Skip to main content

Handle Tasks that may Fail

Question

How do I design steps to handle potential task failures at runtime?

Solution

Metaflow has two decorators that address this.

1Using @retry and @catch

You can use Metaflow's @retry decorator before step definitions. The @retry decorator takes an argument called times which takes a number in [0,4]. This is intended to handle transient failures and is particularly useful when running tasks on the cloud where machine failures are more common.

You can also use this in the command line like python flow.py run --with retry. By default this will retry failed steps with no @retry decorator defined three times.

Similarly, the @catch decorator will catch exceptions raised in the task. However @catch is intended for use cases where you want to continue the flow after any exception. Catch contains an optional argument var which you can save as a flow artifact if you want to later access the exception.

caution

when using @catch you should design the steps in your flow after the @catch to tolerate exceptions in that step.

2Run Flow

This flow shows how to:

  • Create a foreach branch in start that creates three divide tasks.
  • Using @retry to rerun divide when the step code produces an exception.
  • Saving the exception using @catch.
    • In the join task, use the saved exception to only store results if the divide parent task succeeded.
handle_failed_task.py
from metaflow import FlowSpec, step, retry, catch

class CatchRetryFlow(FlowSpec):

@step
def start(self):
self.divisors = [0, 1, 2]
self.next(self.divide, foreach='divisors')

@catch(var='divide_fail')
@retry(times=1)
@step
def divide(self):
self.res = 10 / self.input
self.next(self.join)

@step
def join(self, inputs):
self.results = [i.res
for i in inputs
if not i.divide_fail]
print('results', self.results)
self.next(self.end)

@step
def end(self):
print('done!')

if __name__ == '__main__':
CatchRetryFlow()
python handle_failed_task.py run
     Workflow starting (run-id 1654221294647384):
[1654221294647384/start/1 (pid 71451)] Task is starting.
[1654221294647384/start/1 (pid 71451)] Task finished successfully.
[1654221294647384/divide/2 (pid 71461)] Task is starting.
[1654221294647384/divide/3 (pid 71462)] Task is starting.
[1654221294647384/divide/4 (pid 71463)] Task is starting.
[1654221294647384/divide/2 (pid 71461)] Traceback (most recent call last):
[1654221294647384/divide/2 (pid 71461)] ZeroDivisionError: division by zero
[1654221294647384/divide/2 (pid 71461)]
[1654221294647384/divide/2 (pid 71480)] Task is starting (retry).
[1654221294647384/divide/3 (pid 71462)] Task finished successfully.
[1654221294647384/divide/4 (pid 71463)] Task finished successfully.
[1654221294647384/divide/2 (pid 71480)] > Traceback (most recent call last):
[1654221294647384/divide/2 (pid 71480)] > ZeroDivisionError: division by zero
[1654221294647384/divide/2 (pid 71480)] Task finished successfully.
[1654221294647384/join/5 (pid 71492)] Task is starting.
[1654221294647384/join/5 (pid 71492)] results [10.0, 5.0]
[1654221294647384/join/5 (pid 71492)] Task finished successfully.
[1654221294647384/end/6 (pid 71504)] Task is starting.
[1654221294647384/end/6 (pid 71504)] done!
[1654221294647384/end/6 (pid 71504)] Task finished successfully.
Done!

Further Reading