Skip to main content

Debug Metaflow Errors with Resume

Question

I have a prototype flow that failed and I want to identify why it failed, where it failed, and debug it.

Solution

1Run Flow with Error

When running debug_error_with_resume.py a ZeroDivisionError is produced in the join step.

This flow shows how to:

  • Pass artifacts into a join step.
  • Start a process to deal with Python errors in a task.
debug_error_with_resume.py
from metaflow import FlowSpec, step

class DebugFlow(FlowSpec):

@step
def start(self):
self.next(self.a, self.b)

@step
def a(self):
self.x = 1
self.next(self.join)

@step
def b(self):
self.x = 0
self.next(self.join)

@step
def join(self, inputs):
# divisor is next line is 0!
self.result = inputs.a.x / inputs.b.x
self.next(self.end)

@step
def end(self):
pass

if __name__ == '__main__':
DebugFlow()
python debug_error_with_resume.py run
    ...
[1654221288262697/a/2 (pid 71324)] Task is starting.
[1654221288262697/a/2 (pid 71324)] Task finished successfully.
...
[1654221288262697/b/3 (pid 71325)] Task is starting.
[1654221288262697/b/3 (pid 71325)] Task finished successfully.
...

2Debug Flow

Having seen that the code failed at the join step, you can fix whatever may have caused this and resume the flow from the faulty step. There is a highlighted line in the in join step of this script containing the ZeroDivisionError. You can replace this line with

self.result = inputs.a.x / (inputs.b.x + 1e-12)

to fix the error.

3Resume Flow from Failed Task

Now you can resume from join without re-running the start, a, and b steps. Note that by default the resume feature will enter the flow at the step that produced the error in the last run. In this example none of the steps are time intensive, but you can imagine scenarios such as model training where steps may take a long time to compute and you wouldn't want to re-run a and b if those tasks did expensive model training and the error was in the downstream join task.

python debug_error_with_resume.py resume
    Metaflow 2.6.0 executing DebugFlow for user:eddie
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
2022-06-02 20:54:52.800 Gathering required information to resume run (this may take a bit of time)...
2022-06-02 20:54:52.808 Workflow starting (run-id 1654221292799841):
2022-06-02 20:54:52.809 [1654221292799841/start/1] Cloning results of a previously run task 1654221288262697/start/1
2022-06-02 20:54:53.413 [1654221292799841/a/2] Cloning results of a previously run task 1654221288262697/a/2
2022-06-02 20:54:53.419 [1654221292799841/b/3] Cloning results of a previously run task 1654221288262697/b/3
2022-06-02 20:54:54.069 [1654221292799841/join/4 (pid 71446)] Task is starting.
2022-06-02 20:54:54.709 [1654221292799841/join/4 (pid 71446)] Task finished successfully.
2022-06-02 20:54:54.713 [1654221292799841/end/5 (pid 71454)] Task is starting.
2022-06-02 20:54:55.102 [1654221292799841/end/5 (pid 71454)] Task finished successfully.
2022-06-02 20:54:55.103 Done!

Further Reading