Debug Metaflow Errors with Resume
I have a prototype flow that failed and I want to identify why it failed, where it failed, and debug it.
1Run Flow with Error
ZeroDivisionError is produced in the
This flow shows how to:
- Pass artifacts into a join step.
- Start a process to deal with Python errors in a task.
from metaflow import FlowSpec, step
self.x = 1
self.x = 0
def join(self, inputs):
# divisor is next line is 0!
self.result = inputs.a.x / inputs.b.x
if __name__ == '__main__':
python debug_error_with_resume.py run
[1654221288262697/a/2 (pid 71324)] Task is starting.
[1654221288262697/a/2 (pid 71324)] Task finished successfully.
[1654221288262697/b/3 (pid 71325)] Task is starting.
[1654221288262697/b/3 (pid 71325)] Task finished successfully.
Having seen that the code failed at the
join step, you can fix whatever may have caused this and
resume the flow from the faulty
step. There is a highlighted line in the in
join step of this script containing the
ZeroDivisionError. You can replace this line with
self.result = inputs.a.x / (inputs.b.x + 1e-12)
to fix the error.
3Resume Flow from Failed Task
Now you can
join without re-running the
b steps. Note that by default the
resume feature will enter the flow at the step that produced the error in the last run. In this example none of the steps are time intensive, but you can imagine scenarios such as model training where steps may take a long time to compute and you wouldn't want to re-run
b if those tasks did expensive model training and the error was in the downstream
python debug_error_with_resume.py resume
Metaflow 2.6.0 executing DebugFlow for user:eddie
Validating your flow...
The graph looks good!
Pylint is happy!
2022-06-02 20:54:52.800 Gathering required information to resume run (this may take a bit of time)...
2022-06-02 20:54:52.808 Workflow starting (run-id 1654221292799841):
2022-06-02 20:54:52.809 [1654221292799841/start/1] Cloning results of a previously run task 1654221288262697/start/1
2022-06-02 20:54:53.413 [1654221292799841/a/2] Cloning results of a previously run task 1654221288262697/a/2
2022-06-02 20:54:53.419 [1654221292799841/b/3] Cloning results of a previously run task 1654221288262697/b/3
2022-06-02 20:54:54.069 [1654221292799841/join/4 (pid 71446)] Task is starting.
2022-06-02 20:54:54.709 [1654221292799841/join/4 (pid 71446)] Task finished successfully.
2022-06-02 20:54:54.713 [1654221292799841/end/5 (pid 71454)] Task is starting.
2022-06-02 20:54:55.102 [1654221292799841/end/5 (pid 71454)] Task finished successfully.
2022-06-02 20:54:55.103 Done!