Skip to main content

Debugging Flows

In this episode, you will see how to use resume in the command line when debugging your flows. After the episode, you will be able to debug and resume flows at arbitrary points in the DAG so you don’t need to run time-consuming steps over and over again. This same functionality works even when the steps are run on different computers. In fact, you can even resume a Metaflow run on your local machine for a flow that was run automatically on a production scheduler like AWS Step Functions or Argo.

1Common Resume Scenario

In this episode, we focus on using resume in the command line when debugging your flows. A common scenario of using resume might go something like this:

  • You write my_sweet_flow.py
  • You run python my_sweet_flow.py run
    • Oh no, something broke! Analyzing stack trace...
    • Found the bug!
    • Save my_sweet_flow.py with the fix.
  • You resume the flow from the step that produced the bug: python my_sweet_flow.py resume
    • Pick up the state of the last flow execution from the step that failed.
    • Note: You can also specify a specific step to resume from like python my_sweet_flow.py resume <DIFFERENT STEP NAME>

2Example

Let's look at an example. In this flow:

  • The time_consuming_step mimics some process you'd rather not re-run because of a downstream error. Examples of such processes might be data transformations or model training.
  • The error_prone_step creates an Exception that halts your flow.
debuggable_flow.py
from metaflow import FlowSpec, step

class DebuggableFlow(FlowSpec):

@step
def start(self):
self.next(self.time_consuming_step)

@step
def time_consuming_step(self):
import time
time.sleep(12)
self.next(self.error_prone_step)

@step
def error_prone_step(self):
raise Exception()
self.next(self.end)

@step
def end(self):
print("Flow is done!")

if __name__ == "__main__":
DebuggableFlow()

2aObserve a Failed Task

python debuggable_flow.py run
    ...
[1666720922151822/error_prone_step/3 (pid 52879)] Task is starting.
[1666720922151822/error_prone_step/3 (pid 52879)] <flow DebuggableFlow step error_prone_step> failed:
[1666720922151822/error_prone_step/3 (pid 52879)] Internal error
[1666720922151822/error_prone_step/3 (pid 52879)] Traceback (most recent call last):
[1666720922151822/error_prone_step/3 (pid 52879)] start(auto_envvar_prefix="METAFLOW", obj=state)
[1666720922151822/error_prone_step/3 (pid 52879)] task.run_step(
[1666720922151822/error_prone_step/3 (pid 52879)] self._exec_step_function(step_func)
[1666720922151822/error_prone_step/3 (pid 52879)] step_function()
[1666720922151822/error_prone_step/3 (pid 52879)] raise Exception()
[1666720922151822/error_prone_step/3 (pid 52879)] Exception
[1666720922151822/error_prone_step/3 (pid 52879)]
[1666720922151822/error_prone_step/3 (pid 52879)] Task failed.
...

2bFix the Issue

You can resolve the issue by:

  1. Finding and fixing the bug

    In this case:

- raise Exception()
+ print("Squashed bug")
debuggable_flow.py
from metaflow import FlowSpec, step

class DebuggableFlow(FlowSpec):

@step
def start(self):
self.next(self.time_consuming_step)

@step
def time_consuming_step(self):
import time
time.sleep(12)
self.next(self.error_prone_step)

@step
def error_prone_step(self):
print("Squashed bug")
# raise Exception()
self.next(self.end)

@step
def end(self):
print("Flow is done!")

if __name__ == "__main__":
DebuggableFlow()
  1. Saving the flow script

2cResume the Flow

python debuggable_flow.py resume
    Metaflow 2.7.12 executing DebuggableFlow for user:eddie
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
2022-10-25 13:02:16.194 Gathering required information to resume run (this may take a bit of time)...
2022-10-25 13:02:16.200 Workflow starting (run-id 1666720936193890):
2022-10-25 13:02:16.202 [1666720936193890/start/1] Cloning results of a previously run task 1666720922151822/start/1
2022-10-25 13:02:16.565 [1666720936193890/time_consuming_step/2] Cloning results of a previously run task 1666720922151822/time_consuming_step/2
2022-10-25 13:02:16.925 [1666720936193890/error_prone_step/3 (pid 52891)] Task is starting.
2022-10-25 13:02:17.220 [1666720936193890/error_prone_step/3 (pid 52891)] Squashed bug
2022-10-25 13:02:17.266 [1666720936193890/error_prone_step/3 (pid 52891)] Task finished successfully.
2022-10-25 13:02:17.273 [1666720936193890/end/4 (pid 52894)] Task is starting.
2022-10-25 13:02:17.570 [1666720936193890/end/4 (pid 52894)] Flow is done!
2022-10-25 13:02:17.615 [1666720936193890/end/4 (pid 52894)] Task finished successfully.
2022-10-25 13:02:17.616 Done!

Congratulations, you have completed the Introduction to Metaflow tutorial! Now you are ready to operationalize your machine learning workflows with Metaflow.

To keep progressing in your Metaflow journey you can:

  • Get to know Outerbounds' view on the machine learning stack.
  • Check out the open-source repository.
  • Join our slack community and engage in #ask-metaflow. There is a lot of machine learning wisdom to discover from the community!