Skip to main content

Load Parquet Data from S3 to Pandas DataFrame

Question

I have a parquet dataset stored in AWS S3 and want to access it in a Metaflow flow. How can I read one or several Parquet files at once from a flow and use them in a pandas DataFrame?

Solution

1Access Parquet Data in S3

You can access a parquet dataset on S3 in a Metaflow flow using the metaflow.S3 functionalities and load it into a pandas DataFrame for analysis. To access one file you can use metaflow.S3.get. Often times Parquet datasets have many files which is a good use case for Metaflow's s3.get_many function.

This example uses Parquet data stored in S3 from Ookla Global's AWS Open Data Submission.

2Run Flow

This flow shows how to:

  • Download multiple Parquet files using Metaflow's s3.get_many function.
  • Read the result of one year of the dataset as a Pandas dataframe.
load_parquet_to_pandas.py
from metaflow import FlowSpec, step, S3

BASE_URL = 's3://ookla-open-data/' + \
'parquet/performance/type=fixed/'
YEARS = ['2019', '2020', '2021', '2022']
S3_PATHS = [
f'year={y}/quarter=1/{y}-' + \
'01-01_performance_fixed_tiles.parquet'
for y in YEARS
]

class ParquetPandasFlow(FlowSpec):

@step
def start(self):
self.next(self.load_parquet)

@step
def load_parquet(self):
import pandas as pd
with S3(s3root=BASE_URL) as s3:
tmp_data_path = s3.get_many(S3_PATHS)
first_path = tmp_data_path[0].path
self.df = pd.read_parquet(first_path)
self.next(self.end)

@step
def end(self):
print('DataFrame for first year' + \
f'has shape {self.df.shape}.')

if __name__ == '__main__':
ParquetPandasFlow()
python load_parquet_to_pandas.py run
...
[638/end/3312 (pid 7120)] Task is starting.
[638/end/3312 (pid 7120)] DataFrame for first yearhas shape (4877036, 7).
[638/end/3312 (pid 7120)] Task finished successfully.
...

3Access Artifacts Outside of Flow

The following can be run in any script or notebook to access the contents of the DataFrame that was stored as a flow artifact with self.df.

from metaflow import Flow
Flow('ParquetPandasFlow').latest_run.data.df.head()
quadkey tile avg_d_kbps avg_u_kbps avg_lat_ms tests devices
0 0231113112003202 POLYGON((-90.6591796875 38.4922941923613, -90.... 66216 12490 13 28 4
1 1322111021111001 POLYGON((110.352172851562 21.2893743558604, 11... 102598 37356 13 15 4
2 3112203030003110 POLYGON((138.592529296875 -34.9219710361638, 1... 24686 18736 18 162 106
3 0320000130321312 POLYGON((-87.637939453125 40.225024210605, -87... 17674 13989 78 364 4
4 0320001332313103 POLYGON((-84.7430419921875 38.9209554204673, -... 441192 218955 22 14 1

Further Reading

For help or feedback, please join Metaflow Slack. To suggest an article, you may open an issue on GitHub.

Join Slack