How to use `scan_csv` with a file-like object in Polars

One small trick to handle memory-intensive CSVs

I have a case where a bunch of CSVs are stored together in a zip file and I want to convert those CSVs into a parquet file. I’m using polars because it has an awesome ability to lazily read CSVs and then efficiently sink to parquet. It’s actually kind of magical.

But, there’s a problem. Because the CSVs are in a zipfile, you hit a snag pretty quick. That’s because you can’t just pass the CSV file name to the scan_csv function. The following code will not work!

import polars as pl

with zip_file.open("csv_in_zipfolder.csv") as csv_file:
  pl.scan_csv(csv_file).sink_parquet("my_new_file.parquet")

That’s because csv_file is actually a ZipExtFile, and the scan_csv function can’t accept that! According to the pola.rs API documentation, scan_csv only accepts a path to a file. Unlike the read_csv function, which accepts a path or a file-like objects, scan_csv does not allow file-like objects.

This also means that attempting to download from a URL directly into scan_csv won’t work either. Bummer, right?

But, there’s a hack if your csv file will fit in memory*: write it to a temporary named file and then pass that temporary named file to the scan_csv function. Here’s how that looks:

import polars as pl

with zip_file.open("csv_in_zipfolder.csv") as csv_file:
	# Create the temporary file
	with tempfile.NamedTemporaryFile() as tf:
		tf.write(fp.read()) # Write the csv file to the temporary file
		tf.seek(0)          # Start at the beginning of the temporary file
		pl.scan_csv(tf.name).sink_parquet("my_new_file.parquet")

By saving the file-like object into the temporary file system as a temporary file, you can happily pass the path to that file to polars and scan to your heart’s content.

* Technically, you could even iterate over the lines of your CSV file if it doesn’t fit into memory at all.