Pandas.read_csv too slow! (or not…) [Python]

Spread the word!

Problem: pandas.read_csv was apparently too slow with a big csv files.

In the current version of Pandas (1.0.1) and Python (3.7.3), even by setting the parameter nrows = 1, it looks like the csv file is still parsed once entirely. Also using chunksize does not really help as it should.

But things are not always as they seem… In my case the real problem was the determination of the file encoding, using the following function:

which was passed to the read_csv file as follows:

In this way, the function get_encoding is run independently of any kind rows selection with nrows, since the program should anyway check the encoding by parsing the whole file!

Two possible solutions for this:

  • Use a “manual” character encoding detection by e.g. opening your file with Notepad++ and check the encoding with the corresponding the top menu option.
  • Instead of deducing the file encoding by reading all of its lines, just read e.g. the first 100 rows.

In the last case, the function becomes:

This gave me the largest improvement ever for my “fake”-problem!

A possible further improvement is given by passing engine=’c’ to read_csv.

Setting the chunksize didn’t really help, because even if it speeds up the reading process, I needed a pandas.concat(chunks) later to put my list elements back together, which required some overhead!

See you soon and write your suggestions as comments if you like 🙂

Be the first to comment

Leave a Reply

Your email address will not be published.