I got the following error when using arrow::read_csv_arrow()
or arrow::open_dataset()
.
Error: Invalid: CSV parse error: Row #20241208: Expected 111 columns, got 110:
"col1","col1","col3","col4","col5"," ...
At row 20241208, there might be unexpected characters in one or more columns
that interfere with detecting the delimiter of the CSV file.
These could include ""
, \"
, a blank space, or a combination of punctuation marks.
I have not found a perfect solution for this error.
The approach may vary depending on the goal after importing the data into R.
In my case, the CSV file was very large, and I wanted to convert it into
multiple Parquet files for storage and efficient querying later.
However, the file could not be converted unless it was read correctly.
Here are the steps I followed:
Extract row 20241208 to identify the problematic characters.
Big thanks to Dr. Huynh Vu
for helping with the Python code.
If you’re into Python and data stuff, you should totally check out his blog
—he’s got tons of great tips and tricks!
Use data.table::fread()
to read the CSV file into R, and replace the problematic characters if necessary.
This function does not fail due to parsing errors.
The downside is that your computer’s RAM must be large enough to handle fread()
for in-memory data import.
Convert the cleaned data into Parquet format.
Big Data Analytics by Ulrich Matter.