Problems with commas and double quotes

Question

Problems with commas and double quotes

cla93 opened this issue 3 years ago · comments

I don't think this should be called an issue, more an improvement request (or maybe I do not know if there is already a way to do it). I'm reading a csv in the following way:

    for (const auto row : parser)
    {
        StopTimes_struct tmp_stop_times;
        const auto [trip_id, arrival, departure, stop_id, stop_sequence] = row.cells(0, 1, 2, 3, 4); // indexes must be in ascending order
        tmp_stop_times.trip_id = trip_id.raw();
        tmp_stop_times.stop_id = stop_id.raw();
        tmp_stop_times.departure_time = departure.raw();
        tmp_stop_times.arrival_time = arrival.raw();
    }

Without entering into much details, I have GTFS format files, i.e., transit data. The header of the csv is:
stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,location_type,parent_station,stop_timezone

Now, I have two kinds of problems, which actually are connected:
if the header, together with all the rows, have double quotes containing each field value, the strings resulting from the reading have the "\"" symbol arount the actual string. Is there a way, with lazycsv, to extract directly the string, without having to do it manually at each extraction?

Other problem: it could happena that one field contains a comma. Example of a row oh this kind:
10018","","C.so Sempione, 83 prima di Via E. Filiberto","","45.4862832229375","9.15805393535531","","","","","",""

Thus, what should be the 3rd field, is splitted on the comma.

Is there someway to solve this issue ?

Thank you

Mohammad Nejati · Answer 1 · Thu Feb 17 2022 23:16:15 GMT+0800 (China Standard Time)

Hi, thanks for your report.
As it is mentioned in the README, currently this parser does not handle quoted cells, because a quoted cell can contain even a break line character (\n) I didn't find an efficient way of doing it without parsing each line two times (one for finding the end of line and one for each column).
Maybe I should add it without support for escaping character in between columns.

cla93 · Answer 2 · Thu Feb 17 2022 23:22:14 GMT+0800 (China Standard Time)

Thank you for the answer! yes that was my problem, since your tools is super efficient, a double parsing to check for quotes would have somehow doubled the reading time of my csv. Though, it could be faster that other tools I tried out, I'll give it a try with a custom solution by my side, while I wait for some updates here in case

Mohammad Nejati · Answer 3 · Fri Feb 18 2022 02:46:05 GMT+0800 (China Standard Time)

I've added initial support for quoted cells in this branch.
It is not complete but I think it would solve your issue.

cla93 · Answer 4 · Fri Feb 18 2022 16:30:21 GMT+0800 (China Standard Time)

Thank you so much, it seems to work perfectly! I just have another "problem" about types, but I'll open another issue for that