Include metadata in the parse result for renamed columns

Question

Include metadata in the parse result for renamed columns

jchen042 opened this issue a year ago · comments

Great project!
@pokoli - thanks for the update w.r.t. #982 #129 #956 . Will the lib consider adding the configs for the duplicated header? i.e. enable/disable the automatic renaming while keeping the capacity of reading the right column value, or including the renaming metadata to the ParseResult so the end developer will have more options to handle this scenario?

My proposal will be including the metadata to each column:

With the following CSV data:

c;c;c;c_1
1;2;3;4

The ParseResult.data will be like:

[{
  "c": {
    "originalName": "c",
    "value": "1"
  },
  "c_1": {
    "originalName": "c"
    "value": "2"
  },
  "c_2": {
    "originalName": "c",
    "value": 3
  },
  "c_3": {
    "originalName": "c",
    "value": 4
  }
}]

Alternatively, the column renaming metadata can be included in ParseResutl.meta, like:

"columnNameMapping": {
  "c": "c",
  "c_1": "c",
  "c_2": "c",
  "c_3": "c",
}

If this is a good idea, I'm happy to create a PR to handle it.

Sergi Almacellas Abellana · Answer 1 · Fri Mar 24 2023 16:21:11 GMT+0800 (China Standard Time)

HI @FallingCeilingS,

Thanks for your proposal. I think the best will be to include the renamed columns in metadata, so the original values can be restored back by just reading them.

It will be great if you can create a MR for it.
I will expect that the tests cases are extendend to test that the proper metadata is generate but also that the documentation is extended to explain the new available metadata.

I think the property should be named renamedHeaders so it can be accessed with ParseResult.meta.renamedHeaders

Junxiang Chen · Answer 2 · Fri Mar 24 2023 16:36:34 GMT+0800 (China Standard Time)

Thanks for the reply @pokoli - I'll create a PR once I have free time.

Sergi Almacellas Abellana · Answer 3 · Fri Mar 24 2023 16:38:52 GMT+0800 (China Standard Time)

Thank you so much for taking the effort!

Junxiang Chen · Answer 4 · Wed Mar 29 2023 00:34:49 GMT+0800 (China Standard Time)

@pokoli - the PR is ready for review.

Junxiang Chen · Answer 5 · Tue Apr 04 2023 11:54:01 GMT+0800 (China Standard Time)

CC: @mholt .

Junxiang Chen · Answer 6 · Tue Apr 11 2023 09:15:44 GMT+0800 (China Standard Time)

@pokoli @mholt - the relevant PR is merged. Thanks for the opinions and reviews. Can you please kindly release a new version of the package so that external developers can start using the new feature? Then this issue can get closed.

Sergi Almacellas Abellana · Answer 7 · Tue Apr 11 2023 15:05:23 GMT+0800 (China Standard Time)

@FallingCeilingS I will release a new version once I have soem time.
I close the issue for now as there is nothing to do now.

Just for reference, this was solved with #990

Joseph D. Purcell · Answer 8 · Fri May 12 2023 00:47:32 GMT+0800 (China Standard Time)

This exactly solves the issue I was trying to solve, thank you! (I originally landed on #129). And thanks to all who have contributed fixes/updates to related functionality to duplicate headers.

I see a new tag is pending. Meanwhile, here is the ~~solution~~ hack I'm using to detect duplicates until this feature is available:

function completeFn(results: Papa.ParseResult<any>): void {
  const uniqueFieldNames = new Set();
  const duplicateFieldNames: string[] = [];
  for (const fieldName of results.meta.fields) {
    if (fieldName.slice(-2) === '_1') {
      const originalFieldName = fieldName.slice(0, -2);
      const isDuplicate = uniqueFieldNames.has(originalFieldName);
      if (isDuplicate) {
        duplicateFieldNames.push(originalFieldName);
      } else {
        uniqueFieldNames.add(originalFieldName);
      }
    } else {
      uniqueFieldNames.add(fieldName);
    }
  }
  // At this point, uniqueFieldNames contains all field names that are not duplicates, and duplicateFieldNames contains duplicates.
}

const papaParseConfig: Papa.ParseLocalConfig = {
  header: true,
  complete: completeFn,
};

This only checks if there is 1 duplicate, and doesn't work correctly if someone had a field with _1 in the suffix. In my use case, those are acceptable risks.

When a tag is made I will upgrade, test, and post back confirmation PR #990 works.