mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input

Home Page:http://PapaParse.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Include metadata in the parse result for renamed columns

jchen042 opened this issue · comments

Great project!
@pokoli - thanks for the update w.r.t. #982 #129 #956 . Will the lib consider adding the configs for the duplicated header? i.e. enable/disable the automatic renaming while keeping the capacity of reading the right column value, or including the renaming metadata to the ParseResult so the end developer will have more options to handle this scenario?

My proposal will be including the metadata to each column:

With the following CSV data:

c;c;c;c_1
1;2;3;4

The ParseResult.data will be like:

[{
  "c": {
    "originalName": "c",
    "value": "1"
  },
  "c_1": {
    "originalName": "c"
    "value": "2"
  },
  "c_2": {
    "originalName": "c",
    "value": 3
  },
  "c_3": {
    "originalName": "c",
    "value": 4
  }
}]

Alternatively, the column renaming metadata can be included in ParseResutl.meta, like:

"columnNameMapping": {
  "c": "c",
  "c_1": "c",
  "c_2": "c",
  "c_3": "c",
}

If this is a good idea, I'm happy to create a PR to handle it.

HI @FallingCeilingS,

Thanks for your proposal. I think the best will be to include the renamed columns in metadata, so the original values can be restored back by just reading them.

It will be great if you can create a MR for it.
I will expect that the tests cases are extendend to test that the proper metadata is generate but also that the documentation is extended to explain the new available metadata.

I think the property should be named renamedHeaders so it can be accessed with ParseResult.meta.renamedHeaders

Thanks for the reply @pokoli - I'll create a PR once I have free time.

Thank you so much for taking the effort!

@pokoli - the PR is ready for review.

@pokoli @mholt - the relevant PR is merged. Thanks for the opinions and reviews. Can you please kindly release a new version of the package so that external developers can start using the new feature? Then this issue can get closed.

@FallingCeilingS I will release a new version once I have soem time.
I close the issue for now as there is nothing to do now.

Just for reference, this was solved with #990

This exactly solves the issue I was trying to solve, thank you! (I originally landed on #129). And thanks to all who have contributed fixes/updates to related functionality to duplicate headers.

I see a new tag is pending. Meanwhile, here is the solution hack I'm using to detect duplicates until this feature is available:

function completeFn(results: Papa.ParseResult<any>): void {
  const uniqueFieldNames = new Set();
  const duplicateFieldNames: string[] = [];
  for (const fieldName of results.meta.fields) {
    if (fieldName.slice(-2) === '_1') {
      const originalFieldName = fieldName.slice(0, -2);
      const isDuplicate = uniqueFieldNames.has(originalFieldName);
      if (isDuplicate) {
        duplicateFieldNames.push(originalFieldName);
      } else {
        uniqueFieldNames.add(originalFieldName);
      }
    } else {
      uniqueFieldNames.add(fieldName);
    }
  }
  // At this point, uniqueFieldNames contains all field names that are not duplicates, and duplicateFieldNames contains duplicates.
}
const papaParseConfig: Papa.ParseLocalConfig = {
  header: true,
  complete: completeFn,
};

This only checks if there is 1 duplicate, and doesn't work correctly if someone had a field with _1 in the suffix. In my use case, those are acceptable risks.

When a tag is made I will upgrade, test, and post back confirmation PR #990 works.