Validate fields in JSON on import

Question

Validate fields in JSON on import

LaChapeliere opened this issue 4 years ago · comments

import module, validate the presence of required fields (id, created_at, and text fields).
Maybe add a parameter to specify different names for those fields, in case the data has already been processed by a script

Gabriel Ribeiro commented 4 years ago

Sure!

Gabriel Ribeiro commented 4 years ago

#60

Gabriel Ribeiro · Answer 1 · Sun Oct 04 2020 03:03:11 GMT+0800 (China Standard Time)

@LaChapeliere Can i take a look?

LaChapeliere · Answer 2 · Sun Oct 04 2020 03:05:44 GMT+0800 (China Standard Time)

@riibeirogabriel Sure, you're very welcome :) My issues' descriptions are rather lacking, I'm working on adding an issue template, so let me know if you need additional info

Gabriel Ribeiro · Answer 3 · Sun Oct 04 2020 03:08:55 GMT+0800 (China Standard Time)

Okay, no problems, tomorrow i will try use the package and if i don't understand something i ask to this issue, sure?

LaChapeliere · Answer 4 · Sun Oct 04 2020 03:11:15 GMT+0800 (China Standard Time)

If it's an installation/usage issue because the doc is not clear enough, rather than a question about the field validation, can you open a new issue for it so it doesn't get mixed up?

Gabriel Ribeiro · Answer 5 · Sun Oct 04 2020 03:18:30 GMT+0800 (China Standard Time)

@LaChapeliere I understand this issue, i did meant if case i don't succeed in replicate the issue i return asking something.

LaChapeliere · Answer 6 · Sun Oct 04 2020 03:26:12 GMT+0800 (China Standard Time)

Perfect, we understand each other then 🚀

Gabriel Ribeiro · Answer 7 · Sun Oct 04 2020 16:50:09 GMT+0800 (China Standard Time)

@LaChapeliere I have some questions to you, firstly, the error happens in "tranform data" step, alright? Where i can get a mocked json tweets to use? I did take a look on https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/intro-to-tweet-json, the mocked jsons in this site will work?

LaChapeliere · Answer 8 · Sun Oct 04 2020 17:05:58 GMT+0800 (China Standard Time)

Yes, in transform.py
It used to crash in lighten_tweet() but it looks like my co-maintainer added some error handling there so the tweet will now be put aside as incorrectly formatted if the fields do not exist. So the only thing that would remain is to add parameters to the input module so users can specify if the fields are called something else in their input. Like, we can look for the "text_content" field instead of "text", but still call it "text" in the output because that's what subsequent modules will expect.
I have some mock json files somewhere, I'll share them so you can test things.
I'll also create an issue to add tests for this, because we should have some.

Gabriel Ribeiro · Answer 9 · Sun Oct 04 2020 17:16:43 GMT+0800 (China Standard Time)

Okay i understand, in case of diferent fields like "text_content" than "text", the output csv must replace the "text_content" to "text", alright? and good, thanks for share this files.

LaChapeliere · Answer 10 · Sun Oct 04 2020 17:24:32 GMT+0800 (China Standard Time)

I've added mock files in a tests-input branch.
Lines with "yes" are supposed to end up in the output. Lines with "no" should be filtered out, so not end up in the output but not be counted as parsing failures. Lines with "fail" should not end up in the output and they should be counted in the parsing failures.
I've created tests files for the "different field names" use case too while I was at it.

Gabriel Ribeiro · Answer 11 · Sun Oct 04 2020 17:52:25 GMT+0800 (China Standard Time)

Ok, I will take a look and make a PR.

Gabriel Ribeiro · Answer 12 · Sun Oct 04 2020 19:58:40 GMT+0800 (China Standard Time)

@LaChapeliere In the lighten_tweet function defined at tranform.py, the "text" field can be named as "full_text" or "text" rely on "truncated" param, what you suggest to make with the "text" field?

LaChapeliere · Answer 13 · Sun Oct 04 2020 20:04:22 GMT+0800 (China Standard Time)

@LaChapeliere In the lighten_tweet function defined at tranform.py, the "text" field can be named as "full_text" or "text" rely on "truncated" param, what you suggest to make with the "text" field?

I don't understand your question :s

Gabriel Ribeiro · Answer 14 · Sun Oct 04 2020 22:35:27 GMT+0800 (China Standard Time)

Inside of the lighten_tweet function defined in transform.py, there is a verification if a tweet have the "truncated" field, if true, the name of text field requested is the "full_text", if False, the request text field is the "text"name, case an user define other text name from your input json, the "truncated" validation will be ignorated? Did you uinderstand? sorry my bad english.

Gabriel Ribeiro · Answer 15 · Sun Oct 04 2020 22:45:30 GMT+0800 (China Standard Time)

Or case the "truncated" field is True must access the "extended_tweet" field and get from sub documents the same optional text field passed as parameter?

LaChapeliere · Answer 16 · Sun Oct 04 2020 22:56:27 GMT+0800 (China Standard Time)

For now, let's not give the user the possibility to change ALL fields, only the basic ones. We'll improve on that if the need appears.

I'll try to answer with pseudocode:
if truncated:
if extended has full_text:
save full_text
else:
fail
else:
if user has specified an alternative field name NAME for "text":
save the content of NAME
else:
save text

Gabriel Ribeiro · Answer 17 · Mon Oct 05 2020 01:36:10 GMT+0800 (China Standard Time)

How i can test my solution?

LaChapeliere · Answer 18 · Mon Oct 05 2020 01:37:21 GMT+0800 (China Standard Time)

Have you managed to run the module on the mock input files I've added to the tests folder?

Gabriel Ribeiro · Answer 19 · Mon Oct 05 2020 01:39:16 GMT+0800 (China Standard Time)

Not yet, i am searching a way to build my locally modified module to can run with this test files.

Gabriel Ribeiro · Answer 20 · Mon Oct 05 2020 01:40:00 GMT+0800 (China Standard Time)

I will use the module without build.

LaChapeliere · Answer 21 · Mon Oct 05 2020 01:40:48 GMT+0800 (China Standard Time)

I think this is what you are looking for pip install -e . from the root of the directory

Gabriel Ribeiro · Answer 22 · Mon Oct 05 2020 02:33:36 GMT+0800 (China Standard Time)

@LaChapeliere Did you run the crane-import with the correctImportInput.json?
Here the output is a blank file, "wrote 0 lines failures 9", Do you know what this problem?

Gabriel Ribeiro · Answer 23 · Mon Oct 05 2020 02:34:23 GMT+0800 (China Standard Time)

The files in inputs folder need be formatted as an json array, I can push the formatted files in my PR.

LaChapeliere · Answer 24 · Mon Oct 05 2020 02:42:39 GMT+0800 (China Standard Time)

It works for me. I've got "wrote 6 lines failures 0". The inputs need to be one json per line, it's not a true json format

Gabriel Ribeiro · Answer 25 · Mon Oct 05 2020 02:45:20 GMT+0800 (China Standard Time)

Sorry, i did have thinked this files was with errors haha, i runned again and work, thanks!

Gabriel Ribeiro · Answer 26 · Mon Oct 05 2020 03:01:51 GMT+0800 (China Standard Time)

@LaChapeliere I have thinked in use a dict in argument being each key the field, and the value the name, like
crane-import --fields-name {text: content, id: UID, created_at: tweet_date}
but according with this https://stackoverflow.com/questions/18608812/accepting-a-dictionary-as-an-argument-with-argparse-and-python is not possible load a dict as argument in ArgumentParser, but is possible pass as string dict and converts to a dict, what do you think?

LaChapeliere · Answer 27 · Mon Oct 05 2020 03:07:02 GMT+0800 (China Standard Time)

according with this https://stackoverflow.com/questions/18608812/accepting-a-dictionary-as-an-argument-with-argparse-and-python is not possible load a dict as argument in ArgumentParser, but is possible pass as string dict and converts to a dict, what do you think?
Isn't converting a string dict into a dict a pain? If it's easy I'm okay with it.
I see two other options:

Not ideal if we want to add the possibility of changing the name of the other fields later on, but easier for users that want to change just one field -> create one arg per field name.
E.g. crane-import text-name content date-name tweet_date
Heavier on the user but more evolutive, drop that dict into a json file and pass the path to that file as argument.
What do you think?

Gabriel Ribeiro · Answer 28 · Mon Oct 05 2020 03:09:58 GMT+0800 (China Standard Time)

The first option sounds better to user, can we in this way?

LaChapeliere · Answer 29 · Mon Oct 05 2020 03:14:13 GMT+0800 (China Standard Time)

Totally. Especially since it'll be easy to change to the json config file if we need it later.
By the way, don't forget to update the doc (Readme and Sphinx generated doc. If Sphinx is a problem I can do it later)

Gabriel Ribeiro · Answer 30 · Tue Oct 06 2020 01:16:44 GMT+0800 (China Standard Time)

What the command used to update the sphinx docs?

LaChapeliere · Answer 31 · Tue Oct 06 2020 01:38:57 GMT+0800 (China Standard Time)

make html from /docs

Gabriel Ribeiro · Answer 32 · Tue Oct 06 2020 03:03:56 GMT+0800 (China Standard Time)

There are much altered files in docs folder after runned make html, it is ok?

LaChapeliere · Answer 33 · Tue Oct 06 2020 03:09:50 GMT+0800 (China Standard Time)

It's normal if it's in /doctrees, or if it's the doc for a file you modified. Nothing in /source should have changed, I think.

Gabriel Ribeiro · Answer 34 · Tue Oct 06 2020 03:11:05 GMT+0800 (China Standard Time)

then sounds good! i will make the PR

LaChapeliere · Answer 35 · Tue Oct 06 2020 03:53:02 GMT+0800 (China Standard Time)

I made a bunch of comments on the PR, and I'm cross-run the tests