CRANE-toolbox / analysis-pipelines

Project CRANE (Crisis Racism and Narrative Evaluation) aims to support researchers and anti-racist organisations that wish to use state-of-the-art text analysis algorithms to study how specific events impact online hate speech and racist narratives. CRANE Toolbox is a Python package: once installed, the tools in CRANE are available as functions that users can use in their Python programs or directly through their terminal. CRANE targets users with basic programming but no machine learning skills.

Home Page:https://crane-toolbox.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Validate fields in JSON on import

LaChapeliere opened this issue · comments

import module, validate the presence of required fields (id, created_at, and text fields).
Maybe add a parameter to specify different names for those fields, in case the data has already been processed by a script

@LaChapeliere Can i take a look?

@riibeirogabriel Sure, you're very welcome :) My issues' descriptions are rather lacking, I'm working on adding an issue template, so let me know if you need additional info

Okay, no problems, tomorrow i will try use the package and if i don't understand something i ask to this issue, sure?

If it's an installation/usage issue because the doc is not clear enough, rather than a question about the field validation, can you open a new issue for it so it doesn't get mixed up?

@LaChapeliere I understand this issue, i did meant if case i don't succeed in replicate the issue i return asking something.

Perfect, we understand each other then 🚀

@LaChapeliere I have some questions to you, firstly, the error happens in "tranform data" step, alright? Where i can get a mocked json tweets to use? I did take a look on https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/intro-to-tweet-json, the mocked jsons in this site will work?

Yes, in transform.py
It used to crash in lighten_tweet() but it looks like my co-maintainer added some error handling there so the tweet will now be put aside as incorrectly formatted if the fields do not exist. So the only thing that would remain is to add parameters to the input module so users can specify if the fields are called something else in their input. Like, we can look for the "text_content" field instead of "text", but still call it "text" in the output because that's what subsequent modules will expect.
I have some mock json files somewhere, I'll share them so you can test things.
I'll also create an issue to add tests for this, because we should have some.

Okay i understand, in case of diferent fields like "text_content" than "text", the output csv must replace the "text_content" to "text", alright? and good, thanks for share this files.

I've added mock files in a tests-input branch.
Lines with "yes" are supposed to end up in the output. Lines with "no" should be filtered out, so not end up in the output but not be counted as parsing failures. Lines with "fail" should not end up in the output and they should be counted in the parsing failures.
I've created tests files for the "different field names" use case too while I was at it.

Ok, I will take a look and make a PR.

@LaChapeliere In the lighten_tweet function defined at tranform.py, the "text" field can be named as "full_text" or "text" rely on "truncated" param, what you suggest to make with the "text" field?

@LaChapeliere In the lighten_tweet function defined at tranform.py, the "text" field can be named as "full_text" or "text" rely on "truncated" param, what you suggest to make with the "text" field?

I don't understand your question :s

Inside of the lighten_tweet function defined in transform.py, there is a verification if a tweet have the "truncated" field, if true, the name of text field requested is the "full_text", if False, the request text field is the "text"name, case an user define other text name from your input json, the "truncated" validation will be ignorated? Did you uinderstand? sorry my bad english.

Or case the "truncated" field is True must access the "extended_tweet" field and get from sub documents the same optional text field passed as parameter?

For now, let's not give the user the possibility to change ALL fields, only the basic ones. We'll improve on that if the need appears.

I'll try to answer with pseudocode:
if truncated:
if extended has full_text:
save full_text
else:
fail
else:
if user has specified an alternative field name NAME for "text":
save the content of NAME
else:
save text

How i can test my solution?

Have you managed to run the module on the mock input files I've added to the tests folder?

Not yet, i am searching a way to build my locally modified module to can run with this test files.

I will use the module without build.

I think this is what you are looking for pip install -e . from the root of the directory

@LaChapeliere Did you run the crane-import with the correctImportInput.json?
Here the output is a blank file, "wrote 0 lines failures 9", Do you know what this problem?

The files in inputs folder need be formatted as an json array, I can push the formatted files in my PR.

It works for me. I've got "wrote 6 lines failures 0". The inputs need to be one json per line, it's not a true json format

Sorry, i did have thinked this files was with errors haha, i runned again and work, thanks!

@LaChapeliere I have thinked in use a dict in argument being each key the field, and the value the name, like
crane-import --fields-name {text: content, id: UID, created_at: tweet_date}
but according with this https://stackoverflow.com/questions/18608812/accepting-a-dictionary-as-an-argument-with-argparse-and-python is not possible load a dict as argument in ArgumentParser, but is possible pass as string dict and converts to a dict, what do you think?

according with this https://stackoverflow.com/questions/18608812/accepting-a-dictionary-as-an-argument-with-argparse-and-python is not possible load a dict as argument in ArgumentParser, but is possible pass as string dict and converts to a dict, what do you think?
Isn't converting a string dict into a dict a pain? If it's easy I'm okay with it.
I see two other options:

  1. Not ideal if we want to add the possibility of changing the name of the other fields later on, but easier for users that want to change just one field -> create one arg per field name.
    E.g. crane-import text-name content date-name tweet_date
  2. Heavier on the user but more evolutive, drop that dict into a json file and pass the path to that file as argument.
    What do you think?

The first option sounds better to user, can we in this way?

Totally. Especially since it'll be easy to change to the json config file if we need it later.
By the way, don't forget to update the doc (Readme and Sphinx generated doc. If Sphinx is a problem I can do it later)

What the command used to update the sphinx docs?

make html from /docs

There are much altered files in docs folder after runned make html, it is ok?

It's normal if it's in /doctrees, or if it's the doc for a file you modified. Nothing in /source should have changed, I think.

then sounds good! i will make the PR

I made a bunch of comments on the PR, and I'm cross-run the tests