Pre-processing captions in the dataset before filtering
vishaal27 opened this issue · comments
Hey, thanks for your great work and releasing the code for reproducing results. The code is super easy to follow!
In the part where you do text-based data-filtering, I noted that you search for matches in the parquet files' TEXT fields without pre-processing:
neural-priming/DataFiltering/FilterData.py
Line 210 in e37520f
I was just wondering if you think doing some pre-processing on the texts searched for, (e.g., simple lowercasing, stemming/lemmatization) might improve recall while not harming precision? I do recognise that this is a minor point in the context of your work, but was curious if you had thoughts on this or any initial experiments?
Also, I am not sure if you did some pre-processing while converting the original parquet files into the sqlite db files, I assume not? If you did, then please let me know.
I am currently using your scripts for a project of mine, and hence wanted to know what your thoughts are.
Hey, thanks for the interest in our work.
We didn't use many preprocessing techniques on the text side since we wanted to keep the method as simple and general as possible, though from our experiments I think more preprocessing and filtering would improve performance. One thing we considered, but didn't explore thoroughly was using an LLM to determine whether the class was in the image given the caption. The main bottleneck here is the computational cost of the LLM. In general we found precision was more important than recall given the scale of LAION-5B.
We didn't do any preprocessing of the captions when transferring to sqlite.
-Matt