NLP4ALL / nlp4all

NLP4All is a learning platform for educational institutions to help students that are not in data-oriented fields to understand natural language processing techniques and applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Refactor data import

zeyus opened this issue · comments

commented

The new implementation reads the schema and imports the data straight after upload.

It will need to be benchmarked but it's likely that even though reading from the filesystem is slow, it may just be quicker to read the schema by iterating over all the rows, and then only import the selected fields, because right now importing 400k tweets takes about 40 minutes and then the subsequent delete query (pre index, indexed version is being tested now) takes an additional 20 minutes if done in a single query, and > 60 minutes if done in individual queries.

While we're at it, consider using MongoDB for document storage, and join with a unique key (or the document ID)

See also: https://github.com/NLP4ALL/nlp4all/wiki/Performance

If we go this route (probably more performant) that will require hooks on the init-db and drop-db as well as when deleting and adding data sources.

Update
Version with gin index on the document column actually takes longer both for import and for property deletion. This makes sense as it actually has to update more information at each step, and probably the indexing doesn't extend to such deep nesting (it could, if the structure was consistent). Seems like MongoDB may be the way to go.

UPDATE 2

MongoDB has now been implemented, which is now a 3 minute import. Key deletion still takes around 8 minutes, but that just leaves one remaining task. Process the schema BEFORE import, and only import the required keys. the whole process will be much quicker and probably total around the same (3 min)