Project 2: De-duplication of Spanish Language Articles using UDPipe.
- Ishan Sharma - IXS171130
- Mavis Francia - MCF140030
- Tanushri Singh - TTS150030
- Vyaas Shenoy - VNS170230
There are two crawlers: News Please, RSS Crawler.
News Please can be installed by running pip install news-please
and run by using news-please -c Config/config.cfg
By default, it will use config.cfg
file inside Crawler
folder. Some websites tend to crash it and there are hjson
comments that mark those websites.
The crawler writes to disk and these files can be indexed using python misc/mongo_index.py
. The file also has some
extra configuration options that can be changed on top.
RSS Crawler uses Crawler/config/sitelist_rss.hjson
config file. This file can be automatically generated by running
python3 misc/crawler_config_transformer.py
if sites in sitelist.hjson
have changed.
The model can be trained using misc/doc2vec.py
. It reads data from mongo database big_data
with collection
spanish_articles
. It will be saved to misc/models/
.
After model training, Pandas/doc2vec.py
needs to be run. It will read from spanish_articles
collection and
write to d2v_calculated
collection.
Download a Universal Dependencies model for Spanish. Here are two different ones:
You can find a full list of models here.
Install the ufal.udpipe library by running: pip install ufal.udpipe
. You can read more about this library here.
Once UDPipe is installed, Spark job can be run using spark-submit Streaming/streamToSpark.py
. It will write the
results to mongo collection udpipe_parse
.
Similarity from UDPipe can be calculated by running python UDPipe/runningJaccSim.py
. This will write the results to
collection jacc_sim_calculated
.
iPython notebook for graphing and seeing some statistics is included in Analysis/data_analysis.ipnyb
.
All data can be fetched from * onedrive link
All models can be fetched from here