NewsFetch
subprojects
This repository contains the following subprojects:
- newsfetch-core: The core library for the NewsFetch project
- newsfetch-common-crawl: Various utilities that NewsFetch project uses in interfacing with CommonCrawl
- newsfetch-api: An example API for the NewsFetch project
- sample-data: Sample data
Projects that NewsFetch depends on
- CommonCrawl: The CommonCrawl project is a large-scale web crawl that is used by NewsFetch to collect news articles
- NewsPlease: NewsPlease is a Python library that NewsFetch uses to extract news articles from HTML pages
For enriching the news articles, NewsFetch uses the following projects:
- Spacy: Spacy is a Python library for natural language processing
- HuggingFace: HuggingFace hosts pre-trained ML models that is used in NewsFetch for natural language processing
Setup
First install the following:
- Python 3.9
- Poetry
Recommended, use pyenv to manage your python versions.
Virtual environment
It is highly recommended to use a virtual environment. This is done to avoid conflicts with other projects.
To create a virtual environment, run the following command:
In each subproject, run the following command:
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
source venv/bin/activate
Install dependencies
Poetry is used to install and manage dependencies. It is also used to package the modules/libraries.
Note: The subprojects use relative paths to import the other subprojects/libraries. This is done to make it easier to develop the subprojects.
To install the dependencies, run the following command:
poetry install