MohamedBsh / ETL-News-Articles

A simple ETL pipeline to extract information from news articles, transform the article text to XML via Spacy, and load into SQLite.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

News Article Text -> SQLite Database

An ETL pipeline to scrape information from news articles, transform the article text to XML via Spacy, and load into SQLite to use for modeling/analysis.

Files

  • crawler.py - Web crawler class built with BeautifulSoup
  • database.py - Database class for sqlite database connection
  • xml.py - Extract entities and dependancies from text using spaCy
  • main.py - Main script for crawling a news site, extracting text using Newspaper, and uploading to a SQLite database

Contact Me

Contact Method
Email adamr@hey.com
LinkedIn https://www.linkedin.com/in/adamrauckhorst/

About

A simple ETL pipeline to extract information from news articles, transform the article text to XML via Spacy, and load into SQLite.


Languages

Language:Python 100.0%