Erdos1729/webscrapping-identify-download-classify-published-pdfs-from-multiple-urls

Erdos1729 / webscrapping-identify-download-classify-published-pdfs-from-multiple-urls

This repository will assist you in scrapping data from multiple websites. It will identify, download and classify the latest pdf files published on a website as per the users requirement. This can be used for automating various operations involved in market research.

Webscrapping to identify and download latest pdf documents. Classify these documents into pre-defined categories.

This repository will assist you in scrapping data from multiple websites. It will download the latest pdf files published on a website in a specific folder as per the users requirement. This can be used for automating various operations involved in market research.
Once the pdfs are downloaded they are classified into oil/no_oil/foreign_language categories based on a string based rule
You can customize these rules for classification as per your need

Instructions

pip install -r requirements
Run radar_automation.py

Reference

I devised the solution from the following pages of the documentation:

[] package that collects several modules for working with URLs
[] to scrape information from web pages
[] is a text extraction tool for PDF documents
[] for natural language processing
Keyword based search in extracted text for rule based classification

About

Erdos1729 / webscrapping-identify-download-classify-published-pdfs-from-multiple-urls

Webscrapping to identify and download latest pdf documents. Classify these documents into pre-defined categories.

Instructions

Reference

About

Languages