Erdos1729 / webscrapping-identify-download-classify-published-pdfs-from-multiple-urls

This repository will assist you in scrapping data from multiple websites. It will identify, download and classify the latest pdf files published on a website as per the users requirement. This can be used for automating various operations involved in market research.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Webscrapping to identify and download latest pdf documents. Classify these documents into pre-defined categories.

  • This repository will assist you in scrapping data from multiple websites. It will download the latest pdf files published on a website in a specific folder as per the users requirement. This can be used for automating various operations involved in market research.
  • Once the pdfs are downloaded they are classified into oil/no_oil/foreign_language categories based on a string based rule
  • You can customize these rules for classification as per your need

Instructions

  • pip install -r requirements
  • Run radar_automation.py

Reference

I devised the solution from the following pages of the documentation:

  • [Urllib] package that collects several modules for working with URLs
  • [beautyfulsoup4] to scrape information from web pages
  • [PDFminer] is a text extraction tool for PDF documents
  • [NLTK] for natural language processing
  • Keyword based search in extracted text for rule based classification

About

This repository will assist you in scrapping data from multiple websites. It will identify, download and classify the latest pdf files published on a website as per the users requirement. This can be used for automating various operations involved in market research.


Languages

Language:Python 100.0%