yashmanne / UniScraper

A universal scraper that grabs text from multiple types of webpages.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UniScraper

Description

Uniscraper is a universal scraper that collects text from multiple types of webpages. Currently it supports html (including dynamic webpages that use javascript), online pdfs, word documents, presentation slides, and spreadsheets.

Installation instructions

Clone the git repo:

git clone https://github.com/caimeng2/UniScraper.git

Set up a conda environment by running the following command:

conda env create --prefix ./envs --file environment.yml

conda activate ./envs

Installing the environment to jupyter notebook

conda install -c anaconda ipykernel python -m ipykernel install --user --name=envs

https://moonbooks.org/Articles/How-to-use-a-specific-python-conda-environment-in-a-Jupyter-notebook-/

Also need to install nltk

Dependency

bs4 webdriver_manager pandas selenium nltk requests python-docx python-pptx pdfminer

Example usage

Please run example.ipynb to see example usage.

In the top cell of the notebook, run the follwing:

    import nltk
    
    nltk.download('words')

About

A universal scraper that grabs text from multiple types of webpages.


Languages

Language:C++ 64.0%Language:C 20.3%Language:QML 5.6%Language:Tcl 4.6%Language:CSS 1.9%Language:Python 1.2%Language:Roff 0.7%Language:Jupyter Notebook 0.3%Language:GSC 0.3%Language:QMake 0.2%Language:Objective-C 0.2%Language:Jinja 0.2%Language:Perl 0.2%Language:Makefile 0.1%Language:Shell 0.1%Language:GAP 0.1%Language:CMake 0.1%Language:JavaScript 0.1%Language:C# 0.0%Language:Batchfile 0.0%Language:GLSL 0.0%Language:F# 0.0%Language:DTrace 0.0%Language:Smarty 0.0%