webscraping scraping scraper crawler ai artificial-intelligence python webautomation automation

AI Product Information Extractor

The AI Product Information Extractor implements Named Entity Recognition to extract key entities from product webpages of electronics websites using the CRF Classifer machine learning model. The entities extracted are:

Brand
Model
Price
Availability
Condition
Category

Python Dependencies

Flask (Version 1.1.1)
WTForms (Version 2.2.1)
python-tds (Version 1.9.1)
bs4 (Version 0.0.1)
lxml (Version 4.4.1)
google-api-python-client (Version 1.7.11)
google-api-core (Version 1.14.3)
google-api-python-client (Version 1.7.11)
google-auth (Version 1.6.3)
google-auth-httplib2 (Version 0.0.3)
google-cloud (Version 0.34.0)
google-cloud-core (Version 1.0.3)
google-cloud-storage (Version 1.20.0)
google-compute-engine (Version 2.8.16)
google-resumable-media (Version 0.4.1)
googleapis-common-protos (Version 1.6.0)

Java Dependencies

Execution of the java code contained within the src folder requires the following jar files:

stanford-corenlp-3.9.2.jar
stanford-corenlp-models-current.jar
stanford-english-corenlp-models-current.jar
stanford-english-kbp-corenlp-models-current.jar

The above jar files can be downloaded from the following link: https://github.com/stanfordnlp/CoreNLP

Note: After adding these jar files to the build path, the java code within the src folder must be converted into a jar file called crf.jar for integration with Python.

About

Open Source AIWebscraper built in Python, based on StanfordNLP models

webscraping scraping scraper crawler ai artificial-intelligence python webautomation automation

Languages

Language:Python 43.1%Language:JavaScript 38.0%Language:HTML 6.4%Language:Java 6.3%Language:CSS 6.2%Language:Shell 0.1%