TopreGroup / AIWebscraper

Open Source AIWebscraper built in Python, based on StanfordNLP models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AI Product Information Extractor

The AI Product Information Extractor implements Named Entity Recognition to extract key entities from product webpages of electronics websites using the CRF Classifer machine learning model. The entities extracted are:

  • Brand
  • Model
  • Price
  • Availability
  • Condition
  • Category

Python Dependencies

  • Flask (Version 1.1.1)
  • WTForms (Version 2.2.1)
  • python-tds (Version 1.9.1)
  • bs4 (Version 0.0.1)
  • lxml (Version 4.4.1)
  • google-api-python-client (Version 1.7.11)
  • google-api-core (Version 1.14.3)
  • google-api-python-client (Version 1.7.11)
  • google-auth (Version 1.6.3)
  • google-auth-httplib2 (Version 0.0.3)
  • google-cloud (Version 0.34.0)
  • google-cloud-core (Version 1.0.3)
  • google-cloud-storage (Version 1.20.0)
  • google-compute-engine (Version 2.8.16)
  • google-resumable-media (Version 0.4.1)
  • googleapis-common-protos (Version 1.6.0)

Java Dependencies

Execution of the java code contained within the src folder requires the following jar files:

  • stanford-corenlp-3.9.2.jar
  • stanford-corenlp-models-current.jar
  • stanford-english-corenlp-models-current.jar
  • stanford-english-kbp-corenlp-models-current.jar

The above jar files can be downloaded from the following link: https://github.com/stanfordnlp/CoreNLP

Note: After adding these jar files to the build path, the java code within the src folder must be converted into a jar file called crf.jar for integration with Python.

About

Open Source AIWebscraper built in Python, based on StanfordNLP models


Languages

Language:Python 43.1%Language:JavaScript 38.0%Language:HTML 6.4%Language:Java 6.3%Language:CSS 6.2%Language:Shell 0.1%