Tuanthai4444 / Webcrawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KeyText

Problem

With hundreds of thousands of articles, research papers, and encyclopedia pages being published each day; it becomes more and more difficult for researchers, journalists, and the everyday reader to stay caught up. The difficulty in digesting content affects efficiency in all fields of life from jobs, teaching, research, to even reading the news.

Solution

KeyText alleviates these problems by providing multiple functions that all help to parse out information from content. It allows for multiple functions that span from webcrawling for important links and documents to processing those documents for key information. KeyText allows the capturing of important key words and the creation of summarized text through utilizing language and text processing algorithms/techniques.

Usages

Currently KeyText is being used within EduScribe.

KeyText RoadMap

  • Document link scraper HTML ✔️
  • Okapi BM25 algorithm for document selection ✔️
  • Multiple BM25 options ❌
  • Document link scraper other file/link types ❌
  • Optimized scraper scanning web and credible journals/websites ❌
  • TF-IDF algorithms for important word selection ✔️
  • Multiple TF-IDF operations and customizable K-val TF ✔️
  • Customer focused design based on TF-IDF option benchmarking/use cases ✔️
  • Bag of words style pruning ✔️
  • Single document focused pruning ❌
  • PCA and LSA implementations and researchings ❌
  • Document summarization (TF-IDF) ✔️
  • Optimized document summarization ❌

About

License:Other


Languages

Language:Java 100.0%