einstin88 / sapling-release

A neural-network based AI to help you learn more from your readings with less time. Start by asking Sapling questions!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sapling Logo

Sapling 🌱

  • current release: v0.1
  • Aims to help students pin-point answers (1) from a large corpus of journal articles. It could also work for other formats of texts, suchs as textbook chapters or e-books.

Contents

  1. What it does?
  2. Why Sapling?
  3. Features
  4. Getting started
  5. FAQ
  6. Technical description

What it does?

In short:

  1. It reads in the passages of the documents in your folder and determine their relevancy to your questions. The top 5 passages will then be shown!
  2. Imagine a custom Google-like search engine for specific ideas or concepts in your readings! You can then tell Sapling to open the correct document for you and dive right in~~
  3. You may also use Sapling to do a negative query, that is searching the documents to be certain that they don't have the information you are looking for. Useful for filtering!
    • For example, if the retrieved answers turn out to have low confidence scores, it is likely that the information you need is not in those given documents (even after you tried rephrasing the question)
  • It does NOT assess your texts qualitatively, i.e., strengths & weaknesses or arguments, fallacies, robustness of results or the underlying research methodologies unless they are explicitly mentioned! These tasks will remain the responsibility of the reader fortunately/unfortunately πŸ˜ͺ

Why Sapling?

The creation of Sapling hopes :-

  1. to reduce the time ⏲ needed by students to learn about any text-based (usually new) topic
  2. to provides answers on the fly, such as during lectures πŸ™‹β€β™€οΈπŸ™‹β€β™‚οΈ and in workgroup sessions
  3. to help when you want to refresh on old knowledge such as topics that was learnt months or years ago
  4. to help reduce anxiety (due to information overload) while preparing for exams πŸ‘¨β€πŸ’»πŸ‘©β€πŸ’» or essay writing
  5. to ultimately reduce the knowledge acquisition barrier to help every one succeed in their education journey

(Inspired by advancements in AI πŸ€– on Natural Language Processing (NLP) such as IBM's Watson, FB's DrQA, and Google Research's BERT & ALBERT)

Features

  1. Uses nearly state-of-the-art human language comprehension and question answering architechture (produced in end-2019 to early-2020) that does not rely on memorizing words or questions to find answers. (The method is too technical to explain here, look at the article "Attention is all you need" by Google if you are interested )
  2. Sapling reads your texts to find the 5 most relevant sentences across multiple PDF articles within minutes based on your question about its topic. 😎 Confidence scores of each result are displayed!
  3. Locate the files and paragraphs where the answers are. You can open the file from the results if you want to.
  4. Leverages modern computing capability and speed to quickly 'read' contents among texts. Sapling works best when more texts files are being fed (2). You could for example feed it with all the articles and book chapters required for a course!
  5. Currently supports most PDFs files. Support for .txt and docx will be added in an upcoming version.
  6. Works on Windows 10 🍊 and MacOS 13 🍎 or newer.

Getting Started:

Download, Unzip, Run!

  1. Setting up

    1. Download 'Sapling':

    2. Download and install Java runtime

      • For Windows
      • For macOS ➑ choose 'macOS installer'
      • Java is required to run the parser that converts PDFs to a text format that is understandable by the computer. The same parser can also convert images or other document formats to plain texts which will come in future releases.
  2. Extract and run

    • Windows : Run autorun.bat after extracting the zip file
    • Mac:
      1. Follow instructions in the zipped file to run autorun.command.
      2. When you get a prompt about security, goto 'Setting' > 'Security & Privacy' > select 'open anyway'
  3. Provide path to the folder with your PDFs

    • This is the knowledge base which Sapling draw her answers from.

    • Example:

    This is the folder with the PDFs

    directory

    The full path to the folder provided like this

    then path to the folder

    • For macOS: 2 easy ways to copy the folder path
      1. Option 1:
        • Drag and drop into the console window
      2. Option 2:
        • Right click on the folder, then hold the 'option' key. You should see 'Copy xxx as pathname' option

    copy path on mac

    • You can easily drag and drop into the console on Windows
  4. Question away!

    • Ask anything you like or something you vaguely remember from reading the texts
    • And repeat!

Example query & results:

Key in a question

Example query

Voila!

Results

FAQ

  1. What it means by texts are not parsable?

    • It means that the PDF is a scanned image, or the internal character mapping is corrupt. Usually occur with very old PDFs.
  2. Why am I getting failed to see startup message error?

    • It takes a couple of seconds to load the PDF parser. Sometimes the process monitor times out before it is loaded. If it continues, then don't worry about the message. Otherwise, you may be running with MacOS 10.14 or older, which has old Java setup and will cause strange behaviour.
  3. Why am I getting segmentation fault : 11?

    • This should be fixed now. If you're using MacOS, it is due to the way Python and Mac handles memory. You may have fed it with a large file, increasing RAM and cache usage by the program.
  4. Why does it take a couple of minutes every time I enter a question?

    • It is still in the preliminary development phase of the program. So the focus is on accuracy of the results. Speed will be improved in the future for sure! Please let us know if you find the program useful or what needs to be improved, so it can be added as a future feature!

Feedback & issues reporting

Coming Soon

Technical description

  1. Preprocessing
    1. Retrieves compatible files from given directory
    2. Parse and clean texts of headers/footers/annotations/references
  2. Query processing
    1. Naive search for relevant docs with TF-IDF
    2. Fit query and passages using model pre-trained on Wiki texts and fine-tuned on Squad tasks, with span classification head
    3. Retrieve cross entropy losses to score passage fit and embedding vectors to compute argmax'es for answer spans
  3. Current model specifications
    1. Name: ALBERT base
    2. Vocabulary amount: 30,000
    3. Training data: English wikipedia, Squad v2

Version Releases

  • v0.1 [13-Oct-2020]: debut for alpha testing
    • features available
      1. Extract & clean texts from PDFs
      2. Return 5 top matching answers to your query

Notes

  1. Sapling's internal model was trained to understand public domain language and has not been trained on domain specific language, such as Political Science or Arts. That may reduce accuracy of answers slightly, but this limitation will be improved in future releases when it is used more often.
  2. Sapling has not yet been tested to its limit, but will perform slower when given hundreds of files. Performance is also machine dependant: processor speed, memory size and availability of Cuda-GPUs. (back)

return to top

Ideas for the future:

  • Improvements

    • simplify explanations for users with less technical background
  • Features

    • ability to change the number of results returned
    • improve search speed with multi-thread processing
    • tidier preprocessing of PDF headers, footers and citations
    • web-based UI
    • OCR capability for unparsable PDFs
    • extract text from docx files
    • save outputs
    • combine multiple directories as a common knowledge base
  • Bug fixes

    • Segmentation fault 11 - fault was due to pytorch _init_ calls

About

A neural-network based AI to help you learn more from your readings with less time. Start by asking Sapling questions!


Languages

Language:Python 100.0%