Sapling 🌱

current release: v0.1
Aims to help students pin-point answers (1) from a large corpus of journal articles. It could also work for other formats of texts, suchs as textbook chapters or e-books.

What it does?

In short:

It reads in the passages of the documents in your folder and determine their relevancy to your questions. The top 5 passages will then be shown!
Imagine a custom Google-like search engine for specific ideas or concepts in your readings! You can then tell Sapling to open the correct document for you and dive right in~~
You may also use Sapling to do a negative query, that is searching the documents to be certain that they don't have the information you are looking for. Useful for filtering!
- For example, if the retrieved answers turn out to have low confidence scores, it is likely that the information you need is not in those given documents (even after you tried rephrasing the question)

It does NOT assess your texts qualitatively, i.e., strengths & weaknesses or arguments, fallacies, robustness of results or the underlying research methodologies unless they are explicitly mentioned! These tasks will remain the responsibility of the reader fortunately/unfortunately 😪

The creation of Sapling hopes :-

to reduce the time ⏲ needed by students to learn about any text-based (usually new) topic
to provides answers on the fly, such as during lectures 🙋‍♀️🙋‍♂️ and in workgroup sessions
to help when you want to refresh on old knowledge such as topics that was learnt months or years ago
to help reduce anxiety (due to information overload) while preparing for exams 👨‍💻👩‍💻 or essay writing
to ultimately reduce the knowledge acquisition barrier to help every one succeed in their education journey

(Inspired by advancements in AI 🤖 on Natural Language Processing (NLP) such as IBM's Watson, FB's DrQA, and Google Research's BERT & ALBERT)

Uses nearly state-of-the-art human language comprehension and question answering architechture (produced in end-2019 to early-2020) that does not rely on memorizing words or questions to find answers. (The method is too technical to explain here, look at the article "Attention is all you need" by Google if you are interested )
Sapling reads your texts to find the 5 most relevant sentences across multiple PDF articles within minutes based on your question about its topic. 😎 Confidence scores of each result are displayed!
Locate the files and paragraphs where the answers are. You can open the file from the results if you want to.
Leverages modern computing capability and speed to quickly 'read' contents among texts. Sapling works best when more texts files are being fed (2). You could for example feed it with all the articles and book chapters required for a course!
Currently supports most PDFs files. Support for .txt and docx will be added in an upcoming version.
Works on Windows 10 🍊 and MacOS 13 🍎 or newer.

Download, Unzip, Run!

Setting up
1. Download 'Sapling':
  - For Windows
  - For macOS 10.13.6 or newer
2. Download and install Java runtime
  - For Windows
  - For macOS ➡ choose 'macOS installer'
  - Java is required to run the parser that converts PDFs to a text format that is understandable by the computer. The same parser can also convert images or other document formats to plain texts which will come in future releases.
Extract and run
- Windows : Run autorun.bat after extracting the zip file
- Mac:
  1. Follow instructions in the zipped file to run autorun.command.
  2. When you get a prompt about security, goto 'Setting' > 'Security & Privacy' > select 'open anyway'
Provide path to the folder with your PDFs
- This is the knowledge base which Sapling draw her answers from.
- Example:
This is the folder with the PDFs

The full path to the folder provided like this
- For macOS: 2 easy ways to copy the folder path
  1. Option 1:
    - Drag and drop into the console window
  2. Option 2:
    - Right click on the folder, then hold the 'option' key. You should see 'Copy xxx as pathname' option
- You can easily drag and drop into the console on Windows
Question away!
- Ask anything you like or something you vaguely remember from reading the texts
- And repeat!

Key in a question

Voila!

What it means by texts are not parsable?
- It means that the PDF is a scanned image, or the internal character mapping is corrupt. Usually occur with very old PDFs.
Why am I getting failed to see startup message error?
- It takes a couple of seconds to load the PDF parser. Sometimes the process monitor times out before it is loaded. If it continues, then don't worry about the message. Otherwise, you may be running with MacOS 10.14 or older, which has old Java setup and will cause strange behaviour.
Why am I getting segmentation fault : 11?
- This should be fixed now. If you're using MacOS, it is due to the way Python and Mac handles memory. You may have fed it with a large file, increasing RAM and cache usage by the program.
Why does it take a couple of minutes every time I enter a question?
- It is still in the preliminary development phase of the program. So the focus is on accuracy of the results. Speed will be improved in the future for sure! Please let us know if you find the program useful or what needs to be improved, so it can be added as a future feature!

Coming Soon

Preprocessing
1. Retrieves compatible files from given directory
2. Parse and clean texts of headers/footers/annotations/references
Query processing
1. Naive search for relevant docs with TF-IDF
2. Fit query and passages using model pre-trained on Wiki texts and fine-tuned on Squad tasks, with span classification head
3. Retrieve cross entropy losses to score passage fit and embedding vectors to compute argmax'es for answer spans
Current model specifications
1. Name: ALBERT base
2. Vocabulary amount: 30,000
3. Training data: English wikipedia, Squad v2

v0.1 [13-Oct-2020]: debut for alpha testing
- features available
  1. Extract & clean texts from PDFs
  2. Return 5 top matching answers to your query

Sapling's internal model was trained to understand public domain language and has not been trained on domain specific language, such as Political Science or Arts. That may reduce accuracy of answers slightly, but this limitation will be improved in future releases when it is used more often.
Sapling has not yet been tested to its limit, but will perform slower when given hundreds of files. Performance is also machine dependant: processor speed, memory size and availability of Cuda-GPUs. (back)

Improvements
- simplify explanations for users with less technical background
Features
- ability to change the number of results returned
- improve search speed with multi-thread processing
- tidier preprocessing of PDF headers, footers and citations
- web-based UI
- OCR capability for unparsable PDFs
- extract text from docx files
- save outputs
- combine multiple directories as a common knowledge base
Bug fixes
- Segmentation fault 11 - fault was due to pytorch _init_ calls