document-search glob pdf-converter pdf-search pdfminer python python3 querying tf-idf

PDF-Querying-using-TF-IDF-from-Scratch

Given a set of PDFs and the query, the most relevant pdf can be found with the help of TF-IDF. The code has not used any library to implement TF-IDF

Explanation

The code only uses pdfminer and glob libraries to read pdf and traverse a directory for pdf. The Tf-idf is done manually without using any library. To understand the code, please read the comments in the code.

PDF Files

A sample folder is uploaded with few pdfs to tryout the code.

PDF_querying.py

Includes the reading of pdf files using pdfminer library
Extracting words from each pdf
Take query input from the user
tf-idf for the pdf and query
Ranking the pdfs that have same words from the query

text querying.py

The text from the documents are taken as string initially
Rest process is same as the other code.

About

Given a set of PDFs and the query, the most relevant pdf can be found with the help of TF-IDF. The code has not used any library to implement TF-IDF

document-search glob pdf-converter pdf-search pdfminer python python3 querying tf-idf

Languages

Language:Python 100.0%