Final Project for Statistical Natural Language Processing Course
- Baseline Document Retreival Model
- Extract text from corpus
- Preprocess the texts from corpus and apply tokenisation
- Compute idf
- Comput tf
- Give list of query terms as product of term's idf and tf-value
- Relavance based on cosine similarity
- Sort similarity scores and output top 50 most relevant documents
- Function to evaluate performance of document using precision at r with r = 50
- Test on test_questions.txt
- Advanced Document Retriever with Re-Ranking
- Use the baseline model and return the top 1000 documents
- Re-rank the top 1000 documents with a more advanced approach
- Sentence Ranker
- Split the top 50 documents into sentences (sent_tokenize)
- Treat the sentences likedocuments to rank them and return the top 50 sentences (same approach as above)
- Evaluate performance using Mean Reciprocal Rank