satarupaguha11

Satarupa Guha's repositories

Extract_EmailIDs_Unstructured_webpages

Goal: To understand basic crawling, and use simple heuristics to handle real world unclean web data to get email ids. Input: 2000 business webpages crawled from Yelp. Each webpage is an HTML containing details about the business. It does not have the email id, but it has the website address for the business which can be used to find the contact us page for the website and thereby extract its email id. Task is to obtain structured data for the business: business name, business phone number, business home page URL, contact-us URL for the business, email id for the business.

Language:Python2 20

Handwritten-Digit-Recognition

The problem of handwriting recognition is to interpret intelligible handwritten input automatically, which is of great interest in the pattern recognition research community because of its applicability to many fields towards more convenient input devices and more efficient data organization and processing. We have to code a complete digit recognizer and test it on the MNIST digit dataset. As a benchmark for testing classification algorithms, the MNIST dataset has been widely used to design novel handwritten digit recognition systems. The dataset consists of 70,000 gray scale images, each of size 784. The recognizer is supposed to read the image data, extract features from it and use a k-nearest neighbor classifier to recognize any test image. To carry out the experiments, we need to randomly divide it into two partitions - training and testing. The training set is used to create the classifier and test set is used to determine the accuracy.

2 10

Phrase-Translation

Words may not always be the best atomic unit of a sentence. One word in the source language often corresponds to multiple words in the target language. A word-based model would break down in these cases. This is the mortivation for building a phrase-based model for translation.

100

Top-K-Influentials-in-Temporal-Graph

Given a social network graph, our objective is to find the top –k influential nodes such that if these k nodes are made seeds of information, the information will spread to maximal number of nodes in a certain number of time stamps. We also wish to optimise k so that there is a reasonable trade-off between cost and time.

1 10

try_sentiment

This is an attempt to implement NRC-Canada's sentiment module for SemEval'14

Language:Python1 30

Analysis-Wikipedia-Entities

Goal: To understand the Wikipedia dataset, especially the entity info boxes. Task: We have taken the Wikipedia dump. Our aim is to extract information about various entity types. The steps for this task are as follows: 1. Given the Wikipedia dump, gather all the pages from Wikipedia with Info boxes on them. 2. Find the set of all possible entity types on Wikipedia 3. Find the set of all possible attributes that can be associated with any entity type on Wikipedia. 4. From a few values of these attributes, infer the data type of these attributes as one of the following: String, set of strings, duration, number, set of durations, date, other. 5. Find various units that can be used to express the value of a numeric attribute. E.g., for “height” attribute of “person” entities, the units could be “cms, inches” 6. For numeric attributes, find typical ranges (using the most popular unit). E.g., For person entities, the age attribute should have the range as 0-150 years. 7. For attributes which are semantically similar but have different names used across different entities of the same type, merge them. E.g., Automatically identify that the attribute “birthdate” is the same as “bdate”.

Language:Python010

aspect_category

Language:Python000

azure-docs

Open source documentation of Microsoft Azure

CC-BY-4.0000

guidance

A guidance language for controlling large language models.

Language:Jupyter NotebookMIT000

ImplementingEigenFaces

The goal of this mini project is to get familiarized with the ideas of image representation, PCA and LDA, and face recognition. It is also understand the practical difficulties in developing real-world systems that work with acceptable accuracies.

Language:Matlab000

ImplementingPerceptronAlgorithms

Language:Matlab000

reRankURL

This project is based on the Personalized Web Search Challenge organized by Kaggle. The aim of this challenge is to re-rank URLs of each SERP returned by the search engine according to the personal preferences of the users.

Language:Java000

satarupaguha11.github.io

Personal Webpage

Language:CSSNOASSERTION010

SearchEngineForWikipedia

Given a query, search the Wikipedia Corpus (46 GB) and give the titles of top ten retrieved documents, in ranked order. Queries can be either phrase queries or field based queries. Multi-level indexes were built to improve retrieval speed. Evaluation will be done primarily on the basis of the quality of results and time taken for retrieval (less than 1 sec). Keeping the size of the index was also a challenge. Compression techniques was used for that purpose.

Language:Python020