College name: Hooghly Engineering and Technology College
Project type: Final Year Project
Mentor: Prof. Mr. Sanghamitra Das (Assistant Prof, Computer Science and Engineering Dept.)
Project Started: 1 October 2021
Project Ended: Currently ongoing...
Click here to see project documentation
Project Group Members:
- Soumodip Ghosh (CSE Dept., Year of Passing 2022)
- Jayanta Dhali (CSE Dept., Year of Passing 2022)
- Ankita Datta (CSE Dept., Year of Passing 2022)
- Sarthak Srivastava (CSE Dept., Year of Passing 2022)
Project Started: 1 October 2021
Project Ended: 11 June 2022
For any queries contact me at ghoshsoumo14@gmail.com
We would like to thank all the teachers for their support and knowledge they shared for the successful completion of the project and also a special thanks to our mentor Mentor: Prof. Mr. Sanghamitra Das (Assistant Prof, Computer Science and Engineering Dept.) ma'm for her continuous guidance , support and experience she shared with us and also I am thankful to my team Soumodip , Jayanta , Ankita , Sarthak for the combined efforts to reach the successful completion of this project.
April 27, 2022 by Soumo
This blog post is about the details and purpose of developing a Data Mining . This blog post also includes the detailed process of developing a Data Mining model.
Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating
Meaning
The term "data mining" is a misnomer because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.
The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets
The knowledge discovery in databases (KDD) process is commonly defined with the stages:
- Selection
- Pre-processing
- Transformation
- Data mining
- Interpretation/Evaluation
It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment
Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.
Data mining involves six common classes of tasks:
- Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.
- Association rule learning (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
- Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
- Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
- Regression – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.
- Summarization – providing a more compact representation of the data set, including visualization and report generation.
April 27, 2022 by Jayanta
This blog post is about the details and purpose of developing a Recommendation System . This blog post also includes the detailed process of developing a Recommendation System.
A Recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.
Recommender systems usually make use of either or both collaborative filtering and content-based filtering (also known as the personality-based approach), as well as other systems such as knowledge-based systems. Collaborative filtering approaches build a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete, pre-tagged characteristics of an item in order to recommend additional items with similar properties
Meaning
The term "recommendation system" has two parts one is "recommendation" which means giving a suitable suggestion and "system" refers to a engine.
- Cold start: For a new user or item, there isn't enough data to make accurate recommendations. Note: one commonly implemented solution to this problem is the Multi-armed bandit algorithm.
- Scalability: There are millions of users and products in many of the environments in which these systems make recommendations. Thus, a large amount of computation power is often necessary to calculate recommendations.
- Sparsity: The number of items sold on major e-commerce sites is extremely large. The most active users will only have rated a small subset of the overall database. Thus, even the most popular items have very few ratings.
- A model of the user's preference.
- A history of the user's interaction with the recommender system.
- Weighted: Combining the score of different recommendation components numerically.
- Switching: Choosing among recommendation components and applying the selected one.
- Mixed: Recommendations from different recommenders are presented together to give the recommendation.
- Feature Combination: Features derived from different knowledge sources are combined together and given to a single recommendation algorithm.
- Feature Augmentation: Computing a feature or set of features, which is then part of the input to the next technique.
- Cascade: Recommenders are given strict priority, with the lower priority ones breaking ties in the scoring of the higher ones.
- Meta-level: One recommendation technique is applied and produces some sort of model, which is then the input used by the next technique.
- Data Collection
- Pre-processing
- Data mining
- Vectorization
- Training
- Testing
- Pandas :Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
- Numpy :NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely. NumPy stands for Numerical Python.
- ast :“ast” is a potential tool of python programming language which allows user to interact with python code itself and modify it. “ast” stands for “Abstract Syntax Tree”.
- Natural Language Toolkit(nltk) : Library is a suite that contains libraries and programs for statistical language processing. It is one of the most powerful NLP libraries, which contains packages to make machines understand human language and reply to it with an appropriate response. Natural Language Processing (NLP) is a process of manipulating or understanding the text or speech by any software or machine. An analogy is that humans interact and understand each other’s views and respond with the appropriate answer. In NLP, this interaction, understanding, and response are made by a computer instead of a human.
-
Html : HTML (HyperText Markup Language) is the code that is used to structure a web page and its content. For example, content could be structured within a set of paragraphs, a list of bulleted points, or using images and data tables.
-
Css : CSS is the language we use to style an HTML document.CSS describes how HTML elements should be displayed.
-
Js : JavaScript is a dynamic programming language that's used for web development, in web applications, for game development, and lots more. It allows you to implement dynamic features on web pages that cannot be done with only HTML and CSS. Many browsers use JavaScript as a scripting language for doing dynamic things on the web. Any time you see a click-to-show dropdown menu, extra content added to a page, and dynamically changing element colors on a page, to name a few features, you're seeing the effects of Java Script.
-
Bootstrap : Bootstrap includes HTML and CSS based design templates for typography, forms, buttons, tables, navigation, modals, image carousels and many other, as well as optional JavaScript plugins. Bootstrap also gives you the ability to easily create responsive designs. Bootstrap is a free front-end framework for faster and easier web development Bootstrap includes HTML and CSS based design templates for typography, forms, buttons, tables, navigation, modals, image carousels and many other, as well as optional JavaScript plugins. Bootstrap also gives you the ability to easily create responsive designs
One approach to the design of recommender systems that has wide use is collaborative filtering. Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items. By locating peer users/items with a rating history similar to the current user or item, they generate recommendations using this neighborhood. Collaborative filtering methods are classified as memory-based and model-based. A well-known example of memory-based approaches is the user-based algorithm, while that of model-based approaches is the Kernel-Mapping Recommender.
A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself. Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the k-nearest neighbor (k-NN) approach and the Pearson Correlation as first implemented by Allen.
Collaborative filtering approaches often suffer from three problems: cold start, scalability, and sparsity.
One of the most famous examples of collaborative filtering is item-to-item collaborative filtering (people who buy x also buy y), an algorithm popularized by Amazon.com's recommender system. Many social networks originally used collaborative filtering to recommend new friends, groups, and other social connections by examining the network of connections between a user and their friends.[1] Collaborative filtering is still used as part of hybrid systems.
Another common approach when designing recommender systems is content-based filtering. Content-based filtering methods are based on a description of the item and a profile of the user's preferences. These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user's likes and dislikes based on an item's features.
In this system, keywords are used to describe the items, and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items similar to those that a user liked in the past or is examining in the present. It does not rely on a user sign-in mechanism to generate this often temporary profile. In particular, various candidate items are compared with items previously rated by the user, and the best-matching items are recommended. This approach has its roots in information retrieval and information filtering research.
To create a user profile, the system mostly focuses on two types of information:
Basically, these methods use an item profile (i.e., a set of discrete attributes and features) characterizing the item within the system. To abstract the features of the items in the system, an item presentation algorithm is applied. A widely used algorithm is the tf–idf representation (also called vector space representation). The system creates a content-based profile of users based on a weighted vector of item features. The weights denote the importance of each feature to the user and can be computed from individually rated content vectors using a variety of techniques. Simple approaches use the average values of the rated item vector while other sophisticated methods use machine learning techniques such as Bayesian Classifiers, cluster analysis, decision trees, and artificial neural networks in order to estimate the probability that the user is going to like the item.
A key issue with content-based filtering is whether the system can learn user preferences from users' actions regarding one content source and use them across other content types. When the system is limited to recommending content of the same type as the user is already using, the value from the recommendation system is significantly less than when other content types from other services can be recommended. For example, recommending news articles based on news browsing is useful. Still, it would be much more useful when music, videos, products, discussions, etc., from different services, can be recommended based on news browsing. To overcome this, most content-based recommender systems now use some form of the hybrid system.
Content-based recommender systems can also include opinion-based recommender systems. In some cases, users are allowed to leave text reviews or feedback on the items. These user-generated texts are implicit data for the recommender system because they are potentially rich resources of both feature/aspects of the item and users' evaluation/sentiment to the item. Features extracted from the user-generated reviews are improved meta-data of items, because as they also reflect aspects of the item like meta-data, extracted features are widely concerned by the users. Sentiments extracted from the reviews can be seen as users' rating scores on the corresponding features. Popular approaches of opinion-based recommender system utilize various techniques including text mining, information retrieval, sentiment analysis (see also Multimodal sentiment analysis) and deep learning.
These recommender systems use the interactions of a user within a session to generate recommendations. Session-based recommender systems are used at Youtube and Amazon. These are particularly useful when history (such as past clicks, purchases) of a user is not available or not relevant in the current user session. Domains where session-based recommendations are particularly relevant include video, e-commerce, travel, music and more. Most instances of session-based recommender systems rely on the sequence of recent interactions within a session without requiring any additional details (historical, demographic) of the user. Techniques for session-based recommendations are mainly based on generative sequential models such as Recurrent Neural Networks, Transformers, and other deep learning based approaches
The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets
The recommendation problem can be seen as a special instance of a reinforcement learning problem whereby the user is the environment upon which the agent, the recommendation system acts upon in order to receive a reward, for instance, a click or engagement by the user. One aspect of reinforcement learning that is of particular use in the area of recommender systems is the fact that the models or policies can be learned by providing a reward to the recommendation agent. This is in contrast to traditional learning techniques which rely on supervised learning approaches that are less flexible, reinforcement learning recommendation techniques allow to potentially train models that can be optimized directly on metrics of engagement, and user interest.
The majority of existing approaches to recommender systems focus on recommending the most relevant content to users using contextual information, yet do not take into account the risk of disturbing the user with unwanted notifications. It is important to consider the risk of upsetting the user by pushing recommendations in certain circumstances, for instance, during a professional meeting, early morning, or late at night. Therefore, the performance of the recommender system depends in part on the degree to which it has incorporated the risk into the recommendation process. One option to manage this issue is DRARS, a system which models the context-aware recommendation as a bandit problem. This system combines a content-based technique and a contextual bandit algorithm.
Mobile recommender systems make use of internet-accessing smart phones to offer personalized, context-sensitive recommendations. This is a particularly difficult area of research as mobile data is more complex than data that recommender systems often have to deal with. It is heterogeneous, noisy, requires spatial and temporal auto-correlation, and has validation and generality problems.
There are three factors that could affect the mobile recommender systems and the accuracy of prediction results: the context, the recommendation method and privacy. Additionally, mobile recommender systems suffer from a transplantation problem – recommendations may not apply in all regions (for instance, it would be unwise to recommend a recipe in an area where all of the ingredients may not be available).
One example of a mobile recommender system are the approaches taken by companies such as Uber and Lyft to generate driving routes for taxi drivers in a city. This system uses GPS data of the routes that taxi drivers take while working, which includes location (latitude and longitude), time stamps, and operational status (with or without passengers). It uses this data to recommend a list of pickup points along a route, with the goal of optimizing occupancy times and profits.
Most recommender systems now use a hybrid approach, combining collaborative filtering, content-based filtering, and other approaches. There is no reason why several different techniques of the same type could not be hybridized. Hybrid approaches can be implemented in several ways: by making content-based and collaborative-based predictions separately and then combining them; by adding content-based capabilities to a collaborative-based approach (and vice versa); or by unifying the approaches into one model (see[24] for a complete review of recommender systems). Several studies that empirically compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrated that the hybrid methods can provide more accurate recommendations than pure approaches. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem, as well as the knowledge engineering bottleneck in knowledge-based approaches.
Netflix is a good example of the use of hybrid recommender systems. The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).
Some hybridization techniques include:
The process followed in developing a recommendation system involves:
We have collected data sets from different open source websites for example : imdb , kaggle , Data.gov etc.
We import all these four modules :-
Firstly we merge the movies having same “title” from both the data sets. We consider a few important parameters on basis of which we refine the data sets i.e. “movie_id”, “title”, “overview”, “genres”, “keywords”, “cast”, “crew”.
Then we drop all those columns which have missing values in the data set and put “NaN” in the place of those missing values.
We convert all the list or object format data in the data sets to string format in according to data provided in the data sets.
After the data mining is done using the above essential parameters a new tagline is created by concatenating them. Then this tagline is put inside a new data frame along with their respective id and name. Then we convert the tagline into lower case.
Then we use Porter stemming algorithm (or ‘Porter stemmer’) which is a process for removing the commoner morphological and inflexional endings from words in English. It converts all the words having same meaning into its root words.
“Vectorization” is a technique of implementing array operations without using for loops. Instead , we use functions defined by various modules which are highly optimized that reduces the running and execution time of code. Vectorized array operations will be faster than their pure Python equivalents , with the biggest impact in any kind of numerical computations.
“CountVectorizer” is a great tool provided by scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency(count) of each word that occurs in the entire text. This is helpful when we have multiple such texts and we wish to convert each word in each text into vectors (for using in further text analysis).
After vectorization the desired data is converted to array form. The similarity between the vectors are checked . Then the similarity of every movie with the other movies is checked and stored into an array. Then we find the cosine distance between the vectors and compare the cosine distances of every item with the other items(The lesser is the cosine distance more is the similarity). Then we created a function that will recommend us the 5 most similar movies whenever any movie name is given.
Then a final list is prepared which consist of all the recommendations of movies ,serials, web series whenever any input will we provided by the user and this sheet is uploaded to the database of the website.
Here in this model we have used approximately 5000 movies, serials and web series data which is used for training the model.
Our model will be recommending the best 10 movie names for each of the cases.
In our project we have using Django as our web development framework. Firstly we have created a home page for our web application. Then moving to the insights Django has the inbuilt features for handling the front end web development part features like Html , Css , Js and Bootstrap have been used in our project for the purpose of creating and designing the website.
The data list produced by the Data mining model is stored into our Django database i.e. A model is created inside the Django models.py file and a table is created inside our Django database called Movies_List. Then using the import function we directly import the entire list of data which is in .xls form to our database.
It is done using 4 important components :
How actually the search function works and displays us the final list of movies. Basically there is a views.py file which contains a show function in which the main search functionality is being defined. A post request is being received when a user searches for a movie , a empty list “m” is created. Then we check whether the searched item is ‘’ or not if empty it returns No Search Item found else it filters the movies from the movies list that is already stored inside our Django Database according to the search item given input by the user and again a checking is done if the movies object is empty or not if not empty then the first object is stored into the list m which was created before and all the suggestions matching to the item searched is displayed to the user with a creative animated text format .