umcody / MakeUpMatch

Web app to find similar beauty products while excluding a given allergy or toxin (python, MySQL, html, d3.js)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MakeUpMatch

See the final web application at http://www.makeupmatch.me

Hello! My name is Chanin Woods and this is Makeup Match. I created this web application as part of the Insight Data Science Fellowship. I enjoy following beauty blogs and keeping up with the latest trends, but I also have allergies. I created Makeup Match to help me identify great products without my worst allergens. I used natural language processing and collaborative filtering techniques to identify similar products based on reviews. Then I filtered out the unwanted ingredients using SQL queries.

The Data

The Sephora website contains a trove of beauty products and brands for sale. From this site, I scraped product information (brand, name, and url), review information (reviews, reviewers), and ingredient information. I also downloaded a list of ingredient names and alternative names from the Cosmetic Ingredients and Substances list generated by the European Commission. The additional ingredient names were used to expand the vocabulary that users could input to reference a particular ingredient. I stored all of this data in a MySQL database. I limited the products to only those that contained both reviews and ingredients in the 5 largest categories (foundation, lipstick, blush, mascara, and conditioner).

The Algorithm

I processed the reviews using natural language processing. I stemmed and tokenized the reviews grouped by product. Using gensim I performed a tf-idf transformation to give heavier weight to rare words. I used latent semantic indexing to reduce the feature set from 51,000 features to 100 features. I then calculated the cosine similarity (cosine of the angle between topic vectors) for our set of products. This model successfully discriminated between the categories of makeup, but failed to pick up on nuanced differences between products. To improve this aspect of the algorithm, I used item-based collaborative filtering. I created a user-item matrix for reviewers who had reviewed multiple items in our database. I performed sentiment analysis on the reviews to determine whether a reviewer liked or disliked that item. I then calculated the cosine similarity between products in the matrix. This suggests items that were "Also Liked". I found that this improved the discrimination between products within the same category. I combined the similarities in a linear combination. Finally, I filtered out the unwanted ingredients by finding exact and inexact matches using SQL queries.

The Tools

The web application backend was developed in Python using the following packages: BeautifulSoup, NLTK, pandas, gensim, mysql.connector, and Scikit-Learn. Data was stored in a MySQL database. The webapp frontend uses HTML, Flask, D3, and Javascript. This application is deployed on Amazon Web Services.

About

Web app to find similar beauty products while excluding a given allergy or toxin (python, MySQL, html, d3.js)

License:Apache License 2.0


Languages

Language:Jupyter Notebook 60.3%Language:CSS 19.4%Language:HTML 13.1%Language:Python 4.0%Language:JavaScript 3.1%