Soviet recipes data wrangling and visualization project
Project Description
-
Data Scraping
-
Target websites
-
Raw data format At first, I found it hard to implement a nested json structure, so I decided to use a flat structure for the raw data instead. The structure is as follows:
{ "category": "category", "subcategory": "subcategory", "recipe_name": "recipe_name", "ingredients": [ "ingredient_1", "ingredient_2", "ingredient_3" ], }
Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.
To scrape raw data, run:
cd ./scrapping scrapy crawl sov-obshchepit -O ../data/raw_data.json
-
Nested raw data format
Eventually, I figured out a way to implement a nested json structure. I store scrapped data in a nested json structure (
nested_index
insov_obshchepit.py
) and write it to a json file when the spider is closed. The structure of this file is as follows:{ "category_name": { "subcategory_name": { "recipe_name": { "ingredients": [ "ingredient_1", "ingredient_2", "ingredient_3" ], } } } }
Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.
To scrape raw data with nested structure, run:
cd ./scrapping scrapy crawl sov-obshchepit -a nested_output=../data/raw_nested_data.json
-
Sorted and prettified raw data format
You migth want to sort the raw data by category, subcategory, and recipe name. To do so, run:
cat ./data/raw_data.json | jq 'sort_by(.category, .subcategory, .recipe_name)' > ./data/raw_data_sorted.json
-
-
Data Wrangling
Part of the data cleaning process is done during the scraping process. For example, all trailing whitespaces are removed from the scraped data, as well as any empty strings or invisible characters. The rest of the data cleaning is done in the
data_wrangling.ipynb
notebook.-
Structured cleaned data
I used Claude AI assistent to convert the raw strings of ingredients into structured data. The resulting data is stored in
data/structured_data.json
. The structure of this file is as follows:{ "category_name": { "subcategory_name": { "recipe_name": { "ingredients": [ "ingredient_1", "ingredient_2", "ingredient_3" ], "parsed_ingredients": [ ["ingredient_name", "quantity", "measure units"], ["ingredient_name", "quantity", "measure units"], ["ingredient_name", "quantity", "measure units"] ] } } } }
-
-
Data Visualization
Visualization consists of two major parts: static using
plotly
(python) and dynamic usingd3.js
(coffeescript) andplotly
(python export to html+js).Notebook
data_visualization.ipynb
has all the code for generating svg and html files used on the website.Three types of charts are used:
- Bar charts. It shows the number of recipes per subcategory in the specified category.
- Sunburst chart. Similar to the bar chart, shows the number of recipes per subcategory of each category.
- Networks. For the specified category, it shows connections between different recipes and used ingredients.
Misc
Visual inspiration: everyday soviet food.
(I intended to use those in the final design, however, did not).