Soviet recipes data wrangling and visualization project

Project Description

Data Scraping
1. Target websites
  - Main source for Soviet recipes
  - Not used yet:
2. Raw data format At first, I found it hard to implement a nested json structure, so I decided to use a flat structure for the raw data instead. The structure is as follows:
```
{
    "category": "category",
    "subcategory": "subcategory",
    "recipe_name": "recipe_name",
    "ingredients": [
        "ingredient_1", "ingredient_2", "ingredient_3"
    ],
}
```
  Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.
  
  To scrape raw data, run:
```
cd ./scrapping
scrapy crawl sov-obshchepit -O ../data/raw_data.json
```
3. Nested raw data format
  
  Eventually, I figured out a way to implement a nested json structure. I store scrapped data in a nested json structure (nested_index in sov_obshchepit.py) and write it to a json file when the spider is closed. The structure of this file is as follows:
```
{
    "category_name": {
        "subcategory_name": {
            "recipe_name": {
                "ingredients": [
                    "ingredient_1", "ingredient_2", "ingredient_3"
                ],
            }
        }
    }
}
```
  Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.
  
  To scrape raw data with nested structure, run:
```
cd ./scrapping
scrapy crawl sov-obshchepit -a nested_output=../data/raw_nested_data.json
```
4. Sorted and prettified raw data format
  
  You migth want to sort the raw data by category, subcategory, and recipe name. To do so, run:
```
cat ./data/raw_data.json | jq 'sort_by(.category, .subcategory, .recipe_name)' > ./data/raw_data_sorted.json
```

Data Wrangling

Part of the data cleaning process is done during the scraping process. For example, all trailing whitespaces are removed from the scraped data, as well as any empty strings or invisible characters. The rest of the data cleaning is done in the data_wrangling.ipynb notebook.

Structured cleaned data

I used Claude AI assistent to convert the raw strings of ingredients into structured data. The resulting data is stored in data/structured_data.json. The structure of this file is as follows:

{
    "category_name": {
        "subcategory_name": {
            "recipe_name": {
                "ingredients": [
                    "ingredient_1", "ingredient_2", "ingredient_3"
                ],
                "parsed_ingredients": [
                    ["ingredient_name", "quantity", "measure units"],
                    ["ingredient_name", "quantity", "measure units"],
                    ["ingredient_name", "quantity", "measure units"]
                ]
            }
        }
    }
}

Data Visualization

Visualization consists of two major parts: static using plotly (python) and dynamic using d3.js (coffeescript) and plotly (python export to html+js).

Notebook data_visualization.ipynb has all the code for generating svg and html files used on the website.

Three types of charts are used:
1. Bar charts. It shows the number of recipes per subcategory in the specified category.
2. Sunburst chart. Similar to the bar chart, shows the number of recipes per subcategory of each category.
3. Networks. For the specified category, it shows connections between different recipes and used ingredients.

Misc

Visual inspiration: everyday soviet food.

(I intended to use those in the final design, however, did not).

danielpancake / soviet-recipes-data-wrangling-and-visualization

Soviet recipes data wrangling and visualization project

Project Description

Misc

About

Languages