danielpancake / soviet-recipes-data-wrangling-and-visualization

Soviet recipes visualization project for Data Wrangling and Visualization course at Innopolis University

Home Page:https://danielpancake.github.io/soviet-recipes-data-wrangling-and-visualization/visualization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Soviet recipes data wrangling and visualization project

Project Description

  1. Data Scraping

    1. Target websites

    2. Raw data format At first, I found it hard to implement a nested json structure, so I decided to use a flat structure for the raw data instead. The structure is as follows:

      {
          "category": "category",
          "subcategory": "subcategory",
          "recipe_name": "recipe_name",
          "ingredients": [
              "ingredient_1", "ingredient_2", "ingredient_3"
          ],
      }

      Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.

      To scrape raw data, run:

      cd ./scrapping
      scrapy crawl sov-obshchepit -O ../data/raw_data.json
    3. Nested raw data format

      Eventually, I figured out a way to implement a nested json structure. I store scrapped data in a nested json structure (nested_index in sov_obshchepit.py) and write it to a json file when the spider is closed. The structure of this file is as follows:

      {
          "category_name": {
              "subcategory_name": {
                  "recipe_name": {
                      "ingredients": [
                          "ingredient_1", "ingredient_2", "ingredient_3"
                      ],
                  }
              }
          }
      }

      Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.

      To scrape raw data with nested structure, run:

      cd ./scrapping
      scrapy crawl sov-obshchepit -a nested_output=../data/raw_nested_data.json
    4. Sorted and prettified raw data format

      You migth want to sort the raw data by category, subcategory, and recipe name. To do so, run:

      cat ./data/raw_data.json | jq 'sort_by(.category, .subcategory, .recipe_name)' > ./data/raw_data_sorted.json
  2. Data Wrangling

    Part of the data cleaning process is done during the scraping process. For example, all trailing whitespaces are removed from the scraped data, as well as any empty strings or invisible characters. The rest of the data cleaning is done in the data_wrangling.ipynb notebook.

    1. Structured cleaned data

      I used Claude AI assistent to convert the raw strings of ingredients into structured data. The resulting data is stored in data/structured_data.json. The structure of this file is as follows:

      {
          "category_name": {
              "subcategory_name": {
                  "recipe_name": {
                      "ingredients": [
                          "ingredient_1", "ingredient_2", "ingredient_3"
                      ],
                      "parsed_ingredients": [
                          ["ingredient_name", "quantity", "measure units"],
                          ["ingredient_name", "quantity", "measure units"],
                          ["ingredient_name", "quantity", "measure units"]
                      ]
                  }
              }
          }
      }
  3. Data Visualization

    Visualization consists of two major parts: static using plotly (python) and dynamic using d3.js (coffeescript) and plotly (python export to html+js).

    Notebook data_visualization.ipynb has all the code for generating svg and html files used on the website.

    Three types of charts are used:

    1. Bar charts. It shows the number of recipes per subcategory in the specified category.
    2. Sunburst chart. Similar to the bar chart, shows the number of recipes per subcategory of each category.
    3. Networks. For the specified category, it shows connections between different recipes and used ingredients.

Misc

Visual inspiration: everyday soviet food.

(I intended to use those in the final design, however, did not).

About

Soviet recipes visualization project for Data Wrangling and Visualization course at Innopolis University

https://danielpancake.github.io/soviet-recipes-data-wrangling-and-visualization/visualization


Languages

Language:Jupyter Notebook 34.5%Language:HTML 30.6%Language:Python 17.2%Language:CoffeeScript 12.6%Language:CSS 3.2%Language:JavaScript 1.9%