DestrosCMC / twitter_postgres

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Twitter in Postgres

You will repeat the Twitter/MapReduce assignment using Postgres. Because this assignment will involve many new programming concepts, it will be spread out over several assignments.

In this first assignment, we will focus on:

  1. working with postgres from python
  2. inserting data into the database
  3. understanding JSON/denormalized vs normalized schemas (i.e. NoSQL vs SQL)

Tasks

  1. Getting started:

    1. Fork this repo
    2. Enable github action on your fork
    3. Clone the fork onto the lambda server
    4. Modify the README.md file so that the test case image points to your forked repo
  2. Main tasks:

    1. There are two postgres containers defined in the docker-compose.yml file ports: one containes a normalized database schema, and the other is denormalized. You will need to update the ports for each database so that they do not conflict with anyone else.

      NOTE: Recall that in the pagila assignments, there was no need to adjust the ports. This is because the database was not exposed to the lambda server. In this assignment, we must expose the database to the lambda server. The load_tweets.sh and load_tweets.py scripts will be run from the lambda server.

      It would be possible to put these scripts "inside" the database image so that we wouldn't need to expose the ports. But I've put the scripts "outside" the container to give you more practice connecting to the db from a remote system.

    2. Complete the missing sections of the load_tweets.py file. This file is responsible for loading data into the normalized database. The schema for the normalized database is summarized as:

      The arrows represent foreign keys onto the primary key of the target table. The foreign keys are likely to cause you many errors when inserting your data. These errors may be frustrating, but they are actually a GOOD thing (some would even say GREAT), because Postgres is preventing you from accidentally adding corrupted data into the database.

    3. Complete the missing sections of the load_tweets.sh file. This file will both call the load_tweets.py file, and use the SQL COPY command to load data into the denormalized database.

    4. Grading Note: There are 9 total test cases in the sql folder. If you implement the code above correctly, then the output of the SELECT commands in each test case should be the same for each database. Each test case is worth 1 point per database, for 18 points total.

    5. HINT: As you debug your insert code, you may need to delete your database. Calling

      $ docker-compose down
      

      is not enough, since the database is persisted to a volume. To delete the database, you'll need to use the

      $ docker volume ls
      $ docker volume rm VOLUME_ID
      

      commands to list the docker volumes and delete the appropriate volumes. Alternatively, you can use

      $ docker volume prune
      

      to delete all volumes.

  3. Upload a link to your forked github repo on sakai

About


Languages

Language:Python 62.1%Language:PLpgSQL 30.2%Language:Shell 5.1%Language:Dockerfile 2.6%