sciencificity / reclues

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

reclues

Learn new skills while solving a mystery

The reclues package mimics the SQL Murder Mystery done by Northwestern’s Knight Lab - their game is set up to use SQL to solve the mystery, reclues makes their data available in R to solve the mystery using R ;-).

A brief of the task at hand is given in their walkthrough. Want to play the SQL version of the game online? Here it is courtesy of Simon Willison!

Getting Started: Murder Mystery Prompt (@knightlab)

Whodunnit??

A crime has taken place and the detective needs your help. The detective gave you the crime scene report, but you somehow lost it. You vaguely remember that the crime was a murder that occurred sometime on Jan.15, 2018 and that it took place in SQL City. All the clues to this mystery are buried in a huge database, and you need to use your R skills to navigate through this vast network of information. Your first step to solving the mystery is to retrieve the corresponding crime scene report from the police department’s database.

Take a look at the cheatsheet (this Readme file) to learn some tips on how to do this using the tidyverse tools! From there, you can use your R skills to find the murderer. The crime may be solved with other tools besides tidy tools, please feel free to use the dataset as a learning tool for other packages in R or base R itself if that is your wish. The cheatsheet is for tidyverse tools for now as a start, but since I am learning myself I may expand this to include data.table etc. in future. Have fun!

Travis build status

The goal of reclues is to provide the datasets from the SQL Challenge mentioned above within R, and some cheatsheet tidyverse commands for getting you on your way to solving the mystery.

Please see the reclues package site for how to use the package, as well as the vignette included in the Get Started section which contains info on using the actual sqlite database included in the package.


The datasets

The datasets will be available as soon as you install the package. These are the datasets available and the data contained within them.

Table Name Fields Rows
crime_scene_report date, type, description, city 1,228 rows
drivers_license id, age, height, eye_color, hair_color, gender, plate_number, car_make, car_model 10,007 rows
facebook_event_checkin person_id, event_id, event_name, date 20,011 rows
get_fit_now_check_in membership_id, check_in_date, check_in_time, check_out_time 2,703 rows
get_fit_now_member id, person_id, name, membership_start_date, membership_status 184 rows
income ssn, annual_income 7,514 rows
interview person_id, transcript 4,991 rows
person id, name, license_id, address_number, address_street_name, ssn 10,011 rows

SQLite DB

The raw SQLite database as per @knightlab is also available through the get_db() function. To use the SQLite DB for your investigation you will need the DBI package.

install.packages("DBI")

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("sciencificity/reclues")

Some useful functions to view the data

library(reclues)
library(dplyr)
# basic example code
# see the first 6 observations
head(crime_scene_report)
#> # A tibble: 6 x 4
#>       date type    description                                      city   
#>      <int> <chr>   <chr>                                            <chr>  
#> 1 20180115 robbery A Man Dressed as Spider-Man Is on a Robbery Spr~ NYC    
#> 2 20180115 murder  Life? Dont talk to me about life.                Albany 
#> 3 20180115 murder  Mama, I killed a man, put a gun against his hea~ Reno   
#> 4 20180215 murder  REDACTED REDACTED REDACTED                       SQL Ci~
#> 5 20180215 murder  Someone killed the guard! He took an arrow to t~ SQL Ci~
#> 6 20180115 theft   Big Bully stole my lunch money!                  Chicago
glimpse(crime_scene_report)
#> Observations: 1,228
#> Variables: 4
#> $ date        <int> 20180115, 20180115, 20180115, 20180215, 20180215, ...
#> $ type        <chr> "robbery", "murder", "murder", "murder", "murder",...
#> $ description <chr> "A Man Dressed as Spider-Man Is on a Robbery Spree...
#> $ city        <chr> "NYC", "Albany", "Reno", "SQL City", "SQL City", "...

# Notice that the type field contains info on the 
# type of crime which took place?
# Want to see how many reports of each incident type we have?
table(crime_scene_report$type)
#> 
#>     arson   assault blackmail   bribery     fraud    murder   robbery 
#>       148       145       130       135       130       148       134 
#> smuggling     theft 
#>       117       141

Other great packages to explore your data are DataExplorer and skimr.

  • DataExplorer’screate_report(dataset) creates an html report with summary stats, missing data, graphs of categorical data etc.
  • skimr’s skim(dataset) creates a nice summary of your dataset separating the different types of data allowing you to look at summary stats by data type.

Some useful functions to explore the data

SELECT

In R dplyr’s select() works much the same as SELECT in SQL. You use it to get only specific columns you are interested in.

Let’s say I wanted a closer look at the plate_number, car_make and car_model from the drivers_license table?

# Let's have a look at a few columns of interest from
# the drivers_license table
drivers_license %>% 
  select(plate_number, car_make, car_model) %>% 
  head(3) %>% 
  # formattable func from the formattable package just prints a nice table in the Readme
  formattable::formattable()

plate_number

car_make

car_model

P24L4U

Acura

MDX

XF02T6

Cadillac

SRX

VKY5KR

Scion

xB

# There are also helper functions to select columns of interest
# starts_with('start_text') will help select columns that begin with start_text
# ends_with('end_text') will help select columns that end with end_text
drivers_license %>% 
  # Maybe I am only interested in the columns describing the car...
  select(starts_with('car')) %>% 
  head(3) %>% 
  # formattable just prints a nice table in the Readme
  formattable::formattable()

car_make

car_model

Acura

MDX

Cadillac

SRX

Scion

xB

SQL Equivalent is:

SELECT plate_number, car_make, car_model FROM drivers_license LIMIT 3

Here’s a snippet from the online SQL version:


LIMIT

Let’s say we wanted to see a part of the data - the head() function returns 6 observations and performs a similar functionality as the LIMIT keyword in SQL.

  • head() gives you the first 6 observations of the data in the “table”
  • tail() gives you the last 6 observations of the data in the “table”

You can also specify a number as an argument to the head() or tail() functions. For example, head(15) and tail(10) will give you the first 15, and last 10 observations respectively.

crime_scene_report %>% 
  select(description) %>% 
  head(8) %>% 
  # formattable func from the formattable package just prints a nice table in the Readme
  formattable::formattable()

description

A Man Dressed as Spider-Man Is on a Robbery Spree

Life? Dont talk to me about life.

Mama, I killed a man, put a gun against his head…

REDACTED REDACTED REDACTED

Someone killed the guard! He took an arrow to the knee!

Big Bully stole my lunch money!

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

A lone hunter stalks the night, firing arrows into the Darkness. There is no hiding, no escape. In the distance, the beast falters, tethered to the void. The killing blow comes without hesitation, without mercy.

SQL Equivalent is:

SELECT description FROM crime_scene_report LIMIT 8

Here’s a snippet from the online SQL version:

Maybe I am interested in having a look at all variables associated with a person but I just want to have a look at the data not bring back all 10,011 rows.

person %>% 
  head(4) %>% 
  formattable::formattable()

id

name

license_id

address_number

address_street_name

ssn

10000

Christoper Peteuil

993845

624

Bankhall Ave

747714076

10007

Kourtney Calderwood

861794

2791

Gustavus Blvd

477972044

10010

Muoi Cary

385336

741

Northwestern Dr

828638512

10016

Era Moselle

431897

1987

Wood Glade St

614621061

SQL Equivalent is:

SELECT * FROM person LIMIT 4;

Here’s a snippet from the online SQL version:


DISTINCT

Let’s say we wanted to see the different types of criminal activity the reports identify ….
The type field in the crime scene reports table seems to contain this info. We will use the distinct function from dplyr.

library(magrittr)

# the magrittr package contains the pipe %>% function
# Take the crime scene report data AND THEN
# give me the distinct values for the `type` variable.
crime_scene_report %>% 
    distinct(type) %>% 
    formattable::formattable()

type

robbery

murder

theft

fraud

arson

bribery

assault

smuggling

blackmail

SQL Equivalent is:

SELECT DISTINCT(type) FROM crime_scene_report

Here’s a snippet from the online SQL version:

COUNT DISTINCT

Let’s say we were wondering which city has the highest number of crimes

  • here we want the city and a count of the times that city is mentioned in the crime scene report …
crime_scene_report %>% 
    count(city) %>% 
    arrange(desc(n)) %>% 
    # filter to limit the print-out
    filter(n >= 7) %>% 
    formattable::formattable()

city

n

Murfreesboro

9

SQL City

9

Duluth

8

Evansville

8

Jersey City

8

Toledo

8

Dallas

7

Hollywood

7

Kissimmee

7

Lancaster

7

Little Rock

7

Newark

7

Portsmouth

7

Reno

7

Waterbury

7

Wilmington

7

Hhmmm looks like SQL City is quite notorious for crime!

SQL Equivalent is:

SELECT city, count(city) AS n
FROM crime_scene_report
GROUP BY city
ORDER BY n DESC

Here’s a snippet from the online SQL version:


Magnify long pieces of text

Sometimes there are fields like crime_scene_report.description which are hard to see because the text runs over several lines. Even using View() or printing just the description to the screen sometimes does not help.

Enter pull() from the dplyr package which extracts a column from the data.

Hint: You will need something like this to read some of the textual description and transcript information.

crime_scene_report %>% 
  head(8) %>% 
  pull(description) %>% 
  # these next 2 lines are just for displaying the result nicely in the Readme
  tibble::enframe(name = NULL) %>% 
  formattable::formattable()

value

A Man Dressed as Spider-Man Is on a Robbery Spree

Life? Dont talk to me about life.

Mama, I killed a man, put a gun against his head…

REDACTED REDACTED REDACTED

Someone killed the guard! He took an arrow to the knee!

Big Bully stole my lunch money!

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

A lone hunter stalks the night, firing arrows into the Darkness. There is no hiding, no escape. In the distance, the beast falters, tethered to the void. The killing blow comes without hesitation, without mercy.

interview %>% 
  filter(stringr::str_length(transcript) >= 230) %>% 
  pull(transcript) %>% 
  # these next 2 lines are just for displaying the result nicely in the Readme
  tibble::enframe(name = NULL) %>% 
  formattable::formattable()

value

I was hired by a woman with a lot of money. I don’t know her name but I know she’s around 5’5" (65“) or 5’7” (67"). She has red hair and she drives a Tesla Model S. I know that she attended the SQL Symphony Concert 3 times in December 2017.


LIKE

Let’s say we’re interested in finding the people that start with a Z. We will use the stringr package for this. The str_detect() function can be used in conjunction with regular expressions - here we looking for names that start with (^) Z.

library(stringr)
person %>% 
  filter(stringr::str_detect(name, "^Z")) %>% 
  # Limit to top 5 for the print-out
  head(5) %>% 
  formattable::formattable()

id

name

license_id

address_number

address_street_name

ssn

10452

Zachary Angeloro

702210

3713

Currant Ave

965949567

10797

Zack Pentecost

150590

1839

Rushwood St

532875652

11264

Zada Laban

808317

1663

Testa Circle

983015244

14182

Zack Karwoski

964918

1546

W Middleton Way

482427372

14930

Zella Pietrzyk

440964

3680

Sandgate Circle

443068729

SQL Equivalent is:

SELECT * FROM person
WHERE name LIKE ‘Z%’

Here’s a snippet from the online SQL version:


JOINS

dplyr has joining functions such as inner_join(), left_join() etc. for joining one table to another. This mimics the SQL INNER JOIN etc.
You will notice that the person table has a field called id and the interview table has a person_id field. Let’s join these tables and see what we get.

person %>% 
  # Since the two tables have diff field names for the common field
  # we have to specify the `by` argument.
  # by = c('field_name_from_left_table' = 'field_name_from_right_table')
  inner_join(interview, by = c('id' = 'person_id')) %>% 
  # Let's say we're only interested in interviews from people who live
  # on some Drive abbreviated to 'Dr'
  filter(stringr::str_detect(address_street_name, 'Dr')) %>% 
  # Limit for print-out
  head(3) %>% 
  formattable::formattable()

id

name

license_id

address_number

address_street_name

ssn

transcript

10027

Antione Godbolt

439509

2431

Zelham Dr

491650087

nearer to watch them, and just as she came up to them she heard one of

10034

Kyra Buen

920494

1873

Sleigh Dr

332497972

a kind of serpent, that�s all I can say.�

10039

Francesco Agundez

278151

736

Buswell Dr

861079251

Beau–ootiful Soo–oop!

SELECT * FROM person
INNER JOIN interview ON person.id = interview.person_id
WHERE address_street_name LIKE ‘%Dr%’
LIMIT 3

Here’s a snippet from the online SQL version:


Think you solved it?

Check via @knightlab’s Solution Checker

Head over to ‘The SQL Murder Mystery Page’ OR ‘The SQL Murder Mystery Walkthrough’ to check your solution! At the bottom of both pages there is a Check your solution section where you enter the name of the individual you suspect committed the crime.

Check in R

Run the following commands in R once you think you’ve solved the problem. You will need the DBI package and if you’ve been using the datasets to solve the mystery and not the SQLite database (i.e. the individual dataframes of person, drivers_license etc.) then uncomment the first line to make a connection to the database, run the queries below after you’ve put in the culprit you suspect, and then disconnect from the database.

conn <- reclues::get_db()

# Replace 'Insert the name of the person you found here' with the name of the individual you found.
DBI::dbExecute(conn, "INSERT INTO solution VALUES (1, 'Insert the name of the person you found here')")

# Did we solve it? You'll either get a "That's not the right person." or a "Congrats,..." message.
DBI::dbGetQuery(conn, "SELECT value FROM solution;")

DBI::dbDisconnect(conn)

Resources for learning

SQL

  • Brandon Rohrer (@_brohrer_) has a curated list of resources here

R

About

License:Other


Languages

Language:R 100.0%