Cleaning MyAnimeList data for Google Data Analytics Capstone
In the previous script I had downloaded all of the data that I wanted to
use for my capstone project. Now I need to explore and clean all of it.
First things first - setting some variables
Set your working directory to wherever you’d like in the
WORKINDIRECTORY section.
wd1="WORKINGDIRECTORY"
setwd(wd1)
Install and load any libraries
The tidyverse will be used to manipulate/transform data and the
janitor library will be used to check duplicates and consistency of
data.
install.packages(c("tidyverse", "janitor"))
Next, we want to load the two libraries:
library(tidyverse)
library(janitor)
Load in the data
Previously, we had downloaded and made transformations to data
from the MyAnimeList API. Now we’re going to load that data into R for
further analysis and cleaning.
All of the column names match expectations, so we are good!
Cleaning up strings
There is one situation that will cause issues when uploading onto
tableau. If the field includes characters for new line (\n) or
carriage returns (\r) the table loading process will fail.
I need to search through all tables that contain character fields and
confirm that they don’t have those characters.
tables1<-list(anime_demo_table=anime_demo_table, anime_genres_table=anime_genres_table, anime_ranking_table=anime_ranking_table, anime_studios_table=anime_studios_table, anime_syn_table=anime_syn_table, anime_table=anime_table, rank_table=rank_table, demo_l=demo_l, genres_l=genres_l, studios_l=studios_l)
find_character<-function(df, dfName){
for (iin1:ncol(df)){
if (is.character(df[,i])){
if (TRUE%in% grep("\n",df[, i]) |TRUE%in% grep("\r",df[, i])){
if (TRUE%in% grep("\n",df[, i])){
print(paste(dfName, "-", names(df)[i], "column has the new line character"))
}
if (TRUE%in% grep("\r",df[, i])){
print(paste(dfName, "-", names(df)[i], "column has the carriage return character"))
}
}
}
}
}
for (iin1:length(tables1)){
find_character(tables1[[i]], names(tables1)[i])
}
## [1] "anime_table - synopsis column has the new line character"
tables1<-NULL
It appears that only the synopsis column contains the new line
character, so lets get rid of those.
Ok, so the following fields contain nulls in the anime_table:
end_date
demo_de
synonyms
I plan on incorporating the demographic into my analysis, because of
this, I will replace the null values in this field with missing.
anime_table$demo_de<-anime_table$demo_de %>%
replace_na('missing')
# if I need to replace multiple columns, use the below code# anime_table <- anime_table %>%# replace_na(list(x='missing', y = 'none'))# When I run this for the first full time I will also encounter missing values in start_season_year which will require a substitution of the first four characters from the start_date# anime_table <- anime_table %>%# mutate(start_season.year = case_when(# is.na(start_season.year) ~ as.integer(substr(start_date, 1, 4)),# !is.na(start_season.year) ~ start_season.year))# lapply(anime_table,function(x) { length(which(is.na(x)))})
All other tables look ok, so we’re good to move onto the next step.
Check for duplicates
I’ll check for duplicates for all fields first, then check individual
fields where I wouldn’t expect duplicates (primary keys and potentially
other fields).
anime_demo_table
Field
Type
Primary Key
tm_ky
int
PK
mal_id
int
PK
demo_id
int
get_dupes(anime_genres_table)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, mal_id, genres_id
## [1] tm_ky mal_id genres_id dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(anime_genres_table, mal_id, genres_id)
## No duplicate combinations found of: mal_id, genres_id
## [1] mal_id genres_id dupe_count tm_ky
## <0 rows> (or 0-length row.names)
anime_genres_table
Field
Type
Primary Key
tm_ky
int
PK
mal_id
int
PK
genres_id
int
PK
get_dupes(anime_genres_table)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, mal_id, genres_id
## [1] tm_ky mal_id genres_id dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(anime_genres_table, mal_id, genres_id)
## No duplicate combinations found of: mal_id, genres_id
## [1] mal_id genres_id dupe_count tm_ky
## <0 rows> (or 0-length row.names)
anime_ranking_table
Field
Type
Primary Key
tm_ky
int
PK
mal_id
int
PK
mean
dbl
rank
int
popularity
int
num_scoring_users
int
statistics.watching
int
statistics.completed
int
statistics.on_hold
int
statistics.dropped
int
statistics.plan_to_watch
int
statistics.num_scoring_users
int
get_dupes(anime_ranking_table)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, mal_id, mean, rank, popularity, num_scoring_users, statistics.watching, statistics.completed, statistics.on_hold, ... and 3 other variables
## [1] tm_ky mal_id
## [3] mean rank
## [5] popularity num_scoring_users
## [7] statistics.watching statistics.completed
## [9] statistics.on_hold statistics.dropped
## [11] statistics.plan_to_watch statistics.num_scoring_users
## [13] dupe_count
## <0 rows> (or 0-length row.names)
There can be duplicates for both rank and popularity (even though they
should be unique. The data download occurs over the space of ~1 hour.
Because of this, rankings may change slightly while the download is
occuring resulting in duplicates or gaps.
Since I am not including the popularity or rank as items in my download,
this is ok, however, if anyone is using these fields it should be a
caveat.
anime_studios_table
Field
Type
Primary Key
tm_ky
int
PK
mal_id
int
PK
studio_id
int
PK
get_dupes(anime_studios_table)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, mal_id, studio_id
## [1] tm_ky mal_id studio_id dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(anime_studios_table, mal_id, studio_id)
## No duplicate combinations found of: mal_id, studio_id
## [1] mal_id studio_id dupe_count tm_ky
## <0 rows> (or 0-length row.names)
anime_syn_table
Field
Type
Primary Key
tm_ky
int
PK
mal_id
int
PK
synonyms
chr
get_dupes(anime_syn_table)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, mal_id, synonyms
## [1] tm_ky mal_id synonyms dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(anime_syn_table, mal_id, synonyms)
## No duplicate combinations found of: mal_id, synonyms
## [1] mal_id synonyms dupe_count tm_ky
## <0 rows> (or 0-length row.names)
Interestingly, in some cases, there can be duplicates of the synonyms.
These are present on the website, however, I can exclude them here
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, mal_id, title, rank, rank_category
## [1] tm_ky mal_id title rank rank_category
## [6] dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(rank_table, mal_id, rank_category)
## mal_id rank_category dupe_count tm_ky title rank
## 1 11079 favorite 2 2 Kill Me Baby 1500
## 2 11079 favorite 2 2 Kill Me Baby 1501
get_dupes(rank_table, mal_id, rank_category)
## mal_id rank_category dupe_count tm_ky title rank
## 1 11079 favorite 2 2 Kill Me Baby 1500
## 2 11079 favorite 2 2 Kill Me Baby 1501
get_dupes(rank_table, rank_category, rank)
## No duplicate combinations found of: rank_category, rank
## [1] rank_category rank dupe_count tm_ky mal_id
## [6] title
## <0 rows> (or 0-length row.names)
Interestingly, there are no duplicates for rank. I assume that because
this download is pretty fast, there isn’t time for dynamic changes in
rank as time passes.
demo_l
Field
Type
Primary Key
tm_ky
int
PK
demo_id
int
PK
demo_de
chr
get_dupes(demo_l)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, demo_id, demo_de
## [1] tm_ky demo_id demo_de dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(demo_l, demo_id)
## No duplicate combinations found of: demo_id
## [1] demo_id dupe_count tm_ky demo_de
## <0 rows> (or 0-length row.names)
genres_l
Field
Type
Primary Key
tm_ky
int
PK
genres_id
int
PK
genres_de
chr
get_dupes(genres_l)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, genres_id, genres_de
## [1] tm_ky genres_id genres_de dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(genres_l, genres_id)
## No duplicate combinations found of: genres_id
## [1] genres_id dupe_count tm_ky genres_de
## <0 rows> (or 0-length row.names)
studios_l
Field
Type
Primary Key
tm_ky
int
PK
studio_id
int
PK
studio_de
chr
get_dupes(studios_l)
## No variable names specified - using all columns.
## No duplicate combinations found of: tm_ky, studio_id, studio_de
## [1] tm_ky studio_id studio_de dupe_count
## <0 rows> (or 0-length row.names)
get_dupes(studios_l, studio_id)
## No duplicate combinations found of: studio_id
## [1] studio_id dupe_count tm_ky studio_de
## <0 rows> (or 0-length row.names)
Check for inconsistencies in names
There are only a few fields that I can review for consistency. While I
might want to ensure that every title doesn’t have a slightly different
name, it just isn’t feasible at this point. As a result, I will take a
focused approach to each table