Overview
I grew up playing the original version of Pokemon on my gray brick of a Gameboy, so what would be a better way to re-live my childhood other than digging through the data. Let’s try predicting whether or not a Pokemon is legendary based off their attributes. Historically, I have used the Caret
package as my modeling framework, but we will use the more modern Tidymodels
suite of packages made by Max Kuhn & Rstudio Team.
What we are going to cover:
- Set Up
- Data Wrangling
- Exploratory Data Analysis (EDA)
- Modeling
- Conclusion
Set Up
We are going to be loading two meta packages Tidyverse
for data manipulation, cleaning & visualization and Tidymodels
for model training & evaluation.
# If not installed unncomment below and run
# install.packages("tidyverse", dependencies = TRUE)
# install.packages("tidymodels", dependencies = TRUE)
# if installed load
library(tidyverse) # plotting and data manipulation
library(tidymodels) # tidy model framework
Data Wrangling
We are going to need to load the data, so I store a csv of seven generations of Pokemon. The block below will import the data and then check to see if it worked.
# import data from github
pokemon_df <- read_csv("https://raw.githubusercontent.com/Jordan-Krogmann/pokemon/master/data/pokemon.csv")
# check top rows
head(pokemon_df, 3)
## # A tibble: 3 x 41
## abilities against_bug against_dark against_dragon against_electric
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ['Overgr~ 1 1 1 0.5
## 2 ['Overgr~ 1 1 1 0.5
## 3 ['Overgr~ 1 1 1 0.5
## # ... with 36 more variables: against_fairy <dbl>, against_fight <dbl>,
## # against_fire <dbl>, against_flying <dbl>, against_ghost <dbl>,
## # against_grass <dbl>, against_ground <dbl>, against_ice <dbl>,
## # against_normal <dbl>, against_poison <dbl>, against_psychic <dbl>,
## # against_rock <dbl>, against_steel <dbl>, against_water <dbl>, attack <dbl>,
## # base_egg_steps <dbl>, base_happiness <dbl>, base_total <dbl>,
## # capture_rate <chr>, classfication <chr>, defense <dbl>,
## # experience_growth <dbl>, height_m <dbl>, hp <dbl>, japanese_name <chr>,
## # name <chr>, percentage_male <dbl>, pokedex_number <dbl>, sp_attack <dbl>,
## # sp_defense <dbl>, speed <dbl>, type1 <chr>, type2 <chr>, weight_kg <dbl>,
## # generation <dbl>, is_legendary <dbl>
Next I am going to add/clean a few columns that I will need later, so don’t focus on the why. There is a probably a smarter way to construct the two_types_flag
, but I got brain lazy (I will clean that up later).
pokemon_df <- pokemon_df %>%
mutate(type2 = case_when(is.na(type2) ~ "none", TRUE ~ type2)) %>%
mutate(two_types_flag = case_when(type2 == "none" ~ 0, TRUE ~ 1)) %>%
mutate(bug_type = case_when(type1 == "bug" | type2 == "bug" ~ 1, TRUE ~ 0),
dark_type = case_when(type1 == "dark" | type2 == "dark" ~ 1, TRUE ~ 0),
dragon_type = case_when(type1 == "dragon" | type2 == "dragon" ~ 1, TRUE ~ 0),
electric_type = case_when(type1 == "electric" | type2 == "electric" ~ 1, TRUE ~ 0),
fairy_type = case_when(type1 == "fairy" | type2 == "fairy" ~ 1, TRUE ~ 0),
fighting_type = case_when(type1 == "fighting" | type2 == "fighting" ~ 1, TRUE ~ 0),
fire_type = case_when(type1 == "fire" | type2 == "fire" ~ 1, TRUE ~ 0),
flying_type = case_when(type1 == "flying" | type2 == "flying" ~ 1, TRUE ~ 0),
ghost_type = case_when(type1 == "ghost" | type2 == "ghost" ~ 1, TRUE ~ 0),
grass_type = case_when(type1 == "grass" | type2 == "grass" ~ 1, TRUE ~ 0),
ground_type = case_when(type1 == "ground" | type2 == "ground" ~ 1, TRUE ~ 0),
ice_type = case_when(type1 == "ice" | type2 == "ice" ~ 1, TRUE ~ 0),
normal_type = case_when(type1 == "normal" | type2 == "normal" ~ 1, TRUE ~ 0),
poison_type = case_when(type1 == "poison" | type2 == "poison" ~ 1, TRUE ~ 0),
psychic_type = case_when(type1 == "psychic" | type2 == "psychic" ~ 1, TRUE ~ 0),
rock_type = case_when(type1 == "rock" | type2 == "rock" ~ 1, TRUE ~ 0),
steel_type = case_when(type1 == "steel" | type2 == "steel" ~ 1, TRUE ~ 0),
water_type = case_when(type1 == "water" | type2 == "water" ~ 1, TRUE ~ 0))
Here we will be needing a training & testing set for modeling. I am going to use the first six generations of Pokemon to train our models and hold out the seventh generation to test them.
train_df <- pokemon_df %>% filter(generation != 7)
test_df <- pokemon_df %>% filter(generation == 7)
EDA
Plot 1
1
Plot 2
2