Monday, May 2, 2016

Introduction to Machine Learning with Random Forest


The purpose of this tutorial is to serve as a introduction to the randomForest package in R and some common analysis in machine learning.

Part 1. Getting Started

First step, we will load the package and iris data set. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Part 2. Fit Model

Now that we know what our data set contains, let fit our first model. We will be fitting 500 trees in our forest and trying to classify the Species of each iris in the data set. For the randomForest() function, "~." means use all the variables in the data frame.

Note: a common mistake, made by beginners, is trying to classify a categorical variable that R sees as a character. To fix this, convert the variable to a factor like this randomForest(as.factor(Species) ~ ., iris, ntree=500)

fit <- randomForest(Species ~ ., iris, ntree=500)

The next step is to use the newly create model in the fit variable and predict the label.

results <- predict(fit, iris)
summary(results)


After you have the predicted labels in a vector (results), the predict and actual labels must be compared. This can be done with a confusion matrix. A confusion matrix is a table of the actual vs the predicted with the diagonal numbers being correctly classified elements while all others are incorrect.



# Confusion Matrix
 table(results, iris$Species) 




Now we can take the diagonal points in the table and sum them, this will give us the total correctly classified instances. Then dividing this number by the total number of instances will calculate the percentage of prediction correctly classified. <- -="" 1="" accuracy="" correctly_classified="" div="" error="" iris="" length="" pecies="" results="" style="overflow: auto;" table="" total_classified="">

# Calculate the accuracy  
correctly_classified <- table(results, iris$Species)[1,1] + table(results, iris$Species)[2,2] + table(results, iris$Species)[3,3] 
total_classified <- length(results)  
# Accuracy   
correctly_classified / total_classified   
# Error  
1 - (correctly_classified / total_classified)

Part 3. Validate Model 

The next step is to validate the prediction model. Validation requires splitting your data into two sections. First, the training set, which will be used to create the model. The second will be the test set and will test the accuracy of the prediction model. The reasoning for splitting the data is to allow a model to be created using one data set and then reserving some data, where the output is already known, to "test" the model accuracy. This more effectively estimates the accuracy of the model by not using the same data used to create the model and predict the accuracy.

# How to split into a training set   
rows <- nrow(iris) col_count <- c(1:rows)  
Row_ID <- sample(col_count, rows, replace = FALSE)   
iris$Row_ID <- Row_ID   
 
# Choose the percent of the data to be used in the training 
data training_set_size = .80   
#Now to split the data into training and test 
 index_percentile <- rows*training_set_size   
# If the Row ID is smaller then the index percentile, it will be assigned into the training set   
train <- iris[iris$Row_ID <= index_percentile,]   
# If the Row ID is larger then the index percentile, it will be assigned into the training set   
test <- iris[iris$Row_ID > index_percentile,]   
train_data_rows <- nrow(train)   
test_data_rows <- nrow(test)   
total_data_rows <- (nrow(train)+nrow(test)) train_data_rows / total_data_rows     
 
# Now we have 80% of the data in the training set  test_data_rows / total_data_rows     
# Now we have 20% of the data in the training set   
# Now lets build the randomforest using the train data set  
fit <- randomForest(Species ~ ., train, ntree=500) 
  

After the test set is predicted, a confusion matrix and accuracy must be calculated.

# Use the new model to predict the test set   
results <- predict(fit, test, type="response")   
# Confusion Matrix  
table(results, test$Species)  
# Calculate the accuracy   
correctly_classified <- table(results, test$Species)[1,1] + table(results, test$Species)[2,2] + table(results, test$Species)[3,3] total_classified <- length(results)   
# Accuracy   
correctly_classified / total_classified  
# Error  
1 - (correctly_classified / total_classified)

Part 4. Model Analysis 

After the model is created, understanding the relationship between variables and number of trees is important. R makes it easy to plot the errors of the model as the number of trees increase. This allows users to trade off between more trees and accuracy or fewer trees and lower computational time.

fit <- randomForest(Species ~ ., train, ntree=500)   
results <- predict(fit, test, type="response")  
 
# Rank the input variables based on their effectiveness as predictors   
varImpPlot(fit)   
# To understand the error rate lets plot the model's error as the number of trees increases  
plot(fit) 

Part 5. Handling Missing Values 

The last section of this tutorial involves one of the most time consuming and important parts of the data analysis process, missing variables. Very few machine learning algorithms can handle missing data in the data. However the randomForest package contains one of the most useful functions of all time, na.roughfix(). Na.roughfix() takes the most common factor in that column and replaces all the NAs with it. For this section we will first create some NAs in this data set and then replace them and run the prediction algorithm.

# Create some NA in the data.   
iris.na <- iris for (i in 1:4)  
iris.na[sample(150, sample(20)), i] <- NA   
 
# Now we have a dataframe with NAs   
View(iris.na)   
#Adding na.action=na.roughfix   
#For numeric variables, NAs are replaced with column medians.   
#For factor variables, NAs are replaced with the most frequent levels (breaking ties at random) 
iris.narf <- randomForest(Species ~ ., iris.na, na.action=na.roughfix) 
results <- predict(iris.narf, train, type="response")


Congratulations! You now know how to create machine learning models, fit data using those models, test the model’s accuracy and display it in a confusion matrix, how to validate the model, and quickly replace missing variables. All of these are the basic fundamental skills in machine learning!
Hallway Mathlete Data Scientist

I am a PhD student in Industrial Engineering at Penn State University. I did my undergrad at Iowa State in Industrial Engineering and Economics. Here is my academic website.

Thursday, April 21, 2016

R Stat Job Boards List





Job boards dedicated to R jobs:
And here are a list of more general data science job boards that are not R specific:
Please feel free to leave additional job boards in the comments and I will continue to update the list over time.



Hallway Mathlete Data Scientist

I am a PhD student in Industrial Engineering at Penn State University. I did my undergrad at Iowa State in Industrial Engineering and Economics. Here is my academic website.

Tuesday, March 15, 2016

One of the Best and Most Underutilized Graphs in ggplot2



Understanding how a distribution of a variable changes over time can make a great visualization. These highly intuitive graphics can display a lot of information and can be simply rendered in R using ggplot2. However, based on my experience, they are one of the most underutilized graphs in R.

A good example of this style of graph is from my research. My research studies how data analysis can be utilized to improve the product design and manufacturing process. The style of graph discussed in this post is extremely useful for showing how design specifications change over time. Below you can see an example of how the specifications of secondary cameras on cellphones has changed over time.  It is easily seen that before 2011, there was almost no secondary cameras and by 2015, almost all cameras released had some form of secondary camera.


To create these plots, first lets load ggplot2 and the diamond data set.

library(ggplot2)
data(diamonds)
head(diamonds)


When creating these plots, I like to make sure I under stand how the data is distributed over the x axis. This is helpful because if there is a section of x-axis with much fewer data points, the distribution of the y-axis can change rapidly over the x-axis due to low samples.

The plot below shows the distribution of diamonds grouped by cut as the price changes.

ggplot(data=diamonds, aes(x=price, group=cut, fill=cut, position="stack")) + 
geom_density(adjust=1.5)


In the next plots instead of the count in the y-axis, the y-axis is the percent of each group (cut for the first example and clarity for the second) for different prices.

ggplot(data=diamonds,aes(x=price, group=cut, fill=cut, position="stack")) + 
geom_density(adjust=1.5, position="fill")


ggplot(data=diamonds,aes(x=price, group=clarity, fill=clarity, position="stack")) +
 geom_density(adjust=1.5, position="fill")








Hallway Mathlete Data Scientist

I am a PhD student in Industrial Engineering at Penn State University. I did my undergrad at Iowa State in Industrial Engineering and Economics. Here is my academic website.

Sunday, February 28, 2016

Introduction to R: For Beginners Who Want to be Intermediate


This semester, I am managing an undergraduate on our research team. She has a basic engineering statistics background with a little experience doing analysis in Excel and understands some basic programming concepts. For her project with our group, she will have to learn R. 

I tried to compile a list of high quality videos and tutorials. While searching, I was surprised by two things: the increase in quantity of resources since when I started to learn R and the giant gap between beginner tutorials and the skills needed to be proficient in R. Many beginner tutorials explained how to assign or print out a variable, but no introduction to real analysis or real-world application of these skills. Also many of the tutorials only included basic R functions, for example plot() instead of ggplot2(), which any good Hadley Wickham worshipper knows is akin to sin (For those of you are just starting out, Hadley Wickham created ggplot2 and some of the most used R packages of all time). So I decided to post my compilation of tutorials and videos that will take you from a person that knows nothing about R to an intermediate R programmer in a smooth transition.


Step 1: Download and Setup
Download R  -   LINK
This is the basic program to run R on your computer.

Download RStudio  -   LINK
This is an IDE or Integrated development environment for R. Always open your R files into RStudio and not basic R because RStudio helps track variables, stores past graphs and has many other features that you will use heavily in analyzing data.

Step 2: Understand the Basic
Follow this tutorial for R  -  LINK
This is all online but it gets the basics of the language down.

Do the Introduction to R tutorial and Data Manipulation in R with dplyr. If you have trouble finding them in the site, the links to each class are here.

Step 2.5: Really a tip more than a step
The rest of the steps will be inside RStudio. To run a line of code, click in the line of code you wish to run and press Crt +Enter for Windows or Command + Enter for Mac. The line should drop down to the console window and run. This will make more sense once you watch the later videos.

Step 3: Introduction to R and RStudio

Step 4: Data Visualization and Generating Graphs
R is made of groups of functions called “packages”. There are packages for everything in R from accessing databases to creating websites to creating predictive models to generating plots.

The most common package for data analysis is ggplot2. GGplot2 allows R users to quickly plot data stored in data frames. The following video steps you through using qplot (which stands for quick plot and is ggplot2 most popular function).

Introduction to R Programming - Module 8 (qplot)  -  LINK
The data for this video can be downloaded here and is called “bank.zip”: http://mlr.cs.umass.edu/ml/machine-learning-databases/00222/
The file you want for this video is bank-full.csv.

In the video, replace the line:
attach(bank.marketing)

with the following code:
install.packages("ggplot2")
library(ggplot2)
file_path <- file.choose()
bank.marketing <- read.csv2(file_path)

This is how you read a csv data file separated by “;” into R, if it was separated by commas then the command would be read.csv(). Now the data in the csv selected is in the data frame “bank.marketing”.

The command file.choose() is used when the path to the file is unknown and will pop up a file search for you to select the desired file.

Step 5: Intro to Data Science
This is a great video for learning data analysis in R. There are three parts, I would recommend looking at all three parts.


Step 6: Intro to Web scraping
R Web Page Scraping Example Video

Web scraping is super easy in R and can allow you to grab data others can not. It is often missed in intro R tutorials which is a shame. 


Step 7: Introduction to Machine Learning with Random Forest
A quick introduction tutorial to machine learning algorithms and best practices.
http://www.hallwaymathlete.com/2016/05/introduction-to-machine-learning-with.html

Step 8: Machine Learning Course
Hands down the best machine learning course on the planet. 

At this point, you are a full data scientist and you can gather and analyze any data on the web!
Hallway Mathlete Data Scientist

I am a PhD student in Industrial Engineering at Penn State University. I did my undergrad at Iowa State in Industrial Engineering and Economics. Here is my academic website.