Tutorial 2: Getting Data In

DPI R Bootcamp

Jared Knowles

Overview

A quick note on R packages

I can haz packages?

# You can find and install packages within R
install.packages("foo")  # Name must be in quotes
install.packages(c("foo", "foo1", "foo2"))
# Packages get updated FREQUENTLY
update.packages()  # Gonna update them all

Finding Packages

Some Must Have Packages

plyr ggplot2 lme4 sp knitr

Data Management

The Working Directory

Preliminaries

RStudio Shortcuts

RStudio Projects

Manipulating Project Paths

When to use a full path

Ground Rules

Missing Data Symbols

a <- c(1, 2, 3)  # a is a vector with three elements
# Ask R for element 4
print(a[4])
## [1] NA
ls()  # get objects
## [1] "a"
a <- c(a, NULL)  # Append NULL onto a
print(a)
## [1] 1 2 3
# Notice no change
a <- c(a, NA)
print(a)
## [1]  1  2  3 NA

What the heck is Not a Number?

b <- 1
b <- sqrt(-b)
## Warning: NaNs produced
print(b)
## [1] NaN
pi/0
## [1] Inf

Beginning Analysis

Organization of a project is key

Create our Project

Read in Data

CSV is Our Friend

# Set working directory to the tutorial directory In RStudio can do this
# in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
df <- read.csv("data/smalldata.csv")
# Note if we don't assign data to 'df' R just prints contents of table

Let’s Check What We Got

## 'data.frame':    2700 obs. of  6 variables:
##  $ schoolavg: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ schoollow: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ readSS   : num  357 264 370 347 373 ...
##  $ mathSS   : num  387 303 365 344 441 ...
##  $ proflvl  : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
##  $ race     : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...

Always Check Your Data

dim(df)
## [1] 2700   32
summary(df[, 1:5])
##        X               school         stuid            grade     
##  Min.   :     44   Min.   :   1   Min.   :   205   Min.   :3.00  
##  1st Qu.: 108677   1st Qu.: 195   1st Qu.: 44205   1st Qu.:4.00  
##  Median : 458596   Median : 436   Median : 88205   Median :5.00  
##  Mean   : 557918   Mean   : 460   Mean   : 99229   Mean   :5.44  
##  3rd Qu.: 972291   3rd Qu.: 717   3rd Qu.:132205   3rd Qu.:7.00  
##  Max.   :1499992   Max.   :1000   Max.   :324953   Max.   :8.00  
##      schid      
##  Min.   :  6.0  
##  1st Qu.: 15.0  
##  Median : 55.5  
##  Mean   : 52.0  
##  3rd Qu.: 75.0  
##  Max.   :105.0

Checking your data II

names(df)
##  [1] "X"           "school"      "stuid"       "grade"       "schid"      
##  [6] "dist"        "white"       "black"       "hisp"        "indian"     
## [11] "asian"       "econ"        "female"      "ell"         "disab"      
## [16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
## [21] "teachq"      "year"        "attday"      "schoolscore" "district"   
## [26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
## [31] "proflvl"     "race"
names(attributes(df))
## [1] "names"     "row.names" "class"
class(df)
## [1] "data.frame"

Data Warehouses, Oracle, SQL and RODBC

An Example From DPI

library(RODBC)  # interface driver for R
channel <- odbcConnect("Mydatabase.location", uid = "useR", pwd = "secret")
# establish connection we can do multiple connections in the same R
# session
# 
# WARNING: credentials stored in plain text unless you do some magic
table_list <- sqltables(channel, schema = "My_DB")
# Get a list of tables in the connection
colnames(sqlFetch(channel, "My_DB.TABLE_NAME", max = 1))
# get the column names of a table
datapull <- sqlQuery(channel, "SELECT DATA1, DATA2, DATA3 FROM My_DB.TABLE_NAME")
# execute some SQLquery, can paste any SQLquery as a string into this
# space

Missing Data

random <- sample(unique(df$stuid), 100)
random2 <- sample(unique(df$stuid), 120)
messdf <- df
messdf$readSS[messdf$stuid %in% random] <- NA
messdf$mathSS[messdf$stuid %in% random2] <- NA

Checking for Missing Data

summary(messdf[, c("stuid", "readSS", "mathSS")])
##      stuid            readSS        mathSS   
##  Min.   :   205   Min.   :252   Min.   :210  
##  1st Qu.: 44205   1st Qu.:430   1st Qu.:417  
##  Median : 88205   Median :495   Median :479  
##  Mean   : 99229   Mean   :496   Mean   :483  
##  3rd Qu.:132205   3rd Qu.:562   3rd Qu.:544  
##  Max.   :324953   Max.   :833   Max.   :828  
##                   NA's   :219   NA's   :264
nrow(messdf[!complete.cases(messdf), ])  # number of rows with missing data
## [1] 471
cleandf <- na.omit(messdf)
nrow(cleandf)
## [1] 2229

Now we have the data

dim(messdf)
## [1] 2700   32
str(messdf[, 18:26])
## 'data.frame':    2700 obs. of  9 variables:
##  $ luck       : int  0 1 0 1 0 0 1 0 0 0 ...
##  $ ability    : num  87.9 97.8 104.5 111.7 81.9 ...
##  $ measerr    : num  11.13 6.82 -7.86 -17.57 52.98 ...
##  $ teachq     : num  39.0902 0.0985 39.5389 24.1161 56.6806 ...
##  $ year       : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ attday     : int  180 180 160 168 156 157 169 180 170 152 ...
##  $ schoolscore: num  29.2 56 56 56 56 ...
##  $ district   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ schoolhigh : int  0 0 0 0 0 0 0 0 0 0 ...

Looking at data structure

names(messdf)
##  [1] "X"           "school"      "stuid"       "grade"       "schid"      
##  [6] "dist"        "white"       "black"       "hisp"        "indian"     
## [11] "asian"       "econ"        "female"      "ell"         "disab"      
## [16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
## [21] "teachq"      "year"        "attday"      "schoolscore" "district"   
## [26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
## [31] "proflvl"     "race"
length(unique(messdf$stuid))
## [1] 1200
length(unique(messdf$schid))
## [1] 6
length(unique(messdf$dist))
## [1] 3

Checking for Coding

unique(messdf$grade)
## [1] 3 4 5 6 7 8
unique(messdf$econ)
## [1] 0 1
unique(messdf$race)
## [1] B H I W A
## Levels: A B H I W
unique(messdf$disab)
## [1] 0 1

Next Steps

Exercises

  1. Read in the CSV file from the T drive or the project folder

  2. Read in the R data file from the T drive or the project folder

  3. Read in the sample datafile. Find the readSS (reading scale score) for student 205 in grade 4.

  4. Create a list of two attributes for each district in the df datafile.

  5. Think about your own data warehouse environment. Could R interface with it? How?

Other References

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.8. Here is my session info:

print(sessionInfo(), locale = FALSE)
## R version 2.15.2 (2012-10-26)
## Platform: i386-w64-mingw32/i386 (32-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_0.9.2.1 lmtest_0.9-30   zoo_1.7-9       knitr_0.8      
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-0   dichromat_1.2-4    digest_0.5.2      
##  [4] evaluate_0.4.2     formatR_0.6        grid_2.15.2       
##  [7] gtable_0.1.1       labeling_0.1       lattice_0.20-10   
## [10] MASS_7.3-22        memoise_0.1        munsell_0.4       
## [13] plyr_1.7.1         proto_0.3-9.2      RColorBrewer_1.0-5
## [16] reshape2_1.2.1     scales_0.2.2       stringr_0.6.1     
## [19] tools_2.15.1

Attribution and License

Public Domain Mark
This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.