DPI R Bootcamp
Jared Knowles
packages
packages
are essentially free and open source add-ons for Rggplot2
packagelme4
package for mixed effects modelsscatterplot3d
package (also webGL
)sp
tm
caret
Rserve
fun
package# You can find and install packages within R
install.packages("foo") # Name must be in quotes
install.packages(c("foo", "foo1", "foo2"))
# Packages get updated FREQUENTLY
update.packages() # Gonna update them all
library(ggplot2)
and you’re done!plyr
ggplot2
lme4
sp
knitr
vignettes
which walk you through use cases for the packagecaret
has a great examplels()
commandmydata$thingIwant
or mydata@thingIwant
$
and @
distinction depends on whether this is an S3 or an S4 class$
doesn’t work, use @
wd
is both your friend and enemygetwd()
C:/Users/My Documents/My Project/
which is then set as the working directorysetwd()
command: setwd("PATH/TO/MY PROJECT/")
data
and handouts
/data
or ~data
which means, look for the data
folder in our current working directoryC:/Path/To/My/Data
or usr/home/jaredrocks
NA
or NaN
or NULL
depending on the context.a <- c(1, 2, 3) # a is a vector with three elements
# Ask R for element 4
print(a[4])
## [1] NA
NA
and NULL
?ls() # get objects
## [1] "a"
a <- c(a, NULL) # Append NULL onto a
print(a)
## [1] 1 2 3
# Notice no change
a <- c(a, NA)
print(a)
## [1] 1 2 3 NA
NA
can hold a place, NULL
cannotNaN
is even more special, and only holds things like imaginary numbersNaN
stands for “Not a Number”b <- 1
b <- sqrt(-b)
## Warning: NaNs produced
print(b)
## [1] NaN
pi/0
## [1] Inf
sin(Inf)
= NaN
ProjectTemplate
package for a more detailed philosophy about organizing projects.csv
.dta
.sas
.spss
.dat
and even .xls
and .xlsx
with some caredf
# Set working directory to the tutorial directory In RStudio can do this
# in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
df <- read.csv("data/smalldata.csv")
# Note if we don't assign data to 'df' R just prints contents of table
## 'data.frame': 2700 obs. of 6 variables:
## $ schoolavg: int 1 1 1 1 1 1 1 1 1 1 ...
## $ schoollow: int 0 0 0 0 0 0 0 0 0 0 ...
## $ readSS : num 357 264 370 347 373 ...
## $ mathSS : num 387 303 365 344 441 ...
## $ proflvl : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
## $ race : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(df)
## [1] 2700 32
summary
summary(df[, 1:5])
## X school stuid grade
## Min. : 44 Min. : 1 Min. : 205 Min. :3.00
## 1st Qu.: 108677 1st Qu.: 195 1st Qu.: 44205 1st Qu.:4.00
## Median : 458596 Median : 436 Median : 88205 Median :5.00
## Mean : 557918 Mean : 460 Mean : 99229 Mean :5.44
## 3rd Qu.: 972291 3rd Qu.: 717 3rd Qu.:132205 3rd Qu.:7.00
## Max. :1499992 Max. :1000 Max. :324953 Max. :8.00
## schid
## Min. : 6.0
## 1st Qu.: 15.0
## Median : 55.5
## Mean : 52.0
## 3rd Qu.: 75.0
## Max. :105.0
names
names(df)
## [1] "X" "school" "stuid" "grade" "schid"
## [6] "dist" "white" "black" "hisp" "indian"
## [11] "asian" "econ" "female" "ell" "disab"
## [16] "sch_fay" "dist_fay" "luck" "ability" "measerr"
## [21] "teachq" "year" "attday" "schoolscore" "district"
## [26] "schoolhigh" "schoolavg" "schoollow" "readSS" "mathSS"
## [31] "proflvl" "race"
attributes
and class
names(attributes(df))
## [1] "names" "row.names" "class"
class(df)
## [1] "data.frame"
str
which lists all data elements in an object and their typelibrary(RODBC) # interface driver for R
channel <- odbcConnect("Mydatabase.location", uid = "useR", pwd = "secret")
# establish connection we can do multiple connections in the same R
# session
#
# WARNING: credentials stored in plain text unless you do some magic
table_list <- sqltables(channel, schema = "My_DB")
# Get a list of tables in the connection
colnames(sqlFetch(channel, "My_DB.TABLE_NAME", max = 1))
# get the column names of a table
datapull <- sqlQuery(channel, "SELECT DATA1, DATA2, DATA3 FROM My_DB.TABLE_NAME")
# execute some SQLquery, can paste any SQLquery as a string into this
# space
random <- sample(unique(df$stuid), 100)
random2 <- sample(unique(df$stuid), 120)
messdf <- df
messdf$readSS[messdf$stuid %in% random] <- NA
messdf$mathSS[messdf$stuid %in% random2] <- NA
summary
function helps identify missing datasummary(messdf[, c("stuid", "readSS", "mathSS")])
## stuid readSS mathSS
## Min. : 205 Min. :252 Min. :210
## 1st Qu.: 44205 1st Qu.:430 1st Qu.:417
## Median : 88205 Median :495 Median :479
## Mean : 99229 Mean :496 Mean :483
## 3rd Qu.:132205 3rd Qu.:562 3rd Qu.:544
## Max. :324953 Max. :833 Max. :828
## NA's :219 NA's :264
nrow(messdf[!complete.cases(messdf), ]) # number of rows with missing data
## [1] 471
na.omit
functioncleandf <- na.omit(messdf)
nrow(cleandf)
## [1] 2229
dim(messdf)
## [1] 2700 32
str(messdf[, 18:26])
## 'data.frame': 2700 obs. of 9 variables:
## $ luck : int 0 1 0 1 0 0 1 0 0 0 ...
## $ ability : num 87.9 97.8 104.5 111.7 81.9 ...
## $ measerr : num 11.13 6.82 -7.86 -17.57 52.98 ...
## $ teachq : num 39.0902 0.0985 39.5389 24.1161 56.6806 ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ attday : int 180 180 160 168 156 157 169 180 170 152 ...
## $ schoolscore: num 29.2 56 56 56 56 ...
## $ district : int 3 3 3 3 3 3 3 3 3 3 ...
## $ schoolhigh : int 0 0 0 0 0 0 0 0 0 0 ...
names(messdf)
## [1] "X" "school" "stuid" "grade" "schid"
## [6] "dist" "white" "black" "hisp" "indian"
## [11] "asian" "econ" "female" "ell" "disab"
## [16] "sch_fay" "dist_fay" "luck" "ability" "measerr"
## [21] "teachq" "year" "attday" "schoolscore" "district"
## [26] "schoolhigh" "schoolavg" "schoollow" "readSS" "mathSS"
## [31] "proflvl" "race"
id
variables, this is useful and it is good to check if these variables have multiple rows per id or not and we do this using length
and unique
length(unique(messdf$stuid))
## [1] 1200
length(unique(messdf$schid))
## [1] 6
length(unique(messdf$dist))
## [1] 3
unique(messdf$grade)
## [1] 3 4 5 6 7 8
unique(messdf$econ)
## [1] 0 1
unique(messdf$race)
## [1] B H I W A
## Levels: A B H I W
unique(messdf$disab)
## [1] 0 1
Read in the CSV file from the T drive or the project folder
Read in the R data file from the T drive or the project folder
Read in the sample datafile. Find the readSS
(reading scale score) for student 205 in grade 4.
Create a list of two attributes for each district in the df
datafile.
Think about your own data warehouse environment. Could R interface with it? How?
It is good to include the session info, e.g. this document is produced with knitr version 0.8
. Here is my session info:
print(sessionInfo(), locale = FALSE)
## R version 2.15.2 (2012-10-26)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_0.9.2.1 lmtest_0.9-30 zoo_1.7-9 knitr_0.8
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-0 dichromat_1.2-4 digest_0.5.2
## [4] evaluate_0.4.2 formatR_0.6 grid_2.15.2
## [7] gtable_0.1.1 labeling_0.1 lattice_0.20-10
## [10] MASS_7.3-22 memoise_0.1 munsell_0.4
## [13] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5
## [16] reshape2_1.2.1 scales_0.2.2 stringr_0.6.1
## [19] tools_2.15.1
This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.