DPI R Bootcamp
Jared Knowles
In this lesson we hope to learn:
# Set working directory to the tutorial directory In RStudio can do
# this in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
load("data/smalldata.rda")
# Note if we don't assign data to 'df' R just prints contents of
# table
table
function is our friendtable(df$grade, df$year)
##
## 2000 2001 2002
## 3 200 100 200
## 4 100 200 100
## 5 200 100 200
## 6 100 200 100
## 7 200 100 200
## 8 100 200 100
table(df$year, df$race)
##
## A B H I W
## 2000 16 370 93 7 414
## 2001 16 370 93 7 414
## 2002 16 370 93 7 414
with(df[df$grade == 3, ], {
table(year, race)
})
## race
## year A B H I W
## 2000 4 78 22 4 92
## 2001 1 44 8 2 45
## 2002 0 74 20 1 105
with
specifies a data object to work on, in this case all elements of df
where grade==3
table
is the same command as above, but since we specified the data object in the with
statement, we don’t need the df$
in front of the variables of interestwith(df[df$___==___,]...)
with(df[df$race == "B", ], {
table(year, grade)
})
## grade
## year 3 4 5 6 7 8
## 2000 78 48 87 39 74 44
## 2001 44 78 48 87 39 74
## 2002 74 44 78 48 87 39
table(df$year, df$proflvl)
##
## advanced basic below basic proficient
## 2000 56 313 143 388
## 2001 229 183 64 424
## 2002 503 27 3 367
table(df$race, df$proflvl)
##
## advanced basic below basic proficient
## A 19 7 3 19
## B 160 302 162 486
## H 54 76 33 116
## I 7 4 1 9
## W 548 134 11 549
prop.table(table(df$race, df$proflvl))
##
## advanced basic below basic proficient
## A 0.0070370 0.0025926 0.0011111 0.0070370
## B 0.0592593 0.1118519 0.0600000 0.1800000
## H 0.0200000 0.0281481 0.0122222 0.0429630
## I 0.0025926 0.0014815 0.0003704 0.0033333
## W 0.2029630 0.0496296 0.0040741 0.2033333
round(prop.table(table(df$race, df$proflvl), 1), digits = 3)
##
## advanced basic below basic proficient
## A 0.396 0.146 0.062 0.396
## B 0.144 0.272 0.146 0.438
## H 0.194 0.272 0.118 0.416
## I 0.333 0.190 0.048 0.429
## W 0.441 0.108 0.009 0.442
1
tells R we want proportions rowise, a 2
goes columnwiseround
tells R to cut off some digits for usaggregate
function that can be used and helps us avoid the clustering problems aboveformula
(think I want variable X by grouping factor Y) and the statistic we want to compute# Reading Scores by Race
aggregate(readSS ~ race, FUN = mean, data = df)
## race readSS
## 1 A 508.7
## 2 B 460.2
## 3 H 473.2
## 4 I 485.2
## 5 W 533.2
aggregate
can take us a little further, we can use aggregate multiple variables at a timeaggregate(cbind(readSS, mathSS) ~ race, data = df, mean)
## race readSS mathSS
## 1 A 508.7 477.9
## 2 B 460.2 442.5
## 3 H 473.2 442.7
## 4 I 485.2 455.9
## 5 W 533.2 529.8
formula
syntaxhead(aggregate(cbind(readSS, mathSS) ~ race + grade, data = df, mean),
8)
## race grade readSS mathSS
## 1 A 3 397.8 454.8
## 2 B 3 409.8 371.6
## 3 H 3 417.7 364.2
## 4 I 3 407.6 449.3
## 5 W 3 481.1 450.7
## 6 A 4 456.0 438.2
## 7 B 4 426.9 408.1
## 8 H 4 418.8 404.6
head
or tail
just tell R to present us the firxt X items in an object, in this case 8 objectsag <- aggregate(readSS ~ race + grade, data = df, mean)
xtabs(readSS ~ ., data = ag)
## grade
## race 3 4 5 6 7 8
## A 397.8 456.0 479.1 539.5 600.4 605.3
## B 409.8 426.9 447.6 470.9 492.3 523.5
## H 417.7 418.8 481.2 489.1 500.3 534.2
## I 407.6 531.1 547.6 0.0 405.5 518.0
## W 481.1 498.5 517.1 546.6 565.2 596.1
ftable(xtabs(readSS ~ ., data = ag))
## grade 3 4 5 6 7 8
## race
## A 397.8 456.0 479.1 539.5 600.4 605.3
## B 409.8 426.9 447.6 470.9 492.3 523.5
## H 417.7 418.8 481.2 489.1 500.3 534.2
## I 407.6 531.1 547.6 0.0 405.5 518.0
## W 481.1 498.5 517.1 546.6 565.2 596.1
What is the mean reading score for 6th grade students with disabilities?
481.83
How many points is this from non-disabled students?
29.877
aggregate(cbind(readSS, mathSS) ~ disab + grade, data = df, mean)
## disab grade readSS mathSS
## 1 0 3 449.9 418.3
## 2 1 3 421.1 376.3
## 3 0 4 464.0 454.2
## 4 1 4 438.2 425.1
## 5 0 5 484.9 470.2
## 6 1 5 475.1 431.0
## 7 0 6 511.7 507.9
## 8 1 6 481.8 476.9
## 9 0 7 532.0 532.0
## 10 1 7 516.1 474.3
## 11 0 8 567.6 567.7
## 12 1 8 518.8 534.1
aggregate
we do:z <- aggregate(readSS ~ dist, FUN = mean, data = df)
z
## dist readSS
## 1 205 496.5
## 2 402 500.5
## 3 495 491.6
plyr
aggregate
is cool, but it isn’t very flexibleplyr
packageplyr
is a set of routines/logical structure for transforming, summarizing, reshaping, and reorganizing data objects of one type in R into another typeplyr
package has a number of utilities to help us split-apply-combine across data types for both input and outputfor
loops to iterate over groups of students, because in R for
loops are slow, inefficient, and impracticalplyr
to the rescue, while not as fast as a compiled language, it is pretty dang good!
ddply
has before it combines it back for us when we do the call ddply(df,.(sex,age),"nrow")
plyr
has a straightforward syntaxplyr
functions are in the format XXply. The two X’s specify what the input file we are applying a function to is, and then what way we would like it outputted.plyr
d = dataframe, l= list, m=matrix, and a=array. By far the most common usage is ddply
plyr
in Tutorial 4 as welllibrary(plyr)
myag <- ddply(df, .(dist, grade), summarize, mean_read = mean(readSS, na.rm = T),
mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS, na.rm = T),
sd_math = sd(mathSS, na.rm = T), count_read = length(readSS), count_math = length(mathSS))
head(myag)
## dist grade mean_read mean_math sd_read sd_math count_read
## 1 205 3 451.7 406.1 93.52 72.45 200
## 2 205 4 438.9 459.9 77.76 79.10 100
## 3 205 5 487.9 462.6 85.30 75.10 200
## 4 205 6 514.7 526.8 76.83 66.04 100
## 5 205 7 530.0 521.5 84.82 74.85 200
## 6 205 8 575.5 581.2 79.58 83.45 100
## count_math
## 1 200
## 2 100
## 3 200
## 4 100
## 5 200
## 6 100
summarize
tells ddply
what we are doing to the data framemyag <- ddply(df, .(dist, grade), summarize, mean_read = mean(readSS, na.rm = T),
mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS, na.rm = T),
sd_math = sd(mathSS, na.rm = T), count_read = length(readSS), count_math = length(mathSS),
count_black = length(race[race == "B"]), per_black = length(race[race ==
"B"])/length(readSS))
summary(myag[, 7:10])
## count_read count_math count_black per_black
## Min. :100 Min. :100 Min. :36.0 Min. :0.360
## 1st Qu.:100 1st Qu.:100 1st Qu.:41.2 1st Qu.:0.386
## Median :150 Median :150 Median :60.5 Median :0.412
## Mean :150 Mean :150 Mean :61.7 Mean :0.411
## 3rd Qu.:200 3rd Qu.:200 3rd Qu.:82.0 3rd Qu.:0.439
## Max. :200 Max. :200 Max. :92.0 Max. :0.480
data.table
which is really handyplyr
versionWhat district ID has the highest mean score for 4th grade ELL students on reading? Math?
How many students are in these classes?
12 and 7 respectively
myag2 <- ddply(df, .(dist, grade, ell), summarize, mean_read = mean(readSS,
na.rm = T), mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS,
na.rm = T), sd_math = sd(mathSS, na.rm = T), count_read = length(readSS),
count_math = length(mathSS), count_black = length(race[race == "B"]),
per_black = length(race[race == "B"])/length(readSS))
subset(myag2, ell == 1 & grade == 4)
## dist grade ell mean_read mean_math sd_read sd_math count_read
## 4 205 4 1 403.0 392.9 64.52 39.09 16
## 16 402 4 1 443.1 388.7 79.52 53.28 29
## 28 495 4 1 408.8 431.9 77.47 70.77 13
## count_math count_black per_black
## 4 16 2 0.12500
## 16 29 6 0.20690
## 28 13 1 0.07692
order
function to sort datadf.badsort <- order(df$readSS, df$mathSS)
head(df.badsort)
## [1] 106 1026 2 56 122 118
df.sort <- df[order(df$readSS, df$mathSS, df$attday), ]
head(df[, c(3, 23, 29, 30)])
## stuid attday readSS mathSS
## 1 149995 180 357.3 387.3
## 2 13495 180 263.9 302.6
## 3 106495 160 369.7 365.5
## 4 45205 168 346.6 344.5
## 5 142705 156 373.1 441.2
## 6 14995 157 436.8 463.4
head(df.sort[, c(3, 23, 29, 30)])
## stuid attday readSS mathSS
## 106 106705 160 251.5 277.0
## 1026 80995 176 263.2 377.8
## 2 13495 180 263.9 302.6
## 56 122402 180 264.3 271.7
## 122 79705 168 266.4 318.7
## 118 40495 173 266.9 275.0
head(df[with(df, order(-readSS, -attday)), c(3, 23, 29, 30)])
## stuid attday readSS mathSS
## 1631 145205 137 833.2 828.4
## 1462 107705 180 773.3 746.6
## 2252 122902 180 744.0 621.6
## 2341 44902 175 741.7 676.3
## 1482 134705 180 739.2 705.4
## 1630 14495 162 738.9 758.2
-
denotes we want descending order, R’s default is ascending orderM <- matrix(c(1, 2, 2, 2, 3, 6, 4, 5), 4, 2, byrow = FALSE, dimnames = list(NULL,
c("a", "b")))
M[order(M[, "a"], -M[, "b"]), ]
## a b
## [1,] 1 3
## [2,] 2 6
## [3,] 2 5
## [4,] 2 4
mytab <- table(df$grade, df$year)
mytab[order(mytab[, 1]), ]
##
## 2000 2001 2002
## 4 100 200 100
## 6 100 200 100
## 8 100 200 100
## 3 200 100 200
## 5 200 100 200
## 7 200 100 200
mytab[order(mytab[, 2]), ]
##
## 2000 2001 2002
## 3 200 100 200
## 5 200 100 200
## 7 200 100 200
## 4 100 200 100
## 6 100 200 100
## 8 100 200 100
# Gives all rows that meet this requirement
df[df$readSS > 800, ]
## X school stuid grade schid dist white black hisp indian
## 1631 1281061 852 145205 8 15 205 1 0 0 0
## asian econ female ell disab sch_fay dist_fay luck ability
## 1631 0 0 1 0 0 0 0 0 108.3
## measerr teachq year attday schoolscore district schoolhigh
## 1631 6.325 155.7 2001 137 227.7 19 0
## schoolavg schoollow readSS mathSS proflvl race
## 1631 1 0 833.2 828.4 advanced W
df$grade[df$mathSS > 800]
## [1] 8
# Gives all values of grade that meet this requirement
df$grade[df$black == 1 & df$readSS > 650]
## [1] 8 7 8 6 6 7 8 7 8 8 8 4
df$black=1
or black==1
?spread
indicating whether a district has high or low spread among its student scoresmyag$spread <- NA # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
summary(myag$spread)
## high low
## 15 3
myag$spread <- NA # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
myag
called schoolperf
for mean_math
scores with the following coding scheme:Grade | Score Range | Code |
---|---|---|
3 | >425 | “Hi” |
4 | >450 | “Hi” |
5 | >475 | “Hi” |
6 | >500 | “Hi” |
7 | >525 | “Hi” |
8 | >575 | “Hi” |
dist
?myag$schoolperf <- "lo"
myag$schoolperf[myag$grade == 3 & myag$mean_math > 425] <- "hi"
myag$schoolperf[myag$grade == 4 & myag$mean_math > 450] <- "hi"
myag$schoolperf[myag$grade == 5 & myag$mean_math > 475] <- "hi"
myag$schoolperf[myag$grade == 6 & myag$mean_math > 500] <- "hi"
myag$schoolperf[myag$grade == 7 & myag$mean_math > 525] <- "hi"
myag$schoolperf[myag$grade == 8 & myag$mean_math > 575] <- "hi"
myag$schoolperf <- as.factor(myag$schoolperf)
summary(myag$schoolperf)
## hi lo
## 9 9
table(myag$dist, myag$schoolperf)
##
## hi lo
## 205 3 3
## 402 3 3
## 495 3 3
myag$mean_read[myag$dist == 6 & myag$grade == 3] <- NA
head(myag[, 1:4], 2)
## dist grade mean_read mean_math
## 1 205 3 451.7 406.1
## 2 205 4 438.9 459.9
myag$mean_read[myag$dist == 6 & myag$grade == 3] <- myag$mean_read[myag$dist ==
6 & myag$grade == 4]
head(myag[, 1:4], 2)
## dist grade mean_read mean_math
## 1 205 3 451.7 406.1
## 2 205 4 438.9 459.9
myag$mean_read[myag$grade == 3] <- NA
head(myag[order(myag$grade), 1:4])
## dist grade mean_read mean_math
## 1 205 3 NA 406.1
## 7 402 3 NA 431.9
## 13 495 3 NA 405.5
## 2 205 4 438.9 459.9
## 8 402 4 474.9 432.8
## 14 495 4 447.8 469.1
mean(myag$mean_math)
## [1] 490.7
mean(myag$mean_read)
## [1] NA
mean
of a vector with NAmean(myag$mean_math, na.rm = T)
## [1] 490.7
mean(myag$mean_read, na.rm = T)
## [1] 507.5
mean_read
of less than 500?length(myag$dist[myag$mean_read < 500])
## [1] 10
head(myag$mean_read[myag$mean_read < 500])
## [1] NA 438.9 487.9 NA 474.9 472.5
badvar <- myag$mean_read + myag$sd_read
summary(badvar)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 517 566 592 587 614 655 3
myag$sd_read[myag$count_read < 100 & myag$mean_read < 550] <- NA
length(myag$mean_read[myag$mean_read < 550])
## [1] 16
length(myag$mean_read[myag$mean_read < 550 & !is.na(myag$mean_read)])
## [1] 13
!is.na()
?is.na()
is a helpful function to identify TRUE if a value is missing!
is the reverse operatordf
which has multiple rows per student and myag
which has multiple rows per schoolnames(myag)
## [1] "dist" "grade" "mean_read" "mean_math"
## [5] "sd_read" "sd_math" "count_read" "count_math"
## [9] "count_black" "per_black" "spread" "schoolperf"
names(df[, c(2, 3, 4, 6)])
## [1] "school" "stuid" "grade" "dist"
dist
and grade
are in common. Is this ok?year
as well?merge
we want to consider with ?merge
merge
automagicallysimple_merge <- merge(df, myag)
names(simple_merge)
## [1] "grade" "dist" "X" "school"
## [5] "stuid" "schid" "white" "black"
## [9] "hisp" "indian" "asian" "econ"
## [13] "female" "ell" "disab" "sch_fay"
## [17] "dist_fay" "luck" "ability" "measerr"
## [21] "teachq" "year" "attday" "schoolscore"
## [25] "district" "schoolhigh" "schoolavg" "schoollow"
## [29] "readSS" "mathSS" "proflvl" "race"
## [33] "mean_read" "mean_math" "sd_read" "sd_math"
## [37] "count_read" "count_math" "count_black" "per_black"
## [41] "spread" "schoolperf"
simple_merge(df1,df2,by=c("id1","id2"))
simple_merge(df1,df2,by.x=c("id1","id2"),by.y=c("id1_a","id2_a"))
notsosimple_merge(df1,df2,all.x=TRUE,all.y=TRUE)
x
observations (df1), all the y
observations (df2) or neither, or bothhead(df[, 1:10], 3)
## X school stuid grade schid dist white black hisp indian
## 1 44 1 149995 3 105 495 0 1 0 0
## 2 53 1 13495 3 45 495 0 1 0 0
## 3 116 1 106495 3 45 495 0 1 0 0
head(widedf[, 28:40], 3)
## readSS.2000 mathSS.2000 proflvl.2000 race.2000 X.2001 school.2001
## 1 357.3 387.3 basic B 441000 1
## 2 263.9 302.6 below basic B 531000 1
## 3 369.7 365.5 basic B 1161000 1
## grade.2001 schid.2001 dist.2001 white.2001 black.2001 hisp.2001
## 1 4 105 495 0 1 0
## 2 4 45 495 0 1 0
## 3 4 45 495 0 1 0
## indian.2001
## 1 0
## 2 0
## 3 0
reshape
is the way to move from wide to long2700, 32
1200, 91
widedf <- reshape(df, timevar = "year", idvar = "stuid", direction = "wide")
idvar
represents the unit we want to represent a single row, in this case each unique student gets a single rowtimevar
is the variable that differenaties between two rows with the same student IDdirection
tells R we are going to move to wide datavarying
argument we can tell R explicitly which items we want to move widereshape
function works well in both directionslongdf <- reshape(widedf, idvar = "stuid", timevar = "year", varying = names(widedf[,
2:91]), direction = "long", sep = ".")
subset
function to get only 4th grade scoresg4 <- subset(df, grade == 4)
dim(g4)
## [1] 400 32
g4_b <- df[df$grade == 4, ]
identical(g4, g4_b)
## [1] TRUE
Say we are unhappy about attributing the school/grade mean score across years to student-year observations like we did in this lesson. Let’s fix it by first aggregating our student data frame to a school/grade/year data frame, and second by merging that new data frame with our student level data.
Sort the student-level data frame on attday
and ability
in descending order.
Find the highest proportion of black students in any school/grade/year combination.
It is good to include the session info, e.g. this document is produced with knitr version 0.8
. Here is my session info:
print(sessionInfo(), locale = FALSE)
## R version 2.15.2 (2012-10-26)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods
## [7] base
##
## other attached packages:
## [1] plyr_1.7.1 ggplot2_0.9.2.1 lmtest_0.9-30 zoo_1.7-9
## [5] knitr_0.8
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-0 dichromat_1.2-4 digest_0.5.2
## [4] evaluate_0.4.2 formatR_0.6 grid_2.15.2
## [7] gtable_0.1.1 labeling_0.1 lattice_0.20-10
## [10] MASS_7.3-22 memoise_0.1 munsell_0.4
## [13] proto_0.3-9.2 RColorBrewer_1.0-5 reshape2_1.2.1
## [16] scales_0.2.2 stringr_0.6.1 tools_2.15.1
This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.