Again, read in our dataset

# Set working directory to the tutorial directory In RStudio can do
# this in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
load("data/smalldata.rda")
# Note if we don't assign data to 'df' R just prints contents of
# table

Aggregation

Sometimes we need to do some basic checking for the number of observations or types of observations in our dataset
To do this quickly and easily–the table function is our friend
Let’s look at our observations by year and grade

table(df$grade, df$year)

##    
##     2000 2001 2002
##   3  200  100  200
##   4  100  200  100
##   5  200  100  200
##   6  100  200  100
##   7  200  100  200
##   8  100  200  100

The first command gives the rows, the second gives the columns
Ugly, but effective

Aggregation can be more complex

Let’s aggregate by race and year

table(df$year, df$race)

##       
##          A   B   H   I   W
##   2000  16 370  93   7 414
##   2001  16 370  93   7 414
##   2002  16 370  93   7 414

Race is consistent across years, interesting
What if we want to only look at 3rd graders that year?

More complicated still

with(df[df$grade == 3, ], {
    table(year, race)
})

##       race
## year     A   B   H   I   W
##   2000   4  78  22   4  92
##   2001   1  44   8   2  45
##   2002   0  74  20   1 105

with specifies a data object to work on, in this case all elements of df where grade==3
table is the same command as above, but since we specified the data object in the with statement, we don’t need the df$ in front of the variables of interest

Quick exercise

Can you find the number of black students in each grade in each year?
hint: with(df[df$___==___,]...)
How many in year 2002, grade 6?
48
How many in 2001, grade 7?
39

Answer

with(df[df$race == "B", ], {
    table(year, grade)
})

##       grade
## year    3  4  5  6  7  8
##   2000 78 48 87 39 74 44
##   2001 44 78 48 87 39 74
##   2002 74 44 78 48 87 39

Tables cont.

This is really powerful for looking at the descriptive dimensions of the data, we can ask questions like:
how many students are at each proficiency level each year?

table(df$year, df$proflvl)

##       
##        advanced basic below basic proficient
##   2000       56   313         143        388
##   2001      229   183          64        424
##   2002      503    27           3        367

how many students are at each proficiency level by race?

table(df$race, df$proflvl)

##    
##     advanced basic below basic proficient
##   A       19     7           3         19
##   B      160   302         162        486
##   H       54    76          33        116
##   I        7     4           1          9
##   W      548   134          11        549

Proportional Tables

What if we aren’t interested in counts?
R makes it really easy to calculate proportions

prop.table(table(df$race, df$proflvl))

##    
##      advanced     basic below basic proficient
##   A 0.0070370 0.0025926   0.0011111  0.0070370
##   B 0.0592593 0.1118519   0.0600000  0.1800000
##   H 0.0200000 0.0281481   0.0122222  0.0429630
##   I 0.0025926 0.0014815   0.0003704  0.0033333
##   W 0.2029630 0.0496296   0.0040741  0.2033333

Hmmm, this is goofy. This tells us the proportion of each cell out of the total. Also, the digits are distracting. How can we fix this?

Try number 2

round(prop.table(table(df$race, df$proflvl), 1), digits = 3)

##    
##     advanced basic below basic proficient
##   A    0.396 0.146       0.062      0.396
##   B    0.144 0.272       0.146      0.438
##   H    0.194 0.272       0.118      0.416
##   I    0.333 0.190       0.048      0.429
##   W    0.441 0.108       0.009      0.442

The 1 tells R we want proportions rowise, a 2 goes columnwise
round tells R to cut off some digits for us
Proportions are just that, not in percentage terms (we need to multiply by 100 for this)
Can you make this table express percentages instead of proportions? How might that code look?
A few more problems arise–this pools all observations, including students across years

Aggregating Data

One of the most common questions will be to compute aggregates of data
R has an aggregate function that can be used and helps us avoid the clustering problems above
This works great for simple aggregation like scale score by race, we just need a formula (think I want variable X by grouping factor Y) and the statistic we want to compute

# Reading Scores by Race
aggregate(readSS ~ race, FUN = mean, data = df)

##   race readSS
## 1    A  508.7
## 2    B  460.2
## 3    H  473.2
## 4    I  485.2
## 5    W  533.2

Aggregate (II)

aggregate can take us a little further, we can use aggregate multiple variables at a time

aggregate(cbind(readSS, mathSS) ~ race, data = df, mean)

##   race readSS mathSS
## 1    A  508.7  477.9
## 2    B  460.2  442.5
## 3    H  473.2  442.7
## 4    I  485.2  455.9
## 5    W  533.2  529.8

We can add multiple grouping varialbes using the formula syntax

head(aggregate(cbind(readSS, mathSS) ~ race + grade, data = df, mean), 
    8)

##   race grade readSS mathSS
## 1    A     3  397.8  454.8
## 2    B     3  409.8  371.6
## 3    H     3  417.7  364.2
## 4    I     3  407.6  449.3
## 5    W     3  481.1  450.7
## 6    A     4  456.0  438.2
## 7    B     4  426.9  408.1
## 8    H     4  418.8  404.6

head or tail just tell R to present us the firxt X items in an object, in this case 8 objects

Crosstabs

We can build a systematic cross-tab now

ag <- aggregate(readSS ~ race + grade, data = df, mean)
xtabs(readSS ~ ., data = ag)

##     grade
## race     3     4     5     6     7     8
##    A 397.8 456.0 479.1 539.5 600.4 605.3
##    B 409.8 426.9 447.6 470.9 492.3 523.5
##    H 417.7 418.8 481.2 489.1 500.3 534.2
##    I 407.6 531.1 547.6   0.0 405.5 518.0
##    W 481.1 498.5 517.1 546.6 565.2 596.1

And prettier output

ftable(xtabs(readSS ~ ., data = ag))

##      grade     3     4     5     6     7     8
## race                                          
## A          397.8 456.0 479.1 539.5 600.4 605.3
## B          409.8 426.9 447.6 470.9 492.3 523.5
## H          417.7 418.8 481.2 489.1 500.3 534.2
## I          407.6 531.1 547.6   0.0 405.5 518.0
## W          481.1 498.5 517.1 546.6 565.2 596.1

Check your work

What is the mean reading score for 6th grade students with disabilities?
481.83
How many points is this from non-disabled students?
29.877

Answer II

aggregate(cbind(readSS, mathSS) ~ disab + grade, data = df, mean)

##    disab grade readSS mathSS
## 1      0     3  449.9  418.3
## 2      1     3  421.1  376.3
## 3      0     4  464.0  454.2
## 4      1     4  438.2  425.1
## 5      0     5  484.9  470.2
## 6      1     5  475.1  431.0
## 7      0     6  511.7  507.9
## 8      1     6  481.8  476.9
## 9      0     7  532.0  532.0
## 10     1     7  516.1  474.3
## 11     0     8  567.6  567.7
## 12     1     8  518.8  534.1

School Means

Consider the case we want to turn our student level data into school level data
Who hasn’t had to do this?!?
In aggregate we do:

z <- aggregate(readSS ~ dist, FUN = mean, data = df)
z

##   dist readSS
## 1  205  496.5
## 2  402  500.5
## 3  495  491.6

But I want more! I want to aggregate multiple variables. I want to do it across multiple groups. I want the output to be a dataframe I can work on.
Thank you plyr

Aggregate Isn’t Enough

aggregate is cool, but it isn’t very flexible
We can only use aggregate output as a table, which we have to convert to a data frame
There is a better way; the plyr package
plyr is a set of routines/logical structure for transforming, summarizing, reshaping, and reorganizing data objects of one type in R into another type
We will focus here on summarizing and aggregating a data frame, but later in the bootcamp we’ll apply functions to lists and turn lists into data frames as well
This is cool!

The Logic of plyr

In R this is known as “split, apply, and combine”
Why? First, we split the data into groups by
Then we apply some function or another to that group (i.e. count the unique values of a variable)
Then we combine the data back together
This has some advantages–unlike other methods, the data does not have to be ordered by our id variable for this to work
The disadvantage is that this method is computationally expensive, even in R, and requires copying our data frame using up RAM

An Aside about Split-Apply-Combine

The plyr package has a number of utilities to help us split-apply-combine across data types for both input and output
In R we can’t just use for loops to iterate over groups of students, because in R for loops are slow, inefficient, and impractical
plyr to the rescue, while not as fast as a compiled language, it is pretty dang good!
And still readable

The logic of plyr

This shows how the dataframe is broken up into pieces and each piece then gets whatever functions, summaries, or transformations we apply to it

How plyr works on dataframes like what we will be doing

And this shows the output ddply has before it combines it back for us when we do the call ddply(df,.(sex,age),"nrow")

Using plyr

plyr has a straightforward syntax
All plyr functions are in the format XXply. The two X’s specify what the input file we are applying a function to is, and then what way we would like it outputted.
In plyr d = dataframe, l= list, m=matrix, and a=array. By far the most common usage is ddply
From a dataframe, to a dataframe.
We will see more of plyr in Tutorial 4 as well

plyr in Action

library(plyr)
myag <- ddply(df, .(dist, grade), summarize, mean_read = mean(readSS, na.rm = T), 
    mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS, na.rm = T), 
    sd_math = sd(mathSS, na.rm = T), count_read = length(readSS), count_math = length(mathSS))
head(myag)

##   dist grade mean_read mean_math sd_read sd_math count_read
## 1  205     3     451.7     406.1   93.52   72.45        200
## 2  205     4     438.9     459.9   77.76   79.10        100
## 3  205     5     487.9     462.6   85.30   75.10        200
## 4  205     6     514.7     526.8   76.83   66.04        100
## 5  205     7     530.0     521.5   84.82   74.85        200
## 6  205     8     575.5     581.2   79.58   83.45        100
##   count_math
## 1        200
## 2        100
## 3        200
## 4        100
## 5        200
## 6        100

This looks complex, but it only has a few components.
The first argument is the dataframe we are working on, the next argument is the level of identification we want to aggregate to
summarize tells ddply what we are doing to the data frame
Then we make a list of new variable names, and how to calculate them on each of the subsets in our large data frame
That’s it!

More plyr

This is great, we can quickly build a summary dataset from individual records
A few advanced tricks. How do we build counts and percentages into our dataset?

myag <- ddply(df, .(dist, grade), summarize, mean_read = mean(readSS, na.rm = T), 
    mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS, na.rm = T), 
    sd_math = sd(mathSS, na.rm = T), count_read = length(readSS), count_math = length(mathSS), 
    count_black = length(race[race == "B"]), per_black = length(race[race == 
        "B"])/length(readSS))
summary(myag[, 7:10])

##    count_read    count_math   count_black     per_black    
##  Min.   :100   Min.   :100   Min.   :36.0   Min.   :0.360  
##  1st Qu.:100   1st Qu.:100   1st Qu.:41.2   1st Qu.:0.386  
##  Median :150   Median :150   Median :60.5   Median :0.412  
##  Mean   :150   Mean   :150   Mean   :61.7   Mean   :0.411  
##  3rd Qu.:200   3rd Qu.:200   3rd Qu.:82.0   3rd Qu.:0.439  
##  Max.   :200   Max.   :200   Max.   :92.0   Max.   :0.480

Note for SQL Junkies

There is an alternate package to plyr called data.table which is really handy
It allows SQL like querying of R data frames
It is incredibly fast
It will be incorporated into the next plyr version
You can read up on it online

Quick Exercises in ddply

What if we want to compare how districts do on educating ELL students?
What district ID has the highest mean score for 4th grade ELL students on reading? Math?
66 in reading, 105 in math
How many students are in these classes?
12 and 7 respectively

Answer III

myag2 <- ddply(df, .(dist, grade, ell), summarize, mean_read = mean(readSS, 
    na.rm = T), mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS, 
    na.rm = T), sd_math = sd(mathSS, na.rm = T), count_read = length(readSS), 
    count_math = length(mathSS), count_black = length(race[race == "B"]), 
    per_black = length(race[race == "B"])/length(readSS))
subset(myag2, ell == 1 & grade == 4)

##    dist grade ell mean_read mean_math sd_read sd_math count_read
## 4   205     4   1     403.0     392.9   64.52   39.09         16
## 16  402     4   1     443.1     388.7   79.52   53.28         29
## 28  495     4   1     408.8     431.9   77.47   70.77         13
##    count_math count_black per_black
## 4          16           2   0.12500
## 16         29           6   0.20690
## 28         13           1   0.07692

Sorting

A key way to explore data in tabular form is to sort data
Sorting data in R can be dangerous as you can reorder the vectors of a dataframe
We use the order function to sort data

df.badsort <- order(df$readSS, df$mathSS)
head(df.badsort)

## [1]  106 1026    2   56  122  118

Why is this wrong?
What is R giving us?
Rownames…

Correct Example

To fix it, we need to tell R to reorder the rownames in the order we want

df.sort <- df[order(df$readSS, df$mathSS, df$attday), ]
head(df[, c(3, 23, 29, 30)])

##    stuid attday readSS mathSS
## 1 149995    180  357.3  387.3
## 2  13495    180  263.9  302.6
## 3 106495    160  369.7  365.5
## 4  45205    168  346.6  344.5
## 5 142705    156  373.1  441.2
## 6  14995    157  436.8  463.4

head(df.sort[, c(3, 23, 29, 30)])

##       stuid attday readSS mathSS
## 106  106705    160  251.5  277.0
## 1026  80995    176  263.2  377.8
## 2     13495    180  263.9  302.6
## 56   122402    180  264.3  271.7
## 122   79705    168  266.4  318.7
## 118   40495    173  266.9  275.0

Let’s clean it up a bit more

head(df[with(df, order(-readSS, -attday)), c(3, 23, 29, 30)])

##       stuid attday readSS mathSS
## 1631 145205    137  833.2  828.4
## 1462 107705    180  773.3  746.6
## 2252 122902    180  744.0  621.6
## 2341  44902    175  741.7  676.3
## 1482 134705    180  739.2  705.4
## 1630  14495    162  738.9  758.2

Here we find the high performing students, note that the - denotes we want descending order, R’s default is ascending order
This is easy to correct

About sorting

Sorting works differently on some data types like matrices

M <- matrix(c(1, 2, 2, 2, 3, 6, 4, 5), 4, 2, byrow = FALSE, dimnames = list(NULL, 
    c("a", "b")))
M[order(M[, "a"], -M[, "b"]), ]

##      a b
## [1,] 1 3
## [2,] 2 6
## [3,] 2 5
## [4,] 2 4

Tables are familiar

mytab <- table(df$grade, df$year)
mytab[order(mytab[, 1]), ]

##    
##     2000 2001 2002
##   4  100  200  100
##   6  100  200  100
##   8  100  200  100
##   3  200  100  200
##   5  200  100  200
##   7  200  100  200

mytab[order(mytab[, 2]), ]

##    
##     2000 2001 2002
##   3  200  100  200
##   5  200  100  200
##   7  200  100  200
##   4  100  200  100
##   6  100  200  100
##   8  100  200  100

Filtering Data

Filtering data is an incredibly powerful feature and we have already seen it used to do some interesting things
Filtering data in R is loaded with trouble though, because the filtering arguments must be very carefully specified
Filtering is like a mini-sort
Always, always, always check your work
And remember, this is the place the NAs do the most damage
Let’s look at some examples

Basic Filtering a Column

# Gives all rows that meet this requirement
df[df$readSS > 800, ]

##            X school  stuid grade schid dist white black hisp indian
## 1631 1281061    852 145205     8    15  205     1     0    0      0
##      asian econ female ell disab sch_fay dist_fay luck ability
## 1631     0    0      1   0     0       0        0    0   108.3
##      measerr teachq year attday schoolscore district schoolhigh
## 1631   6.325  155.7 2001    137       227.7       19          0
##      schoolavg schoollow readSS mathSS  proflvl race
## 1631         1         0  833.2  828.4 advanced    W

df$grade[df$mathSS > 800]

## [1] 8

# Gives all values of grade that meet this requirement

This seems basic enough, let’s filter on multiple dimensions
Before the brackets we specify what we want returned, and within the brackets we present the logical expression to evaluate
Behind the scenes R does a logical test and gets the row numbers that match the logical expression
It then combines them back with the object in front of the brackets to return the values
So great we don’t have to do that!

Multiple filters

df$grade[df$black == 1 & df$readSS > 650]

##  [1] 8 7 8 6 6 7 8 7 8 8 8 4

What happens if we type df$black=1 or black==1?
Why won’t this work?

Using filters to assign values

We can also use filters to assign values as well
This is how you recode variables and create new ones
Let’s create a variable spread indicating whether a district has high or low spread among its student scores

myag$spread <- NA  # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
summary(myag$spread)

## high  low 
##   15    3

How does it work?

The previous block of code is a useful way to learn how to recode variables

myag$spread <- NA  # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)

Create a new variable in myag called schoolperf for mean_math scores with the following coding scheme:

Grade	Score Range	Code
3	>425	“Hi”
4	>450	“Hi”
5	>475	“Hi”
6	>500	“Hi”
7	>525	“Hi”
8	>575	“Hi”

All other values are coded as “low”
How many “high” and “low” observations do we have?
By dist?

Results

myag$schoolperf <- "lo"
myag$schoolperf[myag$grade == 3 & myag$mean_math > 425] <- "hi"
myag$schoolperf[myag$grade == 4 & myag$mean_math > 450] <- "hi"
myag$schoolperf[myag$grade == 5 & myag$mean_math > 475] <- "hi"
myag$schoolperf[myag$grade == 6 & myag$mean_math > 500] <- "hi"
myag$schoolperf[myag$grade == 7 & myag$mean_math > 525] <- "hi"
myag$schoolperf[myag$grade == 8 & myag$mean_math > 575] <- "hi"
myag$schoolperf <- as.factor(myag$schoolperf)
summary(myag$schoolperf)

## hi lo 
##  9  9

table(myag$dist, myag$schoolperf)

##      
##       hi lo
##   205  3  3
##   402  3  3
##   495  3  3

Let’s replace data

For district 6 let’s negate the grade 3 scores by replacing them with missing data

myag$mean_read[myag$dist == 6 & myag$grade == 3] <- NA
head(myag[, 1:4], 2)

##   dist grade mean_read mean_math
## 1  205     3     451.7     406.1
## 2  205     4     438.9     459.9

Let’s replace one data element with another

myag$mean_read[myag$dist == 6 & myag$grade == 3] <- myag$mean_read[myag$dist == 
    6 & myag$grade == 4]
head(myag[, 1:4], 2)

##   dist grade mean_read mean_math
## 1  205     3     451.7     406.1
## 2  205     4     438.9     459.9

Voila

Why do NAs matter so much?

Let’s consider the case above but insert some NA values for all 3rd grade tests

myag$mean_read[myag$grade == 3] <- NA
head(myag[order(myag$grade), 1:4])

##    dist grade mean_read mean_math
## 1   205     3        NA     406.1
## 7   402     3        NA     431.9
## 13  495     3        NA     405.5
## 2   205     4     438.9     459.9
## 8   402     4     474.9     432.8
## 14  495     4     447.8     469.1

NAs II

Now let’s calculate a few statistics:

mean(myag$mean_math)

## [1] 490.7

mean(myag$mean_read)

## [1] NA

Remember, NA values propogate, so R assumes an NA value could take literally any value, and as such it is impossible to know the mean of a vector with NA
We can override this though:

mean(myag$mean_math, na.rm = T)

## [1] 490.7

mean(myag$mean_read, na.rm = T)

## [1] 507.5

Beyond the Mean

But for other problems it is tricky
What if we want to know the number of rows that have a mean_read of less than 500?

length(myag$dist[myag$mean_read < 500])

## [1] 10

head(myag$mean_read[myag$mean_read < 500])

## [1]    NA 438.9 487.9    NA 474.9 472.5

And what if we want to add the standard deviation to these vectors?

badvar <- myag$mean_read + myag$sd_read
summary(badvar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     517     566     592     587     614     655       3

So we need to filter NAs explicitly

Consider the case where two sets of variables have different missing elements

myag$sd_read[myag$count_read < 100 & myag$mean_read < 550] <- NA
length(myag$mean_read[myag$mean_read < 550])

## [1] 16

length(myag$mean_read[myag$mean_read < 550 & !is.na(myag$mean_read)])

## [1] 13

What is !is.na() ?
is.na() is a helpful function to identify TRUE if a value is missing
! is the reverse operator
We are asking R if this value is not a missing value, and to only give us non-missing values back

Merging Data

It is unlikely all the data we will want resides in a single dataset and often we have to combine data from several sources
R makes this easy, but that simplicity comes at a cost–it can be easy to make mistakes if you don’t specify things carefully
Let’s merge attributes about a student’s school with the student row data
We might want to do that if we want to evaluate the performance of students in different school climates, and school climate was measured in part by the mean performance

Merging Data II

We have two data objects df which has multiple rows per student and myag which has multiple rows per school
What are the variables that link these two together?

names(myag)

##  [1] "dist"        "grade"       "mean_read"   "mean_math"  
##  [5] "sd_read"     "sd_math"     "count_read"  "count_math" 
##  [9] "count_black" "per_black"   "spread"      "schoolperf"

names(df[, c(2, 3, 4, 6)])

## [1] "school" "stuid"  "grade"  "dist"

It looks like dist and grade are in common. Is this ok?
Why might we want to consider re-aggregating with year as well?
For this example we won’t just yet

Merging 3

We have a few options with merge we want to consider with ?merge
In the simple case we let merge automagically

simple_merge <- merge(df, myag)
names(simple_merge)

##  [1] "grade"       "dist"        "X"           "school"     
##  [5] "stuid"       "schid"       "white"       "black"      
##  [9] "hisp"        "indian"      "asian"       "econ"       
## [13] "female"      "ell"         "disab"       "sch_fay"    
## [17] "dist_fay"    "luck"        "ability"     "measerr"    
## [21] "teachq"      "year"        "attday"      "schoolscore"
## [25] "district"    "schoolhigh"  "schoolavg"   "schoollow"  
## [29] "readSS"      "mathSS"      "proflvl"     "race"       
## [33] "mean_read"   "mean_math"   "sd_read"     "sd_math"    
## [37] "count_read"  "count_math"  "count_black" "per_black"  
## [41] "spread"      "schoolperf"

It looks like it did a good job

Merge Options

In complicated cases, merge has some important options we should review
First is the simple sounding ‘by’ argument:
simple_merge(df1,df2,by=c("id1","id2"))
We can also specify simple_merge(df1,df2,by.x=c("id1","id2"),by.y=c("id1_a","id2_a"))
This allows us to have different names for our ID variables
Now, what if we have two different sized objects and not all matches between them?
notsosimple_merge(df1,df2,all.x=TRUE,all.y=TRUE)
We can tell R whether we want to keep all of the x observations (df1), all the y observations (df2) or neither, or both

Reshaping Data

Reshaping data is a slightly different issue than aggregating data
Let’s review the two data types: long and wide

head(df[, 1:10], 3)

##     X school  stuid grade schid dist white black hisp indian
## 1  44      1 149995     3   105  495     0     1    0      0
## 2  53      1  13495     3    45  495     0     1    0      0
## 3 116      1 106495     3    45  495     0     1    0      0

Now let’s look at wide:

head(widedf[, 28:40], 3)

##   readSS.2000 mathSS.2000 proflvl.2000 race.2000  X.2001 school.2001
## 1       357.3       387.3        basic         B  441000           1
## 2       263.9       302.6  below basic         B  531000           1
## 3       369.7       365.5        basic         B 1161000           1
##   grade.2001 schid.2001 dist.2001 white.2001 black.2001 hisp.2001
## 1          4        105       495          0          1         0
## 2          4         45       495          0          1         0
## 3          4         45       495          0          1         0
##   indian.2001
## 1           0
## 2           0
## 3           0

How did we do this?

Wide data v. Long Data

The great debate
Most econometrics, panel, and time series datasets come wide and so these seem familiar
R for most cases prefers long data, including for most graphing and analysis functions
But, not all
So we have to learn both

The reshape Function

reshape is the way to move from wide to long
The data stays the same, but the shape of it changes
The long data had dimensions: 2700, 32
The wide data has dimensions: 1200, 91
How do we get to these numbers?
The rows in the wide dataframe represent unique students

Deconstructing reshape

widedf <- reshape(df, timevar = "year", idvar = "stuid", direction = "wide")

idvar represents the unit we want to represent a single row, in this case each unique student gets a single row
In this simple case timevar is the variable that differenaties between two rows with the same student ID
direction tells R we are going to move to wide data
As written all data will move, but using the varying argument we can tell R explicitly which items we want to move wide

What about Wide to Long?

We often need to do this to plot data in R
Luckily the reshape function works well in both directions

longdf <- reshape(widedf, idvar = "stuid", timevar = "year", varying = names(widedf[, 
    2:91]), direction = "long", sep = ".")

If our data is formatted nicely, R can do the guessing and identify the years for us by parsing the dataframe names

Subsetting Data

We have already seen a lot of subsetting examples above, which is what filtering is, but R provides some great shortcuts to this
Let’s look at the subset function to get only 4th grade scores

g4 <- subset(df, grade == 4)
dim(g4)

## [1] 400  32

This is equivalent to:

g4_b <- df[df$grade == 4, ]

These two elements are the same:

identical(g4, g4_b)

## [1] TRUE

That’s it

Now you can filter, subset, sort, recode, and aggregate data!
Let’s look at a few exercises to test these skills
Once these skills are mastered, we can begin to understand how to automate R to clean data with known errors, and to recode data in R so it is ready to be used for analysis
Then we can really take off!

Exercises

Say we are unhappy about attributing the school/grade mean score across years to student-year observations like we did in this lesson. Let’s fix it by first aggregating our student data frame to a school/grade/year data frame, and second by merging that new data frame with our student level data.
Sort the student-level data frame on attday and ability in descending order.
Find the highest proportion of black students in any school/grade/year combination.

Other References

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.8. Here is my session info:

print(sessionInfo(), locale = FALSE)

## R version 2.15.2 (2012-10-26)
## Platform: i386-w64-mingw32/i386 (32-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods  
## [7] base     
## 
## other attached packages:
## [1] plyr_1.7.1      ggplot2_0.9.2.1 lmtest_0.9-30   zoo_1.7-9      
## [5] knitr_0.8      
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-0   dichromat_1.2-4    digest_0.5.2      
##  [4] evaluate_0.4.2     formatR_0.6        grid_2.15.2       
##  [7] gtable_0.1.1       labeling_0.1       lattice_0.20-10   
## [10] MASS_7.3-22        memoise_0.1        munsell_0.4       
## [13] proto_0.3-9.2      RColorBrewer_1.0-5 reshape2_1.2.1    
## [16] scales_0.2.2       stringr_0.6.1      tools_2.15.1

Tutorial 3: Manipulating Data in R

Overview

Again, read in our dataset

Aggregation

Aggregation can be more complex

More complicated still

Quick exercise

Answer

Tables cont.

Proportional Tables

Try number 2

Aggregating Data

Aggregate (II)

Crosstabs

Check your work

Answer II

School Means

Aggregate Isn’t Enough

The Logic of plyr

An Aside about Split-Apply-Combine

The logic of plyr

How plyr works on dataframes like what we will be doing

Using plyr

plyr in Action

More plyr

Note for SQL Junkies

Quick Exercises in ddply

Answer III

Sorting

Correct Example

Let’s clean it up a bit more

About sorting

Filtering Data

Basic Filtering a Column

Multiple filters

Using filters to assign values

How does it work?

Results

Let’s replace data

Why do NAs matter so much?

NAs II

Beyond the Mean

So we need to filter NAs explicitly

Merging Data

Merging Data II

Merging 3

Merge Options

Reshaping Data

Wide data v. Long Data

The reshape Function

Deconstructing reshape

What about Wide to Long?

Subsetting Data

That’s it

Exercises

Other References

Session Info

Attribution and License