R Club

Reshape

posted by: Julin Maloof

Reshape

Often your data is collected in a format that is not convenient for analysis. In particular in R usually you want one observation per row–we will call this long format. However in excel it is common for each column to correspond to a different set of observations–we will consider this wide format. Consider:

#generate mock data frame in wide format
wide <- data.frame(
  
            wildtype=rnorm(10,4),
            phyA=rnorm(10,6),
            phyB=rnorm(10,7),
            det1=rnorm(10,2))
wide
##    wildtype  phyA  phyB    det1
## 1     4.607 6.246 7.076  1.1473
## 2     3.363 6.025 6.946  1.0420
## 3     3.507 5.925 5.828 -0.8757
## 4     4.846 6.071 8.055  1.3214
## 5     4.391 7.573 6.592  2.3292
## 6     5.120 6.008 8.177  4.3900
## 7     4.369 6.445 7.955  1.5134
## 8     2.887 3.669 7.276  1.5529
## 9     4.862 5.859 8.543  0.6698
## 10    3.885 4.244 8.238  1.3819

This might seem reasonable but it is the wrong format if, for example, you wanted to use ggplot, lm, lme, etc.

Functions in the reshape package will convert from the wide format to the long format and back.

Melt

melt() is used to change from the wide format to the long format.

library(reshape)
long <- melt(wide)
## Using as id variables
head(long)
##   variable value
## 1 wildtype 4.607
## 2 wildtype 3.363
## 3 wildtype 3.507
## 4 wildtype 4.846
## 5 wildtype 4.391
## 6 wildtype 5.120
tail(long)
##    variable  value
## 35     det1 2.3292
## 36     det1 4.3900
## 37     det1 1.5134
## 38     det1 1.5529
## 39     det1 0.6698
## 40     det1 1.3819

Melt arguments: id.vars and measure.vars

Your data frame might contain metadata that needs to be preserved for each data observation. For example if the hypocotyls in the above example came from two different plates we would want to know which plate each observation came from in the melted data frame. We can use the id.vars argument to specify which columns contain identification information that should be kept with each observation in a row. Alternatively we can use the measure.vars to specify the columns that contain measurements.

wide$plate <- rep(c("A", "B"), each = 5)
wide
##    wildtype  phyA  phyB    det1 plate
## 1     4.607 6.246 7.076  1.1473     A
## 2     3.363 6.025 6.946  1.0420     A
## 3     3.507 5.925 5.828 -0.8757     A
## 4     4.846 6.071 8.055  1.3214     A
## 5     4.391 7.573 6.592  2.3292     A
## 6     5.120 6.008 8.177  4.3900     B
## 7     4.369 6.445 7.955  1.5134     B
## 8     2.887 3.669 7.276  1.5529     B
## 9     4.862 5.859 8.543  0.6698     B
## 10    3.885 4.244 8.238  1.3819     B
long <- melt(wide, id.vars = "plate")
head(long)
##   plate variable value
## 1     A wildtype 4.607
## 2     A wildtype 3.363
## 3     A wildtype 3.507
## 4     A wildtype 4.846
## 5     A wildtype 4.391
## 6     B wildtype 5.120
# alternatively specify the meaurement variables:
long <- melt(wide, measure.vars = c("wildtype", "phyA", "phyB", "det1"))
# you can also use column numbers
long <- melt(wide, measure.vars = 1:4)

Melt arguments: variable_name

The variable_name argument allows us to specify a column name for the new column that will contain the previous column names once the data frame is melted.

long <- melt(wide, id.vars = "plate", variable_name = "genotype")
head(long)
##   plate genotype value
## 1     A wildtype 4.607
## 2     A wildtype 3.363
## 3     A wildtype 3.507
## 4     A wildtype 4.846
## 5     A wildtype 4.391
## 6     B wildtype 5.120

Additional melt functions

The above examples demonstrate melt as it applies to data frames. There are additional methods for arrays and lists. See ?melt for details.

Cast

The cast() function is used to summarize data that is in the long format. Here we specify a formula where each variable listed will form a new dimension.

# use cast to summarize by mean.
cast(long, plate ~ genotype, mean)
##   plate wildtype  phyA  phyB   det1
## 1     A    4.143 6.368 6.899 0.9928
## 2     B    4.224 5.245 8.038 1.9016
cast(long, genotype ~ plate, mean)  #changing the order changes the orientation.
##   genotype      A     B
## 1 wildtype 4.1429 4.224
## 2     phyA 6.3678 5.245
## 3     phyB 6.8993 8.038
## 4     det1 0.9928 1.902

We can specify that we want margin means

cast(long, plate ~ genotype, mean, margins = T)
##   plate wildtype  phyA  phyB   det1 (all)
## 1     A    4.143 6.368 6.899 0.9928 4.601
## 2     B    4.224 5.245 8.038 1.9016 4.852
## 3 (all)    4.184 5.806 7.469 1.4472 4.726

A “.” in the formula species no variables.

cast(long, plate ~ .)  #with no function specified this just counts the number of observations
## Aggregation requires fun.aggregate: length used as default
##   plate (all)
## 1     A    20
## 2     B    20
cast(long, plate ~ ., mean)
##   plate (all)
## 1     A 4.601
## 2     B 4.852
cast(long, genotype ~ ., mean)
##   genotype (all)
## 1 wildtype 4.184
## 2     phyA 5.806
## 3     phyB 7.469
## 4     det1 1.447

Posted by Julin Maloof

comments powered by Disqus