Reshape
Often your data is collected in a format that is not convenient for analysis. In particular in R usually you want one observation per row–we will call this long format. However in excel it is common for each column to correspond to a different set of observations–we will consider this wide format. Consider:
#generate mock data frame in wide format
wide <- data.frame(
wildtype=rnorm(10,4),
phyA=rnorm(10,6),
phyB=rnorm(10,7),
det1=rnorm(10,2))
wide
## wildtype phyA phyB det1
## 1 4.607 6.246 7.076 1.1473
## 2 3.363 6.025 6.946 1.0420
## 3 3.507 5.925 5.828 -0.8757
## 4 4.846 6.071 8.055 1.3214
## 5 4.391 7.573 6.592 2.3292
## 6 5.120 6.008 8.177 4.3900
## 7 4.369 6.445 7.955 1.5134
## 8 2.887 3.669 7.276 1.5529
## 9 4.862 5.859 8.543 0.6698
## 10 3.885 4.244 8.238 1.3819
This might seem reasonable but it is the wrong format if, for example, you wanted to use ggplot
, lm
, lme
, etc.
Functions in the reshape package will convert from the wide format to the long format and back.
Melt
melt()
is used to change from the wide format to the long format.
library(reshape)
long <- melt(wide)
## Using as id variables
head(long)
## variable value
## 1 wildtype 4.607
## 2 wildtype 3.363
## 3 wildtype 3.507
## 4 wildtype 4.846
## 5 wildtype 4.391
## 6 wildtype 5.120
tail(long)
## variable value
## 35 det1 2.3292
## 36 det1 4.3900
## 37 det1 1.5134
## 38 det1 1.5529
## 39 det1 0.6698
## 40 det1 1.3819
Melt arguments: id.vars and measure.vars
Your data frame might contain metadata that needs to be preserved for each data observation. For example if the hypocotyls in the above example came from two different plates we would want to know which plate each observation came from in the melted data frame. We can use the id.vars
argument to specify which columns contain identification information that should be kept with each observation in a row. Alternatively we can use the measure.vars
to specify the columns that contain measurements.
wide$plate <- rep(c("A", "B"), each = 5)
wide
## wildtype phyA phyB det1 plate
## 1 4.607 6.246 7.076 1.1473 A
## 2 3.363 6.025 6.946 1.0420 A
## 3 3.507 5.925 5.828 -0.8757 A
## 4 4.846 6.071 8.055 1.3214 A
## 5 4.391 7.573 6.592 2.3292 A
## 6 5.120 6.008 8.177 4.3900 B
## 7 4.369 6.445 7.955 1.5134 B
## 8 2.887 3.669 7.276 1.5529 B
## 9 4.862 5.859 8.543 0.6698 B
## 10 3.885 4.244 8.238 1.3819 B
long <- melt(wide, id.vars = "plate")
head(long)
## plate variable value
## 1 A wildtype 4.607
## 2 A wildtype 3.363
## 3 A wildtype 3.507
## 4 A wildtype 4.846
## 5 A wildtype 4.391
## 6 B wildtype 5.120
# alternatively specify the meaurement variables:
long <- melt(wide, measure.vars = c("wildtype", "phyA", "phyB", "det1"))
# you can also use column numbers
long <- melt(wide, measure.vars = 1:4)
Melt arguments: variable_name
The variable_name
argument allows us to specify a column name for the new column that will contain the previous column names once the data frame is melted.
long <- melt(wide, id.vars = "plate", variable_name = "genotype")
head(long)
## plate genotype value
## 1 A wildtype 4.607
## 2 A wildtype 3.363
## 3 A wildtype 3.507
## 4 A wildtype 4.846
## 5 A wildtype 4.391
## 6 B wildtype 5.120
Additional melt functions
The above examples demonstrate melt as it applies to data frames. There are additional methods for arrays and lists. See ?melt
for details.
Cast
The cast()
function is used to summarize data that is in the long format. Here we specify a formula where each variable listed will form a new dimension.
# use cast to summarize by mean.
cast(long, plate ~ genotype, mean)
## plate wildtype phyA phyB det1
## 1 A 4.143 6.368 6.899 0.9928
## 2 B 4.224 5.245 8.038 1.9016
cast(long, genotype ~ plate, mean) #changing the order changes the orientation.
## genotype A B
## 1 wildtype 4.1429 4.224
## 2 phyA 6.3678 5.245
## 3 phyB 6.8993 8.038
## 4 det1 0.9928 1.902
We can specify that we want margin means
cast(long, plate ~ genotype, mean, margins = T)
## plate wildtype phyA phyB det1 (all)
## 1 A 4.143 6.368 6.899 0.9928 4.601
## 2 B 4.224 5.245 8.038 1.9016 4.852
## 3 (all) 4.184 5.806 7.469 1.4472 4.726
A “.” in the formula species no variables.
cast(long, plate ~ .) #with no function specified this just counts the number of observations
## Aggregation requires fun.aggregate: length used as default
## plate (all)
## 1 A 20
## 2 B 20
cast(long, plate ~ ., mean)
## plate (all)
## 1 A 4.601
## 2 B 4.852
cast(long, genotype ~ ., mean)
## genotype (all)
## 1 wildtype 4.184
## 2 phyA 5.806
## 3 phyB 7.469
## 4 det1 1.447
Posted by Julin Maloof