Tidy Data in R
Last updated
Last updated
As a bioinformatician, you will get data in many shapes and forms. For instance, if you monitor the height of seedlings during a factorial experiment using warming and fertilization treatments, you might record your data like this:
This form of data is not suitable for your analysis by R. So, you may want to convert the above data as the following table:
Thus, by converting the input data into a suitable format for analysis we get tidy data.
In R, it is easiest to work with data that follow five basic rules:
Every variable is stored in its own column.
Every observation is stored in its own row—that is, every row corresponds to a single case.
Each value of a variable is stored in a cell of the table.
5. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.
6. There should be no extraneous information (footnotes, table titles, etc.)
NB: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.
In R, there are the two most widely used packages for converting your raw data into tidy data.
tidyr
package
reshape2
package
Most of the tutorials out there which utilize tidyr
package for this purpose, use the gather()
and spread()
functions. But, you will find that some tutorials also use, the pivot_longer()
and pivot_wider()
functions.
This creates confusion, about which one we should use.
Actually,pivot_longer()
and pivot_wider()
functions are updated versions of gather()
and spread()
functions, designed to be both simpler to use and to handle more use cases.
Developers of tidyr
recommend, to use pivot_longer()
and pivot_wider()
for new codes.
But, gather()
and spread()
aren't going away but are no longer under active development.
Here, are the terminologies between different packages, versions, and platforms:
tidyr
(version 1.0.0 or later)
pivot_longer()
pivot_wider()
tidyr
(version < 1.0.0)
gather()
spread()
reshape2
melt()
cast()
spreadsheets
unipivot
pivot
datasets
fold
unfold
In the following section, we will transform the data to get tidy data by tidyr
package.
tidyr
packagepivot_longer()
"lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider(
)
.
Let's use the pivot_longer()
command to transfer data from wider to longer format for various datasets.
The relig_income
dataset (from tidyr
package) stores counts based on a survey which (among other things) asked people about their religion and annual income.
View the dataset:
This dataset contains three variables:
religion
, stored in the rows,
income
spread across the column names, and
count
stored in the cell values.
To tidy this dataset we can use pivot_longer()
command:
Output:
The billboard
dataset (from tidyr
package) records the billboard rank of songs in the year 2000. It has a form similar to the relig_income
data, but the data encoded in the column names is really a number, not a string.
View the dataset:
Let's transfer the dataset into the longer format