🕶️Tidy Data in R
What is Tidy Data and why do we need it?
As a bioinformatician, you will get data in many shapes and forms. For instance, if you monitor the height of seedlings during a factorial experiment using warming and fertilization treatments, you might record your data like this:
This form of data is not suitable for your analysis by R. So, you may want to convert the above data as the following table:
Thus, by converting the input data into a suitable format for analysis we get tidy data.
In R, it is easiest to work with data that follow five basic rules:
Every variable is stored in its own column.
Every observation is stored in its own row—that is, every row corresponds to a single case.
Each value of a variable is stored in a cell of the table.
5. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.
6. There should be no extraneous information (footnotes, table titles, etc.)
NB: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.
How to get Tidy Data?
In R, there are the two most widely used packages for converting your raw data into tidy data.
tidyrpackagereshape2package
Here, are the terminologies between different packages, versions, and platforms:
tidyr (version 1.0.0 or later)
pivot_longer()
pivot_wider()
tidyr (version < 1.0.0)
gather()
spread()
reshape2
melt()
cast()
spreadsheets
unipivot
pivot
datasets
fold
unfold
In the following section, we will transform the data to get tidy data by tidyr package.
Pivot data from wide to long by tidyr package
tidyr packagepivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider().
pivot_longer(
data,
cols,
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = NULL,
names_transform = NULL,
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = NULL,
values_transform = NULL,
...
)data: A data frame to pivot.cols: <tidy-select> Columns to pivot into a longer format.names_to: A character vector specifying the new column or columns to create from the information stored in the column namesdataspecified bycols.values_to: A string specifying the name of the column to create from the data stored in cell values. Ifnames_tois a character containing the special.valuesentinel, this value will be ignored, and the name of the value column will be derived from part of the existing column names.
Learn more in vignette("pivot"), <tidy-select>.
Let's use the pivot_longer()command to transfer data from wider to longer format for various datasets.
# Load necessary libraries
library(tidyr)
library(dplyr)
library(readr)
# or, just
library(tidyverse)1. String data in column names
The relig_income dataset (from tidyr package) stores counts based on a survey which (among other things) asked people about their religion and annual income.
View the dataset:
data(relig_income)
This dataset contains three variables:
religion, stored in the rows,incomespread across the column names, andcountstored in the cell values.
To tidy this dataset we can use pivot_longer() command:
# First option
pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")
# Or, Second Option
relig_income %>%
pivot_longer(!religion, names_to = "income", values_to = "count")First Option: The first code is quite straight forward as the syntax of pivot_longer() command.
We specified
relig_incomefordataargument.For the
colsargument we used!religion, it means all columns except the religion column will be considered to pivot into a longer format.In the
names_toargument we specified the name of the column where all the pivoted columns will be stored.The
values_toargument contains values of all the pivoted columns.
Second Option: This code also works the same as the first one. But in this case we use R pipe sign (%>%) with the relig_income dataset.
Output:

2. Numeric Data in Column Names:
The billboard dataset (from tidyr package) records the billboard rank of songs in the year 2000. It has a form similar to the relig_income data, but the data encoded in the column names is really a number, not a string.
View the dataset:
data(billboard)
Let's transfer the dataset into the longer format
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)The code started with piping (
%>%) thebillboarddataset with thepivot_longer()command. That's why we don't need to specify thedataargument inside the command.In this dataset, our targets were the columns named "wk". That's why in the
colsargument we usedstart_with()command to specify that.names_to = "week": After pivoting our target columns will go under the week column.values_to = "rank": Values from each column will go under the rank column.values_drop_na = TRUE: to drop rows that correspond to missing values. Not every song stays in the charts for all 76 weeks, so the structure of the input data forces the creation of unnecessary explicitNAs.
Last updated


