🕶️Tidy Data in R

What is Tidy Data and why do we need it?

As a bioinformatician, you will get data in many shapes and forms. For instance, if you monitor the height of seedlings during a factorial experiment using warming and fertilization treatments, you might record your data like this:

This form of data is not suitable for your analysis by R. So, you may want to convert the above data as the following table:

Thus, by converting the input data into a suitable format for analysis we get tidy data.

In R, it is easiest to work with data that follow five basic rules:

Every variable is stored in its own column.
Every observation is stored in its own row—that is, every row corresponds to a single case.
Each value of a variable is stored in a cell of the table.

5. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.

6. There should be no extraneous information (footnotes, table titles, etc.)

NB: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.

How to get Tidy Data?

In R, there are the two most widely used packages for converting your raw data into tidy data.

tidyr package
reshape2 package

Most of the tutorials out there which utilize tidyr package for this purpose, use the gather() and spread() functions. But, you will find that some tutorials also use, the pivot_longer() and pivot_wider() functions.

This creates confusion, about which one we should use.

Actually,pivot_longer() and pivot_wider() functions are updated versions of gather() and spread() functions, designed to be both simpler to use and to handle more use cases.
Developers of tidyr recommend, to use pivot_longer() and pivot_wider() for new codes.
But, gather() and spread() aren't going away but are no longer under active development.

Here, are the terminologies between different packages, versions, and platforms:

Package/Version/platform

Command

tidyr (version 1.0.0 or later)

pivot_longer()

pivot_wider()

tidyr (version < 1.0.0)

gather()

spread()

reshape2

melt()

cast()

spreadsheets

unipivot

pivot

datasets

fold

unfold

In the following section, we will transform the data to get tidy data by tidyr package.

Pivot data from wide to long by `tidyr` package

pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider().

pivot_longer(
  data,
  cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = NULL,
  values_transform = NULL,
  ...
)

data: A data frame to pivot.
cols: <tidy-select> Columns to pivot into a longer format.
names_to: A character vector specifying the new column or columns to create from the information stored in the column names data specified by cols.
values_to: A string specifying the name of the column to create from the data stored in cell values. If names_to is a character containing the special .value sentinel, this value will be ignored, and the name of the value column will be derived from part of the existing column names.

Learn more in vignette("pivot"), <tidy-select>.

Let's use the pivot_longer()command to transfer data from wider to longer format for various datasets.

# Load necessary libraries
library(tidyr)
library(dplyr)
library(readr)

# or, just
library(tidyverse)

1. String data in column names

The relig_income dataset (from tidyr package) stores counts based on a survey which (among other things) asked people about their religion and annual income.

View the dataset:

data(relig_income)

This dataset contains three variables:

religion, stored in the rows,
income spread across the column names, and
count stored in the cell values.

To tidy this dataset we can use pivot_longer() command:

# First option
pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")

# Or, Second Option
relig_income %>% 
    pivot_longer(!religion, names_to = "income", values_to = "count")

First Option: The first code is quite straight forward as the syntax of pivot_longer() command.

We specified relig_income for data argument.
For the cols argument we used !religion, it means all columns except the religion column will be considered to pivot into a longer format.
In the names_to argument we specified the name of the column where all the pivoted columns will be stored.
The values_to argument contains values of all the pivoted columns.

Second Option: This code also works the same as the first one. But in this case we use R pipe sign (%>%) with the relig_income dataset.

Output:

2. Numeric Data in Column Names:

The billboard dataset (from tidyr package) records the billboard rank of songs in the year 2000. It has a form similar to the relig_income data, but the data encoded in the column names is really a number, not a string.

View the dataset:

data(billboard)

Let's transfer the dataset into the longer format

billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )

The code started with piping (%>%) the billboard dataset with the pivot_longer() command. That's why we don't need to specify the data argument inside the command.
In this dataset, our targets were the columns named "wk". That's why in the cols argument we used start_with() command to specify that.
names_to = "week": After pivoting our target columns will go under the week column.
values_to = "rank": Values from each column will go under the rank column.
values_drop_na = TRUE: to drop rows that correspond to missing values. Not every song stays in the charts for all 76 weeks, so the structure of the input data forces the creation of unnecessary explicit NAs.

PreviousDataFrame NextData Visualization in R

Last updated 2 years ago

What is Tidy Data and why do we need it?

How to get Tidy Data?

Pivot data from wide to long by tidyr package

1. String data in column names

2. Numeric Data in Column Names:

Pivot data from wide to long by `tidyr` package