πŸ•ΆοΈTidy Data in R

What is Tidy Data and why do we need it?

As a bioinformatician, you will get data in many shapes and forms. For instance, if you monitor the height of seedlings during a factorial experiment using warming and fertilization treatments, you might record your data like this:

This form of data is not suitable for your analysis by R. So, you may want to convert the above data as the following table:

Thus, by converting the input data into a suitable format for analysis we get tidy data.

In R, it is easiest to work with data that follow five basic rules:

  1. Every variable is stored in its own column.

  2. Every observation is stored in its own rowβ€”that is, every row corresponds to a single case.

  3. Each value of a variable is stored in a cell of the table.

5. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.

6. There should be no extraneous information (footnotes, table titles, etc.)

NB: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the β€œsneakier” form of untidiness that violates at least one of the first three rules.

How to get Tidy Data?

In R, there are the two most widely used packages for converting your raw data into tidy data.

  1. tidyr package

  2. reshape2 package

Most of the tutorials out there which utilize tidyr package for this purpose, use the gather() and spread() functions. But, you will find that some tutorials also use, the pivot_longer() and pivot_wider() functions.

This creates confusion, about which one we should use.

  • Actually,pivot_longer() and pivot_wider() functions are updated versions of gather() and spread() functions, designed to be both simpler to use and to handle more use cases.

  • Developers of tidyr recommend, to use pivot_longer() and pivot_wider() for new codes.

  • But, gather() and spread() aren't going away but are no longer under active development.

Here, are the terminologies between different packages, versions, and platforms:

Package/Version/platformCommandCommand

tidyr (version 1.0.0 or later)

pivot_longer()

pivot_wider()

tidyr (version < 1.0.0)

gather()

spread()

reshape2

melt()

cast()

spreadsheets

unipivot

pivot

datasets

fold

unfold

In the following section, we will transform the data to get tidy data by tidyr package.

Pivot data from wide to long by tidyr package

pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider().

pivot_longer(
  data,
  cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = NULL,
  values_transform = NULL,
  ...
)

Let's use the pivot_longer()command to transfer data from wider to longer format for various datasets.

# Load necessary libraries
library(tidyr)
library(dplyr)
library(readr)

# or, just
library(tidyverse)

1. String data in column names

The relig_income dataset (from tidyr package) stores counts based on a survey which (among other things) asked people about their religion and annual income.

View the dataset:

data(relig_income)

This dataset contains three variables:

  • religion, stored in the rows,

  • income spread across the column names, and

  • count stored in the cell values.

To tidy this dataset we can use pivot_longer() command:

# First option
pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")

# Or, Second Option
relig_income %>% 
    pivot_longer(!religion, names_to = "income", values_to = "count")

Output:

2. Numeric Data in Column Names:

The billboard dataset (from tidyr package) records the billboard rank of songs in the year 2000. It has a form similar to the relig_income data, but the data encoded in the column names is really a number, not a string.

View the dataset:

data(billboard)

Let's transfer the dataset into the longer format

billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )

Last updated