# Tidy Data in R

## What is Tidy Data and why do we need it?

As a bioinformatician, you will get data in many shapes and forms. For instance, if you monitor the height of seedlings during a factorial experiment using warming and fertilization treatments, you might record your data like this:

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FTqyY9ClJOrl43gXbTcBm%2Fimage.png?alt=media&#x26;token=b3131d00-63c6-4229-90f4-946104b894c6" alt=""><figcaption><p>Source: <a href="https://ourcodingclub.github.io/tutorials/data-manip-intro/">https://ourcodingclub.github.io/tutorials/data-manip-intro/</a></p></figcaption></figure>

This form of data is not suitable for your analysis by R. So, you may want to convert the above data as the following table:

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FIvyPcVRbNzfppaBkUcLY%2Fimage.png?alt=media&#x26;token=8e872e59-3bc0-45b9-85a8-d11bff5f6f06" alt=""><figcaption><p>Source: <a href="https://ourcodingclub.github.io/tutorials/data-manip-intro/">https://ourcodingclub.github.io/tutorials/data-manip-intro/</a></p></figcaption></figure>

Thus, by converting the input data into a suitable format for analysis we get **tidy data**.

In R, it is easiest to work with data that follow <mark style="color:green;">**five basic rules**</mark>:

1. Every **variable** is stored in its own **column**.
2. Every observation is stored in its own **row**—that is, every row corresponds to a single **case**.
3. Each **value** of a variable is stored in a **cell** of the table.

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FJpnRismIcPWsIovDA72x%2Fimage.png?alt=media&#x26;token=8a73f778-4703-47a2-8504-b8c4e10abcfc" alt=""><figcaption><p><a href="https://r4ds.hadley.nz/data-tidy.html">Visual Representation of Rules 1-3.</a></p></figcaption></figure>

5\. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a *codebook*.

6\. There should be no extraneous information (footnotes, table titles, etc.)

**NB**: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.

## How to get Tidy Data?

In R, there are the two most widely used packages for converting your raw data into tidy data.

1. <mark style="color:red;">`tidyr`</mark> package
2. <mark style="color:red;">`reshape2`</mark> package

{% hint style="info" %}
Most of the tutorials out there which utilize <mark style="color:green;">**`tidyr`**</mark> package for this purpose, use the <mark style="color:red;">`gather()`</mark> and <mark style="color:red;">`spread()`</mark> functions. But, you will find that some tutorials also use, the <mark style="color:red;">`pivot_longer()`</mark> and <mark style="color:red;">`pivot_wider()`</mark> functions.

<mark style="background-color:green;">This creates confusion, about which one we should use.</mark>

* Actually,<mark style="color:red;">`pivot_longer()`</mark> and <mark style="color:red;">`pivot_wider()`</mark> functions are updated versions of <mark style="color:red;">`gather()`</mark> and <mark style="color:red;">`spread()`</mark> functions, designed to be both simpler to use and to handle more use cases.
* Developers of <mark style="color:green;">**`tidyr`**</mark> recommend, to use <mark style="color:red;">`pivot_longer()`</mark> and <mark style="color:red;">`pivot_wider()`</mark> for new codes.
* But, <mark style="color:red;">`gather()`</mark> and <mark style="color:red;">`spread()`</mark> aren't going away but are no longer under active development.
  {% endhint %}

&#x20;Here, are the terminologies between different packages, versions, and platforms:

<table><thead><tr><th width="300.3333333333333">Package/Version/platform</th><th>Command</th><th>Command</th></tr></thead><tbody><tr><td><mark style="color:green;"><strong><code>tidyr</code></strong></mark> (version 1.0.0 or later)</td><td><mark style="color:red;"><code>pivot_longer()</code></mark></td><td><mark style="color:red;"><code>pivot_wider()</code></mark></td></tr><tr><td><mark style="color:green;"><strong><code>tidyr</code></strong></mark> (version &#x3C; 1.0.0)</td><td><mark style="color:red;"><code>gather()</code></mark></td><td><mark style="color:red;"><code>spread()</code></mark></td></tr><tr><td><mark style="color:green;"><strong><code>reshape2</code></strong></mark></td><td><mark style="color:red;"><code>melt()</code></mark></td><td><mark style="color:red;"><code>cast()</code></mark></td></tr><tr><td>spreadsheets</td><td>unipivot</td><td>pivot</td></tr><tr><td>datasets</td><td>fold</td><td>unfold</td></tr></tbody></table>

In the following section, we will transform the data to get tidy data by <mark style="color:green;">**`tidyr`**</mark> package.

## Pivot data from wide to long by <mark style="color:green;">`tidyr`</mark> package

<mark style="color:red;">`pivot_longer()`</mark> "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is [`pivot_wider(`](https://tidyr.tidyverse.org/reference/pivot_wider.html)`)`.

{% tabs %}
{% tab title="Syntax" %}

```r
pivot_longer(
  data,
  cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = NULL,
  values_transform = NULL,
  ...
)
```

{% endtab %}

{% tab title="Code Breakdown" %}

* <mark style="color:red;">**`data`**</mark>: A data frame to pivot.
* <mark style="color:red;">**`cols`**</mark>: <[`tidy-select`](https://tidyr.tidyverse.org/reference/tidyr_tidy_select.html)> Columns to pivot into a longer format.
* <mark style="color:red;">**`names_to`**</mark>: A character vector specifying the new column or columns to create from the information stored in the column names  <mark style="color:red;">`data`</mark> specified by <mark style="color:red;">`cols`</mark>.
* <mark style="color:red;">**`values_to`**</mark>: A string specifying the name of the column to create from the data stored in cell values. If <mark style="color:red;">`names_to`</mark> is a character containing the special <mark style="color:red;">`.value`</mark> sentinel, this value will be ignored, and the name of the value column will be derived from part of the existing column names.

Learn more in [`vignette("pivot")`](https://tidyr.tidyverse.org/articles/pivot.html), <[`tidy-select`](https://tidyr.tidyverse.org/reference/tidyr_tidy_select.html)>.
{% endtab %}
{% endtabs %}

Let's use the `pivot_longer()`command to transfer data from wider to longer format for various datasets.

```r
# Load necessary libraries
library(tidyr)
library(dplyr)
library(readr)

# or, just
library(tidyverse)
```

### 1. String data in column names

The `relig_income` dataset (from `tidyr` package) stores counts based on a survey which (among other things) asked people about their religion and annual income.

View the dataset:

```r
data(relig_income)
```

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FrKRvf1CZDmndHW9BcBI1%2Fimage.png?alt=media&#x26;token=6db86456-e531-4b72-aa59-8d7d6d3649d9" alt=""><figcaption><p>Part of the relig_income dataset</p></figcaption></figure>

This dataset contains three variables:

* `religion`, stored in the rows,
* `income` spread across the column names, and
* `count` stored in the cell values.

To tidy this dataset we can use <mark style="color:red;">`pivot_longer()`</mark> command:

{% tabs %}
{% tab title="Code" %}

```r
# First option
pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")

# Or, Second Option
relig_income %>% 
    pivot_longer(!religion, names_to = "income", values_to = "count")
```

{% endtab %}

{% tab title="Code Breakdown" %}
**First Option:** The first code is quite straight forward as the syntax of <mark style="color:red;">`pivot_longer()`</mark> command.

1. We specified `relig_income` for `data` argument.
2. For the `cols` argument we used <mark style="color:red;">`!religion`</mark>, it means all columns except the **religion** column will be considered to pivot into a longer format.
3. In the `names_to` argument we specified the name of the column where all the pivoted columns will be stored.
4. The `values_to` argument contains values of all the pivoted columns.

**Second Option:** This code also works the same as the first one. But in this case we use R pipe sign (<mark style="color:red;">`%>%`</mark>) with the `relig_income` dataset.
{% endtab %}
{% endtabs %}

**Output:**

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FHNAQ3fQgSqZz1hafnRKN%2Fimage.png?alt=media&#x26;token=3715fdf2-e9bf-431d-b9fe-9c4a830fd781" alt=""><figcaption><p>The relig_income dataset after pivoting to long format.</p></figcaption></figure>

### 2. Numeric Data in Column Names:

The `billboard` dataset (from `tidyr` package) records the billboard rank of songs in the year 2000. It has a form similar to the `relig_income` data, but the data encoded in the column names is really a number, not a string.

View the dataset:

```r
data(billboard)
```

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FXRLYgCF9m3DNZXzWEUab%2Fimage.png?alt=media&#x26;token=acffcecc-d4d5-4fa0-8106-33d54a09f611" alt=""><figcaption><p>Part of the billboard dataset</p></figcaption></figure>

Let's transfer the dataset into the longer format

{% tabs %}
{% tab title="Code" %}

```r
billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )
```

{% endtab %}

{% tab title="Code Breakdown" %}

* The code started with piping (<mark style="color:red;">`%>%`</mark>) the `billboard` dataset with the <mark style="color:red;">`pivot_longer()`</mark> command. That's why we don't need to specify the `data` argument inside the command.&#x20;
* In this dataset, our targets were the columns named **"wk"**. That's why in the `cols` argument we used <mark style="color:red;">`start_with()`</mark> command to specify that.
* `names_to = "week"`: After pivoting our target columns will go under the **week** column.
* `values_to = "rank"`: Values from each column will go under the **rank** column.
* `values_drop_na = TRUE`: to drop rows that correspond to missing values. Not every song stays in the charts for all 76 weeks, so the structure of the input data forces the creation of unnecessary explicit `NA`s.
  {% endtab %}
  {% endtabs %}
