# DataFrame

## View the Data

To view the data:

```r
View(DataFrameName)
```

## View the Head and Tail of the DataFrame

```r
# View Head
head(DataFrameName)

# View Tail
tail(DataFrameName)
```

## View the dimension or structure of the DataFrame\#

```r
# View the number of rows and column
dim(DataFrameName)

# View the overall structure
str(DataFrameName)
```

## Summarize the data frame

```r
summary(DataFrameName)
```

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FDw885ZgEFbkXwhpoS2NV%2Fimage.png?alt=media&#x26;token=de9fde20-40d7-4a67-a590-076fdd11086b" alt=""><figcaption><p>Output of the summary function in R</p></figcaption></figure>

{% hint style="info" %}
**Code Breakdown:**

Here,

* If the column of the dataframe is **factor** type then `summary` function will show the number of each factor.
* If the column is **numeric** type then `summary` function will show some basic statistics (min, max, mean median, etc) of that column.
  {% endhint %}

## Get the content of a specific column

```r
DataFrameName$ColumnName
```

{% hint style="info" %}
**Code Breakdown:**

* First, specify the dataframe name
* then put a **$** sign
* After that add the **column** name.

This code will return a list of contents in that column.
{% endhint %}

## Extract unique data and their number in a column

Let's say, the dataframe contains a column that is **factor** type. Now you want to view the **unique** factors and their number.

```r
# View unique content
unique(DataFrameName$ColumnName)

# Get the number of unique content
length(unique(DataFrameName$ColumnName))
```

## Get the observation of a specific cell

To get the value of a specific cell you have to specify the **Row Number** and **Column Number** of that cell.

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FlzWRDcm0eYhw31NeFrcE%2Fimage.png?alt=media&#x26;token=29fbf1d7-5e5e-4135-b7c2-6faad7e79f03" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
You can even specify the column name to get the value or observation of a specific row.
{% endhint %}

<figure><img src="https://3681152927-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FvUtrdiIkCrBX60yTgn1m%2Fuploads%2FZ7DIb8WQSaZfQ87BCJMz%2Fimage.png?alt=media&#x26;token=8c8f2760-62ee-4846-8cea-1ce2c16d1337" alt=""><figcaption></figcaption></figure>

## Get all the observations from a row

```r
DataFrameName[6, ]
```

{% hint style="info" %}
**Code Breakdown:**

The above code will extract all the values from the 6th row of your given dataframe.

You just have to specify the row number.

Similarly, you can give a column number to get all the rows from that column.

`DataFrameName[ , 5]`
{% endhint %}

## Get a specific observation by its value

Subsetting with brackets using row and column numbers can be quite tedious if you have a large dataset and you don’t know where the observations you’re looking for are situated! And it’s never recommended anyway because if you hard-code a number in your script and you add some rows, later on, you might not be selecting the same observations anymore! That’s why we can use l**ogical operations** to access specific parts of the data that match our specifications.

```r
# Let's access the values for number 603
DataFrameName[DataFrameName$ColumnName == 603, ]
```

{% hint style="info" %}
**Code Breakdown:**

Let's say your specific column contains a value of 603. You want to access that cell. Previously we used column and row numbers which is not always a good idea.

* Here, we used **column name** and **observation** value.
* `==` is a logical operator.&#x20;
  {% endhint %}

<details>

<summary>Operators for logical operations:</summary>

Here are some of the most commonly used operators to manipulate data. When you use them to create a subsetting condition, R will evaluate the expression, and return only the observations for which the condition is met.

* `==`: equals exactly
* `<`, `<=`: is smaller than, is smaller than or equal to
* `>`, `>=`: is bigger than, is bigger than or equal to
* `!=`: not equal to
* `%in%`: belongs to one of the following (usually followed by a vector of possible values)
* `&`: AND operator, allows you to chain two conditions which must both be met
* `|`: OR operator, to chains two conditions when at least one should be met
* `!`: NOT operator, to specify things that should be omitted

</details>

## Subset observation based on one condition

We can use logical operators to denote our conditions in a column and subset the observations that meet the condition.

\[Subset means extracting observations from a bigger dataset.]

```r
# Subset all the observations greater than 10
DataFrameName[DataFrameName$ColumnName > 10, ]

# This code is also the same as the previous code
# Here, we just used the not (!) operator
DataFrameName[!DataFrameName$ColumnName < 10, ]
```

## Subset observations based on two conditions

```r
DataFrameName[DataFrameName$Column_01 == 2 | DataFrameName$Column_01 == 5 , ]
```

{% hint style="info" %}
**Code Breakdown:**

The above code will subset all the observations of **Column\_01** where the value is 2 or (|) 5.
{% endhint %}

{% code overflow="wrap" %}

```r
DataFrameName[DataFrameName$Column_01 == 7 & DataFrameName$Column_02 %in% c(100:200) , ]
```

{% endcode %}

{% hint style="info" %}
Code Breakdown:

This code will extract all the observations that have value **7** in **Column\_01** and values **between 100 to 200** in **Column\_02**.
{% endhint %}

## Change the name of the column in the Dataframe

### Option 1: Use column index

{% tabs %}
{% tab title="Syntax" %}

```r
colnames(df)[col_indx] <- “new_col_name”
```

{% endtab %}

{% tab title="Example" %}

```r
# Assinging the second column to a new name
colnames(df)[2] <- "new_col2"
```

{% endtab %}
{% endtabs %}

### Option 2: Use column name

{% tabs %}
{% tab title="Syntax" %}

```r
colnames(df)[colnames(df) == "Age"] <- "Years"
```

{% endtab %}

{% tab title="Example" %}

```r
# Create a sample dataframe
df <- data.frame(Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 35))

# Rename the "Age" column to "Years"
colnames(df)[colnames(df) == "Age"] <- "Years"
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
We can use <mark style="color:red;">`names`</mark> instead of <mark style="color:red;">`colnames`</mark> function. But <mark style="color:red;">`colnames`</mark> is preferable to me.

* <mark style="color:red;">`names`</mark> : Functions to get or set the names of an R object.
* <mark style="color:red;">`colnames`</mark> : Retrieve or set the column names of a matrix-like object (eg. Dataframe).
  {% endhint %}

### Option 3: **Using the&#x20;**<mark style="color:red;">**`rename()`**</mark>**&#x20;Function from&#x20;**<mark style="color:red;">**`dplyr`**</mark>**:**

The <mark style="color:red;">`rename()`</mark> function from the <mark style="color:red;">`dplyr`</mark> package is a more concise and efficient way to rename columns in a dataframe. To use <mark style="color:red;">`rename()`</mark>, you need to specify the new column name as a key-value pair, where the key is the **old column name** and the value is the new column name. For example, to rename the first column of a dataframe to **`new_column_name`**, you would use the following code:

{% tabs %}
{% tab title="Syntax" %}
{% code overflow="wrap" %}

```r
library(dplyr)
new_df <- df %>% rename(new_column_name = old_column_name)

# To change multiple column names
new_df <- df %>% rename(new_column_name_1 = old_column_name_1, new_column_name_2 = old_column_name_2)

```

{% endcode %}
{% endtab %}

{% tab title="Example" %}

```r
library(dplyr)

# Create a sample dataframe
df <- data.frame(Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 35))

# Rename the "Age" column to "Years"
df <- df %>% rename(Years = Age)

```

{% endtab %}
{% endtabs %}

### Option 4: Change the names of all columns

<mark style="color:red;">`setNames()`</mark> method in R can also be used to assign new names to the columns contained within a list, vector, or tuple. The changes have to be saved back then to the original data frame because they are not retained.

{% tabs %}
{% tab title="Syntax" %}

```r
setnames(df, c(names of new columns))
```

{% endtab %}

{% tab title="Example" %}

```r
df <- setNames(df, c("changed_Col1","changed_Col2","changed_Col3"))
```

{% endtab %}
{% endtabs %}

## Replace specific values in a column in R DataFrame

### Option 1: By using row and column number

{% tabs %}
{% tab title="Syntax" %}

```r
df[row_number, column_number] <- value_to_be_replaced
```

{% endtab %}

{% tab title="Example" %}

```r
df[2,5] <- 6.8
```

{% endtab %}
{% endtabs %}

### Option 2: **Using the** **logical condition**

{% tabs %}
{% tab title="Syntax" %}

```r
dataframe_name$column_name1[dataframe_name$column_name2==y] <- x
```

{% hint style="info" %}
**Code Breakdown:**

* <mark style="color:red;">`y`</mark>: It is the value that helps us to fetch the data location of the column
* <mark style="color:red;">`x`</mark> : It is the value that needs to be replaced
  {% endhint %}
  {% endtab %}

{% tab title="Example" %}

```r
df$Marks[df$Names == "Raman"] <- 25
```

{% hint style="info" %}
**Code Breakdown:**

* Replace the **Marks** with **25** when the **Names** value is **Raman**.
  {% endhint %}
  {% endtab %}
  {% endtabs %}

## Filter Observations of a dataframe based on observations from another dataframe

Let's think we have a main dataframe with many observations. We also have some observations in another small dataframe. We want to use the small dataframe to filter data from the main dataframe. We also want some specific columns from the main dataframe.

{% tabs %}
{% tab title=" Using R base functions:" %}

```r
# Load the main dataframe and the filtering csv file
main_df <- read.csv("main_df.csv")
filter_df <- read.csv("filter_df.csv")

# Extract the column from the filtering csv file that will be used for filtering
filter_col <- filter_df[, 1]

# Filter the main dataframe based on the values in filter_col
filtered_df <- main_df[main_df$column_to_filter %in% filter_col, ]

# Select the specific columns you want from the filtered dataframe
selected_df <- filtered_df[, c("column_1", "column_2", "column_3")]

# Write the selected data to a new csv file
write.csv(selected_df, "filtered_and_selected.csv")
```

**Code Breakdown:** In this example, <mark style="color:red;">`main_df.csv`</mark> is your main dataframe, and <mark style="color:red;">`filter_df.csv`</mark> is the csv file with the observations used to filter the main dataframe. <mark style="color:red;">`column_to_filter`</mark> is the column in <mark style="color:red;">`main_df`</mark> that you want to filter based on the values in <mark style="color:red;">`filter_df`</mark>. The specific columns you want to keep from the filtered dataframe are specified in the <mark style="color:red;">`c()`</mark> function in the line <mark style="color:red;">`selected_df <- filtered_df[, c("column_1", "column_2", "column_3")]`</mark>. Finally, the selected data is written to a new csv file, <mark style="color:red;">`filtered_and_selected.csv`</mark>.
{% endtab %}

{% tab title="Using dplyr package functions" %}

```r
library(dplyr)

# Load main dataframe
main_df <- read.csv("main_data.csv")

# Load filter data
filter_data <- read.csv("filter_data.csv")

# Filter the main dataframe based on values in filter_data
filtered_df <- main_df %>%
  filter(column_1 %in% filter_data$column_1) %>%
  select(column_2, column_3)

# Write the selected data to a new csv file
write.csv(filtered_df, "filtered_and_selected.csv")

```

**Code Breakdown:** In this example, <mark style="color:red;">`main_df`</mark> is loaded from a <mark style="color:red;">`.csv`</mark> file named <mark style="color:red;">`main_data.csv`</mark>. The `filter_data` dataframe is loaded from another <mark style="color:red;">`.csv`</mark> file named <mark style="color:red;">`filter_data.csv`</mark>. The <mark style="color:red;">`dplyr`</mark> <mark style="color:red;">`filter`</mark> function is used to keep only those observations in <mark style="color:red;">`main_df`</mark> where the value in <mark style="color:red;">`column_1`</mark> is found in the <mark style="color:red;">`column_1`</mark> column of <mark style="color:red;">`filter_data`</mark>. The `dplyr` <mark style="color:red;">`select`</mark> function is used to only keep the <mark style="color:red;">`column_2`</mark> and <mark style="color:red;">`column_3`</mark> columns in the filtered data.
{% endtab %}
{% endtabs %}

**Code Breakdown:**
