DataFrame

View the Data

To view the data:

View(DataFrameName)

View the Head and Tail of the DataFrame

# View Head
head(DataFrameName)

# View Tail
tail(DataFrameName)

View the dimension or structure of the DataFrame#

# View the number of rows and column
dim(DataFrameName)

# View the overall structure
str(DataFrameName)

Summarize the data frame

summary(DataFrameName)

Code Breakdown:

Here,

  • If the column of the dataframe is factor type then summary function will show the number of each factor.

  • If the column is numeric type then summary function will show some basic statistics (min, max, mean median, etc) of that column.

Get the content of a specific column

DataFrameName$ColumnName

Code Breakdown:

  • First, specify the dataframe name

  • then put a $ sign

  • After that add the column name.

This code will return a list of contents in that column.

Extract unique data and their number in a column

Let's say, the dataframe contains a column that is factor type. Now you want to view the unique factors and their number.

# View unique content
unique(DataFrameName$ColumnName)

# Get the number of unique content
length(unique(DataFrameName$ColumnName))

Get the observation of a specific cell

To get the value of a specific cell you have to specify the Row Number and Column Number of that cell.

You can even specify the column name to get the value or observation of a specific row.

Get all the observations from a row

DataFrameName[6, ]

Code Breakdown:

The above code will extract all the values from the 6th row of your given dataframe.

You just have to specify the row number.

Similarly, you can give a column number to get all the rows from that column.

DataFrameName[ , 5]

Get a specific observation by its value

Subsetting with brackets using row and column numbers can be quite tedious if you have a large dataset and you don’t know where the observations you’re looking for are situated! And it’s never recommended anyway because if you hard-code a number in your script and you add some rows, later on, you might not be selecting the same observations anymore! That’s why we can use logical operations to access specific parts of the data that match our specifications.

# Let's access the values for number 603
DataFrameName[DataFrameName$ColumnName == 603, ]

Code Breakdown:

Let's say your specific column contains a value of 603. You want to access that cell. Previously we used column and row numbers which is not always a good idea.

  • Here, we used column name and observation value.

  • == is a logical operator.

Operators for logical operations:

Here are some of the most commonly used operators to manipulate data. When you use them to create a subsetting condition, R will evaluate the expression, and return only the observations for which the condition is met.

  • ==: equals exactly

  • <, <=: is smaller than, is smaller than or equal to

  • >, >=: is bigger than, is bigger than or equal to

  • !=: not equal to

  • %in%: belongs to one of the following (usually followed by a vector of possible values)

  • &: AND operator, allows you to chain two conditions which must both be met

  • |: OR operator, to chains two conditions when at least one should be met

  • !: NOT operator, to specify things that should be omitted

Subset observation based on one condition

We can use logical operators to denote our conditions in a column and subset the observations that meet the condition.

[Subset means extracting observations from a bigger dataset.]

# Subset all the observations greater than 10
DataFrameName[DataFrameName$ColumnName > 10, ]

# This code is also the same as the previous code
# Here, we just used the not (!) operator
DataFrameName[!DataFrameName$ColumnName < 10, ]

Subset observations based on two conditions

DataFrameName[DataFrameName$Column_01 == 2 | DataFrameName$Column_01 == 5 , ]

Code Breakdown:

The above code will subset all the observations of Column_01 where the value is 2 or (|) 5.

DataFrameName[DataFrameName$Column_01 == 7 & DataFrameName$Column_02 %in% c(100:200) , ]

Code Breakdown:

This code will extract all the observations that have value 7 in Column_01 and values between 100 to 200 in Column_02.

Change the name of the column in the Dataframe

Option 1: Use column index

colnames(df)[col_indx] <- “new_col_name”

Option 2: Use column name

colnames(df)[colnames(df) == "Age"] <- "Years"

We can use names instead of colnames function. But colnames is preferable to me.

  • names : Functions to get or set the names of an R object.

  • colnames : Retrieve or set the column names of a matrix-like object (eg. Dataframe).

Option 3: Using the rename() Function from dplyr:

The rename() function from the dplyr package is a more concise and efficient way to rename columns in a dataframe. To use rename(), you need to specify the new column name as a key-value pair, where the key is the old column name and the value is the new column name. For example, to rename the first column of a dataframe to new_column_name, you would use the following code:

library(dplyr)
new_df <- df %>% rename(new_column_name = old_column_name)

# To change multiple column names
new_df <- df %>% rename(new_column_name_1 = old_column_name_1, new_column_name_2 = old_column_name_2)

Option 4: Change the names of all columns

setNames() method in R can also be used to assign new names to the columns contained within a list, vector, or tuple. The changes have to be saved back then to the original data frame because they are not retained.

setnames(df, c(names of new columns))

Replace specific values in a column in R DataFrame

Option 1: By using row and column number

df[row_number, column_number] <- value_to_be_replaced

Option 2: Using the logical condition

dataframe_name$column_name1[dataframe_name$column_name2==y] <- x

Code Breakdown:

  • y: It is the value that helps us to fetch the data location of the column

  • x : It is the value that needs to be replaced

Filter Observations of a dataframe based on observations from another dataframe

Let's think we have a main dataframe with many observations. We also have some observations in another small dataframe. We want to use the small dataframe to filter data from the main dataframe. We also want some specific columns from the main dataframe.

# Load the main dataframe and the filtering csv file
main_df <- read.csv("main_df.csv")
filter_df <- read.csv("filter_df.csv")

# Extract the column from the filtering csv file that will be used for filtering
filter_col <- filter_df[, 1]

# Filter the main dataframe based on the values in filter_col
filtered_df <- main_df[main_df$column_to_filter %in% filter_col, ]

# Select the specific columns you want from the filtered dataframe
selected_df <- filtered_df[, c("column_1", "column_2", "column_3")]

# Write the selected data to a new csv file
write.csv(selected_df, "filtered_and_selected.csv")

Code Breakdown: In this example, main_df.csv is your main dataframe, and filter_df.csv is the csv file with the observations used to filter the main dataframe. column_to_filter is the column in main_df that you want to filter based on the values in filter_df. The specific columns you want to keep from the filtered dataframe are specified in the c() function in the line selected_df <- filtered_df[, c("column_1", "column_2", "column_3")]. Finally, the selected data is written to a new csv file, filtered_and_selected.csv.

Code Breakdown:

Last updated