# View the number of rows and columndim(DataFrameName)# View the overall structurestr(DataFrameName)
Summarize the data frame
Output of the summary function in R
Code Breakdown:
Here,
If the column of the dataframe is factor type then summary function will show the number of each factor.
If the column is numeric type then summary function will show some basic statistics (min, max, mean median, etc) of that column.
Get the content of a specific column
Code Breakdown:
First, specify the dataframe name
then put a $ sign
After that add the column name.
This code will return a list of contents in that column.
Extract unique data and their number in a column
Let's say, the dataframe contains a column that is factor type. Now you want to view the unique factors and their number.
Get the observation of a specific cell
To get the value of a specific cell you have to specify the Row Number and Column Number of that cell.
You can even specify the column name to get the value or observation of a specific row.
Get all the observations from a row
Code Breakdown:
The above code will extract all the values from the 6th row of your given dataframe.
You just have to specify the row number.
Similarly, you can give a column number to get all the rows from that column.
DataFrameName[ , 5]
Get a specific observation by its value
Subsetting with brackets using row and column numbers can be quite tedious if you have a large dataset and you don’t know where the observations you’re looking for are situated! And it’s never recommended anyway because if you hard-code a number in your script and you add some rows, later on, you might not be selecting the same observations anymore! That’s why we can use logical operations to access specific parts of the data that match our specifications.
Code Breakdown:
Let's say your specific column contains a value of 603. You want to access that cell. Previously we used column and row numbers which is not always a good idea.
Here, we used column name and observation value.
== is a logical operator.
Operators for logical operations:
Here are some of the most commonly used operators to manipulate data. When you use them to create a subsetting condition, R will evaluate the expression, and return only the observations for which the condition is met.
==: equals exactly
<, <=: is smaller than, is smaller than or equal to
>, >=: is bigger than, is bigger than or equal to
!=: not equal to
%in%: belongs to one of the following (usually followed by a vector of possible values)
&: AND operator, allows you to chain two conditions which must both be met
|: OR operator, to chains two conditions when at least one should be met
!: NOT operator, to specify things that should be omitted
Subset observation based on one condition
We can use logical operators to denote our conditions in a column and subset the observations that meet the condition.
[Subset means extracting observations from a bigger dataset.]
Subset observations based on two conditions
Code Breakdown:
The above code will subset all the observations of Column_01 where the value is 2 or (|) 5.
Code Breakdown:
This code will extract all the observations that have value 7 in Column_01 and values between 100 to 200 in Column_02.
Change the name of the column in the Dataframe
Option 1: Use column index
Option 2: Use column name
We can use names instead of colnamesfunction. But colnames is preferable to me.
names : Functions to get or set the names of an R object.
colnames : Retrieve or set the column names of a matrix-like object (eg. Dataframe).
Option 3: Using the rename() Function from dplyr:
The rename() function from the dplyr package is a more concise and efficient way to rename columns in a dataframe. To use rename(), you need to specify the new column name as a key-value pair, where the key is the old column name and the value is the new column name. For example, to rename the first column of a dataframe to new_column_name, you would use the following code:
Option 4: Change the names of all columns
setNames() method in R can also be used to assign new names to the columns contained within a list, vector, or tuple. The changes have to be saved back then to the original data frame because they are not retained.
Replace specific values in a column in R DataFrame
Option 1: By using row and column number
Option 2: Using thelogical condition
Code Breakdown:
y: It is the value that helps us to fetch the data location of the column
x : It is the value that needs to be replaced
Code Breakdown:
Replace the Marks with 25 when the Names value is Raman.
Filter Observations of a dataframe based on observations from another dataframe
Let's think we have a main dataframe with many observations. We also have some observations in another small dataframe. We want to use the small dataframe to filter data from the main dataframe. We also want some specific columns from the main dataframe.
Code Breakdown: In this example, main_df.csv is your main dataframe, and filter_df.csv is the csv file with the observations used to filter the main dataframe. column_to_filter is the column in main_df that you want to filter based on the values in filter_df. The specific columns you want to keep from the filtered dataframe are specified in the c() function in the line selected_df <- filtered_df[, c("column_1", "column_2", "column_3")]. Finally, the selected data is written to a new csv file, filtered_and_selected.csv.
Code Breakdown: In this example, main_df is loaded from a .csv file named main_data.csv. The filter_data dataframe is loaded from another .csv file named filter_data.csv. The dplyrfilter function is used to keep only those observations in main_df where the value in column_1 is found in the column_1 column of filter_data. The dplyrselect function is used to only keep the column_2 and column_3 columns in the filtered data.
# View unique content
unique(DataFrameName$ColumnName)
# Get the number of unique content
length(unique(DataFrameName$ColumnName))
DataFrameName[6, ]
# Let's access the values for number 603
DataFrameName[DataFrameName$ColumnName == 603, ]
# Subset all the observations greater than 10
DataFrameName[DataFrameName$ColumnName > 10, ]
# This code is also the same as the previous code
# Here, we just used the not (!) operator
DataFrameName[!DataFrameName$ColumnName < 10, ]
dataframe_name$column_name1[dataframe_name$column_name2==y] <- x
df$Marks[df$Names == "Raman"] <- 25
# Load the main dataframe and the filtering csv file
main_df <- read.csv("main_df.csv")
filter_df <- read.csv("filter_df.csv")
# Extract the column from the filtering csv file that will be used for filtering
filter_col <- filter_df[, 1]
# Filter the main dataframe based on the values in filter_col
filtered_df <- main_df[main_df$column_to_filter %in% filter_col, ]
# Select the specific columns you want from the filtered dataframe
selected_df <- filtered_df[, c("column_1", "column_2", "column_3")]
# Write the selected data to a new csv file
write.csv(selected_df, "filtered_and_selected.csv")
library(dplyr)
# Load main dataframe
main_df <- read.csv("main_data.csv")
# Load filter data
filter_data <- read.csv("filter_data.csv")
# Filter the main dataframe based on values in filter_data
filtered_df <- main_df %>%
filter(column_1 %in% filter_data$column_1) %>%
select(column_2, column_3)
# Write the selected data to a new csv file
write.csv(filtered_df, "filtered_and_selected.csv")