🏀Scatter Plot by R base graph

Scatter plot:

Let's create a scatter plot:

# Data
x <- mtcars$wt
y <- mtcars$mpg

# Create the plot
plot(x, y, main = "Main title",
     xlab = "X axis title", ylab = "Y axis title",
     pch = 19, frame = FALSE)

Code Breakdown:

  • The code above creates a scatter plot in R using data from the built-in mtcars data set. The variables x and y are created by extracting the wt and mpg columns, respectively, from the mtcars data set.

  • The plot() function is then used to create the scatter plot. The first two arguments to plot(), x and y, specify the x and y coordinates of the points in the plot, respectively.

  • The main argument is used to specify the main title for the plot. The xlab and ylab arguments are used to specify the titles for the x and y axes, respectively.

  • The pch argument is used to specify the plotting character to use for the points in the plot. In this case, pch = 19 means that filled circles will be used.

  • Finally, the frame argument is set to FALSE to remove the default frame around the plot.

Output:

pch values for point types:

Add regression line to the plot

Regression Line:

In statistics, a regression line is a line of best fit that represents the relationship between two continuous variables. It is used to model the relationship between the independent variable (often denoted as "x") and the dependent variable (often denoted as "y"). The line of best fit is calculated such that the sum of the squared differences between the observed values and the values predicted by the line is minimized. The equation of the line can be used to make predictions about the dependent variable based on new values of the independent variable. The regression line is also sometimes referred to as the "least-squares line".

Let's add a regression line to the above plot:

# Create the plot
plot(x, y, main = "Main title",
     xlab = "X axis title", ylab = "Y axis title",
     pch = 19, frame = FALSE)

# Add regression line
abline(lm(y ~ x, data = mtcars), col = "red")

Here,

The second line of code, abline(lm(y ~ x, data = mtcars), col = "red"), adds a regression line to the plot. The lm function is used to fit a linear model to the data. The y ~ x formula specifies the response variable y and the predictor variable x. The data argument specifies the data frame that contains the variables x and y, which is mtcars in this case. The abline function is then used to add the regression line to the plot, and the col argument specifies the color of the line as red.

Add LOESS fit to the plot:

LOESS fit:

A LOESS fit is a type of non-parametric regression that is used to model the relationship between two continuous variables. The term "LOESS" stands for "Locally Weighted Scatterplot Smoothing". The basic idea behind a LOESS fit is to fit a locally weighted regression model to each subset of the data. In other words, instead of fitting a single global model to all of the data, a LOESS fit fits multiple local models to different parts of the data. This allows the fit to capture more complex relationships between the variables, and to handle areas of the data that may be noisy or have a different structure than the rest of the data. The result of a LOESS fit is a smooth curve that can be used to make predictions about the relationship between the two variables, based on the observed data.

Let's add loess fit to the main scatter plot:

# Create the plot
plot(x, y, main = "Main title",
     xlab = "X axis title", ylab = "Y axis title",
     pch = 19, frame = FALSE)
     
# Add loess fit
lines(lowess(x, y), col = "blue")

Here,

The lowess function in R computes a locally weighted regression fit. The fit is a smooth curve that tries to capture the underlying trend in the data. In this case, the curve is created by fitting a simple regression model at each individual point in the data, weighting the regression by the proximity of other points. The resulting curve is then plotted over the scatter plot of the data using the lines function. The color of the curve is set to "blue".

Scatter Plot Matrices

Here, we'll go over the process of creating a matrix of scatter plots. This is helpful for visualizing the correlation in smaller data sets. The pairs() function from the R base can be used for this purpose.

We will use R built-in iris dataset.

# load the data
data(iris)

The pairs() function in R is used to create a matrix of scatter plots to visualize the relationship between multiple variables in a data set. The function is part of the base R package and can be used without loading any additional packages.

Here's an example of using the pairs() function in R:

# Create a basic plot
pairs(iris[,1:4], pch = 19)

The code generates a matrix of scatter plots to visualize the correlation between different variables in the iris data set. The pch argument is set to 19, which determines the plotting character used in the scatter plots. The pairs function takes two arguments: the first argument is a matrix or data frame of the variables to be plotted, and the second argument pch sets the plotting character. In this code, the iris data set is passed in as the first argument, but only the first 4 columns (i.e., variables) are included. The resulting matrix of scatter plots will display the relationship between each pair of variables.

To show only the upper panel:

pairs(iris[,1:4], pch = 19, lower.panel = NULL)

Note that, to only retain the lower panel, set the argument upper.panel to NULL.

Color points by groups

Let's color the points based on species group

# Specify the color in a vector
my_cols <- c("#ff8000", "#0080ff", "#ff0080")  

# Create the plot
pairs(iris[,1:4], pch = 19,  cex = 0.5,
      col = my_cols[iris$Species],
      lower.panel=NULL)

Here,

  • The cex argument sets the size of the plotting characters, with the value of 0.5 indicating that they will be half their default size.

  • The argument col specifies the color of the points in the scatter plots, with the value my_cols[iris$Species] indicating that the color will be determined by the species of each iris in the data frame. The my_cols variable is a vector of color codes, and iris$$Species is a column in the iris data frame that indicates the species of each iris.

Add correlations on the scatter plots:

# Customize the upper panel
upper.panel<-function(x, y){
  points(x,y, pch=19, cex = 0.5, col=c("red", "green3", "blue")[iris$Species])
  r <- round(cor(x, y), digits=2)
  txt <- paste0("R = ", r)
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  text(0.5, 0.9, txt)
}
pairs(iris[,1:4], lower.panel = NULL, 
      upper.panel = upper.panel)

code breakdown:

The code is creating a matrix of scatter plots using the pairs() function. The pairs() function takes the first four columns of the iris dataset as input. The code is customizing the upper panel of the plots using a user-defined function upper.panel. In the upper.panel function, points are plotted using the points() function. The arguments pch and col are used to specify the plot character and color for the points. The value for col is set based on the species of each sample in the iris dataset.

The cor() function is used to calculate the Pearson's correlation coefficient between the two variables in the plot, and round() is used to round the correlation value to 2 decimal places. The rounded correlation value is then concatenated with the string "R = " to create a label for the plot. The label is placed at the position (0.5, 0.9) in the plot using the text() function. The par() function is used to set the plot parameters to display the label. The on.exit() function ensures that the original plot parameters are restored after the upper.panel function is executed.

Finally, the pairs() function is called with the arguments lower.panel = NULL and upper.panel = upper.panel to create the matrix of scatter plots. The lower panel of the plots is suppressed by setting lower.panel = NULL, and the upper panel is customized using the upper.panel function.

Output:

Add correlations on the lower panels:

First, we have to define a function that will create a correlation panel. In this case, The size of the text is proportional to the correlations.

# Correlation panel
panel.cor <- function(x, y){
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- round(cor(x, y), digits=2)
  txt <- paste0("R = ", r)
  cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

Here,

The code defines a custom panel function named "panel.cor" in the R programming language. This function will be used to produce a correlation plot. The function takes two arguments, x and y, which represent the two variables whose correlation is being calculated.

  • The first line of the function saves the current graphical parameters using the par function and sets on.exit to restore the original parameters after the function has finished executing.

  • The second line sets the graphical parameters using the par function, which takes the argument usr = c(0, 1, 0, 1), meaning that the plot will cover the entire plot area (0 to 1 on both the x and y-axis).

  • The third line calculates the Pearson correlation coefficient between the two variables using the cor function and rounds it to two decimal places using the round function.

  • The fourth line concatenates the text "R = " with the correlation coefficient value stored in the r variable, and saves it in the txt variable.

  • The fifth line calculates the scaling factor cex.cor for the text, which is proportional to the inverse of the width of the text string.

  • Finally, the last line adds the text "R = r" to the plot using the text function, positioning it in the center of the plot area (x = 0.5, y = 0.5) and setting its size proportional to the correlation coefficient (cex = cex.cor * r).

Output:

Scatter plot by psych package

The psych package in R provides a function called pairs.panels which can be used to generate a scatter plot matrix. This plot displays bivariate scatter plots below the diagonal, histograms along the diagonal, and the Pearson correlation coefficient above the diagonal.

# Load the library
library(psych)

# Create the plot
pairs.panels(iris[,-5], 
             method = "pearson", # correlation method
             hist.col = "#e066ff",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

Output:

If lm = TRUE is used with this function, the plots will display linear regression fits for both y against x and x against y. Additionally, correlation ellipses will also be displayed. The points in the scatter plots may be colored differently based on a grouping variable.

Source:

Last updated