🏀Scatter Plot by R base graph
Scatter plot:
Let's create a scatter plot:
Code Breakdown:
The code above creates a scatter plot in R using data from the built-in
mtcars
data set. The variablesx
andy
are created by extracting thewt
andmpg
columns, respectively, from themtcars
data set.The
plot()
function is then used to create the scatter plot. The first two arguments toplot()
,x
andy
, specify the x and y coordinates of the points in the plot, respectively.The
main
argument is used to specify the main title for the plot. Thexlab
andylab
arguments are used to specify the titles for the x and y axes, respectively.The
pch
argument is used to specify the plotting character to use for the points in the plot. In this case,pch = 19
means that filled circles will be used.Finally, the
frame
argument is set toFALSE
to remove the default frame around the plot.
Output:
pch
values for point types:
pch
values for point types:Add regression line to the plot
Regression Line:
In statistics, a regression line is a line of best fit that represents the relationship between two continuous variables. It is used to model the relationship between the independent variable (often denoted as "x") and the dependent variable (often denoted as "y"). The line of best fit is calculated such that the sum of the squared differences between the observed values and the values predicted by the line is minimized. The equation of the line can be used to make predictions about the dependent variable based on new values of the independent variable. The regression line is also sometimes referred to as the "least-squares line".
Let's add a regression line to the above plot:
Here,
The second line of code, abline(lm(y ~ x, data = mtcars), col = "red")
, adds a regression line to the plot. The lm
function is used to fit a linear model to the data. The y ~ x
formula specifies the response variable y
and the predictor variable x
. The data
argument specifies the data frame that contains the variables x
and y
, which is mtcars
in this case. The abline
function is then used to add the regression line to the plot, and the col
argument specifies the color of the line as red.
Add LOESS fit to the plot:
LOESS fit:
A LOESS fit is a type of non-parametric regression that is used to model the relationship between two continuous variables. The term "LOESS" stands for "Locally Weighted Scatterplot Smoothing". The basic idea behind a LOESS fit is to fit a locally weighted regression model to each subset of the data. In other words, instead of fitting a single global model to all of the data, a LOESS fit fits multiple local models to different parts of the data. This allows the fit to capture more complex relationships between the variables, and to handle areas of the data that may be noisy or have a different structure than the rest of the data. The result of a LOESS fit is a smooth curve that can be used to make predictions about the relationship between the two variables, based on the observed data.
Let's add loess fit to the main scatter plot:
Here,
The lowess
function in R computes a locally weighted regression fit. The fit is a smooth curve that tries to capture the underlying trend in the data. In this case, the curve is created by fitting a simple regression model at each individual point in the data, weighting the regression by the proximity of other points. The resulting curve is then plotted over the scatter plot of the data using the lines
function. The color of the curve is set to "blue".
Scatter Plot Matrices
Here, we'll go over the process of creating a matrix of scatter plots. This is helpful for visualizing the correlation in smaller data sets. The pairs()
function from the R base can be used for this purpose.
We will use R built-in iris dataset.
The pairs()
function in R is used to create a matrix of scatter plots to visualize the relationship between multiple variables in a data set. The function is part of the base R package and can be used without loading any additional packages.
Here's an example of using the pairs()
function in R:
The code generates a matrix of scatter plots to visualize the correlation between different variables in the iris data set. The pch
argument is set to 19, which determines the plotting character used in the scatter plots. The pairs
function takes two arguments: the first argument is a matrix or data frame of the variables to be plotted, and the second argument pch
sets the plotting character. In this code, the iris data set is passed in as the first argument, but only the first 4 columns (i.e., variables) are included. The resulting matrix of scatter plots will display the relationship between each pair of variables.
To show only the upper panel:
Note that, to only retain the lower panel, set the argument upper.panel
to NULL
.
Color points by groups
Let's color the points based on species group
Here,
The
cex
argument sets the size of the plotting characters, with the value of 0.5 indicating that they will be half their default size.The argument
col
specifies the color of the points in the scatter plots, with the valuemy_cols[iris$Species]
indicating that the color will be determined by the species of each iris in the data frame. Themy_cols
variable is a vector of color codes, andiris$$Species
is a column in theiris
data frame that indicates the species of each iris.
Add correlations on the scatter plots:
code breakdown:
The code is creating a matrix of scatter plots using the pairs()
function. The pairs()
function takes the first four columns of the iris
dataset as input. The code is customizing the upper panel of the plots using a user-defined function upper.panel
. In the upper.panel
function, points are plotted using the points()
function. The arguments pch
and col
are used to specify the plot character and color for the points. The value for col
is set based on the species of each sample in the iris
dataset.
The cor()
function is used to calculate the Pearson's correlation coefficient between the two variables in the plot, and round()
is used to round the correlation value to 2 decimal places. The rounded correlation value is then concatenated with the string "R = " to create a label for the plot. The label is placed at the position (0.5, 0.9) in the plot using the text()
function. The par()
function is used to set the plot parameters to display the label. The on.exit()
function ensures that the original plot parameters are restored after the upper.panel
function is executed.
Finally, the pairs()
function is called with the arguments lower.panel = NULL
and upper.panel = upper.panel
to create the matrix of scatter plots. The lower panel of the plots is suppressed by setting lower.panel = NULL
, and the upper panel is customized using the upper.panel
function.
Output:
Add correlations on the lower panels:
First, we have to define a function that will create a correlation panel. In this case, The size of the text is proportional to the correlations.
Here,
The code defines a custom panel function named "panel.cor
" in the R programming language. This function will be used to produce a correlation plot. The function takes two arguments, x
and y
, which represent the two variables whose correlation is being calculated.
The first line of the function saves the current graphical parameters using the
par
function and setson.exit
to restore the original parameters after the function has finished executing.The second line sets the graphical parameters using the
par
function, which takes the argumentusr = c(0, 1, 0, 1)
, meaning that the plot will cover the entire plot area (0 to 1 on both the x and y-axis).The third line calculates the Pearson correlation coefficient between the two variables using the
cor
function and rounds it to two decimal places using theround
function.The fourth line concatenates the text
"R = "
with the correlation coefficient value stored in ther
variable, and saves it in thetxt
variable.The fifth line calculates the scaling factor
cex.cor
for the text, which is proportional to the inverse of the width of the text string.Finally, the last line adds the text "R = r" to the plot using the
text
function, positioning it in the center of the plot area (x = 0.5, y = 0.5
) and setting its size proportional to the correlation coefficient (cex = cex.cor * r
).
Output:
Scatter plot by psych
package
psych
packageThe psych package in R provides a function called pairs.panels
which can be used to generate a scatter plot matrix. This plot displays bivariate scatter plots below the diagonal, histograms along the diagonal, and the Pearson correlation coefficient above the diagonal.
Output:
If lm = TRUE
is used with this function, the plots will display linear regression fits for both y against x and x against y. Additionally, correlation ellipses will also be displayed. The points in the scatter plots may be colored differently based on a grouping variable.
Source:
Last updated