One of the best features of R is its graphics capability. As far as flexibility and breadth, R graphics is unsurpassed (in my opinion). There are MANY functions in R that involve visualizing data. There are many examples online, but a nice site that has tons of graphics (and syntax!) is: http://www.r-graph-gallery.com/.
Pictures are cool, no doubt. But, visualization of information is an ESSENTIAL part of your toolkit as a researcher, analyst, and/or rad R user. Visualizing information helps you tell the story and make the point to those who are “visual” learners.
For example, in a paper entitled “Solitary Confinement and the Well-Being of People in Prison”, Wright et. al. examined the effects of solitary confinement on mental well-being at three time points over a 12-month period (called “baseline”, “6-Month”, and “12-Month”). A key feature of the study was tracking movement of individuals through custody levels.
Take a look at the table below which shows a cross-tabulation for custody levels for individuals. That is, the cells represent the count of individuals who were at a particular custody level at baseline (the row) and their custody level at 6 months (the column):
Custody Levels | Minimum (BL) | Medium | Close | Maximum | Attrite |
---|---|---|---|---|---|
Medium (6M) | 6 | 81 | 3 | 3 | 17 |
Close | 0 | 9 | 60 | 13 | 12 |
Maximum | 0 | 0 | 14 | 99 | 9 |
What does this table tell you? In other words, what can you infer from the table?
It is a bit hard to follow. Now, what if we added the two other tables: 6-month to 12-month and baseline to 12-month. That would be a lot of information to digest and tough to get a sense of what was occurring.
A visualization would help. How could we better represent this information?
Take a look at this visualization.
This is a “sanky” plot.
It is for showing flows where entities (nodes) are represented by
rectangles or text. Arrows or arcs are used to show flows between them.
It is created using the networkD3
package.
Visualization is a great medium for conveying information. There are
a LOT of tools in R for visualizing information. So, to get a sense of
how it all works, let’s start with a basic function
plot()
.
plot()
FunctionA very useful graphics function is the plot()
function.
This function plots a two-dimensional pane with two arguments giving the
x and y coordinates.
Let’s create a simple plot:
x <- rnorm( 100, 0, 1 ) #create a vector with 100 elements drawn from a normal distribution.
y <- seq( 1, 10, length.out=length( x ) ) # create a vector 1:10 with same length.
plot( x, y ) #plot it.
This is a pretty simple plot. As mentioned, one of the nice features
of R is the flexibility to create your plot. Take a look at the
different arguments that we pass to plot()
to modify it by
examining the help: ?plot
.
Overall, there are many different parameters we can modify
in plot()
. Let’s check out a few:
We can change the “type” of plot:
plot( x, y, type="l" ) #plot a line.
plot( x, y, type="p" ) #plot points.
plot( x, y, type="b" ) #plot both!
Often, when plotting multiple objects, we want to first set up the
plot regions before adding anything. This is a plot of type “none”:
plot( x,y, type="n" )
.
plot( x, y,
type="n",
main="our sample plot", # plot a title.
xlab="this is the x axis", # change the x label.
ylab="this is the y axis" # change the y label.
)
We can also change the “characters” of the plot:
plot( x, y, pch=1 ) #plot a point.
plot( x, y, pch=2 ) #plot a triangle.
plot( x, y, pch=3 ) #plot a +.
plot( x, y, pch=4 ) #plot an x.
The argument pch determines the shape of the plot points. The numeric values 0 to 25 represent different default shapes. We can also use any number, letter, or symbol as a plotting shape.
Note that shapes 0 to 14 are hollow, 15 to 20 are solid, and 21 to 25 can also plot a background color specified by the bg= argument.
Additionally, the points()
, lines()
,
segments()
, and text()
functions are useful
for adding information to plots.
Here is an example I use in my data analysis course to illustrate the properties of the standard normal distribution.
First, let’s set up our values:
y <- seq( -15, 30, length=1000 ) # sequence from -15 to 30.
hx.1 <- dnorm( y, 0, 1 ) # densities for the plots.
hx.2 <- dnorm( y, 0, 2 )
hx.3 <- dnorm( y, 0, 3 )
Next, let’s set up the plot, but we don’t want to add anything yet
(so we use type="n"
):
plot( y, hx.1,
xlab="", ylab="", # blank out the labels for x and y.
type="n", #do not plot anything.
main="Normal Distributions" # a title.
)
Now, illustrate the shape of the distributions using the
lines()
function (you can copy and paste one at a time to
see them get added):
plot( y, hx.1,
xlab="", ylab="", # blank out the labels for x and y.
type="n", #do not plot anything.
main="Normal Distributions" # a title.
)
lines( y, hx.1, col="blue", type="l", lwd=2 )
lines( y, hx.2, col="red", type="l", lwd=2 )
lines( y, hx.3, col="darkgreen", type="l", lwd=2 )
Now, add a line to show the central tendency by using the
segments()
function:
plot( y, hx.1,
xlab="", ylab="", # blank out the labels for x and y.
type="n", #do not plot anything.
main="Normal Distributions" # a title.
)
lines( y, hx.1, col="blue", type="l", lwd=2 )
lines( y, hx.2, col="red", type="l", lwd=2 )
lines( y, hx.3, col="darkgreen", type="l", lwd=2 )
segments( 0, 0, 0, 0.5, col="black", lwd=2 )
Finally, add some text to show the values (note that we will use the
text()
function):
plot( y, hx.1,
xlab="", ylab="", # blank out the labels for x and y.
type="n", #do not plot anything.
main="Normal Distributions" # a title.
)
lines( y, hx.1, col="blue", type="l", lwd=2 )
lines( y, hx.2, col="red", type="l", lwd=2 )
lines( y, hx.3, col="darkgreen", type="l", lwd=2 )
segments( 0, 0, 0, 0.5, col="black", lwd=2 )
text( 11, 0.35, "Mean = 0, SD = 1", col="blue", cex=1.5 )
text( 12, 0.15, "Mean = 0, SD = 2", col="red", cex=1.5 )
text( 13, 0.06, "Mean = 0, SD = 3", col="darkgreen", cex=1.5 )
As we have seen, we can start with a basic plot and add information. Creating graphics in this way is referred to as layering because we are stacking additional layers of elements on top of each other.
Consider the following plot:
As you can see, there are a number of elements that have been used to create this plot:
plot()
points()
segments()
main=
,
xlab=
, and ylab=
arguments in
the function plot()
text()
.Let’s go through and build this plot, layer by layer.
First, what are these data?
The data are yearly rates of family deaths recorded by a professor at Penn State. That is, the rate at which family deaths are reported to him prior to an exam from 1960-1995.
Here are what the data look like in a table:
Year | Death Rate |
---|---|
1960 | 0.18 |
1965 | 0.20 |
1970 | 0.24 |
1975 | 0.30 |
1980 | 0.47 |
1985 | 0.61 |
1990 | 0.70 |
1995 | 0.90 |
Now, let’s move the data into objects to work with in R:
x <- c( 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995 )
y <- c( 0.18, 0.20, 0.24, 0.30, 0.47, 0.61, 0.70, 0.90 )
Next, let’s set up the plot and define the limits of the axes using
xlim=
and ylim=
, define the title using
main=
, and set the axis labels using xlab=
and
ylab=
:
# Set up the plot.
plot(x, y,
xlim=c( min( x ) - 5, max( x ) + 5 ), # set x axis limits.
ylim=c( min( y ) - 0.5, max( y ) + 0.5) , # same for y axis.
main="Plot of Average Family Deaths by Year", # title.
xlab="Year", # x axis label.
ylab="Average Family Deaths", # same for y axis.
type="n" # don't plot anything inside.
)
Now, we can add the points to the plot using the
points()
function. We can customize the points using the
pch=
, col=
, and bg=
arguments.
plot(x, y,
xlim=c( min( x ) - 5, max( x ) + 5 ), # set x axis limits.
ylim=c( min( y ) - 0.5, max( y ) + 0.5) , # same for y axis.
main="Plot of Average Family Deaths by Year", # title.
xlab="Year", # x axis label.
ylab="Average Family Deaths", # same for y axis.
type="n" # don't plot anything inside.
)
points( x, y, pch = 21, col = "red", bg = "lightblue" )
What additional information could we add to this plot that would aid in our understanding of the relationship between year and average family deaths? What about understanding how regression works?
Well, we could add the least-squares regression line to the plot
using the abline()
function and the lm()
function:
plot(x, y,
xlim=c( min( x ) - 5, max( x ) + 5 ), # set x axis limits.
ylim=c( min( y ) - 0.5, max( y ) + 0.5) , # same for y axis.
main="Plot of Average Family Deaths by Year", # title.
xlab="Year", # x axis label.
ylab="Average Family Deaths", # same for y axis.
type="n" # don't plot anything inside.
)
points( x, y, pch = 21, col = "red", bg = "lightblue" )
abline( lm( y ~ x ), lty=2 )
Additionally, we can illustrate how OLS estimation works. Recall that
OLS finds the line that minimizes the sum of squared residuals. We can
show that using the points()
, abline()
,
segments()
, and text()
functions:
plot(x, y,
xlim=c( min( x ) - 5, max( x ) + 5 ), # set x axis limits.
ylim=c( min( y ) - 0.5, max( y ) + 0.5) , # same for y axis.
main="Plot of Average Family Deaths by Year", # title.
xlab="Year", # x axis label.
ylab="Average Family Deaths", # same for y axis.
type="n" # don't plot anything inside.
)
points( x, y, pch = 21, col = "red", bg = "lightblue" )
abline( lm( y ~ x ), lty=2 )
# add some points to the plot.
points( mean( x ), mean( y ), col="black", pch=3, cex=3 )
# plot the mean of y horizontally.
abline( h=mean( y ), lty=3 )
#plot the mean of x vertically.
abline( v=mean( x ), lty=3 )
# add segments and text showing the deviations.
segments( 1985, mean( y ), 1985, 0.61, lwd=3, col="red" )
text( 1987.2, 0.53, "y-ybar" )
segments( mean( x ), 0.61, 1985, 0.61, lwd=3, col="red" )
text( 1981, 0.65, "x-xbar" )
There is MUCH more you can do with just the plot()
function. See help( par )
for a list of all the arguments
and options for plotting.
There are also many other options for plots in R. There are entire
packages created for plotting. One in particular is the
ggplot2
. Check out this page showing crime
in Phoenix.