In the Introduction to
Exponential Random Graph Models (ERGMs) lecture you were introduced
to several examples of models from the exponential family of random
graphs. This lab provides an introduction to ERGMs in R using the
ergm
package. We will work through different dependence
specifications of the model reviewed in the lecture (e.g. dyadic
independence) and show how to incorporate node level attributes into the
model.
For the get along with network, individuals could have asymmetric nominations. That is, i could nominate j and j didn’t necessarily nominate i. But, we are going to symmetrize the network by only taking ties for which both i and j indicated that the get along with the other person. This will give us an undirected network.
# load the libraries we need
library( sna )
library( network )
# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-getalong-w1-adj.csv"
# read in the .csv file
gaMat <- as.matrix(
read.csv(
loc,
as.is = TRUE,
header = TRUE,
row.names = 1
)
)
# use the symmetrize() function to create an undirected matrix
gaMatU <- symmetrize( gaMat, rule = "strong" )
# create the network object
gaNetU <- as.network( gaMatU, directed = FALSE )
gplot(
gaNetU,
gmode = "graph",
edge.col="grey40",
vertex.col="#c78c71",
coord = coords,
main = "PINS Get\n Along With Network (Undirected)"
)
Think about the structure here. What do you notice?
How might we go about incorporating this information into a model of the network?
Before we get going, let’s define some graphical objects we will use.
# define the number of nodes
g <- dim( gaMatU )[1]
# define the number of edges
l <- sum( gaMatU )/2
# define the density
den <- l / ( g *( g - 1 ) / 2 )
den
## [1] 0.004543281
What is the interpretation of the density?
Recall that the exponential random graph model seeks to model the
underlying processes that generated the observed network. Let’s create a
random graph with the same nodes, edges, and density as the
gaNetU
network and compare them by using the
rgraph()
function in the sna
package:
?rgraph
# set the seed to reproduce these results
set.seed( 605 )
# generate the random graph
random.graph <- rgraph(
g, # the number of nodes in the network
1, # we want just 1 random graph
tprob = den, # the density of the network
mode = "graph" # it is undirected
)
# now coerce the random graph to a network object
random.net <- as.network( random.graph, directed = FALSE )
We can now compare our observed data with a graph that has the same density and the edges are randomly distributed over the nodes.
# set the margins
par( mfrow=c( 1,2 ),
mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
gaNetU,
gmode = "graph",
edge.col="grey40",
vertex.col="#c78c71",
coord = coords,
main = "PINS Get\n Along With Network (Undirected)"
)
# create the second plot
gplot(
random.net,
gmode = "graph",
edge.col = "grey40",
vertex.col="#069e6e",
main = "Random network"
)
How is the Get Along With network different from the random network?
Recall that there is an attributes file that we can attach to these data. Let’s load that and use these attributes in our plot.
# define the attributes object
attrs <- read.csv(
"https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
as.is = TRUE,
header = TRUE
)
# assign the attributes to the network
gaNetU %v% "Age" <- attrs[,1]
gaNetU %v% "Race" <- attrs[,2]
Now that these are assigned as an attribute, we can reference them
using the %v%
operator. For example, let’s look at the
values for the Race
attribute.
## [1] 1 2 1 2 2 2 1 2 2 3 2 1 1 1 1 2 1 2 1 2 3 1 1 3 4 2 1 1 3 2 2 2 2 1 2 2 2
## [38] 3 1 2 2 1 1 3 1 2 2 3 2 1 1 1 2 3 1 1 1 1 3 1 2 1 1 2 2 3 2 1 2 1 1 1 2 2
## [75] 1 1 2 2 3 2 2 1 3 1 1 1 2 2 2 1 2 2 2 3 1 1 1 2 2 2 2 1 3 2 2 2 1 1 3 2 1
## [112] 1 1 2 2 1 3 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2 3 3 2 2 3 3 3 3 2 2 2 3 2 1 1 2
## [149] 2 3 2 1 2 1 1 1 2 2 2 3 2 2 1 1 1 2 2 2 1 2 1 1 1 2 2 1 1 1 1 2 2 2 1 2 2
## [186] 2 1 2 1 1 1 3 2 2 2 1 2 2 2 3 2 3 3 2 2
##
## 1 2 3 4
## 79 96 29 1
# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1
# now assign the recoded value
gaNetU %v% "Race" <- race.recoded
Now, let’s use the Race
attribute and the
Age
attribute in a plot. We will color the nodes based on
Race
and size the nodes based on Age
. Note
that for Age
, we are going to need to use the
rescale()
function.
# create the colors for race
node.cols <- gaNetU %v% "Race"
node.cols[ gaNetU %v% "Race" == 1 ] <- "#eb105d"
node.cols[ gaNetU %v% "Race" == 2 ] <- "#5773fa"
node.cols[ gaNetU %v% "Race" == 3 ] <- "#37cf00"
# define the rescale function
rescale <- function( nchar, low, high ){
min_d <- min( nchar )
max_d <- max( nchar )
rscl <- ( ( high - low )*( nchar - min_d ) ) / ( max_d - min_d ) + low
rscl
}
# now, plug our pieces into our plot
gplot(
gaNetU,
gmode = "graph",
edge.col = "grey40",
vertex.col = node.cols,
vertex.cex = rescale( gaNetU %v% "Age", 0.5, 2 ),
coord = coords,
main = "PINS Get\n Along With Network (Undirected)"
)
# add a legend
legend( "topright",
legend=c( "White", "Black", "Hispanic" ),
col = unique( node.cols ),
pch = 19,
pt.cex=1
)
Think about the structure again. What do you notice about race or age?
How might we go about incorporating this information into a model of the network?
ergm
Package and the
ergm()
FunctionFirst things first, we need to install the ergm
package
and load the library of functions.
# install the ergm package
install.packages( "ergm" )
# load the ergm package
library( ergm )
# take a look at the functionality
help( package = "ergm" )
If you look at the ergm()
function using
?ergm
, then you will see an extensive help file on the
function.
ergm()
FunctionThe ergm()
function takes two arguments:
an object of class network
(the dependent
variable)
and some terms (i.e. network configurations)
Thus, the model form looks like this:
model <- ergm(network ~ term)
. Recall from the
Introduction to ERGMs lecture that the general expression of the model
is:
\(logit\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = \sum_{k=1}^\kappa \theta_k\delta_{z_k(y)}\)
As with logistic regression, the logit transformation can be used to re-express the equation as the conditional probability of a tie:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_1\delta_{z_1(y)+\theta_2\delta_{z_2}(y)...})\)
Here, the coefficients are expressed as the conditional log-odds of a single actor pair. That is, how likely we are to see a tie between i and j, given some model terms. So, what are those model terms or network configurations?
We can view all of the network statistics that have been programmed
by using the ?ergm.terms
command. You can also use the
(vignette("ergm-term-crossRef"))
command to show you a bit
more info about the model terms.
The ERGM expresses the probability of observing a tie between nodes i and j given some terms (i.e. network configurations). The edge independence model of Erdos and Renyi (1959) uses a single term, the number of edges in the graph, to represent the probability of a tie:
\(P(Y=y)=\Bigg(\frac{1}{c}\Bigg) exp \big\{\theta L(y) \big\}\)
In the ergm()
function, this term is edges
.
You can see the help on the edges term using
?"edges-ergmTerm"
. Note the difference, we are calling
edges
within the ergmTerm
help page. Let’s
take a look at this model:
# run the model
edge.indep.gaNetU <- ergm( gaNetU ~ edges )
# check out the summary
summary( edge.indep.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges)
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -5.3896 0.1028 0 -52.41 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 28987 on 20910 degrees of freedom
## Residual Deviance: 1214 on 20909 degrees of freedom
##
## AIC: 1216 BIC: 1224 (Smaller is better. MC Std. Err. = 0)
## [1] "ergm"
## [1] "coefficients" "iterations" "MCMCtheta" "gradient"
## [5] "hessian" "covar" "failure" "mple.lik"
## [9] "mple.lik.null" "nw.stats" "call" "network"
## [13] "ergm_version" "info" "MPLE_is_MLE" "drop"
## [17] "offset" "estimable" "etamap" "formula"
## [21] "reference" "constraints" "obs.constraints" "estimate"
## [25] "estimate.desc" "control" "null.lik" "mle.lik"
First, the summary shows the coefficient \(\theta\) from the equation above. The value
of -5.39 indicates that the density of the network is below .5 or 50%.
An edges
term of 0 would represent 50% or 0.5 density. As
we saw above, the density of the gaNetU
network is
0.005.
In the general formulation of the ERGM, the \(\delta\) represents the change
statistic, or the change in the statistic of interest when an edge
is added (i.e. \(Y_{ij}\)) goes from 0
to 1). The change statistic for the edges
term is always 1,
so we can think of the probability of a tie between i and
j as the logit of the coefficients for the edges
term. Specifically, we can use the calculation that is usual for a
logistic regression to interpret the coefficient:
\(\frac{1}{1 + e^{-(\theta_1X_1)}}\)
If we plug in the value of -5.39 that is returned by the
ergm()
function, we get:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges})\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-5.39 \times 1)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = \frac{1}{1 + e^{-(-5.39 \times 1)}} = 0.005\)
Does the value 0.005 look familiar?
The value of 0.005 is returned by using the plogis()
function. Use the ?plogis
command to examine the help file
for this function.
## [1] 0.004541256
## edges
## 0.004543281
So what is this? Recall that the ERGM expresses the probability of observing a tie between nodes i and j given some terms (i.e. network configurations). The edge independence model of Erdos and Renyi (1959) uses a single term, the number of edges in the graph, to represent the probability of a tie. So, here we are saying that the probability of a tie between two nodes is 0.0045433.
Does that help us understand the structure in the network?
Sort of, but it tells us about as much as the random network we plotted above. We can extend this model to include characteristics of nodes.
So far, we have assumed that the probability of a tie is independent of any nodal attribute. We could relax this assumption and test two hypotheses:
Nodes differ in their degree based on their race.
Nodes that are the same race are more likely to share ties (i.e. homophily).
We can test these hypotheses using the nodefactor
(for
categorical attributes [for continuous attributes use
nodecov
]) and nodematch
(for categorical
attributes [for continuous attributes use absdiff
]) terms.
Take a look at the description of the term nodefactor
term
using ?nodefactor
.
Why make these hypotheses? First, let’s take a look at the degree distribution for race. Let’s compare the mean degree for each group:
# mean degree for white
round( mean( degree( gaNetU, gmode = "graph", cmode = "degree" )[gaNetU %v% "Race" == 1] ), 2 )
## [1] 1.18
# mean degree for black
round( mean( degree( gaNetU, gmode = "graph", cmode = "degree" )[gaNetU %v% "Race" == 2] ), 2 )
## [1] 0.89
# mean degree for hispanic
round( mean( degree( gaNetU, gmode = "graph", cmode = "degree" )[gaNetU %v% "Race" == 3] ), 2 )
## [1] 0.38
Now, how are the mean degrees different for each racial group?
Are these differences due to random variation?
# let's add the nodefactor term for race to our model
race.gaNetU <- ergm(
gaNetU ~ edges
+ nodefactor( "Race" )
)
summary( race.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodefactor("Race"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -4.9102 0.1799 0 -27.296 < 1e-04 ***
## nodefactor.Race.2 -0.2861 0.1505 0 -1.901 0.057348 .
## nodefactor.Race.3 -1.1391 0.3195 0 -3.565 0.000363 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 28987 on 20910 degrees of freedom
## Residual Deviance: 1197 on 20907 degrees of freedom
##
## AIC: 1203 BIC: 1227 (Smaller is better. MC Std. Err. = 0)
Note that the term adds one network statistic to the model for each categorical value of the variable we have passed. The first category is excluded as the reference category. The first category is white (which is coded as 1), so it is automatically excluded and serves as the referent category. This can be overriden if we want to.
For the nodefactor
term, positive values indicate that a
node with that particular value of the attribute is more likely
to have an edge, relative to the referent category. Alternatively, a
negative value indicates that a node with that particular value of the
attribute is less likely to have an edge, relative to the
referent category.
What does the model tell us?
Let’s look at this for our example. Looking at the table we see:
the term nodefactor.Race.2 has a value of -0.29, but is not significantly different from zero at the p<0.5 level. Thus, the difference in degree for white and black that we observed above is consistent with random variation.
the term nodefactor.Race.3 has a value of -1.14 and is significantly different from zero at the p<0.5 level. Thus, compared to whites, the probability of a tie between i and j is lower if either i or j are Hispanic.
Recall that in the general formulation of the ERGM, the \(\delta\) represents the change
statistic, or the change in the statistic of interest when an edge
is added (i.e. \(Y_{ij}\)) goes from 0
to 1). The change statistic for the nodefactor
term is
different from the edges
term we saw above. If the
predictor is categorical, the value of the change statistic is 0, 1 or
2. Specifically:
If neither of the nodes have the characteristic of interest, the change statistic is 0.
A value of 2 indicates that both of the nodes in the dyad have the characteristic.
Again, let’s use the calculation that is usual for a logistic regression to interpret the coefficient:
\(\frac{1}{1 + e^{-(\theta_1X_1)}}\)
Using the coefficient of -1.14, the predicted probability of a tie between i and j if they are both Hispanic is:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + \theta_{race} \times \delta_{race})\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic((-4.91 \times 1) + (-1.14 \times 2))\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-4.91 + -2.28)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-7.19) = 0.0008\)
The value of 7.5^{-4} is returned by using the command
plogis()
. We can see this by adding the equation elements
to the plogis()
function.
plogis(
race.gaNetU$coefficients[1]*1 # the edges term
+ race.gaNetU$coefficients[3]*2 # the nodefactor hispanic term
)
## edges
## 0.0007547075
## edges
## 0.00075
What is the predicted probability of a tie between i and j if only one is Hispanic?:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + \theta_{race} \times \delta_{race})\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic((-4.91 \times 1) + (-1.14 \times 1))\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-4.91 + -1.14)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.05) = 0.002\)
The value of 0.002 is returned by using the command
plogis()
. We can see this by adding the equation elements
to the plogis()
function.
plogis(
race.gaNetU$coefficients[1]*1
+ race.gaNetU$coefficients[3]*1 # NOTE: we are only multiplying by 1!!!
)
## edges
## 0.002353967
## edges
## 0.002
Note that this is quite a change from having two nodes who are Hispanic. This difference in the probability of a tie shows you that Hispanics are less connected to the network, relative to other race/ethnicities.
Finally, what is the predicted probability of a tie between i and j if only both are White?
Now let’s look at a different hypothesis: Nodes of the same
race/ethnicity are more likely to share ties (i.e. homophily).
We can test this using the nodematch
term. But first, let’s
look at the mixing between each category value with the
mixingmatrix()
function. This function returns homophilous
ties in the diagonal and heterophilous ties in the off-diagonal. To see
more on the mixingmatrix()
function, see the help,
?mixingmatrix
.
## 1 2 3
## 1 40 14 0
## 2 14 34 3
## 3 0 3 4
Looking at the mixing matrix for race, what patterns do you see?
Before we estimate the model for homophily, let’s take a look at the
nodematch
term to see its functionality. You can do so
using ?"nodematch-ergmTerm"
. (note the use of help here is
a bit different).
Now, let’s estimate the model for homophily.
homophily.gaNetU <- ergm(
gaNetU ~ edges
+ nodematch( "Race" ) # add the term for homophily
)
summary( homophily.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodematch("Race"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.6214 0.2427 0 -27.283 <1e-04 ***
## nodematch.Race 1.9849 0.2680 0 7.405 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 28987 on 20910 degrees of freedom
## Residual Deviance: 1139 on 20908 degrees of freedom
##
## AIC: 1143 BIC: 1159 (Smaller is better. MC Std. Err. = 0)
Note that the term adds one network statistic to the model for the
variable. For the nodematch
term, positive values indicate
that for a pair of nodes with the same attribute value,
an edge is more likely. Alternatively, a negative value
indicates that for a pair of nodes with the same
attribute value, an edge is less likely. Thus, positive
coefficients indicate the presence of homophily and
negative coefficients indicate the presence of
heterophily.
Let’s work through our example to see the interpretation of the coefficient. Looking at the table we see:
round( homophily.gaNetU$coefficients[2], 2 )
, indicating
that ties are more likely among nodes with the same value for the
variable Race.Recall that in the general formulation of the ERGM, the \(\delta\) represents the change
statistic, or the change in the statistic of interest when an edge
is added (i.e. \(Y_{ij}\)) goes from 0
to 1). The change statistic for the nodematch
term is
similar to the edges
term we saw above. If the predictor is
categorical, the value of the change statistic is 0 or 1.
Specifically:
If i and j have the same value for a categorical covariate the change statistic is 1.
If i and j do not have the same value for a categorical covariate the change statistic is 0.
Again, let’s use the calculation that is usual for a logistic regression to interpret the coefficient:
\(\frac{1}{1 + e^{-(\theta_1X_1)}}\)
Using the coefficient of
round( homophily.gaNetU$coefficients[2], 2 )
, the predicted
probability of an edge between nodes with the same
race/ethnicity is:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + (\theta_{race.homophily} \times \delta_{race.homophily}))\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.62 \times 1 + 1.98 \times 1)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.62 + 1.98)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-4.64) = 0.0096\)
The value of 0.0096 is returned by using the command
plogis()
.
## edges
## 0.009598819
Now, what if we where interested in homophily differences by race.
That is, does homophily differ for the groups? This is referred to as
differential homophily and can be tested using the
diff=TRUE
argument in the nodematch
term in
the ergm()
function.
d.homophily.gaNetU <- ergm(
gaNetU ~ edges
+ nodematch( "Race", diff=TRUE ) # not the addition
)
summary( d.homophily.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodematch("Race", diff = TRUE))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.6214 0.2427 0 -27.283 < 1e-04 ***
## nodematch.Race.1 2.2647 0.2902 0 7.804 < 1e-04 ***
## nodematch.Race.2 1.7302 0.2975 0 5.815 < 1e-04 ***
## nodematch.Race.3 2.0112 0.5580 0 3.604 0.000313 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 28987 on 20910 degrees of freedom
## Residual Deviance: 1134 on 20906 degrees of freedom
##
## AIC: 1142 BIC: 1174 (Smaller is better. MC Std. Err. = 0)
What is the interpretation of the model?
In the model examining homophily, note that we excluded the
nodefactor
term. This inherently assumes that there are no
degree differences between the nodes based on Race
. As a
result, we are not accounting for an important feature of the network.
Specifically, we could observe homophily on Race
simply
because there are differences in the degree distribution by the values
for the attribute Race
. Let’s take this into account by
reestimating the model and including both the nodefactor
and the nodematch
terms.
DH.gaNetU <- ergm(
gaNetU ~ edges
+ nodefactor( "Race" )
+ nodematch( "Race", diff=FALSE )
)
summary( DH.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodefactor("Race") + nodematch("Race",
## diff = FALSE))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.2910 0.2837 0 -22.172 <1e-04 ***
## nodefactor.Race.2 -0.2443 0.1148 0 -2.129 0.0333 *
## nodefactor.Race.3 -0.3498 0.2759 0 -1.268 0.2048
## nodematch.Race 1.9323 0.2785 0 6.938 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 28987 on 20910 degrees of freedom
## Residual Deviance: 1134 on 20906 degrees of freedom
##
## AIC: 1142 BIC: 1174 (Smaller is better. MC Std. Err. = 0)
All of the models above are edge indepedent or dyad independent models in that the probability of a tie from i to j is not dependent on a tie to k. But, in reality, many network structures are formed due to dependencies between dyads. For example, whether I decide to befriend you may depend on whether we have a shared friend in common. That is dyad dependence.
A major strength of the ERGM is the ability to model such endogenous
network structures. In the models above, we observed homophily for the
attribute Race
. The model proposed that the probability of
a tie was more likely to occur between i and j if they
both are of the same race/ethnicity. This is because we observed many
same-race/ethnicity dyads in the mixing matrix:
## 1 2 3
## 1 40 14 0
## 2 14 34 3
## 3 0 3 4
But, we may observe such mixing due to transitivity in a network. If we wanted to examine homophily, we would want to account for the tendency for transitive relationships to form.
Let’s estimate a model that includes the gwesp
term.
This terms add a count of the edgewise shared parterns in a network to
the model. Positive values indicate that there is a tendency for an edge
to form between i and j if they have a shared partner.
Note that when we move to the dyad dependence model, the estimation gets
a bit more complicated. We won’t go into the details now, but you will
see some differences in the code below.
triad.gaNetU <- ergm(
gaNetU ~ edges
+ nodematch( "Race", diff=TRUE )
+ gwesp( decay = 0.25, fixed = TRUE ), # here is our gwesp term
control = control.ergm(
seed = 605 ) # here we use the control argument to set the seed to reproduce results
)
summary( triad.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodematch("Race", diff = TRUE) +
## gwesp(decay = 0.25, fixed = TRUE), control = control.ergm(seed = 605))
##
## Monte Carlo Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.8062 0.2390 0 -28.472 < 1e-04 ***
## nodematch.Race.1 1.7942 0.2639 0 6.800 < 1e-04 ***
## nodematch.Race.2 1.4955 0.2736 0 5.467 < 1e-04 ***
## nodematch.Race.3 1.9731 0.5133 0 3.844 0.000121 ***
## gwesp.fixed.0.25 1.6955 0.1546 0 10.966 < 1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 28987 on 20910 degrees of freedom
## Residual Deviance: 1051 on 20905 degrees of freedom
##
## AIC: 1061 BIC: 1101 (Smaller is better. MC Std. Err. = 0.6673)
What is the interpretation of the model?
Let’s walk back through what we covered above with a directed network, noting differences in the model as we go.
For the power and influence network, individuals could have asymmetric nominations. That is, i could nominate j and j didn’t necessarily nominate i. We will keep this asymmetry so that we can treat the network as directed.
# clear the workspace since we will recycle object names
rm( list = ls() )
# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-power-w1-adj.csv"
# read in the .csv file
piMat <- as.matrix(
read.csv(
loc,
as.is = TRUE,
header = TRUE,
row.names = 1
)
)
piNetD <- as.network( piMat, directed = TRUE )
gplot(
piNetD,
gmode = "digraph",
edge.col="grey40",
vertex.col="#5bf5b5",
coord = coords,
main = "PINS Power/Influence Network (Directed)"
)
Think about the structure here. What do you notice?
How might we go about incorporating this information into a model of the network?
Before we get going, let’s define some graphical objects we will use.
# define the number of nodes
g <- dim( piMat )[1]
# define the number of edges
l <- sum( piMat )/2
# define the density
den <- l / ( g *( g - 1 ) )
den
## [1] 0.001924916
What is the interpretation of the density?
Recall that the exponential random graph model seeks to model the
underlying processes that generated the observed network. Let’s create a
random graph with the same nodes, edges, and density as the
piNetD
network and compare them by using the
rgraph()
function in the sna
package:
# set the seed to reproduce these results
set.seed( 605 )
# generate the random graph
random.graph <- rgraph(
g, # the number of nodes in the network
1, # we want just 1 random graph
tprob = den, # the density of the network
mode = "digraph" # it is directed
)
# now coerce the random graph to a network object
random.net <- as.network( random.graph, directed = FALSE )
We can now compare our observed data with a graph that has the same density and the edges are randomly distributed over the nodes.
# set the margins
par( mfrow=c( 1,2 ),
mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
piNetD,
gmode = "digraph",
edge.col="grey40",
vertex.col="#5bf5b5",
coord = coords,
main = "PINS Power/Influence Network (Directed)"
)
# create the second plot
gplot(
random.net,
gmode = "digraph",
edge.col = "grey40",
vertex.col="#069e6e",
main = "Random network"
)
How is the Power/Influence network different from the random network?
We can use the same attributes as above. Some we cleared the workspace, we will want to reload and recreate the attributes.
# define the attributes object
attrs <- read.csv(
"https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
as.is = TRUE,
header = TRUE
)
# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1
# assign the attributes to the network
piNetD %v% "Age" <- attrs[,1]
# now assign the recoded value
piNetD %v% "Race" <- race.recoded
Now, let’s use the Race
attribute and the
Age
attribute in a plot. We will color the nodes based on
Race
and size the nodes based on Age
. Note
that for Age
, we are going to need to use the
rescale()
function.
# create the colors for race
node.cols <- piNetD %v% "Race"
node.cols[ piNetD %v% "Race" == 1 ] <- "#eb105d"
node.cols[ piNetD %v% "Race" == 2 ] <- "#5773fa"
node.cols[ piNetD %v% "Race" == 3 ] <- "#37cf00"
# define the rescale function
rescale <- function( nchar, low, high ){
min_d <- min( nchar )
max_d <- max( nchar )
rscl <- ( ( high - low )*( nchar - min_d ) ) / ( max_d - min_d ) + low
rscl
}
# now, plug our pieces into our plot
gplot(
piNetD,
gmode = "digraph",
edge.col="grey40",
vertex.col=node.cols,
vertex.cex = rescale( piNetD %v% "Age", 0.5, 2 ),
coord = coords,
main = "PINS Power/Influence Network (Directed)"
)
# add a legend
legend( "topright",
legend=c( "White", "Black", "Hispanic" ),
col = unique( node.cols ),
pch = 19,
pt.cex=1
)
Think about the structure again. What do you notice about race or age?
How might we go about incorporating this information into a model of the network?
Let’s test three new hypotheses:
Nodes differ in their degree based on their age.
Nodes that are of similar age are more likely to share ties (i.e. homophily).
Nodes reciprocate ties.
We can test these hypotheses using the nodecov
(since
age is continuous) and absdiff
terms. And, for reciprocity,
a new term mutual
.
# let's add the nodecov term for age to our model
age.d.piNetD <- ergm(
piNetD ~ edges
+ nodecov( "Age" )
)
summary( age.d.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodecov("Age"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -8.868656 0.428498 0 -20.697 <1e-04 ***
## nodecov.Age 0.039425 0.004715 0 8.362 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 57975 on 41820 degrees of freedom
## Residual Deviance: 2044 on 41818 degrees of freedom
##
## AIC: 2048 BIC: 2065 (Smaller is better. MC Std. Err. = 0)
For the nodecov
term adds one network statistic to the
model that sums the attribute of interest for the two nodes comprising
the end points of each edge in the network. Positive values indicate
that as the continuous attribute increases, the edge is more
likely, relative to lower values. Note the difference here compared
to the interpretation for categorical attributes.
What does the model tell us?
Let’s look at this for our example. Looking at the table we see:
Recall that in the general formulation of the ERGM, the \(\delta\) represents the change
statistic, or the change in the statistic of interest when an edge
is added (i.e. \(Y_{ij}\)) goes from 0
to 1). The change statistic for the nodecov
term is
different from the edges
term we saw above. If the
predictor is continuous, we interpret one-unit changes in the
attribute.
Using the coefficient of 0.04, if we increase Age
by 1
unit, the predicted probability of a tie between i and
j is:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + \theta_{age} \times \delta_{age})\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic((-8.87 \times 1) + (0.02 \times 1))\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.87 + 0.02)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.85) = 0.0001\)
The value of 1.4^{-4} is returned by using the command
plogis()
. We can see this by adding the equation elements
to the plogis()
function.
plogis(
age.d.piNetD$coefficients[1]*1 # the edges term
+ age.d.piNetD$coefficients[2]*1 # the nodecov term for age
)
## edges
## 0.0001463694
The term for Age
above tells us that degree is
correlated with age. But, remember that this network is directed. So, we
could test two additional hypotheses:
Note that we use the nodecov
term for continuous
attributes. But, we can refine this for directed graphs using the
nodeicov
and nodeocov
terms. (For categorical
attributes, we use nodeifactor
and
nodeofactor
).
For the nodeicov
term, positive values indicate that as
the continuous attribute increases, then i is more likely to
receive and tie from j (because it is the indegree).
For the nodeocov
term, positive values indicate that as the
continuous attribute increases, then i is more likely to
send and tie to j (because it is outdegree).
Let’s check it out in the model.
# let's add the nodeicov and nodeocov terms for age to our model
age.sr.piNetD <- ergm(
piNetD ~ edges
+ nodeicov( "Age" ) # effect for receiving ties
+ nodeocov( "Age" ) # effect for sending ties
)
summary( age.sr.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodeicov("Age") + nodeocov("Age"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -8.908835 0.435864 0 -20.439 <1e-04 ***
## nodeicov.Age 0.067429 0.006566 0 10.270 <1e-04 ***
## nodeocov.Age 0.009698 0.006971 0 1.391 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 57975 on 41820 degrees of freedom
## Residual Deviance: 2006 on 41817 degrees of freedom
##
## AIC: 2012 BIC: 2038 (Smaller is better. MC Std. Err. = 0)
To remind you, in this model we have separated the effect of
Age
on sending ties and receiving ties. What is the
interpretation of the model estimates?
Now let’s look at a different hypothesis: Nodes of similar age are
more likely to share ties (i.e. homophily). We can test this
using the absdiff
term. Before we estimate the model for
homophily, let’s take a look at the absdiff
term to see its
functionality. You can do so using ?"absdiff-ergmTerm"
.
(note the use of help here is a bit different).
Now, let’s estimate the model for homophily.
age.h.piNetD <- ergm(
piNetD ~ edges
+ nodeicov( "Age" )
+ nodeocov( "Age" )
+ absdiff( "Age" ) # homophily for age
)
summary( age.h.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodeicov("Age") + nodeocov("Age") +
## absdiff("Age"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -8.668025 0.431516 0 -20.087 < 1e-04 ***
## nodeicov.Age 0.082026 0.008812 0 9.308 < 1e-04 ***
## nodeocov.Age -0.002340 0.008832 0 -0.265 0.79106
## absdiff.Age -0.033437 0.010562 0 -3.166 0.00155 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 57975 on 41820 degrees of freedom
## Residual Deviance: 1995 on 41816 degrees of freedom
##
## AIC: 2003 BIC: 2037 (Smaller is better. MC Std. Err. = 0)
Note that the term adds one network statistic to the model for the
variable that takes the absolute difference in the value of the
attribute for each node. This means that negative values will indicate
homophily. For the absdiff
term, negative values indicate
that for a pair of nodes with similar attribute values,
an edge is more likely. Put differently, as the absolute value
of the difference between each node on the attribute gets larger, the
probability of a tie declines. Thus, negative coefficients indicate the
presence of homophily and positive coefficients
indicate the presence of heterophily.
Using the coefficient of -0.03, the predicted probability of an edge between nodes with the same age is:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + (\theta_{age.homophily} \times \delta_{age.homophily}))\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.67 \times 1 + -0.03 \times 0)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.67 + 0)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.67) = 0.00017\)
The value of 1.7^{-4} is returned by using the command
plogis()
.
## edges
## 0.0001719689
Now, let’s test whether the probability of a tie for i to j depends on the presence of a tie from j to i (i.e. reciprocity).
recp.piNetD <- ergm(
piNetD ~ edges
+ nodeicov( "Age" )
+ nodeocov( "Age" )
+ absdiff( "Age" )
+ mutual, # add the term for reciprocity
control = control.ergm(
seed = 605 ) # here we use the control argument to set the seed to reproduce results
)
summary( recp.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodeicov("Age") + nodeocov("Age") +
## absdiff("Age") + mutual, control = control.ergm(seed = 605))
##
## Monte Carlo Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -8.569734 0.446426 0 -19.196 < 1e-04 ***
## nodeicov.Age 0.081230 0.008958 0 9.068 < 1e-04 ***
## nodeocov.Age -0.004240 0.008694 0 -0.488 0.62577
## absdiff.Age -0.033707 0.010528 0 -3.202 0.00137 **
## mutual 1.743896 0.741226 0 2.353 0.01864 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 57975 on 41820 degrees of freedom
## Residual Deviance: 1991 on 41815 degrees of freedom
##
## AIC: 2001 BIC: 2045 (Smaller is better. MC Std. Err. = 0.1983)
What is the interpretation of the coefficient?
Using the coefficient of 1.74, the predicted probability of an edge between i and j if a tie exist between j and i is:
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + (\theta_{reciprocity} \times \delta_{reciprocity}))\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.57 \times 1 + 1.74 \times 1)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.57 + 1.74)\)
\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.83) = 0.0011\)
The value of 0.0011 is returned by using the command
plogis()
.
## edges
## 0.00108419
In this lab, you were introduced to exponential family of random
graphs in R using the ergm
package. We worked through
different dependence specifications of the model reviewed in the Introduction to Exponential
Random Graph Models (ERGMs) lecture. We also showed how to
incorporate node level attributes and endogenous network structures.