In the Introduction to Exponential Random Graph Models (ERGMs) lecture you were introduced to several examples of models from the exponential family of random graphs. This lab provides an introduction to ERGMs in R using the ergm package. We will work through different dependence specifications of the model reviewed in the lecture (e.g. dyadic independence) and show how to incorporate node level attributes into the model.


Modeling Friendship in the PINS Get Along With Network

For the get along with network, individuals could have asymmetric nominations. That is, i could nominate j and j didn’t necessarily nominate i. But, we are going to symmetrize the network by only taking ties for which both i and j indicated that the get along with the other person. This will give us an undirected network.

# load the libraries we need
library( sna )
library( network )

# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-getalong-w1-adj.csv"

# read in the .csv file
gaMat <- as.matrix(
  read.csv( 
    loc,
    as.is = TRUE,
    header = TRUE,
    row.names = 1 
    )
  )

# use the symmetrize() function to create an undirected matrix
gaMatU <- symmetrize( gaMat, rule = "strong" )

# create the network object
gaNetU <- as.network( gaMatU, directed = FALSE )


# Set the coordinates
set.seed( 605 )
coords <- gplot( gaNetU )
gplot( 
  gaNetU, 
  gmode = "graph",
  edge.col="grey40", 
  vertex.col="#c78c71",
  coord = coords,
  main = "PINS Get\n Along With Network (Undirected)"  
  )

Think about the structure here. What do you notice?

How might we go about incorporating this information into a model of the network?


Before we get going, let’s define some graphical objects we will use.

# define the number of nodes
g <- dim( gaMatU )[1]

# define the number of edges
l <- sum( gaMatU )/2

# define the density
den <- l / ( g *( g - 1 ) / 2 ) 

den
## [1] 0.004543281

What is the interpretation of the density?


Recall that the exponential random graph model seeks to model the underlying processes that generated the observed network. Let’s create a random graph with the same nodes, edges, and density as the gaNetU network and compare them by using the rgraph() function in the sna package:

?rgraph

# set the seed to reproduce these results
set.seed( 605 )

# generate the random graph
random.graph <- rgraph(
  g,                   # the number of nodes in the network
  1,                   # we want just 1 random graph 
  tprob = den,         # the density of the network 
  mode = "graph"       # it is undirected 
  )

# now coerce the random graph to a network object
random.net <- as.network( random.graph, directed = FALSE )


We can now compare our observed data with a graph that has the same density and the edges are randomly distributed over the nodes.

# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  gaNetU, 
  gmode = "graph",
  edge.col="grey40", 
  vertex.col="#c78c71",
  coord = coords,
  main = "PINS Get\n Along With Network (Undirected)"
  )

# create the second plot
gplot( 
  random.net, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#069e6e",
  main = "Random network"
  )

How is the Get Along With network different from the random network?


Adding Attributes (revisited)

Recall that there is an attributes file that we can attach to these data. Let’s load that and use these attributes in our plot.

# define the attributes object
attrs <- read.csv( 
    "https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
    as.is = TRUE,
    header = TRUE 
    )

# assign the attributes to the network
gaNetU %v% "Age" <- attrs[,1]
gaNetU %v% "Race" <- attrs[,2]

Now that these are assigned as an attribute, we can reference them using the %v% operator. For example, let’s look at the values for the Race attribute.

# print out the values for each node
gaNetU %v% "Race"
##   [1] 1 2 1 2 2 2 1 2 2 3 2 1 1 1 1 2 1 2 1 2 3 1 1 3 4 2 1 1 3 2 2 2 2 1 2 2 2
##  [38] 3 1 2 2 1 1 3 1 2 2 3 2 1 1 1 2 3 1 1 1 1 3 1 2 1 1 2 2 3 2 1 2 1 1 1 2 2
##  [75] 1 1 2 2 3 2 2 1 3 1 1 1 2 2 2 1 2 2 2 3 1 1 1 2 2 2 2 1 3 2 2 2 1 1 3 2 1
## [112] 1 1 2 2 1 3 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2 3 3 2 2 3 3 3 3 2 2 2 3 2 1 1 2
## [149] 2 3 2 1 2 1 1 1 2 2 2 3 2 2 1 1 1 2 2 2 1 2 1 1 1 2 2 1 1 1 1 2 2 2 1 2 2
## [186] 2 1 2 1 1 1 3 2 2 2 1 2 2 2 3 2 3 3 2 2
# use the table() function to count each value
table( gaNetU %v% "Race" )
## 
##  1  2  3  4 
## 79 96 29  1
# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1

# now assign the recoded value
gaNetU %v% "Race" <- race.recoded

Now, let’s use the Race attribute and the Age attribute in a plot. We will color the nodes based on Race and size the nodes based on Age. Note that for Age, we are going to need to use the rescale() function.

# create the colors for race
node.cols <- gaNetU %v% "Race"
node.cols[ gaNetU %v% "Race" == 1 ] <- "#eb105d"
node.cols[ gaNetU %v% "Race" == 2 ] <- "#5773fa"
node.cols[ gaNetU %v% "Race" == 3 ] <- "#37cf00"

# define the rescale function
rescale <- function( nchar, low, high ){
  min_d <- min( nchar )
  max_d <- max( nchar )
  rscl  <- ( ( high - low )*( nchar - min_d ) ) / ( max_d - min_d ) + low
  rscl
}

# now, plug our pieces into our plot
gplot( 
  gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col = node.cols,
  vertex.cex = rescale( gaNetU %v% "Age", 0.5, 2 ),
  coord = coords,
  main = "PINS Get\n Along With Network (Undirected)"
  )

# add a legend
legend( "topright", 
        legend=c( "White", "Black", "Hispanic" ), 
        col = unique( node.cols ), 
        pch = 19, 
        pt.cex=1
        )

Think about the structure again. What do you notice about race or age?

How might we go about incorporating this information into a model of the network?


Getting Started with the ergm Package and the ergm() Function

First things first, we need to install the ergm package and load the library of functions.

# install the ergm package
install.packages( "ergm" )

# load the ergm package
library( ergm )

# take a look at the functionality
help( package = "ergm" )


If you look at the ergm() function using ?ergm, then you will see an extensive help file on the function.


The ergm() Function

The ergm() function takes two arguments:

  • an object of class network (the dependent variable)

  • and some terms (i.e. network configurations)

Thus, the model form looks like this: model <- ergm(network ~ term). Recall from the Introduction to ERGMs lecture that the general expression of the model is:

\(logit\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = \sum_{k=1}^\kappa \theta_k\delta_{z_k(y)}\)

As with logistic regression, the logit transformation can be used to re-express the equation as the conditional probability of a tie:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_1\delta_{z_1(y)+\theta_2\delta_{z_2}(y)...})\)

Here, the coefficients are expressed as the conditional log-odds of a single actor pair. That is, how likely we are to see a tie between i and j, given some model terms. So, what are those model terms or network configurations?

We can view all of the network statistics that have been programmed by using the ?ergm.terms command. You can also use the (vignette("ergm-term-crossRef")) command to show you a bit more info about the model terms.


Edge independence (Bernoulii/Simple Random graphs)

The ERGM expresses the probability of observing a tie between nodes i and j given some terms (i.e. network configurations). The edge independence model of Erdos and Renyi (1959) uses a single term, the number of edges in the graph, to represent the probability of a tie:

\(P(Y=y)=\Bigg(\frac{1}{c}\Bigg) exp \big\{\theta L(y) \big\}\)

In the ergm() function, this term is edges. You can see the help on the edges term using ?"edges-ergmTerm". Note the difference, we are calling edges within the ergmTerm help page. Let’s take a look at this model:

# run the model
edge.indep.gaNetU <- ergm( gaNetU ~ edges ) 

# check out the summary
summary( edge.indep.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges)
## 
## Maximum Likelihood Results:
## 
##       Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges  -5.3896     0.1028      0  -52.41   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 28987  on 20910  degrees of freedom
##  Residual Deviance:  1214  on 20909  degrees of freedom
##  
## AIC: 1216  BIC: 1224  (Smaller is better. MC Std. Err. = 0)
# note that the ergm() function creates an object of class ergm
class( edge.indep.gaNetU )
## [1] "ergm"
names( edge.indep.gaNetU )
##  [1] "coefficients"    "iterations"      "MCMCtheta"       "gradient"       
##  [5] "hessian"         "covar"           "failure"         "mple.lik"       
##  [9] "mple.lik.null"   "nw.stats"        "call"            "network"        
## [13] "ergm_version"    "info"            "MPLE_is_MLE"     "drop"           
## [17] "offset"          "estimable"       "etamap"          "formula"        
## [21] "reference"       "constraints"     "obs.constraints" "estimate"       
## [25] "estimate.desc"   "control"         "null.lik"        "mle.lik"

First, the summary shows the coefficient \(\theta\) from the equation above. The value of -5.39 indicates that the density of the network is below .5 or 50%. An edges term of 0 would represent 50% or 0.5 density. As we saw above, the density of the gaNetU network is 0.005.

In the general formulation of the ERGM, the \(\delta\) represents the change statistic, or the change in the statistic of interest when an edge is added (i.e. \(Y_{ij}\)) goes from 0 to 1). The change statistic for the edges term is always 1, so we can think of the probability of a tie between i and j as the logit of the coefficients for the edges term. Specifically, we can use the calculation that is usual for a logistic regression to interpret the coefficient:

\(\frac{1}{1 + e^{-(\theta_1X_1)}}\)


If we plug in the value of -5.39 that is returned by the ergm() function, we get:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges})\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-5.39 \times 1)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = \frac{1}{1 + e^{-(-5.39 \times 1)}} = 0.005\)


Does the value 0.005 look familiar?

The value of 0.005 is returned by using the plogis() function. Use the ?plogis command to examine the help file for this function.


# we could write it out 
plogis( -5.39 * 1 )
## [1] 0.004541256
# or, we could pull from the model object
plogis( edge.indep.gaNetU$coefficients[1] * 1 )
##       edges 
## 0.004543281

So what is this? Recall that the ERGM expresses the probability of observing a tie between nodes i and j given some terms (i.e. network configurations). The edge independence model of Erdos and Renyi (1959) uses a single term, the number of edges in the graph, to represent the probability of a tie. So, here we are saying that the probability of a tie between two nodes is 0.0045433.

Does that help us understand the structure in the network?

Sort of, but it tells us about as much as the random network we plotted above. We can extend this model to include characteristics of nodes.


Adding Nodal Covariates

So far, we have assumed that the probability of a tie is independent of any nodal attribute. We could relax this assumption and test two hypotheses:

  • Nodes differ in their degree based on their race.

  • Nodes that are the same race are more likely to share ties (i.e. homophily).


We can test these hypotheses using the nodefactor (for categorical attributes [for continuous attributes use nodecov]) and nodematch (for categorical attributes [for continuous attributes use absdiff]) terms. Take a look at the description of the term nodefactor term using ?nodefactor.


Testing the Degree Effects of Race/Ethnicity

Why make these hypotheses? First, let’s take a look at the degree distribution for race. Let’s compare the mean degree for each group:

# mean degree for white
round( mean( degree( gaNetU, gmode = "graph", cmode = "degree" )[gaNetU %v% "Race" == 1] ), 2 )
## [1] 1.18
# mean degree for black
round( mean( degree( gaNetU, gmode = "graph", cmode = "degree" )[gaNetU %v% "Race" == 2] ), 2 )
## [1] 0.89
# mean degree for hispanic
round( mean( degree( gaNetU, gmode = "graph", cmode = "degree" )[gaNetU %v% "Race" == 3] ), 2 )
## [1] 0.38

Now, how are the mean degrees different for each racial group?

Are these differences due to random variation?

# let's add the nodefactor term for race to our model
race.gaNetU <- ergm( 
  gaNetU ~ edges 
  + nodefactor( "Race" )
  ) 

summary( race.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodefactor("Race"))
## 
## Maximum Likelihood Results:
## 
##                   Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges              -4.9102     0.1799      0 -27.296  < 1e-04 ***
## nodefactor.Race.2  -0.2861     0.1505      0  -1.901 0.057348 .  
## nodefactor.Race.3  -1.1391     0.3195      0  -3.565 0.000363 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 28987  on 20910  degrees of freedom
##  Residual Deviance:  1197  on 20907  degrees of freedom
##  
## AIC: 1203  BIC: 1227  (Smaller is better. MC Std. Err. = 0)

Note that the term adds one network statistic to the model for each categorical value of the variable we have passed. The first category is excluded as the reference category. The first category is white (which is coded as 1), so it is automatically excluded and serves as the referent category. This can be overriden if we want to.

For the nodefactor term, positive values indicate that a node with that particular value of the attribute is more likely to have an edge, relative to the referent category. Alternatively, a negative value indicates that a node with that particular value of the attribute is less likely to have an edge, relative to the referent category.

What does the model tell us?


Let’s look at this for our example. Looking at the table we see:

  • the term nodefactor.Race.2 has a value of -0.29, but is not significantly different from zero at the p<0.5 level. Thus, the difference in degree for white and black that we observed above is consistent with random variation.

  • the term nodefactor.Race.3 has a value of -1.14 and is significantly different from zero at the p<0.5 level. Thus, compared to whites, the probability of a tie between i and j is lower if either i or j are Hispanic.

Recall that in the general formulation of the ERGM, the \(\delta\) represents the change statistic, or the change in the statistic of interest when an edge is added (i.e. \(Y_{ij}\)) goes from 0 to 1). The change statistic for the nodefactor term is different from the edges term we saw above. If the predictor is categorical, the value of the change statistic is 0, 1 or 2. Specifically:

  • If neither of the nodes have the characteristic of interest, the change statistic is 0.

  • A value of 2 indicates that both of the nodes in the dyad have the characteristic.


Again, let’s use the calculation that is usual for a logistic regression to interpret the coefficient:

\(\frac{1}{1 + e^{-(\theta_1X_1)}}\)


Using the coefficient of -1.14, the predicted probability of a tie between i and j if they are both Hispanic is:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + \theta_{race} \times \delta_{race})\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic((-4.91 \times 1) + (-1.14 \times 2))\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-4.91 + -2.28)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-7.19) = 0.0008\)


The value of 7.5^{-4} is returned by using the command plogis(). We can see this by adding the equation elements to the plogis() function.

plogis(
    race.gaNetU$coefficients[1]*1  # the edges term
  + race.gaNetU$coefficients[3]*2  # the nodefactor hispanic term
  )
##        edges 
## 0.0007547075
# rounded
round( plogis( race.gaNetU$coefficients[1]*1 + race.gaNetU$coefficients[3]*2 ), 5 )
##   edges 
## 0.00075

What is the predicted probability of a tie between i and j if only one is Hispanic?:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + \theta_{race} \times \delta_{race})\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic((-4.91 \times 1) + (-1.14 \times 1))\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-4.91 + -1.14)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.05) = 0.002\)

The value of 0.002 is returned by using the command plogis(). We can see this by adding the equation elements to the plogis() function.

plogis(
    race.gaNetU$coefficients[1]*1
  + race.gaNetU$coefficients[3]*1  # NOTE: we are only multiplying by 1!!!
  )
##       edges 
## 0.002353967
# rounded
round( plogis( race.gaNetU$coefficients[1]*1 + race.gaNetU$coefficients[3]*1 ), 3 )
## edges 
## 0.002

Note that this is quite a change from having two nodes who are Hispanic. This difference in the probability of a tie shows you that Hispanics are less connected to the network, relative to other race/ethnicities.

Finally, what is the predicted probability of a tie between i and j if only both are White?


Testing Homophily on Race/Ethnicity

Now let’s look at a different hypothesis: Nodes of the same race/ethnicity are more likely to share ties (i.e. homophily). We can test this using the nodematch term. But first, let’s look at the mixing between each category value with the mixingmatrix() function. This function returns homophilous ties in the diagonal and heterophilous ties in the off-diagonal. To see more on the mixingmatrix() function, see the help, ?mixingmatrix.

mixingmatrix( gaNetU, "Race" )
##    1  2  3
## 1 40 14  0
## 2 14 34  3
## 3  0  3  4

Looking at the mixing matrix for race, what patterns do you see?

Before we estimate the model for homophily, let’s take a look at the nodematch term to see its functionality. You can do so using ?"nodematch-ergmTerm". (note the use of help here is a bit different).

Now, let’s estimate the model for homophily.

homophily.gaNetU <- ergm( 
  gaNetU ~ edges 
  + nodematch( "Race" )  # add the term for homophily
  ) 

summary( homophily.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodematch("Race"))
## 
## Maximum Likelihood Results:
## 
##                Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges           -6.6214     0.2427      0 -27.283   <1e-04 ***
## nodematch.Race   1.9849     0.2680      0   7.405   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 28987  on 20910  degrees of freedom
##  Residual Deviance:  1139  on 20908  degrees of freedom
##  
## AIC: 1143  BIC: 1159  (Smaller is better. MC Std. Err. = 0)

Note that the term adds one network statistic to the model for the variable. For the nodematch term, positive values indicate that for a pair of nodes with the same attribute value, an edge is more likely. Alternatively, a negative value indicates that for a pair of nodes with the same attribute value, an edge is less likely. Thus, positive coefficients indicate the presence of homophily and negative coefficients indicate the presence of heterophily.

Let’s work through our example to see the interpretation of the coefficient. Looking at the table we see:

  • the term nodematch.Race has a value of round( homophily.gaNetU$coefficients[2], 2 ), indicating that ties are more likely among nodes with the same value for the variable Race.


Recall that in the general formulation of the ERGM, the \(\delta\) represents the change statistic, or the change in the statistic of interest when an edge is added (i.e. \(Y_{ij}\)) goes from 0 to 1). The change statistic for the nodematch term is similar to the edges term we saw above. If the predictor is categorical, the value of the change statistic is 0 or 1. Specifically:

  • If i and j have the same value for a categorical covariate the change statistic is 1.

  • If i and j do not have the same value for a categorical covariate the change statistic is 0.


Again, let’s use the calculation that is usual for a logistic regression to interpret the coefficient:

\(\frac{1}{1 + e^{-(\theta_1X_1)}}\)


Using the coefficient of round( homophily.gaNetU$coefficients[2], 2 ), the predicted probability of an edge between nodes with the same race/ethnicity is:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + (\theta_{race.homophily} \times \delta_{race.homophily}))\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.62 \times 1 + 1.98 \times 1)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.62 + 1.98)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-4.64) = 0.0096\)


The value of 0.0096 is returned by using the command plogis().

plogis( homophily.gaNetU$coefficients[1]*1 + homophily.gaNetU$coefficients[2]*1 )
##       edges 
## 0.009598819

Now, what if we where interested in homophily differences by race. That is, does homophily differ for the groups? This is referred to as differential homophily and can be tested using the diff=TRUE argument in the nodematch term in the ergm() function.

d.homophily.gaNetU <- ergm( 
  gaNetU ~ edges 
  + nodematch( "Race", diff=TRUE ) # not the addition
  ) 

summary( d.homophily.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodematch("Race", diff = TRUE))
## 
## Maximum Likelihood Results:
## 
##                  Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges             -6.6214     0.2427      0 -27.283  < 1e-04 ***
## nodematch.Race.1   2.2647     0.2902      0   7.804  < 1e-04 ***
## nodematch.Race.2   1.7302     0.2975      0   5.815  < 1e-04 ***
## nodematch.Race.3   2.0112     0.5580      0   3.604 0.000313 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 28987  on 20910  degrees of freedom
##  Residual Deviance:  1134  on 20906  degrees of freedom
##  
## AIC: 1142  BIC: 1174  (Smaller is better. MC Std. Err. = 0)

What is the interpretation of the model?


Accounting for Degree and Homophily

In the model examining homophily, note that we excluded the nodefactor term. This inherently assumes that there are no degree differences between the nodes based on Race. As a result, we are not accounting for an important feature of the network. Specifically, we could observe homophily on Race simply because there are differences in the degree distribution by the values for the attribute Race. Let’s take this into account by reestimating the model and including both the nodefactor and the nodematch terms.

DH.gaNetU <- ergm( 
  gaNetU ~ edges
  + nodefactor( "Race" )            
  + nodematch( "Race", diff=FALSE )
  ) 

summary( DH.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodefactor("Race") + nodematch("Race", 
##     diff = FALSE))
## 
## Maximum Likelihood Results:
## 
##                   Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges              -6.2910     0.2837      0 -22.172   <1e-04 ***
## nodefactor.Race.2  -0.2443     0.1148      0  -2.129   0.0333 *  
## nodefactor.Race.3  -0.3498     0.2759      0  -1.268   0.2048    
## nodematch.Race      1.9323     0.2785      0   6.938   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 28987  on 20910  degrees of freedom
##  Residual Deviance:  1134  on 20906  degrees of freedom
##  
## AIC: 1142  BIC: 1174  (Smaller is better. MC Std. Err. = 0)


Dyad Dependence: Modeling Endogenous Network Structures

All of the models above are edge indepedent or dyad independent models in that the probability of a tie from i to j is not dependent on a tie to k. But, in reality, many network structures are formed due to dependencies between dyads. For example, whether I decide to befriend you may depend on whether we have a shared friend in common. That is dyad dependence.

A major strength of the ERGM is the ability to model such endogenous network structures. In the models above, we observed homophily for the attribute Race. The model proposed that the probability of a tie was more likely to occur between i and j if they both are of the same race/ethnicity. This is because we observed many same-race/ethnicity dyads in the mixing matrix:

##    1  2  3
## 1 40 14  0
## 2 14 34  3
## 3  0  3  4

But, we may observe such mixing due to transitivity in a network. If we wanted to examine homophily, we would want to account for the tendency for transitive relationships to form.

Let’s estimate a model that includes the gwesp term. This terms add a count of the edgewise shared parterns in a network to the model. Positive values indicate that there is a tendency for an edge to form between i and j if they have a shared partner. Note that when we move to the dyad dependence model, the estimation gets a bit more complicated. We won’t go into the details now, but you will see some differences in the code below.

triad.gaNetU <- ergm( 
  gaNetU ~ edges
  + nodematch( "Race", diff=TRUE )
  + gwesp( decay = 0.25, fixed = TRUE ), # here is our gwesp term
  
  control = control.ergm(
    seed = 605 ) # here we use the control argument to set the seed to reproduce results
  ) 

summary( triad.gaNetU )
## Call:
## ergm(formula = gaNetU ~ edges + nodematch("Race", diff = TRUE) + 
##     gwesp(decay = 0.25, fixed = TRUE), control = control.ergm(seed = 605))
## 
## Monte Carlo Maximum Likelihood Results:
## 
##                  Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges             -6.8062     0.2390      0 -28.472  < 1e-04 ***
## nodematch.Race.1   1.7942     0.2639      0   6.800  < 1e-04 ***
## nodematch.Race.2   1.4955     0.2736      0   5.467  < 1e-04 ***
## nodematch.Race.3   1.9731     0.5133      0   3.844 0.000121 ***
## gwesp.fixed.0.25   1.6955     0.1546      0  10.966  < 1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 28987  on 20910  degrees of freedom
##  Residual Deviance:  1051  on 20905  degrees of freedom
##  
## AIC: 1061  BIC: 1101  (Smaller is better. MC Std. Err. = 0.6673)

What is the interpretation of the model?


Modeling Status in the PINS Power/Influence Network

Let’s walk back through what we covered above with a directed network, noting differences in the model as we go.

For the power and influence network, individuals could have asymmetric nominations. That is, i could nominate j and j didn’t necessarily nominate i. We will keep this asymmetry so that we can treat the network as directed.

# clear the workspace since we will recycle object names
rm( list = ls() )

# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-power-w1-adj.csv"

# read in the .csv file
piMat <- as.matrix(
  read.csv( 
    loc,
    as.is = TRUE,
    header = TRUE,
    row.names = 1 
    )
  )

piNetD <- as.network( piMat, directed = TRUE )


# Set the coordinates
set.seed( 605 )
coords <- gplot( piNetD )
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#5bf5b5",
  coord = coords,
  main = "PINS Power/Influence Network (Directed)"  
  )

Think about the structure here. What do you notice?

How might we go about incorporating this information into a model of the network?


Before we get going, let’s define some graphical objects we will use.

# define the number of nodes
g <- dim( piMat )[1]

# define the number of edges
l <- sum( piMat )/2

# define the density
den <- l / ( g *( g - 1 ) ) 

den
## [1] 0.001924916

What is the interpretation of the density?


Recall that the exponential random graph model seeks to model the underlying processes that generated the observed network. Let’s create a random graph with the same nodes, edges, and density as the piNetD network and compare them by using the rgraph() function in the sna package:

# set the seed to reproduce these results
set.seed( 605 )

# generate the random graph
random.graph <- rgraph(
  g,                   # the number of nodes in the network
  1,                   # we want just 1 random graph 
  tprob = den,         # the density of the network 
  mode = "digraph"    # it is directed 
  )

# now coerce the random graph to a network object
random.net <- as.network( random.graph, directed = FALSE )


We can now compare our observed data with a graph that has the same density and the edges are randomly distributed over the nodes.

# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#5bf5b5",
  coord = coords,
  main = "PINS Power/Influence Network (Directed)"  
  )

# create the second plot
gplot( 
  random.net, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#069e6e",
  main = "Random network"
  )

How is the Power/Influence network different from the random network?


Adding Attributes (revisited)

We can use the same attributes as above. Some we cleared the workspace, we will want to reload and recreate the attributes.

# define the attributes object
attrs <- read.csv( 
    "https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
    as.is = TRUE,
    header = TRUE 
    )

# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1


# assign the attributes to the network
piNetD %v% "Age" <- attrs[,1]

# now assign the recoded value
piNetD %v% "Race" <- race.recoded

Now, let’s use the Race attribute and the Age attribute in a plot. We will color the nodes based on Race and size the nodes based on Age. Note that for Age, we are going to need to use the rescale() function.

# create the colors for race
node.cols <- piNetD %v% "Race"
node.cols[ piNetD %v% "Race" == 1 ] <- "#eb105d"
node.cols[ piNetD %v% "Race" == 2 ] <- "#5773fa"
node.cols[ piNetD %v% "Race" == 3 ] <- "#37cf00"

# define the rescale function
rescale <- function( nchar, low, high ){
  min_d <- min( nchar )
  max_d <- max( nchar )
  rscl  <- ( ( high - low )*( nchar - min_d ) ) / ( max_d - min_d ) + low
  rscl
}

# now, plug our pieces into our plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col=node.cols,
  vertex.cex = rescale( piNetD %v% "Age", 0.5, 2 ),
  coord = coords,
  main = "PINS Power/Influence Network (Directed)"  
  )

# add a legend
legend( "topright", 
        legend=c( "White", "Black", "Hispanic" ), 
        col = unique( node.cols ), 
        pch = 19, 
        pt.cex=1
        )

Think about the structure again. What do you notice about race or age?

How might we go about incorporating this information into a model of the network?


Adding Nodal Covariates

Let’s test three new hypotheses:

  • Nodes differ in their degree based on their age.

  • Nodes that are of similar age are more likely to share ties (i.e. homophily).

  • Nodes reciprocate ties.


We can test these hypotheses using the nodecov (since age is continuous) and absdiff terms. And, for reciprocity, a new term mutual.


Testing the Degree Effect of Age

# let's add the nodecov term for age to our model
age.d.piNetD <- ergm( 
  piNetD ~ edges 
  + nodecov( "Age" )
  ) 

summary( age.d.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodecov("Age"))
## 
## Maximum Likelihood Results:
## 
##              Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges       -8.868656   0.428498      0 -20.697   <1e-04 ***
## nodecov.Age  0.039425   0.004715      0   8.362   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 57975  on 41820  degrees of freedom
##  Residual Deviance:  2044  on 41818  degrees of freedom
##  
## AIC: 2048  BIC: 2065  (Smaller is better. MC Std. Err. = 0)

For the nodecov term adds one network statistic to the model that sums the attribute of interest for the two nodes comprising the end points of each edge in the network. Positive values indicate that as the continuous attribute increases, the edge is more likely, relative to lower values. Note the difference here compared to the interpretation for categorical attributes.

What does the model tell us?


Let’s look at this for our example. Looking at the table we see:

  • the term nodecov.Age has a value of 0.04, and is significantly different from zero at the p<0.5 level.

Recall that in the general formulation of the ERGM, the \(\delta\) represents the change statistic, or the change in the statistic of interest when an edge is added (i.e. \(Y_{ij}\)) goes from 0 to 1). The change statistic for the nodecov term is different from the edges term we saw above. If the predictor is continuous, we interpret one-unit changes in the attribute.


Using the coefficient of 0.04, if we increase Age by 1 unit, the predicted probability of a tie between i and j is:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + \theta_{age} \times \delta_{age})\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic((-8.87 \times 1) + (0.02 \times 1))\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.87 + 0.02)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.85) = 0.0001\)


The value of 1.4^{-4} is returned by using the command plogis(). We can see this by adding the equation elements to the plogis() function.

plogis( 
  age.d.piNetD$coefficients[1]*1  # the edges term
  + age.d.piNetD$coefficients[2]*1  # the nodecov term for age
  )
##        edges 
## 0.0001463694


The term for Age above tells us that degree is correlated with age. But, remember that this network is directed. So, we could test two additional hypotheses:

  • Indegree increases with age
  • Outdegree decreases with age

Note that we use the nodecov term for continuous attributes. But, we can refine this for directed graphs using the nodeicov and nodeocov terms. (For categorical attributes, we use nodeifactor and nodeofactor).

For the nodeicov term, positive values indicate that as the continuous attribute increases, then i is more likely to receive and tie from j (because it is the indegree). For the nodeocov term, positive values indicate that as the continuous attribute increases, then i is more likely to send and tie to j (because it is outdegree).

Let’s check it out in the model.

# let's add the nodeicov and nodeocov terms for age to our model
age.sr.piNetD <- ergm( 
  piNetD ~ edges 
  + nodeicov( "Age" ) # effect for receiving ties
  + nodeocov( "Age" ) # effect for sending ties
  ) 

summary( age.sr.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodeicov("Age") + nodeocov("Age"))
## 
## Maximum Likelihood Results:
## 
##               Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges        -8.908835   0.435864      0 -20.439   <1e-04 ***
## nodeicov.Age  0.067429   0.006566      0  10.270   <1e-04 ***
## nodeocov.Age  0.009698   0.006971      0   1.391    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 57975  on 41820  degrees of freedom
##  Residual Deviance:  2006  on 41817  degrees of freedom
##  
## AIC: 2012  BIC: 2038  (Smaller is better. MC Std. Err. = 0)

To remind you, in this model we have separated the effect of Age on sending ties and receiving ties. What is the interpretation of the model estimates?


Testing Homophily on Age

Now let’s look at a different hypothesis: Nodes of similar age are more likely to share ties (i.e. homophily). We can test this using the absdiff term. Before we estimate the model for homophily, let’s take a look at the absdiff term to see its functionality. You can do so using ?"absdiff-ergmTerm". (note the use of help here is a bit different).

Now, let’s estimate the model for homophily.

age.h.piNetD <- ergm( 
  piNetD ~ edges 
  + nodeicov( "Age" ) 
  + nodeocov( "Age" ) 
  + absdiff( "Age" ) # homophily for age  
  ) 

summary( age.h.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodeicov("Age") + nodeocov("Age") + 
##     absdiff("Age"))
## 
## Maximum Likelihood Results:
## 
##               Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges        -8.668025   0.431516      0 -20.087  < 1e-04 ***
## nodeicov.Age  0.082026   0.008812      0   9.308  < 1e-04 ***
## nodeocov.Age -0.002340   0.008832      0  -0.265  0.79106    
## absdiff.Age  -0.033437   0.010562      0  -3.166  0.00155 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 57975  on 41820  degrees of freedom
##  Residual Deviance:  1995  on 41816  degrees of freedom
##  
## AIC: 2003  BIC: 2037  (Smaller is better. MC Std. Err. = 0)

Note that the term adds one network statistic to the model for the variable that takes the absolute difference in the value of the attribute for each node. This means that negative values will indicate homophily. For the absdiff term, negative values indicate that for a pair of nodes with similar attribute values, an edge is more likely. Put differently, as the absolute value of the difference between each node on the attribute gets larger, the probability of a tie declines. Thus, negative coefficients indicate the presence of homophily and positive coefficients indicate the presence of heterophily.


Using the coefficient of -0.03, the predicted probability of an edge between nodes with the same age is:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + (\theta_{age.homophily} \times \delta_{age.homophily}))\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.67 \times 1 + -0.03 \times 0)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.67 + 0)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.67) = 0.00017\)


The value of 1.7^{-4} is returned by using the command plogis().

plogis( age.h.piNetD$coefficients[1]*1 + age.h.piNetD$coefficients[4]*0 )
##        edges 
## 0.0001719689


Testing for Reciprocity

Now, let’s test whether the probability of a tie for i to j depends on the presence of a tie from j to i (i.e. reciprocity).

recp.piNetD <- ergm( 
  piNetD ~ edges
  + nodeicov( "Age" ) 
  + nodeocov( "Age" ) 
  + absdiff( "Age" )
  + mutual,          # add the term for reciprocity
  
  control = control.ergm(
    seed = 605 ) # here we use the control argument to set the seed to reproduce results
  ) 

summary( recp.piNetD )
## Call:
## ergm(formula = piNetD ~ edges + nodeicov("Age") + nodeocov("Age") + 
##     absdiff("Age") + mutual, control = control.ergm(seed = 605))
## 
## Monte Carlo Maximum Likelihood Results:
## 
##               Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges        -8.569734   0.446426      0 -19.196  < 1e-04 ***
## nodeicov.Age  0.081230   0.008958      0   9.068  < 1e-04 ***
## nodeocov.Age -0.004240   0.008694      0  -0.488  0.62577    
## absdiff.Age  -0.033707   0.010528      0  -3.202  0.00137 ** 
## mutual        1.743896   0.741226      0   2.353  0.01864 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 57975  on 41820  degrees of freedom
##  Residual Deviance:  1991  on 41815  degrees of freedom
##  
## AIC: 2001  BIC: 2045  (Smaller is better. MC Std. Err. = 0.1983)

What is the interpretation of the coefficient?


Using the coefficient of 1.74, the predicted probability of an edge between i and j if a tie exist between j and i is:

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(\theta_{edges} \times \delta_{edges} + (\theta_{reciprocity} \times \delta_{reciprocity}))\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.57 \times 1 + 1.74 \times 1)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-8.57 + 1.74)\)

\(\Bigg(P \Bigg(Y_{ij}=1 \> | \> n \> actors, Y_{ij}^C\Bigg) \Bigg) = logistic(-6.83) = 0.0011\)


The value of 0.0011 is returned by using the command plogis().

plogis( recp.piNetD$coefficients[1]*1 + recp.piNetD$coefficients[5]*1 )
##      edges 
## 0.00108419



Wrapping up…

In this lab, you were introduced to exponential family of random graphs in R using the ergm package. We worked through different dependence specifications of the model reviewed in the Introduction to Exponential Random Graph Models (ERGMs) lecture. We also showed how to incorporate node level attributes and endogenous network structures.


Questions?


Please report any needed corrections to the Issues page. Thanks!


Back to SAND main page