In the Introduction to Exponential Random Graph Models (ERGMs) lecture you were introduced to several examples of models from the exponential family of random graphs and we worked through those models in the Introduction to Exponential Random Graph Models (ERGMs) in R lab.

This lab continues working with ERGMs in R using the ergm package. Specifically, we ask the question: how do we evaluate whether our model is a good fit to the data? This is the notion of “goodness of fit” and we can assess how well our model fits the data using the gof() function in the ergm package.

Evaluation Model Fit for the PINS Get Along With Network

Recall that in the Introduction to Exponential Random Graph Models (ERGMs) in R lab we estimated a series of ERGMs using the get along with network from PINS. Here, we want to re-estimate the models and examine goodness of fit.

Setting Up

First, let’s bring in the data, create the network object, and assign attributes.

# load the libraries we need
library( sna )
library( network )
library( ergm )

# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-getalong-w1-adj.csv"

# read in the .csv file
gaMat <- as.matrix(
  read.csv( 
    loc,
    as.is = TRUE,
    header = TRUE,
    row.names = 1 
    )
  )

# use the symmetrize() function to create an undirected matrix
gaMatU <- symmetrize( gaMat, rule = "strong" )

# create the network object
gaNetU <- as.network( gaMatU, directed = FALSE )


# define the attributes object
attrs <- read.csv( 
    "https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
    as.is = TRUE,
    header = TRUE 
    )

# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1

# now assign the recoded value
gaNetU %v% "Age" <- attrs[,1]
gaNetU %v% "Race" <- race.recoded


Now, let’s estimate a few models. We will estimate several (each subsequent model will include the prior terms):

  • An edge independence model
  • A model where we add a term where Race influences degree
  • A model where we add a term where there is homophily based on Race
  • A model where we take into account transitivity


Ok, let’s run the models. (remember that we can use the summary() function on the model object to summarize the results).

# edge independence model
m1.gaNetU <- ergm( gaNetU ~ edges ) 

# add the effect of Race on degree
m2.gaNetU <- ergm( gaNetU ~ edges  + nodefactor( "Race" ) ) 
  
# add the homophily effect for Race  
m3.gaNetU <- ergm( gaNetU ~ edges + nodematch( "Race" ) )

# add the effect for transitivity  
m4.gaNetU <- ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE ),
  control = control.ergm( seed = 605 ) ) 


The simulate.ergm() or simulate() and control.ergm() Functions

How can we go about assessing model fit? What would a good fitting model do? Well, one way to think about this is whether the model we estimate reproduces structures in the data. In other words, if we estimate a model, then simulated some data from that model, we could say: “does the simulated network look like our observed network?” This is the logic of how we will approach goodness of fit with our models.

First, we need to think about simulating a network (or networks) from our model. We can do this using the simulate.ergm() or simulate() function that is built into the ergm package. This function randomly samples networks from the specified model. We can examine the function using ?simulate.ergm. Let’s see how it works:

# simulate a network from the edge independence model
sim.m1.gaNetU <- simulate(
  m1.gaNetU,               # the model we want to simulate from
  nsim=1,                  # simulate 1 network
  seed = 605               # set the seed to reproduce results
)

Boom! You simulated a network. Now what? Take a look at what we did.

# the function creates an object of class network
class( sim.m1.gaNetU )
## [1] "network"
# look at the simulated network
sim.m1.gaNetU
##  Network attributes:
##   vertices = 205 
##   directed = FALSE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 92 
##     missing edges= 0 
##     non-missing edges= 92 
## 
##  Vertex attribute names: 
##     Age Race vertex.names 
## 
## No edge attributes

Awesome! We have an object of class network that has the attributes of our observed data. Now, we can compare our observed network to our simulated network.

# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  gaNetU, 
  gmode = "graph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Get\n Along With Network (Undirected)"
  )

# create the second plot for the simulated network
gplot( 
  sim.m1.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Edge Independence Model"
  )

How is the network different from the network generated from the edge independence model? What are the structural features that the model is not capturing?


Simulating More Complex Models

Well, dang. It looks like our model is doing a great job reproducing the structure in our observed network.

Let’s take a look at the additional models we created:

# simulate a network from the degree effect for Race model
sim.m2.gaNetU <- simulate( m2.gaNetU, nsim=1, seed = 605 )

# simulate a network from the homophily model
sim.m3.gaNetU <- simulate( m3.gaNetU, nsim=1, seed = 605 )

# simulate a network from transitivity model
sim.m4.gaNetU <- simulate( m4.gaNetU, nsim=1, seed = 605 )

Now that we have our simulated networks, let’s plot them.

# set the margins
par( mfrow=c( 2,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  gaNetU, 
  gmode = "graph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Get\n Along With Network (Undirected)"
  )

# create the second plot for the simulated network
gplot( 
  sim.m2.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Degree Model"
  )

# create the third plot for the simulated network
gplot( 
  sim.m3.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#e707f2",
  main = "Network Simulated from\n Homophily Model"
  )

# create the fourth plot for the simulated network
gplot( 
  sim.m4.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#17f207",
  main = "Network Simulated from\n Transitivity Model"
  )

What do think? As our models become more complex, are we doing better at reproducing the structures we observe?


Goodness-of-fit (the basic logic)

Recall that we are fitting a model to observed data. That is, we are specifying the stochastic process that generated our observed data. We have specified local processes that generate our observed network. So, one way to determine how well the model fits the data is to see how well the model generates global network properties. That is, if the processes we have specified are operating, would we get the degree distribution (for example) we observe?

We can get a sense of how well the model is representing the observed network based on the visual comparison with the simulated networks (as we did above). In addition, we could look at whether our simulated networks are reproducing particular structures in our network.

We can use the summary() function and use the ergm terms to return descriptive stats on the networks (i.e. the observed network and the simulated networks). Let’s take a look at the degree distribution as well as the number of triangles in our observed network and our simulated networks:

# let's see how this works for our observed network

summary( 
  gaNetU            # the network we want summary info on 
  ~ edges           # give the edges; note the ~
  + degree( 0:5 )   # count of degrees 0 through 5
  + triangle        # count of triangles
  )
##    edges  degree0  degree1  degree2  degree3  degree4  degree5 triangle 
##       95      111       47       20       15        7        3       13

The output shows that the network has:

  • 95 edges

  • 111 nodes with degree 0, 47 with degree 1, and so on to degree 5.

  • 13 triangles (i.e. i-j, j-k, & i-k)


Now, let’s take this info for our observed network and do the same for the simulated network from the fourth model. Then, we can look at whether the simulation is reproducing the structures we observe in the network.

summary( 
  sim.m4.gaNetU   
  ~ edges         
  + degree( 0:5 ) 
  + triangle      
  )
##    edges  degree0  degree1  degree2  degree3  degree4  degree5 triangle 
##      104       92       63       26       12        8        1       19

Let’s build a table to compare them:

# put a table together for ease of comparing
obsSimTab <- cbind(
  summary( gaNetU  ~ edges + degree( 0:5 ) + triangle ),
  summary( sim.m4.gaNetU  ~ edges + degree( 0:5 ) + triangle )
)

# add some names and then examine the table
colnames( obsSimTab ) <- c( "Obs Net", "Sim Net" )
obsSimTab
##          Obs Net Sim Net
## edges         95     104
## degree0      111      92
## degree1       47      63
## degree2       20      26
## degree3       15      12
## degree4        7       8
## degree5        3       1
## triangle      13      19

Looking through the table, how well is the model representing the structures we have decided to examine?


Going Beyond One Simulated Network

So far, we have only looked at one simulation. Alternatively, we could run a lot of simulations and examine the distribution of a particular structural property (e.g. triangles). In the simulate() function, we can use several arguments to give us what we want:

  • The monitor= argument to pass any specific terms we want to simulate. Since we are focusing just on the number of triangles, we would use monitor=~triangle. Note the incorporation of the ~ in the monitor= argument.

  • We can also use the output= argument to just calculate stats rather than generating networks. Since we are only interested in the stats and don’t want the networks, we use output="stats".


Let’s simulate 500 networks and take a look at the results:

# run the simulation

gaNetU.large.sim <- simulate(
  m4.gaNetU,                # the model to simulate from
  nsim = 500,               # ask for 500 simulations
  monitor = ~ triangle,     # we want it to focus on the triangles
  output = "stats",           # just keep the stats, not the networks
  seed = 605                # set the seed to reproduce the results
)

Boom! We simulated 500 networks from our model.

Let’s see what it gave us.

head( gaNetU.large.sim )
##      edges nodematch.Race.1 nodematch.Race.2 nodematch.Race.3 gwesp.fixed.0.25
## [1,]   104               49               35                5         48.86745
## [2,]   105               43               38                3         51.54839
## [3,]   119               50               42                7         60.92452
## [4,]   115               44               34                9         54.26092
## [5,]   117               52               43               10         59.30985
## [6,]   108               47               34               12         43.44240
##      triangle
## [1,]       19
## [2,]       19
## [3,]       24
## [4,]       21
## [5,]       23
## [6,]       15

We can see that it reports the count of all the terms in the model in the columns and each row is the count for each simulated network.

Now, we can plot the summary stats for all the simulated networks to visualize the information.

hist( 
  gaNetU.large.sim[,"triangle"],  # plot the triangle counts from each simulation
  main = "Simulations \n (X shows observed network)", # add a title
  xlim = c( 0,50 ), # set the x axis limits
  ylim = c( 0,200 ), # set the y axis limits
  xlab = "Number of triangles",  # label the x axis
  ylab = "Number of simulated networks" #label the y axes 
  )

# add a mark where the observed count is
points(
  summary( 
    gaNetU ~ triangle ),
  4, 
  pch="X", 
  cex=2, 
  col="red"
  )

In looking at the simulation of the triangles, is our model generating as much transitivity as we observe in the network?


Goodness-of-Fit with the gof() Function

As we just saw, we can simulate lots of networks and then compare our observed network to the networks simulated from the model estimates. In ergm, there is a function that does this for us, gof(). The gof() function calculates p-values for geodesic distance, degree, and reachability summaries to diagnose the goodness-of-fit of exponential family random graph models. We can examine the help page for the gof() function using ?gof.

As shown in the help, the gof() function takes a fitted ergm object (i.e. an estimated model), and uses the GOF= argument to specify what summaries information we want to calculate for our simulated networks. Rather than simulating the networks, then calculating the stats, then creating a plot, the gof() function does all of this for us! (how nice).

The GOF= argument takes a formula object, of the form ~ <model terms> specifying the statistics to use to diagnosis the goodness-of-fit of the model. They do not need to be terms that are in the model, and typically are not. Currently supported terms are the degree distribution (degree for undirected graphs, or idegree and/or odegree for directed graphs), geodesic distances (distance), shared partner distributions (espartners [1 edge-wise shared partner is i-j, j-k, and i-k] and dspartners [1 dyad-wise shared partner is j-k and i-k]), the triad census (triadcensus), and the terms of the original model (model). The default formula for undirected networks is ~ degree + espartners + distance + model, and the default formula for directed networks is ~ idegree + odegree + espartners + distance + model.


Let’s go ahead and take a look at this with the m4.gaNetU model we estimated above where:

  • ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE )

We want to calculate the degree distribution (degree), the edge-wise shared partner distribution (espartners), and the geodesic distances (distance).

m4.gaNetU.gof <- gof(
  m4.gaNetU,               # our estimated model
  GOF = ~degree            # the degree distribution 
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)

# print out the goodness of fit info
m4.gaNetU.gof
## 
## Goodness-of-fit for degree 
## 
##         obs min  mean max MC p-value
## degree0 111  75 96.99 125       0.10
## degree1  47  40 55.51  79       0.24
## degree2  20  19 29.58  46       0.08
## degree3  15   5 14.17  24       0.98
## degree4   7   1  5.46  15       0.66
## degree5   3   0  2.13   8       0.62
## degree6   1   0  0.71   6       0.92
## degree7   0   0  0.32   3       1.00
## degree8   0   0  0.06   1       1.00
## degree9   1   0  0.07   1       0.14
## 
## Goodness-of-fit for edgewise shared partner 
## 
##      obs min  mean max MC p-value
## esp0  60  42 59.02  78       0.94
## esp1  31  15 35.40  57       0.76
## esp2   4   0  3.84  18       0.90
## esp3   0   0  0.35   4       1.00
## esp4   0   0  0.03   1       1.00
## 
## Goodness-of-fit for minimum geodesic distance 
## 
##       obs   min     mean   max MC p-value
## 1      95    65    98.64   136       0.88
## 2     139    42    98.65   183       0.22
## 3     193    21    96.14   252       0.08
## 4     210    10    89.39   298       0.10
## 5     182     3    78.65   360       0.22
## 6     160     0    65.58   398       0.18
## 7     120     0    52.02   379       0.30
## 8      83     0    37.99   298       0.28
## 9      42     0    26.74   228       0.36
## 10     25     0    19.43   210       0.44
## 11     11     0    13.76   190       0.48
## 12      4     0     9.42   172       0.52
## 13      1     0     5.80   148       0.46
## 14      0     0     3.35   123       1.00
## 15      0     0     2.20   119       1.00
## 16      0     0     1.55   108       1.00
## 17      0     0     0.98    74       1.00
## 18      0     0     0.62    48       1.00
## 19      0     0     0.27    23       1.00
## 20      0     0     0.04     3       1.00
## Inf 19645 17848 20208.78 20741       0.30
## 
## Goodness-of-fit for model statistics 
## 
##                      obs min     mean       max MC p-value
## edges            95.0000  65 98.64000 136.00000       0.88
## nodematch.Race.1 40.0000  19 42.18000  62.00000       0.86
## nodematch.Race.2 34.0000  20 35.31000  63.00000       0.98
## nodematch.Race.3  4.0000   0  4.61000  11.00000       1.00
## gwesp.fixed.0.25 35.8848  15 40.57238  71.25171       0.76

The object m4.gaNetU.gof shows the goodness-of-fit for the degree distribution, the edge-wise shared partner distribution, and the geodesic distances. Each table shows the value in the observed network (obs) as well as the minimum (min), mean (mean), and maximum (max) of that value in the simulations.

The p-value column (p-value) is the proportion of the simulated values of the statistic that are at least as extreme as the observed value. Large values indicate that the simulated networks are similar to the observed network. Values less than 0.05 demonstrate a significant difference between the observed network and the simulated networks.


We can reference each table using the $pval operator. Take a look at the names using: names(m4.gaNetU.gof). Let’s take a look at the degree distribution first:

m4.gaNetU.gof$pval.deg[1:11,]
##          obs min  mean max MC p-value
## degree0  111  75 96.99 125       0.10
## degree1   47  40 55.51  79       0.24
## degree2   20  19 29.58  46       0.08
## degree3   15   5 14.17  24       0.98
## degree4    7   1  5.46  15       0.66
## degree5    3   0  2.13   8       0.62
## degree6    1   0  0.71   6       0.92
## degree7    0   0  0.32   3       1.00
## degree8    0   0  0.06   1       1.00
## degree9    1   0  0.07   1       0.14
## degree10   0   0  0.00   0       1.00

In looking through the Goodness-of-fit for degree table, what do the p-values tell us?

The degree distribution in the simulated data are matching well with the degree distribution for the observed data. In other words, our model (i.e. ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE )) is reproducing a degree distribution that is very similar to the degree distribution in our observed data.

Now let’s look at the other distributions. Take a look at the Goodness-of-fit for edge-wise shared partner table:

m4.gaNetU.gof$pval.espart[1:6,]
##      obs min  mean max MC p-value
## esp0  60  42 59.02  78       0.94
## esp1  31  15 35.40  57       0.76
## esp2   4   0  3.84  18       0.90
## esp3   0   0  0.35   4       1.00
## esp4   0   0  0.03   1       1.00
## esp5   0   0  0.00   0       1.00

Recall that an edge-wise shared partner of 1 or esp1 is i-j, j-k, and i-k. An edge-wise shared partner of 2 or esp2 is i-j, j-k and i-k, as well as i-l and l-l. And so on…

What is the interpretation of this table?

Conceptually, what does the difference (or lack thereof?) between the simulated networks and the observed networks, based on the edge-wise shared partner distribution tell us?


Now, take a look at the Goodness-of-fit for minimum geodesic distance table:

m4.gaNetU.gof$pval.dist[1:21,]
##    obs min  mean max MC p-value
## 1   95  65 98.64 136       0.88
## 2  139  42 98.65 183       0.22
## 3  193  21 96.14 252       0.08
## 4  210  10 89.39 298       0.10
## 5  182   3 78.65 360       0.22
## 6  160   0 65.58 398       0.18
## 7  120   0 52.02 379       0.30
## 8   83   0 37.99 298       0.28
## 9   42   0 26.74 228       0.36
## 10  25   0 19.43 210       0.44
## 11  11   0 13.76 190       0.48
## 12   4   0  9.42 172       0.52
## 13   1   0  5.80 148       0.46
## 14   0   0  3.35 123       1.00
## 15   0   0  2.20 119       1.00
## 16   0   0  1.55 108       1.00
## 17   0   0  0.98  74       1.00
## 18   0   0  0.62  48       1.00
## 19   0   0  0.27  23       1.00
## 20   0   0  0.04   3       1.00
## 21   0   0  0.00   0       1.00

Recall that a geodesic is the shortest path between two nodes. The distribution shows the count of these (when it shows Inf, this indicates the number of dyads that are unreachable).

What is the interpretation of this table?


Finally, we can plot the gof() object and examine the simulated and observed data.

# set the plot panes
par( mfrow = c( 2,2 ) )

plot( m4.gaNetU.gof )

Each plot shows boxplots for the values of the simulated networks. The dark line in each plot is the network statistic for the observed network. Situations where the dark line falls outside of the boxplots indicates that the simulation is not reproducing that particular structural aspect of the network.

Overall, how well is our model at reproducing the observed distributions?


As a contrast, let’s look at the goodness of fit for the model where we exclude the term to account for transitivity in the network. Recall that the m3.gaNetU model was: m3.gaNetU$formula.

# run the gof function
m3.gaNetU.gof <- gof(
  m3.gaNetU,               # our estimated model
  GOF = ~degree            # the degree distribution 
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)

# plot the gof diagnostics
par( mfrow = c( 2,2 ) )
plot( m3.gaNetU.gof )

Now, let’s ask a few questions:

  • Which distributions is the model doing a good job reproducing?
  • Where does it not do a good job? (Why is this the case?)



Evaluation Model Fit for the PINS Power/Influence Network

Now, let’s go back through this using a directed network. Recall that in the Introduction to Exponential Random Graph Models (ERGMs) in R lab we estimated a series of ERGMs using the power and influence network from PINS. Here, we want to re-estimate the models and examine goodness of fit.

Setting Up

First, let’s bring in the data, create the network object, and assign attributes.

# clear the workspace in case we recycle some objects
rm( list = ls( ) )

# load the libraries we need (this is redundant)
library( sna )
library( network )
library( ergm )

# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-power-w1-adj.csv"

# read in the .csv file
piMat <- as.matrix(
  read.csv( 
    loc,
    as.is = TRUE,
    header = TRUE,
    row.names = 1 
    )
  )

# create the network object
piNetD <- as.network( piMat, directed = TRUE )


# define the attributes object
attrs <- read.csv( 
    "https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
    as.is = TRUE,
    header = TRUE 
    )

# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1

# now assign the recoded value
piNetD %v% "Age" <- attrs[,1]
piNetD %v% "Race" <- race.recoded


Now, let’s estimate a few models. We will estimate several (each subsequent model will include the prior terms):

  • An edge independence model
  • A model where we add a term where Age influences indegree and outdegree
  • A model where we add a term where there is homophily based on Age
  • A model where we take into account reciprocity


Ok, let’s run the models. (remember that we can use the summary() function on the model object to summarize the results).

# edge independence model
m1.piNetD <- ergm( piNetD ~ edges ) 

# add the effects of Age on indegree and outdegree
m2.piNetD <- ergm( piNetD ~ edges  + nodeicov( "Age" ) + nodeocov( "Age" ) ) 
  
# add the homophily effect for Age  
m3.piNetD <- ergm( piNetD ~ edges  + nodeicov( "Age" ) + nodeocov( "Age" ) + absdiff( "Age" ) ) 

# add the effect for reciprocity  
m4.piNetD <- ergm( piNetD ~ edges + nodeicov( "Age" ) + nodeocov( "Age" ) + absdiff( "Age" ) + mutual,
  control = control.ergm( seed = 605 ) ) 


Simulating a Single Network

Let’s simulate a network and plot it along with our observed network. Let’s use the edge independence model:

# simulate a network from the edge independence model
sim.m1.piNetD <- simulate(
  m1.piNetD,               # the model we want to simulate from
  nsim=1,                  # simulate 1 network
  seed = 605               # set the seed to reproduce results
)

# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Power/Influence\n Network (Directed)"
  )

# create the second plot for the simulated network
gplot( 
  sim.m1.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Edge Independence Model"
  )

How is the network different from the network generated from the edge independence model? What are the structural features that the model is not capturing?


Simulating More Complex Models

Let’s take a look at the additional models we created:

# estimate the simulations
sim.m2.piNetD <- simulate( m2.piNetD, nsim=1, seed = 605 )
sim.m3.piNetD <- simulate( m3.piNetD, nsim=1, seed = 605 )
sim.m4.piNetD <- simulate( m4.piNetD, nsim=1, seed = 605 )

Now that we have our simulated networks, let’s plot them.

# set the margins
par( mfrow=c( 2,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Power/Influence\n Network (Directed)"
  )

# create the second plot for the simulated network
gplot( 
  sim.m2.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Degree Model"
  )

# create the third plot for the simulated network
gplot( 
  sim.m3.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#e707f2",
  main = "Network Simulated from\n Homophily Model"
  )

# create the fourth plot for the simulated network
gplot( 
  sim.m4.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#17f207",
  main = "Network Simulated from\n Reciprocity Model"
  )

What do think? As our models become more complex, are we doing better at reproducing the structures we observe?


Goodness-of-fit with the gof() function

Let’s examine the goodness of fit for the m4.piNetD model.

m4.piNetD.gof <- gof(
  m4.piNetD,               # our estimated model
  GOF = ~idegree           # the indegree distribution 
    + odegree              # the outdegree distribution
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)

# print out the goodness of fit info
m4.piNetD.gof
## 
## Goodness-of-fit for in-degree 
## 
##           obs min   mean max MC p-value
## idegree0  156  93 110.83 127       0.00
## idegree1   28  41  54.57  70       0.00
## idegree2    7  13  23.22  35       0.00
## idegree3    2   4   9.82  19       0.00
## idegree4    0   0   3.61  10       0.08
## idegree5    2   0   1.63   5       1.00
## idegree6    0   0   0.67   4       1.00
## idegree7    2   0   0.35   2       0.16
## idegree8    3   0   0.22   2       0.00
## idegree9    0   0   0.05   1       1.00
## idegree10   0   0   0.01   1       1.00
## idegree11   0   0   0.01   1       1.00
## idegree12   1   0   0.00   0       0.00
## idegree13   3   0   0.01   1       0.00
## idegree14   1   0   0.00   0       0.00
## 
## Goodness-of-fit for out-degree 
## 
##          obs min  mean max MC p-value
## odegree0 127  79 93.14 115       0.00
## odegree1  36  57 72.94  88       0.00
## odegree2  23  21 29.61  41       0.16
## odegree3  10   2  7.48  16       0.38
## odegree4   4   0  1.67   5       0.16
## odegree5   2   0  0.16   3       0.02
## odegree6   1   0  0.00   0       0.00
## odegree8   1   0  0.00   0       0.00
## odegree9   1   0  0.00   0       0.00
## 
## Goodness-of-fit for edgewise shared partner 
## 
##          obs min   mean max MC p-value
## esp.OTP0 129 136 161.04 184          0
## esp.OTP1  28   0   1.04   4          0
## esp.OTP2   3   0   0.00   0          0
## esp.OTP3   1   0   0.00   0          0
## 
## Goodness-of-fit for minimum geodesic distance 
## 
##       obs   min     mean   max MC p-value
## 1     161   137   162.08   185       0.90
## 2     142    78   133.17   189       0.62
## 3      79    46   105.60   199       0.54
## 4      25    19    81.87   213       0.12
## 5       3     3    61.05   228       0.02
## 6       0     0    43.09   219       0.06
## 7       0     0    29.13   190       0.14
## 8       0     0    18.58   175       0.50
## 9       0     0    12.11   169       0.74
## 10      0     0     7.64   162       1.00
## 11      0     0     4.40   122       1.00
## 12      0     0     2.51    91       1.00
## 13      0     0     1.34    62       1.00
## 14      0     0     0.85    44       1.00
## 15      0     0     0.33    16       1.00
## 16      0     0     0.17    14       1.00
## 17      0     0     0.10    10       1.00
## 18      0     0     0.04     4       1.00
## Inf 41410 39542 41155.94 41518       0.24
## 
## Goodness-of-fit for model statistics 
## 
##               obs  min    mean  max MC p-value
## edges         161  137  162.08  185       0.90
## nodeicov.Age 7849 6538 7915.01 8974       0.82
## nodeocov.Age 6543 5463 6569.14 7541       0.92
## absdiff.Age  2058 1643 2078.87 2486       0.82
## mutual          2    0    1.75    7       1.00


Look through the table for each distribution. What is the interpretation of each table?


Finally, we can plot the gof() object and examine the simulated and observed data.

# set the plot panes
par( mfrow = c( 3,2 ) )

plot( m4.piNetD.gof )

Each plot shows boxplots for the values of the simulated networks. The dark line in each plot is the network statistic for the observed network. Situations where the dark line falls outside of the boxplots indicates that the simulation is not reproducing that particular structural aspect of the network.

Overall, how well is our model at reproducing the observed distributions?


As a contrast, let’s look at the goodness of fit for the edge independence model (m1.piNetD).

# run the gof function
m1.piNetD.gof <- gof(
  m1.piNetD,               # our estimated model
  GOF = ~idegree           # the indegree distribution 
    + odegree              # the outdegree distribution
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)

# plot the gof diagnostics
par( mfrow = c( 3,2 ) )
plot( m1.piNetD.gof )

Now, let’s ask a few questions:

  • Which distributions is the model doing a good job reproducing?
  • Where does it not do a good job? (Why is this the case?)


Simulating Alternative Models

No Homophily?

Suppose we wanted to simulate a model where the effect of homophily is less strong. In other words, what would this network look like if there was not homophily? We can do this by plugging new values into the simulate.ergm function.

Since a fitted ergm model is of class ergm, it has several parts that we can reference as objects.

# see the names associated with the object
names( m4.piNetD )
##  [1] "coefficients"    "sample"          "iterations"      "MCMCtheta"      
##  [5] "loglikelihood"   "gradient"        "hessian"         "covar"          
##  [9] "failure"         "newnetwork"      "coef.init"       "est.cov"        
## [13] "coef.hist"       "stats.hist"      "steplen.hist"    "control"        
## [17] "etamap"          "MCMCflag"        "nw.stats"        "call"           
## [21] "network"         "ergm_version"    "info"            "MPLE_is_MLE"    
## [25] "drop"            "offset"          "estimable"       "formula"        
## [29] "reference"       "constraints"     "obs.constraints" "estimate"       
## [33] "estimate.desc"   "null.lik"        "mle.lik"
# the coefficients are an object that we can take, modify, and use with the simulation
m4.piNetD$coefficients
##        edges nodeicov.Age nodeocov.Age  absdiff.Age       mutual 
## -8.569733632  0.081229804 -0.004240028 -0.033707476  1.743896037

If we want to simulate a model where there is no homophily, then we can set the coefficient to zero, then simulate from that new set of model coefficients. In our case, we well change the effect of absdiff.Age to be zero.

# create a new object that is just the model coefficients
no.homophily.coef <- m4.piNetD$coefficients

# set the coefficent to zero; it is the 4th element
no.homophily.coef[4] <- 0 

# Now, the effect is zero.
no.homophily.coef
##        edges nodeicov.Age nodeocov.Age  absdiff.Age       mutual 
## -8.569733632  0.081229804 -0.004240028  0.000000000  1.743896037
# now, plug the edited coefficient into a simulation
no.homophily.sim <- simulate(
  m4.piNetD,                # the model we are simulating from 
  coef = no.homophily.coef, # tell it to use the new coefficients
  verbose = TRUE            # tell it to show us what it is doing
) 

Now that we have the simulated network, we can plot it and compare it to our observed network. We will want to size the nodes by age. So, let’s first run that through the rescale() function.

# define the rescale function
rescale <- function( nchar, low, high ){
  min_d <- min( nchar )
  max_d <- max( nchar )
  rscl  <- ( ( high - low )*( nchar - min_d ) ) / ( max_d - min_d ) + low
  rscl
}

# create the objects to use for the plots
PiNetDobsSize <- rescale( piNetD %v% "Age", 0.2, 3 )
PiNetDsimSize <- rescale( no.homophily.sim %v% "Age", 0.2, 3 )
# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#f71505",
  vertex.cex = PiNetDobsSize,
  displayisolates = FALSE,
  main = "PINS Power/Influence\n Network (Directed)"
  )

# create the second plot for the simulated network
gplot( 
  no.homophily.sim, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#03fc7b",
  vertex.cex = PiNetDsimSize,
  displayisolates = FALSE,
  main = "Network Simulated from\n No Homophily Model"
  )

How does the simulated network differ from the piNetD network? Can you see the difference in the role that homophily plays?


Aversion to Reciprocity?

In the model above, m4.piNetD, we showed that there was evidence of reciprocity in power/influence nominations. What would the network look like if there was an aversion, such that the probability of a tie from i to j is lower if there is a tie from j to i?

If we want to simulate a model where there is an aversion to reciprocal ties, then we can take the coefficient for the reciprocity term and multiply it by -1, then simulate from that new set of model coefficients.

# create a new object that is just the model coefficients
anti.recip.coef <- m4.piNetD$coefficients

# set the coefficient
anti.recip.coef[5] <- anti.recip.coef[5]*-1 

# now, plug the edited coefficient into a simulation
anti.recip.sim <- simulate(
  m4.piNetD,                # the model we are simulating from 
  coef = anti.recip.coef,   # tell it to use the new coefficients
  verbose = TRUE            # tell it to show us what it is doing
) 

Now that we have the simulated network, we can plot it and compare it to our observed network. We will want to color the reciprocated ties. To do this, we can get the matrix of reciprocated ties (using symmetrize()) and then pass colors to this matrix.

# create the symmetric matrix
sympiMat <- symmetrize( piNetD, rule = "strong" )

# create a matrix of colors
sympiMatCols <- sympiMat
sympiMatCols[sympiMat == 0] <- "grey80"
sympiMatCols[sympiMat == 1] <- "#3746a3" # set a color for mutual ties


# create the symmetric matrix
sympiMat2 <- symmetrize( anti.recip.sim, rule = "strong" )

# create a matrix of colors
sympiMatCols2 <- sympiMat2
sympiMatCols2[sympiMat2 == 0] <- "grey80"
sympiMatCols2[sympiMat2 == 1] <- "#3796a4" # set a color for mutual ties

Now, we pass this information to the plot.

# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 3, 0.5 ) )

# set the seed
set.seed( 605 )

# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col=sympiMatCols, 
  vertex.col="#03eeff",
  displayisolates = FALSE,
  main = "PINS Power/Influence\n Network (Directed)"
  )

# create the second plot for the simulated network
gplot( 
  no.homophily.sim, 
  gmode = "digraph",
  edge.col = sympiMatCols2, 
  vertex.col="#03fc7b",
  displayisolates = FALSE,
  main = "Network Simulated from\n Aversion to\n Reciprocity Model"
  )

How does the simulated network differ from the piNetD network? Can you see the difference in the role that reciprocity plays?



Wrapping up…

In this lab, we continued working with ERGMs in R using the ergm package. Specifically, we ask the question: how do we evaluate whether our model is a good fit to the data? This is the notion of “goodness of fit” and we worked through how to assess model fit using the gof() function in the ergm package.


Questions?


Please report any needed corrections to the Issues page. Thanks!


Back to SAND main page