In the Introduction to Exponential Random Graph Models (ERGMs) lecture you were introduced to several examples of models from the exponential family of random graphs and we worked through those models in the Introduction to Exponential Random Graph Models (ERGMs) in R lab.
This lab continues working with ERGMs in R using the
ergm package. Specifically, we ask the question: how do
we evaluate whether our model is a good fit to the data? This is
the notion of “goodness of fit” and we can assess how well our model
fits the data using the gof() function in the
ergm package.
Recall that in the Introduction to Exponential Random Graph Models (ERGMs) in R lab we estimated a series of ERGMs using the get along with network from PINS. Here, we want to re-estimate the models and examine goodness of fit.
First, let’s bring in the data, create the network object, and assign attributes.
# load the libraries we need
library( sna )
library( network )
library( ergm )
# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-getalong-w1-adj.csv"
# read in the .csv file
gaMat <- as.matrix(
  read.csv( 
    loc,
    as.is = TRUE,
    header = TRUE,
    row.names = 1 
    )
  )
# use the symmetrize() function to create an undirected matrix
gaMatU <- symmetrize( gaMat, rule = "strong" )
# create the network object
gaNetU <- as.network( gaMatU, directed = FALSE )
# define the attributes object
attrs <- read.csv( 
    "https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
    as.is = TRUE,
    header = TRUE 
    )
# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1
# now assign the recoded value
gaNetU %v% "Age" <- attrs[,1]
gaNetU %v% "Race" <- race.recodedNow, let’s estimate a few models. We will estimate several (each subsequent model will include the prior terms):
Race influences
degreeRaceOk, let’s run the models. (remember that we can use the
summary() function on the model object to summarize the
results).
# edge independence model
m1.gaNetU <- ergm( gaNetU ~ edges ) 
# add the effect of Race on degree
m2.gaNetU <- ergm( gaNetU ~ edges  + nodefactor( "Race" ) ) 
  
# add the homophily effect for Race  
m3.gaNetU <- ergm( gaNetU ~ edges + nodematch( "Race" ) )
# add the effect for transitivity  
m4.gaNetU <- ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE ),
  control = control.ergm( seed = 605 ) ) simulate.ergm() or simulate()
and control.ergm() FunctionsHow can we go about assessing model fit? What would a good fitting model do? Well, one way to think about this is whether the model we estimate reproduces structures in the data. In other words, if we estimate a model, then simulated some data from that model, we could say: “does the simulated network look like our observed network?” This is the logic of how we will approach goodness of fit with our models.
First, we need to think about simulating a network (or networks) from
our model. We can do this using the simulate.ergm() or
simulate() function that is built into the
ergm package. This function randomly samples networks from
the specified model. We can examine the function using
?simulate.ergm. Let’s see how it works:
# simulate a network from the edge independence model
sim.m1.gaNetU <- simulate(
  m1.gaNetU,               # the model we want to simulate from
  nsim=1,                  # simulate 1 network
  seed = 605               # set the seed to reproduce results
)Boom! You simulated a network. Now what? Take a look at what we did.
## [1] "network"##  Network attributes:
##   vertices = 205 
##   directed = FALSE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 92 
##     missing edges= 0 
##     non-missing edges= 92 
## 
##  Vertex attribute names: 
##     Age Race vertex.names 
## 
## No edge attributesAwesome! We have an object of class network that has the
attributes of our observed data. Now, we can compare our observed
network to our simulated network.
# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot( 
  gaNetU, 
  gmode = "graph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Get\n Along With Network (Undirected)"
  )
# create the second plot for the simulated network
gplot( 
  sim.m1.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Edge Independence Model"
  )How is the network different from the network generated from the edge independence model? What are the structural features that the model is not capturing?
Well, dang. It looks like our model is doing a great job reproducing the structure in our observed network.
Let’s take a look at the additional models we created:
# simulate a network from the degree effect for Race model
sim.m2.gaNetU <- simulate( m2.gaNetU, nsim=1, seed = 605 )
# simulate a network from the homophily model
sim.m3.gaNetU <- simulate( m3.gaNetU, nsim=1, seed = 605 )
# simulate a network from transitivity model
sim.m4.gaNetU <- simulate( m4.gaNetU, nsim=1, seed = 605 )Now that we have our simulated networks, let’s plot them.
# set the margins
par( mfrow=c( 2,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot( 
  gaNetU, 
  gmode = "graph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Get\n Along With Network (Undirected)"
  )
# create the second plot for the simulated network
gplot( 
  sim.m2.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Degree Model"
  )
# create the third plot for the simulated network
gplot( 
  sim.m3.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#e707f2",
  main = "Network Simulated from\n Homophily Model"
  )
# create the fourth plot for the simulated network
gplot( 
  sim.m4.gaNetU, 
  gmode = "graph",
  edge.col = "grey40", 
  vertex.col="#17f207",
  main = "Network Simulated from\n Transitivity Model"
  )What do think? As our models become more complex, are we doing better at reproducing the structures we observe?
Recall that we are fitting a model to observed data. That is, we are specifying the stochastic process that generated our observed data. We have specified local processes that generate our observed network. So, one way to determine how well the model fits the data is to see how well the model generates global network properties. That is, if the processes we have specified are operating, would we get the degree distribution (for example) we observe?
We can get a sense of how well the model is representing the observed network based on the visual comparison with the simulated networks (as we did above). In addition, we could look at whether our simulated networks are reproducing particular structures in our network.
We can use the summary() function and use the
ergm terms to return descriptive stats on the networks
(i.e. the observed network and the simulated networks). Let’s take a
look at the degree distribution as well as the number of triangles in
our observed network and our simulated networks:
# let's see how this works for our observed network
summary( 
  gaNetU            # the network we want summary info on 
  ~ edges           # give the edges; note the ~
  + degree( 0:5 )   # count of degrees 0 through 5
  + triangle        # count of triangles
  )##    edges  degree0  degree1  degree2  degree3  degree4  degree5 triangle 
##       95      111       47       20       15        7        3       13The output shows that the network has:
95 edges
111 nodes with degree 0, 47 with degree 1, and so on to degree 5.
13 triangles (i.e. i-j, j-k, & i-k)
Now, let’s take this info for our observed network and do the same for the simulated network from the fourth model. Then, we can look at whether the simulation is reproducing the structures we observe in the network.
##    edges  degree0  degree1  degree2  degree3  degree4  degree5 triangle 
##      104       92       63       26       12        8        1       19Let’s build a table to compare them:
# put a table together for ease of comparing
obsSimTab <- cbind(
  summary( gaNetU  ~ edges + degree( 0:5 ) + triangle ),
  summary( sim.m4.gaNetU  ~ edges + degree( 0:5 ) + triangle )
)
# add some names and then examine the table
colnames( obsSimTab ) <- c( "Obs Net", "Sim Net" )
obsSimTab##          Obs Net Sim Net
## edges         95     104
## degree0      111      92
## degree1       47      63
## degree2       20      26
## degree3       15      12
## degree4        7       8
## degree5        3       1
## triangle      13      19Looking through the table, how well is the model representing the structures we have decided to examine?
So far, we have only looked at one simulation. Alternatively, we
could run a lot of simulations and examine the distribution of a
particular structural property (e.g. triangles). In the
simulate() function, we can use several arguments to give
us what we want:
The monitor= argument to pass any specific terms we
want to simulate. Since we are focusing just on the number of triangles,
we would use monitor=~triangle. Note the incorporation of
the ~ in the monitor= argument.
We can also use the output= argument to just
calculate stats rather than generating networks. Since we are only
interested in the stats and don’t want the networks, we use
output="stats".
Let’s simulate 500 networks and take a look at the results:
# run the simulation
gaNetU.large.sim <- simulate(
  m4.gaNetU,                # the model to simulate from
  nsim = 500,               # ask for 500 simulations
  monitor = ~ triangle,     # we want it to focus on the triangles
  output = "stats",           # just keep the stats, not the networks
  seed = 605                # set the seed to reproduce the results
)Boom! We simulated 500 networks from our model.
Let’s see what it gave us.
##      edges nodematch.Race.1 nodematch.Race.2 nodematch.Race.3 gwesp.fixed.0.25
## [1,]   104               49               35                5         48.86745
## [2,]   105               43               38                3         51.54839
## [3,]   119               50               42                7         60.92452
## [4,]   115               44               34                9         54.26092
## [5,]   117               52               43               10         59.30985
## [6,]   108               47               34               12         43.44240
##      triangle
## [1,]       19
## [2,]       19
## [3,]       24
## [4,]       21
## [5,]       23
## [6,]       15We can see that it reports the count of all the terms in the model in the columns and each row is the count for each simulated network.
Now, we can plot the summary stats for all the simulated networks to visualize the information.
hist( 
  gaNetU.large.sim[,"triangle"],  # plot the triangle counts from each simulation
  main = "Simulations \n (X shows observed network)", # add a title
  xlim = c( 0,50 ), # set the x axis limits
  ylim = c( 0,200 ), # set the y axis limits
  xlab = "Number of triangles",  # label the x axis
  ylab = "Number of simulated networks" #label the y axes 
  )
# add a mark where the observed count is
points(
  summary( 
    gaNetU ~ triangle ),
  4, 
  pch="X", 
  cex=2, 
  col="red"
  )In looking at the simulation of the triangles, is our model generating as much transitivity as we observe in the network?
gof()
FunctionAs we just saw, we can simulate lots of networks and then compare our
observed network to the networks simulated from the model estimates. In
ergm, there is a function that does this for us,
gof(). The gof() function calculates
p-values for geodesic distance, degree, and reachability
summaries to diagnose the goodness-of-fit of exponential family random
graph models. We can examine the help page for the gof()
function using ?gof.
As shown in the help, the gof() function takes a fitted
ergm object (i.e. an estimated model), and uses the
GOF= argument to specify what summaries information we want
to calculate for our simulated networks. Rather than
simulating the networks, then calculating the stats, then creating a
plot, the gof() function does all of this for us! (how
nice).
The GOF= argument takes a formula object, of the form
~ <model terms> specifying the statistics to use to
diagnosis the goodness-of-fit of the model. They do not need to be terms
that are in the model, and typically are not. Currently supported terms
are the degree distribution (degree for undirected
graphs, or idegree and/or odegree for
directed graphs), geodesic distances (distance),
shared partner distributions (espartners [1 edge-wise
shared partner is i-j, j-k, and
i-k] and dspartners [1 dyad-wise shared
partner is j-k and i-k]), the triad
census (triadcensus), and the terms of the original model
(model). The default formula for undirected networks is
~ degree + espartners + distance + model, and the default
formula for directed networks is
~ idegree + odegree + espartners + distance + model.
Let’s go ahead and take a look at this with the
m4.gaNetU model we estimated above where:
ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE )We want to calculate the degree distribution (degree),
the edge-wise shared partner distribution (espartners), and
the geodesic distances (distance).
m4.gaNetU.gof <- gof(
  m4.gaNetU,               # our estimated model
  GOF = ~degree            # the degree distribution 
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)
# print out the goodness of fit info
m4.gaNetU.gof## 
## Goodness-of-fit for degree 
## 
##         obs min  mean max MC p-value
## degree0 111  75 96.99 125       0.10
## degree1  47  40 55.51  79       0.24
## degree2  20  19 29.58  46       0.08
## degree3  15   5 14.17  24       0.98
## degree4   7   1  5.46  15       0.66
## degree5   3   0  2.13   8       0.62
## degree6   1   0  0.71   6       0.92
## degree7   0   0  0.32   3       1.00
## degree8   0   0  0.06   1       1.00
## degree9   1   0  0.07   1       0.14
## 
## Goodness-of-fit for edgewise shared partner 
## 
##      obs min  mean max MC p-value
## esp0  60  42 59.02  78       0.94
## esp1  31  15 35.40  57       0.76
## esp2   4   0  3.84  18       0.90
## esp3   0   0  0.35   4       1.00
## esp4   0   0  0.03   1       1.00
## 
## Goodness-of-fit for minimum geodesic distance 
## 
##       obs   min     mean   max MC p-value
## 1      95    65    98.64   136       0.88
## 2     139    42    98.65   183       0.22
## 3     193    21    96.14   252       0.08
## 4     210    10    89.39   298       0.10
## 5     182     3    78.65   360       0.22
## 6     160     0    65.58   398       0.18
## 7     120     0    52.02   379       0.30
## 8      83     0    37.99   298       0.28
## 9      42     0    26.74   228       0.36
## 10     25     0    19.43   210       0.44
## 11     11     0    13.76   190       0.48
## 12      4     0     9.42   172       0.52
## 13      1     0     5.80   148       0.46
## 14      0     0     3.35   123       1.00
## 15      0     0     2.20   119       1.00
## 16      0     0     1.55   108       1.00
## 17      0     0     0.98    74       1.00
## 18      0     0     0.62    48       1.00
## 19      0     0     0.27    23       1.00
## 20      0     0     0.04     3       1.00
## Inf 19645 17848 20208.78 20741       0.30
## 
## Goodness-of-fit for model statistics 
## 
##                      obs min     mean       max MC p-value
## edges            95.0000  65 98.64000 136.00000       0.88
## nodematch.Race.1 40.0000  19 42.18000  62.00000       0.86
## nodematch.Race.2 34.0000  20 35.31000  63.00000       0.98
## nodematch.Race.3  4.0000   0  4.61000  11.00000       1.00
## gwesp.fixed.0.25 35.8848  15 40.57238  71.25171       0.76The object m4.gaNetU.gof shows the goodness-of-fit for
the degree distribution, the edge-wise shared partner distribution, and
the geodesic distances. Each table shows the value in the observed
network (obs) as well as the minimum (min),
mean (mean), and maximum (max) of that value
in the simulations.
The p-value column (p-value) is the proportion
of the simulated values of the statistic that are at least as extreme as
the observed value. Large values indicate that the simulated
networks are similar to the observed network. Values less than 0.05
demonstrate a significant difference between the observed network and
the simulated networks.
We can reference each table using the $pval operator.
Take a look at the names using: names(m4.gaNetU.gof). Let’s
take a look at the degree distribution first:
##          obs min  mean max MC p-value
## degree0  111  75 96.99 125       0.10
## degree1   47  40 55.51  79       0.24
## degree2   20  19 29.58  46       0.08
## degree3   15   5 14.17  24       0.98
## degree4    7   1  5.46  15       0.66
## degree5    3   0  2.13   8       0.62
## degree6    1   0  0.71   6       0.92
## degree7    0   0  0.32   3       1.00
## degree8    0   0  0.06   1       1.00
## degree9    1   0  0.07   1       0.14
## degree10   0   0  0.00   0       1.00In looking through the Goodness-of-fit for degree table, what do the p-values tell us?
The degree distribution in the simulated data are matching well with
the degree distribution for the observed data. In other words, our model
(i.e. ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE ))
is reproducing a degree distribution that is very similar to the degree
distribution in our observed data.
Now let’s look at the other distributions. Take a look at the Goodness-of-fit for edge-wise shared partner table:
##      obs min  mean max MC p-value
## esp0  60  42 59.02  78       0.94
## esp1  31  15 35.40  57       0.76
## esp2   4   0  3.84  18       0.90
## esp3   0   0  0.35   4       1.00
## esp4   0   0  0.03   1       1.00
## esp5   0   0  0.00   0       1.00Recall that an edge-wise shared partner of 1 or esp1 is
i-j, j-k, and i-k.
An edge-wise shared partner of 2 or esp2 is
i-j, j-k and i-k,
as well as i-l and l-l. And so
on…
What is the interpretation of this table?
Conceptually, what does the difference (or lack thereof?) between the simulated networks and the observed networks, based on the edge-wise shared partner distribution tell us?
Now, take a look at the Goodness-of-fit for minimum geodesic distance table:
##    obs min  mean max MC p-value
## 1   95  65 98.64 136       0.88
## 2  139  42 98.65 183       0.22
## 3  193  21 96.14 252       0.08
## 4  210  10 89.39 298       0.10
## 5  182   3 78.65 360       0.22
## 6  160   0 65.58 398       0.18
## 7  120   0 52.02 379       0.30
## 8   83   0 37.99 298       0.28
## 9   42   0 26.74 228       0.36
## 10  25   0 19.43 210       0.44
## 11  11   0 13.76 190       0.48
## 12   4   0  9.42 172       0.52
## 13   1   0  5.80 148       0.46
## 14   0   0  3.35 123       1.00
## 15   0   0  2.20 119       1.00
## 16   0   0  1.55 108       1.00
## 17   0   0  0.98  74       1.00
## 18   0   0  0.62  48       1.00
## 19   0   0  0.27  23       1.00
## 20   0   0  0.04   3       1.00
## 21   0   0  0.00   0       1.00Recall that a geodesic is the shortest path between two nodes. The
distribution shows the count of these (when it shows Inf,
this indicates the number of dyads that are unreachable).
What is the interpretation of this table?
Finally, we can plot the gof() object and examine the
simulated and observed data.
Each plot shows boxplots for the values of the simulated networks. The dark line in each plot is the network statistic for the observed network. Situations where the dark line falls outside of the boxplots indicates that the simulation is not reproducing that particular structural aspect of the network.
Overall, how well is our model at reproducing the observed distributions?
As a contrast, let’s look at the goodness of fit for the model where
we exclude the term to account for transitivity in the network. Recall
that the m3.gaNetU model was:
m3.gaNetU$formula.
# run the gof function
m3.gaNetU.gof <- gof(
  m3.gaNetU,               # our estimated model
  GOF = ~degree            # the degree distribution 
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)
# plot the gof diagnostics
par( mfrow = c( 2,2 ) )
plot( m3.gaNetU.gof )Now, let’s ask a few questions:
Now, let’s go back through this using a directed network. Recall that in the Introduction to Exponential Random Graph Models (ERGMs) in R lab we estimated a series of ERGMs using the power and influence network from PINS. Here, we want to re-estimate the models and examine goodness of fit.
First, let’s bring in the data, create the network object, and assign attributes.
# clear the workspace in case we recycle some objects
rm( list = ls( ) )
# load the libraries we need (this is redundant)
library( sna )
library( network )
library( ergm )
# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-power-w1-adj.csv"
# read in the .csv file
piMat <- as.matrix(
  read.csv( 
    loc,
    as.is = TRUE,
    header = TRUE,
    row.names = 1 
    )
  )
# create the network object
piNetD <- as.network( piMat, directed = TRUE )
# define the attributes object
attrs <- read.csv( 
    "https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
    as.is = TRUE,
    header = TRUE 
    )
# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1
# now assign the recoded value
piNetD %v% "Age" <- attrs[,1]
piNetD %v% "Race" <- race.recodedNow, let’s estimate a few models. We will estimate several (each subsequent model will include the prior terms):
Age influences
indegree and outdegreeAgeOk, let’s run the models. (remember that we can use the
summary() function on the model object to summarize the
results).
# edge independence model
m1.piNetD <- ergm( piNetD ~ edges ) 
# add the effects of Age on indegree and outdegree
m2.piNetD <- ergm( piNetD ~ edges  + nodeicov( "Age" ) + nodeocov( "Age" ) ) 
  
# add the homophily effect for Age  
m3.piNetD <- ergm( piNetD ~ edges  + nodeicov( "Age" ) + nodeocov( "Age" ) + absdiff( "Age" ) ) 
# add the effect for reciprocity  
m4.piNetD <- ergm( piNetD ~ edges + nodeicov( "Age" ) + nodeocov( "Age" ) + absdiff( "Age" ) + mutual,
  control = control.ergm( seed = 605 ) ) Let’s simulate a network and plot it along with our observed network. Let’s use the edge independence model:
# simulate a network from the edge independence model
sim.m1.piNetD <- simulate(
  m1.piNetD,               # the model we want to simulate from
  nsim=1,                  # simulate 1 network
  seed = 605               # set the seed to reproduce results
)
# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Power/Influence\n Network (Directed)"
  )
# create the second plot for the simulated network
gplot( 
  sim.m1.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Edge Independence Model"
  )How is the network different from the network generated from the edge independence model? What are the structural features that the model is not capturing?
Let’s take a look at the additional models we created:
# estimate the simulations
sim.m2.piNetD <- simulate( m2.piNetD, nsim=1, seed = 605 )
sim.m3.piNetD <- simulate( m3.piNetD, nsim=1, seed = 605 )
sim.m4.piNetD <- simulate( m4.piNetD, nsim=1, seed = 605 )Now that we have our simulated networks, let’s plot them.
# set the margins
par( mfrow=c( 2,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#f71505",
  main = "PINS Power/Influence\n Network (Directed)"
  )
# create the second plot for the simulated network
gplot( 
  sim.m2.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#11adf5",
  main = "Network Simulated from\n Degree Model"
  )
# create the third plot for the simulated network
gplot( 
  sim.m3.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#e707f2",
  main = "Network Simulated from\n Homophily Model"
  )
# create the fourth plot for the simulated network
gplot( 
  sim.m4.piNetD, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#17f207",
  main = "Network Simulated from\n Reciprocity Model"
  )What do think? As our models become more complex, are we doing better at reproducing the structures we observe?
gof()
functionLet’s examine the goodness of fit for the m4.piNetD
model.
m4.piNetD.gof <- gof(
  m4.piNetD,               # our estimated model
  GOF = ~idegree           # the indegree distribution 
    + odegree              # the outdegree distribution
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)
# print out the goodness of fit info
m4.piNetD.gof## 
## Goodness-of-fit for in-degree 
## 
##           obs min   mean max MC p-value
## idegree0  156  93 110.83 127       0.00
## idegree1   28  41  54.57  70       0.00
## idegree2    7  13  23.22  35       0.00
## idegree3    2   4   9.82  19       0.00
## idegree4    0   0   3.61  10       0.08
## idegree5    2   0   1.63   5       1.00
## idegree6    0   0   0.67   4       1.00
## idegree7    2   0   0.35   2       0.16
## idegree8    3   0   0.22   2       0.00
## idegree9    0   0   0.05   1       1.00
## idegree10   0   0   0.01   1       1.00
## idegree11   0   0   0.01   1       1.00
## idegree12   1   0   0.00   0       0.00
## idegree13   3   0   0.01   1       0.00
## idegree14   1   0   0.00   0       0.00
## 
## Goodness-of-fit for out-degree 
## 
##          obs min  mean max MC p-value
## odegree0 127  79 93.14 115       0.00
## odegree1  36  57 72.94  88       0.00
## odegree2  23  21 29.61  41       0.16
## odegree3  10   2  7.48  16       0.38
## odegree4   4   0  1.67   5       0.16
## odegree5   2   0  0.16   3       0.02
## odegree6   1   0  0.00   0       0.00
## odegree8   1   0  0.00   0       0.00
## odegree9   1   0  0.00   0       0.00
## 
## Goodness-of-fit for edgewise shared partner 
## 
##          obs min   mean max MC p-value
## esp.OTP0 129 136 161.04 184          0
## esp.OTP1  28   0   1.04   4          0
## esp.OTP2   3   0   0.00   0          0
## esp.OTP3   1   0   0.00   0          0
## 
## Goodness-of-fit for minimum geodesic distance 
## 
##       obs   min     mean   max MC p-value
## 1     161   137   162.08   185       0.90
## 2     142    78   133.17   189       0.62
## 3      79    46   105.60   199       0.54
## 4      25    19    81.87   213       0.12
## 5       3     3    61.05   228       0.02
## 6       0     0    43.09   219       0.06
## 7       0     0    29.13   190       0.14
## 8       0     0    18.58   175       0.50
## 9       0     0    12.11   169       0.74
## 10      0     0     7.64   162       1.00
## 11      0     0     4.40   122       1.00
## 12      0     0     2.51    91       1.00
## 13      0     0     1.34    62       1.00
## 14      0     0     0.85    44       1.00
## 15      0     0     0.33    16       1.00
## 16      0     0     0.17    14       1.00
## 17      0     0     0.10    10       1.00
## 18      0     0     0.04     4       1.00
## Inf 41410 39542 41155.94 41518       0.24
## 
## Goodness-of-fit for model statistics 
## 
##               obs  min    mean  max MC p-value
## edges         161  137  162.08  185       0.90
## nodeicov.Age 7849 6538 7915.01 8974       0.82
## nodeocov.Age 6543 5463 6569.14 7541       0.92
## absdiff.Age  2058 1643 2078.87 2486       0.82
## mutual          2    0    1.75    7       1.00Look through the table for each distribution. What is the interpretation of each table?
Finally, we can plot the gof() object and examine the
simulated and observed data.
Each plot shows boxplots for the values of the simulated networks. The dark line in each plot is the network statistic for the observed network. Situations where the dark line falls outside of the boxplots indicates that the simulation is not reproducing that particular structural aspect of the network.
Overall, how well is our model at reproducing the observed distributions?
As a contrast, let’s look at the goodness of fit for the edge
independence model (m1.piNetD).
# run the gof function
m1.piNetD.gof <- gof(
  m1.piNetD,               # our estimated model
  GOF = ~idegree           # the indegree distribution 
    + odegree              # the outdegree distribution
    + espartners           # the edge-wise shared partners distribution
    + distance,            # the geodesic distance distribution
  verbose = TRUE,          # we want it to tell us what it is doing while it is doing it
  control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results 
)
# plot the gof diagnostics
par( mfrow = c( 3,2 ) )
plot( m1.piNetD.gof )Now, let’s ask a few questions:
Suppose we wanted to simulate a model where the effect of homophily
is less strong. In other words, what would this network look
like if there was not homophily? We can do this by plugging new values
into the simulate.ergm function.
Since a fitted ergm model is of class ergm,
it has several parts that we can reference as objects.
##  [1] "coefficients"    "sample"          "iterations"      "MCMCtheta"      
##  [5] "loglikelihood"   "gradient"        "hessian"         "covar"          
##  [9] "failure"         "newnetwork"      "coef.init"       "est.cov"        
## [13] "coef.hist"       "stats.hist"      "steplen.hist"    "control"        
## [17] "etamap"          "MCMCflag"        "nw.stats"        "call"           
## [21] "network"         "ergm_version"    "info"            "MPLE_is_MLE"    
## [25] "drop"            "offset"          "estimable"       "formula"        
## [29] "reference"       "constraints"     "obs.constraints" "estimate"       
## [33] "estimate.desc"   "null.lik"        "mle.lik"# the coefficients are an object that we can take, modify, and use with the simulation
m4.piNetD$coefficients##        edges nodeicov.Age nodeocov.Age  absdiff.Age       mutual 
## -8.569733632  0.081229804 -0.004240028 -0.033707476  1.743896037If we want to simulate a model where there is no homophily, then we
can set the coefficient to zero, then simulate from that new set of
model coefficients. In our case, we well change the effect of
absdiff.Age to be zero.
# create a new object that is just the model coefficients
no.homophily.coef <- m4.piNetD$coefficients
# set the coefficent to zero; it is the 4th element
no.homophily.coef[4] <- 0 
# Now, the effect is zero.
no.homophily.coef##        edges nodeicov.Age nodeocov.Age  absdiff.Age       mutual 
## -8.569733632  0.081229804 -0.004240028  0.000000000  1.743896037# now, plug the edited coefficient into a simulation
no.homophily.sim <- simulate(
  m4.piNetD,                # the model we are simulating from 
  coef = no.homophily.coef, # tell it to use the new coefficients
  verbose = TRUE            # tell it to show us what it is doing
) Now that we have the simulated network, we can plot it and compare it
to our observed network. We will want to size the nodes by age. So,
let’s first run that through the rescale() function.
# define the rescale function
rescale <- function( nchar, low, high ){
  min_d <- min( nchar )
  max_d <- max( nchar )
  rscl  <- ( ( high - low )*( nchar - min_d ) ) / ( max_d - min_d ) + low
  rscl
}
# create the objects to use for the plots
PiNetDobsSize <- rescale( piNetD %v% "Age", 0.2, 3 )
PiNetDsimSize <- rescale( no.homophily.sim %v% "Age", 0.2, 3 )# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col="grey40", 
  vertex.col="#f71505",
  vertex.cex = PiNetDobsSize,
  displayisolates = FALSE,
  main = "PINS Power/Influence\n Network (Directed)"
  )
# create the second plot for the simulated network
gplot( 
  no.homophily.sim, 
  gmode = "digraph",
  edge.col = "grey40", 
  vertex.col="#03fc7b",
  vertex.cex = PiNetDsimSize,
  displayisolates = FALSE,
  main = "Network Simulated from\n No Homophily Model"
  )How does the simulated network differ from the
piNetD network? Can you see the difference in
the role that homophily plays?
In the model above, m4.piNetD, we showed that there was
evidence of reciprocity in power/influence nominations. What would the
network look like if there was an aversion, such that the
probability of a tie from i to j is lower if there is
a tie from j to i?
If we want to simulate a model where there is an aversion to reciprocal ties, then we can take the coefficient for the reciprocity term and multiply it by -1, then simulate from that new set of model coefficients.
# create a new object that is just the model coefficients
anti.recip.coef <- m4.piNetD$coefficients
# set the coefficient
anti.recip.coef[5] <- anti.recip.coef[5]*-1 
# now, plug the edited coefficient into a simulation
anti.recip.sim <- simulate(
  m4.piNetD,                # the model we are simulating from 
  coef = anti.recip.coef,   # tell it to use the new coefficients
  verbose = TRUE            # tell it to show us what it is doing
) Now that we have the simulated network, we can plot it and compare it
to our observed network. We will want to color the reciprocated ties. To
do this, we can get the matrix of reciprocated ties (using
symmetrize()) and then pass colors to this matrix.
# create the symmetric matrix
sympiMat <- symmetrize( piNetD, rule = "strong" )
# create a matrix of colors
sympiMatCols <- sympiMat
sympiMatCols[sympiMat == 0] <- "grey80"
sympiMatCols[sympiMat == 1] <- "#3746a3" # set a color for mutual ties
# create the symmetric matrix
sympiMat2 <- symmetrize( anti.recip.sim, rule = "strong" )
# create a matrix of colors
sympiMatCols2 <- sympiMat2
sympiMatCols2[sympiMat2 == 0] <- "grey80"
sympiMatCols2[sympiMat2 == 1] <- "#3796a4" # set a color for mutual tiesNow, we pass this information to the plot.
# set the margins
par( mfrow=c( 1,2 ), 
     mar=c( 0.1, 0.5, 3, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot( 
  piNetD, 
  gmode = "digraph",
  edge.col=sympiMatCols, 
  vertex.col="#03eeff",
  displayisolates = FALSE,
  main = "PINS Power/Influence\n Network (Directed)"
  )
# create the second plot for the simulated network
gplot( 
  no.homophily.sim, 
  gmode = "digraph",
  edge.col = sympiMatCols2, 
  vertex.col="#03fc7b",
  displayisolates = FALSE,
  main = "Network Simulated from\n Aversion to\n Reciprocity Model"
  )How does the simulated network differ from the
piNetD network? Can you see the difference in
the role that reciprocity plays?
In this lab, we continued working with ERGMs in R using the
ergm package. Specifically, we ask the question: how do
we evaluate whether our model is a good fit to the data? This is
the notion of “goodness of fit” and we worked through how to assess
model fit using the gof() function in the ergm
package.