In the Introduction to Exponential Random Graph Models (ERGMs) lecture you were introduced to several examples of models from the exponential family of random graphs and we worked through those models in the Introduction to Exponential Random Graph Models (ERGMs) in R lab.
This lab continues working with ERGMs in R using the
ergm
package. Specifically, we ask the question: how do
we evaluate whether our model is a good fit to the data? This is
the notion of “goodness of fit” and we can assess how well our model
fits the data using the gof()
function in the
ergm
package.
Recall that in the Introduction to Exponential Random Graph Models (ERGMs) in R lab we estimated a series of ERGMs using the get along with network from PINS. Here, we want to re-estimate the models and examine goodness of fit.
First, let’s bring in the data, create the network object, and assign attributes.
# load the libraries we need
library( sna )
library( network )
library( ergm )
# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-getalong-w1-adj.csv"
# read in the .csv file
gaMat <- as.matrix(
read.csv(
loc,
as.is = TRUE,
header = TRUE,
row.names = 1
)
)
# use the symmetrize() function to create an undirected matrix
gaMatU <- symmetrize( gaMat, rule = "strong" )
# create the network object
gaNetU <- as.network( gaMatU, directed = FALSE )
# define the attributes object
attrs <- read.csv(
"https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
as.is = TRUE,
header = TRUE
)
# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1
# now assign the recoded value
gaNetU %v% "Age" <- attrs[,1]
gaNetU %v% "Race" <- race.recoded
Now, let’s estimate a few models. We will estimate several (each subsequent model will include the prior terms):
Race
influences
degreeRace
Ok, let’s run the models. (remember that we can use the
summary()
function on the model object to summarize the
results).
# edge independence model
m1.gaNetU <- ergm( gaNetU ~ edges )
# add the effect of Race on degree
m2.gaNetU <- ergm( gaNetU ~ edges + nodefactor( "Race" ) )
# add the homophily effect for Race
m3.gaNetU <- ergm( gaNetU ~ edges + nodematch( "Race" ) )
# add the effect for transitivity
m4.gaNetU <- ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE ),
control = control.ergm( seed = 605 ) )
simulate.ergm()
or simulate()
and control.ergm()
FunctionsHow can we go about assessing model fit? What would a good fitting model do? Well, one way to think about this is whether the model we estimate reproduces structures in the data. In other words, if we estimate a model, then simulated some data from that model, we could say: “does the simulated network look like our observed network?” This is the logic of how we will approach goodness of fit with our models.
First, we need to think about simulating a network (or networks) from
our model. We can do this using the simulate.ergm()
or
simulate()
function that is built into the
ergm
package. This function randomly samples networks from
the specified model. We can examine the function using
?simulate.ergm
. Let’s see how it works:
# simulate a network from the edge independence model
sim.m1.gaNetU <- simulate(
m1.gaNetU, # the model we want to simulate from
nsim=1, # simulate 1 network
seed = 605 # set the seed to reproduce results
)
Boom! You simulated a network. Now what? Take a look at what we did.
## [1] "network"
## Network attributes:
## vertices = 205
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 92
## missing edges= 0
## non-missing edges= 92
##
## Vertex attribute names:
## Age Race vertex.names
##
## No edge attributes
Awesome! We have an object of class network
that has the
attributes of our observed data. Now, we can compare our observed
network to our simulated network.
# set the margins
par( mfrow=c( 1,2 ),
mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
gaNetU,
gmode = "graph",
edge.col="grey40",
vertex.col="#f71505",
main = "PINS Get\n Along With Network (Undirected)"
)
# create the second plot for the simulated network
gplot(
sim.m1.gaNetU,
gmode = "graph",
edge.col = "grey40",
vertex.col="#11adf5",
main = "Network Simulated from\n Edge Independence Model"
)
How is the network different from the network generated from the edge independence model? What are the structural features that the model is not capturing?
Well, dang. It looks like our model is doing a great job reproducing the structure in our observed network.
Let’s take a look at the additional models we created:
# simulate a network from the degree effect for Race model
sim.m2.gaNetU <- simulate( m2.gaNetU, nsim=1, seed = 605 )
# simulate a network from the homophily model
sim.m3.gaNetU <- simulate( m3.gaNetU, nsim=1, seed = 605 )
# simulate a network from transitivity model
sim.m4.gaNetU <- simulate( m4.gaNetU, nsim=1, seed = 605 )
Now that we have our simulated networks, let’s plot them.
# set the margins
par( mfrow=c( 2,2 ),
mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
gaNetU,
gmode = "graph",
edge.col="grey40",
vertex.col="#f71505",
main = "PINS Get\n Along With Network (Undirected)"
)
# create the second plot for the simulated network
gplot(
sim.m2.gaNetU,
gmode = "graph",
edge.col = "grey40",
vertex.col="#11adf5",
main = "Network Simulated from\n Degree Model"
)
# create the third plot for the simulated network
gplot(
sim.m3.gaNetU,
gmode = "graph",
edge.col = "grey40",
vertex.col="#e707f2",
main = "Network Simulated from\n Homophily Model"
)
# create the fourth plot for the simulated network
gplot(
sim.m4.gaNetU,
gmode = "graph",
edge.col = "grey40",
vertex.col="#17f207",
main = "Network Simulated from\n Transitivity Model"
)
What do think? As our models become more complex, are we doing better at reproducing the structures we observe?
Recall that we are fitting a model to observed data. That is, we are specifying the stochastic process that generated our observed data. We have specified local processes that generate our observed network. So, one way to determine how well the model fits the data is to see how well the model generates global network properties. That is, if the processes we have specified are operating, would we get the degree distribution (for example) we observe?
We can get a sense of how well the model is representing the observed network based on the visual comparison with the simulated networks (as we did above). In addition, we could look at whether our simulated networks are reproducing particular structures in our network.
We can use the summary()
function and use the
ergm
terms to return descriptive stats on the networks
(i.e. the observed network and the simulated networks). Let’s take a
look at the degree distribution as well as the number of triangles in
our observed network and our simulated networks:
# let's see how this works for our observed network
summary(
gaNetU # the network we want summary info on
~ edges # give the edges; note the ~
+ degree( 0:5 ) # count of degrees 0 through 5
+ triangle # count of triangles
)
## edges degree0 degree1 degree2 degree3 degree4 degree5 triangle
## 95 111 47 20 15 7 3 13
The output shows that the network has:
95 edges
111 nodes with degree 0, 47 with degree 1, and so on to degree 5.
13 triangles (i.e. i-j, j-k, & i-k)
Now, let’s take this info for our observed network and do the same for the simulated network from the fourth model. Then, we can look at whether the simulation is reproducing the structures we observe in the network.
## edges degree0 degree1 degree2 degree3 degree4 degree5 triangle
## 104 92 63 26 12 8 1 19
Let’s build a table to compare them:
# put a table together for ease of comparing
obsSimTab <- cbind(
summary( gaNetU ~ edges + degree( 0:5 ) + triangle ),
summary( sim.m4.gaNetU ~ edges + degree( 0:5 ) + triangle )
)
# add some names and then examine the table
colnames( obsSimTab ) <- c( "Obs Net", "Sim Net" )
obsSimTab
## Obs Net Sim Net
## edges 95 104
## degree0 111 92
## degree1 47 63
## degree2 20 26
## degree3 15 12
## degree4 7 8
## degree5 3 1
## triangle 13 19
Looking through the table, how well is the model representing the structures we have decided to examine?
So far, we have only looked at one simulation. Alternatively, we
could run a lot of simulations and examine the distribution of a
particular structural property (e.g. triangles). In the
simulate()
function, we can use several arguments to give
us what we want:
The monitor=
argument to pass any specific terms we
want to simulate. Since we are focusing just on the number of triangles,
we would use monitor=~triangle
. Note the incorporation of
the ~
in the monitor=
argument.
We can also use the output=
argument to just
calculate stats rather than generating networks. Since we are only
interested in the stats and don’t want the networks, we use
output="stats"
.
Let’s simulate 500 networks and take a look at the results:
# run the simulation
gaNetU.large.sim <- simulate(
m4.gaNetU, # the model to simulate from
nsim = 500, # ask for 500 simulations
monitor = ~ triangle, # we want it to focus on the triangles
output = "stats", # just keep the stats, not the networks
seed = 605 # set the seed to reproduce the results
)
Boom! We simulated 500 networks from our model.
Let’s see what it gave us.
## edges nodematch.Race.1 nodematch.Race.2 nodematch.Race.3 gwesp.fixed.0.25
## [1,] 104 49 35 5 48.86745
## [2,] 105 43 38 3 51.54839
## [3,] 119 50 42 7 60.92452
## [4,] 115 44 34 9 54.26092
## [5,] 117 52 43 10 59.30985
## [6,] 108 47 34 12 43.44240
## triangle
## [1,] 19
## [2,] 19
## [3,] 24
## [4,] 21
## [5,] 23
## [6,] 15
We can see that it reports the count of all the terms in the model in the columns and each row is the count for each simulated network.
Now, we can plot the summary stats for all the simulated networks to visualize the information.
hist(
gaNetU.large.sim[,"triangle"], # plot the triangle counts from each simulation
main = "Simulations \n (X shows observed network)", # add a title
xlim = c( 0,50 ), # set the x axis limits
ylim = c( 0,200 ), # set the y axis limits
xlab = "Number of triangles", # label the x axis
ylab = "Number of simulated networks" #label the y axes
)
# add a mark where the observed count is
points(
summary(
gaNetU ~ triangle ),
4,
pch="X",
cex=2,
col="red"
)
In looking at the simulation of the triangles, is our model generating as much transitivity as we observe in the network?
gof()
FunctionAs we just saw, we can simulate lots of networks and then compare our
observed network to the networks simulated from the model estimates. In
ergm
, there is a function that does this for us,
gof()
. The gof()
function calculates
p-values for geodesic distance, degree, and reachability
summaries to diagnose the goodness-of-fit of exponential family random
graph models. We can examine the help page for the gof()
function using ?gof
.
As shown in the help, the gof()
function takes a fitted
ergm
object (i.e. an estimated model), and uses the
GOF=
argument to specify what summaries information we want
to calculate for our simulated networks. Rather than
simulating the networks, then calculating the stats, then creating a
plot, the gof()
function does all of this for us! (how
nice).
The GOF=
argument takes a formula object, of the form
~ <model terms>
specifying the statistics to use to
diagnosis the goodness-of-fit of the model. They do not need to be terms
that are in the model, and typically are not. Currently supported terms
are the degree distribution (degree
for undirected
graphs, or idegree
and/or odegree
for
directed graphs), geodesic distances (distance
),
shared partner distributions (espartners
[1 edge-wise
shared partner is i-j, j-k, and
i-k] and dspartners
[1 dyad-wise shared
partner is j-k and i-k]), the triad
census (triadcensus
), and the terms of the original model
(model
). The default formula for undirected networks is
~ degree + espartners + distance + model
, and the default
formula for directed networks is
~ idegree + odegree + espartners + distance + model
.
Let’s go ahead and take a look at this with the
m4.gaNetU
model we estimated above where:
ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE )
We want to calculate the degree distribution (degree
),
the edge-wise shared partner distribution (espartners
), and
the geodesic distances (distance
).
m4.gaNetU.gof <- gof(
m4.gaNetU, # our estimated model
GOF = ~degree # the degree distribution
+ espartners # the edge-wise shared partners distribution
+ distance, # the geodesic distance distribution
verbose = TRUE, # we want it to tell us what it is doing while it is doing it
control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results
)
# print out the goodness of fit info
m4.gaNetU.gof
##
## Goodness-of-fit for degree
##
## obs min mean max MC p-value
## degree0 111 75 96.99 125 0.10
## degree1 47 40 55.51 79 0.24
## degree2 20 19 29.58 46 0.08
## degree3 15 5 14.17 24 0.98
## degree4 7 1 5.46 15 0.66
## degree5 3 0 2.13 8 0.62
## degree6 1 0 0.71 6 0.92
## degree7 0 0 0.32 3 1.00
## degree8 0 0 0.06 1 1.00
## degree9 1 0 0.07 1 0.14
##
## Goodness-of-fit for edgewise shared partner
##
## obs min mean max MC p-value
## esp0 60 42 59.02 78 0.94
## esp1 31 15 35.40 57 0.76
## esp2 4 0 3.84 18 0.90
## esp3 0 0 0.35 4 1.00
## esp4 0 0 0.03 1 1.00
##
## Goodness-of-fit for minimum geodesic distance
##
## obs min mean max MC p-value
## 1 95 65 98.64 136 0.88
## 2 139 42 98.65 183 0.22
## 3 193 21 96.14 252 0.08
## 4 210 10 89.39 298 0.10
## 5 182 3 78.65 360 0.22
## 6 160 0 65.58 398 0.18
## 7 120 0 52.02 379 0.30
## 8 83 0 37.99 298 0.28
## 9 42 0 26.74 228 0.36
## 10 25 0 19.43 210 0.44
## 11 11 0 13.76 190 0.48
## 12 4 0 9.42 172 0.52
## 13 1 0 5.80 148 0.46
## 14 0 0 3.35 123 1.00
## 15 0 0 2.20 119 1.00
## 16 0 0 1.55 108 1.00
## 17 0 0 0.98 74 1.00
## 18 0 0 0.62 48 1.00
## 19 0 0 0.27 23 1.00
## 20 0 0 0.04 3 1.00
## Inf 19645 17848 20208.78 20741 0.30
##
## Goodness-of-fit for model statistics
##
## obs min mean max MC p-value
## edges 95.0000 65 98.64000 136.00000 0.88
## nodematch.Race.1 40.0000 19 42.18000 62.00000 0.86
## nodematch.Race.2 34.0000 20 35.31000 63.00000 0.98
## nodematch.Race.3 4.0000 0 4.61000 11.00000 1.00
## gwesp.fixed.0.25 35.8848 15 40.57238 71.25171 0.76
The object m4.gaNetU.gof
shows the goodness-of-fit for
the degree distribution, the edge-wise shared partner distribution, and
the geodesic distances. Each table shows the value in the observed
network (obs
) as well as the minimum (min
),
mean (mean
), and maximum (max
) of that value
in the simulations.
The p-value column (p-value
) is the proportion
of the simulated values of the statistic that are at least as extreme as
the observed value. Large values indicate that the simulated
networks are similar to the observed network. Values less than 0.05
demonstrate a significant difference between the observed network and
the simulated networks.
We can reference each table using the $pval
operator.
Take a look at the names using: names(m4.gaNetU.gof)
. Let’s
take a look at the degree distribution first:
## obs min mean max MC p-value
## degree0 111 75 96.99 125 0.10
## degree1 47 40 55.51 79 0.24
## degree2 20 19 29.58 46 0.08
## degree3 15 5 14.17 24 0.98
## degree4 7 1 5.46 15 0.66
## degree5 3 0 2.13 8 0.62
## degree6 1 0 0.71 6 0.92
## degree7 0 0 0.32 3 1.00
## degree8 0 0 0.06 1 1.00
## degree9 1 0 0.07 1 0.14
## degree10 0 0 0.00 0 1.00
In looking through the Goodness-of-fit for degree table, what do the p-values tell us?
The degree distribution in the simulated data are matching well with
the degree distribution for the observed data. In other words, our model
(i.e. ergm( gaNetU ~ edges + nodematch( "Race", diff=TRUE ) + gwesp( decay = 0.25, fixed = TRUE )
)
is reproducing a degree distribution that is very similar to the degree
distribution in our observed data.
Now let’s look at the other distributions. Take a look at the Goodness-of-fit for edge-wise shared partner table:
## obs min mean max MC p-value
## esp0 60 42 59.02 78 0.94
## esp1 31 15 35.40 57 0.76
## esp2 4 0 3.84 18 0.90
## esp3 0 0 0.35 4 1.00
## esp4 0 0 0.03 1 1.00
## esp5 0 0 0.00 0 1.00
Recall that an edge-wise shared partner of 1 or esp1
is
i-j, j-k, and i-k.
An edge-wise shared partner of 2 or esp2
is
i-j, j-k and i-k,
as well as i-l and l-l. And so
on…
What is the interpretation of this table?
Conceptually, what does the difference (or lack thereof?) between the simulated networks and the observed networks, based on the edge-wise shared partner distribution tell us?
Now, take a look at the Goodness-of-fit for minimum geodesic distance table:
## obs min mean max MC p-value
## 1 95 65 98.64 136 0.88
## 2 139 42 98.65 183 0.22
## 3 193 21 96.14 252 0.08
## 4 210 10 89.39 298 0.10
## 5 182 3 78.65 360 0.22
## 6 160 0 65.58 398 0.18
## 7 120 0 52.02 379 0.30
## 8 83 0 37.99 298 0.28
## 9 42 0 26.74 228 0.36
## 10 25 0 19.43 210 0.44
## 11 11 0 13.76 190 0.48
## 12 4 0 9.42 172 0.52
## 13 1 0 5.80 148 0.46
## 14 0 0 3.35 123 1.00
## 15 0 0 2.20 119 1.00
## 16 0 0 1.55 108 1.00
## 17 0 0 0.98 74 1.00
## 18 0 0 0.62 48 1.00
## 19 0 0 0.27 23 1.00
## 20 0 0 0.04 3 1.00
## 21 0 0 0.00 0 1.00
Recall that a geodesic is the shortest path between two nodes. The
distribution shows the count of these (when it shows Inf
,
this indicates the number of dyads that are unreachable).
What is the interpretation of this table?
Finally, we can plot the gof()
object and examine the
simulated and observed data.
Each plot shows boxplots for the values of the simulated networks. The dark line in each plot is the network statistic for the observed network. Situations where the dark line falls outside of the boxplots indicates that the simulation is not reproducing that particular structural aspect of the network.
Overall, how well is our model at reproducing the observed distributions?
As a contrast, let’s look at the goodness of fit for the model where
we exclude the term to account for transitivity in the network. Recall
that the m3.gaNetU
model was:
m3.gaNetU$formula
.
# run the gof function
m3.gaNetU.gof <- gof(
m3.gaNetU, # our estimated model
GOF = ~degree # the degree distribution
+ espartners # the edge-wise shared partners distribution
+ distance, # the geodesic distance distribution
verbose = TRUE, # we want it to tell us what it is doing while it is doing it
control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results
)
# plot the gof diagnostics
par( mfrow = c( 2,2 ) )
plot( m3.gaNetU.gof )
Now, let’s ask a few questions:
Now, let’s go back through this using a directed network. Recall that in the Introduction to Exponential Random Graph Models (ERGMs) in R lab we estimated a series of ERGMs using the power and influence network from PINS. Here, we want to re-estimate the models and examine goodness of fit.
First, let’s bring in the data, create the network object, and assign attributes.
# clear the workspace in case we recycle some objects
rm( list = ls( ) )
# load the libraries we need (this is redundant)
library( sna )
library( network )
library( ergm )
# set the location for the file
loc <- "https://github.com/jacobtnyoung/sna-textbook/raw/main/data/data-PINS-power-w1-adj.csv"
# read in the .csv file
piMat <- as.matrix(
read.csv(
loc,
as.is = TRUE,
header = TRUE,
row.names = 1
)
)
# create the network object
piNetD <- as.network( piMat, directed = TRUE )
# define the attributes object
attrs <- read.csv(
"https://raw.githubusercontent.com/jacobtnyoung/sna-textbook/main/data/data-PINS-w1-age-race-attributes.csv",
as.is = TRUE,
header = TRUE
)
# let's recode the single 4 to a 1
race.recoded <- attrs[,2]
race.recoded[ race.recoded == 4 ] <- 1
# now assign the recoded value
piNetD %v% "Age" <- attrs[,1]
piNetD %v% "Race" <- race.recoded
Now, let’s estimate a few models. We will estimate several (each subsequent model will include the prior terms):
Age
influences
indegree and outdegreeAge
Ok, let’s run the models. (remember that we can use the
summary()
function on the model object to summarize the
results).
# edge independence model
m1.piNetD <- ergm( piNetD ~ edges )
# add the effects of Age on indegree and outdegree
m2.piNetD <- ergm( piNetD ~ edges + nodeicov( "Age" ) + nodeocov( "Age" ) )
# add the homophily effect for Age
m3.piNetD <- ergm( piNetD ~ edges + nodeicov( "Age" ) + nodeocov( "Age" ) + absdiff( "Age" ) )
# add the effect for reciprocity
m4.piNetD <- ergm( piNetD ~ edges + nodeicov( "Age" ) + nodeocov( "Age" ) + absdiff( "Age" ) + mutual,
control = control.ergm( seed = 605 ) )
Let’s simulate a network and plot it along with our observed network. Let’s use the edge independence model:
# simulate a network from the edge independence model
sim.m1.piNetD <- simulate(
m1.piNetD, # the model we want to simulate from
nsim=1, # simulate 1 network
seed = 605 # set the seed to reproduce results
)
# set the margins
par( mfrow=c( 1,2 ),
mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
piNetD,
gmode = "digraph",
edge.col="grey40",
vertex.col="#f71505",
main = "PINS Power/Influence\n Network (Directed)"
)
# create the second plot for the simulated network
gplot(
sim.m1.piNetD,
gmode = "digraph",
edge.col = "grey40",
vertex.col="#11adf5",
main = "Network Simulated from\n Edge Independence Model"
)
How is the network different from the network generated from the edge independence model? What are the structural features that the model is not capturing?
Let’s take a look at the additional models we created:
# estimate the simulations
sim.m2.piNetD <- simulate( m2.piNetD, nsim=1, seed = 605 )
sim.m3.piNetD <- simulate( m3.piNetD, nsim=1, seed = 605 )
sim.m4.piNetD <- simulate( m4.piNetD, nsim=1, seed = 605 )
Now that we have our simulated networks, let’s plot them.
# set the margins
par( mfrow=c( 2,2 ),
mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
piNetD,
gmode = "digraph",
edge.col="grey40",
vertex.col="#f71505",
main = "PINS Power/Influence\n Network (Directed)"
)
# create the second plot for the simulated network
gplot(
sim.m2.piNetD,
gmode = "digraph",
edge.col = "grey40",
vertex.col="#11adf5",
main = "Network Simulated from\n Degree Model"
)
# create the third plot for the simulated network
gplot(
sim.m3.piNetD,
gmode = "digraph",
edge.col = "grey40",
vertex.col="#e707f2",
main = "Network Simulated from\n Homophily Model"
)
# create the fourth plot for the simulated network
gplot(
sim.m4.piNetD,
gmode = "digraph",
edge.col = "grey40",
vertex.col="#17f207",
main = "Network Simulated from\n Reciprocity Model"
)
What do think? As our models become more complex, are we doing better at reproducing the structures we observe?
gof()
functionLet’s examine the goodness of fit for the m4.piNetD
model.
m4.piNetD.gof <- gof(
m4.piNetD, # our estimated model
GOF = ~idegree # the indegree distribution
+ odegree # the outdegree distribution
+ espartners # the edge-wise shared partners distribution
+ distance, # the geodesic distance distribution
verbose = TRUE, # we want it to tell us what it is doing while it is doing it
control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results
)
# print out the goodness of fit info
m4.piNetD.gof
##
## Goodness-of-fit for in-degree
##
## obs min mean max MC p-value
## idegree0 156 93 110.83 127 0.00
## idegree1 28 41 54.57 70 0.00
## idegree2 7 13 23.22 35 0.00
## idegree3 2 4 9.82 19 0.00
## idegree4 0 0 3.61 10 0.08
## idegree5 2 0 1.63 5 1.00
## idegree6 0 0 0.67 4 1.00
## idegree7 2 0 0.35 2 0.16
## idegree8 3 0 0.22 2 0.00
## idegree9 0 0 0.05 1 1.00
## idegree10 0 0 0.01 1 1.00
## idegree11 0 0 0.01 1 1.00
## idegree12 1 0 0.00 0 0.00
## idegree13 3 0 0.01 1 0.00
## idegree14 1 0 0.00 0 0.00
##
## Goodness-of-fit for out-degree
##
## obs min mean max MC p-value
## odegree0 127 79 93.14 115 0.00
## odegree1 36 57 72.94 88 0.00
## odegree2 23 21 29.61 41 0.16
## odegree3 10 2 7.48 16 0.38
## odegree4 4 0 1.67 5 0.16
## odegree5 2 0 0.16 3 0.02
## odegree6 1 0 0.00 0 0.00
## odegree8 1 0 0.00 0 0.00
## odegree9 1 0 0.00 0 0.00
##
## Goodness-of-fit for edgewise shared partner
##
## obs min mean max MC p-value
## esp.OTP0 129 136 161.04 184 0
## esp.OTP1 28 0 1.04 4 0
## esp.OTP2 3 0 0.00 0 0
## esp.OTP3 1 0 0.00 0 0
##
## Goodness-of-fit for minimum geodesic distance
##
## obs min mean max MC p-value
## 1 161 137 162.08 185 0.90
## 2 142 78 133.17 189 0.62
## 3 79 46 105.60 199 0.54
## 4 25 19 81.87 213 0.12
## 5 3 3 61.05 228 0.02
## 6 0 0 43.09 219 0.06
## 7 0 0 29.13 190 0.14
## 8 0 0 18.58 175 0.50
## 9 0 0 12.11 169 0.74
## 10 0 0 7.64 162 1.00
## 11 0 0 4.40 122 1.00
## 12 0 0 2.51 91 1.00
## 13 0 0 1.34 62 1.00
## 14 0 0 0.85 44 1.00
## 15 0 0 0.33 16 1.00
## 16 0 0 0.17 14 1.00
## 17 0 0 0.10 10 1.00
## 18 0 0 0.04 4 1.00
## Inf 41410 39542 41155.94 41518 0.24
##
## Goodness-of-fit for model statistics
##
## obs min mean max MC p-value
## edges 161 137 162.08 185 0.90
## nodeicov.Age 7849 6538 7915.01 8974 0.82
## nodeocov.Age 6543 5463 6569.14 7541 0.92
## absdiff.Age 2058 1643 2078.87 2486 0.82
## mutual 2 0 1.75 7 1.00
Look through the table for each distribution. What is the interpretation of each table?
Finally, we can plot the gof()
object and examine the
simulated and observed data.
Each plot shows boxplots for the values of the simulated networks. The dark line in each plot is the network statistic for the observed network. Situations where the dark line falls outside of the boxplots indicates that the simulation is not reproducing that particular structural aspect of the network.
Overall, how well is our model at reproducing the observed distributions?
As a contrast, let’s look at the goodness of fit for the edge
independence model (m1.piNetD
).
# run the gof function
m1.piNetD.gof <- gof(
m1.piNetD, # our estimated model
GOF = ~idegree # the indegree distribution
+ odegree # the outdegree distribution
+ espartners # the edge-wise shared partners distribution
+ distance, # the geodesic distance distribution
verbose = TRUE, # we want it to tell us what it is doing while it is doing it
control = control.gof.ergm( seed = 605 ) # set the seed to reproduce results
)
# plot the gof diagnostics
par( mfrow = c( 3,2 ) )
plot( m1.piNetD.gof )
Now, let’s ask a few questions:
Suppose we wanted to simulate a model where the effect of homophily
is less strong. In other words, what would this network look
like if there was not homophily? We can do this by plugging new values
into the simulate.ergm
function.
Since a fitted ergm
model is of class ergm
,
it has several parts that we can reference as objects.
## [1] "coefficients" "sample" "iterations" "MCMCtheta"
## [5] "loglikelihood" "gradient" "hessian" "covar"
## [9] "failure" "newnetwork" "coef.init" "est.cov"
## [13] "coef.hist" "stats.hist" "steplen.hist" "control"
## [17] "etamap" "MCMCflag" "nw.stats" "call"
## [21] "network" "ergm_version" "info" "MPLE_is_MLE"
## [25] "drop" "offset" "estimable" "formula"
## [29] "reference" "constraints" "obs.constraints" "estimate"
## [33] "estimate.desc" "null.lik" "mle.lik"
# the coefficients are an object that we can take, modify, and use with the simulation
m4.piNetD$coefficients
## edges nodeicov.Age nodeocov.Age absdiff.Age mutual
## -8.569733632 0.081229804 -0.004240028 -0.033707476 1.743896037
If we want to simulate a model where there is no homophily, then we
can set the coefficient to zero, then simulate from that new set of
model coefficients. In our case, we well change the effect of
absdiff.Age
to be zero.
# create a new object that is just the model coefficients
no.homophily.coef <- m4.piNetD$coefficients
# set the coefficent to zero; it is the 4th element
no.homophily.coef[4] <- 0
# Now, the effect is zero.
no.homophily.coef
## edges nodeicov.Age nodeocov.Age absdiff.Age mutual
## -8.569733632 0.081229804 -0.004240028 0.000000000 1.743896037
# now, plug the edited coefficient into a simulation
no.homophily.sim <- simulate(
m4.piNetD, # the model we are simulating from
coef = no.homophily.coef, # tell it to use the new coefficients
verbose = TRUE # tell it to show us what it is doing
)
Now that we have the simulated network, we can plot it and compare it
to our observed network. We will want to size the nodes by age. So,
let’s first run that through the rescale()
function.
# define the rescale function
rescale <- function( nchar, low, high ){
min_d <- min( nchar )
max_d <- max( nchar )
rscl <- ( ( high - low )*( nchar - min_d ) ) / ( max_d - min_d ) + low
rscl
}
# create the objects to use for the plots
PiNetDobsSize <- rescale( piNetD %v% "Age", 0.2, 3 )
PiNetDsimSize <- rescale( no.homophily.sim %v% "Age", 0.2, 3 )
# set the margins
par( mfrow=c( 1,2 ),
mar=c( 0.1, 0.5, 2, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
piNetD,
gmode = "digraph",
edge.col="grey40",
vertex.col="#f71505",
vertex.cex = PiNetDobsSize,
displayisolates = FALSE,
main = "PINS Power/Influence\n Network (Directed)"
)
# create the second plot for the simulated network
gplot(
no.homophily.sim,
gmode = "digraph",
edge.col = "grey40",
vertex.col="#03fc7b",
vertex.cex = PiNetDsimSize,
displayisolates = FALSE,
main = "Network Simulated from\n No Homophily Model"
)
How does the simulated network differ from the
piNetD
network? Can you see the difference in
the role that homophily plays?
In the model above, m4.piNetD
, we showed that there was
evidence of reciprocity in power/influence nominations. What would the
network look like if there was an aversion, such that the
probability of a tie from i to j is lower if there is
a tie from j to i?
If we want to simulate a model where there is an aversion to reciprocal ties, then we can take the coefficient for the reciprocity term and multiply it by -1, then simulate from that new set of model coefficients.
# create a new object that is just the model coefficients
anti.recip.coef <- m4.piNetD$coefficients
# set the coefficient
anti.recip.coef[5] <- anti.recip.coef[5]*-1
# now, plug the edited coefficient into a simulation
anti.recip.sim <- simulate(
m4.piNetD, # the model we are simulating from
coef = anti.recip.coef, # tell it to use the new coefficients
verbose = TRUE # tell it to show us what it is doing
)
Now that we have the simulated network, we can plot it and compare it
to our observed network. We will want to color the reciprocated ties. To
do this, we can get the matrix of reciprocated ties (using
symmetrize()
) and then pass colors to this matrix.
# create the symmetric matrix
sympiMat <- symmetrize( piNetD, rule = "strong" )
# create a matrix of colors
sympiMatCols <- sympiMat
sympiMatCols[sympiMat == 0] <- "grey80"
sympiMatCols[sympiMat == 1] <- "#3746a3" # set a color for mutual ties
# create the symmetric matrix
sympiMat2 <- symmetrize( anti.recip.sim, rule = "strong" )
# create a matrix of colors
sympiMatCols2 <- sympiMat2
sympiMatCols2[sympiMat2 == 0] <- "grey80"
sympiMatCols2[sympiMat2 == 1] <- "#3796a4" # set a color for mutual ties
Now, we pass this information to the plot.
# set the margins
par( mfrow=c( 1,2 ),
mar=c( 0.1, 0.5, 3, 0.5 ) )
# set the seed
set.seed( 605 )
# create the first plot
gplot(
piNetD,
gmode = "digraph",
edge.col=sympiMatCols,
vertex.col="#03eeff",
displayisolates = FALSE,
main = "PINS Power/Influence\n Network (Directed)"
)
# create the second plot for the simulated network
gplot(
no.homophily.sim,
gmode = "digraph",
edge.col = sympiMatCols2,
vertex.col="#03fc7b",
displayisolates = FALSE,
main = "Network Simulated from\n Aversion to\n Reciprocity Model"
)
How does the simulated network differ from the
piNetD
network? Can you see the difference in
the role that reciprocity plays?
In this lab, we continued working with ERGMs in R using the
ergm
package. Specifically, we ask the question: how do
we evaluate whether our model is a good fit to the data? This is
the notion of “goodness of fit” and we worked through how to assess
model fit using the gof()
function in the ergm
package.