In this exercise,
we will explore Exponential Random Graph Models (ERGM),
a statistical method used to analyze social network data.
We will use the Edges
and Vertices
data to illustrate this concept.
Before we begin, we need to set the working directory and load the data.
The Edges
and Vertices
data are read as follows:
head(edges <- read.csv("edges.csv",
header=TRUE))
head(vertices <- read.csv("vertices.csv",
header=TRUE))
Make sure that the igraph package is unloaded so that naming conflicts are avoided! Then, install the ERGM package.
if(require('ergm')==FALSE) {
install.packages('ergm',
repos="https://cran.rstudio.com/",
quiet=TRUE)}
require('ergm')
An edge list contains all the edges, denoted by two IDs. If two IDs appear alongside each other it means they have an edge. All EGO_IDs are in ALTER_IDs and vice versa. A vertex list contains each unique ID (both EGO and ALTER) once.
Next, a network is created. But… Since there are loops in the network,
just reading it in as a network object will not work.
First read it in as a igraph object, simplify and set as network object.
Note that it is assumed that the pacman package has been loaded as p_load
is used.
p_load(igraph, intergraph)
g <- graph.edgelist(as.matrix(edges[,c("EGO_ID", "ALTER_ID")]),
directed = FALSE)
g <- igraph::simplify(g, remove.multiple = TRUE, remove.loops = TRUE)
network <- asNetwork(g)
Save a new csv file and re-begin.
edges_new <- as.edgelist(network)
edges_new <- as.data.frame(edges_new)
names(edges_new) <- c("EGO_ID", "ALTER_ID")
write.csv(edges_new, row.names = FALSE,file = "edges_new.csv")
head(edges <- read.csv("edges_new.csv",
header=TRUE))
head(vertices <- read.csv("vertices.csv",
header=TRUE))
EGO_ID ALTER_ID
1 1 2
2 1 3
3 1 4
4 1 5
5 1 7
6 1 11
EGO_ID gender age
1 1 0 30
2 2 1 29
3 3 0 23
4 4 0 22
5 5 0 28
6 6 1 22
Now, the network can be created.
(network <-network(edges[,c("EGO_ID", "ALTER_ID")],
directed=FALSE))
Edge-level attributes as well as vertex-level attributes are assigned.
set.edge.attribute(network, "EGO_ID", edges[,"EGO_ID"])
set.edge.attribute(network, "ALTER_ID", edges[,"ALTER_ID"])
network %v% "gender" <- vertices[,'gender']
network %v% "age" <- vertices[,'age']
Note that %v%
means to extract a vertex attribute.
To get a view of the network use summary()
.
summary(network, print.adj=FALSE)
Network attributes:
vertices = 50
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges = 521
missing edges = 0
non-missing edges = 521
density = 0.4253061
Vertex attributes:
age:
integer valued attribute
50 values
gender:
integer valued attribute
50 values
vertex.names:
character valued attribute
50 valid vertex names
Edge attributes:
ALTER_ID:
integer valued attribute
521values
EGO_ID:
integer valued attribute
521values
The summary tells that the number of vertices needs to be 50 (see nrow(vertices)
). It is a non-directed graph (arrows go both ways). There are no hyper edges (an edge that connects multiple vertices, represented by sets, circles). There are no loops (a connection of a vertex to itself). There is no multiplexity (more than one edge between a pair of vertices) as well. Further, it is not a bipartite graph, which is a graph whose vertices can be divided into two disjoint sets U and V. Thus, U and V are each independent sets such that every edge connects a vertex in U to one in V.
In addition, there are 521 edges (see nrow(edges)
). It also shows us that the density is 0.43. Finally, the vertex and edge attributes are shown.
Now, let’s estimate a first ERGM. To do this, multiple variables are added:
model <-ergm(network ~ edges + triangle + absdiff("age"),
verbose=FALSE)
MCMC diagnostics are no course material, as that is out of the scope of this course. The model looks as follows.
summary(model)
Call:
ergm(formula = network ~ edges + triangle + absdiff("age"), verbose = FALSE)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges 0.006146 0.298517 0 0.021 0.984
triangle 0.034959 0.028950 0 1.208 0.227
absdiff.age -0.184762 0.026151 0 -7.065 <1e-04 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null Deviance: 1698 on 1225 degrees of freedom
Residual Deviance: 1609 on 1222 degrees of freedom
AIC: 1615 BIC: 1630 (Smaller is better. MC Std. Err. = 0.1687)
The interesting parts are the coefficients, SE and p-values. The coefficients are on the log-odds scale (aka logit), and can be interpreted just like logit coefficients.
An edge value of 0.006146
means that the odds of having two vertices
connected by a tie is exp(0.006146) = 1.006164925
.
In other words, the probability of having a tie is 50 % = exp(0.006146) / (1 + exp(0.006146))
.
The edges parameter can be conceived of as the intercept and is insignificant in this case.
The value for triangle is somewhat higher but also insignificant. Together with edges, traingle represents the intercept, as it is a network statistic and therefore constant across edges (the unit of analysis).
A value of -0.184762
for absolute difference in age means that it is a lot less likely to see a tie if the absolute difference increases. To be exact, if the absolute age difference increases by 1, then we are exp(-0.184762) = 0.8313021102
times as likely (i.e. we are 1 - 0.8313021102
less likely) to observe a tie between two vertices than if the absolute age difference stayed the same. In terms of probabilities, we can say that if the difference increases by one, we have a less than 50-50 (i.e. 45 %
) chance of seeing a tie.
Remember the PadelNetwork
from earlier. Create an ERGM model for this network and store the model in model
. Use edges
and triangle
.
To download the graph from the dataframe click: here1
Assume that:
g
has been loaded.