In this exercise, we will explore Exponential Random Graph Models (ERGM), a statistical method used to analyze social network data. We will use the Edges and Vertices data to illustrate this concept.

Setting Up

Before we begin, we need to set the working directory and load the data. The Edges and Vertices data are read as follows:

head(edges <- read.csv("edges.csv", 
                       header=TRUE))
head(vertices <- read.csv("vertices.csv", 
                          header=TRUE))

Make sure that the igraph package is unloaded so that naming conflicts are avoided! Then, install the ERGM package.

if(require('ergm')==FALSE)  {
  install.packages('ergm',
                   repos="https://cran.rstudio.com/", 
                   quiet=TRUE)}
require('ergm')

Data

An edge list contains all the edges, denoted by two IDs. If two IDs appear alongside each other it means they have an edge. All EGO_IDs are in ALTER_IDs and vice versa. A vertex list contains each unique ID (both EGO and ALTER) once.

Next, a network is created. But… Since there are loops in the network, just reading it in as a network object will not work. First read it in as a igraph object, simplify and set as network object. Note that it is assumed that the pacman package has been loaded as p_load is used.

p_load(igraph, intergraph)
g <- graph.edgelist(as.matrix(edges[,c("EGO_ID", "ALTER_ID")]), 
                    directed = FALSE)
g <- igraph::simplify(g, remove.multiple = TRUE, remove.loops = TRUE)

network <- asNetwork(g)

Save a new csv file and re-begin.

edges_new <- as.edgelist(network)
edges_new <- as.data.frame(edges_new)
names(edges_new) <- c("EGO_ID", "ALTER_ID")
write.csv(edges_new, row.names = FALSE,file = "edges_new.csv")

head(edges <- read.csv("edges_new.csv", 
                       header=TRUE))
head(vertices <- read.csv("vertices.csv", 
                          header=TRUE))

EGO_ID ALTER_ID
1      1        2
2      1        3
3      1        4
4      1        5
5      1        7
6      1       11

EGO_ID gender age
1      1      0  30
2      2      1  29
3      3      0  23
4      4      0  22
5      5      0  28
6      6      1  22

Network

Now, the network can be created.

(network <-network(edges[,c("EGO_ID", "ALTER_ID")],
                  directed=FALSE))

Edge-level attributes as well as vertex-level attributes are assigned.

set.edge.attribute(network, "EGO_ID", edges[,"EGO_ID"])
set.edge.attribute(network, "ALTER_ID", edges[,"ALTER_ID"])

network %v% "gender" <- vertices[,'gender']
network %v% "age" <- vertices[,'age']

Note that %v% means to extract a vertex attribute.

To get a view of the network use summary().

summary(network, print.adj=FALSE)

Network attributes:
  vertices = 50
  directed = FALSE
  hyper = FALSE
  loops = FALSE
  multiple = FALSE
  bipartite = FALSE
 total edges = 521 
   missing edges = 0 
   non-missing edges = 521 
 density = 0.4253061 

Vertex attributes:

 age:
   integer valued attribute
   50 values

 gender:
   integer valued attribute
   50 values
  vertex.names:
   character valued attribute
   50 valid vertex names

Edge attributes:

 ALTER_ID:
   integer valued attribute
   521values

 EGO_ID:
   integer valued attribute
   521values

The summary tells that the number of vertices needs to be 50 (see nrow(vertices)). It is a non-directed graph (arrows go both ways). There are no hyper edges (an edge that connects multiple vertices, represented by sets, circles). There are no loops (a connection of a vertex to itself). There is no multiplexity (more than one edge between a pair of vertices) as well. Further, it is not a bipartite graph, which is a graph whose vertices can be divided into two disjoint sets U and V. Thus, U and V are each independent sets such that every edge connects a vertex in U to one in V.

In addition, there are 521 edges (see nrow(edges)). It also shows us that the density is 0.43. Finally, the vertex and edge attributes are shown.

ERGM

Now, let’s estimate a first ERGM. To do this, multiple variables are added:

Triangles : This term adds one statistic to the model equal to the number of triangles in the network.
Edges : This term adds one network statistic equal to the number of edges in the network.
Absolute difference : The attrname argument is a character string giving the name of a quantitative attribute in the network’s vertex attribute list. This term adds one network statistic to the model equaling the sum of abs(attrname[i]-attrname[j]) for all edges (i,j) in the network.

model <-ergm(network ~  edges + triangle + absdiff("age"),
             verbose=FALSE)

MCMC diagnostics are no course material, as that is out of the scope of this course. The model looks as follows.

summary(model)

Call:
ergm(formula = network ~ edges + triangle + absdiff("age"), verbose = FALSE)

Monte Carlo Maximum Likelihood Results:

             Estimate Std. Error MCMC % z value Pr(>|z|)    
edges        0.006146   0.298517      0   0.021    0.984    
triangle     0.034959   0.028950      0   1.208    0.227    
absdiff.age -0.184762   0.026151      0  -7.065   <1e-04 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 1698  on 1225  degrees of freedom
 Residual Deviance: 1609  on 1222  degrees of freedom
 
AIC: 1615  BIC: 1630  (Smaller is better. MC Std. Err. = 0.1687)

The interesting parts are the coefficients, SE and p-values. The coefficients are on the log-odds scale (aka logit), and can be interpreted just like logit coefficients.

An edge value of 0.006146 means that the odds of having two vertices connected by a tie is exp(0.006146) = 1.006164925. In other words, the probability of having a tie is 50 % = exp(0.006146) / (1 + exp(0.006146)). The edges parameter can be conceived of as the intercept and is insignificant in this case.

The value for triangle is somewhat higher but also insignificant. Together with edges, traingle represents the intercept, as it is a network statistic and therefore constant across edges (the unit of analysis).

A value of -0.184762 for absolute difference in age means that it is a lot less likely to see a tie if the absolute difference increases. To be exact, if the absolute age difference increases by 1, then we are exp(-0.184762) = 0.8313021102 times as likely (i.e. we are 1 - 0.8313021102 less likely) to observe a tie between two vertices than if the absolute age difference stayed the same. In terms of probabilities, we can say that if the difference increases by one, we have a less than 50-50 (i.e. 45 %) chance of seeing a tie.

Exercise

Remember the PadelNetwork from earlier. Create an ERGM model for this network and store the model in model. Use edges and triangle.

To download the graph from the dataframe click: here¹

Assume that:

The ERGM library has been loaded.
The intergraph library has been loaded.
The graph from the dataframe g has been loaded.

Social Network Learning: Exponential Random Graph Models (ERGM)

Setting Up

Data

Network

ERGM

Exercise