Understanding Network Clustering Metrics: Trace_e and Modularity Q

In network clustering, it’s crucial to have reliable metrics to evaluate the quality of the clusters. Two such measures are trace_e and modularity Q. This exercise will delve into these metrics, demonstrating their behavior in different clustering scenarios.

Scenario 1: Optimal Cluster Structure (High Diagonal Values)

In an optimal cluster structure, trace_e will be high, indicating a high degree of intra-cluster connections. The modularity Q will also be high, suggesting a good clustering solution.

e <- data.frame(Cluster1=c(0.2,0.05,0.025),
                Cluster2=c(0.05,0.3,0.1),
                Cluster3=c(0.025,0.1,0.15))
rownames(e) <- c("Cluster1","Cluster2","Cluster3")
Cluster1 Cluster2 Cluster3
Cluster1 0.200 0.05 0.025
Cluster2 0.050 0.30 0.100
Cluster3 0.025 0.10 0.150

The sum of all elements in e should be equal to 1. Also, the corresponding row- (a) and column-sums (b) should be the same.

a <- rowSums(e)
b <- colSums(e)

The matrix under the null model:

rand <- data.frame()
for (i in 1: length(a)){
  for (j in 1:length(b)){
    rand[i,j] <- a[i]*b[j]
  }
}

Bad measure of clustering quality:

(trace_e <- sum(diag(as.matrix(e))))
[1] 0.65

Good measure of clustering quality:

(Q <- sum(diag(as.matrix(e))) - sum(diag(as.matrix(rand))))
[1] 0.29625

Scenario 2: Poor Cluster Structure (Low Diagonal Values)

In the second scenario, trace_e will be low. This is good, but situation 3 will show why this is a bad measure of cluster quality. Q will be very low, which is the result that we want.

Cluster1 Cluster2 Cluster3
Cluster1 0.05 0.200 0.150
Cluster2 0.20 0.025 0.100
Cluster3 0.15 0.100 0.025

Again, the sum of all elements in e should be equal to 1. Also, corresponding row- and column-sums should be the same.

Bad measure of clustering quality:

(trace_e <- sum(diag(as.matrix(e))))
[1] 0.1

Good measure of clustering quality:

(Q <- sum(diag(as.matrix(e))) - sum(diag(as.matrix(rand))))
[1] -0.24125

In the third situation, the trace_e will be same (high) as in situation 1, this is why trace_e is bad. Q will be low, so the clustering solution is bad. This is the desired result, since the clustering solution is not better than random, but this does not mean that there is no within-cluster linkage at all.

Cluster1 Cluster2 Cluster3
Cluster1 0.650 0.05 0.025
Cluster2 0.050 0.00 0.100
Cluster3 0.025 0.10 0.000

Again, the sum of all elements in e should be equal to 1. Also, corresponding row- and column-sums should be the same.

Bad measure of clustering quality:

(trace_e <- sum(diag(as.matrix(e))))
[1] 0.65

Good measure of clustering quality:

(Q <- sum(diag(as.matrix(e))) - sum(diag(as.matrix(rand))))
[1] 0.08625

Multiple choice

Which of the following statements is not correct?

  1. Trace_e is the fraction of edges that connect vertices in the same cluster.
  2. Q = 0 means that there is absolutely no within-cluster linkage.
  3. Modularity (Q) can be calculated as the sum of the elements on the diagonal (incluster edges) minus the random assignment of edges.