In network clustering, it’s crucial to have reliable metrics to evaluate the quality of the clusters.
Two such measures are trace_e
and modularity Q
.
This exercise will delve into these metrics, demonstrating their behavior in different clustering scenarios.
In an optimal cluster structure, trace_e
will be high, indicating a high degree of intra-cluster connections.
The modularity Q
will also be high, suggesting a good clustering solution.
e <- data.frame(Cluster1=c(0.2,0.05,0.025),
Cluster2=c(0.05,0.3,0.1),
Cluster3=c(0.025,0.1,0.15))
rownames(e) <- c("Cluster1","Cluster2","Cluster3")
Cluster1 | Cluster2 | Cluster3 | |
---|---|---|---|
Cluster1 | 0.200 | 0.05 | 0.025 |
Cluster2 | 0.050 | 0.30 | 0.100 |
Cluster3 | 0.025 | 0.10 | 0.150 |
The sum of all elements in e should be equal to 1. Also, the corresponding row- (a) and column-sums (b) should be the same.
a <- rowSums(e)
b <- colSums(e)
The matrix under the null model:
rand <- data.frame()
for (i in 1: length(a)){
for (j in 1:length(b)){
rand[i,j] <- a[i]*b[j]
}
}
Bad measure of clustering quality:
(trace_e <- sum(diag(as.matrix(e))))
[1] 0.65
Good measure of clustering quality:
(Q <- sum(diag(as.matrix(e))) - sum(diag(as.matrix(rand))))
[1] 0.29625
In the second scenario, trace_e will be low. This is good, but situation 3 will show why this is a bad measure of cluster quality. Q will be very low, which is the result that we want.
Cluster1 | Cluster2 | Cluster3 | |
---|---|---|---|
Cluster1 | 0.05 | 0.200 | 0.150 |
Cluster2 | 0.20 | 0.025 | 0.100 |
Cluster3 | 0.15 | 0.100 | 0.025 |
Again, the sum of all elements in e should be equal to 1. Also, corresponding row- and column-sums should be the same.
Bad measure of clustering quality:
(trace_e <- sum(diag(as.matrix(e))))
[1] 0.1
Good measure of clustering quality:
(Q <- sum(diag(as.matrix(e))) - sum(diag(as.matrix(rand))))
[1] -0.24125
In the third situation, the trace_e will be same (high) as in situation 1, this is why trace_e is bad. Q will be low, so the clustering solution is bad. This is the desired result, since the clustering solution is not better than random, but this does not mean that there is no within-cluster linkage at all.
Cluster1 | Cluster2 | Cluster3 | |
---|---|---|---|
Cluster1 | 0.650 | 0.05 | 0.025 |
Cluster2 | 0.050 | 0.00 | 0.100 |
Cluster3 | 0.025 | 0.10 | 0.000 |
Again, the sum of all elements in e should be equal to 1. Also, corresponding row- and column-sums should be the same.
Bad measure of clustering quality:
(trace_e <- sum(diag(as.matrix(e))))
[1] 0.65
Good measure of clustering quality:
(Q <- sum(diag(as.matrix(e))) - sum(diag(as.matrix(rand))))
[1] 0.08625
Which of the following statements is not correct?