## Abstract

Many complex systems can be described as networks exhibiting inner organization as communities of nodes. The identification of communities is a key factor to understand community-based functionality. We propose a family of measures based on the weighted sum of two dissimilarity quantifiers that facilitates efficient classification of communities by tuning the quantifiers’ relative weight to the network’s particularities. Additionally, two new dissimilarities are introduced and incorporated in our analysis. The effectiveness of our approach is tested by examining the Zachary’s Karate Club Network and the *Caenorhabditis elegans* reactions network. The analysis reveals the method’s classification power as confirmed by the efficient detection of intrapathway metabolic functions in *C. elegans*.

## 1. Introduction

Characteristic inhomogeneities in real networks display order and organization, e.g. the local inhomogeneity in the distribution of links unveils the network organization in clusters of nodes. Such a feature is known as community structure [1]. In a community, nodes share some sort of similarity or common property. In a network, the communities may sustain functional meaning, e.g. in a social context clusters could be people with the same interests or buying patterns; in a biochemical network, clusters may perform specific functions such as energy production or storage. Despite the many works published on the topic, community detection in complex networks is still an active problem and has been addressed from different perspectives [2–18]. In this study, we follow a novel approach to this important problem. In particular, we propose a dissimilarity measure, *D*_{ϵ}(*d*_{1},*d*_{2}), to detect communities. Such a measure is built with the weighted combination of two quantifiers, *d*_{1} and *d*_{2}. This procedure is able to address different particularities of the network. To the best of our knowledge, there are no works dealing with weighted (dissimilarities) quantifiers in the context of community identification in complex networks.

## 2. The dissimilarity measure *D*_{ϵ}(*d*_{1},*d*_{2})

Given the single quantifiers, *d*_{1} and *d*_{2}, we define the parametric family of dissimilarity measures, *D*_{ϵ}(*d*_{1},*d*_{2}), as the weighted sum
2.1where 0≤*ϵ*≤1 is a weight parameter. In this expression, *D*_{ϵ}(*d*_{1},*d*_{2}) is a matrix of components. We will focus on a set of specific dissimilarity quantifiers: link betweenness, the Jaccard dissimilarity index, Meet/Min and two new additional quantifiers that we introduce below. Let us consider the following definitions: given a network , nodes *i* and *j*, the betweenness centrality of an edge {*i*,*j*}[19,20] is the sum of the fraction of all-pairs shortest paths that pass through {*i*,*j*}, i.e.
2.2where *σ*(*s*,*t*) is the number of shortest paths between nodes *s* and *t*, and *σ*(*s*,*t*|{*i*,*j*}) is the number of those paths passing through edge {*i*,*j*}.

If *N*(*i*) denotes the neighbourhood nodes of node *i*, one of the simplest indexes, developed to compare regional floras [21], is the Jaccard dissimilarity index between nodes *i* and *j*, which is defined by
2.3The Meet/Min dissimilarity, *mm*_{ij}, as introduced by Golberg & Roth [22] and by Ravasz *et al.* [23] is expressed as
2.4In this work, two additional quantifiers are introduced: (i) the Meet/Max, *MM*(*i*,*j*), proposed as a variation of equation (2.5) and defined by
2.5and (ii) the intensity of interaction between two nodes, , given by
2.6where *k*_{i} and *k*_{j} are the connectivity degrees of nodes *i* and *j*, respectively, and is the set of edges. Note the resemblance between and the channel capacity. quantifies information interchange between nodes, e.g. if only a link between two nodes occurs, the quantity is maximized because the flow of mutual information is not divided up by other nodes. On the contrary, if two high-degree nodes are connected the information circulating through them is shared with the additional neighbourhood nodes, producing poor effective communication between the two high-degree nodes. We found that this quantity is particularly useful to classify satellite nodes. As far as we know, has not been related to community detection yet. It seems to be one of the simplest dissimilarity measures to detect inner network communication.

Below, the following dissimilarity measures are evaluated: 2.7 2.8 2.9 2.10 2.11 2.12 2.13

To carry out our approach, hierarchical clustering on the *D*_{ϵ}(*d*_{1},*d*_{2}) matrix is developed, implementing a single-linkage renormalization or agglomerative method [24]. Consequently, the communities with the most similar nodes are recursively merged. This procedure is described in detail in figure 1. We have tested additional methods such as complete linkage [25] and average linkage [26], obtaining similar results. The performance of *D*_{ϵ}(*d*_{1},*d*_{2}) was tested by analysing the community structure of: (i) the Zachary’s Karate Club Network (ZKCN) [27] and (ii) the metabolic reactions network of *Caenorhabditis elegans*.

## 3. Zachary’s Karate Club Network

ZKCN data were downloaded from the University of California Irvine Network Data Repository. Reported by Wayne Zachary while studying information flow patterns and fission of small social groups [27], the ZKCN is a network of friendship of 34 members of a karate club at a US university in the 1970s. ZKCN data were taken after the club split into two schools following an internal dispute. Therefore, it is known that this network has at least two communities, i.e. the two schools created after fission: the one led by the club’s owner and the one led by the club’s sensei. The underlying community structure was determined by calculating *D*_{ϵ}(*d*_{1},*d*_{2}) for different weighted combinations of the quantifiers *d*_{1} and *d*_{2}. Results obtained with these combinations are displayed in figure 2*a*–*g*. A first obvious observation is that the identification of communities depends on the selection of the weight parameter *ϵ* and the particular measures *d*_{1} and *d*_{2}. The classification in communities varies in composition. However, the identification of some particular communities, such as the one coloured bright pink and located on the left of the network’s representation, does not depend on the implementation of *D*_{ϵ}(*d*_{1},*d*_{2}). However, communities such as the blue one located at the network centre, while being quite stable in composition, vary slightly depending on the implemented quantifier. Meanwhile, there is a set of communities whose composition varies tremendously with the implementation of *D*_{ϵ}(*d*_{1},*d*_{2}). The yellow, green, red and brown ones exemplify this situation. Finally, particular implementations of *D*_{ϵ}(*d*_{1},*d*_{2}) determine isolated nodes. Certainly, the family of measures *D*_{ϵ}(*d*_{1},*d*_{2}) show that, in all the analysed cases, the school led by node 1 (the club’s owner) and the school led by node 34 (the club’s sensei) may exhibit subdivisions into much more similar groups of nodes. The behaviour of the modularity values with the parameter *ϵ* for each of the evaluated combinations of *d*_{1} and *d*_{2} is shown in figure 3. In this representation, the orange line describes the best determined outcome, shown in figure 2*a*, that yields a modularity value *Q*_{max}=0.405. Such a measure was built with the weighted combination of the Jaccard and the interaction intensity given by equation (2.7) with optimum weight parameter values *ϵ*∈(0.778,0.889). In such a case, four different communities are identified in the ZKCN.

## 4. *Caenorhabditis elegans* reactions network

*Caenorhabditis elegans* is the first multicellular organism whose genome was sequenced [28] and is widely used in genetics and neuroscience research. Metabolic network data from the work of Nerima *et al.* [29] were used here. The network of reactions was obtained by the projection of the metabolic network using the methods outlined by Newman *et al.* [30]. On this reactions network, *D*_{ϵ}(*d*_{1},*d*_{2}) was calculated using the combinations given by equations (2.7)–(2.13). The best community structure was obtained with the weighted combination of the Meet/Min and betweenness centrality quantifiers given by equation (2.12) for *ϵ*=0.071 and the interval *ϵ*∈[0.117,0.147] (green line of figure 4 yielding a value of and *N*_{c}=74 communities). The network community structure obtained with can be seen in figure 5. In figure 6, the resulting network’s structure, the values of and the number of communities obtained with the rest of evaluations of *D*_{ϵ}(*d*_{1},*d*_{2}) is reported.

The functional significance of the detected structures using equation (2.12) was checked by searching in the KEGG database and manually curated for functions associated with the reactions at a given community. As expected our method classifies a set of reactions as communities developing specific metabolic pathways according to their gene ontology (GO). Remarkably, *D*_{ϵ}(*d*_{1},*d*_{2}) is able to detect subtle differences in the same GO group. This is the case for the pentose–phosphate pathway, which is commonly grouped as a single pathway but in our case we can separate it into its oxidative and non-oxidative component pathways. This is also true for the pathways of purine degradation and purine and pyrimidine metabolism.

## 5. Conclusion

Dissimilarity-based hierarchical clustering methods were conceived to account only for the quantitative or qualitative characteristics for which they were designed, being unable to access other aspects. Measures of dissimilarity were originally inspired by clustering of spatial (vectorial) data with no connections, in a context where communication or information flow were of no interest. As an example, consider the case of comparing two nodes’ shared neighbourhoods. Obviously, such a procedure would not consider details such as information flow or connectivity. While single dissimilarity approaches have already been used to detect communities [31], it may seem insufficient to capture the intrinsic features of a complex network. A single quantifier does not capture the structural complexity. In such a context, in this work we propose a novel strategy that complements a quantifier’s weaknesses with the strengths of another. As figure 2 illustrates, a particular measure *D*_{ϵ}(*d*_{1},*d*_{2}) may perform better, depending on how well their constituent quantifiers, *d*_{1} and *d*_{2}, adapt to the network’s specificity. Thus, choosing an optimal measure may depend on the topological and statistical heterogeneity of the network under analysis. Adaptively coupling two different quantifiers may improve community detection while considering more than one relevant network’s feature, e.g. local or global properties. The resulting network maximal modularity can be evaluated as a function of the parameter *ϵ*, thus tuning the weights in *D*_{ϵ}(*d*_{1},*d*_{2}) to the best structural result. Such a tuning is not an artefact, as the resulting classification shows. In fact, it is possible to observe a good classification of the different communities of ZKCN. Remarkably, for the case of *C. elegans*, the method precisely separates all the metabolic pathways and intrapathway functions. Consequently, such a detailed function identification represents an improvement with respect to the GO classification available at KEGG.

In our paper, we explored two archetypal systems used widely in the literature to test methods for community detection in complex networks. In particular, we found that our method is useful to detect communities in sparse networks such as the metabolic network discussed above. In fact we found that weighting dissimilarities perform better than several of the methods recently published [8–18]. An analysis of larger networks would offer further insights into the dissimilarity’s performance; however, such an analysis is outside the scope of this work and is left for a future publication. However, we analysed a number of different networks. We observed that when a network with more diverse characteristics is analysed the method adapts better and the modularity improves, as observed in our analysis of ZKCN () and *C. elegans* (). In particular, our analysis considered different networks whose results are not included here but will be reported elsewhere. Such analysis includes (but is not limited to) the networks of *Candida albicans* (), *Dictyostelium discoideum* (), *Cryptosporidium hominis* (), *Schizosaccharomyces pombe* () and *Entamoeba histolytica* ().

The current approach could be generalized to more than two measures, i.e. taking into account *n* different features to build a new measure
5.1where *d*_{1},…,*d*_{n} are dissimilarity quantifiers capturing most of the features of a community in a complex network and . At this point, it must be remarked that the current method is not based on the optimization of modularity but that such a function is used as a quantifier providing selection criteria for the best cutting level on the dendrogram. As a result, the method does not suffer from a resolution limit. Moreover, variations of our approach can be implemented using different modularity functions, e.g. using the recently reported surprise [32], and possibly changing the criteria on the dendrogram’s cut-off level. It seems obvious that a non-supervised version can be formulated with little effort.

Insights into the intermediation role that some nodes may play can be gained by integrating the information obtained with different implementations of *D*_{ϵ}(*d*_{1},*d*_{2}). The application of such a procedure could reveal vertices with robust memberships and vertices playing a clear intermediate role. This aspect is left for future work.

## Authors' contributions

A.J.A. and J.L.C. conceived and designed the research. A.J.A. developed the software. A.J.A. and C.E.S.-R. performed the numerical simulations. All authors participated in data analysis. A.J.A. and C.E.S.-R. prepared figures. J.L.C. coordinated the study. J.L.C. and A.J.A. wrote the manuscript. All authors gave final approval for publication.

## Competing interests

We declare we have no competing interests.

## Funding

A.J.A. acknowledges an IVIC graduate fellowship. This work was supported by grant IVIC-141.

## Footnotes

One contribution of 13 to a theme issue ‘Topics on non-equilibrium statistical mechanics and nonlinear physics (II)’.

- Accepted August 19, 2015.

- © 2015 The Author(s)