1 Vulnerability of clustering under node failure in complex networks Alan Kuhnle, Nam P. Nguyen, Thang N. Dinh, My T. Thai Department of Computer and Information Science and Engineering, University of Florida, USA

[email protected] Abstract Robustness in response to unexpected events is always desirable for real-world networks. To improve the robustness of any networked system, it is important to analyze vulnerability to external perturbation such as random failures or adversarial attacks occurring to elements of the network. In this paper, we study an emerging problem in assessing the robustness of complex networks: the vulnerability of the clustering of the network to the failure of network elements. Specifically, we identify vertices whose failures will critically damage the network by degrading its clustering, evaluated through the average clustering coefficient. This problem is important because any significant change made to the clustering, resulting from element-wise failures, could degrade network performance such as the ability for information to propagate in a social network. We formulate this vulnerability analysis as an optimization problem, prove its NP-completeness and non-monotonicity, and we offer two algorithms to identify the vertices most important to clustering. Finally, we conduct comprehensive experiments in synthesized social networks generated by various well-known models as well as traces of real social networks. The empirical results over other competitive strategies show the efficacy of our proposed algorithms. I. INTRODUCTION Network resilience to attacks and failures has been a growing concern in recent times. Robustness is perhaps one of the most desirable properties for corporeal complex networks, such as the World Wide Web, transportation networks, communication networks, biological networks and social information networks. Roughly speaking, robustness of a network evaluates how much the network’s normal function is affected in case of external perturbation, i.e., it measures the resilience of the network in response to unexpected events such as adversarial attacks and random failures (Holme et al., 2002). Complex systems that can sustain their organizational structure, functionality and responsiveness under such unexpected perturbation are considered more robust than those that fail to do so. The concept of vulnerability has generally been used to realize and characterize the lack of robustness and resilience of complex systems (Criado and Romance, 2012). In order to improve the robustness of real-world systems, it is therefore important to obtain key insights into the structural vulnerabilities of the networks representing them. A major aspect of this is to analyze and understand the effect of failure (either intentionally or at random) of individual components on the degree of clustering in the network. Clustering is a fundamental network property that has been shown to be relevant to a variety of topics. For example, consider the propagation of information through a social network, such as the spread of a rumor. A growing body of work has identified the importance of clustering to such propagation; the more clustered a network is, the easier it is for information to propagate (Barclay et al., 2013; Centola, 2010, 2011; Lü et al., 2011; Malik and Mucha, 2013). In addition, in Fig. 1, we show experimentally a strong relationship between the final spread of information and the level of clustering in the network, with higher clustering corresponding to higher levels of expected spread. The importance of clustering is not limited to social networks; in the context of air transportation networks, Ponton et al. (2013) argued that higher clustering of such a network is beneficial, as passengers for a cancelled flight can be rerouted more easily. In this work, we use average clustering coefficient (ALCC) as our definition and measure of clustering in a network. ALCC was proposed for this purpose by Watts and Strogatz (1998). The identification of elements that crucially affect the clustering of the network, as a result, is of great impact. For example, as a matter of homeland security, the critical elements for clustering in homeland communication networks should receive greater resources for protection; in complement, the identification of critical elements in a social network of adversaries could potentially limit the spread of information in such a network. However, most studies of network vulnerability in the literature focus on how the network behaves when its elements (nodes and edges) are removed based on the pair-wise connectivity (Dinh et al., 2012b), natural connectivity (Chan et al., 2014), or using centrality measures, such as degrees, betweeness (Albert et al., 2000), the geodesic length (Holme et al., 2002), eigenvector (Allesina and Pascual, 2009), etc. To our knowledge, none of the existing work has examined the average clustering coefficient from the perspective of vulnerability - as evidenced by the examples above, the damage made to the average clustering, resulted from element-wise failures, can potentially have severe effects on the functionality of the network. This drives the need for an analysis of clustering vulnerabilities in complex networks. Finding a solution for this emerging problem, nevertheless, is fundamentally yet technically challenging because (1) the behavior of ALCC is not monotonic with respect to node removal and thus can be unpredictable even in response to minor changes, and (2) given large sizes of real networks, the NP -completeness of the problem prohibits the tractable computation arXiv:1701.08787v1 [cs.SI] 30 Jan 20172 Fig. 1. Relationship between the value of ALCC and the expected number of activations under the LT and IC models, normalized by initial value. For more details and discussion of p, see Section VII. of an exact solution. In this paper, we tackle the problem and analyze the vulnerabilities of the network clustering. Particularly, we ask the question: “Given a complex network and its clustering coefficient, what are the most important vertices whose failure under attack, either intentionally or at random, will maximally degrade the network clustering?” There are many advantages of ALCC over other structural measures (Watts and Strogatz, 1998): (1) it is one of the most popular metrics for evaluating network clustering - the higher the ALCC of a network the better clustering it exhibits, (2) it implies multiple network modular properties such as small-world scale-free phenomena, small diameter and modular structure (or community structure), and (3) it is meaningful on both connected and disconnected as well as dense and sparse graphs: Sparse networks are expected to have small clustering coefficient whereas extant complex networks are found to have high clustering coefficients. Our contributions in this paper are: (1) We define the Clustering Vulnerability Assessment (CVA) on complex networks, and formulate it as an optimization problem with ALCC as the objective function. (2) We study CVA’s complexity (NP- completeness), provide rigorous proofs and vulnerability analysis on random failures and targeted attacks. To our knowledge, this is the first time the problem and the analysis are studied specifically for ALCC. (3) Given the intractability of the problem, we provide two efficient algorithms which scale to large networks to identify the worst-case scenarios of adversary attacks. Finally, (4) we conduct comprehensive experiments in both synthesized networks (generated by various well-known models) as well as real networks. The empirical results over other methods show the efficacy and scalability of our proposed algorithms. The paper is organized as follows: Section II reviews studies that are related to our work. Section III describes the notations, measure functions and the problem definition. Section IV shows the proof of NP-completeness implying the intractability of the problem. Section V and VI present our analysis of clustering behaviors on random failures and targeted attacks, respectively. In Section VII, we provide further evidence for a correlation between the extent of influence propagation and ALCC. In Section VIII, we report empirical results of our approaches in comparison with other strategies. Finally, Section IX concludes the paper. II. RELATED WORK Vulnerability assessment has attracted a large amount of attention from the network science community. Work in the literature can be divided into two categories: Measuring the robustness and manipulating the robustness of a network. In measuring the robustness, different measures and metrics have been proposed such as the graph connectivity (Dinh et al., 2012b), the diameter, relative size of largest components, and average size of the isolated cluster (Albert et al., 2000). Other work suggests using the minimum node/edge cut (Frank and Frisch, 1970) or the second smallest non-zero eigenvalue or the Laplacian matrix (Fiedler, 1973). In terms of manipulating the robustness, different strategies has been proposed such as Albert et al. (2000); Peixoto and Bornholdt (2012), or using graph percolation (Callaway et al., 2000). Other studies focus on excluding nodes by centrality measures, such as betweeness and the geodesic length (Holme et al., 2002), eigenvector (Allesina and Pascual, 2009), the shortest path between node pairs (Grubesic et al., 2008), or the total pair-wise connectivity (Dinh et al., 2012b). Veremyev et al. (2014, 2015) developed integer programming frameworks to determine the critical nodes that minimize a connectivity metric subject to a budgetary constraint. For more information on network vulnerability assessments, the reader is referred to the surveys (Chen, 2016) and (Gomes et al., 2016) and references therein. The vulnerability of the average clustering of a complex network has been a relatively unexplored area. In a related work (Nguyen et al., 2013), the authors introduced the community structure vulnerability to analyze how the communities are affected when top k vertices are excluded from the underlying graphs. They further provided different heuristic approaches to find out those critical components in modularity-based community structure. Alim et al. (2014b) suggested a method based on the generating edges of a community to find out the critical components. In a similar vein, Alim et al. (2014a) studied the problem of breaking all density-based communities in the network, proved its NP-hardness and suggested an approximation as well as heuristic solutions. These studies, while forming the basis of community-based vulnerability analysis, face a fundamental 3 TABLE I LIST OF SYMBOLS Notation Meaning N Number of vertices/nodes (N = |V |) M Number of edges/links (M = |E|) du The degree of u N (u) The set of neighbors of u T (u) The number of triangles containing u C(u), C(G) Clustering coefficients of u and G C?v (u), C?v (G) Clustering coefficients of u and G after removing node v from G G[S] The subgraph induced by S ? V in G tr(u, v) The number of triangles containing both u, v limitation due to the ambiguity of definitions of a community in a network. Our work overcomes this particular shortcoming as ALCC is a well-defined and commonly accepted concept for quantifying the clustering of a network. Ertem et al. (2016) studied the problem of how to detect groups of nodes in a social network with high clustering coefficient; however, their work does not consider the vulnerability of the average clustering coefficient of a network. The diffusion of information in a social network has been studied from many perspectives, including worm containment (Nguyen et al., 2010), viral marketing (Dinh et al., 2012a, 2013; Kempe et al., 2003; Kuhnle et al., 2017), and the detection of overlapping communities (Nguyen et al., 2011). A. Notations III. NOTATIONS AND PROBLEM DEFINITION Let G = (V, E) be an undirected graph representing a complex network where V is the set of N nodes and E is the set of edges containing M connections. For a node u ∈ V , denote by du and N (u) the degree of u and the set of u’s neighbors, respectively. For a subset of nodes S ? V , let G[S] and mS in this order denote the subgraph induced by S in G and the number of edges in this subgraph. Hereafter, the terms “vertices” and “nodes” as well as “edges” and “links” are used interchangeably. (Triangle-free graphs) A graph G is said to be triangle-free if no three vertices of G form a triangle of edges. Verifying whether a given graph G is triangle-free or not is tractable by computing the trace of A3 where A is the adjacency matrix of G. The trace is zero if and only if the graph is triangle-free. This verification can be done in polynomial time O(N ω ) for ω ≤ 2.372 with the latest matrix multiplying result (Gall, 2014). Alternatively, one can use the method of (Schank and Wagner, 2005) with time complexity O(M 3/2) to check if the graph is triangle-free. B. Clustering Measure Functions 1) Local Clustering Coefficient (LCC): Given a node u ∈ V , there are du adjacent vertices of u in G and there are du(du ? 1)/2 possible edges among all u’s neighbors. The local clustering coefficient C(u) is the probability that two random neighbors of u are connected. Equivalently, it quantifies how close the induced subgraph of neighbors is to a clique. The local clustering coefficient C(u) is defined (Watts and Strogatz, 1998) 2T (u) d 1 C(u) = du(du ? 1) u 0 otherwise where T (u) is the number of triangles containing u. It is clear that 0 ≤ C(u) ≤ 1 for any u ∈ V . For any node v /= u, let C?v (u) denote the clustering coefficient of u in G[V \{v}]. Finally, define tr(u, v) as the number of triangles containing both vertices u and v. 2) Average Clustering Coefficient (ALCC): In graph theory, the average local clustering coefficient (ALCC) C(G) of a graph G is a measure indicating how much vertices of G tend to cluster together (Watts and Strogatz, 1998). This measure is defined as the average of LCC over all vertices in the network. C(G) is defined as: 1 C(G) = , C(u). (1) N u∈V Because 0 ≤ C(u) ≤ 1 for every node u ∈ V , C(G) is normalized and can only take values in the range [0, 1] inclusively. For instance, C(G) = 0 when G is a triangle-free graph and C(G) = 1 when G is a clique or a collection of cliques. The 4 higher the clustering coefficient of G the more closely the graph locally resembles a clique. Also, we define C?v (G) = C (G[V \{v}]) . C. Problem definition We define the Clustering Structure Assessment problem (CSA) as follows Definition 1 (CSA(G, k)). Given a network G = (V, E) and a positive integer k ≤ N, find a subset S? ? V of cardinality at most k that maximizes the reduction of the clustering coefficient, i.e., S? = argmax ?C(S), S?V,|S|≤k where ?C(S) = C(G) ? C(G[V \S]). CSA problem aims to identify the most critical vertices of the network with respect to the avera