Applications

Editor: Mike Potel

Visual Matrix Clustering of Social Networks Pak Chung Wong, Patrick Mackey, Harlan Foote, and Richard May Pacific Northwest National Laboratory

I

n social networking applications, it’s desirable to be able to see and understand information such as the distances between individuals, groups within the population, degrees of separation, and disconnections. Given a social network such as a dataset of phone calls between users in a population over a period of time, you might wish to understand not only the direct connections among the callers but also the network’s cluster structure (where callers frequently talk to each other). The prevailing choices to graphically represent a social network (we treat the terms “network” and “graph” interchangeably in this article) are a node-link graph (see Figure 1a) and an adjacency matrix (see Figure 1b). However, neither visualization technique, in its native form, is fully equipped for social network analytics. Here, we focus on how to change the adjacency matrix from merely showing pairwise associations among network actors (or graph nodes) to depicting clusters of a social network. We use node-link graphs to supplement the discussion.

The Adjacency Matrix’s Strength and Limitation

July/August 2013

From Pairwise Connections to Cluster Analysis Visualization We can alleviate the lack of readily available options to visualize network clustering if we look differently at the underlying network graph in the adjacency matrix. The workaround is to use every possible display pixel of the adjacency matrix to show not just the actor connections but also the pairwise shortest-path distances among the actors. By color-coding the pairwise distances along the hops of the network paths, we often can instantly spot clusters (or the lack of them). To demonstrate the concept, we use a graphdrawing conference contest dataset (http://vlado. fmf.uni-lj.si/pub/networks/data/gd/gd96/rules96. htm), which is a sanitized telephone connection graph, found in the public domain. This social network has 111 nodes and 386 links. Figure 3 depicts ■■

An adjacency matrix provides an instant overview of the pairwise connectivity among a social network’s actors. For example, the red straight-line pattern in Figure 2a is a starburst with four followers (nodes 2 to 5) directly connected to a single leader in the center (node 1), as Figure 2b illustrates. However, adjacency matrices lack a natural option to show a network path that connects multiple actors. As a simple example, once we see actor A is connected to actor B, we need to perform an additional step to search the matrix indices and determine whether B is also connected to actor C to establish the path A → B → C. The concept of a path in which some actors are closer to each other (the network paths are shorter) 88

than to the others (the paths are longer) becomes the foundation of graph node clustering frequently found in a social network studies.

■■

■■

■■

a discrete rainbow color map (see Figure 3a), a traditional adjacency matrix showing merely the connections (see Figure 3b), an enhanced adjacency matrix showing the pairwise shortest-path distances (see Figure 3c), and a corresponding node-link graph (see Figure 3d).

For simplicity, Figure 3a shows the pairwise shortestpath distances in order, with red indicating the shortest distance (1 hop) between two actors in Figure 3c. The sparsely scattered white dots in Figure 3b don’t reveal much about the network’s connectivity. The colored adjacency matrix in Figure 3c performs slightly better. Many orange blocks are present, indicating that many actors are only two hops away from many other actors.

Published by the IEEE Computer Society

0272-1716/13/$31.00 © 2013 IEEE

1 1

However, by brushing (interactively highlighting) the matrix cells in Figure 3c and linking the brushed graph nodes in Figure 3c to the corresponding nodes in Figure 3d, we learn that the index order (of the actors) along the matrix axes can be crucial in determining the adjacency matrix’s performance and success. In the visual analytics (VA) literature, brushing and linking are frequently combined for direct manipulation of data visualized using multiple techniques. William Cleveland provided an early discussion of applying brushing and linking to data visualization.1

Node Ordering To visualize the pairwise shortest-path distances, we developed GreenTea, a tool that applies eight node-ordering algorithms: ■■ ■■ ■■ ■■ ■■ ■■ ■■ ■■

the original data order (mainly for comparison), degree (a network actor’s fan-in and fan-out), Fiedler, Sloan, breadth-first search, depth-first search, Gibbs-King, and reverse Cuthill-Mckee.

For a look at other tools for social network VA, see the related sidebar. We use Fiedler extensively in our demo examples (shown later in this article); for more on it, see the sidebar “An Implementation of the Fiedler Transform for Large Social Networks.” Sloan reduces a matrix’s profile and wavefront. The profile

1 (a)

(b)

2

3

4

5

6

(c)

7

8

2

3

4

5

1 2

2

3

3 4 5

4

5

(a)

(b)

Figure 1. The two prevailing ways to depict social networks. (a) A simple node-link graph with five nodes. (b) A corresponding 5 × 5 adjacency matrix. The red links in the node-link graph are red blocks in the adjacency matrix. Neither visualization technique, in its native form, is fully equipped for social network analytics.

1

2

3

4

5

1

2

3

2 1

3 4

5

5 (a)

4

(b)

Figure 2. A starburst graph is visualized by (a) an adjacency matrix and (b) a node-link graph. The red straight-line pattern in Figure 2a is a starburst with four followers (nodes 2 to 5) directly connected to a single leader in the center (node 1), as Figure 2b illustrates.

9

10+

(d)

Figure 3. Using adjacency matrices for cluster analysis of a social network. (a) A rainbow color map. (b) An adjacency matrix showing the network connectivity. (c) An adjacency matrix showing the pairwise shortest-path distances. (d) A node-link graph of the network.

IEEE Computer Graphics and Applications

89

Applications

Selected Social Network Visual-Analytics Tools

V

isual analytics (VA) frequently has been applied to graphrelated data problems in science, engineering, and social domains for analysis and presentation for many years. Many public-domain and commercial graph VA tools are available. Some of them are for general-purpose graph analytics applications; two such popular tools are Gephi (https://gephi. org) and Graphviz (www.graphviz.org). Other tools are for specific domain applications, such as Cytoscape (www. cytoscape.org) for bioinformatics. For social network analytics, four tools especially have an established reputation and a large user base in the information analytics community: ■■ IBM’s

i2 Analyst’s Notebook (www-142.ibm.com/ software/products/us/en/analysts-notebook), ■■ Palantir (www.palantir.com/labs/graph), ■■ Renoir (www.nsa.gov/research/tech_transfer/fact_sheets/ renoir.shtml), and ■■ Starlight Visual Information System (VIS; www.futurepoint systems.com/?page=products). i2 Analyst’s Notebook enables analysts to organize large sets of disparate information. It supports the import of structured and unstructured data from different sources for use in an analytical product. The software has several tools for information exploration but has paid particular attention to social network analysis. Users can calculate betweenness, closeness, degree, and eigenvector (hub and authority) centrality measures. i2 Analyst’s Notebook also supports conditional formatting, letting users customize social networks’ look to emphasize features (such as betweenness) or to simply clean up a complex network structure. IBM claims the tool’s user community exceeds 2,500 organizations. Palantir offers an extensive platform for analysis of large local or cloud-based information repositories. It supports a range of data management and analytic tools. It’s an enterprise solution that allows extensive data and user access control and management. Palantir employs a dynamic

Figure A. A screenshot of Palantir showing a social network investigation based on gathered associate ties. The histograms on the right show the distribution of email domain names and gender (of selected objects), which are part of the information the graph was built on. The bottom panes show the investigation history.

ontology to arrange objects and their relationships. The object model (object, property, and link) used to describe the ontology is customizable for each organization. Palantir supports numerous network layouts and extensive access to node (object or property) information. Users can calculate network properties such as betweenness, closeness, degree, and eigenvector, as well as geospatial displays. Palantir is an open platform with a set of APIs that let users customize features or develop applications that call the Palantir API. Figure A shows a social network investigation in Palantir, based on gathered associations among network actors. The associations span out from a central hub; several of that hub’s satellites have an additional association to a professor at a university. The histograms on the right show the distribution of email domain names and gender (of selected objects), which are part of the information the graph was built on. The bottom panes show the investigation history.

is the smallest number of entries that the matrix’s off-diagonal nonzero elements can contain. The wavefront is the maximum number of nonzero elements among all matrix rows. Both Gibbs-King and reverse Cuthill-Mckee are breadth-first oriented but with different local-traversal priorities. As you’ll see, their visualization results sometimes look similar. Christopher Mueller and his colleagues discussed these and other ordering algorithms in more detail.2 Figure 4a shows visualizations of the social network in Figure 3 using the eight node-ordering algorithms. These algorithms serve different analytical purposes. So, their performance and the resulting 90

July/August 2013

visualization’s quality will depend largely on the structures in the underlying datasets. The social network in Figure 3 contains several clusters with different pairwise shortest-path distances among the actors. The color blocks in Figure 4a represent clusters of actors separated by the same number of hops. For example, as in Figure 3c, actors in the orange blocks are separated by two hops. Figure 4b shows the correspondences among the major orange blocks in the adjacency matrix ordered by Fiedler and the starbursts in the node-link graph.

Applying the Approach Two graph analytics examples illustrate our network-

Figure B. A Starlight Visual Information System visualization showing geospatial (see the upper right), topical (the middle), and socialrelationship (the lower left and lower right) information from a collection of company profiles.

Renoir is a Java-based general-purpose package to manipulate and visualize network structures. It’s a desktop application that reads from local files or available databases. Users can employ a host of layout techniques and network property calculations (including the basic betweenness, closeness, degree, and eigenvector). Renoir supports connectivity to other programs through remote method invocation, raw text sockets, and other techniques. VIS handles structured and unstructured data. It includes a broad selection of tools, including text cluster-

clustering approach. The first involves only static visualization; the second adds user interaction. In both cases, accomplishing the analytical goals is intuitive when we use both an adjacency matrix and node index ordering but would be difficult otherwise.

Time Series Visualization This example employs a contest dataset from the 2008 IEEE Symposium on Visual Analytics Science and Technology (www.cs.umd.edu/hcil/VASTchal lenge08). The time series dataset contains 19,668 telephone calls made by 400 telephones in 10 days. We want to investigate whether a static visualiza

ing; network analysis (including social network analysis); relationship recognition through hierarchical, geospatial, temporal, and link-charting visualizations; workflow management; and support for multimedia. VIS also includes Starlight Data Engineer (SDE), a system that supports data ingest. SDE’s graphical programming interface helps users customize the ingest of complex datasets. Figure B shows a screenshot of Starlight that visualizes multiple aspects of a network of commercial companies and their business profiles.

tion lets us see a social network’s node clusters. Figure 5a shows 10 adjacency matrices, each representing one day of social contacts for the 400 phones. For presentation simplicity, we apply Fiedler to order the matrix indices. The clustering information, which is reflected by the color distribution of orange, yellow, green, and blue blocks (corresponding to 2, 3, 4, and 5 hops), is readily apparent. For example, by comparing the shape and size of the same color blocks in the circled areas, we can see changes to the connection patterns among the network actors over the 10 days. We couldn’t achieve such visual effects without reordering the matrix indices. IEEE Computer Graphics and Applications

91

Applications

(a)

(b) Figure 4. Visualizations of the social network in Figure 3. (a) The node-ordering algorithms (from left to right, top to bottom): the original input order, degree, Fiedler, Sloan, breadth-first search, depth-first search, GibbsKing, and reverse Cuthill-Mckee. (b) The correspondences among the major orange blocks in the adjacency matrix ordered by Fiedler and the node-link graph. (Orange indicates that the actors are separated by two hops.)

A logical question is, “Can node-link graphs perform as effectively as adjacency matrices in graph clustering?” Figure 5b shows four force-directed node-link graphs for the dataset’s last four days. Unlike Figure 3c, no major cluster patterns appear, and the connection patterns’ time-varying changes are subtle.

Interactive Visualization You could argue that it’s uncommon to completely rely on visual structure detection in any serious 92

July/August 2013

real-world VA application. So, this example includes user interaction such as data brushing and linking. The social network is a set of communication data gathered from 57 cell phones for testing purposes. The phone network contains 3,311 nodes (phone numbers contacted by the phones) and 29,242 edges. With GreenTea, analysts can brush data in one visualization and link that data to another visualization. This brushing and linking lets them le-

(a)

(b) Figure 5. Investigating whether a static visualization lets us see a social network’s node clusters. (a) Adjacency matrices for 10 consecutive days of phone calls from 400 phones. The circled regions highlight changing connection patterns among some of the network actors that are from two to four hops away. (b) Forcedirected graphs for the dataset’s last four days.

verage the best features of both techniques. For example, if we brush the cluster at the top of the node-link graph in Figure 6a, we’ll see that this largely corresponds with the top left of the breadth-first-search view in Figure 6b. However, we’ll also notice that some nodes in this region aren’t selected, as is evident by the thin dark lines in that region in Figure 6b and the corresponding close-up in Figure 6c. By selecting this entire region in the matrix view in Figure 6d, we can see which nodes and edges our original selection didn’t include. Figure 6e shows what this would look like in the nodelink graph. With our new selection, the centers of many of the other clusters are now highlighted. This is because these centers are also reachable within one to three hops of our initial cluster. We know this must be so because all are red, orange, or yellow in our matrix selection. Without this feature, we might not have noticed how quickly we could get from our initial cluster to the centers of other clusters, including ones that appear much farther away in the traditional graph visualization. The unique combination of visualizing a pairwise shortest-distance matrix and reordering matrix indices turns traditional adjacency matrices

into a powerful VA approach that fills the gaps of both matrix-based and node-link-based graph analytics techniques.

Computational Performance Computational performance also plays an important role in interactive VA. Here, we discuss the performance for the two datasets in Figures 4 and 5. As we mentioned before, we implemented the highly effective Fiedler algorithm.3 For coding consistency, the other ordering algorithms were from the Boost Graph Library (www.boost.org). We performed the study on a modest Dell Precision 690 workstation with 4 Gbytes of memory. None of the algorithms were parallelized. The results (see Table 1) indicate that the algorithms performed well within the required interactive response time.

Selected Lessons Learned We hesitate to draw any inferences or conclusions from the network cases demonstrated in this article. Instead, we highlight selected lessons learned from our analytical studies in a real-world setting. First, despite the visualization results in Figure 4a, no ordering algorithm consistently provides the best visual results for different types of graph IEEE Computer Graphics and Applications

93

Applications

An Implementation of the Fiedler Transform for Large Social Networks

I

n graph analytics, the Fiedler transform is frequently tied to the study of a graph partition that identifies subgraphs from a bigger graph with specific properties. Given a graph G, the graph Laplacian1 L of G is L = D – A, where D is the degree matrix of G and A is the adjacency matrix of G. Given a graph G with n vertices, the matrix L = ( Ii ,j ) n× n satisfies deg( v i ) Ii ,j = − 1 0

if i = j if i ≠ j and v i is adjacent to v j otherwise .

The algebraic connectivity 1 of G is the second-smallest eigenvalue of L. The eigenvector associated with that connectivity is called the Fiedler vector 1 of L. The connectivity’s magnitude has implications for properties such as clustering and segmentation. The implementation of node ordering (see the main article) can be straightforward on today’s desktop computers if the underlying graph has no more than a few hundred nodes. In that case, we can use one of the standard routines such as Lapack (Linear Algebra Package; www.netlib.org/ lapack) dsyev to solve the eigenvectors. For larger graphs with tens of thousands to hundreds of thousands of nodes, we need additional heuristics to tackle the large-matrix problem. The following discussion is based on one of our previous papers,2 which included a performance comparison of the implementation. Instead of solving the graph Laplacian all at once, we 1. generate a multiscale hierarchy with increasingly coarse graphs G0, G1, …, Gi, …, Gm (with Gm being the coarsest); 2. compute the Fiedler vector Vm of the Laplacian matrix Lm of Gm; and 3. refine and propagate Vm to all the graphs in the hierarchy until reaching the finest G 0. Our research on generating a multiscale graph hierarchy for a large graph is based partly on “A Multilevel Algorithm for Force-Directed Graph Drawing,”3 with improvements customized for small-world graphs such as social networks. To find the Fiedler vector, we modify and enhance a fast multilevel approach that Stephen Barnard and Horst Simon suggested.4 After generating a multiscale hierarchy, we use dsyev to find the initial Vm. Because Gm is small (approximately 200 nodes in our implementation), the primitive Lapack routine causes insignificant potential delay. Using the coarsening information in the graph hierarchy,

Input: A sparse graph G Output: The Fiedler vector of G 1. Generate G 0, G1, …, Gm from G 2. Lm ← Laplacian matrix (Gm) 3. V ← Fiedler vector (Lm) 4. for each Gi from Gm to G 0 5. Li ← Laplacian matrix (Gi) 6. Vi,initial ← injection (V) 7. Vi,final ← RQI (Vi,initial, Gi, Li) 8. V ← Vi,final 9. Fiedler vector ← V Figure C. The algorithm for finding the Fiedler vector. RQI stands for Rayleigh quotient iteration.

we first inject (that is, interpolate an approximate) Fiedler vector Vm–1 from Vm for the next finer graph Gm–1. Then, we construct the graph Laplacian Lm–1 for Gm–1. We then send Lm–1 and Vm–1 to a vector refinement process called Rayleigh quotient iteration (RQI)5 to obtain an increasingly accurate Fiedler vector. Figure C shows the algorithm for finding the Fiedler vector. Given a matrix and vector pair of Li and Vi (or just L and V for any Gi), RQI iteratively solves the linear sparse system (L – kI)Vk+1 = Bk, where Bk is the normalized Vk, λ k = BkT LBk is the last iteration’s eigenvalue, and Vk+1 is the refined Fiedler vector at iterative level k. Because we start with an accurate Fiedler vector and RQI converges rapidly, finding a stable answer for Gi requires only a few iterations (k = 1 to 2 in our experiments).

References 1. M. Fiedler, “Laplacian of Graphs and Algebraic Connectivity,” Combinatorics and Graph Theory, vol. 25, 1989, pp. 57–70. 2. P.C. Wong et al., “A Space-Filling Visualization Technique for Multivariate Small-World Graphs,” IEEE Trans. Visualization and Computer Graphics, vol. 18, no. 4, 2012, pp. 797–809. 3. C. Walshaw, “A Multilevel Algorithm for Force-Directed Graph Drawing,” J. Graph Algorithms and Applications, vol. 7, no. 3, 2003, pp. 253–285. 4. S.T. Barnard and H.D. Simon, “A Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems,” Concurrency: Practice and Experience, vol. 6, no. 2, Apr. 1994, pp. 101–117. 5. B.N. Parlett, The Symmetric Eigenvalue Problem, Soc. for Industrial and Applied Mathematics, 1987.

datasets. GreenTea thus provides a menu of network analysis options. Second, algorithms such as Fiedler and Sloan consistently require more time to compute than the others when the graphs grow to tens of thousands of 94

July/August 2013

nodes. Surprisingly, GreenTea users generally have been willing to wait a few extra seconds for a potentially different view of the underlying network. Third, the one-datum-per-pixel rule isn’t always a critical requirement in this visualization appli-

(a)

(b)

(c)

(d)

(e)

Figure 6. Interactive social network visualization. (a) A node-link graph with a node cluster brushed in red. (b) The corresponding adjacency matrix, showing dark horizontal and vertical lines in the upper-left orange block. (c) A close-up of that upper-left block. (d) A new node cluster brushed on the adjacency matrix that covers both the orange and yellow blocks. (e) The corresponding brushing results in the node-link graph.

Table 1. The performance (in wall-clock seconds) of seven vertex-ordering algorithms, using two phone call datasets from symposium contests.* Algorithm Dataset

No. of nodes

No. of links

Degree

Fiedler

Sloan

Breadth-first search

Depth-first search

Gibbs-King

Reverse Cuthill-Mckee

GD 1996

111

386

0.00908

0.00163

0.00090

0.00044

0.00043

0.00038

0.00033

VAST 2008

400

19,668

0.08068

0.01832

0.01180

0.00730

0.00767

0.01277

0.01672

* The 1996 Symposium on Graph Drawing and the 2008 IEEE Symposium on Visual Analytics Science and Technology.

cation. The analytical goal is to quickly visualize clusters of different sizes (the colored blocks’ sizes) and distances (the blocks’ colors), instead of the individual connections shown in Figure 3b. This gives adjacency matrices an advantage for analyzing social networks with substantially more actors than the number of matrix rows or columns. Fourth, a social network is inherently a sparse graph, which normally doesn’t effectively utilize the limited screen pixels in social network analytics. GreenTea uses all possible screen pixels and maximizes pixel utilization. Finally, no node-ordering algorithm is the most effective for this application. So, perhaps the best solution to address some unsolved challenges in network-clustering analytics is a hybrid node

ordering technique combining different algorithms’ strengths into one adjacency-adjacency matrix.

O

ur design turns an ordinary network visualization technique into a robust, scalable network-clustering analytics tool. We’ll continue exploring the possibility of developing a hybrid solution that maximizes the different algorithms’ strengths, as we suggested before.

Acknowledgments We thank Mike Potel and the anonymous reviewers for their comments. The US Department of Defense, the National Visualization and Analytics Center IEEE Computer Graphics and Applications

95

Applications

(NVAC) at the Pacific Northwest National Laboratory, and additional US government agencies partly supported this research. NVAC was sponsored by the US Department of Homeland Security. Battelle manages the Pacific Northwest National Laboratory for the US Department of Energy under contract DEAC05-76RL01830.

References 1. W.S. Cleveland, Visualizing Data, Hobart Press, 1993. 2. C. Mueller, B. Martin, and A. Lumsdaine, “A Comparison of Vertex Ordering Algorithms for Large Graph Visualization,” Proc. 6th Int’l Asia-Pacific Symp. Visualization (APVIS 07), IEEE CS, 2007. 3. P.C. Wong et al., “A Space-Filling Visualization Technique for Multivariate Small-World Graphs,” IEEE Trans. Visualization and Computer Graphics, vol. 18, no. 4, 2012, pp. 797–809. Pak Chung Wong is a chief scientist and project manager at the Pacific Northwest National Laboratory, working on R&D for information analytics, visual analytics, extreme-

scale data analytics, and social analytics. Contact him at [email protected]. Patrick Mackey is a scientist at the Pacific Northwest National Laboratory, working on R&D for data visualization, high-performance computing, and computer graphics. Contact him at [email protected]. Harlan Foote was a senior research scientist at the Pacific Northwest National Laboratory. He retired from his position in fall 2009 and passed away in 2010. Richard May is the director of the National Visualization and Analytics Center at the Pacific Northwest National Laboratory, working on R&D for visual analytics, humancomputer interaction, and computer graphics. Contact him at [email protected]. Contact department editor Mike Potel at potel@wildcrest. com.

Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.

ADVERTISER INFORMATION • JULY/AUGUST 2013

Advertising Personnel Marian Anderson: Sr. Advertising Coordinator Email: [email protected] Phone: +1 714 816 2139 | Fax: +1 714 821 4010 Sandy Brown: Sr. Business Development Mgr. Email [email protected] Phone: +1 714 816 2144 | Fax: +1 714 821 4010 Advertising Sales Representatives (display) Central, Northwest, Far East: Eric Kincaid Email: [email protected] Phone: +1 214 673 3742 Fax: +1 888 886 8599 Northeast, Midwest, Europe, Middle East: Ann & David Schissler Email: [email protected], [email protected] Phone: +1 508 394 4026 Fax: +1 508 394 1707

96

July/August 2013

Southwest, California: Mike Hughes Email: [email protected] Phone: +1 805 529 6790 Southeast: Heather Buonadies Email: [email protected] Phone: +1 973 585 7070 Fax: +1 973 585 7071 Advertising Sales Representatives (Classified Line) Heather Buonadies Email: [email protected] Phone: +1 973 585 7070 Fax: +1 973 585 7071 Advertising Sales Representatives (Jobs Board) Heather Buonadies Email: [email protected] Phone: +1 973 585 7070 Fax: +1 973 585 7071

Visual matrix clustering of social networks.

The prevailing choices to graphically represent a social network are a node-link graph and an adjacency matrix. Both techniques have unique strengths ...
3MB Sizes 2 Downloads 4 Views