Molli: interactive visualization for exploratory protein analysis.

Biomedical Applications

Molli: Interactive Visualization for Exploratory Protein Analysis Sara Su ■ Google Connor Gramazio ■ Brown University Daniela Extrum-Fernandez ■ Royal Veterinary College Caitlin Crumm ■ University of Texas Southwestern Medical School Lenore J. Cowen ■ Tufts University Matt Menke ■ Google Cambridge Megan Strait ■ Tufts University

M

any programs have been designed to view the 3D structures of protein molecules in 2D. However, certain types of domain-specific information haven’t yet been categorized systematically in a way that highlights the interface design challenges. In particular, sequence, structure, The Molli interactive and homology are heterogeneous visualization system types of linked information characterizes three that a scientist needs in workheterogeneous types of linked ing memory to manipulate and information that scientists can understand a protein structure. offload from working memory Existing protein visualization to coordinated multiple software, such as Jmol (www. jmol.org), PyMOL (www.pymol. views in the study of protein org),1 and Swiss-PdbViewer,2 castructures. Molli enhances pably provide access to the three the exploratory process by data facets individually. But they preserving the linkages and currently offer no easy way for relations of the underlying users to coordinate operations data and provides an intuitive between two or more facets or interface for expert and novice to link the views of homologous system users. 3D models. Furthermore, these popular programs require deep use of nested menus or a scripting language to perform many functions. 62

September/October 2012

We developed Molli, an interactive visualization program, to give biologists and nonprogrammers multiple linked views of 3D protein models, along with point-and-click tools for manipulating them. A comparison of the user experience with Molli and Jmol in typical exploratory tasks showed Molli’s utility.

Molli System Features Molli breaks away from the console and programming environment that are the norm in this domain. Our primary contributions are improved rendering styles, increased spatial-manipulation capabilities, and simplified user tools. Molli supports direct selection and direct manipulation of protein residues, along with a coordinated multiple-view display. Figure 1 shows Molli’s user interface and available views: ■■

■■

■■

The display panes in which the 3D protein models are rendered. The sequence pane, at the window’s bottom, which displays the amino acid sequences in one or more protein chains. The residue pane, on the window’s right, which displays only one chain at a time.

Published by the IEEE Computer Society

0272-1716/12/$31.00 © 2012 IEEE

Display pane Residue pane Coordinated multiple-view functionality

Sequence pane

Figure 1. The Molli interface. The display panes show proteins 2L0P and 2L5C aligned. The sequence pane at the bottom and the residue pane on the right show the multiple-view functionality through concurrent highlighting in pink. A beta sheet substructure found only in 2L5C is additionally highlighted in yellow. Users can directly manipulate all panes to adjust the orientation, highlighting, and selection of one or more proteins.

Molli’s pane interface supports brushing and linking so that a user selection on the textual sequence highlights the corresponding structure, thus providing spatial context. This approach follows the classic interface work of Andreas Buja and his colleagues,3 who demonstrated greater efficiencies with multipleview visualizations that address different parts of a dataset, as opposed to combining all options and views in a single window. In combining abstract and spatial data aspects, Molli resembles other modern interactive visualization systems that have been applied to such diverse domains as visualizing brain connectivity in neural maps4 or creating simultaneous 2D and 3D views of geospatial terrain data.5 Molli is the first structure viewer we’re aware of that enables multiple homologous proteins to be aligned and then their 3D models to rotate in concert. The closest work we could find to Molli was the SRS 3D module,6 which also couples a sequence pane to the structure pane but can’t link the views of homologous 3D models.

Sequence, Structure, and Homology We now describe the underlying biology that motivates this functionality. A protein molecule is associated with a linearly ordered string of amino acid residues; this is the 1D representation of the protein structure. Each amino acid consists of backbone atoms, which are the same for every amino acid type, and side-chain atoms, which differ for each amino acid type. The linear protein chain folds into a minimum energy conformation, a complicated 3D shape with atomic coordinates for typically between 1,000 and 10,000 atoms.

Figure 2. The 2L5C protein rendered in Molli with an alpha-helix partially highlighted in yellow. Protein substructures are in grayscale for illustrative clarity.

It’s useful for biologists to shift between several views of a protein structure. For instance, they might start from a cartoon rendering, such as in Figure 2. It shows only the approximate position of the backbone atoms in the 3D structure (omitting the side-chain atoms), with annotated regions of what are called secondary structures. These structures consist of particular hydrogen bonding patterns of the backbone rendered as different types of straight or twisted ribbon. A biologist interested in potential docking sites on the protein surface might switch to a space-fill rendering, which represents the 3D coordinates of every atom on both the backbone and the side chains. In addition, most IEEE Computer Graphics and Applications

63


(a)

(b)

(c)

(d)

Figure 3. Molli’s multiple-alignment function on four proteins: (a) 1HT1A, (b) 1HT1B, (c) 1PLUA, and (d) 2YF6A. Concurrent highlighting (in pink) identifies an alpha helix and a portion of a beta sheet common to all four proteins. When the user directly selects substructures in one protein, Molli simultaneously highlights the corresponding substructures of the other proteins, using the Multiple Alignment with Translations and Twists (MATT) algorithm for multiple structure alignment.

programs allow for the coloring and selection (or hiding) of different portions of the protein. Molli supports all these modes of viewing the protein sequence and structure. However, the biggest conceptual leap in Molli’s design is incorporating homology into the visualization of multiple proteins (see Figure 3). (For an audiovisual presentation of the system at work, see the Web addendum at http://youtu.be/iMWwal3v5VE.) Homology is the term biologists use to describe proteins that share a common structure or function because they are believed to have evolved from a common ancestor. Structural homologs are proteins with 3D structures that can be superimposed. The super imposition corresponds to an alignment of the protein sequences; corresponding regions in one protein are thought to be related to corresponding regions in its homologs. Biologists are interested in looking at homologous proteins side by side. Molli uses the Multiple Alignment with Translations and Twists (MATT) algorithm for multiple structure alignment to identify corresponding parts of homologous proteins.7 Furthermore, users can view the parts together either in the same window or in multiple windows and rotate them simultaneously into the same orientation and view. Our main additional interface improvements are better rendering (see Figure 4), fewer and shallower menus, direct selection of model subparts, 64


and a more tightly coupled interface between the 1D sequence and 3D structural view of the model. Visualizing structural homologs helps scientists study protein active sites and protein docking at the interface between two protein chains that form a structural complex. For example, it might be experimentally determined that particular atoms (for example, amino acid residues 43, 14, and 235), all proximate in 3D, form one particular protein’s active site. A biologist might wish to select and highlight these atoms and automatically have the corresponding atoms highlighted on the structural homolog. It’s predicted computationally that the highlighted atoms will correspond to the new protein’s active site. In drug design, Molli makes it possible to visualize how a small molecule (drug) fits into the “pocket” of its protein target and to contrast the fit with the similarly but not identically shaped pocket of its structural homologs. In this way, a protein’s 3D structure is related not just to its 1D sequence but also to a library of other 3D protein structures sharing an approximate 3D shape. The ability to visualize in tandem mimics the state of biological knowledge of structure and function associated with that set of 3D structures.

Evaluating Single and Coordinated Views We conducted a study comparing typical exploratory tasks in Molli and the Jmol visualization pro-

(a)

(b)

Figure 4. A close-up of the protein database’s 2L5C protein in (a) Jmol and (b) Molli. The two systems differ in their rendering of primitives. Molli maintains a smooth transition between substructures, whereas Jmol employs distinct objects.

gram. We chose Jmol for comparison rather than other popular tools, PyMOL and Swiss-PdbViewer, because Jmol is the easiest for a new user to learn. Although PyMOL and Swiss-PdbViewer can create better-rendered 2D pictures than Jmol, their interfaces are much more awkward and difficult to learn. Still, even Jmol’s user interface involves deep menus and a console interface, and we wanted to understand how its user interface affects comprehension of complex structures and performance on tasks such as search, selection, and annotation, as compared to Molli. Although there is extensive research studying text-console versus mouse efficiency and the utility of interactive 2D representations of 3D scenes,8 little research specifically addresses biologists’ 3D-software usage patterns.

■■ ■■ ■■ ■■ ■■

residue and structure selection, substructure coloring, residue hiding and substructure occlusion, view rendering, and camera manipulations.

The trials reflect real tasks performed by biologists. Users had to complete nine trials comparable between the Jmol and Molli interfaces: ■■

■■

Six trials required users to change the color of a range of residues or protein substructures (for example, coloring residues 13 through 23 in the open protein 1r4y yellow). Two rendering trials required changing the display style of residues (for example, from a cartoon to a spacefill style). One trial required hiding part of the protein model (for example, restricting the view to protein chain A).

Participants

■■

We recruited 18 participants (7 males, 11 females) via advertisements within our institution and in the local scientific community. We compensated all participants for their time with gift cards. We prescreened them for familiarity with the relevant concepts at the level of a standard undergraduate biology curriculum—specifically, identification of alpha helices and beta sheets in a cartoon protein model. We further divided the group into two classes, biologists and computational biologists, depending on whether they had any experience writing computer programs.

Participants also had to perform three camera manipulations for one of these tasks—for example, “Color the upper leftmost alpha helix orange and restrict the view to the top half of the protein.” All trials required the selection of a range of residues followed by a state change. A final trial required a more complex selection of a set of residues.

Trials We designed representative directed trials to test

Equipment We conducted all sessions in a closed laboratory. Participants worked at a standard desktop workstation with a 15-inch display. We installed both the Molli and Jmol software packages on the IEEE Computer Graphics and Applications

65


hands-on tutorial on Jmol. After the tutorial, they each received a printed set of 12 Jmol trials, with a three-minute maximum allotted for each trial. The participants then completed a similar 10-minute tutorial on Molli and received a new set of 12 trials; among these 12 trials, 9 were matched as equivalent but not identical to those for Jmol. Example equivalent tasks are “Color residues 13– 23 yellow” (Jmol, task 1) and “Color residues 7–17 brown” (Molli, task 1). The remaining three Molli trials evaluated features unique to Molli involving multiple structure alignment. We recorded time to completion (TTC) for comparison between the condition and user category for the nine matched trials. The zero marker was set as soon as the participant opened the protein required for any trial, and the finish marker was set upon its successful trial. We marked data that resulted from a mistrial, such as when a user didn’t complete the task as directed.

Qualitative Evaluation of Molli’s Multiple Alignment

Figure 5. The Jmol interface. The context menu is at the top of the window, the display pane is in the middle, and the text console is at the bottom.

workstation as well as a specified collection of protein files downloaded from the Protein Data Bank (www.pdb.org).9 Participants operated the system using a keyboard and mouse. Figure 5 shows the Jmol interface (see Figure 1 for Molli’s interface). We used TechSmith’s Camtasia Studio screenrecording software (www.techsmith.com/camtasia. asp) to capture audio and video of each session for offline analysis. In addition, the observing researchers used note-taking software to annotate events during each session. The software timestamped the notes for synchronization with the recorded video.

Protocol After obtaining the participants’ informed consent, we administered a pretest and survey to identify their backgrounds in biology and bioinformatics and experience with protein visualization software. Additionally, we asked participants to self-assess their sense of direction, spatial reasoning, visualization in 3D, and willingness to adopt new technology for comparative measures.

Comparative Evaluation of Jmol and Molli Next, the participants completed a 10-minute 66


We assigned participants a set of three additional trials using Molli after they completed the comparative tasks using Jmol and Molli. We specifically selected the proteins 2Q4M and 1I7E (of the superfamily Tubby C-Terminal) for these trials to qualitatively evaluate Molli’s multiple-alignment function. Participants opened the proteins in multiview (see Figure 1). The trials covered selection, coloring, hiding, and changing the rendering style.

Undirected Exploration of Molli Each participant completed an undirected 10minute session to explore the Molli software freely, using one or more of the preinstalled proteins. We collected qualitative feedback during the exploration using the talk-aloud method. After the time was exhausted, we gave participants a poststudy survey to rate Molli on qualitative measures using a seven-point Likert scale and to provide additional feedback with open-ended questions. The survey assessed the software’s learnability, simplicity, visibility, user control and freedom, error handling, efficiency, and graphic design as well as impressions of the multiple-alignment feature’s usability. The total time per user was approximately one hour.

Considerations We acknowledge that the ordering of conditions (Jmol first, Molli second) could affect performance. Participants might perform better on the second condition because their comfort with protein interaction and interface commands has increased.

Conversely, they might also perform worse owing to fatigue or waning interest. We assigned the Jmol condition first to prevent exposure to Molli’s additional features, such as multiple alignment and the capability for selecting an entire alpha helix directly.

Results We analyzed the performance times collected from the 18 participant sessions on nine trials, for a total of 162 trials. Of the 162 trials assessed, we classified five as mistrials (incorrectly completed) and excluded them from comparative analysis. We used the TTC quantitative measure to compare the two interfaces. We analyzed both the TTC as well as the percent time to completion (PTC), which normalizes the TTC according to the total time each participant spent to complete all tasks. This second measure highlighted that the less technically experienced users (who had slower completion times across the board) had the most significant gains with Molli. We compared how the user type affected performance and found additional significant differences favoring both systems. We also analyzed the qualitative prestudy and poststudy surveys.

How Condition Affects Performance We compared the PTC of trials performed on Jmol to the PTC of the corresponding trials on Molli. We ran a repeated-measures t-test on PTC at the 5 percent confidence level. Performance on trials 2 (simple), 5 (complex), and 9 (very complex) proved statistically significantly faster using the Molli interface (p = 0.0004, p = 0.0344, and p < 0.0001, respectively). Additionally, performance on trial 6 was weakly significantly faster for Molli (p = 0.0677). These tasks include manipulating the camera position and view and selecting and coloring residues for proteins of all three complexity levels. These results indicate that Molli better supports common protein manipulations regardless of structure complexity. Figure 6 shows the average difference in PTC between the Jmol and Molli conditions. A positive percentage indicates faster performance for the given task using Molli, and a negative percentage indicates a faster performance using Jmol. Asterisks indicate tasks with statistically significant differences in PTC performance measures.

How User Type Affects Performance On the basis of our categorization of participants as biologists or computational biologists, we performed a between-subject PTC comparison for trials performed in Jmol and the corresponding trials

Percent time to completion

20

*

10

*

* * 0

–10

1

2

3

4

5

6

7

8

9

Task Figure 6. The average difference between conditions in percent time to completion (y-axis) by task (x-axis). Of the nine tasks in total, each was performed by 18 distinct subjects (N = 18). Positive values indicate better performance using Molli; negative values indicate better performance using Jmol. Standard deviation is shown with error bars; asterisks indicate statistical significance.

in Molli, for each user group (N = 9). Figure 7 shows the average difference in PTC for the nine directed trials. We ran a repeated-measures t-test on PTC at the 5 percent confidence level. For the computational biologist user group, performance on trials 2 and 9 was statistically significantly faster in the Molli condition (p = 0.0221 and p = 0.0186), consistent with the aggregate analyses. Additionally, performance on trial 7 was weakly significantly faster in the Molli condition (p = 0.0557). For the biologist user group, performance on trials 2 and 9 was statistically significantly faster in the Molli condition (p = 0.0096 and p < 0.0001), also consistent with the aggregate analyses. Performance on trial 6, weakly significant in the aggregate analyses, proved significantly faster in the Molli condition (p = 0.0388). In contrast, trial 4 (not significant in aggregate analyses) proved statistically significant faster in the Jmol condition (p = 0.0499). Trial 5 (significant in the aggregate analyses) wasn’t significant for either user group.

Poststudy Survey Because some of Molli’s most novel features—for example, its ability to explore homologous proteins simultaneously—don’t exist in Jmol, we couldn’t evaluate in a comparative study, so we asked participants to give qualitative feedback about the Molli system. The feedback was generally positive (see Table 1). Participants found Molli easy to learn, simple to use, and visually pleasing. Three categories showed individual instances of negative or neutral views: IEEE Computer Graphics and Applications

67

Biomedical Applications 20

*

Percent time to completion

Computational biologist Biologist

10

* *

** *

0

* –10 Figure 7. The average difference between conditions in percent time to completion (y-axis) by task (x-axis), grouped by user type, computational biologist (dark, N = 9) and biologist (light, N = 9). Positive values indicate better performance using Molli; negative values indicate better performance using Jmol. Standard deviation is shown with error bars; asterisks indicate statistical significance.

Table 1. Average post-task survey results. Eighteen participants rated Molli on seven visual measures using a seven-point Likert scale, with seven being the most favorable. Measure

Rating

Standard deviation

Learnability

6.059

0.658

Simplicity

6.000

0.935

Visibility

6.471

0.717

User control

6.235

0.903

Error handling

5.706

1.403

Efficiency

6.000

1.274

Graphic design

6.412

0.712

simplicity, error handling, and efficiency. Several participants asked for an undo button; we suspect the lack of this feature accounts for the lower marks in error handling. (We’ve since added some limited undo capability.) In addition, users might have either been unaware of or misrecollected efficient correcting methods presented in the guided tutorial. Many users found the capability to select an entire alpha helix or beta sheet useful but suggested augmenting it with an option to select all the alpha helices or beta sheets in a model. Nonprogrammer biologists uniformly preferred Molli’s direct-selection methods, but some computational biologists grew comfortable with the console commands in Jmol and advocated for a console in Molli. Several users had some free interaction time with Molli after completing the 68


directed trials. We encouraged them to open any favorite proteins or choose from a protein database list of files and experiment with Molli features. On average, computational biologists spent less time on the directed trials than biologists, so most of the information we collected about the undirected, exploratory usage patterns came from the computational biologists.

M

olli is freely available under the General Public License (GPL), version 2.0, and can be downloaded from http://code.google.com/p/ molli. We hope this article sparks additional visualization development using Molli as a platform. Its support of exploratory tasks, in particular, shows its potential as an educational tool. Future research includes developing analysis tools such as automatic protein rotation, bond length measurement, and a protein-protein interaction mode. Molli builds on a long history of protein visualization programs, and we don’t expect it to completely replace the popular Jmol or PyMOL programs any time soon. Both of these tools have large, enthusiastic user communities. In addition, Jmol offers an applet form that can be easily embedded into other webpages, and PyMOL supports animated movies. Furthermore, much auxiliary software has been written to help PyMOL interface with other software packages used by x-ray crystallographers. However, the application of computer visualization to bioinformatics has transformed the workflows of computational biologists, crys-

tallographers, and biophysicists. Molli introduces an alternative visualization system with a shallow learning curve for biologists and other users who aren’t computer specialists. Molli’s interface lets users offload linked information from working memory to coordinated multiple views. The demonstrated power of its design leads us to believe its principles can be extended to develop linked information interfaces in related fields such as genomics.

Acknowledgments We thank the Tufts Summer Scholars program for supporting Connor Gramazio and the Computer Research Association Committee on the Status of Women’s Distributed Research Experiences for Undergraduates program for supporting Dani ExtrumFernandez and Caitlin Crumm. We also thank Rob Jacob, Remco Chang, and the researchers and students who participated in our study.

References 1. W.L. Delano, The PyMOL Molecular Graphics System, 2002; http://pymol.sourceforge.net/overview/index. htm. 2. N. Guex and M. Peitsch, “Swiss-Model and the Swiss-PdbViewer: An Environment for Comparative Protein Modeling,” Electrophoresis, vol. 18, no. 15, 1997, pp. 2714–2723. 3. A. Buja et al., “Interactive Data Visualization Using Focusing and Linking,” Proc. 2nd Conf. Visualization (VIS 91), IEEE CS, 1991, pp. 156–163. 4. R. Jianu, C. Demiralp, and D.H. Laidlaw, “Exploring Brain Connectivity with Two-Dimensional Neural Maps,” IEEE Trans. Visualization and Computer Graphics, vol. 18, no. 6, 2012, pp. 978–987. 5. S. Brooks and J.L. Whalley, “Multilayer Hybrid Visualizations to Support 3D GIS,” Computers, Environment and Urban Systems, vol. 32, no. 4, 2008, pp. 278–292. 6. S. O’Donoghue et al., “The SRS 3D Module: Integrating Structures, Sequences and Features,” Bioinformatics, vol. 20, no. 15, 2004, pp. 2476–2478. 7. M. Menke, B. Berger, and L. Cowen, “MATT: Local Flexibility Aids Protein Multiple Structure Alignment,” Public Library of Science Computational Biology, vol. 4, no. 1, 2008; http://dx.plos.org/ 10.1371%2Fjournal.pcbi.0040010. 8. T. Sando, M. Tory, and P. Irani, “Effects of Animation, User-Controlled Interactions, and Multiple Static Views in Understanding 3D Structures,” Proc. Applied Perception in Graphics and Visualization, ACM, 2009, pp. 69–76.

9. H. Berman et al., “The Protein Data Bank,” Nucleic Acids Research, vol. 28, no. 1, 2000, pp. 235–242. Sara Su is a product manager at Google in Mountain View, California. Her research interests include computer graphics, interactive visualization, and human-computer interaction. Su has a PhD in electrical engineering and computer science from MIT. Contact her at [email protected]. Connor Gramazio is a PhD student in computer science at Brown University on a US National Science Foundation graduate fellowship. His research interests are in humancomputer interaction and visualization. Gramazio has a BS in computer science from Tufts University. Contact him at [email protected]. Daniela Extrum-Fernandez is a student in the Royal Veterinary College in London. Her research interests are in computer graphics and visualization. Extrum-Fernandez has a BA in biology and computer science from Mills College. Contact her at [email protected]. Caitlin Crumm is a medical student at the University of Texas Southwestern Medical School. Her research interests include medical imaging and visualization and trauma medicine. Crumm has a BS in computer science from the University of Texas, Austin. Contact her at caitlin.crumm@ utsouthwestern.edu. Lenore J. Cowen is a professor in Tufts University’s Computer Science Department. Her research interests include algorithms and computational structural biology. Cowen has a PhD in mathematics from MIT. She’s a member of the ACM, American Mathematical Society, Association for Women in Mathematics, International Society in Computational Biology, and Society for Industrial and Applied Mathematics. Contact her at [email protected]. Matt Menke is a senior software engineer at Google Cambridge, where he is working on the Chrome browser. He was the principal architect of the Molli protein visualization system, and his research interests include network performance, bioinformatics, and protein visualization. Menke has a PhD in applied math and computer science from MIT. Contact him at [email protected]. Megan Strait is a graduate student in Tufts University’s Computer Science Department. Her research interests include human-computer interaction, robotics, and cognitive psychology. Strait has an MS in computer science from Tufts University. Contact her at [email protected].

Selected CS articles and columns are also available for free at http://ComputingNow.computer.org. IEEE Computer Graphics and Applications

69

Open Environment for Multimodal Interactive Connectivity Visualization and Analysis.

Interactive Visualization for Patient-to-Patient Comparison.

Forensic-case analysis: from 3D imaging to interactive visualization.

fluff: exploratory analysis and visualization of high-throughput sequencing data.

Using Interactive Data Visualizations for Exploratory Analysis in Undergraduate Genomics Coursework: Field Study Findings and Guidelines.

Combenefit: an interactive platform for the analysis and visualization of drug combinations.

NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology.

ReactionFlow: an interactive visualization tool for causality analysis in biological pathways.

AVIA: an interactive web-server for annotation, visualization and impact analysis of genomic variations.

VR interactive environment for MD simulations, visualization and analysis.

ENIGMA-Viewer: interactive visualization strategies for conveying effect sizes in meta-analysis.

Hardware-accelerated interactive data visualization for neuroscience in Python.

Exploratory visualization of surgical training databases for improving skill acquisition.

PROFILEGRAPH: an interactive graphical tool for protein sequence analysis.

Protter: interactive protein feature visualization and integration with experimental proteomic data.

Interactive visualization and analysis of large-scale sequencing datasets using ZENBU.

ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments.

IslandViewer 3: more flexible, interactive genomic island discovery, visualization and analysis.

Interactive Exploration, Analysis, and Visualization of Complex Phenome-Genome Datasets with ASPIREdb.

inPHAP: interactive visualization of genotype and phased haplotype data.

GPU-accelerated interactive visualization and planning of neurosurgical interventions.

From Static to Interactive: Transforming Data Visualization to Improve Transparency.

gbtools: Interactive Visualization of Metagenome Bins in R.

Elviz - exploration of metagenome assemblies with an interactive visualization tool.