e185(1) C OPYRIGHT  2013

BY

T HE J OURNAL

OF

B ONE

AND J OINT

S URGERY, I NCORPORATED

the

Orthopaedic forum Whole-Exome Sequencing: Discovering Genetic Causes of Orthopaedic Disorders Nandina Paria, PhD, Lawson A. Copley, MD, John A. Herring, MD, Harry K.W. Kim, MD, Benjamin S. Richards, MD, Daniel J. Sucato, MD, Carol A. Wise, PhD, and Jonathan J. Rios, PhD Sequencing technologies promising the ‘‘$1000 genome’’ have developed at a staggering pace, driven in part by the HapMap Project1 (which set out to catalog common variations throughout the genome in multiple ethnic populations) and the 1000 Genomes Project2 (a consortium to catalog rare variations from different ethnic populations worldwide). Commercial wholegenome sequencing platforms have yet to break the $1000 mark. However, sequencing throughput and accuracy are no longer limiting factors. Despite recent successes in patientoriented whole-genome sequencing3-6, analyzing and interpreting these data sets represent a major challenge, particularly for the ;98% of the genome that does not encode proteins. In 2009, an alternative to whole-genome sequencing was introduced7. Whole-exome sequencing uses high-throughput next-generation technologies to sequence only the portions of the genome that encode proteins (termed the exome). The goal of whole-exome sequencing is to identify sequence mutations that alter the amino acid content of a protein, potentially altering the protein’s function in a cell and leading to disease. Proof of concept for this approach was documented in a landmark paper that used whole-exome sequencing to identify the genetic cause of Freeman-Sheldon syndrome (OMIM [Online Mendelian Inheritance in Man] 193700)7. This study broke the technical and cost limitation barriers (compared with sequencing an entire genome) by sequencing only the protein-

coding regions (;1% of the genome) and showed that a study of only a few patients was sufficient to identify a disease-causing gene. Whole-exome sequencing has revolutionized disease-gene discovery and, supported by the 1000 Genomes Project2 and other large-scale sequencing studies (e.g., the NHLBI/NHGRI [National Heart, Lung, and Blood Institute/National Human Genome Research Institute] Exome Project), it promises to identify virtually all genes causing disorders with Mendelian inheritance—i.e., those disorders caused by single genes. In this review, we focus on single-gene Mendelian disorders rather than complex multifactorial orthopaedic disorders. Complex multifactorial disorders caused by mutations in multiple genes, each conferring modest disease risk, are best identified with use of genome-wide association studies in large patient and control populations. We describe various applications of whole-exome sequencing to identify diseasecausing genes and provide examples of their use in orthopaedic disorders. Whole-Exome Sequencing Compared with whole-genome sequencing, which sequences the entire genome of an individual, whole-exome sequencing only sequences approximately 1% to 2% of the genome in order to identify gene mutations that alter protein sequences. Whole-exome sequencing can be divided into three phases

Disclosure: None of the authors received payments or services, either directly or indirectly (i.e., via his or her institution), from a third party in support of any aspect of this work. One or more of the authors, or his or her institution, has had a financial relationship, in the thirty-six months prior to submission of this work, with an entity in the biomedical arena that could be perceived to influence or have the potential to influence what is written in this work. No author has had any other relationships, or has engaged in any other activities, that could be perceived to influence or have the potential to influence what is written in this work. The complete Disclosures of Potential Conflicts of Interest submitted by the authors of this work are available with the online version of this article at jbjs.org.

J Bone Joint Surg Am. 2013;95:e185(1-8)

d

http://dx.doi.org/10.2106/JBJS.L.01620

e185(2) TH E JO U R NA L O F B O N E & JO I N T SU RG E RY J B J S . O RG V O L U M E 95-A N U M B E R 23 D E C E M B E R 4, 2 013 d

d

d

W H O L E -E XO M E S E Q U E N C I N G : D I S C O V E R I N G G E N E T I C C AU S E S O F O RT H O PA E D I C D I S O R D E R S

Fig. 1-A

Fig. 1-B

Figs. 1-A and 1-B Schematic illustration of whole-exome capture and sequence analysis. Fig. 1-A Genomic DNA is extracted from blood and is fragmented. The fragmented DNA corresponds to protein-coding regions (blue) and regions not encoding proteins (red). Fragments are hybridized to ‘‘probes’’ (shown in black), which will selectively bind to protein-coding regions of the genome. After fragments not bound to the probes are removed, the probes are separated and washed away, leaving a pool of DNA fragments corresponding to the protein-coding regions of the genome (i.e., the exome). Millions of fragments are then sequenced on a next-generation sequencing platform. Fig. 1-B The sequences of individual fragments are assembled together, using the published human genome sequence (shown at bottom) as a reference. The final assembled sequence is compared with the published reference genome to identify sequence mutations (boxed in black and shown with arrows in the middle panel). These mutations are then analyzed in all samples, and potential disease-causing genes (candidate genes) are identified.

(Figs. 1-A and 1-B): (1) exome capture and sequencing, (2) sequence analysis using bioinformatics methods, and (3) analysis of candidate genes. Selectively targeting the protein-coding regions of the genome requires an important capture procedure (Fig. 1-A). Commercially available capture methods use predesigned ‘‘probes,’’ which are short sequences homologous to proteincoding regions of the genome. The probes are designed to ‘‘capture’’ fragments of the genome that correspond to the exons of proteincoding genes (Fig. 2). The mixtures of probes provided for commercial platforms differ by design in several aspects, including (1) the length of each probe, (2) the spacing between probes, and (3) the genes and exons that are targeted by the probes. These differences in design affect the overall performance of the experiment8.

The procedure for exome capture is shown graphically in Figure 1-A. First, an individual’s genomic DNA is extracted from a blood (or tissue) sample and is cut into fragments that are ;300 to 600 base pairs in length. The genomic fragments are then mixed with the probes, which will selectively hybridize to their complementary target sequences (the exome). After the hybridized probes have been captured, genomic fragments that did not bind to probes (because the DNA does not encode protein) are washed away. The hybridized probes are then separated from the DNA and are washed away, leaving a sample of DNA fragments from the patient’s exome. Finally, millions of fragments are sequenced with use of next-generation sequencers, such as the SOLiD (Life Technologies), HiSeq (Illumina), or Genome Sequencer FLX (Roche) platforms.

e185(3) TH E JO U R NA L O F B O N E & JO I N T SU RG E RY J B J S . O RG V O L U M E 95-A N U M B E R 23 D E C E M B E R 4, 2 013 d

d

d

W H O L E -E XO M E S E Q U E N C I N G : D I S C O V E R I N G G E N E T I C C AU S E S O F O RT H O PA E D I C D I S O R D E R S

Fig. 2

Schematic illustrating gene exons targeted for whole-exome sequencing. Genes contain protein-coding exons (black) with intervening intron sequences. Whole-exome sequencing uses short ‘‘probes’’ (blue) that are homologous to the gene exons. These probes selectively capture their target exons and do not capture the gene introns.

Exome Sequence Analysis The principle of high-throughput next-generation sequencing involves sequencing a large number of DNA fragments until the combined sequence covers the entire exome (Fig. 1-B). Each base of the exome may be represented in hundreds of different fragments; thus, by sequencing millions of fragments, each position of the exome is sequenced dozens of times. The number of times that a base (or exon) is sequenced is termed the coverage. For example, assume that exon 2 of the neurofibromatosis gene (NF1) is represented in fifty fragments of the captured exome DNA. Thus, all fifty fragments containing exon 2 of NF1 may be sequenced among the millions of fragments in the experiment, and exon 2 of NF1 will have 50· sequence coverage. Sequencing the entire human exome to 50· coverage means that any position in the exome was sequenced, on average, in fifty different fragments. High sequence coverage improves the accuracy of identifying true mutations and limits the occurrence of false-positive mutation detections (sequencing errors) in the analysis. The original exome sequencing report indicated an average sequence coverage of 51·7. However, because the amount of data generated from sequencing machines is increasing at a staggering pace and higher sequence coverage provides better sequence accuracy, the desired average coverage is also increasing, reaching as high as 100·. The computational challenges in analyzing whole-exome sequencing data stem from the inherent difficulty of accurately predicting where each fragment originated in the genome. Multiple software programs are available, and each uses complex statistical methods to accurately align each fragment to the published human reference genome, creating a final sequence assembly for each sample (Fig. 1-B). This assembly is used to identify sequence mutations in the sample. Mutations are identified as sequence differences in the sample compared with the reference sequence, and a confidence score for each mutation is calculated (in part on the basis of the sequence coverage). Each mutation is then annotated with relevant information including its frequency in the population, its location in the gene, whether or not the mutation changes the protein sequence (‘‘nonsynonymous’’ or ‘‘synonymous,’’ respectively), and the specific amino acid change and position in the protein. Sequence mapping, mutation detection, and annotation often require dedicated computer servers and trained personnel, which most clinical and research laboratories lack. Consequently, sequencing centers and other fee-for-service providers typically perform all exome capture, sequencing, and

sequence analyses, returning lists of annotated mutations for each sample to the investigator. However, few sequencing centers perform candidate gene identification. Candidate Gene Analysis Whole-exome and whole-genome sequencing provide a unique opportunity to identify disease-causing mutations in individual patients. However, the challenge lies in determining which of the ;10,000 protein-altering nonsynonymous mutations identified in each exome is likely to cause the patient’s disease. The location of such a mutation may be referred to as a candidate disease gene. In general, candidate gene analysis assumes that the disease is caused by a rare protein-altering nonsynonymous mutation. A majority of variants identified in an individual’s exome are common in the population and are thus not predicted to cause a rare disease9. Such variants are generally referred to as polymorphisms, with most being single-nucleotide polymorphisms (SNPs). Data from the 1000 Genomes Project are used to determine the frequency of a given mutation in the population. Often, the candidate gene analysis excludes common mutations with >5% frequency in the 1000 Genomes Project data set. For a very rare disease, a 1% cutoff is useful, with mutations with >1% frequency being excluded because they are not considered to cause the disease. Alternatively, all mutations that have been reported previously are excluded, leaving only novel mutations. The goal of whole-exome sequencing analysis is to identify a single candidate gene that is likely to cause the patient’s disease. Multiple strategies exist for identifying candidate disease-causing genes from whole-exome sequencing data, depending on whether the focus is on analyzing a rare disease in unrelated patients, a disease inherited in a family, a disease caused by a de novo (new) mutation, or a somatic disease that occurs in only part of a patient’s body. The analysis method used depends on the disease (e.g., whether the disease is inherited or sporadic) and on the number of patients available for sequencing10,11. Family-based analyses utilize whole-exome sequencing of parents and offspring to determine which mutations in the patient were inherited from the mother and which were inherited from the father. Taken together, this information can be used to identify disease-causing genes for dominant and recessive diseases. Dominant diseases are inherited from an affected parent (Fig. 3-A), and a single mutation in the gene is sufficient to cause disease. Recessive disorders are caused by two inherited

e185(4) TH E JO U R NA L O F B O N E & JO I N T SU RG E RY J B J S . O RG V O L U M E 95-A N U M B E R 23 D E C E M B E R 4, 2 013 d

d

d

Fig. 3-A

W H O L E -E XO M E S E Q U E N C I N G : D I S C O V E R I N G G E N E T I C C AU S E S O F O RT H O PA E D I C D I S O R D E R S

Fig. 3-B

Figs. 3-A, 3-B, and 3-C Pedigrees showing dominant, recessive, and de novo patterns of disease-causing mutations. Affected individuals are denoted by crosshatching, and the locations of the mutations are shown by red asterisks within the exon (thick black bar) in the corresponding schematic of the gene. Fig. 3-A A dominant disease is caused by a single mutation that is inherited from a single affected parent. Fig. 3-B A recessive disease is caused by mutations in both copies of a gene. For example, if each mutation results in a nonfunctional protein and the patient inherits mutations from each unaffected parent, the patient will have no functional protein and will present with the recessive disease. Fig. 3-C A de novo mutation is not inherited from either parent; rather, it is present for the first time in the patient. De novo mutations are very rare and are often considered likely disease-causing mutations.

Fig. 3-C

mutations in the same gene. One mutation is inherited from the father and the second mutation is inherited from the mother; the co-occurrence in the patient causes the disease (Fig. 3-B). When the same mutation is inherited from both parents, the mutation is termed homozygous, whereas if two different mutations are inherited from the parents (as is the case in Figure 3-B), the mutations are termed compound heterozygous. Some diseases are caused by mutations that are not present in either parent and are therefore called de novo mutations; the mutation occurs for the first time in the patient and is sufficient to cause disease. De novo mutations are rare events that occur spontaneously (Fig. 3-C).

Rare Disease Analysis For very rare diseases, such as achondroplasia (frequency ;1/ 10,000; OMIM 100800), the exome sequence of only a few unrelated patients may be sufficient to identify the disease gene. Table I shows the effect of various filtering strategies and the effect of sequencing multiple patients to reduce the number of potential candidate genes. Using more stringent filters, the number of candidate genes from the exome sequence of a single patient is substantially reduced. The majority of variants identified in a single exome (column 2 of the table) are common in the population (e.g., SNPs), and

Whole-exome sequencing: discovering genetic causes of orthopaedic disorders.

Whole-exome sequencing: discovering genetic causes of orthopaedic disorders. - PDF Download Free
1MB Sizes 0 Downloads 0 Views