Functional optimization of gene clusters by combinatorial design and assembly.

Articles

Functional optimization of gene clusters by combinatorial design and assembly

© 2014 Nature America, Inc. All rights reserved.

Michael J Smanski1, Swapnil Bhatia2, Dehua Zhao1, YongJin Park1, Lauren B A Woodruff1,3, Georgia Giannoukos3, Dawn Ciulla3, Michele Busby3, Johnathan Calderon1, Robert Nicol3, D Benjamin Gordon1,3, Douglas Densmore2 & Christopher A Voigt1,3 Large microbial gene clusters encode useful functions, including energy utilization and natural product biosynthesis, but genetic manipulation of such systems is slow, difficult and complicated by complex regulation. We exploit the modularity of a refactored Klebsiella oxytoca nitrogen fixation (nif) gene cluster (16 genes, 103 parts) to build genetic permutations that could not be achieved by starting from the wild-type cluster. Constraint-based combinatorial design and DNA assembly are used to build libraries of radically different cluster architectures by varying part choice, gene order, gene orientation and operon occupancy. We construct 84 variants of the nifUSVWZM operon, 145 variants of the nifHDKY operon, 155 variants of the nifHDKYENJ operon and 122 variants of the complete 16-gene pathway. The performance and behavior of these variants are characterized by nitrogenase assay and strand-specific RNA sequencing (RNA-seq), and the results are incorporated into subsequent design cycles. We have produced a fully synthetic cluster that recovers 57% of wild-type activity. Our approach allows the performance of genetic parts to be quantified simultaneously in hundreds of genetic contexts. This parallelized design-build-test-learn cycle, which can access previously unattainable regions of genetic space, should provide a useful, fast tool for genetic optimization and hypothesis testing. Biological systems are able to build intricate materials and chemicals that require precise dynamic and spatial control over many genes. However, engineering large systems that are composed of many genetic parts is not straightforward. First, software is focused on combining parts at the DNA sequence level, which can make the design process time consuming. Second, it takes months to prototype a design. Although DNA synthesis is routine for individual genes1 and, indeed, has been used to build entire 1-Mb genomes2, it remains too costly to simultaneously synthesize many large alternative designs. In practice, it is feasible to build only a few designs for testing, so finding a design that works can take considerable time. This is further complicated when working with large, naturally occurring systems that are the product of evolutionary forces and have redundant and often overlapping regulatory elements3,4. Even in wellcharacterized systems, not all of the regulation or regulatory parts (for example, promoters) are known3. Starting with such a system, design choices cannot be cleanly implemented without triggering a web of secondary effects. For example, a desired change in gene order may be tolerable in itself, but if there are promoters internal to the ORFs, this could create transcriptional interference. Overlapping genetic elements also thwart part substitutions; if genes are translationally coupled, this complicates codon optimization or the substitution of a ribosome binding site (RBS), where these will have a secondary impact on neighboring genes. Refactoring is an engineering approach to clean up a natural genetic system5. The goal is to create a fully defined and modular system

by systematically eliminating native regulation and replacing it with well-characterized parts6. First, refactoring removes complex, multigene pathways from the control of the host and places them under the control of synthetic genetic sensors and circuits. This eliminates the influence of the many environmental and cellular inputs that can influence a system and enables it to be controlled with an inducible switch or as the output of a circuit7. Second, refactoring facilitates the large-scale part swapping and engineering that is required for species transfer. Each species speaks a different regulatory language, and refactoring simplifies the conversion of the code from one to another (codon optimizing each gene, converting ribosome binding sites and so on). Here, we start with a refactored version of the nif gene cluster that encodes the enzymes necessary for nitrogenase activity from Klebsiella oxytoca6. Nitrogen fixation is a key process in agriculture, involving the conversion of atmospheric N2 to ammonia, and since the 1970s it has been a goal in biotechnology to move this function into cereal crops to reduce the use of chemically derived fertilizer8. In Klebsiella, the native cluster contains 20 genes encoded in 7 operons, altogether comprising 25 kb and encoding regulatory proteins, the nitrogenase enzyme, chaperones, electron transport proteins and the biosynthetic pathway for the iron-molybdenum cofactor (FeMo-co) and other metalloclusters9. Under appropriate environmental conditions, the cluster is highly expressed (nifH alone accounts for 10% of cell weight), and activity is balanced to avoid H 2 generation as

1Synthetic

Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. 2Electrical and Computer Engineering Department, Boston University, Boston, Massachusetts, USA. 3Broad Technology Labs, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. Correspondence should be addressed to C.A.V. ([email protected]). Received 14 March; accepted 7 October; published online 24 November 2014; doi:10.1038/nbt.3063

nature biotechnology advance online publication


Articles a byproduct10,11. Thus, the challenge is not to increase activity in Klebsiella but to transfer the nif cluster to a new host, such as a plant chloroplast or root-associated microorganism12. The transfer process requires large-scale part substitutions, to enable the cluster to function in the new host, and subsequent optimization, but the native nif system is not organized such that this can be done easily. We previously refactored the nif gene cluster to reduce the genetic barriers to transfer6. This involved the removal of all noncoding DNA as well as nonessential and regulatory genes. The 16 remaining genes were ‘codon randomized’ to eliminate regulation internal to the open reading frame. These genes were organized into artificial operons and placed under the control of T7 RNA polymerase (RNAP) promoters and terminators, and synthetic RBSs were selected to optimize the expression levels. Finally, we constructed a ‘controller’ plasmid that contains synthetic sensors and circuits, whose output is an attenuated T7* RNAP7. T7 RNAP and promoters were selected because these are transcriptionally orthogonal from the host and can be used in many organisms. Thus, the cluster can be transferred and optimized using the same promoter set. When we were refactoring the nif cluster, two operons proved particularly difficult to optimize: nifHDKY (encoding the nitrogenase subunits) and nifUSVWZM (encoding the metal cofactor biosynthetic pathways)6. Starting with this refactored cluster, here we applied a highly parallelized process of design, construction, analysis and learning (Fig. 1a). Combinatorial design is a field of mathematics that studies the arrangement of elements of a finite set into patterns according to specified constraints13. As applied to synthetic biology, it allows a design to be articulated as a set of parts and formalized constraints between parts. This approach enables one design to capture many potential DNA constructs, averting the need to design each one individually by manually combining DNA parts. To realize the designs, we applied a hierarchal assembly process that builds the clusters in multiple cloning steps with techniques suited for each step (Fig. 1b). This approach allows a degree of architectural diversity that is otherwise difficult to obtain. Previous work involving the combinatorial assembly of parts has focused on the construction of libraries by substituting a set of parts at a specific location, such as a set of promoters of different strength14. Finally, we screened the library for function and applied RNA-seq to every member of the library to debug failed systems and systematically learn how design choices influence the outcome. We first used the pipeline to build libraries for the nifUSVWZM and nifHDKY operons (each 0.7 Mb), from which we identified optimized operons whose architectures differ substantially from that of the wild type. We confirmed the plasticity of the cluster as a whole by building large libraries in which the order, operon occupancy and orientation were changed for all 16 genes (2.0 Mb total). We transferred the refactored cluster from Klebsiella to Escherichia coli, and improved activity by building a library (0.9 Mb total) with simultaneous RBS substitutions across all 16 genes—a genetic manipulation that would be impossible with the native cluster. We used RNA-seq to extract new constraints that can be incorporated into the next round of design. This pipeline is applicable to genetic engineering challenges beyond nitrogen fixation. For example, this design-build-test-learn pipeline could be used to facilitate transfer of clusters encoding useful natural products, from the producer organism to an industrially relevant microbe. RESULTS Combinatorial assembly of the nifUSVWZM operon We started with a variant of the refactored nifUSVWZM operon that yields 60% wild-type activity in the context of K. oxytoca NF10

(ref. 6), a strain containing a chromosomal deletion of the wild-type nifUSVWZM but with the other nif genes remaining in the wild-type cluster in the chromosome (Online Methods). On the basis of this construct, we designed a library to vary the architecture and component parts. There was little guiding information about the importance of the genetic organization of the native operon in the literature. We noted that when we compared the nifUSVWZM operon across a range of species, the component genes were found to be arranged differently into subclusters with varying orders and orientations9,15 (Supplementary Fig. 1 and Supplementary Note 1). Therefore, rather than constraining the genes into an operon structure, we allowed the architectures in the library to vary considerably. We designed constructs that contain different operon structures, gene orders and gene orientations (Fig. 2a). The only architectural constraint that was imposed was to limit nifUSV to the first half of the cluster and nifWZM to the second half, to reduce the number of half clusters to be assembled (this does not constrain these genes to be part of the same operon). We allowed several nonstandard design features in the library, including tandem promoters, interference from downstream reverse promoters and genes after terminators that rely on read-through for transcription. (Eugene code16 corresponding to this library is available as Supplementary Data File 1, Supplementary Table 1 and Supplementary Note 2.) We selected genetic parts to vary gene expression levels. We used three T7 RNAP promoters to vary transcription levels and reused the same terminator throughout the designs6 (Supplementary Fig. 2). The same set of codon-randomized open reading frames for the six genes was used as described previously6. We designed a total of 12 RBSs using the RBS Calculator17 to provide strong and weak RBSs for each gene. This includes six from the original refactored cluster, four that were designed to be fivefold stronger (for nifU, nifV, nifW and nifZ) and two that were designed to be fivefold weaker (for nifS and nifM). Seven spacers composed of randomly generated 50-bp DNA sequences were included to increase the distance between RBSs and upstream elements and to separate cistron parts to reduce context effects that could occur between neighboring parts18 (the spacers were computationally scanned to eliminate functional sequences) (Supplementary Note 3). We developed a hierarchal DNA assembly strategy to efficiently combine genetic parts to form intermediate composite parts, half clusters and whole clusters. Each level of the hierarchy uses a different DNA assembly strategy that is optimal for the size and types of parts that exist at that stage (Fig. 1b). The first stage combines individual parts (spacers, promoters, RBSs, genes and terminators) to form 48 cistron-sized constructs (Supplementary Fig. 3 and Supplementary Note 4). At this stage, scarless assembly is critical, as the introduction of new sequences at the seams can affect part behavior19. We developed a simple ‘scarless stitching’ method that can combine up to three parts in vitro and uses an additional enzyme to remove the bridging scar (Supplementary Figs. 4 and 5 and Supplementary Note 5). The cistron-level constructs were then combined to form half clusters via the MoClo variation of Golden Gate assembly20. This method introduces 4-bp scars when building libraries, which we placed in the spacers separating cistrons. After building the cistron parts, we used one round of PCR to customize the flanking regions of each cistron, which contain MoClo cohesive ends that determine the eventual order and orientation in future assembly steps. We built 24 half clusters: 12 for nifUSV and 12 for nifWZM. We then combined these in 84 different combinations to build the full clusters using the same assembly process. We sequence verified a subset of the clusters and from that inferred that 22 were incorrect, mainly from point mutations in advance online publication nature biotechnology

Articles a

Design

Build

Test

Constrain: [promoter] = P1,P2,P3,null [rbs] = RBSx1,RBSx5 [terminator] = T1,null [order] = U,S,V < W,Z,M

Activity

Transcriptome

Growth

Genomic impact

Permute: promoter strength and number RBS strength terminator number gene order and orientation

PartT3

PartT1

PartT0

PartRC z2 PartRC m1 PartRC m2

PartRC z1

PartRC v2 PartRC w1 PartRC w2

PartRC s2 PartRC v1

PartRC s1

PartRC u2

PartRC u1

PartP2

Scarless stitching, PCR

Cistron 1 Cistron 2 Cistron 3 Cistron 4 Cistron 5 Cistron 6 Cistron 7 Cistron 8 Cistron 9 Cistron 10 Cistron 11 Cistron 12 Cistron 13 Cistron 14 Cistron 15 Cistron 16 Cistron 17 Cistron 18 Cistron 19 Cistron 20 Cistron 21 Cistron 22 Cistron 23 Cistron 24 Cistron 25 Cistron 26 Cistron 27 Cistron 28 Cistron 29 Cistron 30 Cistron 31 Cistron 32 Cistron 33 Cistron 34 Cistron 35 Cistron 36 Cistron 37 Cistron 38 Cistron 39 Cistron 40 Cistron 41 Cistron 42 Cistron 43 Cistron 44 Cistron 45 Cistron 46 Cistron 47 Cistron 48

Part

PartP3

PartP0

b

PartP1

Learn

Cistron

ZM−12 Half W

ZM−11 Half W

ZM−9

ZM−10 Half W

ZM−8

Half W

ZM−7

Half W

ZM−6

Half W

ZM−5

Half W

Half W

ZM−4 Half W

ZM−3 Half W

ZM−1

ZM−2

Half W

Half W

V−12

V−11

Half US

Half US

V−9

V−10 Half US

V−8

Half US

V−7

Half US

Half US

V−6

V−5

Half US

V−4

Half US

V−3

Half US

V−2

Half US

Half US

V−1 Half US

Half cluster Golden Gate Assembly

Full cluster

USVW USVWZM−1 USVWZM−2 USVWZM−3 USVWZM−4 USVWZM−5 USVWZM−6 USVWZM−7 USVWZM−8 USVWZM−9 USVWZM−10 USVWZM−11 USVWZM−12 USVWZM−13 USVWZM−14 USVWZM−15 USVWZM−16 USVWZM−17 USVWZM−18 USVWZM−19 USVWZM−20 USVWZM−21 USVWZM−22 USVWZM−23 USVWZM−24 USVWZM−25 USVWZM−26 USVWZM−27 USVWZM−28 USVWZM−29 USVWZM−30 USVWZM−31 USVWZM−32 USVWZM−33 USVWZM−34 USVWZM−35 USVWZM−36 USVWZM−37 USVWZM−38 USVWZM−39 USVWZM−40 USVWZM−41 USVWZM−42 USVWZM−43 USVWZM−44 USVWZM−45 USVWZM−46 USVWZM−47 USVWZM−48 USVWZM−49 USVWZM−50 USVWZM−51 USVWZM−52 USVWZM−53 USVWZM−54 USVWZM−55 USVWZM−56 USVWZM−57 USVWZM−58 USVWZM−59 USVWZM−60 USVWZM−61 USVWZM−62 USVWZM−63 USVWZM−64 USVWZM−65 USVWZM−66 USVWZM−67 USVWZM−68 USVWZM−69 USVWZM−70 USVWZM−71 USVWZM−72 USVWZM−73 USVWZM−74 USVWZM−75 USVWZM−76 USVWZM−77 USVWZM−78 USVWZM−79 USVWZM−80 USVWZM−81 USVWZM−82 ZM USVW −83 ZM−84


Golden Gate Assembly

Figure 1 Combinatorial design and construction of gene cluster libraries. (a) The design-build-test-learn cycle. (b) Cluster assembly steps (left) illustrate the application of different techniques at different stages of the hierarchy, and an assembly graph (right) traces the path from parts to complete clusters for the nifUSVWZM library. The complete constructs corresponding to the letter codes are provided in Supplementary Figure 3.

coding sequences occurring in intermediate plasmids. We further analyzed a total of 62 constructs for sequence-activity relationships. Screening and analysis of the nifUSVWZM library All 62 gene clusters were introduced into K. oxytoca NF10 (∆nifUSVWZM) bearing a controller containing the IPTG-inducible PTac promoter driving the expression of T7* RNAP (plasmid N249)6. This is akin to a complementation experiment, where the remaining nif genes are encoded in the wild-type cluster in the genome. We characterized each variant using an acetylene reduction assay (Online Methods) and compared it to wild-type K. oxytoca M5al (Fig. 2a) to determine the activity as a percentage of wild-type activity. This percentage is for total activity, not the specific activity with respect to protein level, which is sometimes reported in the literature21,22. In measuring the activity, we diluted samples so that they would have the same optical density (OD) before induction. One variant (USVWZM#30) recovered full wild-type activity (96% ± 9%, s.d.), but its genetic architecture differed from that of the wild type, with five transcription units, a different gene order and a change in orientation between nifUVS and nifZMW. Additionally, tandem promoters control nifZ and nifM in USVWZM#30. However, it is noteworthy that the second most active operon (USVWZM#1, 85% ± 5%) has the same single-operon architecture as the original refactored operon, with the only different parts being RBSs and spacers. The next three variants (USVWZM#61, #68 and #41) had high activity (77% ± 5%, 75% ± 5% and 70% ± 3%) but also differed substantially in their architectures, with two, five and three transcriptional units and two, eight and four promoters. The diversity of genetic architectures present in the five most active variants highlights the genetic plasticity of this operon. nature biotechnology advance online publication

Several of the permuted nifUSVWZM constructs produced a slow growth phenotype in the assay conditions. We used the OD after 5 h of growth as a measure of the growth rate (Online Methods). When we arranged the nifUSVWZM variants in order by growth rate and then activity (Fig. 2a), constructs #1, #68 and #61 stood out as maintaining both high activity and wild-type growth. We analyzed the library to see whether there were correlations between nitrogenase activity and features of the genetic architecture, including gene orientation and order, part combinations and part activity (Supplementary Figs. 6–9 and Supplementary Note 6). There was a negative correlation between activity and the number of transcription units (as well as the number of promoters and terminators, which are related), but there were many outliers, and the most active variants contained multiple transcription units (Supplementary Note 6). There was no correlation with the number of orientation changes, but there was a preference for nifU and nifS to be in the reverse orientation. There was no enrichment for constructs that preserve any aspect of the gene order, including those orders most preserved in native clusters when compared across bacterial species. Likewise, there were few correlations between genetic architecture and growth rate (Supplementary Note 6). It is possible that the disruption of an operon would yield a variant that has high activity but is less robust over a range of RNAP concentrations23–25. The refactored clusters are induced by a controller, which simplifies the measurement of robustness by varying the concentration of IPTG to change T7* RNAP concentration (Fig. 2b). This enables the direct measurement of the robustness of the clusters to changes in expression; a phenomenon that has been characterized computationally for other systems26,27 but is not possible to measure experimentally for native systems with complex and redundant regulation. We quantified

Articles the robustness of 49 of the most active nifUSVWZM variants by measuring the nitrogenase activity at five levels of IPTG induction that spanned two orders of magnitude (Fig. 2b). Clusters were placed into three groups on the basis of their response to increasing IPTG concentrations. The majority (35) of these monotonically increased in activity as a function of RNAP concentration. Several clusters were very robust over a wide range of RNAP concentrations, showing Activity OD600 (% wild-type) 100 0 0.2 1.0

Cluster #

b T7*

Controller plasmid

0.5

0 Refactored cluster plasmid Activity (normalized)

7 66 9 77 24 73 78 75 80 72 48 60 52 14 31 63

1.2

68

1.0 0.8

50 100 Activity (% wild-type)

61

1

0.6

0

30

0.4 0.2 0

0

1,000

2,000

0

1,000

0

2,000

1,000

2,000

T7* RNAP−driven transcription (a.u.)

c

d

Activity (% wild-type)

120

Activity (% wild-type)


1 68 61 49 10 54 6 16 15 56 32 5 12 2 13

100 80 60 40 20 0

HDKY

60 40 20 0

BQFUSVWZM

v1.0 v1.1 v2.0 v2.1 ∆nif

HDKYENJ

e

30 11 65 64 71 44 84 83 51 79 55 53 21 67 43

Refactored v2.1 0.16

340

P7

Rh16

0

23

5.23 nifH

58

P5 1,019

T8 940

0.09

1,091

Rn1

6,439

1,126

S454

2,578

Rj42

7,856

Rb1

82 nifK

2,662

S455

4,225

Ry1

4,265

17 nifY

4,291

S456

4,953

Re6

4,993

nifE

5,025

6,399

1.70 nifJ

T452

S132

7,883 11,390 11,427 11,458

36 S459

Rk1

2,628

39 nifN

6,470

0.08 P2

1,019

230 nifD

120 6,399

S130 988

90 S131 Rd13

1,042

S457

41 37 28 42 8 36 25 23 59 35 29 40 27 39 22

36 nifB

S460

Rq1

2.80 nifQ

T451 S461

11,458 11,481 11,521 11,560 12,967 13,007 13,028 13,532 13,561 13,579

0.12 P3

22 Rf1

nifF

2.80 T451 S462

0.08 P2

S2

1545 Ru1

390 nifU S10c0 Rs1

nifS

S1d0

13,579 13,602 13,636 14,167 14,199 14,228 14,251 14,301 14,338 15,163 15,267 15,306 16,509 16,613

30 Rv1

47

1.0

IPTG

Robustness

a

almost no change in activity. The most active cluster (#30) as well as those that balance activity and growth (#1, #68 and #61) were all robust, showing little change in activity over a wide range of RNAP concentrations. Notably, we observed no correlation between fragile clusters and the disruption of the operon structure, and #30 was among the most robust, despite the disruption of the original operon with seven promoters.

32 nifV

S10f0

Rw2

7 nifW S10d0 Rz2

34 nifZ

S10a0 Rm1

2.62 nifM

T1

16,613 16,647 17,790 17,894 17,928 18,186 18,290 18,321 18,768 18,872 18,906 19,707 19,755

Ru1

Rs1

Rv1

Rw1

Rz1

Ru2

Rs2

Rv2

Rw2

Rz2

Rm1 Rm2

nifU

nifV

nifZ

T1

P1

P3

nifS

nifW

nifM

S[x]

P2

advance online publication nature biotechnology


Articles Optimization and assembly of complete nif clusters We further optimized the nifHDKY and nifENJ operons on the basis of the codon-randomized genes synthesized previously6. For nifHDKY, we started with the refactored operon that yields 45% wild-type activity in the context of K. oxytoca NF9 (ref. 6), a strain with a chromosomal deletion of the wild-type nifHDKY (Online Methods). We assembled a library that varied the operon occupancy, part identity and introduced promoters and terminators to disrupt the operon. A set of RBSs was designed by the RBS Calculator to vary the expression level of nifH and nifD (Online Methods, Supplementary Fig. 10 and Supplementary Note 7). The library of 145 variants was built and screened, and the top construct yielded 102% wild-type activity in the ∆nifHDKY strain background (Fig. 2c). Similarly, we made RBS libraries using the RBS Calculator to vary the expression of the nifENJ operon (Supplementary Fig. 11 and Supplementary Note 7). We made a library of 155 variants with 50–60 RBS variants for each gene. We tested the variants with the optimized nifHDKY in K. oxytoca NF24 (ref. 6) (a strain with a nifHDKYENJ chromosomal deletion), and the top variant was able to achieve 82% activity in this strain background (Fig. 2c). The improved operons were combined to optimize the activity of the complete refactored cluster in K. oxytoca NF26 ∆nif (Fig. 2d, Supplementary Fig. 12 and Supplementary Note 8). Placing the top nifHDKY operon in the context of the original refactored cluster (v1.1) increased the activity from 7% to 21%. This was further increased to 40% activity when the USVWZM#1 operon was added (v2.0). Finally, the inclusion of the optimized nifENJ (v2.1) recovered 57%. It is noteworthy that the genetic architectures of v2.1 and the wild-type cluster differ significantly (Fig. 2e). To further explore the plasticity of the gene ordering and operon occupancy, we made a large library in which these variables were permuted for all 16 genes. The only pattern that emerged was a benefit for keeping nifE and nifN adjacent and co-transcribed in an operon. (Supplementary Figs. 13–19 and Supplementary Note 9). Transcriptome diversity in the variant nif clusters Very few architectural rules were gleaned from the nifUSVWZM library and the assembly of the complete clusters. It is possible that it is important to maintain the correct expression levels and that, if this condition is satisfied, many architectures are equivalent. We used high-throughput transcriptomics to quantify the relationship between genetic architecture, expression and activity. We gathered RNA-seq data on the wild-type cluster, refactored cluster and 82 members of the nifUSVWZM library (Online Methods). Analyzing this many samples in a cost-effective manner required the application of new techniques involving pooling samples early in the process and multiplexing

reactions (data not shown). The method is also strand-specific, which is important for obtaining data for promoters oriented in opposite directions. This provides base-pair resolved transcript levels across the refactored clusters, as well as the entire genome, for the complete set of samples. RNA-seq data were first gathered for the wild-type Klebsiella strain under inducing conditions (Online Methods, Supplementary Fig. 20 and Supplementary Note 10). To our knowledge, this was the first transcriptomics investigation of nitrogen fixation in K. oxytoca, although the cluster has been investigated in other organisms28–30. The different operons within the nif cluster were transcribed to different levels (Fig. 3a). Expression of nifHDKTY was approximately tenfold higher than that of nifUSVWZM, nifF and nifJ. The leasttranscribed operons (nifENX and nifBQ) were 20-fold lower than nifHDKTY. We then measured the transcription profile for the original refactored nif cluster (v1.0)6. Unlike the wild-type cluster, this profile showed very little change across the entire cluster (Fig. 3a). This is reflective of the design of this cluster; genes were maintained in the same orientation, super-operons were built by combining nifHDKY, nifEN and nifJ and T7 RNAP is known for terminator readthrough31 and reduced attenuation32. In comparison, the profile of the improved refactored cluster, v2.1, was much more punctate, with genes expressed at different levels and clearer delineation between operons. Notably, this profile emerged as looking more similar to wild-type even though we had not made changes designed to achieve this. To convert the profile into expression levels, we calculated the normalized reads assigned per kilobase of target per million mapped reads (RPKM) value for each gene (Online Methods). For both the v1.0 and v2.1 clusters, the transcript levels of all the genes were very different from that of wild type (Fig. 3b). Compared to the v1.0 cluster, the improved cluster (v2.1) showed a consistent ~2-fold drop in the transcripts, approaching wild-type levels, but the transcripts remained 2- to 100-fold higher than in wild type. To compare protein concentrations, we performed proteomics on the wild-type and refactored clusters. We found that the protein expression in both refactored clusters was much closer to that in the wild-type cluster, and v2.1 was nearly identical, for all genes (Fig. 3c). This is consistent with the fact that most of the debugging of the refactored clusters was performed by tuning the RBS strengths. Expression tolerances and part behavior Of the libraries that we constructed, nifUSVWZM contained the most architectural diversity. To quantify how the expression levels reflect this diversity, we performed RNA-seq on each library member and analyzed these data to determine the transcription levels of genes

Figure 2 Screening results for the nif cluster optimization in K. oxytoca. (a) The rank-ordered list of 62 members of the nifUSVWZM library, sorted by nitrogenase activity (gray) and growth (OD600 = 0.88–0.74 (dark green), 0.73–0.69 (medium green), 0.68–0.62 (light green) and 0.60–0.28 (yellow)). Details of the part sequences and functions are provided in Supplementary Notes 3 and 11. (b) Robustness of clusters to changes in T7* RNAP concentration. Top left, illustration of how inducible expression of T7* RNAP feeds forward to effect expression of refactored gene cluster. Bottom, responses of clusters to changes in T7* RNAP concentration grouped by increasing activity, flat and decreasing activity. Red traces and labels highlight the robustness of the most active variant (#30) and the three that exhibit fast growth and high activity (#1, #61 and #68). Top right, cluster activities plotted against their ‘robustness’, calculated as the area under the induction curves (Online Methods). a.u., arbitrary units. (c) The activity of each fragment of the cluster is shown in a Klebsiella strain where the genes (labeled on the x axis) are knocked out and the remainder are wild type. nifHDKY activity is shown in K. oxytoca NF9 (∆nifHDKY) as an example. Light bars indicate the fragments before optimization (from refactored v1.0); dark bars are those after optimization. (d) The activity of the refactored clusters in K. oxytoca NF26 in which the complete cluster is knocked out (∆nif). Cluster v1.0 was improved by the substitution of the optimized nifHDKY (v1.1), both the optimized nifHDKY and USVWZM#1 (v2.0) and the addition of the optimized nifENJ (v2.1). Error bars denote s.d. from two (OD) or four (nitrogenase activity) replicates performed on different days (a–d). (e) The parts composition of the optimized cluster (v2.1). Part classes are indicated by symbols (top), part name (bottom) and function (middle). Function values denote REUs for promoter parts, arbitrary units for RBSs, and T S for terminators. Numbers at the corner of each rectangle measure the location (in bp) within the genetic construct. Part details and sequences are provided in Supplementary Note 11.


Articles Thus, the individual genes must be expressed within a tolerated range, and their ratios must be conserved. One of the challenges in designing genetic systems is the reliance on part-characterization data that are based on measurements of the parts in isolation. Part function can vary considerably depending on the local genetic context33. When constructing the nifUSVWZM library, we varied the architectures substantially, but the same underlying parts were reused many times. Strong (0.12 relative expression units (REU)), medium (0.08 REU) and weak (0.03 REU) promoters were used 99, 101 and 257 times, respectively (Supplementary Figs. 21–23).


(RPKM values) and the activity of promoters and terminators (Fig. 3d). In addition to the plasmid-borne nifUSVWZM genes, the levels of the remainder of the genomic nif genes in the ∆nifUSVWZM strain can be calculated (Fig. 3e). The diversity of transcription levels of the nifUSVWZM genes was substantial, spanning five orders of magnitude. These data can be used to determine ranges of tolerance for the expression levels of the individual genes, outside of which low growth or reduced activity occur (Fig. 3f). The highly active and fast-growing variants (#1, #68 and #61) had gene-expression ratios that closely match those of the wild-type, all of which were close to 1:1 (Fig. 3g).


Articles


We used only one terminator (terminator strength TS = 2.6), and it appears 276 times (Supplementary Fig. 24). The transcript profiles can be aligned for each instance of the part in different contexts (Fig. 3h,i and Supplementary Note 10). When averaged across all contexts, the strengths of the promoters closely matched those from the measurements in isolation using fluorescent reporters and cytometry (Fig. 3i). However, there was considerable variation in individual contexts, and certain trends correlating genetic context with part behavior emerged (Supplementary Note 10). Collectively, these contextual effects caused the transcription for each gene to vary over ranges that are orders of magnitude larger than what could be done by the context-free promoters (Fig. 3a,f). This is probably why the extreme architecture diversification of the nifUSVWZM library was successful in scanning expression ranges to identify the optimal levels. Transfer of the refactored gene cluster to E. coli We chose to study the transfer of the refactored nif cluster from Klebsiella to E. coli MG1655. Nitrogenase activity has been transferred to E. coli previously21,34,35, and these strains share similar regulatory parts. RBSs function in either, because the mRNA-binding region of the 16S rRNA is identical, T7 RNAP functions in both and the plasmid origins can be carried in either. However, the quantitative function of these parts could be different owing to changes in the cellular environment, growth rate, metabolism and concentration of RNAPs and ribosomes36,37. This could lead to differences in the selection of parts to optimize activity in each organism. The original refactored nif cluster only retains 7% of the wild-type activity (Fig. 4a). Indeed, when we transferred the first version of the refactored cluster to E. coli, it dropped to 2% of the Klebsiella wild-type

b HD K Y E N J BQ F U SVWZM

25 20

Permuted nif clusters

a Nitrogenase activity (% wild-type)

Figure 4 Transfer of refactored nif clusters into E. coli MG1655. (a) Nitrogenase activity of synthetic clusters in E. coli MG1655 compared to the wild-type nif cluster expressed in K. oxytoca M5al. Data are shown for the original refactored cluster (v1.0)6, the optimized cluster (v2.1), the best obtained from the RBS library (RBS) and the intermediate containing the optimized USVWZM#1 in the v1.0 background (v1.2). Error bars represent s.d. of four experiments done on different days. (b) The RBS library built on the basis of v1.0 and screening data for E. coli. The most active mutant (top row) corresponds with RBS bar in a.

15 10 5

Strong RBS

0 v1.0

v2.1

RBS

v1.2

Weak RBS

0 5 10 Nitrogenase activity (% wild-type)

activity (Fig. 4a). However, because the refactored cluster is modular, parts can be quickly substituted that are designed to function in the new host. Although the RBSs are similar between these hosts, their quantitative strength and rank order can differ36. We applied the RBS Calculator to build a library that systematically changes the RBSs across all 16 genes in the cluster to those designed to function in E. coli (Fig. 4b). To vary expression, we designed RBSs to be either fivefold stronger (nifDNJBQUVWZ) or fivefold weaker (nifHKYEFSM) than the calculated strength of the RBSs in the original refactored cluster. Those selected to be weaker were chosen because they were already tuned near the upper limit allowed by the RBS Calculator. We designed 20 clusters on the basis of a fractional factorial multivariate design 38, and an additional 20 clusters by randomly choosing RBS combinations in silico (Online Methods, Supplementary Figs. 25 and 26 and Supplementary Note 9). We screened this library, and the best construct yielded 8% activity. Next, we examined whether the improvements that went into the construction of the v2.1 cluster in Klebsiella would likewise improve performance in E. coli (Fig. 4a). The best-performing refactored cluster in Klebsiella (v2.1) increased the activity to 7% in E. coli. In addition, we screened the intermediate constructs and were surprised to find that the top construct in E. coli contained only the optimized USVWZM#1 variant in the background of the original refactored cluster (v1.2). Further, we had difficulty transforming the v1.1 cluster in E. coli and the v1.2 cluster into Klebsiella, possibly owing to toxicity. Collectively, these results highlight the differences in part functions

Figure 3 Transcriptomic analysis of the optimized refactored clusters and nifUSVWZM library. (a) Strand-specific RNA-seq reads are mapped to the wild-type nif gene cluster (gray) and the refactored clusters before (v1.0, pink) and after (v2.1, blue) optimization. For the wild-type cluster, the transcription of the sense strand is shown in dark gray and the antisense strand is shown in light gray. The architecture of each gene cluster is shown and known promoters (green arrows) and terminators (red tees) are indicated. The purple lines map the genes from the wild-type to refactored clusters. (b) The ratio of the transcripts (RPKM values) for the refactored clusters compared to the wild type (v1.0, pink; v2.1, blue). x-axis labels correspond to each gene in the refactored nif gene cluster. (c) The ratios of proteins for the refactored clusters compared to wild type, as determined by global iTRAQ proteomics with directed mass spectrometry measurement (Online Methods). Asterisks mark proteins for which no ions were identified; error bars indicate the sample s.d. from two technical replicates of two biological replicates. The horizontal line marks the ratio of 1. (d) Representative RNA-seq traces for USVWZM#1 and USVWZM#32. The sequencing reads mapped to the sense and antisense strands are shown in above- and belowcluster diagrams, respectively. Examples are shown for the calculation of the strength of a promoter, expression level of a gene (RPKM) and terminator efficiency (Online Methods and Supplementary Note 10). (e) Expression of the nif genes (RPKM values) are shown for the nifUSVWZM, organized by activity and growth rate as in Figure 2a. Red box indicates variants that are both highly active and fast growing. The expression levels of the genomic nif genes are shown in blue. The scales of the y axes for the genomic and plasmid-carried genes are shown (bottom right). (f) Expression levels in nifUSVWZM variants. Red lines indicate the high and low extremes of the expression levels for variants #1, #61 and #68. (g) The expression ratios between all pairs of nifUSVWZM genes were calculated and averaged (Online Methods). Each box represents one member of the library; the red boxes show variants #1, #61 and #68. (h) Behavior of terminator parts in the nifUSVWZM library. n, the number of instances of the part. Each trace is the normalized number of mapped RNA-seq reads, which has been aligned by the location of the part (the beginning and end of the part are shown as vertical dashed lines). Gray traces represent every instance of the part in the library; the red trace is the average across all genetic contexts. Inset, histogram of percentage termination among library terminators; red vertical line indicates mean. (i) Part function for the weak (green), medium (orange) and strong (blue) promoters in the library, labeled as in Figure 3h. Middle, histograms showing the gain of transcription for library promoters; colored lines indicate average gain. Right, data for promoter strength collected in isolation using an mRFP expression construct (Supplementary Note 3). Error bars denote s.d.


Articles


between these hosts and the need to be able to rapidly build and test variants with clean part substitutions as part of the process of species transfer. For transfers to more disparate hosts, we expect that the refactored cluster will be nonfunctional in the new host. Therefore, it is necessary to harness the modularity of the refactored system to make many part substitutions—guided either by design of experiments or by part measurements in the respective hosts—to build a large library to take many ‘shots on goal’ to achieve a functional system in the new host. DISCUSSION Directed evolution has proven to be a powerful approach to optimize biological systems, including proteins, pathways and whole genomes39. Combinatorial design is essentially a new approach to the directed evolution of large, multi-gene systems. It allows parts to be changed to create diverse architectures, while preserving those rules that are known to be important, such as maintaining the necessary parts for transcription and translation. In this manuscript, we have demonstrated that a refactored gene cluster can be used as a platform to build diverse architectures and implement design choices that would not be possible with the wild-type system. A large fraction of these highly perturbed systems are functional and many are improved compared to the parent system. There are three classes of rules that we obtained from analyzing the libraries of nif clusters: architectural rules, part constraints and expression tolerances. These classes of rules occur frequently across genetic engineering projects and it is important to have formalisms to capture them, such that they can be fed into the next round of the design cycle. For the nif cluster, the new rules have been provided in Eugene format (Supplementary Data File 2). Surprisingly few architectural rules could be extracted from our data. The strongest constraints regard maintaining expression in a tolerated range and in defined ratios; however, there are many genetic architectures that are equivalent in achieving this result. We could not identify preferences for gene order, orientation, or operon occupancy. In fact, some of the most highly disrupted operons emerged as the most active in the library. When comparing nif clusters across species, there is similar diversity in the genetic encoding, with some gene order conservation (synteny), but there are many examples of disrupted gene pairs and well as changes in order and orientation40,41 (Supplementary Note 1). Although there may be unknown adaptive reasons for these changes, our results suggest that many of these differences are neutral, reflective more of evolutionary drift. Beyond the nif cluster, similar tolerance for diverse architectures has been shown for computational studies of viral genomes42 and alternative assemblies of genetic devices43. Complementary to comparative genomics, the diversity created using a refactored cluster offers the unique ability to compare alternative architectures akin to natural diversity, but side-by-side in the same species. For example, a rule that is preserved in phylogenetic comparisons as well as our analysis is the proximity of nifE and nifN, which may reflect the need to maintain a stoichiometric ratio in forming a protein complex6,44. The combination of refactoring and combinatorial design has the potential to affect the optimization, diversification and transfer of many classes of gene clusters encoding different cellular functions of interest in biotechnology45. Notably, there are many bacterial and fungal clusters that encode pathways to produce small molecules that could be valuable pharmaceuticals, insecticides or herbicides45. Accessing these functions involves transferring them to a new host, often requiring wholesale codon optimization, substitution of regulatory parts and subsequent optimization via directed evolution.

Here we present a systematic pipeline to strip out evolutionary genetic baggage, parallelize the process of design and construction to increase the probability of success and implement ‘omics’ analytical pipelines to interpret the results and drive the next iteration of design. Using this pipeline to transfer nitrogenase activity to a nontrivial organism such as a root-associated bacterium, chloroplast or mitochondria requires the construction of large sets of parts for the target organism, the construction and transfer of large combinatorial libraries and the use of cellular analytics to assess cluster function in the new host. Methods Methods and any associated references are available in the online version of the paper. Accession codes. Sequence Read Archive: K. oxytoca NF26 refactored v1.0, SRX722064 (experiment) and SRR1598918 (run); K. oxytoca NF26 refactored v2.1, SRX722065 (experiment) and SRR1599311 (run); nifUSVWZM library samples, SRX722066 (experiment), SRR1599276 (run); wild-type, SRX719666 (experiment) and SRR1598917 (run). Note: Any Supplementary Information and Source Data files are available in the online version of the paper. Acknowledgments M.J.S., D.Z., J.C., R.N., D.B.G. and C.A.V. are supported by the US Defense Advanced Research Projects Agency (DARPA) Living Foundries grant HR0011-12-C-0067 and the US National Science Foundation Synthetic Biology Engineering Research Center (SynBERC) through grant SA5284-11210 and are also supported by the Institute for Collaborative Biotechnologies through contract W911NF-09-0001 from the US Army Research Office. The content of the information does not necessarily reflect the position or the policy of the US government, and no official endorsement should be inferred. S.B. and D.D. are supported by DARPA Living Foundries grant HR0011-12-C-0067. M.J.S. is an HHMI Fellow of the Damon Runyon Cancer Research Foundation, DRG-2129-12. Y.P. is supported by the Samsung Scholarship. AUTHOR CONTRIBUTIONS M.J.S., D.Z. and C.A.V. conceived and designed the experiments and wrote the manuscript. M.J.S. performed the nifUSVWZM, monocistronic and RBS library construction and analysis. D.D. and S.B. performed the clustering analysis, wrote the design files and analyzed data. D.Z. constructed and analyzed the nifHDKY, nifENJ and complete cluster library. Y.P., D.B.G., M.B., G.G., R.N. and D.C. performed the RNA-seq experiments and analysis. L.B.A.W. and J.C. performed experiments. COMPETING FINANCIAL INTERESTS The authors declare competing financial interests: details are available in the online version of the paper. Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. 1. Czar, M.J., Anderson, J.C., Bader, J.S. & Peccoud, J. Gene synthesis demystified. Trends Biotechnol. 27, 63–72 (2009). 2. Gibson, D.G. et al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52–56 (2010). 3. Kröger, J.D. et al. The transcriptional landscape and small RNAs of Salmonella enterica serovar Typhimurium. Proc. Natl. Acad. Sci. USA 109, E1277–E1286 (2012). 4. Arnold, W., Rump, A., Klipp, W., Priefer, U.B. & Pühler, A. Nucleotide sequence of a 24,206-base-pair DNA fragment carrying the entire nitrogen fixation gene cluster of Klebsiella pneumoniae. J. Mol. Biol. 203, 715–738 (1988). 5. Chan, L.Y., Kosuri, S. & Endy, D. Refactoring bacteriophage T7. Mol. Sys. Biol. 1, 2005.0018 (2005). 6. Temme, K., Zhao, D. & Voigt, C.A. Refactoring the nitrogen fixation gene cluster from Klebsiella oxytoca. Proc. Natl. Acad. Sci. USA 109, 7085–7090 (2012). 7. Temme, K., Hill, R., Segall-Shapiro, T.H., Moser, F. & Voigt, C.A. Modular control of multiple pathways using engineered orthogonal T7 polymerases. Nucleic Acids Res. 40, 8773–8781 (2012).



Articles 8. Beatty, P.H. & Good, A.G. Future prospects for cereals that fix nitrogen. Science 333, 416–417 (2011). 9. Arnold, W., Rump, A., Klipp, W., Priefer, U.B. & Pühler, A. Nucleotide sequence of a 24,206-base-pair DNA fragment carrying the entire nitrogen fixation gene cluster of Klebsiella pneumoniae. J. Mol. Biol. 203, 715–738 (1988). 10. Eady, R.R., Issack, R., Kennedy, C., Postgate, J.R. & Ratcliffe, H.D. Nitrogenase synthesis in Klebsiella pneumonia: comparison of ammonium and oxygen regulation. J. Gen. Microbiol. 104, 277–285 (1978). 11. Lowe, D.J. & Thorneley, R.N.F. The mechanism of Klebsiella pneumoniae nitrogenase action: pre-steady-state kinetics of H2 formation. Biochem. J. 224, 877–886 (1984). 12. Dixon, R., Cheng, Q., Shen, G.F., Day, A. & Dowson-Day, M. Nif gene transfer and expression in chloroplasts: prospects and problems. Plant Soil 194, 193–203 (1997). 13. Dukes, P., Lamken, E. & Wilson, R. Workshop report: Combinatorial design theory (Banff International Research Station meeting 08w5098) https://www.birs.ca/ workshops/2008/08w5098/report08w5098.pdf (2008). 14. Alper, H., Fischer, C., Nevoigt, E. & Stephanopoulos, G. Tuning genetic control through promoter engineering. Proc. Natl. Acad. Sci. USA 102, 12678–12683 (2005). 15. Beynon, J., Cannon, M., Buchanan-Wollaston, V. & Cannon, F. The nif promoters of Klebsiella pneumonia have a characteristic primary structure. Cell 34, 665–671 (1983). 16. Bilitchenko, L. et al. Eugene—a domain specific language for specifying and constraining synthetic biological parts, devices, and systems. PLoS ONE 6, e18882 (2011). 17. Salis, H.M., Mirsky, E.A. & Voigt, C.A. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950 (2009). 18. Davis, J.H., Rubin, A.J. & Sauer, R.T. Design, construction and characterization of a set of insulated bacterial promoters. Nucleic Acids Res. 39, 1131–1141 (2011). 19. Crook, N.C., Freeman, E.S. & Alper, H.S. Re-engineering multicloning sites for function and convenience. Nucleic Acids Res. 39, e92 (2011). 20. Weber, E., Engler, C., Gruetzner, R., Werner, S. & Marillonnet, S. A modular cloning system for standardized assembly of multigene constructs. PLoS ONE 6, e16765 (2011). 21. Wang, X. et al. Using synthetic biology to distinguish and overcome regulatory and functional barriers related to nitrogen fixation. PLoS ONE 8, e68677 (2013). 22. Cannon, F.C., Dixon, R.A. & Postgate, J.R. Derivation and properties of F-prime factors in Escherichia coli carrying nitrogen fixation genes from Klebsiella pneumoniae. J. Gen. Microbiol. 93, 111–125 (1976). 23. Price, M.N., Huang, K.H., Arkin, A.P. & Alm, E.J. Operon formation is driven by co-regulation and not by horizontal gene transfer. Genome Res. 15, 809–819 (2005). 24. Lim, H.N., Lee, Y. & Hussein, R. Fundamental relationship between operon organization and gene expression. Proc. Natl. Acad. Sci. USA 108, 10626–10631 (2011). 25. Liang, L.W., Hussein, R., Block, D.H.S. & Lim, H.N. Minimal effect of gene clustering on expression in Escherichia coli. Genetics 193, 453–465 (2013). 26. Endy, D., You, L., Yin, J. & Molineux, I. Computation, prediction, and experimental test of fitness for bacteriophage T7 mutants with permuted genomes. Proc. Natl. Acad. Sci. USA 97, 5375–5380 (2000).


27. von Dassow, G., Meir, E., Munro, E.M. & Odell, G.M. The segment polarity network is a robust developmental module. Nature 406, 188–192 (2000). 28. Hamilton, T.L. et al. Transcriptional profiling of nitrogen fixation in Azotobacter vinelandii. J. Bacteriol. 193, 4477–4486 (2011). 29. Yan, Y. et al. Global transcriptional analysis of nitrogen fixation and ammonium repression in root-associated Pseudomonas stutzeri A1501. BMC Genomics 11, 11 (2010). 30. Poza-Carrión, C., Jiménez-Vicente, E., Navarro-Rodríguez, M., Echavarri-Erasun, C. & Rubio, L.M. Kinetics of nif gene expression in a nitrogen-fixing bacterium. J. Bacteriol. 196, 595–603 (2014). 31. Jeng, S.T., Gardner, J.F. & Gumport, R.I. Transcription termination by bacteriophage T7 RNA polymerase at rho-independent terminators. J. Biol. Chem. 265, 3823–3830 (1990). 32. McAllister, W.T. & Morris, C. Utilization of bacteriophage T7 late promoters in recombinant plasmids during infection. J. Mol. Biol. 153, 527–544 (1981). 33. Cardinale, S. & Arkin, A.P. Contextualizing context for synthetic biology–identifying causes of failure of synthetic biological systems. Biotechnol. J. 7, 856–866 (2012). 34. Dixon, R.A. & Postgate, J.R. Genetic transfer of nitrogen fixation from Klebsiella pneumoniae to Escherichia coli. Nature 237, 102–103 (1972). 35. Dixon, R. & Cannon, F. Construction of a P plasmid carrying nitrogen fixation genes from Klebsiella pneumoniae. Nature 260, 268–271 (1976). 36. Moser, F. et al. Genetic circuit performance under conditions relevant for industrial bioreactors. ACS Synth. Biol. 1, 555–564 (2012). 37. Gorochowski, T.E., van den Berg, E., Kerkman, R., Roubos, J.A. & Bovenberg, R.A.L. Using synthetic biological parts and microbioreactors to explore the protein expression characteristics of Escherichia coli. ACS Synth. Biol. 3, 129–139 (2014). 38. Plackett, R.L. & Burman, J.P. The design of optimum multifactorial experiments. Biometrika 33, 305–325 (1946). 39. May, O., Voigt, C.A. & Arnold, F.H. in Enzyme Catalysis in Organic Synthesis: A Comprehensive Handbook 2nd edn. (eds. Drauz, K. & Waldmann, H.) Ch. 4 (Wiley-VCH Verlag, 2002). 40. Ran, L. et al. Genome erosion in a nitrogen-fixing vertically transmitted endosymbiotic multicellular cyanobacterium. PLoS ONE 5, e11486 (2010). 41. Stucken, K. et al. The smallest known genomes of multicellular and toxic cyanobacteria: comparison, minimal gene sets for linked traits and the evolutionary implications. PLoS ONE 5, e9235 (2010). 42. Endy, D., You, L., Yin, J. & Molineux, I.J. Computation, prediction, and experimental tests of fitness for bacteriophage T7 mutants with permuted genomes. Proc. Natl. Acad. Sci. USA 97, 5375–5380 (2000). 43. Densmore, D., Kittleson, J.T., Bilitchenko, L., Liu, A. & Anderson, J.C. Rule based constraints for the construction of genetic devices. Proc. 2010 IEEE ISCAS, doi:10.1109/ISCAS.2010.5537540 (2010). 44. Suh, M.H., Pulakat, L. & Gavini, N. Functional expression of the FeMo-cofacterspecific biosynthetic genes nifEN as a NifE-N fusion protein synthesizing unit in Azotobacter vinelandii. Biochem. Biophys. Res. Commun. 299, 233–240 (2002). 45. Fischbach, M. & Voigt, C.A. Prokaryotic gene clusters: a rich toolbox for synthetic biology. Biotechnol. J. 5, 1277–1296 (2010).


ONLINE METHODS

Strains and media. Escherichia coli DH5α was used for routine cloning and plasmid propagation. E. coli MG1655 was used as a heterologous host for screening libraries of full refactored nitrogen fixation gene clusters. Klebsiella oxytoca M5al46 was used to determine wild-type nitrogenase activity levels, and knockout mutant strains K. oxytoca NF9, NF10, NF24, NF25 and NF26 (ref. 6) were used to screen synthetic nifHDKY, nifUSVWZM, nifHDKYENJ, nifBQFUSVWZM, or full nifHDKYENJBQFUSVWZM constructs, respectively. LB medium (10 g/L tryptone, 5 g/L yeast extract, 10 g/L NaCl; VWR #90003-350) with appropriate antibiotic supplementation was used for strain maintenance and plasmid construction in E. coli strains. LB-Lennox medium (10 g/L tryptone, 5 g/L yeast extract, 5 g/L NaCl; Invitrogen 12780-052) was used for strain maintenance in K. oxytoca strains. All nitrogen fixation assays were performed in minimal medium (25 g/L Na2HPO4, 3 g/L KH2PO4, 0.25 g/L MgSO4.7H2O, 1 g/L NaCl, 0.1 g/L CaCl2.2H2O, 2.9 mg/L FeCl3, 0.25 mg/L Na2MoO4.2H2O, and 20 g/L sucrose). Growth medium is defined as minimal medium supplemented with 6 mL/L of 22% ammonium acetate (filter sterilized). De-repression medium is defined as minimal medium supplemented with 1.5 mL/L of 10% serine (filter sterilized). Antibiotic selection was performed with spectinomycin (100 mg/L; MP Biomedicals #021 5899305), kanamycin (50 mg/L; Gold Bio #K-120-5), ampicillin (100 mg/L; Affymetrix #11259 5), and/or chloramphenicol (33 mg/L; VWR cat. #AAB20841-14). Isopropyl-β-d-1thiogalactopyranoside (IPTG; Gold Bio #I2481C25 259) was supplemented to medium for induction at 1 mM unless otherwise stated. Blue-white screening of colonies resulting from DNA assembly reactions was performed on LB-agar plates (1.5% Bacto agar; VWR cat. #90000-760) supplemented with 0.15 mM IPTG, 60 mg/L 5-bromo-4-chloro-indolyl- β-d-galactopyranoside (Roche #10 745 740 001), and appropriate antibiotics. Plasmids. The plasmids containing refactored nif operons or full gene clusters contain the same vector backbone as the previously described refactored nif cluster (SBa_000534)6, with the exception that the chloramphenicol resistance marker is replaced by the kanamycin resistance marker for the nifUSVWZM, monocistronic, and RBS libraries. Rebuilding full cluster v1.0 in this backbone with additional changes comprising a corrected point mutation in nifZ and 4-bp scars used for MoClo assembly produced a construct (pCV-RBS20) with the same nitrogenase activity as SBa_000534. The sequence for both vector backbones is provided with part sequences in Supplementary Data Files 3–6. The controller plasmid for expression in K. oxytoca was described previously6, and that used for expression in E. coli is identical except for the RBS controlling T7-RNAP translation (Supplementary Note 9). Part characterization was performed using plasmids described previously 6 (Supplementary Note 3). Scarless stitching vectors (Supplementary Note 5) and MoClo assembly vectors were built from scratch using pMB* or p15a origin of replications and standard resistance cassettes, with the exception that recognition sites for restriction enzymes used during the assembly were removed by silent point mutations. Key plasmids are listed in Supplementary Table 2, and select plasmid maps are shown in Supplementary Figure 27. Eugene files capturing design of library plasmids are in Supplementary Data Files 1, 2 and 7 and described in Supplementary Note 2. DNA assembly and verification. The promoter parts, RBS and CDS parts, and terminator parts that entered into the pipeline at the highest level of the assembly tree were constructed using standard cloning techniques including isothermal assembly47 and PCR ligation. Parts with an identification number beginning with SBa (Supplementary Note 11 and Supplementary Data Files 3–6) are also deposited in the SynBERC registry of parts (https://registry. synberc.org/). Part characterization is described in Supplementary Note 3. All promoter parts are flanked by sequences GGAG (upstream) and TACT (downstream), RBS and CDS parts are flanked by sequences AATG (upstream) and AGGT (downstream), and terminator parts (TPs) are flanked by sequences TACT (upstream) and AATG (downstream). These 4-bp sequences correspond to 5′ overhanging single-stranded cohesive ends when digested with restriction enzymes BbsI (promoter and RBS or CDS parts) or BsaI (terminator parts). Application of the scarless stitching method (Supplementary Note 5) to create a seamless junction between any combination of promoter part and RBS or CDS part proceeds as follows: 20 fmol each of promoter part plasmid, RBS or

nature biotechnology

CDS part plasmid, pMJS20BC, and pMJS23AD are mixed with 5 U BbsI (New England Biolabs, #R0539S) and 5 U T4 DNA Ligase (Promega, #M1794) in a total of 10 µl 1× Promega T4 DNA Ligase Buffer and incubated at 37 °C for 4.5 h. Next, a 10 µl solution containing 5 U MlyI (New England Biolabs, #R0610S) and 5 U T4 DNA Ligase in 1× Promega T4 DNA Ligase Buffer is added to each reaction and incubated an additional 30 min at 37 °C. Reactions are terminated by incubating at 50 °C for 5 min and 80 °C for 10 min. Constructed plasmids are transformed into E. coli and prepared for sequence confirmation by Sanger sequencing using standard techniques. Scarless stitching of a promoter-RBS or -CDS construct to a terminator part follows a similar protocol to that described above, with pMJS25DB and pMJS24AC replacing pMJS20BC and pMJS23AD, and BsaI (New England Biolabs #R0535S) replacing BbsI. Constructs containing a promoter part, RBS or CDS part, and terminator part are considered cistron parts. Sequence-verified cistron parts are PCR amplified to give each construct specific cohesive ends upon BbsI digestion that dictate the orientation and relative position in the overall assembly. PCR products are cloned into Level 1 plasmids (pCV27069) with the appropriate flanking cohesive ends using a Golden Gate assembly reaction20. At this stage, each part is sequence verified. Fourteen of the 48 cistron parts contained a 1- or 2-bp deletion in the beginning of the terminator part (Supplementary Note 5), but as the first 6 bp of the terminator parts are not part of the hairpin structure and are not expected to effect termination efficiency48, these were still carried further in the library assembly. Three (nifUSVWZM library) or four (monocistron library) Level 1 plasmids are combined by BsaI digestion/ligation into Level 2 plasmids (pCV27070) using a Golden Gate assembly reaction to intermediate assembly plasmids dubbed half clusters or quarter-clusters for the nifUSVWZM or monocistron libraries, respectively. Finally, Level 2 plasmids are combined by BbsI digestion and ligation into the expression vector pMJS2001AC to form Level 3 plasmids containing 6 or 16 genes of the nifUSVWZM operon or complete refactored nif gene cluster. Level 2 and Level 3 plasmids are verified by colony-multiplex PCR using primers that anneal to the CDS sequences of each gene. Colonies are picked into 10 µl of sterile H2O and boiled at 100 °C for 10 min. Boil preps are centrifuged to pellet cell debris, and 0.5 µl supernatant is used as template in 5 µl PCR reactions using Phusion High-Fidelity DNA Polymerase (New England Biolabs, cat. #M0530L) with standard reaction conditions and the following heat cycle in a Bio-Rad C1000 Touch Thermal Cycler (Hercules, CA): 98 °C for 30 s, 35 cycles of 98 °C for 10 s, 60 °C for 30 s, and 72 °C for 15 s, followed by 72 °C for 10 min. PCR reactions are analyzed by agarose gel electrophoresis or on a Qiaxcel (Qiagen, Germantown, MD) with a DNA Screening cartridge and 320 s separation time. A sample of 30 members of the nifUSVWZM library (#1–#20 plus the five best- and worst-performing constructs) was sequence verified by Sanger sequencing. As intermediate parts from the assembly were present in multiple of the sequence-verified constructs, we could infer when a sequence error was present in an intermediate construct. In such cases, all final constructs bearing these intermediate parts (USV-2, frameshift in stop codon of nifV; USV-10, frameshift in nifV; and USV-9, duplication of nifU cistron part). Constructs for which mutations were directly observed or could be inferred based on the assembly hierarchy include: #3, #4, #17, #18, #19, #20, #26, #33, #34, #38, #45, #46, #50, #57, #58, #62, #63, #69, #70, #74, #76, #81, and #83. Only the remaining 62 constructs were included in the analyses reported here, unless specifically noted (for example in characterizing part behavior via RNA-seq). Combinatorial libraries of nifHDKY, nifHDKYENJ, and the improved clusters v2.0 and v2.1, all of which are described in more detail in Supplementary Note 7, were assembled via isothermal assembly47 or PCR ligation. In each case, mutagenized plasmids were sequenced verified for a region 500 bp upstream and downstream of the ligation joints. The RBS-swapping library was constructed by directly cloning cistronlevel parts into level 1 MoClo plasmids. Level 2 plasmids containing each possible combination of the four-gene quarter clusters (HDKY; ENJB; QFUS; VWZM) were constructed by type IIs digestion and ligation as described above. Thirty-nine of the 40 level 3 plasmid reactions yielded colonies that were verified by multiplexed PCR to contain each of the nif genes. The top two gene clusters in the library were sequence-verified by Sanger sequencing.

doi:10.1038/nbt.3063


RBS design. All new RBS parts were designed using the RBS Calculator online tool version (https://salis.psu.edu/software/) using one of two approaches. Targeted RBS strength changes were designed using the RBS Calculator Design mode version 1.1. Upstream and downstream sequences used in the calculation were selected from the original refactored nif gene cluster design. When RBS strength was optimized using a library-based approach (nifH, nifD, nifE, nifJ), the RBS Calculator ‘reverse engineering mode’ with Free Energy Model 2.0 was used to evaluate pools generated by saturated mutagenesis of 3–4 continuous bases in the RBS part. Specifically, in silico libraries were generated by randomizing each 3- or 4-bp window spanning from –30 to –3 of the AUG start codon. Loci producing large variations in the in silico predictions of at least 100-fold were selected for empirical testing and were constructed using degenerate oligonucleotides. Nitrogenase activity assay. Nitrogenase activity is determined in vivo via the previously described acetylene reduction assay6,49. Following 12 h of growth in LB-lennox medium supplemented with required antibiotics, each strain is grown in 2 ml growth medium (supplemented with required antibiotics) in 15 mL culture tubes for 14 h in an incubated shaker (30 °C, 250 r.p.m.). For certain high-throughput experiments, these cultures were grown in 1.1 mL of the same medium in 96-well deep-well culture plates incubated in an INFORS Multitron incubated shaker (30 °C, 900 r.p.m.). Cultures are diluted in 2 ml de-repression medium (supplemented with required antibiotics and inducers) to a final OD600 of 0.5 in 10 ml glass vials with PTFE-silicone septa screw caps (Supelco Analytical, Bellefonte, PA, cat. #SU860103). Headspace in the bottles was repeatedly evacuated and flushed with N2 gas using a vacuum manifold equipped with a copper catalyst O2 trap. After 5 h incubation at 30 °C and 250 r.p.m. in an incubated shaker, headspace was replaced with 1 atm argon. Acetylene was freshly generated from CaC2 in a Burris bottle, and 1 ml was injected into each bottle to start the reaction. Cultures were incubated at 30 °C, 250 r.p.m. for 5 h (nifHKDY library), 12 h (nifHDKYENJ library and full clusters v1.1, v2.0, and v2.1) or 17 h (all other libraries) before the assay was quenched by the addition of 500 µl of 4 M NaOH to each vial. In each assay, a wild-type control allowed expression of activity in terms of percent wild-type. Ethylene production was analyzed by gas chromatography on an Agilent 7890A GC system (Agilent Technologies, Inc., Santa Clara, CA, USA) equipped with a PAL headspace autosampler and flame ionization detector as follows. 250 µL headspace preincubated to 35 °C was sampled and separated on a GS-CarbonPLOT column (0.32 mm × 30 m, 3 micron; Agilent) at 60 °C and a He flow rate of 1.8 ml/min. Detection occurred in a FID heated to 300 °C with a gas flow of 35 ml/min H2 and 400 ml/min air. Under these conditions, acetylene eluted at 3.0 min after injection and ethylene at 3.7 min. Ethylene production was quantified by integrating the 3.7 min peak using Agilent GC/MSD ChemStation Software. Cell growth is determined in identical conditions, with 500 ml of culture sampled 5 h after induction and diluted 1:1 with minimal medium to return cultures to within the linear range for optical density (OD600) measurement. Optical density is measured on a Varian 50 Bio UV-Vis spectrophotometer. To generate the T7* RNAP expression versus normalized nitrogenase activity plots for Figure 2b, raw nitrogen fixation activities at each level of induction (Supplementary Note 6) were first corrected for T7* RNAP-independent activity by subtracting the latter value from each data point, with a lower boundary of 0 (i.e., corrected activity levels were not allowed to be negative). Next, the activity was normalized to the maximum activity of each gene cluster across the range of induction levels assayed. For a measure of cluster robustness, we integrate under a third order polynomial best fit curve and report this value as a construct’s robustness. Part characterization. Promoter parts were characterized by replacing the P23100 locus upstream of mRFP in characterization plasmid N110 and measuring fluorescence in K. oxytoca containing the controller plasmid (N249). Fluorescence was measured after 5.5 h growth after induction in conditions identical to the nitrogenase assay with the exception that headspace was not replaced with N2. Arbitrary fluorescence measurements were converted to REU using conversion factors described in Supplementary Note 3. RBSs were similarly characterized by cloning 82 bp including the RBS and upstream and downstream sequences (i.e., 82 bases ending in the +21 position from the

doi:10.1038/nbt.3063

AUG start codon) as translational fusions to mRFP in N110 and measuring fluorescence in conditions described above. As RBS values were previously reported as arbitrary units6, that convention was maintained in this study. To maintain a comparable range for these arbitrary units despite measuring fluorescence on a different instrument, a conversion factor of 45 was calculated by re-measuring all previously described RBSs in conditions described above. Termination strength is measured as described previously 6, and all termination strengths were determined using a T7 RNAP expression system. Characterization data and sequences for parts and constructs are described in Supplementary Note 11. Strand-specific RNA-seq. For RNA-sequencing samples, total RNA is harvested from each of the nifUSVWZM library strains cultured in nitrogenase assay conditions as well as wild-type K. oxytoca m5al grown with and without IPTG. RNA preparation is initiated following 5.5 h of growth in inducing conditions. From 8 ml of culture, cells are spun down at 4 °C, with 21,000 relative centrifugal force (rcf) for 3 min. After centrifugation, supernatant is discarded and cell pellets are flash frozen in liquid nitrogen for storage at –80 °C. RNA is isolated with PureLink RNA Mini Kit (Life Technologies, Carlsbad, CA) according to the manufacturer’s instructions and further purified and concentrated with RNA Clean & Concentrator-5 (Zymo Research) to assure sample quality. Sample qualities are determined by spectrophotometry on a NanoDrop ND-1000 (Thermo Fischer Scientific, Inc., Wilmington, DE) and are listed in the attached supplemental file (Supplementary Data File 3). Purified RNA samples were submitted for deep sequencing at the Broad Institute (Cambridge, MA). Strand specific RNA-seq libraries were created by the Broad Technology Labs specialized service facility. For transcriptomics of full cluster v1.0 and its wild-type control, samples were prepared as described previously50. For the remaining samples, individual sample RNA was fragmented and the 3′ end was tagged with a DNA oligonucleotide containing a sample tag and a partial 5′ Illumina adaptor. Uniquely tagged RNAs were then pooled and carried through rRNA depletion (Ribo-Zero Magnetic Kit (Bacteria); Epicentre, Madison, WI), cDNA synthesis, ligation to a second oligonucleotide containing a partial Illumina 3′ adaptor and amplified with full-length barcoded Illumina adaptor primers to tag pools and generated strand-specific sequence-ready RNA-seq libraries. The 96 libraries were created as 3 pools of 32 samples. Each pool was split and sequenced on two lanes of an Illumina HiSeq 2500. A reference genome for each design was assembled by combining the wild-type (WT) genomic sequence with a plasmid sequence predicted based on the specific design. Reads were trimmed of barcodes and aligned to associated reference genomes using BWA version 0.7.4 using the default settings51. Strand-specific RPKM values were calculated using custom scripts which used the Bamtools API52. Read depth profiles were computed using the “mpileup –d 20000” function from the SAMtools suite53. For both computations, counts were added together from the replicate lanes. Experimental error in RPKM errors was calculated using values obtained from biological replicates of eight strains (Supplementary Note 10). Expression levels from the two sets of data are highly correlated (R 2 = 0.92 to x = y line). The fullcluster RNA-seq data presented for the original refactored gene cluster and the improved v2.1 gene cluster (Fig. 3b,c) were processed in the same way, except that the original refactored gene cluster data was collected on an Illumina MiSeq. Relative RPKM values (Fig. 3c) were calculated by comparing to data obtained from a wild-type sample run under the same conditions as each refactored strain. To calculate the average pairwise ratio of the expression levels of refactored genes from the nifUSVWZM library, fold change between each of the 15 possible pairwise combinations (i.e., U-S, U-V, U-W, etc) was computed and represented as a fold change value ≥1. These pairwise ratios were averaged to generate the final metric D The formal equation to calculate this metric is D = (1/15)Σi

Metabolic pathway optimization using ribosome binding site variants and combinatorial gene assembly.

Combinatorial chemistry by ant colony optimization.

Fragment Linking and Optimization of Inhibitors of the Aspartic Protease Endothiapepsin: Fragment-Based Drug Design Facilitated by Dynamic Combinatorial Chemistry.

PCB drill path optimization by combinatorial cuckoo search algorithm.

Combinatorial molecular optimization of cement hydrates.

Combinatorial optimization of cyanobacterial 2,3-butanediol production.

Materials design by evolutionary optimization of functional groups in metal-organic frameworks.

Multistep assembly of DNA condensation clusters by SMC.

Structure-based design of combinatorial mutagenesis libraries.

Robust mutant strain design by pessimistic optimization.

Protocol design and optimization.

Assembly of combinatorial antibody libraries on phage surfaces: the gene III site.

A dynamic multiarmed bandit-gene expression programming hyper-heuristic for combinatorial optimization problems.

The Caulobacter crescentus flaFG region regulates synthesis and assembly of flagellin proteins encoded by two genetically unlinked gene clusters.

Design and optimization of bilayered tablet of Hydrochlorothiazide using the Quality-by-Design approach.

Structure-based optimization of tyrosine kinase inhibitor CLM3. Design, synthesis, functional evaluation, and molecular modeling studies.

Design and optimization of peptide nanoparticles.

Electric-field-induced assembly and propulsion of chiral colloidal clusters.

A Robust and Versatile Method of Combinatorial Chemical Synthesis of Gene Libraries via Hierarchical Assembly of Partially Randomized Modules.

Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort.

Chromodomain Ligand Optimization via Target-Class Directed Combinatorial Repurposing.

Assembly of "3D" plasmonic clusters by "2D" AFM nanomanipulation of highly uniform and smooth gold nanospheres.

An optimization spiking neural p system for approximately solving combinatorial optimization problems.

Amphotericin B-loaded polymeric nanoparticles: formulation optimization by factorial design.