Parallel Protein Expression Strategies for Structural Biology


Dominic Esposito, Ph.D.

Protein Expression Laboratory, NCI-Frederick


The genomics era has opened up an overwhelming number of possibilities for structural biologists, with tens of thousands of new proteins waiting to be explored. The most interesting of these are human genes and closely related homologs which encode proteins involved in various aspects of human disease. Unfortunately, the biggest bottleneck in exploring this large protein space lies at the level of protein expression. While most prokaryotic genes are readily expressed in soluble form in Escherichia coli, many genes from eukaryotes, particularly those from humans, are very difficult to express in native form in heterogonous organisms. Solving the problem of expressing soluble proteins in purity and quantity enough for structural biology is a pressing concern being addressed these days in nearly every major university, government, and industrial setting.


At the NCI-Frederick Protein Expression Laboratory (PEL), we have taken the approach of developing highly parallel methods for protein expression. Unlike genomic techniques such as cloning, protein expression rarely offers any consistency from experiment to experiment. Each protein is its own unique molecule, and must be handled as such; we can draw conclusions from various experimental parameters, but to this point, we are unable to accurately predict ahead of time how a particular protein will behave. For that reason, we view a parallel approach as the most time and cost efficient way to deal with individual proteins of interest.


Such a parallel scheme begins at the cloning stage with a need for a reliable and parallel cloning methodology. Currently, the best available system is the recombination cloning system called Gateway (Invitrogen Corporation), which allows simple subcloning into multiple vectors without the need for restriction enzymes, PCR, or sequencing. In brief, an initial sequence verified clone called an Entry clone is created, and this clone can then be transferred in a 1 hour recombination reaction to any number of vectors with various tags, promoters, host specificities, etc. These Destination Vectors can also be constructed from other commercially available vectors by means of a simple ligation of a selectable marker cassette that carries the recombination signal sequences, allowing one to convert your favorite vectors to the system. Currently, the PEL has over 150 such vectors in our library.


In order to maximize downstream processing—recall that expression and solubility aren’t very useful to the structural biologist if you can’t purify the protein—we add to our clones an aminoterminal protease cleavage site. Normally, this site is the 7 amino acid recognition sequence of the tobacco etch virus (Tev) protease.  Tev has many advantages over the more commonly used thrombin, factor X, and enterokinase—the most important being that it has high specificity, rarely if ever cleaving at any place other than its recognition sequence.  In addition, Tev cleaves at the end of its recognition sequence and will leave only a single residue after cleavage, and that residue can be nearly any amino acid except proline. This allows the most native structure possible with a minimum of extra amino acids present.


For structural biologists, E. coli is still the organism of choice.  It produces large biomass, is very cheap to grow, avoids problems with post-translation modifications such as glycosylation and phosphorylation, which might lead to a heterogeneous population of proteins, and most structural biology labs have the equipment needed to grow and harvest E. coli. Unfortunately, as was mentioned before, E. coli are not particularly friendly hosts for many human proteins. Expression of large proteins (>60 kDa) in E. coli is very difficult, and toxicity and solubility remain huge hurdles. Some alterations in the conditions of expression can help reduce the solubility problem. Dropping the temperature at induction to 16 °C has been extremely helpful in increasing solubility.  Other more hit-or-miss variables include E. coli strain choice, addition of chaperones, and additives to the media. On occasion these things can help, but so far there is no consistent way to tell ahead of time what will help your particular protein. Again, this argues for maximizing your parallel processing of samples. To assist in this process, we have developed a protocol for simultaneous growth of 24 E. coli cultures using a customized shaking plate reader, which allows real-time monitoring of ODs and control over individual inductions. This allows you to test multiple conditions at the same time, increasing your throughput, lowering the technician hands-on time, and increasing your chances of finding the best conditions for expression and solubility.


All of these variables still frequently fail to produce large amounts of soluble protein in E. coli. For this reason, the development of solubility tags has been a major goal of most protein expression labs. Although many tags have been developed which occasionally increase solubility (thioredoxin, GST), the workhorse tags in our lab are the maltose binding protein (MBP), and an E. coli protein called NusA. Though both are quite large (40 kDa and 65 kDa, respectively), they offer the highest likelihood of success in enhancing solubility in E. coli, often producing >50% soluble protein in cases where the native protein is completely insoluble. In optimal cases, the size of the tag is irrelevant, since it can be cleaved off after purification. In reality, this can often still be a problem, as sometimes the tag will not cleave properly, or the protein will simply precipitate after cleavage. In these cases, finding a different host is often the next step. In cases where things work well, we fuse a His6 tag to the amino terminus of the solubility tag, allowing easy purification by IMAC, and the structural biologists are on their way to getting crystals.


When E. coli fails to work, either due to insolubility or poor protein performance, Gateway allows us to rapidly turn to other systems. Our current favorite second choice is the bacuolvirus expression system used in insect cells. Baculovirus produces significant quantities of protein (not nearly as much as E. coli, but far better than mammalian cell culture), and though not simple to get into, it requires much less effort than other eukaryotic systems. The biggest problem with insect cell expression is the upfront time required—the subcloned gene must be introduced by transposition into the baculovirus genome, which must then be used to infect insect cells. Virus is then collected and titered prior to amplification and infection of cultures for a time course analysis. This step is necessary because individual proteins often express maximally at different time points. The whole process takes 2-3 weeks, so if there is any expectation of failure in E. coli (a very large protein, or proteins with lots of disulfide bonds for instance), one should consider starting the baculovirus work early on. Like E. coli, solubility tags function in insect expression as well. Both GST and MBP have been shown to increase solubility of proteins in insect cells, and coupling a His6 tag with these tags allows for convenient purification of proteins. We have had a lot of success with insect cell expression in cases where E. coli has failed to produce any quantity of soluble protein.


Mammalian cell expression is a much more complicated endeavor, as well as generally yielding less protein which makes it less useful for structural biologists. However, in recent years developments in episomal mammalian cell expression constructs using the Epstein-Barr nuclear antigen (EBNA) have moved high-yield mammalian expression closer to reality. Vectors carrying the EBNA antigen, when expressed in the proper cell lines, can replicate to high copy number in the cell, and can produce significant quantities of protein. We are currently exploring the expression and solubility characteristics of these constructs and comparing them to other systems.


No discussion of protein expression for structural biology can be without comments on the “total failures”—those proteins which refuse to express well no matter what system they are tried in. In these cases, I would argue that its not worth completely giving up if you believe that the protein is important enough to warrant study. If so, it may be worth considering attempts to purify domains or fragments of the protein of interest; though this may not answer global structural questions, it may still prove quite useful in understanding some aspects of the protein. We have recently employed a scheme in which bioinformatics was used to break a particularly incalcitrant protein into 7 domains, which were expressed in all the various contiguous combinations to make 24 different proteins. While the parent protein was totally insoluble in E. coli, even with solubility tags, several of the subdomains were quite soluble and reasonably well behaved, including some large fragments encompassing half of the protein. With nothing at all known about this particular protein, even some structural information on half of the protein may go a long way to understanding its function. 


The most important take-home message for structural biologists is that protein expression is never a simple task. You should expect in many cases, particularly any involving large proteins or proteins from mammalian genes, that a significant amount of effort will be required to find the right conditions for expression. That leaves two choices—either take only the low-hanging fruits by ignoring things that don’t express soluble proteins under one set of conditions, or be prepared to make parallel approaches to find the best conditions. The former approach is the one being taken by many of the large proteomics labs and structural genomics collaborations; I would argue, however, that these will end up missing a lot of very interesting proteins which could be had for just a little bit of effort upfront, and a proper attention to the importance of trying many different things. X-ray crystallographers should appreciate this notion quite well, since it has been used for many years in the search for the best crystallization conditions. Given the choice between trying the 2 or 3 best formulations and screening 96 or 384 conditions at a time, I know where most crystallographers would put their money, and protein expression should be viewed in much the same way.