Charges are crucial to protein solubility, and in turn solubility is critical for handling proteins at high concentrations in biotechnology and biopharmaceutical applications. Along with Robin Curtis and other collaborators at the University of Manchester and companies (in particular MedImmune), we are looking at various aspects, including the pH and ionic strength dependencies of solubility. These are variables that are relatively easy to model computationally, and thus to develop models alongside biophysical studies. More challenging is to include the role of excipients and buffers into the models, and we are a long way from a sufficient understanding in this area, which therefore represents a major challenge.
Along the way, using bioinformatics and available 'omics data, we discovering novel features of proteins, such as an apparent evolutionary preference for lysine over arginine for proteins that are expressed at higher levels. We believe that the approach of combining bioinformatics, 'omics data resources, biophysical measurements, and computational chemistry, will deliver useful models for protein solubility and aggregation propensity.
In studies over several years, originating in collaborations with Paul Gane, Robert Freedman and Jim Bardwell, we have chosen the thioredoxin superfamily to develop and test our model for calculating the modulation of redox potential by protein environment. The geometry around the core CxxC redox motif is generally conserved within the superfamily, putting the emphasis on the amino acids changes in the neighbourhood. In this figure cysteine pKa (which tracks redox potential) varies according to both identity of the xx residues and the framework structure with the superfamily.
Our continuum electrostatics model can progress in two directions (i) through combination with more detailed models for redox potential variation, in providing a scheme that accounts for chemical as well as structural change and (ii) through wider scale application to structures and models. Following this second line, in a superfamily with conserved structural motif, it should be possible to build comparative models, in which it is largely the sequence and amino acid sidechain variation that dictates functional differences. For a larger superfamily, with thousands of representatives in the sequence databases, this gives an opportunity to make large-scale predictions of redox function, adding a structural element to sequence-based functional annotation.
We are also interested in how redox potential and substrate specificity combine in particular sub-families, such as the Dsb proteins and the PDI family.
In a new colaboration with Chris Blanford and Sam DeVisser (both also in the MIB), and undertaken by PhD student Nick Fowler, we are looking at computational models for redox potential more generally.
The era of Structural Genomics has reminded us how little we understand about structure-based, as opposed to purely sequence-based, annotation of protein function. Some things are clear, such as the observation that enzyme active sites tend to occur in relatively large clefts. We have found that this can be quantitated neatly with a pseudo-charge calculation, where a field is calculated from a protein that is uniformly-charged over its volume. This allows us to find optimal values for predicting enzyme/non-enzyme, and also gives an accurate method for locating an active site. We can also address the question we started out wanting to answer - what physical and chemical characteristics make an enzyme active site. For example, the intermediate plots here are catalytic and non-catalytic antibodies. In these terms, neither Ab version appears particularly well-suited to catalysis.
Application of this method to proteins of unknown function, but known structure yields an estimate that well under half of these are enzyme (to the right of vertical dotted line in this figure).
It is possible to use any property in our pseudo-charge calculation, such as sequence profile values obtained from a multiple alignment. This fusion of evolutionary trace and physics-based methods can be a useful addition to functional site finding algorithms. Ultimately we want to get back to the question we originally asked (but didn't answer), not for functional sites in general: What energetic properties can we calculate to give clues to function?
Ion channels represent a crossover between several areas of research, physiology and biophysics, and more lately structural biology and molecular simulation. For potassium channels, a variety of functional differences such as gating and conductance are superposed on what appears to be a relatively uniform mechanism of ion translocation, based on the selectivity filter structure. One of the variations is pH-dependence of conductance, and (in a collaboration with Mark Boyett) we have used calculations and sequence analysis to rationalise pH properties where they are known in the Kv1 family, and to predict in other cases.
A further collaboration, with James Magee, has allowed us to model differences in free energy of binding for protein-nucleic acid systems, in the first instance mRNA and eIF4E. With a simplified model for the non-specific component in complexes that exhibit a tethering point, we predict how the binding energy varies as charge complementarity changes. In our test case, this was between different eIF4E forms. Within the constraints of our mode, we see the balancing of enthalpic and entropic contributions and how the average mRNA path can vary quite dramatically, according to the underlying protein charge distribution. This different eIF4E isoforms may have different affinities for capped mRNAs. Our current RNA model does not include secondary structure, and therefore needs improvement to study whether there could also be some selectivity on the RNA side, perhaps mediated by base-pairing and charge density.
Our latest study in the biophysics area compares proteins from mesophiles and thermophiles. This is well-trodden path computationally, but we have come across a couple of novel observations. Firstly that the predicted average increase in stability due to interactions between ionisable groups, for proteins from hyperthermophiles, does not correlate with the number of ionisable groups. And secondly that, perhaps contray to expectation, amino acids with bulky non-polar sidechains (such as tryptophan) apopear on average to be more solvent exposed in proteins from higher growth temperature organisms.
A major use of continuum electrostatics models has been the analysis of pH-dependent properties, such as the free energy of folding (schematic figure to the right). Although individual salt-bridges are often relatively weak, their cumulative effect is demonstrated by the extent to which many protein folded states lose stability at acidic pH. This latter feature is one feature of interest in the analysis of mis-folding diseases, but the influence of pH-dependence has a wide molecular and physiological base, for example in control of enzyme activity, uncoating for some viruses, and intracellular trafficking.
Our group has been involved with the development of methods to calculate pKa values and pH-dependence. It is important to devise a scheme in which the larger pKa shifts of more buried ionisable groups are handled alongside the more typical smaller shifts for groups that are surrounded by water. We have combined Finite Difference Poisson-Boltzmann and Debye-Huckel methods into a hybrid FD/DH algorithm that accomplishes this task. This work underpins much of our analysis into various systems, for example comparison of proteins from thermophilic and mesophilic organisms, potassium binding in ion channels, and study of proteins according to subcellular location.
It is simple to test a calculated property against functional divisions that are populated in the PDB. In this example, we derive a set of protein structures that are annotated by subcellular location. We then see that the pH at which these proteins are predicted to be most stable appear, on average, to track with the pH of the subcellular compartment.
A recurring theme in structural bioinformatics, given the number of coordinate files available, is to search for structure-function correlations that become evident over sets of proteins.