The Future of Protein Engineering Post-AlphaFold2
Recently the scientific community shared an exciting moment: a deep neural network model developed by a dedicated team from DeepMind (a British artificial intelligence subsidiary of Alphabet Inc) solved a challenge considered the 'holy grail' of biochemistry by most experts.
The DeepMind team competed in and won, by a significant margin, a bi-annual competition called CASP (Critical Assessment of protein Structure Prediction). This competition was created back in 1994 to assess the state-of-the-art methods for protein 3d structure prediction. Participants (mainly academic research groups) are required to predict the three-dimensional structure of a protein from its amino-acid sequence in a double-blind fashion (neither the competitors nor the organizers know the solution at the time when predictions are made). Not only did the DeepMind team outperform research groups that have been working on this problem for years, leaving the entire community dumbfounded, but their predictions were also, in many cases, indistinguishable from the experimentally determined structures (which are the closest thing to the 'ground truth').
The exact details of how the DeepMind team achieved these remarkable results are still shrouded in mystery since the source code hasn't been published yet. However, assuming the evaluation process was valid, we believe that we are on the brink of a new era in protein science. In this post, we'll raise several hypotheses for how this monumental scientific progress, which is almost unheard of in the field of biochemistry and biology, is going to transform protein research.
Since the start of the century, the cost of DNA sequencing has dropped by a factor higher than that of 'Moore's law.’
Source of image: Illumina
The implications of that cost reduction are staggering and would take several posts or even entire books to describe. One obvious implication is our ability to generate data on a scale unheard of just a few years back. Many organisms yet-unknown to science live in extreme environments such as the arctic, geothermal vents, or the dead sea in Israel. Since many industrially-relevant chemical reactions occur under extreme conditions (such as high salinity, temperature, or pH), natural enzymes found in organisms adapted to live in such conditions could be precious to us humans. A paper published in 2017 on Nature Biotechnology by Mukherjee et-al; reported that sequencing 1,003 bacterial and archaeal genomes increased the number of novel protein families by 10.5% (translates to roughly 25 million different proteins). Without deciphering the structure of those proteins (and enzymes in particular), our ability to adapt them for industrial purposes is limited since knowing the structure of a protein is key to understanding its function and mode of operation. Unfortunately, the experimental determination of protein structures is difficult, time-consuming, and requires a lot of effort. This is where AlphaFold2 comes in. We hope that the new version of AlphaFold will enable us to elucidate the structure of many of those millions of protein sequences. This information can be used by both the scientific community and the industry to engineer many types of novel functionalities that are currently out of reach without compromising performance. For instance, a future protein engineer could combine different properties from enzymes that operate in different environments, such as high activity in low temperatures, together with oxidation resistance - a common environment in our machine washers. Those enzymes will be much more efficient than what's used today and could reduce the amount of consumed energy and pollutants. An even more exciting prospect is creating novel enzymes with activities not found in nature but with potentially immense significance for us humans, for example, enzymes that can convert CO2 into high valued molecules, create silicon-carbon compounds, or deactivate pollutants that we humans have introduced to the environment.
Protein and Ligand Docking
Protein-Protein Interaction prediction (PPI) remains a major challenge associated with biochemical/medical applications. The interaction between proteins can either be transient (to accomplish a certain task or process, such as signal transduction or enzymatic reaction) or constant (such as a multimer formation).
PPI prediction is composed of several sub-problems, each challenging in its own right:
Predicting whether 2 (or more) proteins bind each other (or what concentration of each monomer is required for the complex to form).
Predicting the binding site (i.e., which residues form the interface between the two proteins)
Predicting the rigid body orientation of the proteins in the complex
Predicting the conformational change induced by the binding event
Predicting whether two proteins interact has been dealt with relatively well with sequence data with recent attempts resulting in more than 95% accuracy.
Sub-problems 2-4 are more challenging to predict from sequence alone and could probably gain a significant performance boost by leveraging structural information. For example, it has long been hypothesized that solvent-accessible regions in the three-dimensional structure of a protein involved in PPI have different properties than solvent-accessible areas that are not part of a protein-protein complex. For instance, regions of proteins that form interfaces with other proteins were shown to be enriched in hydrophobic residues (particularly Trp). When found in an interface, charged residues are typically paired with charged residues having opposite charges on the binding partner. Therefore, we believe that protein docking is the next challenge to be 'solved,’ and the implications from this future achievement could be even more far-reaching.
From Protein Docking to the Bigger Picture - Metabolic and Signal Transduction Pathways
Cells regularly interact with their surrounding environment. They receive inputs such as changes in acidity, concentrations of salts, and so on, and respond with an output (action) such as - secretion of hormones or increasing the number of certain types of proteins within the cell (e.g., heat-shock proteins that help other heat-sensitive proteins fold). The flow of information from the outside and within the cell depends most of all on the interaction of different proteins with other proteins or ligands.
The effort to map the "Interactomes" (the network of protein-protein interactions of an organism) of model organisms such as E.coli and S.cerevisiae started back in the early 2000s with both experimental methods such as yeast two-hybrid (Y2H) and computational methods such as clustering analysis and data mining of published literature.
However, protein docking algorithms have been sparsely applied to predicting large-scale interaction networks due to the lack of high-quality structural data and lack of robust, cost-effective, and accurate methods to predict interactions in a high level of granularity. Case in point, some proteins interact with their counterparts only when phosphorylated or modified with other functional groups. Those modifications induce a conformational change that facilitates the interaction between the monomer and the rest of the complex. High-resolution details about an Interactome can be captured with ease by incorporating structural data on a massive scale.
As we obtain a better picture of an increasing number of metabolic pathways and signal transduction networks, we can perturb and engineer them with higher confidence. We'll know, for example, how a particular mutation in gene X can affect the metabolism of drug Y, or given a set of mutated genes in a cancer cell, we'd know exactly which gene (or node, in the network analysis jargon) to target to induce apoptosis or mediate the cell destruction by immune cells.
A Word of Caution
Machine learning models, including deep neural networks, excel in domains where the data is high dimensional and complex. Humans struggle with extracting useful information (AKA “inference”) from these datasets without the help of ML models. This also creates a challenge, where the models become a sort of a “black-box” where information goes in and data comes out, but we users are not privy to how this data was derived.
Why is it a problem for science?
Imagine a scenario in which computing and deep neural networks were developed 100 years ago. In that scenario, Albert Einstein wouldn't have developed the theory of relativity. Rather, he’d have developed a deep neural network that bridges the gap between classical mechanics and observations made by scientists such as Michelson and Morely.
From a practical standpoint, humanity would have been fine - all of our satellite communications would have worked flawlessly, and any device on earth that needed relativistic calculations could have used the algorithm developed by our Einstein duplicate.
The lesson from this thought experiment is clear. Unless we want to outsource our science, we must not give up on basic research and scientific literacy (colloquially referred to as 'domain expertise').
In the context of AlphaFold and the tremendous promise it holds, we should remember that we still do not fully understand the forces that govern protein folding and other processes that occur on the quantum level. Therefore, we shouldn’t mark this challenge as 'solved' just yet.