The Integration of Artificial Intelligence into Synthetic Biology to "Solve" the Protein Folding Problem

Using deep learning and AlphaFold to predict three-dimensional protein structures and unlock new possibilities in medicine and synthetic biology.

Within the next decade, using the power of artificial intelligence, we will be able to craft our own custom organisms to produce any organic medicine. I know, crazy, right? If I told you back in 2019, "In the next five years, we will get artificial intelligence to generate any image from a prompt and instantly write hundreds of lines of code," you would probably call me crazy too, but here we are. Artificial intelligence can now fulfil tasks within seconds; tasks that could take a person hours to complete. The rapid growth of artificial intelligence within the last few years has been incredible, and we have now been implementing this amazing technology into the vast field of biology.

Companies, such as DeepMind Google, have been developing artificial intelligence (AI) models to help crack down some major problems in biology, specifically the protein folding problem. If we were to solve the folding problem, we will open a door to limitless applications in the fields of medicine, biotechnology, and synthetic biology.

The Protein Folding Problem

Proteins Defined

All proteins are composed of many amino acids that are attached together in a chain-like structure (polypeptide), with some proteins being made up of multiple chains. These polypeptides fold, twist, and stack via molecular forces, which are just the forces between different atoms in the structure. Each protein has its own unique structure, composition, and function. The function of a protein is directly correlated to its three-dimensional structure, so to predict the protein's function, we must know its structure.

Levels of a protein's structure — Biology Dictionary

Proteins are used throughout the body. One common example are digestive enzymes such as lipase and protease, which aid in the process of chemically breaking down lipids and proteins down into nutrients for the body to absorb. The functions of these enzymes follow a lock and key model, where the substrate (the substance that is being broken down) fits perfectly into the active site of an enzyme, which is where the substrate will be broken down. This demonstrates the importance of the three-dimensional structure of a protein. If the protein structure was altered, it could cause the active site to change shape, preventing the substrate from fitting into the active site to be broken down.

Visual diagram of the lock and key model of an enzyme — Biology Brain

The Folding Problem

The protein folding problem is the dilemma associated with predicting the three-dimensional structure of a protein. We can easily predict the amino acid composition of a polypeptide, but we still cannot accurately predict the three-dimensional shape, as it can have multiple three-dimensional orientations which are all stable. Since we cannot effectively predict the three-dimensional shape of the protein, we cannot determine its functions and properties. Without knowing the function, the protein becomes obsolete. It's like having all the pieces to a machine without knowing what the full machine even looks like, let alone what the machine's functions are. How are you supposed to build and use that machine?

The different possible structures of the same polypeptide — Nature Biotechnology

Why We Were Slowed Down

Before the incorporation of AI into the field of synthetic biology, researchers struggled to simulate and replicate the three-dimensional structures of a protein. The costs to replicate proteins over and over again were incredibly high, costing millions of dollars and years doing repetitive lab procedures. This process involved replicating the environment which supports the synthesis of a specific protein, then adding all the components of the protein to begin the synthesis. The protein would then be investigated to identify its structure and functions. This process was incredibly tedious, as even the slightest changes in pH and temperature can greatly alter the structure of a protein, potentially changing its function. However, with the new addition of AI simulations, we can now replicate these structures with a near-zero cost — but how exactly are we using AI to predict these structures?

The Integration of Artificial Intelligence

The branch of AI that is primarily used in synthetic biology to solve the protein folding problem is deep learning. Deep learning is a type of machine learning that uses artificial neural networks to interpret and learn from data. These artificial neural networks are modelled using the human brain, where problems are solved via image recognition, pattern recognition, natural language processing, etc.

These neural networks can have multiple levels of depth, where instead of one layer of processing, inputted information goes through multiple levels of analysis to determine an output. These types of networks are called deep neural networks, and are the primary type of network used in deep learning to address the protein folding problem.

Deep learning is incorporated into synthetic biology to analyse the patterns between a protein's structure and three-dimensional shape. Images of previously determined three-dimensional protein structures are given to the AI model which then analyses the interactions between the components of the structure. These interactions are then used to predict the three-dimensional structures of other proteins with different amino acid sequences. This process is repeated until the AI model can consistently predict the three-dimensional structures of a protein, and is then continuously built upon as new protein structures are discovered to build an expansive dataset of information.

Diagram explaining an example of a deep neural network — Altexsoft

AlphaFold and DeepMind

AlphaFold3 is a platform developed by Google DeepMind, which is the current leader of protein folding research in terms of simulation, testing, and overall size of their database. AlphaFold uses an AI model to predict three-dimensional structures of proteins based on the inputted amino acid sequences in addition to other structures such as RNA and DNA sequences. AlphaFold seeks to transform our understanding of the biological world and contribute to major breakthroughs in the field of biology via AI protein modelling.

AlphaFold In-Depth

The process of AlphaFold's AI model follows 3 general steps, each of which contain many substeps. Examples of these steps include: input, processing, and output.

Input — The first step of AlphaFold's AI model is the input of information. The user inputs their amino acid sequences and other components to be simulated in a 3D space. These components can be ions, ligands, RNA sequences, and DNA sequences, which all contribute to the interactions and structure of a protein.

Processing — The processing step of AlphaFold involves the deep neural network. AlphaFold uses a special neural network called an Evoformer, which serves a similar purpose to the neural networks mentioned earlier. The Evoformer takes the inputted data from the user and translates it into information that a computer can process. The Evoformer first translates the amino acids and compares its sequence with previously analysed samples, which contain the same sequencing, then makes a prediction based on this previous information.

Multiple Sequence Alignment is the process of comparing the current sequence to previously analysed sequences — BioPandit

The Evoformer then takes the amino acid sequence and determines each amino acid's position in 3D space relative to other amino acids (based on atomic interactions and behaviour). This process is repeated multiple times for each amino acid to determine the most probable orientation of the amino acids.

Structure Output — Lastly, a structure module takes the information from the Evoformer to start producing a three-dimensional structure. This structure is almost like a "rough draft" of the final output, as it takes the given measurements and orientations of the amino acids and outputs a starting template. The structure model then uses iterative refinement, which is the process of analysing the accuracy of each prediction made in the rough draft of the structure, then attempting to resolve these mistakes in order to increase the accuracy of the final output. This cycle is repeated until no other refinements can be made to the structure, which is then outputted to the user.

Potential Applications

As mentioned earlier, there are endless inventions we can create once we "solve" the protein folding problem. We can create a platform which takes functions/characteristics as input, then outputs a synthetic protein to carry out these functions. In other words, we could essentially program any protein to have desired functions for any task. As I like to put it, in computers we use programming languages as a medium for translating human language into zeros and ones for the computer; once we solve the protein folding problem, we will start translating code from human to amino acids and nitrogen bases.

Insulin Production — As most people know, diabetes is an ongoing issue that affects people of all ages, but how exactly do we treat this disease? Diabetes occurs when a person's body isn't capable of producing enough insulin to balance out blood glucose (sugar) levels, leading to dangerously high levels of glucose in the body. Luckily for us, insulin is a protein. We can synthetically create an exact copy of insulin to treat diabetes, or potentially create a protein that is more efficient than insulin. Additionally, we could even design an enzyme or microorganism that is capable of creating insulin for when the body cannot. These insulin solutions are more convenient and most likely more cost effective than traditional insulin injections.

Biological Computing — The field of biological computing involves creating computer and logic circuits using biological components. Examples of this include using DNA sequences to store data, or using enzymes/proteins as logic gates for operations. Using the power of synthetic biology, we can use our knowledge of protein structures to create unique proteins that are built for these logic circuits by picking the exact functions we need from the protein. This method of computing is a sustainable alternative to traditional computing, as components used in biological computing are reusable in some shape or form.

Final Remarks

The reason why I put "solve" in quotations earlier in the title is because we don't truly know what solved looks like. How far can we really go with predicting three-dimensional protein structures? As our understanding of biology, chemistry, and physics increases, so does the depth and scope of the protein folding problem. There could possibly be atomic behaviour we have not even discovered yet which can completely throw off everything we know about the folding of proteins. That is why synthetic biology is an ever-growing field, we can always gain new insights the further we go in our research, and I am excited to see where we will find ourselves in the next couple of decades.

Works Cited

← Back to Blog