Computational engineering of protein binding pocket

Computational engineering of protein binding pocket

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have an X-ray structure of an enzyme with reported activity to a small molecule. This activity is rather low since it is not the native substrate.

I can run molecular modeling simulations (e.g. using DOCK) to estimate the enzyme's binding behavior to different compounds. However, what I would like to be able to do is to mutate the binding pocket so that it better binds my small molecule of interest.

Exploring all possible amino acid substitutions is computationally too expensive, so I am wondering whether smarter ways of looking at this problem have been developed.

Can someone point me in the right direction? I've searched Google but can't really find a straight forward answer

You can try an evolutionary algorithm. If you can quickly evaluate the binding affinity you can initialize a set of copies of your protein but with random mutations. You can assign them a fitness score based on the affinity of the binding. The ones with highest affinity you replicate again with some probability of mutating each amino acid and you continue for as many generations as you need. If you find your parameters (mutation rate, population size, etc) correctly you can easily optimize the binding without knowing anything a priori about the different sites. There is a lot of literature on this, I suggest you do a quick Google scholar search on directed evolution or evolutionary algorithms to optimize binding affinity.

Paper on in silico evolution to optimize protein-protein binding

Computational design of G Protein-Coupled Receptor allosteric signal transductions

Membrane receptors sense and transduce extracellular stimuli into intracellular signaling responses but the molecular underpinnings remain poorly understood. We report a computational approach for designing protein allosteric signaling functions. By combining molecular dynamics simulations and design calculations, the method engineers amino-acid ‘microswitches’ at allosteric sites that modulate receptor stability or long-range coupling, to reprogram specific signaling properties. We designed 36 dopamine D2 receptor variants, whose constitutive and ligand-induced signaling agreed well with our predictions, repurposed the D2 receptor into a serotonin biosensor and predicted the signaling effects of more than 100 known G-protein-coupled receptor (GPCR) mutations. Our results reveal the existence of distinct classes of allosteric microswitches and pathways that define an unforeseen molecular mechanism of regulation and evolution of GPCR signaling. Our approach enables the rational design of allosteric receptors with enhanced stability and function to facilitate structural characterization, and reprogram cellular signaling in synthetic biology and cell engineering applications.


Mutations that arise in HIV-1 protease after exposure to various HIV-1 protease inhibitors have proved to be a difficult aspect in the treatment of HIV. Mutations in the binding pocket of the protease can prevent the protease inhibitor from binding to the protein effectively. In the present study, the crystal structures of 68 HIV-1 proteases complexed with one of the nine FDA approved protease inhibitors from the Protein Data Bank (PDB) were analyzed by (a) identifying the mutational changes with the aid of a developed mutation map and (b) correlating the structure of the binding pockets with the complexed inhibitors. The mutations of each crystal structure were identified by comparing the amino acid sequence of each structure against the HIV-1 wild-type strain HXB2. These mutations were visually presented in the form of a mutation map to analyze mutation patterns corresponding to each protease inhibitor. The crystal structure mutation patterns of each inhibitor (in vitro) were compared against the mutation patterns observed in in vivo data. The in vitro mutation patterns were found to be representative of most of the major in vivo mutations. We then performed a data mining analysis of the binding pockets from each crystal structure in terms of their chemical descriptors to identify important structural features of the HIV-1 protease protein with respect to the binding conformation of the HIV-1 protease inhibitors. Data mining analysis is performed using several classification techniques: Random Forest (RF), linear discriminant analysis (LDA), and logistic regression (LR). We developed two hybrid models, RF-LDA and RF-LR. Random Forest is used as a feature selection proxy, reducing the descriptor space to a few of the most relevant descriptors determined by the classifier. These descriptors are then used to develop the subsequent LDA, LR, and hierarchical classification models. Clustering analysis of the binding pockets using the selected descriptors used to produce the optimal classification models reveals conformational similarities of the ligands in each cluster. This study provides important information in understanding the structural features of HIV-1 protease which cannot be studied by other existing in vivo genomic data sets.


Homology models of Olfr73

To obtain a reliable 3D model of Olfr73 through the initial homology modelling, we first compared the sequence of Olfr73 with sequences of other class A GPCRs in the PDB database. We found that it shared in the best case 19% sequence identity with beta-2-adrenergic receptor (β2AR, pdb code: 4LDE) 48 and 16% sequence identity with rhodopsin (RHO, pdb code: 4BEY) 49 . Since multiple modelling templates can considerably improve the reliability of homology models 50,51,52 , we used the crystal structures of both receptors as templates for model building. 3D sequence alignment (Supplementary Fig. 1) indicates that most highly conserved residues/motifs in class A GPCRs 9 including N 1.50 , D 2.50 , DRY motif, W 4.50 , Y 5.58 , F 6.44 and NPxxY motif are also present in Olfr73 53 . However, residue P 5.50 and motif CWxP that are typically found in non-OR class A GPCRs are missing in Olfr73. Moreover, one gap in TM3 is observed between the sequence of C 3.25 and DRY motif of Olfr73 (Supplementary Fig. 1).

Interaction fingerprints (IFP) between agonists and Olfr73

The final refined homology model of Olfr73 (Fig. 2) shared many of the common features of non-OR class A GPCRs outlined elsewhere 9,54 . We then docked isoeugenol, a potent agonist for Olfr73 47 , into the predicted extracellular ligand-binding pocket to explore atomistic details of the interaction between the ligand and its receptor. As depicted in Fig. 2, the hydrophobic moiety of isoeugenol is surrounded by several aromatic residues including F102 3.30 , F105 3.33 , F182 ECL2 , F203 5.42 , Y260 6.52 which were also found by functional analysis of mutant receptors to play a crucial role in agonist binding 47 . Furthermore, the three non-aromatic hydrophobic residues L199 5.38 , L259 6.51 and V277 7.39 are contacting the agonist molecule. The hydroxyl group in isoeugenol forms an H-bond with Y260 6.52 , which in turn forms an H-bond with E208 5.47 and water mediated hydrogen bonding to S113 3.41 . Both Y260 6.52 , E208 5.47 and S113 3.41 had been shown elsewhere to be important for Olfr73 activation 47 .

The 3D structural model of Olfr73 (left) and enlarged view of the binding mode of the agonist isoeugenol (right). Amino acid side chains in contact with bound isoeugenol are shown in green

To further validate these observations, we performed an IFP analysis (Fig. 3a), which encodes specific interactions between a particular ligand and specific amino acids in the binding pocket. IFP analyses have been used for computational drug discovery for non-olfactory GPCRs 55 . Here we docked 25 previously reported Olfr73 agonist molecules 16,45 (Fig. 4) into the binding pocket of Olfr73 and obtained the interaction fingerprints of the different agonists with residues in the binding pocket (Fig. 3a). The IFP analysis showed that all docked agonists could interact with five residues in the receptor’s binding pocket including F102 3.30 , F105 3.33 , L199 5.38 , L259 6.51 and Y260 6.52 . Furthermore, C106 3.34 (80%), V109 3.37 (96%), E181 ECL2 (80%), F182 ECL2 (64%), F203 5.42 (60%), E208 5.47 (88%), V277 7.39 (52%) and T280 7.42 (52%) are also found frequently (percentage in parentheses) contacting the agonists (Fig. 3a, top histogram). In addition, the three residues V110 3.38 (5%), F179 ECL2 (8%) and K273 7.35 (12%) were found sometimes in contact with the agonists. Each particular ligand was found interacting with at least 80% of all the residues in the binding pocket (Fig. 3a, right histogram). Most of these mentioned residues were found by functional analysis of mutant receptors to play a crucial role in agonist binding 47 .

Interaction fingerprints of 25 known Olfr73 agonists grouped in classes 1–5 according to Fig. 4. a In the interaction histogram, each contact of a particular residue with the ligand is indicated by a color. The color code distinguishes the residue location in a particular TM helix. Each class of compounds is separated by a horizontal gray line. b The pharmacophore model based on 25 known Olfr73 agonists. As a prototypical example, the position of isoeugenol in the Olfr73 binding pocket showing the interaction fingerprint. Assignments: H-bond donor (I), H-bond acceptor (II), hydrophobic moiety (III, IV, V, VI) polar residues (yellow), aromatic residues (cyan), hydrophobic residues (green)

Hierarchical clustering of Olfr73 agonist molecules. Six different classes of agonists are identified (distinguished by a color code) according to their PH4 features. In the Hierarchical diagram, the links between the chemical compounds are represented as branched vertical lines. The height of the lines, coupled with merging distance (numbers showed in each node), indicate the normalized dissimilarity distance between the adjacent compounds. A higher line or a larger merging distance denotes a larger dissimilarity. A typical representative molecular structure of each class is shown below the dendrogram together with their molecular surfaces indicating hydrophobic moieties in grey and polar moieties in red. The commonly shared atoms within a certain class of molecules are labeled with colored dots accordingly. The molecular structures of the six classes of agonists are grouped in boxes. The 17 newly found agonists are represented as A1-A17 in blue. The 25 previously reported agonists are represented as B1-B25 in black. The agonist isoeugenol is B3 and p-isobutylphenol is A1. In all cases the corresponding micromolar EC50 values are indicated in brackets. Names of A- and B-compounds are listed in Supplementary Tables 2 and 3

Structural characteristics of Olfr73 from molecular dynamics simulations

We modelled the Olfr73 with crystal structures of activated GPCRs. To explore reliable atomic details, we performed 2 × 500 ns all-atom molecular dynamics simulations for both the apo form of the receptor (apo-Olfr73) and the receptor with agonist isoeugenol (iEG-Olfr73) (Fig. 2). The molecular dynamics simulations showed that the volumes of the binding pocket for apo-Olfr73 and iEG-Olfr73 were 190 ± 3 Å 3 and 220 ± 3 Å 3 , respectively. This is probably because of the induced fit effect (IFD), which has been widely observed in GPCR system and many others 56,57 .

Since the Olfr73 was modelled with low sequence identity templates, it was necessary to restrain the backbone of the modeled OR structure during molecular dynamics simulations to keep correct secondary structure 50,52 . Thus, we added a small force constrain during all our molecular dynamics simulations (see methods section). Transmembrane (TM) movements are a hallmark of GPCR activation. Since the templates used for the molecular dynamics simulations are based on receptors in activated states, the cytoplasmic TM regions of Olfr73 have been kept in the active open conformation by the end of molecular dynamics simulations (Supplementary Fig. 2).

Virtual agonist screening

We established a refined 3D structural homology model of Olfr73, the agonist-receptor interaction fingerprint and the structural framework explaining the mechanism of receptor activation. To validate these findings, we performed a virtual screen on a large chemical compound library to find new candidates of agonists for Olfr73 (Fig. 1) beyond classical odorant compound libraries, which finally will be tested by cellular functional assays.

First, we evaluated the physicochemical properties of all reported compounds (see Methods section on physical properties filtering) and used them for setting the conditions for an initial filter according to which 312,800 compounds were selected from the initial 1.58 million drug-like compounds of the ZINC library (Supplementary Table 1). For details about this procedure, please see the Methods section on virtual screening).

We applied the next round of selection criteria to our downsized chemical compound library using pharmacophore search (PH4) 58 , a screening method selecting compounds according to their chemical shape (Fig. 3b). The PH4 screen heavily relies on our results obtained from the IFP analysis and the molecular dynamics simulations. According to the molecular dynamics simulations, the oxygen at site I is crucial for agonist binding, forming distinct H-bonds with Y260 6.52 and E208 5.47 . IFP analysis further confirmed that the interaction with Y260 6.52 in this position is highly conserved (Fig. 3a). Since the -OH group could be either H-bond donor or H-bond acceptor, it was featured allocated preferentially the PH4 selection filter further downsized the library to 266,000 compounds.

Interestingly, the empty ligand-binding pocket of Olfr73 has a volume of 190 Å 3 and is noticeably smaller than that of other GPCRs including A2AR (270 Å 3 ) 59 , rhodopsin (260 Å 3 ) 11 (Fig. 5), the 5-HT1A 60 receptor (360 Å 3 ), or the μ-opioid receptor 57 (510 Å 3 ), and therefore acts as a size-selection filter for potential binders. This explains why all currently reported agonists of Olfr73 are small (MW = 130–220) and the corresponding EC50 values are relatively high, due to the limited interactions in such small binding pocket. On this basis we created a volume counter along the 3D space of the sixteen superimposed ligands further reducing the library of potential Olfr73 binders to 493 compounds. We then continued selection filtering applying first an ionization penalty and then a molecular polarity counter, which narrowed the library further down first to 371 and then to 204 compounds.

Cross-section through several GPCRs along the membrane normal showing the vertical part of the ligand-binding pocket of (a) A2AR in complex with ZMA, (b) rhodopsin in complex with retinal, and (c) Olfr73 in complex with isobutylphenol. d Plot of EC50 values versus agonist volumes and agonist polar surface areas (PSA) for Olfr73 based on all reported agonists. Highly potent agonists are located in regions a1, a2, a3 and a4 agonists with medium potency are in the regions of b1, b2 and b3 agonists with lower potency are found in regions c1 and c2. e Molecular mass distribution of OR ligands. f Molecular mass distribution of non-olfactory GPCR ligands

Finally, we selected potential agonists from the remaining 204 compounds using quantitative structure–activity relationships (QSAR) based on comparative molecular field analysis (CoMFA) methods 61 . We docked the top 100 ranked compounds by QSAR into the MD refined homology model and found that 64 compounds fitted into the ligand-binding pocket of Olfr73, close to the activation trigger F105 3.33 . However, only 25 out of the 64 selected compounds have been commercially available for testing biological activity.

Cell-based functional tests

Next, we used the SEAP reporter assay, monitoring changes in cyclic adenosine monophosphate (cAMP) second messenger signalling as a read-out for cellular responses of odorant-induced receptor activation and found ligands capable of activating Olfr73 in Hana3A cells 47 We tested 25 compounds of the molecules predicted from virtual screening and identified 17 (Fig. 4, blue labelled compounds Supplementary Table 1 and Supplementary Fig. 3) inducing a noticeable SEAP signal in a concentration-dependent activation of Olfr73. It would be interesting to test by additional experiments whether of the eight compounds, which did not show agonist activity, there are antagonists for Olfr73.

The diversity of OR agonists

In the following, we used a hierarchical agglomerative clustering method 62 to classify both the newly found and the previously known Olfr73 agonists based on their PH4 characteristics. As shown in Fig. 4, the 42 compounds can be grouped into 6 different classes. The four agonists of class-1 comprise a common phenol group with bulky hydrophobic groups (cyclohexyl or branched methyl containing alkyl chains) in para position. The EC50 values of class-1 agonists range from 13 to 64 μM. The agonists of class-2 share a central modified pyrocatechol structure (primarily in form of monometoxy-phenol or dimetoxy-phenon) with an additional linear, branched or cyclic hydrophobic group attached. The EC50 values of these agonists range from 7 to 240 μM. The agonists of class-3 contain a central benzaldehyde structure. The para positions carry primarily a metoxy- or ethoxy-substitute the meta positions are substituted mostly by methoxy groups or for one case by a methyl group. The EC50 of class-3 agonists range from 26 to 270 μM. The 16 agonists of class-4, share a central phenol structure with oxygen carrying groups in the para and sometimes also in the ortho position. The class-4 agonists are the most polar ones in our collection ten of them show EC50 values in the range of 4–100 μM, the remaining six have EC50 values between 200 and 660 μM. The class-5 agonists are quite different from the initial four classes they do not contain an aromatic ring but instead carry a central cyclohexanone structure preferentially with a linear or branched alkyl substituent at the para position. The four class-5 agonists show EC50 values from 36 to 63 μM. Only one agonist is listed in class-6. It is composed of a tetrahydro-2H-pyran structure carrying two hydrophobic substitutions in the ring and has an EC50 value of 630 μM.

Therapeutic potential of the newly discovered agonists

We found p-isobutylphenol (4-isobutylphenol) as the most potent ligand activating Olfr73 in our functional assay (Fig. 4). It is a known degradation product of Ibuprofen which is widely used as analgesic anti-inflammatory drug but p-isobutylphenol has also been shown to exhibit antibiotic activity 63 . The estrogenic activity of the compound 4-cyclohexylphenol has been documented by in-vitro assays 64 . The Olfr73 activating compound 4′-hydroxy-3′,5′-dimethoxyacetophenone (Acetosyringone) has anti-asthmatic and anti-inflammatory properties 65 . And finally, 4′-hydroxypropiophenone is a predicted inhibitor of metalloproteinase 10, which has an active role in lung cancer development (Kiresee et al., 2016). Thus, our results have revealed some insights regarding the potential poly-pharmacological profile of these drugs acting not only on a defined medicinal target but also activating an OR. Similar observations of unintended interactions and activation by medicinal drugs have also been documented for the bitter taste receptor TAS2R14 66 . The complete list of newly discovered compounds can be found in Supplementary Table 2. The full list of reported compounds is in Supplementary Table 3.

Limited volume of OR binding pocket

ORs in general and Olfr73 in particular show some interesting structural and functional differences to their class A GPCR relatives. Most of the known OR-agonists are smaller in size than the typical agonists of non-olfactory class A GPCRs. Considering a large panel of reported OR ligands 67 together with the new ligands (105 compounds in total) from this work shows that the molecular mass (M) of OR ligands distribute between 80 and 220 Da (maximum around 150 Da) (Fig. 5e). In contrast, non-olfactory GPCR ligands 68 (161,083 compounds in total) distribute primarily between 300 and 600 Da (maximum around 450 Da) (Fig. 5f). Additionally, EC50 values for OR-agonists are usually much higher than those of the agonists of the non-olfactory class A GPCRs 69 . In our present study, this can be explained by the volume of the ligand-binding pocket of the Olfr73 which in the apo form is considerably smaller than comparable regions of the non-olfactory class A GPCRs reducing the number of interaction points between ligand and receptor (Fig. 5). Obviously, the ligand-binding pocket of a particular receptor acts as a size exclusion filter for potential ligands. To test this hypothesis and to probe the flexibility of the ligand-binding pocket for the Olfr73-agonists, we submitted Olfr73 with bound compounds A1, A2 and A3 (Fig. 4) to additional 2 × 500 ns all-atom molecular dynamics simulations. Similar to the agonist isoeugenol, the volume of the binding pocket of Olfr73 increased from 190 ± 3 Å 3 in the empty state to 220 ± 3 Å 3 for A1 (MW = 149) to 225 ± 5 Å 3 for A2 (MW = 166), and 240 ± 2 Å 3 for A3 (MW = 171), respectively. In general, larger ligands induce a larger volume increase in the occupied binding pocket 56 . Obviously, the binding pocket of Olfr73 is within a certain range quite flexible and adjusts perfectly to the size of the bound ligand with volume changes between 15 and 25%. We previously made similar observations for non-olfactory class A GPCRs such as the P2Y1 receptor 70 changing the volume of the binding pocket from 230 ± 4 to 280 ± 5 Å 3 (22% change), the 5-HT1A 60 receptor from 360 ± 5 to 425 ± 3 Å 3 (18%), the A2AR 59 from 270 ± 2 to 315 ± 4 Å 3 (17%), and the μ-opioid receptor 57 from 510 ± 3 to 575 ± 5 Å 3 (13%). In the next step, we performed an interaction fingerprint analysis for our newly found 17 agonists and compared the outcome with that in Fig. 3 of the 25 known agonists (Supplementary Fig. 4). The IFPs of both sets of agonists are quite similar. Moreover, IFPs indicated that there are only two hydrogen bond interactions between Olfr73 and its agonists, whereas there are much more polar interactions in other GPCRs 71,72 . These results further confirm our conclusions that the volume of Olfr73 is limited which is responsible for the small size of its agonists and weak EC50 values.


In this study, we have computationally truncated and engineered the human ACE2 receptor sequence to potently bind to the SARS-CoV-2 RBD. We have further identified an optimized peptide variant that enables robust degradation of RBD-sfGFP complexes in human cells, both in trans and in cis with human E3 ubiquitin ligases. Finally, we have shown that our optimal fusion construct inhibits the production of infection-competent viruses pseudotyped with the full-length S protein of SARS-CoV-2.

Although further testing contexts are needed, there may be certain advantages to our platform as compared to the PAC-MAN strategy presented recently 6 , apart from the ethical implications of applying CRISPR in humans 35 . First, both of the peptide and E3 ubiquitin ligase components have been engineered from endogenous human proteins, unlike Cas13d, which is derived from Ruminococcus flavefaciens bacteria, thus potentially reducing the risk of immunogenicity. In addition, in terms of in vivo delivery as RNA or recombinant protein, Cas13d has an open-reading frame (ORF) of nearly 1000 amino acids, not including the guide RNAs needed for interference. The entire peptide-CHIPΔTPR ORF consists of just over 200 amino acids, which can be readily synthesized as a peptide or be efficiently packaged for delivery in a lipid nanoparticle or adeno-associated virus.

Furthermore, our peptide fusion platform as a prophylactic provides a viable alternative to current antiviral strategies being explored for COVID-19. Antiretroviral protease inhibitors for HIV, such as lopinavir and ritonavir, have shown minimal efficacy in clinical trials of COVID-19, and generated adverse effects in a subsection of patients 36 . Similarly, antimalarials, such as hydroxychloroquine and chloroquine, which may glycosylate ACE2, have demonstrated no benefit in patients infected with SARS-CoV-2 in randomized, controlled studies 37 . Finally, there is a global effort to generate a vaccine for COVID-19. Although 11 candidates have advanced into Phase 3 trials and 5 have been approved for early or limited use in China and Russia, a standard timeframe to fully assess safety and efficacy takes well over one year 4 . Though our platform likely requires synthesis and assessment of a gene therapy, rather than a small molecule or compound, and does not generate immunological memory against SARS-CoV-2 as would a vaccine, its rapid and direct targeting mechanism, coupled with its size and human-protein derivation, presents numerous advantages as compared with existing strategies.

In total, we envision that the strategy of utilizing a computationally designed peptide binder linked to an E3 ubiquitin ligase can be explored not only for SARS-CoV-2, but also for other viruses and drug targets that have known binding partners. With already over 30,000 co-crystal structures currently in the PDB, and structure determination becoming more routine with advances in cryogenic electron microscopy, the computational peptide engineering pipeline presented here provides a versatile new therapeutic platform in the fight against COVID-19, future emergent viral threats, and numerous diseases.


Our approach, described in more detail in the following sections, consists in (i) defining the query region, i.e. the region of the input target protein containing the desired binding site (ii) searching for regions structurally similar to the query region in a non redundant database of experimentally solved monomeric proteins (iii) retrieving continuous backbone fragments in contact with the query region and merging them in the appropriate relative position with the target protein and (iv) designing the peptide sequence using repeated cycles of structure diversification and sequence design. These steps are described below.

Define the query region: the query region can be defined in two ways, i.e. by listing a specific set of residues or by selecting a single residue res and a query region radius r. In the latter case, all residues having at least one atom within rÅ from any atom of res are included in the query.

Identify regions structurally similar to the query region: a structure similarity search of the query region as defined above is performed against a non-redundant database of solved monomeric proteins (filtered at the level of 70% sequence identity). We use Triangle Match ( 23, 24) with default parameters to select backbone regions (hit regions) structurally similar to the query region in an amino acid sequence and order independent fashion.

Retrieve the appropriate backbones: we retrieve backbone scaffolds by considering regions of the hit proteins in contact with the hit region. This is achieved analyzing contact graphs. Contacts are pre-calculated for all proteins in our database using Almost Delaunay tessellation ( 25) as implemented in the ADCGAL program ( 26) and used to construct graphs using the NetworkX package for Python ( 27). The conformations of the top 2000 longer hit regions are analyzed further. Fragments in contact with the hit regions (that we call scaffolds) are retrieved. Those that do not have an extended conformation are excluded since regions with helical or turn conformations in the native structure are unlikely to preserve their conformation in isolation. Extended backbone scaffolds are defined as the fragments for which ne/(L - 2) ≥ 0.5 where ne is the number of residues that have ϕ angles in the range −185° to −35° and ψ angles in the range from 85° to 160° and L is the length of the fragment defined as the number of its residues. Regions shorter than four residues are not considered.

The contact density (defined as the number of contacts per residue) between the remaining potential scaffolds and the corresponding hit region is used to re-sort the list and retain the top 500 scaffolds (or more if there are regions with the same contact density at position 500). This is based on the assumptions that higher contact density backbone scaffolds are more likely to support sequences leading to high affinity binding peptides in the subsequent sequence design step. Selected backbone scaffolds are then merged with the query protein to build putative protein–peptide complexes on the basis of the superposition between the query and the hit regions. The superposition is performed using Triangle Match ( 23, 24).

Sequence design: sequence design is performed using PyRosetta ( 22), a Python-based interface to the Rosetta molecular modeling package ( 28) and the Rosetta full atoms energy function with Talaris2013 energy term weights ( 29). The design consists of two different stages, both including structure diversification and sequence design.

First, a relaxation step is performed using PackRotamersMover to allow changes in side chain rotamers of both the protein and the peptide. Next, small rigid-body movements of the peptide and small local flexible backbone moves of the protein–peptide complex structure are performed using RigidBodyPerturbMover with translation and rotation steps of 0.08 Å and 0.3° respectively followed by five rounds of BackrubMover on the whole complex ( 22). Energy minimization of the protein–peptide complex is performed using MinMover DFP minimization with 0.01 Å tolerance. Next, the amino acid sequence of the peptide is optimized using a standard simulated annealing Monte Carlo method, using the PackRotamersMover function, where rotamers are changed in both the protein and the peptide while amino acids are mutated only in the peptide. For every backbone scaffold, 10 different peptide sequences are calculated and the resulting complexes scored and ranked using FoldX ( 30). The 100 backbone scaffolds corresponding to top protein peptide complexes in terms of lowest FoldX binding energy are selected for the subsequent exhaustive sequence design step.

In the next stage of design, each peptide is subjected to three iterations of backrub movements, energy minimization and sequence design and 100 peptides are generated from every backbone scaffold selected in the pre-design stage. A further structure refinement step is performed on the protein–peptide complex following the Rosetta Classic Relax protocol ( 31). In this final step the backbone of the protein is kept fixed and all residues within 20 Å from any atoms of the query region are considered in the calculation. Resulting models differing by more than 1Å in terms of Cα RMSD from the initial protein–peptide structure are filtered out in order to avoid both significantly distorted structures and peptide conformations that deviate too much from the starting backbone scaffold. The remaining peptides corresponding to the same backbone are grouped by sequence identity using CLUSEQ ( 32) and each group is assigned the average FoldX binding energy of its members.

The parameters used in the pipeline described above have been selected on the basis of their ability to retrieve peptides similar in both structure and sequence to experimentally known cases. A few selected examples of the results are illustrated later more are available in the ‘Example’ section of the web server.


Computational protocol

The computational protocol is organized in two main parts (Fig. 2). Computational models of ligand analogues (N6-(benzyl) ATP, N 6 -(1methylbutyl)adenosine-5′-triphosphate (N6-(1-methylbutyl) ATP), N 6 -cyclopentyl-adenosine-5′-triphosphate (N6-(cyclopentyl) ATP), N 6 -(2-phenythyl)adenosine-5′-triphosphate (N6-(2-phenythyl) ATP), and 1-tert-butyl-3-(4-methylphenyl)-1H-pyrazolo[3,4-d]pyrimidin-4-amine (PP1) Fig. 3) were modelled in Maestro (version 9.5, Schrödinger, LLC, New York, NY, 2013). For each molecule, an ensemble of low energy conformers was generated by performing an in vacuo conformational search keeping the adenine base, the ribose ring, the phosphates and the pyrazolopyrimidine core of PP1 fixed and allowing the bonds of each substituent group to rotate freely. We used the Monte Carlo multiple minimum (MCMM) method [29] for 10,000 steps and OPLS_2005 as force field [30, 31]. During the conformational search, new structures generated were retained if they exhibited conformational energies lower than 100 kJ/mol. The conformation energy cutoff was chosen at 100 kJ/mol to allow for the various geometric approximations made in the force field. It serves as a proxy for the estimated protein–ligand interaction energy. To obtain an ensemble of non-redundant conformations, each conformer was compared with the previous ones and only retained if the root mean square deviation (all atoms) exceeds 0.5 Å. The conformational search was performed with the MacroModel module implemented in the Schrödinger suite (version 10.1, Schrödinger, LLC, New York, NY, USA, 2013).

Workflow of the computational protocol. The protocol is organized in two parts, the first part identifies residues to mutate and the 2nd part evaluates mutant-analogue interactions. The specific inputs are depicted in circles, steps of the workflow are shown in rectangles and outputs are depicted in rectangles with dashed lines. In case all analogue conformations are scored as having favorable interactions with the wild type, the analogue is considered to act as substrate for the wild-type protein and thus not further considered

Chemical structures of ATP and ATP-competitive analogues used in this study. For N6-(substituent) ATPs only the structures of the adenine ring and the hydrophobic groups are shown

For each analogue, the ensemble was superposed onto the adenine moiety of the native ATP ligand within the binding pocket of the reference protein. If the distance between an atom of a protein residue and any atom of the substituent group of a ligand analogue in the ensemble is shorter than the sum of their van der Waals [32] radii, the corresponding residue is considered a potential candidate for single-point mutagenesis. If no residues were identified by this approach, the analogue was considered to act as substrate for the native target and thus not further considered. The method was implemented in Python 2.5.4 and contains functions from the OpenStructure software framework [33].

In the second step, the interaction between potential protein mutants and ligand analogues was evaluated using a protein–ligand scoring function. Amino acids at positions identified in the first step were replaced in silico to generate mutant proteins. When a residue was changed into Gly or Ala, the entire structure was relaxed by a minimization step performed using OPLS_2005 as force field in Maestro [34]. When a residue was mutated into an amino acid with a larger side chain, such as Met or Thr, a rotamer scan was performed to identify the most probable rotamer state using Rapid Torsion Scan tool available in Maestro. The kinase mutant-ligand conformer pairs were evaluated and ranked by the protein–ligand scoring function GlideScore [35]. The kinase mutant-ligand conformer structure with the lowest GlideScore was selected and the corresponding Glide energy was computed. The Glide energy is the sum of the Coulomb and van der Waals terms and represents an estimate for the protein–ligand interaction energy. Typically, predicted energies of interaction (Glide energies) correlate better with protein–ligand binding affinities or experimental IC50 values than GlideScore [36]. We arbitrarily limited all positive energies to zero as we were only interested in identifying favorable interactions. In the case of engineered kinases and ATP analogue pairs, only the adenine base and the substituent group were scored by GlideScore.

Kinase data set

A set of 7 protein kinases and 15 mutants for which experimental data were available in literature was used as test set (Table 1). Unless stated otherwise, in silico mutagenesis was performed using Maestro and the structure was prepared with the Protein Preparation Wizard tool [34]. Residues are numbered as as in PDB structures. The crystal structure of JNK bound to ANP (an ATP analogue with an amino group in place of the oxygen between the β and γ phosphates that mimics the natural cofactor) and Mg 2+ was solved in 1998 (Homo sapiens, PDB:1JNK, resolution 2.30 Å, [37]). The crystal structure was prepared for molecular modelling by adding hydrogen atoms, optimizing the hydrogen bonding network, the orientation of the amide groups of Asn and Gln, and the orientation and protonation state of the imidazole ring of His. This optimization allowed for improving interactions between charged groups as well as hydrogen bonds within the structure. The optimization was performed at pH of 7. Finally, a minimization step was applied to relax the entire structure. OPLS_2005 was used as force field and the termination criterion was based on the rmsd of the heavy atoms relative to their initial location (rmsd less than or equal to 0.30 Å). The M108GL168A mutant was obtained by in silico replacing Met108 to Gly and Leu168 to Ala and the structure was prepared as described above.

The kinase domain of v-Src differs from that of the cellular protein kinase c-Src at position 338 within the binding pocket (Ile338 in v-Src and Thr338 in c-Src). The crystal structure of c-Src in complex with ANP has been solved (Homo sapiens, PDB:2SRC, resolution 1.50 Å, [38]). To obtain a model of v-Src bound to its natural cofactor, we substituted in silico Thr338 into Ile. The v-SrcI338A and v-SrcI338G mutants were obtained in the same way.

To obtain a model of v-Src in complex with a pyrazolopyrimidine inhibitor, PP1, the structure of v-Src bound to ANP was superposed onto the structure of the hematopoietic cell kinase (Hck, a homologous protein) in complex with PP1 (Homo sapiens, PDB:1QCF, resolution 2.00 Å, [39]). The superposition was based on residues belonging to the hinge regions (residues 338–341 in both v-Src and Hck). The coordinates of PP1 were copied into the v-Src binding site and the complex was then prepared and minimized as described before. The same procedure was used for all other protein kinases and mutants studied in the same paper, proto-oncogene c-Fyn (Fyn, Homo sapiens, PDB:2DQ7, resolution 2.80 Å, [40]), abelson murine leukemia viral oncogene homolog 1 (Abl, Homo sapiens, PDB:2G1T, chain D, resolution 1.80 Å, [41]), calcium/calmodulin-dependent protein kinase type II subunit alpha (CamKII, Homo sapiens, PDB: 2VZ6, chain B, resolution 2.30 Å, [42]), cyclin-dependent kinase 2 (Cdk2, Homo sapiens, PDB:1HCK, resolution 1.90 Å, [43]), and mitogen-activated protein kinase p38 alpha (P38, Homo sapiens, PDB:1DI9, resolution 2.60 Å, [44]).

The complex of Fyn bound to the PP1 conformer with the best GlideScore was minimized in vacuo without constraints. We used the Polak-Ribier Conjugate Gradient (PRCG) as method for 2500 steps [45]. The same procedure was used for the complexes of FynT339A, Abl and AblT334A. The procedure was performed using MacroModel.

Data comparison

All plots reported in this paper were made using the Matplotlib [46] and NumPy packages [47]. In the plot of JNKM108GL168A, the interaction energies were scaled between 0 and 100 to fit the same range of observed phosphorylation values (expressed as percentage of phosphorylation). The lowest Glide energy was set to 0 and the highest to 100. The plots of v-Src, v-SrcI338A and v-SrcI338G in complex with ATP and N6-(benzyl) ATP were created by comparing the experimental catalytic efficiency (kcat/Km) and the predicted interaction energies (Glide energies). To correlate experimental and predicted data, we computed the negative logarithm of the kcat/Km ratio. The plots of tyrosine kinases and serine/threonine kinases in complex with PP1 were made measuring the linear correlation between the predicted interaction energies and the experimental measured pIC50 (−log(IC50)). For each family, the Pearson correlation coefficient was computed.


Elucidating the properties of a protein interfacial pocket (IP) can be a daunting task,2 , 32 let alone re-engineering it by altering residue and R-group arrangements to endow intended new functionalities. The IP of a protein may be abstracted as a set of amino acid residues, not necessarily adjacent in the linear polypeptide sequence, that form a local biochemical environment in three-dimensional structure space using conformations of their R-groups that is favorable to binding the proper substrate.29 These residues are housed amongst the rest of the residues, or scaffold, that do not participate in direct binding. Contribution to this favorable local environment can arise from a number of biophysical factors. The steric effects, approximated by a pairwise Lennard–Jones interaction energy, E VDW , of the van der Waals (VDW) radii of atoms composing the amino acid residues of the IP as well as the proper substrate, contribute to favorable binding between them and provide hindrance and shielding against unintended side-reactions involving other substrates.46 The coulombic effects, represented by an analogous pairwise electrostatic interaction energy, E electrostatics , enable charge complementarities between regions of the IP and proper substrate and repulsive mismatches with other substrates.46 Since both the IP and a region of the substrate with which it interfaces are occupied by electrostatic interactions with water molecules or some other solvent, evacuating this solvent is quantified as the electrostatic desolvation energy of the former, (ΔG desolvation IP , and the latter, ΔG desolvation substrate . Thus, an estimate has been often used for the net binding free energy, ΔG, from a linear sum of these weighted energies46:

The protein IP engineering possibilities, as outlined in Fig. 1, and IP residue replacement R-group refinement (R 4 ) maintain these favorable ΔG of interactions and global minimum energy conformations (GMECs) of the protein–substrate complex as a whole.14

Protein IP engineering possibilities. The wild-type protein, consisting of the original scaffold and IP with the original substrate (top) can serve as a starting point for three distinct engineering possibilities: an advantageous original scaffold can support an engineered IP that binds a different substrate than wild-type (bottom left) an original IP that binds the original substrate well can be adapted to an engineered scaffold (bottom center) or both IP and scaffold can be engineered to provide advantageous support and binding to a different substrate (bottom right)

Computational Filtering (CF) Approaches

There are a number of CF approaches to perform R 4 with the aforementioned energetic conditions under consideration. However, due to the rapidly increasing degrees of freedom at each residue, n, of the protein chain, coupled with the specific characteristics of the 20 amino acids that can be found at each position, an colossal combinatorial quagmire of 20 n possibilities require modeling and analysis—and for an average-sized protein composed of 100 amino acids, simulating 20 100 possible physical combinations exceeds the number of atoms in the known universe. Thus, the probability of the protein’s IP locating its native state by pursuing all these combinations is biologically infeasible (known as the Levinthal paradox)15 and computationally impractical (known as the Blind Watchmaker paradox).15 An exhaustive structural bioinformatics search for IP formation and end-state continues to be a challenge that is tackled using filtering, heuristics, homology, distributed computing, and high performance supercomputers with varied success.39

Heuristics are often helpful and necessary in undertaking R 4 at the scale of IPs. For example, heuristics in genetic algorithms, mean field algorithms, constraint logic programming enumeration, or database search perform adequately under certain scenarios and assumptions and not as well with others.11 While the computational cost is lessened or efficiency increased compared to the exhaustive search, the quality, however, of the end solution may or may not be consistent rather than assuring that the particular IP R 4 generated by the heuristic is located at or near the GMEC.

Homology can often aid in proper R 4 as well. Here, informatics searches and interpolations from signature sequences of a few residues composing a key motif of IP, substrate, or both can provide clues for engineering. This can extend further to domain sampling of entire regions across the protein that compose the IP. While this may be effective in well-investigated and documented systems, those sequences or structures with no similarity or availability of such information can hinder this approach. Even with fertile sources, often the R 4 is limited by what has been already observed to transpose well.26 , 27

In similar fashion, partitioning and docking can narrow the possibilities for R 4 .25 A collection of IP conformations can be generated that each present a different VDW, electrostatic profile, or desolvation cost. By docking this collection of IP conformations to the proper substrate, the affinity features of those subpartition of IPs that dock more readily can be gleaned. However, fully enumerating all the elements in this collection may be computationally difficult or biologically unsubstantiated.

Furthermore, exact filtering algorithms, among them integer programming,21 dead-end elimination (DEE),12 , 13 , 23 and A*,24 can advance the R 4 process by eliminating possibilities.17 For example, in DEE the relative global energy, E global, of an IP is composed of the linear sum of the energy contributions from the backbone, the self and interaction with backbone energy, E(i r), of rotamer, r, at its position, i, and its pairwise interaction energy with rotamer at nearby position, j:

Thus, if the minimum energy, determined via some given discrete rotamer library and energy function, or best case, arrangement of rotamer, i r, still has a higher energy than the maximum energy, or worst case, arrangement of an alternate rotamer, i t:

then the former rotamer is considered an energetic dead-end for further investigation as its arrangement is guaranteed to not be a participant in the GMEC, thus filtering the number of possibilities than need to undergo R 4 . However, this guarantee is accompanied by an expensive computational cost due to the enumeration of all elements of the rotamer library in use at each residue position of the IP. In addition, since these IP R-groups need to be energy minimized as a whole, then DEE may no longer be provably accurate. In summary, the CF approaches are often a trade-off between the quality of the end IP candidate and the efficiency to reach it.42

Biological Focusing (BF) Approaches

Correspondingly, there are many BF approaches to perform R 4 so that resulting possibilities are in or near the aforementioned energetic conditions, perhaps by virtue of the constraints and fitness requirements existing in and imposed by the biological environment.44 , 45 Here, the parallel processing nature of this environment may provide a natural, even advantageous, platform to evaluate the large combinatorial number of possibilities and interdependencies to be considered in a tractable manner. However, this evaluation is often performed in a stochastic, discovery-driven investigation using various mutagenesis techniques, recombination, and directed evolution among others to screen for high performing clones or select those that survive from a large starting population representing the number of possibilities.7

Stochastics are often necessary for R 4 at a single position in the protein, let alone the half-dozen to a dozen residues that comprise some IPs.19 Consider a random mutagenesis methodology using mutagenic chemicals, wobble base PCR, or error prone PCR to incorporate mutations at the genetic level that will be selected or screened for the desired characteristics at the protein level. Though apparently misguided, it has been observed that non-obvious mutations can give rise to proteins with new characteristics.3

Another group of approaches to achieve R 4 using biology relies on using the recombination of existing components in the system to generate new promising possibilities.43 Among these is incremental truncation to correlate the loss or gain of certain IP features and functions to the gene and protein truncation positions.31 , 38 There is also homologous gene shuffling to generate variants of the original IP from internal wellsprings of diversity.9

These external and internal sources of stochastics can be considered aspects of directed and simulated evolution, which mimic the fitness requirements, survival and natural selection, propagation and amplification and of individuals, or IPs, to evaluate massive potential-filled populations with desirable properties.19 However, these stochastic approaches rely on the robustness of this evolutionary condition to propagate order from randomness. In summary, the BF approaches are usually a compromise between the intended end IP and those that arise serendipitously or survive having unintended properties.

Coupling Computational and Biological Approaches: CF–BF

While CF and BF each has its own advantages and drawbacks, a synergistic coupling of CF and BF may narrow the scope to a smaller number of high-quality, intended candidates more efficiently than either alone (Fig. 2). This smaller number of possibilities is also more amenable to downstream computational and empirical evaluation and feedback. In this research, we attempt to demonstrate the application of the CF–BF criterion to computationally engineer a putative IP on the scaffold of the restriction endonuclease R.PvuII to bind a different DNA substrate, which is to be that of the restriction endonuclease R.EcoRV.

CF–BF reduces the search space Ω and the corresponding cost required to locate GMEC. Using existing CF criterion, shown in red arrows, the search space of all possibilities, Ω, eliminates residues and R-group configurations of those residues that are most likely not in the GMEC based on pairwise local energies, to yield a smaller number of conformational possibilities, shown in dotted blue line, that must be evaluated via global energy minimization (top panel). Coupling to a BF criterion, shown in green, improves this condition by further reducing Ω to an even smaller number of possibilities based on evolutionarily relevant residues and R-groups, to be evaluated for minimum global energy as well as functionality (bottom panel)

Web-based tools for computational enzyme design

Enzymes are increasingly used in technological applications but often require prior optimization.

Computational tools save time and effort in protein engineering endeavours.

Web-based tools make computational enzyme design accessible to a broad user community.

The recent enzyme engineering web-tools are described and classified by engineering purpose.

Enzymes are in high demand for very diverse biotechnological applications. However, natural biocatalysts often need to be engineered for fine-tuning their properties towards the end applications, such as the activity, selectivity, stability to temperature or co-solvents, and solubility. Computational methods are increasingly used in this task, providing predictions that narrow down the space of possible mutations significantly and can enormously reduce the experimental burden. Many computational tools are available as web-based platforms, making them accessible to non-expert users. These platforms are typically user-friendly, contain walk-throughs, and do not require deep expertise and installations. Here we describe some of the most recent outstanding web-tools for enzyme engineering and formulate future perspectives in this field.



fpocket relies on the concept of α-spheres, a concept initiated by Liang and Edelsbrunner ( 3) and is also used by Chemical Computing Group in the SiteFinder software ( An α-sphere is a sphere in contact with four atoms on its boundary, not containing any internal atom inside. For a protein, very small spheres are located within the protein, whereas large spheres are located at the exterior. Clefts and cavities correspond to spheres of intermediate radii. Thus, it is possible to filter the ensemble of α-spheres defined from the atoms of a protein according to some minimal and maximal radii values in order to address pocket detection. Based on this, we have recently introduced the fpocket package for pocket detection. For more information refer to ( 10).

Pocket tracking over collection of structural frames

Given a collection of comparable protein structures, such as provided by molecular dynamics or by homology search, one challenge is to track the persistence of pockets within this set of conformations or frames. The approach used can be summarized as an iterative run of fpocket on each frame, followed by a post-analysis using a grid-based approach, as illustrated Figure 1.

Workflow of the pocket tracking methodology. α-spheres from different snapshots are represented by different colors (dark and light).

Workflow of the pocket tracking methodology. α-spheres from different snapshots are represented by different colors (dark and light).

In more detail, a 1 � spaced grid is generated to encompass previously aligned conformers. The grid allows tracking of pockets (α-spheres) in very precise zones over time. On each grid point the α-sphere density of 8� 3 volume around it is calculated, corresponding to a small box of a 2 � sized edge. Furthermore, the associated pocket score for each α-sphere near a grid point is tracked following formula ( 3).

Watch the video: Disruption - Day 2 - Part 1 ENG (November 2022).