https://www.science.org/content/blog-post/ai-predicting-compound-affinity-we-aren-t-there-yet
Here’s a paper evaluating a popular AI/ML model for cofolding ligands and proteins, Boltz-2. This is of course a problem of extreme interest to the drug discovery community, as well as to all sorts of people working on cell biology, structural biology, and related fields. It’s been one of the goals for decades to start from scratch with a protein sequence and a small molecule and be able to say “Does this molecule bind to this protein? How well?”
And no, we really haven’t been able to do that, not in the way that we’d like (and certainly not on the scale that we’d like). The error bars on those binding predictions have generally been too wide, and that’s both on the underlying structure of the protein and on the energetic implications of how how it interacts with a given small-molecule ligand. And the computational burdens of even getting that far have generally been too great, given the number of conformations you’re likely to need to examine (and the way that you’ll need to evaluate which of those are most plausible relative to the others).
Protein structure from scratch was of course a notoriously hard problem for decades as well, but machine learning off the databases of known protein structures (AlphaFold, RosETTAFold and the like) have made terrific progress by identifying the often-reused structural motifs and their effects on overall tertiary protein structure as they’re combined. But protein-with-ligand, that one is still the Holy Grail. If we could get that to work well, and get the speed up and the computational overhead down, then we’d perhaps be able to finally achieve primary-screen nirvana in silico. No need to make and purify protein, no need to have a basement full of hundreds of thousands of small-molecule candidates in vials and plates. No robot arms, no fluorescent plate readers. Just fire up the computational hardware and software and go get some lunch instead.
Boltz-2 is one of the open-source alternatives to software like AlphaFold 3, all of which are trying to address this problem. And it is claimed to produce protein structures at AlphaFold levels of accuracy while simultaneously predicting binding affinity energies at a level similar to the most computationally intensive methods (like free energy perturbation) but hundreds of times faster. So as you can imagine, it and the other programs in this space have gotten a lot of attention.
As the paper linked above notes, so far it looks like this software is at its best when working with rather locked-down protein structures and known binding structures - that is, when working on Easy Mode. Unfortunately, we don’t spend much time on Easy Mode in the wonder drug factories. We have a lot of other things to worry about: proteins that don’t have much (or any) good experimental structural data, binding sites that depend crucially on the effects of water molecules, small-molecule cofactors, or on the binding of another ligand at a completely different allosteric pocket. And some of the binding events we’re looking at turn out (once we get real-world data) to involve significant shifts in the original protein structure and/or rather odd twists in the conformation of ligand molecules, neither of which are easy to compute your way to. (You end up paying a lot of energetic penalties if you try to advance step-by-step, and the system may well throw in the towel before the big unexpected energetic payoff at the end).
In this paper the authors use a set of 943 virtual-screen hits from some very large previous screening efforts (hundreds of millions of candidates), binding to ten different target proteins, and with the associated real-world in vitro binding data already in hand. Those comprise 364 true positives and 579 false positives, as discovered when those assays were run. The paper notes that so far no affinity-prediction systems have been able to really tell those true positives from false positives in this data set, so this is an adversarial challenge for sure, and just the thing to let a hot new piece of software get its virtual teeth into. Most of the targets, as it happens, are G-protein coupled receptors, althought the binding site diversity is still very high.
Boltz-2 ran through the 943 candidates at about two minutes per, and it got all of them into the right binding pocket (as it darn well should - all of these targets have ligand-bound structures in the PDB already). And its predictions for “Is this compound likely to be a good binder?” are notably better than any other method tested (with the exception of two targets on which it failed pretty thoroughly). So it really distinguished itself on finding the true positives. This did not seem to be due to similarities between these compounds and the Boltz-2 training set (which is something you always need to be wary of).
That said, its actual predictions of affinity were quite poor as compared to the experimentally determined values. And when the actual structures of the ligands in the binding pockets was examined, the Boltz-2 predictions were pretty far off of what is believed to be the actual situation. Even odder, the accuracy of distinguishing true positives did not seem to be affected by the quality of the docking poses, which is rather counterintuitive.
At this point the authors were mindful of a report that came out last year about AlphaFold 3 docking predictions. That work noted that AF3 poses and predictions seemed curiously insensitive to amino acid mutations in the binding site(s) that should severely affect such results. These clearly nonphysical results suggest a great deal of overfitting to the training data, or to particular trends in it, and caused those authors to warn people about relying too much on such deep-learning models. So the authors in this latest paper tried the same trick: introducing amino acid changes that would absolutely blow up important polar interactions between the ligands and their binding sites. We’re talking aspartic acid to alanine, that sort of thing, or dropping a proline into the hinge region of a kinase. These are grenades.
Unfortunately, Boltz-2 emerged from that challenge with predictions that were for the most part not statistically different from the ones generated from the wild-type structures. What’s more, the poses of the compounds in these messed-up binding sites seemed to have little in common with what it had generated earlier - i.e., it didn’t hang on to the other interactions it had found while just making the best of it with the abrogated ones. Further alanine-scan mutations (up to six per binding site!) made it clear the Boltz-2 just didn’t care much about such petty details.
Even reshuffling the target proteins completely and assigning random ones to the ligands (where they would be expected to have no binding whatsoever) only got rid of about half the true-positive recommendations. For the others, predictions of affinity seemed to be almost independent of what target they chose. This is not what you want. In fact, it’s the opposite of what you want. The authors “therefore advise remaining very skeptical with respect to affinity predictions” from the program (and others like it, you’d have to say) and I completely agree.
It is very tempting to look at the outputs of such software and to tell yourself that it must have a deep understanding of the physics and energetics of protein folding and compound binding. But that is an illusion. Large computational models do not understand anything, any more than LLM chatbots know what they are “saying”. We have built these systems to provide blurry copies of what were actually useful and pleasing outputs that we generated by our own efforts, and sometimes the results, the extruded simulated products, are worthwhile enough and sometimes they are not. In the case of overfitted models, which these seem to be, we are at great risk of just talking to ourselves and playing our own voices back to no good effect.
Understanding is still our human domain. And we need to understand not only the physics of small-molecule interactions, but the workings of our own tools.
https://www.science.org/content/blog-post/ai-predicting-compound-affinity-we-aren-t-there-yet