Measuring the performance of ligand-based methods
This is another episode in the "how good are these methods really?" series. The aim is to understand how well the ligand-based 3D methods for virtual screening perform under standard benchmarking sets. In the original papers introducing the methods frequently is benchmarking done on a small test sets, prone to all sorts of small population problems.
Here I take 3 popular methods: the ROCS-like implementation of Gaussian shape overlap that I recently ported to rdkit as well as USR and USR-CAT, two very fast shape methods. All 3 are tested on the DUDE dataset, a standard benchmarking set. Conformer generation was done as outlined by JP in his paper.
|Pair of methods compared||Statistic used||Value|
|rdkit-shape VS usr||wilcoxon||T=2.24, p=0.02|
|rdkit-shape VS usrcat||wilcoxon||T=-0.47, p=0.63|
|usr VS usrcat||wilcoxon||T=-2.70, p=0.006|
One thing that struck me was the spread of performance. I got a little worried that the conformer generation method was messed up, so I tested that part and it correctly reproduced the crystal poses to within 2-3 Angstrom easily (below). So that part is okay.
It still could be that there is a numerical problem with rdkit-shapeit – and that it can't find the correct maximum overlap but I'm not sure how to test for that.
But let's take it all at a face value. These methods are not great... USR is worst of all, kind of making a case that it doesn't matter if the method is fast, it matters if it gives any sort of sensible results. rdkit-shapeit isn't much better. USRCAT performs best of all, along the lines of "if you include some chemistry features, it works much better". It seems that although shape complimentary is a necessary condition for molecular similarity, it's certainly not a sufficient condition...
Going back to the general statement (that these methods don't look awesome), it's clear that even the best one offers a median AUC (0.6) that's barely above random (0.5). The interquartile range for all 3 methods is huge, which means that depending on your target protein they may work well or terrible. It's interesting to think that a method with a narrow interquartile range around AUC of 0.6 would be more desirable that a method with median AUC of say 0.7 but with huge variation from target to target.
A blank space is left for rdkit-alignit, a pharmacophore tool. rdkit-alignit is to rdkit-shape what usrcat is to usr – it incorporates chemistry features into the search. The blank space is left because I haven't ported alignit into rdkit yet and so I can't make the comparison.