Showing posts from 2014

Performance of a popular docking code

This post is a quick note on a performance of a commercial docking code, as measured across the entire DUDE dataset. For around a 100 protein targets, the code is supposed to rank active compounds higher than decoys compounds – separate poppy seeds from sand if you like. Let me start by saying that I'm pretty impressed with what I've seen. Starting this side-project, I assumed that any docking code can be expected to have an AUC of around 0.6-0.7 measured on a standard benchmarking set (such as DUDE). I think that's largely true of free codes such as Autodock vina or rdock. But here we're looking at a piece of code from a major commercial vendor and it performs beyond my expectation. I'm not disclosing the name of the code, since there may have been something in the academic license that prevents such benchmarking studies from being published (a "gag order" effectively, in case somebody is measuring the code "incorrectly"). AUC is estima

SDF to Excel file, in an automated fashion

Sharing SDF files between chemists is often a pain. It's supposed to be vanilla and super-standard but sometimes still gives everyone involved a headache. Especially when moving SDF between two chemistry codes, especially if hydrogens are involved... For this reason, and because some people ONLY work with excel files, it's good to have an ability to automatically convert an SDF file to a Excel file (especially xlsx). With pandas and rdkit, its possible to easily make such moves. Example below. Pandas uses xlsxwriter module to support the Excel format. There is no easy way to pass image objects, embedded in the pandasa.DataFrame, down to xlsxwriter. The writer itself supports the insert_image functionality that takes a filename as argument example ). The easiest way is to make pandas detect that a cell contains a string ending with a .png and take use 'insert_image', see the hack below: And here you go: molecule_data.xlsx has a beautiful col

Measuring the performance of ligand-based methods

This is another episode in the "how good are these methods really?" series. The aim is to understand how well the ligand-based 3D methods for virtual screening perform under standard benchmarking sets. In the original papers introducing the methods frequently is benchmarking done on a small test sets, prone to all sorts of small population problems.  Here I take 3 popular methods: the ROCS-like implementation of Gaussian shape overlap[1] that I recently ported to rdkit as well as USR[2] and USR-CAT[3], two very fast shape methods. All 3 are tested on the DUDE[4] dataset, a standard benchmarking set. Conformer generation was done as outlined by JP in his paper[5]. Pair of methods compared Statistic used Value rdkit-shape VS usr wilcoxon T=2.24, p=0.02 rdkit-shape VS usrcat wilcoxon T=-0.47, p=0.63 usr VS usrcat wilcoxon T=-2.70, p=0.006 One thing that struck me was the spread of performance. I got a little worried that the conformer

Adding a simple cuda library to RDKit

There are many interesting things a chemist can do with a GPU. Think FastROCS as an example where GPU has greatly speed-up aligning two conformers. A chemist, in addition to learning how to code on a GPU a little bit, has to distribute the code. One strategy would be to include the code in another software package that's widely used by the target audience. This can be a very tedious task... and so the task of this post will be to cover this 'boiler plate' part. The code presented here does nothing meaningful, it just shows you how to do the piping. For the ubuntu users in the house, visit . In my setup ubuntu 14.04 64bit DEB was the choice, then 'sudo apt-get install nvidia-cuda-toolkit' Here is a simple 'hello world' program that truly does some work on the GPU. Let's save that as file. It can be anywhere – this will be independent of rdkit for now. That's all

ROCS-like shape overlap in rdkit

One year ago, I gave a brief talk at the RDKit user group meeting in Cambridge. As usual, I ranted about how it would be fantastic to have robust open-source computational chemistry tools. In particular, a commonly used tool from the 80s – gaussian-based shape overlap for pairs of molecules – was only covered by a single open-source package. So, together with help from some friends (big shout out to Greg Landrum, rdkit) I set out to change that. Took almost a year but we're getting there! The only open-source package doing shape overlap was shape-it from silicos-it. We basically ripped the code out of it and incorporated it into rdkit. The aim was to provide python-scriptable shape-overlap tool. The silicos-it is run by Hans de Winter – big thanks to him – I'm not sure what's going on now but the silicos-it webpage down. Greg did most (or certainly a _lot_) of work cleaning up the code. Now, we're finally moving towards testing against standard benc

Conformer generation

Edited: update the figure after working around the RMSD alignment (rdkit tries all the atom order permutations to find the best alignment, rather than keep the atom order fixed, i think); removed the technical problems paragraph. New figure.  "Conformer generation is an art". That statement from a more senior computational chemist initially made me chuckle – I mean, how hard could that be? If computational methods are claiming to bring a lot to the table, surely conformer generation is a problem solved long ago... Oh the naive young mind ;)  Let's start with a simple question here is: if we start from the native (experimentally observed) conformation of a ligand and minimize it using two popular ligand force-fields (UFF and MMFF), we should stay close to the crystal structure. Is that indeed the case? Figure 1 So that doesn't look so bad! Let's hydrogenate the structures – right now they're mixed bag. Figure 2 Woops, adding hydrogens s

Make a molecule dissapear in gromacs

This is a free energy perturbation basics – making molecules appear and disappear and using that to estimate free energies. This code can also be used ror other, imaginative purposes. To make a molecule 'Ligand' disappear, use the following grompp.mdp ... integrator = ld dt = 0.001 nsteps = 10000 ... ; FREE ENERGY free_energy          = yes init_lambda          = 0.00 delta_lambda         = 1e-4 sc-alpha             = 1.5 # sc-power             = 1.0 couple-moltype       = Pore couple-lambda0       = vdw-q couple-lambda1       = none Now let's break it down: free_energy = yes   is an on-switch for this feature of gromacs init_lambda = 0.0 defines how invisible is the molecule at the start of the run, a value of 0 indicates that it's fully visible delta_lambda = 1e-4 will define by how much the molecule is disappeared at every step of the simulation. Here this has to match the nsteps value - after 10000 steps the molecule will be fully gone. couple-lam

High-frequency lunch booking

It's summer time and a 750-year-old tradition is that Merton college SCR (the senior common room) opens up its lunch sittings to the MCR members (middle common room). It's quite a treat – the quality of the meal greatly surpasses what MCR members get in other times of the year. There is a catch obviously: the number of seats available to the MCR members is limited. The seats become available some time between Fri-Monday each week. Nobody has the time to sit there and keep refreshing the page so why not use a simple python web-crawler, executed periodically from crontab to maximize your odds of getting a spot? Source code is attached! Enjoy ;) EDIT: Due to popular demand from college folks, the tool has been converted to web app. It's called, HackAHall and it's available (invite only) here ;) Thanks, Jan

Acetylated lysine and negatively charged cysteine for CHARMM protein force-field

The CHARMM protein force-fields in gromacs provides topologies for many commonly found protein residues. However, when one wants to include a slightly more exotic residue - say a modified amino acid - the process can be a little tedious. Below, two example topologies are shown: the acetylated lysine (ALY) and negatively charged Cys (CYN); treat those as an extension to CHARMM protein force-field. ALY is can be used to model modifications on histone proteins, while CYN is supposed to model Cys in zinc-finger motifs where it coordinates Zn2+ ions. The two topologies below are simple hacks. At no point am I claiming that they are 'correct' in any way – but they provide a starting point for any improvements one wishes to make. Edit: I recently bumped into a paper called Force field parameters for the simulation of modified histone tails where quantum calculations are done to obtain CHARMM parameters for the modified aminoacid residues. This could be a more reliable set of pa

Measuring docking code performance

Here is something I've been meaning to do for a long time: measure the performance of a docking code on a standard benchmarking dataset. The choice was DUDE, the Dud38 subset. From the Dud38, I've decided to look at actives and decoys both the "scaffolds" and "final". Here, only partial results for the "scaffolds" are shown. The code tested is autodock_vina, version 1.1.2 with the default settings (including exhaustiveness). This is the only publicly available docking code that I know of. rdkit was used to prepare the molecules from smiles strings. The performance is measured over 38 protein targets (hence the Dud38). Experimentally known active compounds as well a fake, decoy molecules are docked and scored to each target. The scoring is then converted into a ranking of molecules. The performance metric is AUC, or area under the curve, commonly used in signal processing. The AUC of 1.0 indicates a perfect classifier (all actives score higher t