Posts

Cool pandas hack - get random rows in a multi-column dataframe

# # load inputs # actives = pd.read_pickle("actives_final.pkl") decoys = pd.read_pickle("decoys_final.pkl") # # stack tables # df = pd.concat([actives, decoys])   # # remove duplicate indicies # df = df.reset_index()   ordered = df.sort_values(by='tc')     .groupby(['category', 'molId'])     .last()     .reset_index() shuffled = df.sample(frac=1, random_state=123456)     .groupby(['category', 'molId'])      .last()      .reset_index()

Example slurm cluster on your laptop (multiple VMs via vagrant)

To anybody who's considering a switch to modern queuing systems, like slurm, this will be an useful guide. Rather than doing a roll out on production nodes, we'll try things out in vagrant. This text assumes you have vagrant installed on your local machine (it's very easy). The ultimate aim will be to setup 2 servers (worker nodes) and one master/controller, and run a simple job on them. There are multiple great guides out there: most of them are out of date, and won't work without some tweaking on Ubuntu 16.04 Xenial. The entirety of the code is hosted at https://github.com/jandom/gromacs-slurm-openmpi-vagrant The README.md file will show you how to set it up, with checks on every step. References https://mussolblog.wordpress.com/2013/07/17/setting-up-a-testing-slurm-cluster/ https://github.com/gabrieleiannetti/slurm_cluster_wiki/wiki/Installing-a-Slurm-Cluster http://philipwfowler.me/2016/04/14/how-to-setup-a-gramble/ https://github.com/dakl/slurm-clus

Making DESMOND easier to run on a cluster

So, some of you probably wanted to try out this DESMOND code from DESRES. It's shipped with the Maestro suite by Schrodinger, which is free for academics and a little annoying to use if you wanna *just* use the binary. This tutorial essentially describes how to run DESMOND without any of the annoying wrappers shipped by Schrodinger. It's making the code possible to run in a sane fashion – similar to how you launch your any other MD codes on the cluster. We'll cover both the CPU DESMOND and the GPU GDESMOND. Required package can be downloaded from https://www.deshawresearch.com/downloads/download_desmond.cgi/ Desmond_Maestro_2016.3.tar Desmond-3.6.1.1.tar.gz The 'Desmond-3.6.1.1.tar.gz' is just the source code and some sample systems - we won't be compiling that just gonna use the sample system to see if the code runs. Install 'Desmond_Maestro_2016.3.tar' following their instructions. Here is an example module file, to accomp

What was that password again?

A few months ago I bought a new hard-drive. When you encrypt a drive, you need to set a password – but the trick here is to remember the password. Your hard-drive won't send you a "forgot password?" email. So you set that password, enter it, the drive mounts and then you work away happily for weeks without rebooting your machine. Then you reboot... and you need to re-enter that password. You slowly realize that those weeks ago you set a new unique password for that bloody hard-drive that you don't remember for the love of god... What to do? Well one, don't be a complete idiot like me – remember the password. But if you do, here is what to do? [Spoiler alert: I failed to recover the password but it's a pretty interesting ride...] The new Ubuntu 16.04 comes with this handy too called 'bruteforce-luks', here are some scenarios. Note you have to use it as root. "I remember the start/end of the password and I sort of know what's in the midd

Benchmarking gromacs – 2 quick questions

Image
With gromacs there are all these things you have to do when benchmarking, it's a little bit of a mess. One question that I always wondered about was – well – how long do you have to run to get a reliable performance number, in ns/day? 1e0 MD step is certainly too short but 1e7 steps seems like unnecessarily excessive. Second question was, could one get away with measuring benchmarking speed of just a box of water? Currently people use system like DHPR or APOA1 (protein in water) to asses but those are arbitrary.  A box of water has dramatically simpler topology than a protein but maybe it doesn't matter? One thing that this asses is absolute performance of gromacs: the ns/day below are low but that's because an old workstation is used, without GPU acceleration. The answer are pretty simple: 1e4–1e5 steps are required to 'converge' the performance estimate, for my test system. Also,  a water box of very similar dimensions (and # of particles) but without the p

Pharma, you're not a good place to go

Just visited an old-favourite blog of mine – the excellent "In the pipeline" by Derek Lowe. By going though the last 5 or so pages, it's clear what's the hiring state in the industry that once was Merck Cuts Chemistry Layoffs at Takeda AstraZeneca Cutting Back Again Layoffs at Medimmune To balance that there was one positive news Merck Expanding on the West Coast The blog posts are for the span between July 2016 and Sept 2016. Now, this is no hard numbers but gives me an "anecdotal nudge" that field/industry is in atrophy. When is the last time you saw any of the computer-science shops with similar headlines? "Google cuts down on software engineers", "Facebook re-organizes and lays-off data scientists",  anybody? It simply doesn't happen, not at this scale. If you're a young student, grad or post grad in life sciences, consider your options carefully. Academia is marred with fierce competition over small resources.

VNC connect from Mac OSX El capitan to Ubuntu Linxu

Edit 2016-11-05 This is actually trivial, here are the steps. On the remote Ubuntu "workstation", run $ vncserver On the home Mac machine, run your favorite SSH tunnel $ ssh -t -L 5901:localhost:5901 workstation Open up the vnc screen in a Mac window $ open vnc://localhost:5901