Thursday, January 19, 2017

A great-looking option for doing easy parallel/distributed computation in python

[Updated 20.01.2017 to use parallel file reading]

A few years ago I blogged about using an early version of Blaze to do some interesting data manipulations. Since then I've been paying attention to the progress of the various pieces of the Blaze ecosystem and experimenting with them. I have some really interesting results from using Dask to work with large amounts of chemical data, but those are taking me a while to write up, so I haven't been blogging about this.

Today I managed to make a bit of time to do some experimentation with using Dask to do parallel computation. As usual, I want to be able to do that with molecules, but fortunately Dask's bag data structure makes that pretty easy.

Here's a short demo of that, without a lot of extra text:

Start by importing dask and the RDKit:
In [4]: import dask.bag as db
In [5]: from rdkit import Chem
Set up a file reader to pull in one of the Zinc15 tranches, the blocksize argument allows parallel file reading:
In [6]: text = db.read_text('zinc15_CD.rdk.smi', blocksize =int(1e7))
Check how many lines we have; this is the first time we actually do any work:
In [7]: text.count().compute()
Out[7]: 6008877
Create the substructure query we will use: 
In [10]: p = Chem.MolFromSmarts('c1nc(C)nc(O)c1')
Now create a bag containing the molecules:
In [12]: mbag = text.map(lambda x: Chem.MolFromSmiles(x.split()[0], sanitize=False))
And then filter the bag to only contain the molecules that match our substructure query:
In [13]: mms = mbag.filter(lambda x: x is not None).filter(lambda x,y=p: x.HasSubstructMatch(y))
Finally, pull out the results and look at how long it takes:
In [14]: import time
In [15]: t1 = time.time(); ms = mms.compute(); print(" runtime: %.2fs"%(time.time()-t1))
runtime: 49.92s
In [16]: len(ms)
Out[16]: 5739

This got all the cores on my desktop machine happily working away.

~50 seconds to load 6 million molecules from SMILES and do a substructure filter across them without doing any precomputation seems pretty good to me.

A warning to anyone who wants to play at home: you need to be careful about what you call compute() on. Doing mbag.compute() will cause dask to happily go off and try to load 6 million molecules at once. Unless you have a lot more RAM than I do, this is not going to make you happy.

There's some really cool stuff going on in the Python data science world.