Saturday, February 14, 2015

Feels like getting something for nothing...

Background

One of the focus points for the 2015.03 RDKit release is improving performance. To this end we've made changes that mitigate or remove some of the performance bottlenecks. These include, among others, modifications to the way SMILES are generated, rearranging the way the molecular GetProp/SetProp interface is used internally, and making the RDKit molecule smaller so that less memory is required. There are a couple of other changes coming; I think there should be a nice increase in the speed of common operations when the new version is released.

Getting something for nothing

Brian Kelley pointed out that using tcmalloc instead of the system-provided malloc implementation can lead to big speedups. It's super-easy to test (just add LD_PRELOAD=/usr/local/lib/libtcmalloc.so to the command line, so I gave it a try with the RDKit. Wow did it make a difference!
Many of the tests in the RDKit's basic python performance suite run too quickly to really be able to say much about performance, so I created a second performance suite that runs larger tests where it makes sense. This isn't yet complete - I need to add some reasonably sized tests of the conformation generation and force fields - but it's a decent start.
Here's a performance comparison for the current trunk status. The tests were run on my linux box (a three year old Dell Studio XPS) running Unbuntu 14.04.
test default with tcmalloc fraction
50K mols from SMILES 21.7 12.7 0.59
generate SMILES 12.7 6.5 0.51
10x1K mols from SDF 8.2 5.6 0.68
823 queries from SMILES 0.1 0.1 1.00
HasSubstructMatch 102.0 80.9 0.79
GetSubstructMatches 115.3 91.9 0.80
428 queries from SMARTS 0.0 0.0 0.0
HasSubstructMatch 287.0 239.6 0.83
GetSubstructMatches 288.2 240.8 0.84
generate Mol blocks 37.8 24.5 0.65
BRICS decomposition 79.4 53.6 0.68
generate 2D coords 27.1 23.9 0.88
generate RDKit fingerprints 148.4 80.8 0.54
generate Morgan fingerprints 7.5 3.8 0.51
That's a pretty dramatic speedup for no work at all!
It is, unfortunatetly, not possible to make using tcmalloc the default at RDKit build time: this would require that other programs using the RDKit shared libraries (python, postgresql, etc.) also be re-compiled to use tcmalloc. It's probably also not safe to use the LD_PRELOAD trick in your .bashrc, but setting it before starting a long-running process seems like it definitely could be a win.

Friday, February 13, 2015

New Drawing Code

The 2015.03 release of the RDKit will contain a new set of C++-based functionality for drawing molecules. This code, based on a contribution from Dave Cosgrove, is much faster than the current python-based drawing code and provides much more flexibility and higher drawing quality than the current C++ code (which is used in Knime).

This post demonstrates some of the new functionality, which is currently available in the RDKit github repo.

The new code supports rendering to SVG, PNGs (using cairo), Qt, and wx (thanks to a contribution from Igor Filippov). It's pretty easy to add support for other backends as well. Here I'll show the SVG rendering.

In [1]:
from __future__ import print_function
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import SVG
import time
print(time.asctime())
Sat Feb 14 07:48:33 2015

In [2]:
m = Chem.MolFromSmiles('c1cccnc1O')
In [3]:
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
def moltosvg(mol,molSize=(450,150),kekulize=True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    # It seems that the svg renderer used doesn't quite hit the spec.
    # Here are some fixes to make it work in the notebook, although I think
    # the underlying issue needs to be resolved at the generation step
    return svg.replace('svg:','')
In [4]:
SVG(moltosvg(m))
Out[4]:
N OH
In [5]:
import glob
pkls = glob.glob('/home/glandrum/Code/benchmarking_platform/compounds/ChEMBL_II/*.pkl')
from rdkit.six.moves import cPickle
sets = cPickle.load(file(pkls[0]))
In [6]:
smis=sets.values()[0]
In [7]:
ms = [Chem.MolFromSmiles(y) for x,y in smis]
In [8]:
SVG(moltosvg(ms[1]))
Out[8]:
N NH O O N O N

Look at controlling that in more detail.

In [9]:
m = ms[1]
rdDepictor.Compute2DCoords(m)
Out[9]:
0
In [10]:
drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
drawer.DrawMolecule(m)
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[10]:
N NH O O N O N

Highlight a substructure:

In [11]:
drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
drawer.DrawMolecule(m,highlightAtoms=m.GetSubstructMatch(Chem.MolFromSmarts('c1ncoc1')))
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[11]:
N NH O O N O N

Individiual atoms can also be highlighted:

In [12]:
highlight=list(m.GetSubstructMatch(Chem.MolFromSmarts('c1ncoc1'))) + [10,1]

drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
drawer.DrawMolecule(m,highlightAtoms=highlight)
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[12]:
N NH O O N O N

The highlighting color can be changed:

In [13]:
highlight=[10,1,4,13]
colors={1:(1,0,0), 4:(0,1,0), 13:(0,0,1), 10:(1,0,1)}


drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
drawer.DrawMolecule(m,highlightAtoms=highlight,highlightAtomColors=colors)
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[13]:
N NH O O N O N

As we saw above, the bonds between adjacent atoms will be highlighted:

In [14]:
highlight=[10,1,4,5,6]
colors={1:(1,0,0), 4:(0,1,0), 5:(0,0,1), 10:(1,0,1)}


drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
drawer.DrawMolecule(m,highlightAtoms=highlight,highlightAtomColors=colors)
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[14]:
N NH O O N O N

But this can be turned off:

In [15]:
highlight=[10,1,4,5,6]
colors={1:(1,0,0), 4:(0,1,0), 5:(0,0,1), 10:(1,0,1)}


drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
drawer.DrawMolecule(m,highlightAtoms=highlight,highlightBonds=[],highlightAtomColors=colors)
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[15]:
N NH O O N O N

We can also highlight just bonds:

In [16]:
highlight=[1,5,6]
colors={1:(1,0,0), 4:(0,1,0), 5:(0,0,1), 10:(1,0,1)}


drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
drawer.DrawMolecule(m,highlightAtoms=[],highlightBonds=highlight,highlightBondColors=colors)
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[16]:
N NH O O N O N

Customizing atom labels:

In [17]:
drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
opts = drawer.drawOptions()
opts.atomLabels[15]='a16'
opts.atomLabels[4]='a5'
opts.atomLabels[20]='a21'
drawer.DrawMolecule(m,highlightAtoms=[4,15,20])
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[17]:
N a5 NH O a16 O N a21 O N

Taking that to an extreme:

In [18]:
drawer = rdMolDraw2D.MolDraw2DSVG(400,200)
opts = drawer.drawOptions()

for i in range(m.GetNumAtoms()):
    opts.atomLabels[i] = m.GetAtomWithIdx(i).GetSymbol()+str(i)

drawer.DrawMolecule(m)
drawer.FinishDrawing()
svg = drawer.GetDrawingText().replace('svg:','')
SVG(svg)
Out[18]:
C0 N1 C2 C3 C4 C5 C6 C7 C8 N9 C10 O11 C12 C13 C14 C15 C16 O17 C18 N19 C20 C21 O22 C23 C24 N25 C26 C27

More to come!

One final note on performance.

Here's a quick comparison of the current Python-based renderer with the new C++ renderer for generating SVG:

In [19]:
from rdkit.six.moves import cStringIO
from rdkit.Chem import Draw
def moltosvg_old(mol,molSize=(450,150),kekulize=True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)

    sio = cStringIO()
    Draw.MolToFile(mc, sio, size=molSize, imageType="svg",
                           kekulize=False)
    svg = sio.getvalue()
    # It seems that the svg renderer used doesn't quite hit the spec.
    # Here are some fixes to make it work in the notebook, although I think
    # the underlying issue needs to be resolved at the generation step
    return svg.replace('svg:','')
In [20]:
%timeit [moltosvg_old(x) for x in ms]
10 loops, best of 3: 62.9 ms per loop

In [21]:
%timeit [moltosvg(x) for x in ms]
100 loops, best of 3: 7.47 ms per loop

Factors of 10 in performance should never be sneered at. :-)

In []: