Thursday, March 10, 2016

Explaining Morgan Similarity

Explaining Similarity with Morgan Fingerprints

This post comes out of a question I was asked at this week's CIC Spring School on Cheminformatics, where I did an introductory lecture on fingerprints.

On to the work

Start with the preliminaries

We start by doing a bunch of imports and defining some functions we'll use later.

Note: This is a Python3 notebook, so the code below may not work in Python2. It's also using code that is currently only present in github, but that will be in the next release

In [1]:
import numpy
from rdkit.Chem.Draw import IPythonConsole
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
from rdkit import rdBase
print(rdBase.rdkitVersion)

import time
print(time.asctime())
%pylab inline
2016.03.1.dev1
Thu Mar 10 19:01:19 2016
Populating the interactive namespace from numpy and matplotlib
In [2]:
#
# Functions for providing detailed descriptions of MFP bits 
#  inspired by some code from from Nadine Schneider 
#
def getSubstructSmi(mol,atomID,radius):
    if radius>0:
        env = Chem.FindAtomEnvironmentOfRadiusN(mol,radius,atomID)
        atomsToUse=[]
        for b in env:
            atomsToUse.append(mol.GetBondWithIdx(b).GetBeginAtomIdx())
            atomsToUse.append(mol.GetBondWithIdx(b).GetEndAtomIdx())
        atomsToUse = list(set(atomsToUse))
    else:
        atomsToUse = [atomID]
        env=None
    symbols = []
    for atom in mol.GetAtoms():
        deg = atom.GetDegree()
        isInRing = atom.IsInRing()
        nHs = atom.GetTotalNumHs()
        symbol = '['+atom.GetSmarts()
        if nHs: 
            symbol += 'H'
            if nHs>1:
                symbol += '%d'%nHs
        if isInRing:
            symbol += ';R'
        else:
            symbol += ';!R'
        symbol += ';D%d'%deg
        symbol += "]"
        symbols.append(symbol)
    smi = Chem.MolFragmentToSmiles(mol,atomsToUse,bondsToUse=env,allHsExplicit=True, allBondsExplicit=True, rootedAtAtom=atomID)
    smi2 = Chem.MolFragmentToSmiles(mol,atomsToUse,bondsToUse=env,atomSymbols=symbols, allBondsExplicit=True, rootedAtAtom=atomID)
    return smi,smi2
In [3]:
# Start by importing some code to allow the depiction to be used:
from IPython.display import SVG
from rdkit.Chem.Draw import rdMolDraw2D

# a function to make it a bit easier. This should probably move to somewhere in
# rdkit.Chem.Draw
def _prepareMol(mol,kekulize):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)
    return mc
def moltosvg(mol,molSize=(450,200),kekulize=True,drawer=None,**kwargs):
    mc = rdMolDraw2D.PrepareMolForDrawing(mol,kekulize=kekulize)
    if drawer is None:
        drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    drawer.DrawMolecule(mc,**kwargs)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    # It seems that the svg renderer used doesn't quite hit the spec.
    # Here are some fixes to make it work in the notebook, although I think
    # the underlying issue needs to be resolved at the generation step
    return SVG(svg.replace('svg:',''))
In [4]:
# do a depiction where the atom environment is highlighted normally and the central atom
# is highlighted in blue
def getSubstructDepiction(mol,atomID,radius,molSize=(450,200)):
    if radius>0:
        env = Chem.FindAtomEnvironmentOfRadiusN(mol,radius,atomID)
        atomsToUse=[]
        for b in env:
            atomsToUse.append(mol.GetBondWithIdx(b).GetBeginAtomIdx())
            atomsToUse.append(mol.GetBondWithIdx(b).GetEndAtomIdx())
        atomsToUse = list(set(atomsToUse))       
    else:
        atomsToUse = [atomID]
        env=None
    return moltosvg(mol,molSize=molSize,highlightAtoms=atomsToUse,highlightAtomColors={atomID:(0.3,0.3,1)})
def depictBit(bitId,mol,molSize=(450,200)):
    info={}
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,radius=2,nBits=2048,bitInfo=info)
    aid,rad = info[bitId][0]
    return getSubstructDepiction(mol,aid,rad,molSize=molSize)

Basics: similarity between pyridine and benzene

In [5]:
bz = Chem.MolFromSmiles('c1ccccc1')
fp_bz = AllChem.GetMorganFingerprintAsBitVect(bz,radius=2,nBits=2048)
pyr = Chem.MolFromSmiles('c1ccccn1')
fp_pyr = AllChem.GetMorganFingerprintAsBitVect(pyr,radius=2,nBits=2048)
print("Similarity:",DataStructs.TanimotoSimilarity(fp_bz,fp_pyr))
Similarity: 0.3333333333333333

The question was: why is this value so low?

The obvious mathematical answer is to just look at the definition of Tanimoto similarity:

$Tani(fp_1,fp_2) = \frac{|fp_1 \cap fp_2|}{|fp_1|+|fp_2|-|fp_1 \cap fp_2|} = \frac{|fp_1 \cap fp_2|}{|fp_1 \cup fp_2|} $

In [6]:
print("intersection count:",(fp_bz&fp_pyr).GetNumOnBits())
print("union count:",(fp_bz|fp_pyr).GetNumOnBits())
intersection count: 3
union count: 9

But that doesn't really answer the question of "why", which boils down to the fact that the fingerprints don't have many bits in common relative to the number of bits set.

To understand why the two fingerprints have so few bits in common even though they look very similar to each other, we need to look at the bits themselves. This is most easily done using the bitInfo argument to the fingerprinter.

In [7]:
bi_bz = {}
fp_bz = AllChem.GetMorganFingerprintAsBitVect(bz,radius=2,nBits=2048,bitInfo=bi_bz)
bi_pyr = {}
fp_pyr = AllChem.GetMorganFingerprintAsBitVect(pyr,radius=2,nBits=2048,bitInfo=bi_pyr)

The bitInfo structures are dictionaries with bit IDs as keys and (atomId, radius) pairs as values:

In [8]:
bi_bz
Out[8]:
{389: ((2, 2), (3, 2), (1, 2), (0, 2), (5, 2), (4, 2)),
 1088: ((1, 1), (2, 1), (3, 1), (4, 1), (0, 1), (5, 1)),
 1873: ((0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0))}

We will start by collecting information about the bits in lists.

Here's benzene:

In [9]:
info_bz = []
for bitId,atoms in bi_bz.items():
    exampleAtom,exampleRadius = atoms[0]
    description = getSubstructSmi(bz,exampleAtom,exampleRadius)
    info_bz.append((bitId,exampleRadius,description[0],description[1]))
print(info_bz)
[(1088, 1, '[cH](:[cH]):[cH]', '[cH;R;D2](:[cH;R;D2]):[cH;R;D2]'), (1873, 0, '[cH]', '[cH;R;D2]'), (389, 2, '[cH](:[cH]:[cH]):[cH]:[cH]', '[cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[cH;R;D2]')]

And pyridine:

In [10]:
info_pyr = []
for bitId,atoms in bi_pyr.items():
    exampleAtom,exampleRadius = atoms[0]
    description = getSubstructSmi(pyr,exampleAtom,exampleRadius)
    info_pyr.append((bitId,exampleRadius,description[0],description[1]))
print(info_pyr)
[(1088, 1, '[cH](:[cH]):[cH]', '[cH;R;D2](:[cH;R;D2]):[cH;R;D2]'), (1873, 0, '[cH]', '[cH;R;D2]'), (1603, 1, '[n](:[cH]):[cH]', '[n;R;D2](:[cH;R;D2]):[cH;R;D2]'), (1155, 2, '[cH](:[cH]:[cH]):[cH]:[n]', '[cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[n;R;D2]'), (437, 2, '[cH](:[cH]:[cH]):[n]:[cH]', '[cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[n;R;D2]:[cH;R;D2]'), (378, 0, '[n]', '[n;R;D2]'), (383, 2, '[n](:[cH]:[cH]):[cH]:[cH]', '[n;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[cH;R;D2]'), (1866, 1, '[cH](:[cH]):[n]', '[cH;R;D2](:[cH;R;D2]):[n;R;D2]'), (389, 2, '[cH](:[cH]:[cH]):[cH]:[cH]', '[cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[cH;R;D2]')]

The text parts of the description show two different levels of detail. The first is simple: element, whether or not it's aromatic (upper-case or lower-case letter) and the H count. The second level of detail adds the number of heavy atom neighbors (also known as degree, and indicated with the "D") and whether or not the atom is in a ring (indicated by the "R"). The second level of detail corresponds to what is actually used in the fingerprints.

Now put those together into a single table with columns for pyridine and benzene:

In [11]:
collection = {}
for bid,rad,smi,sma in info_bz:
    collection[bid] = [bid,rad,smi,sma,'','']
for bid,rad,smi,sma in info_pyr:
    if bid not in collection:
        collection[bid] = [bid,rad,'','','','']
    collection[bid][-2] = smi
    collection[bid][-1] = sma

Now put those rows in a Pandas Dataframe

In [12]:
import pandas as pd
pd.options.display.width=100000 # options to make sure our wide columns display properly
pd.options.display.max_colwidth=1000
In [13]:
df = pd.DataFrame(list(collection.values()),columns=('Bit','radius','smi_bz','sma_bz','smi_pyr','sma_pyr'))
df.sort_values(by='radius')
Out[13]:
Bit radius smi_bz sma_bz smi_pyr sma_pyr
1 1873 0 [cH] [cH;R;D2] [cH] [cH;R;D2]
5 378 0 [n] [n;R;D2]
0 1088 1 [cH](:[cH]):[cH] [cH;R;D2](:[cH;R;D2]):[cH;R;D2] [cH](:[cH]):[cH] [cH;R;D2](:[cH;R;D2]):[cH;R;D2]
2 1603 1 [n](:[cH]):[cH] [n;R;D2](:[cH;R;D2]):[cH;R;D2]
7 1866 1 [cH](:[cH]):[n] [cH;R;D2](:[cH;R;D2]):[n;R;D2]
3 1155 2 [cH](:[cH]:[cH]):[cH]:[n] [cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[n;R;D2]
4 389 2 [cH](:[cH]:[cH]):[cH]:[cH] [cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[cH;R;D2] [cH](:[cH]:[cH]):[cH]:[cH] [cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[cH;R;D2]
6 383 2 [n](:[cH]:[cH]):[cH]:[cH] [n;R;D2](:[cH;R;D2]:[cH;R;D2]):[cH;R;D2]:[cH;R;D2]
8 437 2 [cH](:[cH]:[cH]):[n]:[cH] [cH;R;D2](:[cH;R;D2]:[cH;R;D2]):[n;R;D2]:[cH;R;D2]

Let's look at some of the bits.

The radius 0 bits are just the "c" or the "n", so they are uninteresting.

The radius 1 bits all have three atoms:

In [14]:
depictBit(1866,pyr,molSize=(125,125))
Out[14]:
N

The other two are easy to envision: bit 1088 is all C, and bit 1603 has the N in the middle.

The radius 2 bits each have 5 atoms, here are a couple of them:

In [15]:
depictBit(1155,pyr,molSize=(125,125))
Out[15]:
N
In [16]:
depictBit(383,pyr,molSize=(125,125))
Out[16]:
N

The other two have the N in the second position (bit 437) or are pure C (bit 389)

That's pretty much it for this case.

Sunday, March 6, 2016

The performance impact of virtualization on some basic RDKit tasks

Intro

After reading http://www.cybertec.at/2016/02/postgresql-on-hardware-vs-postgresql-on-virtualbox/ from Hans-Jürgen Schönig about performance differences between PostgreSQL running natively and PostgreSQL running in a VM, I got curious about the impact of virtualization on the RDKit. This is a brief exploration of that topic.

Test setup

Some technical details about the experiments first:

  • Code version: this is experiment was done using the 2015.09.2 version of the RDKit available from https://anaconda.org/rdkit/rdkit.
  • Testing code: I used the code $RDBASE/Regress/Scripts/new_timings.py. This is a more time consuming version of the standard RDKit benchmarking tests.
  • Python 3.5.1, from anaconda python
  • Test machine: a Dell XPS desktop with a 3.6GHz i7-4790 CPU and 16GB of RAM.
  • Test OS (physical machine): Ubuntu 15.10
  • Vagrant configuration: Ubuntu 14.04 (Trusty) running in a 4GB VirtualBox VM
  • Docker configuration: Debian Jessie using Docker 1.10 (the post about setting this up is coming, but see the bottom of the post for the very minimal Dockerfile I used)

Details of the tests:

The test set is 50K molecules pulled from ZNP (a subset that no longer exists) a few years ago.

  1. Building the molecules from SMILES
  2. Generating canonical SMILES
  3. Reading 10K molecules from SDF
  4. Constructing 823 queries from SMARTS
  5. Running HasSubstructMatch() for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
  6. Running GetSubstructMatches() for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
  7. Reading the 428 SMARTS from $RDBASE/Data/SmartsLib/RLewis_smarts.txt
  8. Running HasSubstructMatch() for the 50K molecules and the 428 SMARTS
  9. Running GetSubstructMatches() for the 50K molecules and the 428 SMARTS
  10. Generating 50K mol blocks
  11. Calling Chem.BRICS.BreakBRICSBonds() on the 50K molecules
  12. Generating 2D coordinates for the 50K molecules
  13. Generating the RDKit fingerprint for the 50K molecules
  14. Generating Morgan fingerprints for the 50K molecules

Note that none of these need to do much in the way of I/O.

Results

Env T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
Physical 12.6 6.1 5.0 0.0 56.3 60.7 0.0 163.6 168.7 18.5 44.6 15.8 64.8 5.0
Vagrant 12.9 6.5 5.0 0.1 56.0 61.4 0.0 164.2 168.5 19.3 45.5 16.1 68.5 5.1
Docker 12.6 6.2 4.9 0.0 54.5 59.8 0.0 161.5 162.6 18.4 43.8 15.4 67.9 5.0

Conclusions

Comfortingly, running the code in a virtual environment or container doesn't have much, if any, impact on performance for this CPU-intensive test.

Technical details

Using Docker

Since ContinuumIO makes Docker images with miniconda preconfigured available, this turns out to be really simple. Here's the Dockerfie I used:

FROM continuumio/miniconda3
MAINTAINER Greg Landrum <greg.landrum@gmail.com>

ENV PATH /opt/conda/bin:$PATH
ENV LANG C

# install the RDKit:
RUN conda config --add channels  https://conda.anaconda.org/rdkit
RUN conda install -y rdkit

You can put that in an empty directory and then build a local image with the RDKit installed by running:

docker build -t basic_conda .

I wanted to mirror my local RDKit checkout into the image when I ran it so that I had access to the Regress directory. This is easy to do:

docker run -i -t -v /scratch/RDKit_git:/opt/RDKit_git basic_conda /bin/bash

And then I ran the benchmark with:

cd /opt/RDKit_git/Regress/Scripts
python new_timing.py

Thursday, March 3, 2016

Capturing error information

When the RDKit has problems processing a molecule, it outputs information to the error console about what those problems were. Here's an example:

In [23]: m = Chem.MolFromSmiles('CO(C)C')
[06:18:04] Explicit valence for atom # 1 O, 3, is greater than permitted

It's sometimes useful to have programmatic access to this information for later use in reporting.

It also would be great if these types of messages were visible in the jupyter notebook.

Brian Kelley recently added functionality to the RDKit to enable both of these things. For anyone interested, the two pull requests for those changes are: #736 and #739.

This is a short note on how to take advantage of that.

A couple of things to note:

  • This is currently in git and will be available in the 2016.03 release.
  • This post was written using Python3, some adaptation would be required for Python2.

Let's start by showing the standard state of affairs:

In [2]:
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
2016.03.1.dev1
In [3]:
m = Chem.MolFromSmiles('CO(C)C')
m

At this point there is an error message in the console I launched Jupyter from, but it sure would be nice if it were visible here.

We can enable that by just importing the usual RDKit Jupyter integration code:

In [4]:
from rdkit.Chem.Draw import IPythonConsole
In [6]:
m = Chem.MolFromSmiles('CO(C)C')
RDKit ERROR: [06:46:02] Explicit valence for atom # 1 O, 3, is greater than permitted
In [7]:
Chem.MolFromSmiles('c1cc1')
RDKit ERROR: [06:46:11] Can't kekulize mol 
RDKit ERROR: 
In [8]:
Chem.MolFromSmiles('c1')
RDKit ERROR: [06:46:18] SMILES Parse Error: unclosed ring for input: 'c1'
In [9]:
Chem.MolFromSmiles('Ch')
RDKit ERROR: [06:46:41] SMILES Parse Error: syntax error for input: 'Ch'

So far so good. What if I want to have access to the error messages as strings in Python?

In [13]:
from io import StringIO
import sys
Chem.WrapLogs()
In [15]:
sio = sys.stderr = StringIO()
Chem.MolFromSmiles('Ch')
print("error message:",sio.getvalue())
error message: RDKit ERROR: [06:49:14] SMILES Parse Error: syntax error for input: 'Ch'

I can use this to write a bit of code that processes all of the molecules in an SDF and captures the errors:

In [16]:
def readmols(suppl):
    ok=[]
    failures=[]
    sio = sys.stderr = StringIO()
    for i,m in enumerate(suppl):
        if m is None:
            failures.append((i,sio.getvalue()))
            sio = sys.stderr = StringIO() # reset the error logger
        else:
            ok.append((i,m))
    return ok,failures
In [18]:
import gzip,os
from rdkit import RDConfig
inf = gzip.open(os.path.join(RDConfig.RDDataDir,'PubChem','Compound_000200001_000225000.sdf.gz'))
suppl = Chem.ForwardSDMolSupplier(inf)
ok,failures = readmols(suppl)
In [19]:
for i,fail in failures:
    print(i,fail)
2035 RDKit ERROR: [07:31:28] Explicit valence for atom # 0 Br, 5, is greater than permitted
RDKit ERROR: [07:31:28] ERROR: Could not sanitize molecule ending on line 404864

11460 RDKit ERROR: [07:31:28] ERROR: Explicit valence for atom # 0 Br, 5, is greater than permitted
RDKit ERROR: [07:31:32] Explicit valence for atom # 2 Te, 4, is greater than permitted
RDKit ERROR: [07:31:32] ERROR: Could not sanitize molecule ending on line 2344967

17016 RDKit ERROR: [07:31:32] ERROR: Explicit valence for atom # 2 Te, 4, is greater than permitted
RDKit ERROR: [07:31:34] Explicit valence for atom # 1 Br, 5, is greater than permitted
RDKit ERROR: [07:31:34] ERROR: Could not sanitize molecule ending on line 3489884

In [ ]: