Wednesday, November 15, 2017

Revisting the MaxMinPicker

This is a guest post from Tim Dudgeon.
 
This post has a bit of history. It is primarily inspired by Roger Sayle's talk at the recent UGM in Berlin. If you have attended these great events you will know that Roger's talk is one of the highlights. In a way that only Roger can, he picks an area of the RDKit and tears it to pieces. Having done so he reassembles those pieces using his immense computer science brain and the result is something improved. Greg sits there not quite knowing what is coming, but assured that the outcome will be positive!
This time he picked on (pun!) the MaxMinPicker algorithm in RDKit. This also has history, as it was the subject of a blog by Greg from 2014 (and also this one).
The most obvious approach to picking a diverse subset might be to start with clustering the dataset e.g. using Butina Clustering and then picking from each cluster. Indeed that's exactly what I've done here. But as Roger pointed out this does not scale as the full distance matrix is needed which, assuming a symmetric distance function, is (n^2 - n)/2 in size. Something smarter is needed for large datasets.
In fact the MaxMinPicker (see here for C++ or here for Python docs) also has the option to pass in the full distance matrix which will also have the same limitation. But the key to MaxMinPicker (described in Ashton, M. et. al., Quant. Struct.-Act. Relat., 21 (2002), 598-604) is that the full distance matrix does not need to be generated, only a subset of it, and it can be generated as needed in a lazy manner.
The algorithm works as follows.
  1. Generate descriptors (e.g. Morgan fingerprints) for all the molecules, both any initial seeds plus those to pick from (the candidate pool). For large sets this will be slow, but cannot be avoided.
  2. If there are no initial seeds select an initial molecule at random from the candidate pool and it becomes the sole member of the picked set.
  3. From the molecules in the candidate pool find the one that has the maximum value for its minimum distance to molecules in the picked set (hence the MaxMin name), calculating and recording the distances as required. This molecule is the most distant one to those already picked so is transferred to the picked set.
  4. Iterate back to step 3 until your are complete (e.g. you have picked the required number of molecules).
Step 3 here is the key one, and the one that Roger improved as we will see. The process is indicated in slide 11 from Roger's presentation and is repeated here:



The key is to think of traversing the table column by column starting from the left. You move all the way down the first column to find the minimum value (3 in this case) and this becomes your first maximum value. Then for the remaining columns you move down the values until you reach a value that is less than or equal to your current maximum. If you hit such a value you know that molecule is more similar than your current candidate so can be discarded, so you can immediately move onto the next column. If you reach the bottom of a column and your minimum value for that column is greater than the minimum value for your current candidate then that compound now becomes your current candidate as it is more distant than the original one. As you traverse the columns you calculate and record the distances if they have not already been calculated. This lazy evaluation of the distances is key to the efficiency of the algorithm. A distance is only calculated if it is needed, and it is never calculated twice. This is the aspect that Roger spotted was not implemented in the old version, and he improved as described in his talk.

So what does this all mean? Roger's new version is present in the 2017_09_01 version of RDKit (now released), and the tests we show here use a beta version of this, and compare timings against the 2017_03_03 version that had the old implementation that performs many more unnecessary distance calculations.
The results are shown in this Jupyter notebook.
First we look at the LazyPick method:
picks = mmp.LazyPick(fn, len(benzodiazepines), start_with + how_many_to_pick, list(range(start_with)))
Let's explain the arguments to this function:
  1. fn: This is the distance function
  2. len(benzodiazepines): the total number of descriptors, the initial seeds plus those to pick from
  3. start_with + how_many_to_pick: the number of indexes to return (including the initial seeds)
  4. list(range(start_with)): a list of the indexes of those in the initial pick list. Typically [1,,2,3...], but can be None
What is returned is the indexes of all those that are picked, the original seeds followed by those that have been picked.
We compare this to the LazyBitVectorPick function. This does not allow a distance function to be defined but instead uses Tamimoto distance between the bit vectors. Whilst this is a restriction, that's exactly what we are using, and what most people will want to use in most cases, so it's not too much of a limitation. The plus side is that this allows less too and fro between the C++ and Python layers so performs much better.
picks = mmp.LazyBitVectorPick(benzodiazepines, len(benzodiazepines), start_with + how_many_to_pick, list(range(start_with)))
Timings for picking a number of molecules from a total set of 12386 benzodiazepines given a number of seeds can be seen below. All timings were gathered on an i5 based MacBook Pro with 8GB RAM. You can use the Jupyter notebook to run these, and other combinations, yourself.
Method            | Seeds | Picks | Old time | New time | Improvement
LazyPick          |     0 |   100 |     17.5 |     0.67 |          26
LazyPick          |     0 |  1000 |   1523.9 |    13.94 |         109
LazyPick          |   100 |   100 |     81.6 |     1.41 |          58
LazyPick          |  1000 |  1000 |       ND |    27.22 |          
LazyBitVectorPick |     0 |   100 |     12.9 |     0.09 |         143
LazyBitVectorPick |     0 |  1000 |   1466.7 |     1.39 |        1055
LazyBitVectorPick |   100 |   100 |     72.1 |     0.18 |         401
LazyBitVectorPick |  1000 |   100 |    761.3 |     0.54 |        1410
LazyBitVectorPick |   100 |  1000 |   3105.8 |     1.38 |        2251
LazyBitVectorPick |  1000 |  1000 |       ND |     3.03 |
LazyBitVectorPick |  5000 |  1000 |       ND |     7.37 |
LazyBitVectorPick |  5000 |  5000 |       ND |     7.37 |
Times are in seconds and averages of 3 replicates.
Firstly you'll notice that the LazyBitVectorPick function is significantly faster than the LazyPick one especially for the new algorithm. As a result additional calculations were done with just LazyBitVectorPick.
The new algorithm is most remarkable in the way it scales, and the bigger the problem to solve the bigger the improvement. Large studies that are impractical with the old algorithm now run in seconds.
So how can this be used for some more realistically scaled studies? Compound collections will often be in the order of several 100,000's, and those for Pharma companies in the range of a few million. Vendor catalogs that can be picked from can often be in the range of 100,000 compounds. So we set up an study that tried to approach this sort of scale. We chose the NCI250 dataset (comprising 247,477 smiles that RDKit can handle) as the compounds that we already possess, and pick from the benzodiazepine set of 12386 compounds. This can realistically only be done with the new algorithm.
The Jupyter notebook shows how this can be done. Timings are as follows:
Picks | Time (sec)
    1 |    8.4    8.5    8.7
    2 |  137    133    128
    3 |  222    218    223
   10 | 1029   1006   1030
  100 | 1201   1181   1199
 1000 | 1256   1187   1245
Impressive. Large scale compound selection done of a modest laptop in minutes.
Are there no limits? Well, for very large sets holding all the fingerprints and picking data in memory will become limiting, and generating the descriptors in the first place will take time. Roger informs me of a study he has done selecting 2000 diverse compounds from the ~90M in PubChem Compound that took 12.6 hours to read the molecules and generate fingerprints and then 2.5 hours for diversity selection. Memory used was ~25 gigabytes. So here we are a little beyond a trivial application on a laptop, but not massively so. Impressive!
Finally, when running some of these longer jobs it was noticeable that only one core of the machine was being used. Clearly the MxMinPicker is not written to be multi-threaded, though, in principle it seems that it could be sped up considerably by being able to utilise all the cores, though this would not be trivial. So there is still scope for further improvement. Maybe another challenge for Roger!

Wednesday, November 8, 2017

Using Feature Maps

This post provides a short demonstration of how to use the RDKit's feature map implementation to score aligned 3D conformations against each other based on the overlap of their pharmacophoric features. The idea of the feature map is not a new one and there are numerous publications on the topic. These two come closest to describing the RDKit implementation:

  1. Putta, S., Landrum, G. A. & Penzotti, J. E. "Conformation mining: An algorithm for finding biologically relevant conformations." J. Med. Chem. 48, 3313–3318 (2005). http://pubs.acs.org/doi/abs/10.1021/jm049066l
  2. Landrum, G. A., Penzotti, J. E. & Putta, S. "Feature-map vectors: a new class of informative descriptors for computational drug discovery." J. Comput. Aided. Mol. Des. 20, 751–762 (2007). https://link.springer.com/article/10.1007/s10822-006-9085-8
In [1]:
from rdkit.Chem import AllChem
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import time
print(time.asctime())
from rdkit import rdBase
print(rdBase.rdkitVersion)
Thu Nov  9 08:17:29 2017
2017.09.1

For this post we'll use a set of three 5HT-3 ligands.

In [2]:
smis = ['CC(OC(=O)c1c[nH]c2ccccc12)C1CCCCN1C',
 'CN1CCOc2c(C(=O)NC3CC4CCC(C3)N4C)cc(Cl)cc21',
 'CN1CC2CCC1CC2n1nnc2ccc(Cl)cc2c1=O']
ms = [Chem.MolFromSmiles(x) for x in smis]
Draw.MolsToGridImage(ms)
Out[2]:

Start by generating a conformation for each:

In [3]:
ms = [Chem.AddHs(m) for m in ms]
ps = AllChem.ETKDG()
ps.randomSeed = 0xf00d  # we seed the RNG so that this is reproducible
for m in ms:
    AllChem.EmbedMolecule(m,ps)

To get sensible scores for the feature maps, we need to start from aligned structures. For this example we'll just use the RDKit's Open3D Align implementation:

In [4]:
from rdkit.Chem import rdMolAlign
o3d = rdMolAlign.GetO3A(ms[1],ms[0])
o3d.Align()
Out[4]:
0.34363011582550307
In [5]:
o3d = rdMolAlign.GetO3A(ms[2],ms[0])
o3d.Align()
Out[5]:
0.5938107267982091

Now let's build a feature map.

We need to start by building a FeatureFactory object which defines the set of pharmacophore features being used. We'll use this to find features on the molecules.

In [6]:
import os
from rdkit import RDConfig
from rdkit.Chem.FeatMaps import FeatMaps
fdef = AllChem.BuildFeatureFactory(os.path.join(RDConfig.RDDataDir,'BaseFeatures.fdef'))
print(fdef.GetFeatureFamilies())
('Donor', 'Acceptor', 'NegIonizable', 'PosIonizable', 'ZnBinder', 'Aromatic', 'Hydrophobe', 'LumpedHydrophobe')

We also need the parameters for the points used to make up the feature map. The feature points are defined by

  • a FeatProfile - Gaussian, Triangle, or Box. Gaussian is the default.
  • a width - the precise meaning is determined by the profile, but this is the sigma factor for the Gaussian. The default value is 1.0.
  • a cutoff radius - Feature-feature overlaps at a distance longer than this will not be considered. The default is 2.5.

Here we'll take the defaults.

In [7]:
fmParams = {}
for k in fdef.GetFeatureFamilies():
    fparams = FeatMaps.FeatMapParams()
    fmParams[k] = fparams

Next go through and find the features on each molecule. We'll only consider a subset of the features defined by the FeatureFactory. In "real" use we'd more likely use a FeatureFactory that only defines the features we are interested in, but this example of how to limit the features returned may still be interesting:

In [8]:
keep = ('Donor','Acceptor','NegIonizable','PosIonizable','Aromatic')
featLists = []
for m in ms:
    rawFeats = fdef.GetFeaturesForMol(m)
    # filter that list down to only include the ones we're intereted in 
    featLists.append([f for f in rawFeats if f.GetFamily() in keep])

Now it's straightforward to create FeatMap objects from the feature lists:

In [10]:
f1 = featLists[0][0]
f1
Out[10]:
<rdkit.Chem.rdMolChemicalFeatures.MolChemicalFeature at 0x118241a80>
In [17]:
fms = [FeatMaps.FeatMap(feats = x,weights=[1]*len(x),params=fmParams) for x in featLists]

Score the features from each molecule against the feature map. Since scores are affected by the number of features in each FeatMap, we'll normalize here using the number of features in the smaller molecule

In [19]:
fms[0].ScoreFeats(featLists[0])/fms[0].GetNumFeatures()
Out[19]:
1.0047133039823972
In [12]:
fms[0].ScoreFeats(featLists[1])/min(fms[0].GetNumFeatures(),len(featLists[1]))
Out[12]:
0.3087264298235622
In [13]:
fms[0].ScoreFeats(featLists[2])/min(fms[0].GetNumFeatures(),len(featLists[2]))
Out[13]:
0.41696159402924426
In [14]:
fms[1].ScoreFeats(featLists[2])/min(fms[1].GetNumFeatures(),len(featLists[2]))
Out[14]:
0.5575068770578883

An aside: since the weights on the features in FeatMap are all 1.0, the scoring is symmetric:

In [22]:
fms[2].ScoreFeats(featLists[1])/min(fms[2].GetNumFeatures(),len(featLists[1]))
Out[22]:
0.5575068770578883

Hopefully others find this brief introduction to the RDKit's FeatMap implementation useful.

Wednesday, August 9, 2017

Chemical Topic Modeling with the RDKit and KNIME

We recently published a paper on the application of topic modeling, a method developed in the text mining community, to chemical data: http://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00249. Here I'm going to show how to use this approach inside of KNIME. I'm really pleased with how the paper turned out and think the approach is a really useful one for efficiently organizing chemical datasets. We've got a bunch of cool ideas about what to do next too, so keep your eyes open...

An aside/apology: while doing the literature search, we somehow completely missed Rajarshi's blog post from 2012 (http://blog.rguha.net/?p=997). This is really embarrassing. Sorry Rajarshi...

Since we continue to work on the code that Nadine wrote while doing the research, called CheTo (for ChemicalTopic), Nadine and I have put it on GitHub (https://github.com/rdkit/CheTo). We'd also like to make it easy for other people to use the code, so we built a conda package for it and added it to the RDKit channel. If you're using the Anaconda Python distribution (and you should be!), you can install the package in your conda enviroment with a single command conda install -c rdkit cheto. If you don't already have the RDKit installed, it will automatically be installed for you. We'll be updating the git repository in the coming weeks to provide more information about and examples of how to use the CheTo python code. This blog post, though, is about using it from KNIME.

Let's start with the pre-requisites. You need an installation of at least v3.4.0 of KNIME (released July 2017). That installation should have the KNIME text mining extensions and the Python Integration version that supports Python 2 and Python 3. At the time of writing these are both in KNIME Labs. It's not a bad idea to have the RDKit nodes installed too (these are available in the KNIME Community Extensions section in the "Install KNIME Extensions" dialog). You also need to have the Python extension properly configured, I covered this in a post on the KNIME blog: https://www.knime.com/blog/setting-up-the-knime-python-extension-revisited-for-python-30-and-20. The condo environment you are using from KNIME should have both the RDKit and CheTo installed from the rdkit channel (see the CheTo installation instructions above).

phew... now we're ready to go. Here's a screenshot of the sample workflow for doing chemical topic modeling:

The table reader at the beginning brings in a set of a couple hundred molecules taken from 12 ChEMBL documents. The real work is done in the "fingerprint and do LDA" wrapped metanode, which expects an input table that has a column named "smiles" that contains SMILES. We won't get into the details of the contents of this node here, but if you configure the node (double click or right click and "configure") you'll get a dialog which allows you to change the important parameters:


Executing the node, which can take a while since it's not currently very well optimized, gives you two tables. The first has the individual molecules assigned to topics:

and the second has the bits that define the topics themselves,  including Nadine's very nice depictions of the fingerprint bits:


The GroupBy nodes provide a summary of the number of documents each topic shows up in as well as the number of topics that are identified in each document. This last was one of the validation metrics that we used in the paper; here's what we get with the sample data set:
You can see that the majority of the documents contain compounds that are assigned to a single topic, while a few documents contain compounds assigned to two topics and one, doc_id 44596, has compounds from three topics.

There's a lot more detail in the paper about what this all means and what you might do with it; the goal for this post was to provide a very quick overview of how to do the analysis and look at the results inside of KNIME. I hope I was successful with that.

The workflow itself is on the KNIME public examples server, you find that in KNIME by logging into the examples server and then navigating to Examples/99_Community/03_RDKit/08_Chemical_Topic_Modeling:



Wednesday, May 17, 2017

Looking at the Platinum dataset

A recent paper out of the ZBH in Hamburg published a new set of ligand structures from the PDB to be used for benchmarking conformational analysis methods: http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00613 The dataset is available here: http://www.zbh.uni-hamburg.de/?id=628 Along with the paper the authors presented results from multiple conformation generation tools, including the RDKit. Sereina Riniker's ETKDG implementation (http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00654) in the RDKit ended up doing quite well relative to the other approaches. That, of course, made me quite happy.
I hadn't done anything with the Platinum dataset at all yet, but an exchange with David Koes on Twitter - https://twitter.com/david_koes/status/862831249617862656 - prompted me to take a look. It's an interesting one to look at from the "predicting bioactive conformations" perspective since it has >4500 PDB ligands. In the ETKDG paper we only looked at 238 PDB structures; that paper also included 1290 CSD structures.
In [204]:
from collections import defaultdict
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_3d=True
%pylab inline
Populating the interactive namespace from numpy and matplotlib
/home/glandrum/anaconda3/envs/py36/lib/python3.6/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['indices']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"
Read the molecules in and assign atomic chirality based on the structures provided.
In [193]:
ms = [x for x in Chem.SDMolSupplier('../data/platinum_dataset_2017_01.sdf',removeHs=False)]
# Assign atomic chirality based on the structures:
for m in ms: Chem.AssignAtomChiralTagsFromStructure(m)
len(ms)
Out[193]:
4548

Conformation generation

Start by generating 50 conformations via ETKDG
In [77]:
import time
tms = [Chem.Mol(x) for x in ms]
ps = AllChem.ETKDG()
ps.pruneRmsThresh=0.5
ps.numThreads=0
ts = []
for i,m in enumerate(tms):
    t1 = time.time()
    AllChem.EmbedMultipleConfs(m,50,ps)
    t2=time.time()
    ts.append((t2-t1))
    if not (i%50): print("done %d"%i)
done 0
done 50
done 100
done 150
  ... 
done 4350
done 4400
done 4450
done 4500
Repeat the conformation generation using standard DG
In [139]:
tms2 = [Chem.Mol(x) for x in ms]
ps = AllChem.ETKDG()
ps.useBasicKnowledge = False
ps.useExpTorsionAnglePrefs = False
ps.pruneRmsThresh=0.5
ps.numThreads=0
ts2 = []
for i,m in enumerate(tms2):
    t1 = time.time()
    AllChem.EmbedMultipleConfs(m,50,ps)
    t2=time.time()
    ts2.append((t2-t1))
    if not (i%50): print("done %d"%i)
done 0
done 50
done 100
done 150
 ...
done 4350
done 4400
done 4450
done 4500

RMSD calculation

For the RMS analysis we want molecules without Hs:
In [140]:
ms_noh = [Chem.RemoveHs(m) for m in ms]
tms_noh = [Chem.RemoveHs(m) for m in tms]
tms2_noh = [Chem.RemoveHs(m) for m in tms2]
In [141]:
import pickle
with open('../data/platinum_dataset_2017_01.chiral_confs.pkl','wb+') as outf:
    pickle.dump((ms_noh,tms_noh,tms2_noh,ts,ts2),outf)
Now generate RMS values to the crystal structure for 1, 5, 20, 25, and 50 conformer ensembles.
In [143]:
# ETKDG
rmsvs = defaultdict(list)
for m,tm in zip(ms_noh,tms_noh):
    best = 1e8
    cids = [x.GetId() for x in tm.GetConformers()]
    for i,cid in enumerate(cids):
        rms = AllChem.GetBestRMS(m,tm,probeConfId=cid)
        best = min(rms,best)
        if not i:
            best_1 = best
        if i<5:
            best_5 = best
        if i<20:
            best_20 = best
        if i<25:
            best_25 = best
    rmsvs[50].append(best)
    rmsvs[1].append(best_1)
    rmsvs[5].append(best_5)
    rmsvs[20].append(best_20)
    rmsvs[25].append(best_25)
# DG
rmsvs2 = defaultdict(list)
for m,tm in zip(ms_noh,tms2_noh):
    best = 1e8
    cids = [x.GetId() for x in tm.GetConformers()]
    for i,cid in enumerate(cids):
        rms = AllChem.GetBestRMS(m,tm,probeConfId=cid)
        best = min(rms,best)
        if not i:
            best_1 = best
        if i<5:
            best_5 = best
        if i<20:
            best_20 = best
        if i<25:
            best_25 = best
    rmsvs2[50].append(best)
    rmsvs2[1].append(best_1)
    rmsvs2[5].append(best_5)
    rmsvs2[20].append(best_20)
    rmsvs2[25].append(best_25)    
There are a couple molecules where we didn't manage to generate conformations:
In [181]:
len([1 for x in rmsvs[50] if x>1000]),len([1 for x in rmsvs2[50] if x>1000])
Out[181]:
(2, 1)
Let's look at those:
In [184]:
[i for i,x in enumerate(rmsvs[50]) if x > 1000],[i for i,x in enumerate(rmsvs2[50]) if x > 1000]
Out[184]:
([1641, 2708], [1641])
In [192]:
from rdkit.Chem import Draw
indices = [i for i,x in enumerate(rmsvs[50]) if x > 1000]
badms = [Draw.PrepareMolForDrawing(ms_noh[x],forceCoords=True) for x in indices]
legends = [ms[x].GetProp("_Name") for x in indices]
Draw.MolsToGridImage(badms,legends=legends)
Out[192]:
I'm surprised that there were problems with the second structure (H52). It doesn't look like it should be that hard to embed, so that's something to look into. The first structure (840) has multiple specified chiral centers in a highly constrained fused ring system, so it's not horribly surprising to encounter problems.

Looking at RMSD values

In [145]:
def values_there(seq,threshold=1000):
    return [x for x in seq if x<threshold]
Let's look at histograms of RMS values
In [248]:
figsize(18,6)
hist([values_there(rmsvs[x]) for x in sorted(rmsvs)],bins=20,label=[x for x in sorted(rmsvs)]);
legend();
title('ETKDG');
xlabel('RMSD')
ylabel('Number of structures');
In [249]:
figsize(18,6)
hist([values_there(rmsvs2[x]) for x in sorted(rmsvs2)],bins=20,label=[x for x in sorted(rmsvs2)]);
legend();
title('default DG');
xlabel('RMSD')
ylabel('Number of structures');
Another way of looking at that is with cumulative box plots similar to what was used in Fig 7 of the ETKDG paper. These show the fraction of structures that have a minimum RMSD less than a particular value.
In [250]:
def fig7_plot(rmsvs,nmols,title,blocks=(0.1,0.5,1.0,1.5,2.0)):
    pltVals=[]
    for thresh in blocks:
        tVals = []
        for k in sorted(rmsvs):
            tVals.append(len([1 for x in rmsvs[k] if x<thresh])/nmols)
        pltVals.append(tVals)
    N = len(rmsvs.keys())
    ind = np.arange(N)
    w=0.1
    figsize(18,6)
    fig,ax = subplots()
    rects=[]
    for i,nc in enumerate(rmsvs):
        rects.append(ax.bar(ind+i*w,[x[i] for x in pltVals],w))
    ax.set_xticks(ind+(w*len(blocks)/2))
    ax.set_xticklabels('<%.2f'%x for x in blocks)
    ax.set_ylabel('Fraction of structures')
    ax.set_xlabel('RMSD')
    ax.set_title(title)
    ax.legend((r[0] for r in rects),('%d'%k for k in sorted(rmsvs)))
    ax.set_ybound(lower=0,upper=1)
    ax.grid(axis='y')
In [251]:
fig7_plot(rmsvs,len(ms),'ETKDG');
In [252]:
fig7_plot(rmsvs2,len(ms),'DG');
Those show what we'd expect: ETKDG is significantly better than DG and adding conformations tends to help. With 50 conformations ETKDG finds a result within 1.0A of the crystal structure for about 70% of the structures.
David Koes made some interesting plots that looked at average RMSD in structures as a function of the number of rotatable bonds. Let's try a few of those.
In [151]:
nrots = [AllChem.CalcNumRotatableBonds(x) for x in ms_noh]
rms_by_nrot = defaultdict(lambda:defaultdict(list))
rms_by_nrot2 = defaultdict(lambda:defaultdict(list))
for i,nrot in enumerate(nrots):
    for k in rmsvs:
        rms_by_nrot[k][nrot].append(rmsvs[k][i])
        rms_by_nrot2[k][nrot].append(rmsvs2[k][i])
In [152]:
import numpy as np
counts = sorted([(x,len(values_there(rms_by_nrot[50][x]))) for x in rms_by_nrot[50]])
In [270]:
figsize(8,8)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

# integral = [counts[0][1]]
# for i in range(1,len(counts)): 
#     integral.append(counts[i][1]+integral[-1])
# integral = [x/len(ms_noh) for x in integral]
# ax2.plot([x for x,y in counts],[y for y in integral],linestyle='--',color='r')
# ax2.set_ylabel('frac structs')

ax2.plot([x for x,y in counts],[y for x,y in counts],linestyle='--', color='r')
ax2.set_ylabel('num structs')

for k in sorted(rms_by_nrot):
    means = [(x,np.average(values_there(rms_by_nrot[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],label="ETKDG"+str(k))
for k in (5,50):
    means = [(x,np.average(values_there(rms_by_nrot2[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],linestyle=':',label="DG"+str(k))

ax1.set_ylabel('mean(RMSD)');
ax1.set_xlabel('num_rotatable')
ax1.legend();
ax2.grid(axis='y')
ax1.grid(axis='x')
I've deviated a bit from the way David presents this by adding a line showing the number of structures in the dataset with the corresponding number of rotatable bonds (the red dashed line and the right-hand Y axis). There aren't a lot of structures with more than 10 rotatable bonds in the dataset.
From the graph we can see the expected result: mean(RMSD) increases steadily with the number of rotatable bonds and ETKDG is consistently better than DG.

Timing information

We know that the ETKDG procedure is more computationally intensive than DG. Let's look at how much of a difference it makes on this data set:
In [292]:
figsize(8,8)
ptimes = [x for i,x in enumerate(ts) if tms[i].GetNumConformers()]
ptimes2 = [x for i,x in enumerate(ts2) if tms[i].GetNumConformers()]
scatter(ptimes2,ptimes)
plot((0,12),(0,12),'r')
A = np.vstack([ptimes2, np.ones(len(ptimes2))]).T
m, c = np.linalg.lstsq(A, ptimes)[0]
print(m,c)
grid()
plot(ptimes2,m*np.array(ptimes2)+c,color='r',linestyle='--')             
xlabel('Time-DG (s)');
ylabel('Time-ETKDG (s)');
1.73472205238 0.0968224468535
So ETKDG takes, on average, 1.7 times as long.
A different look at timing: The fraction of structures that are completed in less than given time.
In [303]:
time_integration = defaultdict(int)
time_integration2 = defaultdict(int)
for tv in (0.1,.25, .5, 1, 1.5, 2, 3, 5,10,20):
    time_integration[tv] = len([1 for x in ts if x<=tv])
    time_integration2[tv] = len([1 for x in ts2 if x<=tv])
figsize(8,8)
plot(sorted(time_integration),[time_integration[x]/len(ms_noh) for x in sorted(time_integration)],label='ETKDG')
plot(sorted(time_integration),[time_integration2[x]/len(ms_noh) for x in sorted(time_integration)],label='DG')
grid()
xlabel('time (s)')
ylabel('fraction of structures')
legend()
xscale('log');
Interpretation of this: both ETKDG and DG are able to generate conformations (remember that the target number was 50 conformations with an RMS threshold of 0.5A between them) for >80% of the structures in less than 1 second. ETKDG completes ~85% and DG completes about 95%.
Look at how sensitive those times are to the number of rotatable bonds:
In [293]:
times_by_nrot = defaultdict(list)
times_by_nrot2 = defaultdict(list)
for i,nrot in enumerate(nrots):
    if not tms[i].GetNumConformers(): continue
    times_by_nrot[nrot].append(ts[i])
    times_by_nrot2[nrot].append(ts2[i])
In [294]:
figsize(8,8)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

ax2.plot([x for x,y in counts],[y for x,y in counts],linestyle='--', color='r')
ax2.set_ylabel('num structs')

means = [(x,np.average(times_by_nrot[x]),np.std(times_by_nrot[x])) for x,_ in counts]
means = sorted(means)
ax1.errorbar([x for x,y,z in means],[y for x,y,z in means],yerr=[z for x,y,z in means],label="ETKDG-50")
means = [(x,np.average(times_by_nrot2[x]),np.std(times_by_nrot2[x])) for x,_ in counts]
means = sorted(means)
ax1.errorbar([x for x,y,z in means],[y for x,y,z in means],yerr=[z for x,y,z in means],label="DG-50")

ax1.set_ylabel('time (s)');
ax1.set_xlabel('num_rotatable')
ax1.legend();
ax2.grid(axis='y')
ax1.grid(axis='x')
ax1.set_ybound(0,3)
Adding rotatable bonds causes conformation generation to take longer. This isn't a huge surprise.

Impact of force-field minimization

David also does an analysis of the impact of UFF minimization on the RMSD values. I'm not going to repeat exactly that, but it is worth seeing if minimizing the conformations improves the RMSDs. We'll do that here with UFF.
In [172]:
minms = [Chem.Mol(m) for m in tms]
mintimes = []
for i,m in enumerate(minms):
    t1 = time.time()
    if not m.GetNumConformers():
        continue
    AllChem.UFFOptimizeMoleculeConfs(m,numThreads=0)
    t2 = time.time()
    mintimes.append(t2-t1)
    if not i%500:
        print("Done:",i)
Done: 0
RDKit ERROR: [05:11:24] UFFTYPER: Unrecognized charge state for atom: 14
Done: 500
 ...
Done: 4000
RDKit ERROR: [05:26:56] UFFTYPER: Unrecognized charge state for atom: 13
RDKit ERROR: [05:27:20] UFFTYPER: Unrecognized charge state for atom: 12
Done: 4500
In [173]:
minms_noh = [Chem.RemoveHs(m) for m in minms]
In [174]:
with open('../data/platinum_dataset_2017_01.chiral_confs.minimized.pkl','wb+') as outf:
    pickle.dump((minms_noh,mintimes),outf)
In [175]:
rmsvs_min = defaultdict(list)
for m,tm in zip(ms_noh,minms_noh):
    best = 1e8
    cids = [x.GetId() for x in tm.GetConformers()]
    for i,cid in enumerate(cids):
        rms = AllChem.GetBestRMS(m,tm,probeConfId=cid)
        best = min(rms,best)
        if not i:
            best_1 = best
        if i<5:
            best_5 = best
        if i<20:
            best_20 = best
        if i<25:
            best_25 = best
    rmsvs_min[50].append(best)
    rmsvs_min[1].append(best_1)
    rmsvs_min[5].append(best_5)
    rmsvs_min[20].append(best_20)
    rmsvs_min[25].append(best_25)
In [176]:
rms_by_nrot_min = defaultdict(lambda:defaultdict(list))
for i,nrot in enumerate(nrots):
    for k in rmsvs_min:
        rms_by_nrot_min[k][nrot].append(rmsvs_min[k][i])
In [304]:
figsize(8,8)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax2.plot([x for x,y in counts],[y for x,y in counts],linestyle='--', color='r')
ax2.set_ylabel('num structs')

for k in sorted(rms_by_nrot):
    means = [(x,np.average(values_there(rms_by_nrot[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],label="ETKDG"+str(k))
for k in (1,5,25,50):
    means = [(x,np.average(values_there(rms_by_nrot_min[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],linestyle=':',label="ETKDG_min"+str(k))

ax1.set_ylabel('mean(RMSD)');
ax1.set_xlabel('num_rotatable')
ax1.legend();
ax2.grid(axis='y')
ax1.grid(axis='x')
The impact is not large.
There's a lot more to be looked at here with the force fields, but this blog post is already getting fairly long, so I'm going to come back to that in a future post.