Friday, April 28, 2017

Using custom fingerprints in PostgreSQL

A question recently came up on the mailing list about how to use custom fingerprints with the RDKit PostgreSQL cartridge without having to edit the cartridge code itself. Since the answer isn't trivial but may be useful to others, I'm doing a blog post with the answer.

We'll start the usual way, with a bunch of imports. In the interests of being maximally explicit and having this whole notebook be normal Python code, I'm handling the database connection with the usual psycopg2 connector to PostgreSQL.

In [1]:
from rdkit import Chem
from rdkit import DataStructs
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import psycopg2,time
print(time.asctime(),"rdkit=",rdBase.rdkitVersion)
Wed Apr 26 13:22:19 2017 rdkit= 2017.03.1

Read in a set of molecule data that we've used before:

In [2]:
with open('../data/chembl16_2010-2012.smi') as inf:
    data = [x.strip().split() for x in inf]
len(data)
Out[2]:
234681

Establish a connection to the database we'll use and insert that data:

In [3]:
conn = psycopg2.connect(database='rdkit_blog_demo')
curs = conn.cursor()
curs.execute('create extension if not exists rdkit')
curs.execute('drop table if exists raw_data')
curs.execute('create table raw_data (smiles text,molregno int)')
curs.executemany('insert into raw_data values (%s,%s)',data)
conn.commit()

Create a molecule table with the first 10K rows (we don't need all the data for the purposes of this post):

In [4]:
curs.execute('drop table if exists mols')
curs.execute('select molregno,mol_from_smiles(smiles::cstring) as m into mols from raw_data limit 10000')
conn.commit()

Now generate our custom fingerprints for those molecules. We're using one of the Atom Pair/Topological Torsion variants that the RDKit provides here:

In [5]:
from rdkit.Chem.AtomPairs import Sheridan
from rdkit.Chem import rdMolDescriptors
fps = []
# grab the molecules, we're pulling them out in their pickled form:
curs.execute('select molregno,mol_send(m) from mols limit 10000')
for molregno,pkl in curs.fetchall():
    if pkl is None: continue
    # construct a molecule
    m = Chem.Mol(pkl.tobytes())
    # now do our fingerprint. We're using Topological Torsions with Sheridan's binding properties
    # to define atom types
    fp = Sheridan.GetBTFingerprint(m,fpfn=rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect)
    fps.append((molregno,fp))

Now insert the fingerprints into the database, we do this by sending a byte string to the database using DataStructs.BitVectToBinaryText() on the python side and then converting that into a bit vector in the database using the function bfp_from_binary_text():

In [7]:
curs.execute('drop table if exists fps')
curs.execute('create table fps (molregno int, btfp bfp)')
curs.executemany('insert into fps values (%s,bfp_from_binary_text(%s))',
                 [(x,DataStructs.BitVectToBinaryText(y)) for x,y in fps])
conn.commit()

If this were a larger database I'd also create an index on the fingerprint column in order to speed similarity searches up a bit, but since this one's small we'll skip that.

And here's how you do a query:

In [8]:
fp = fps[-1][-1]
curs.execute('set rdkit.tanimoto_threshold=0.6')
curs.execute('select molregno,m from mols join fps using (molregno) where btfp%%bfp_from_binary_text(%s)',
             (DataStructs.BitVectToBinaryText(fp),))
res = curs.fetchall()
len(res)
Out[8]:
3

It's always good to test that we're getting the right answers, so let's verify that by repeating the same process in Python (just to be sure we did everything right!):

In [9]:
len([x for (x,y) in fps if DataStructs.TanimotoSimilarity(y,fp)>=0.6])
Out[9]:
3

Pull back some similarity values:

In [10]:
curs.execute('select molregno,tanimoto_sml(btfp,bfp_from_binary_text(%s)) from fps limit 10',
             (DataStructs.BitVectToBinaryText(fp),))
curs.fetchall()
Out[10]:
[(23, 0.159090909090909),
 (97, 0.223880597014925),
 (115, 0.256410256410256),
 (146, 0.228571428571429),
 (147, 0.160714285714286),
 (148, 0.102941176470588),
 (173, 0.226415094339623),
 (194, 0.160714285714286),
 (213, 0.333333333333333),
 (205, 0.203125)]

And again, just a test, ensure that we're getting the same thing we'd see in Python:

In [11]:
[(x,DataStructs.TanimotoSimilarity(y,fp)) for (x,y) in fps][:10]
Out[11]:
[(23, 0.1590909090909091),
 (97, 0.22388059701492538),
 (115, 0.2564102564102564),
 (146, 0.22857142857142856),
 (147, 0.16071428571428573),
 (148, 0.10294117647058823),
 (173, 0.22641509433962265),
 (194, 0.16071428571428573),
 (213, 0.3333333333333333),
 (205, 0.203125)]

That's it. Not particularly complicated, but still a useful thing to know how to do. Following this approach, any bit vector fingerprint (even one's generated outside of the RDKit) can be inserted into PostgreSQL tables and then searched using the RDKit cartridge.