dbichko / CDKHashFingerPrint

Improvised CDK Hashed fingerprint

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is an attempt to improve the CDK HashFingerprint (Fingerprinter class).
The idea behind the improved version is borrowed from my blog improvised hashing function and their impact on the fingerprints. 

http://chembioinfo.com/2011/10/30/revisiting-molecular-hashed-fingerprints/

Command line interface

/*  Test improved CDK FP */

java -jar BenchmarkHashedFingerprinter.jar test/data/mol hash  2  2000
 
/* Test CDK default FP */
 
java -jar BenchmarkHashedFingerprinter.jar test/data/mol cdk  2  2000
   
***************************************
Improved CDK HashedFingerprinter class with 1024 size FP
***************************************
CASES:          TP:     FP:	TN:	FN:   ACCURACY:	TPR:	FPR:   Time (mins): 
200*200         629	189	39182	0	0.995	1.000	0.005	0.11
400*400         2428	972	156600	0	0.994	1.000	0.006	0.37
600*600         4940	2449	352611	0	0.993	1.000	0.007	0.75
800*800         8562	5083	626355	0	0.992	1.000	0.008	1.27
1000*1000	12802	9011	978187	0	0.991	1.000	0.009	2.04
1200*1200	17178	12727	1410095	0	0.991	1.000	0.009	2.94

***************************************
Improved New HashedFingerprinter class with 2048 size FP
***************************************

------------------------------------------------------------------------------
CASES:		TP:	FP:	TN:	FN:	ACCURACY:	TPR:	FPR:	Time (mins): 
------------------------------------------------------------------------------
200*200		629	189	39182	0	0.995		1.000	0.005	0.1
400*400		2381	974	156645	0	0.994		1.000	0.006	0.35
600*600		4882	2452	352666	0	0.993		1.000	0.007	0.71
800*800		8484	5085	626431	0	0.992		1.000	0.008	1.19
1000*1000	12710	9014	978276	0	0.991		1.000	0.009	1.93
1200*1200	17070	12730	1410200	0	0.991		1.000	0.009	2.77

***************************************
CDK Default Fingerprinter class with 1024 size FP
***************************************
CASES:		TP:	FP:	TN:	FN:   ACCURACY:	TPR:	FPR:   Time (mins): 
200*200		629	298	39073	0	0.993	1.000	0.008	0.11
400*400		2428	1691	155881	0	0.989	1.000	0.011	0.37
600*600		4940	3765	351295	0	0.990	1.000	0.011	0.74
800*800		8562	7522	623916	0	0.988	1.000	0.012	1.26
1000*1000	12802	13922	973276	0	0.986	1.000	0.014	2.05
1200*1200	17178	19262	1403560	0	0.987	1.000	0.014	2.92



Results:

The improved hashed fingerprinter has better "Accuracy" 
and ~30-40% lesser false positives (FPs) than the original version!

/* Test new FP with ring matcher */

java -jar BenchmarkHashedFingerprinter.jar test/data/mol hash  1  2000

------------------------------------------------------------------------------
CASES:		TP:	FP:	TN:	FN:	ACCURACY:   TPR:    FPR:	Time (mins): 
------------------------------------------------------------------------------
200*200		629	144	39227	0	0.996       1.000   0.004	0.1
400*400		2381	842	156777	0	0.995       1.000   0.005	0.34
600*600		4882	2161	352957	0	0.994       1.000   0.006	0.71
800*800		8484	4477	627039	0	0.993       1.000   0.007	1.2
1000*1000	12710	7977	979313	0	0.992       1.000   0.008	1.97
1200*1200	17070	11429	1411501	0	0.992       1.000   0.008	2.82

The improved hashed fingerprinter with ring matcher has better "Accuracy" 
and ~40% lesser false positives (FPs) than the original version!

/* Test new FP with bloom filter and ring matcher */

java -jar BenchmarkHashedFingerprinter.jar test/data/mol hashbloom  1  2000

About

Improvised CDK Hashed fingerprint