This is an attempt to improve the CDK HashFingerprint (Fingerprinter class). The idea behind the improved version is borrowed from my blog improvised hashing function and their impact on the fingerprints. http://chembioinfo.com/2011/10/30/revisiting-molecular-hashed-fingerprints/ Command line interface /* Test improved CDK FP */ java -jar BenchmarkHashedFingerprinter.jar test/data/mol hash 2 2000 /* Test CDK default FP */ java -jar BenchmarkHashedFingerprinter.jar test/data/mol cdk 2 2000 *************************************** Improved CDK HashedFingerprinter class with 1024 size FP *************************************** CASES: TP: FP: TN: FN: ACCURACY: TPR: FPR: Time (mins): 200*200 629 189 39182 0 0.995 1.000 0.005 0.11 400*400 2428 972 156600 0 0.994 1.000 0.006 0.37 600*600 4940 2449 352611 0 0.993 1.000 0.007 0.75 800*800 8562 5083 626355 0 0.992 1.000 0.008 1.27 1000*1000 12802 9011 978187 0 0.991 1.000 0.009 2.04 1200*1200 17178 12727 1410095 0 0.991 1.000 0.009 2.94 *************************************** Improved New HashedFingerprinter class with 2048 size FP *************************************** ------------------------------------------------------------------------------ CASES: TP: FP: TN: FN: ACCURACY: TPR: FPR: Time (mins): ------------------------------------------------------------------------------ 200*200 629 189 39182 0 0.995 1.000 0.005 0.1 400*400 2381 974 156645 0 0.994 1.000 0.006 0.35 600*600 4882 2452 352666 0 0.993 1.000 0.007 0.71 800*800 8484 5085 626431 0 0.992 1.000 0.008 1.19 1000*1000 12710 9014 978276 0 0.991 1.000 0.009 1.93 1200*1200 17070 12730 1410200 0 0.991 1.000 0.009 2.77 *************************************** CDK Default Fingerprinter class with 1024 size FP *************************************** CASES: TP: FP: TN: FN: ACCURACY: TPR: FPR: Time (mins): 200*200 629 298 39073 0 0.993 1.000 0.008 0.11 400*400 2428 1691 155881 0 0.989 1.000 0.011 0.37 600*600 4940 3765 351295 0 0.990 1.000 0.011 0.74 800*800 8562 7522 623916 0 0.988 1.000 0.012 1.26 1000*1000 12802 13922 973276 0 0.986 1.000 0.014 2.05 1200*1200 17178 19262 1403560 0 0.987 1.000 0.014 2.92 Results: The improved hashed fingerprinter has better "Accuracy" and ~30-40% lesser false positives (FPs) than the original version! /* Test new FP with ring matcher */ java -jar BenchmarkHashedFingerprinter.jar test/data/mol hash 1 2000 ------------------------------------------------------------------------------ CASES: TP: FP: TN: FN: ACCURACY: TPR: FPR: Time (mins): ------------------------------------------------------------------------------ 200*200 629 144 39227 0 0.996 1.000 0.004 0.1 400*400 2381 842 156777 0 0.995 1.000 0.005 0.34 600*600 4882 2161 352957 0 0.994 1.000 0.006 0.71 800*800 8484 4477 627039 0 0.993 1.000 0.007 1.2 1000*1000 12710 7977 979313 0 0.992 1.000 0.008 1.97 1200*1200 17070 11429 1411501 0 0.992 1.000 0.008 2.82 The improved hashed fingerprinter with ring matcher has better "Accuracy" and ~40% lesser false positives (FPs) than the original version! /* Test new FP with bloom filter and ring matcher */ java -jar BenchmarkHashedFingerprinter.jar test/data/mol hashbloom 1 2000