reimandlab / ActiveDriverDB

ActiveDriverDB

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

searching for mutations

reimand0 opened this issue · comments

PT53 R282W, the example query on the front page, is actually not working:
screen shot 2017-07-18 at 12 45 46 pm

it lands on the page https://activedriverdb.org/search/mutations

It's a known issues - as I wrote in an email earlier this week, I am reimporting the mappings and this functionality is dependent on mappings database which is not accessible currently. I have difficulties to recreate mappings now (after we have got MC3 data it goes much slower - like 4 days - and sometimes breaks due to MySQL issues) but the fix which will allow to fix this will soon land on: faster_mappings_import branch

thats OK - i did not realise it is still running the update!

I restored the old mappings database, so the mutation searching should work again.

Even though I reduced overhead of my script significantly, the slowdown occurs anyway. It may be an issue of the data size and the BerkleyDB itself. The slowdown starts gradually and intensifies while the data continue to be imported, making the process effectively stall at some point - less than half way to the end (with prediction of finishing the work in next 120 hours or so; during initial iterations it projected 10 hours to import all the data).

I found this issue on stackoverflow: BerkeleyDB write performance problems where another user of BDB describes identical issues which I am facing with now. The discussion is revolving around caching and hardware specific issues, citing more relevant:

What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.

Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.

I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.

As one of the answers indicate, increasing BDB cache may help to improve the performance dramatically. I will play around with various BDB settings later and I will set the import to run over the weekend. There are various options allowing to tune BDB hashing strategy, like setting expected elements count, hash table bucket density and page size. I will write there more if any of this helps.

If all of this fail we may need to use other hash-key database for mappings.

This was finally resolved four days ago (details in referenced commit); lots of optimizations, increasing BDB cache to 8GB and using separate import functions for the refseq mappings and for the nucleotide mappings allowed to re-import everything nicely. It still takes about 24 hour to generate the mappings but this is manageable and the script runs with a steady pace. Today I swapped the mappings database on production so the issue is dealt with completely.

No, I don't think so; we should be able to handle additional datasets of a size comparable to pancanAtlas easily. Potential problems may arise if the size of an individual dataset will be much larger than before (like 5-10 times larger).

Even if there would be a problem with ExAC size, then (maybe) we could use a MAF filtering criterion? According to their paper, 54% of variants are singletons; are such mutations significant for ActiveDriverDB users?