searching for mutations

Question

searching for mutations

reimand0 opened this issue 7 years ago · comments

PT53 R282W, the example query on the front page, is actually not working:

it lands on the page https://activedriverdb.org/search/mutations

Michał Krassowski · Answer 1 · Wed Jul 19 2017 01:19:27 GMT+0800 (China Standard Time)

It's a known issues - as I wrote in an email earlier this week, I am reimporting the mappings and this functionality is dependent on mappings database which is not accessible currently. I have difficulties to recreate mappings now (after we have got MC3 data it goes much slower - like 4 days - and sometimes breaks due to MySQL issues) but the fix which will allow to fix this will soon land on: faster_mappings_import branch

reimand0 · Answer 2 · Wed Jul 19 2017 01:44:33 GMT+0800 (China Standard Time)

thats OK - i did not realise it is still running the update!

Michał Krassowski · Answer 3 · Thu Jul 20 2017 01:54:04 GMT+0800 (China Standard Time)

I restored the old mappings database, so the mutation searching should work again.

Even though I reduced overhead of my script significantly, the slowdown occurs anyway. It may be an issue of the data size and the BerkleyDB itself. The slowdown starts gradually and intensifies while the data continue to be imported, making the process effectively stall at some point - less than half way to the end (with prediction of finishing the work in next 120 hours or so; during initial iterations it projected 10 hours to import all the data).

I found this issue on stackoverflow: BerkeleyDB write performance problems where another user of BDB describes identical issues which I am facing with now. The discussion is revolving around caching and hardware specific issues, citing more relevant:

What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.

Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.

I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.

As one of the answers indicate, increasing BDB cache may help to improve the performance dramatically. I will play around with various BDB settings later and I will set the import to run over the weekend. There are various options allowing to tune BDB hashing strategy, like setting expected elements count, hash table bucket density and page size. I will write there more if any of this helps.

If all of this fail we may need to use other hash-key database for mappings.

reimand0 · Answer 4 · Sat Jul 22 2017 20:01:19 GMT+0800 (China Standard Time)

That sounds like an intense problem. I'm not sure of the way you do it, but perhaps loading the big dataset in smaller batches and in between batches executing some DB cleanup or indexing procedure would help?

…

On Wed, Jul 19, 2017 at 19:54 krassowski ***@***.***> wrote: I restored the old mappings database, so the mutation searching should work again. Even though I reduced overhead of my script significantly, the slowdown occurs anyway. It may be an issue of the data size and the BerkleyDB itself. The slowdown starts gradually and intensifies while the data continue to be imported, making the process effectively stall at some point - less than half way to the end (with prediction of finishing the work in next 120 hours or so; during initial iterations it projected 10 hours to import all the data). I found this issue on stackoverflow: BerkeleyDB write performance problems <https://stackoverflow.com/questions/5423343/berkeleydb-write-performance-problems> where another user of BDB describes identical issues which I am facing with now. The discussion is revolving around caching and hardware specific issues, citing more relevant: What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop. Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour. I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data. As one of the answers indicate, increasing BDB cache may help to improve the performance dramatically. I will play around with various BDB settings later and I will set the import to run over the weekend. There are various options allowing to tune BDB hashing strategy, like setting expected elements count, hash table bucket density and page size. I will write there more if any of this helps. If all of this fail we may need to use other hash-key database for mappings. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASYC_XED1ASEhJLtvjqFZC-P3bKzVhA7ks5sPkK9gaJpZM4Obn-l> .

Michał Krassowski · Answer 5 · Mon Jul 31 2017 23:03:47 GMT+0800 (China Standard Time)

This was finally resolved four days ago (details in referenced commit); lots of optimizations, increasing BDB cache to 8GB and using separate import functions for the refseq mappings and for the nucleotide mappings allowed to re-import everything nicely. It still takes about 24 hour to generate the mappings but this is manageable and the script runs with a steady pace. Today I swapped the mappings database on production so the issue is dealt with completely.

reimand0 · Answer 6 · Mon Jul 31 2017 23:37:58 GMT+0800 (China Standard Time)

Perfect, thanks! Did the database become much bulkier and slower after moving from pancan12 to pancanAtlas? I'm asking because there is abother large dataset of interest called ExAC with much more population data than ESP6500. Jüri

…

On Jul 31, 2017, at 18:03, krassowski ***@***.***> wrote: This was finally resolved four days ago (details in referenced commit); lots of optimizations, increasing BDB cache to 8GB and using separate import functions for the refseq mappings and for the nucleotide mappings allowed to re-import everything nicely. It still takes about 24 hour to generate the mappings but this is manageable and the script runs with a steady pace. Today I swapped the mappings database on production so the issue is dealt with completely. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Michał Krassowski · Answer 7 · Tue Aug 01 2017 00:00:17 GMT+0800 (China Standard Time)

No, I don't think so; we should be able to handle additional datasets of a size comparable to pancanAtlas easily. Potential problems may arise if the size of an individual dataset will be much larger than before (like 5-10 times larger).

Even if there would be a problem with ExAC size, then (maybe) we could use a MAF filtering criterion? According to their paper, 54% of variants are singletons; are such mutations significant for ActiveDriverDB users?

reimand0 · Answer 8 · Tue Aug 01 2017 18:53:20 GMT+0800 (China Standard Time)

EXAC singletons might be relevant because rare mutations are enriched in disease mutations. However let's set this aside for now - we can experiment in the review phase. More important things include web design, drug and kinase networks, and other ongoing developments.

…

On Mon, Jul 31, 2017 at 7:00 PM krassowski ***@***.***> wrote: No, I don't think so; we should be able to handle additional datasets of a size comparable to pancanAtlas easily. Potential problems may arise if the size of an individual dataset will be much larger than before (like 5-10 times larger). Even if there would be a problem with ExAC size, then (maybe) we could use a MAF filtering criterion? According to their paper, 54% of variants are singletons <http://www.nature.com/nature/journal/v536/n7616/full/nature19057.html#f1>; are such mutations significant for ActiveDriverDB users? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASYC_QOjDk_-MY1nhbdpY1uG1hfdpHTFks5sTfoRgaJpZM4Obn-l> .