dplyr microbenchmark numpy pandas parallel-computing python r scipy seaborn

Analiza_Danych_projekt

More informations about this project in Projekt Zaliczeniowy PDF.pdf

We received 4 files in .vcf format of Genetic data of healthy and diseased Holstein-Friesian cows.

The files had more than 14.3 million records for diseased individuals and more than 13.8 mln for healthy individuals.

The aim was to find SNP-type mutations that may have a biological basis for the development of the disease, and to determine their relationship with selected parameters for each chromosome.

In this project were used libraries from Python such as 𝐏𝐚𝐧𝐝𝐚𝐬, 𝐨𝐬 𝐌𝐨𝐝𝐮𝐥𝐞𝐬, 𝐏𝐲𝐭𝐡𝐨𝐧 𝐦𝐮𝐥𝐭𝐢𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠, 𝐍𝐮𝐦𝐏𝐲, 𝐒𝐜𝐢𝐏𝐲, and 𝐬𝐞𝐚𝐛𝐨𝐫𝐧.From R 𝐝𝐩𝐥𝐲𝐫, 𝐦𝐢𝐜𝐫𝐨𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤, and 𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥. SNPs were detected with the use of the 𝗰𝗵𝗶𝟮 𝗰𝗼𝗻𝘁𝗶𝗻𝗴𝗲𝗻𝗰𝘆 𝘁𝗲𝘀𝘁 with Yamates correction. Additionally, we examined the 𝗣𝗲𝗮𝗿𝘀𝗼𝗻 𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 between our results and the length of the chromosome.

Due to the enormous data, we used 𝗽𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 in 𝗣𝘆𝘁𝗵𝗼𝗻 (𝗺𝘂𝗹𝘁𝗶𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴.𝗣𝗼𝗼𝗹()) and 𝗥 (𝗺𝗮𝗸𝗲𝗖𝗹𝘂𝘀𝘁𝗲𝗿 and 𝗰𝗹𝘂𝘀𝘁𝗲𝗿𝗔𝗽𝗽𝗹𝘆). And in a result, we discovered that project in R appeared to be faster than in Python.

About

Finding mutations in genomic data with the use of the chi2 test and Parallel functions in Python and R

dplyr microbenchmark numpy pandas parallel-computing python r scipy seaborn

Languages

Language:R 76.4%Language:Python 23.6%