Know Thyself: Using Data Science to Explore Your Own Genome

DNA analysis with pandas and Selenium

"Nosce te ipsum" ("know thyself"), an ancient maxim frequently associated with gory anatomical pop-up books of yore.

23andme once offered me a free DNA and ancestry test kit if I participated in one of their clinical studies. In exchange for a cheek swab and baring my guts and soul in a score of questionnaires, I got my genome sequenced and gained access to myriad reports on where my ancestors were likely from, whom else on the site I might be related to, and what health conditions and traits I probably have inherited.

Seriously?

23andme already provides an overwhelming amount of consumer-ready infographics and tools, but I knew I could do more with the data. The intrepid may download their raw genetic data if they dare, so of course I poured it into pandas to see what I could make of it.

%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
sns.color_palette('Spectral')
import matplotlib.pyplot as plt

import numpy as np
import requests
import pandas as pd

import re

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

Importing my DNA into pandas and exploring the genome

Looking at the .txt file, I could see that I was missing some genotype values, denoted with '--'.

Most of the chromosomes are ints, but three are X, Y, and MT (for 'mitochondrial'). I needed to specify the data type properly so that pandas wouldn't throw an error when it found mixed data in the input.

The other columns were fairly straightforward. I also wanted pandas to ignore the prefatory comments at the beginning of the file that consisted of lines beginning with an octothorpe.

The arguments I needed to pass, therefore, were:

separator (tab-delimited)
dtype (as a dict)
na_values ('--') (n.b.: I decided against this in the end to avoid dealing with more NaNs)
comment ('#')

data = pd.read_csv('genome.txt', sep='\t', dtype={'rsid':'str', 'chromosome':'object', 'position':'int', 'genotype':'str'}, comment='#')

print(data)

               rsid chromosome  position genotype
0       rs548049170          1     69869       TT
1        rs13328684          1     74792       --
2         rs9283150          1    565508       AA
3           i713426          1    726912       --
4       rs116587930          1    727841       GG
5         rs3131972          1    752721       AG
6        rs12184325          1    754105       CC

...             ...        ...       ...      ...

[638531 rows x 4 columns]

A quick note on the column names:

rsid stands for Reference SNP cluster ID. It identifies unique SNPs.
SNPs are Single Nucleotide Polymorphisms ('snips'), locations in the genome that vary between individuals. They can influence disease risk and drug effects, tell you about your ancestry, and predict aspects of how you look and act.
All humans have almost the same sequence of 3 billion DNA bases (A,C,G, or T) distributed between their 23 pairs of chromosomes. But at certain locations, some differences exist that researchers have declared meaningful, for medical or other reasons (like genealogy).

I started to navigate my new DataFrame with basic exploratory data analysis and data cleaning.

# Read the data into a pandas DataFrame and do some EDA
df = pd.DataFrame(data)

df.head(25)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rsid	chromosome	position	genotype
0	rs548049170	1	69869	TT
1	rs13328684	1	74792	--
2	rs9283150	1	565508	AA
3	i713426	1	726912	--
4	rs116587930	1	727841	GG
5	rs3131972	1	752721	AG
6	rs12184325	1	754105	CC
7	rs12567639	1	756268	AA
8	rs114525117	1	759036	GG
9	rs12124819	1	776546	AA
10	rs12127425	1	794332	GG
11	rs79373928	1	801536	TT
12	rs72888853	1	815421	--
13	rs7538305	1	824398	AA
14	rs28444699	1	830181	AA
15	i713449	1	830731	--
16	rs116452738	1	834830	GG
17	rs72631887	1	835092	TT
18	rs28678693	1	838665	TT
19	rs4970382	1	840753	CT
20	rs4475691	1	846808	CC
21	rs72631889	1	851390	GG
22	rs7537756	1	854250	AA
23	rs13302982	1	861808	GG
24	rs376747791	1	863130	AA

df.isna().any()

 rsid         False
chromosome    False
position      False
genotype      False
dtype: bool

df.nunique()

 rsid         638531
chromosome        25
position      634934
genotype          20
dtype: int64

# How many chromosomes am I missing by not 
# having a Y chromosome?
Y_chromosome = df[df.chromosome == 'Y']

len(Y_chromosome)

I converted the letter chromosomes to numbers, cast them to ints, and created a dictionary to translate them back later so that I could better manipulate the data.

df['chromosome'].unique()

array(['1', '2', '3', '4', '5', '6', '7', '8', 
      '9', '10', '11', '12','13', '14', '15', 
      '16', '17', '18', '19', '20', '21', '22',
      'X','MT'], dtype=object)

df['chromosome'] = df['chromosome'].apply(lambda x: 
  re.sub(r'X', r'23', x))
df['chromosome'] = df['chromosome'].apply(lambda x: 
  re.sub(r'MT', r'24', x))

df['chromosome'] = df['chromosome'].apply(lambda x: 
  int(x))

chromosome_dict = {1:'1', 2:'2', 3:'3', 4:'4', 5:'5', 
           6:'6', 7:'7', 8:'8', 9:'9', 10:'10', 
           11:'11', 12:'12', 13:'13', 14:'14', 
           15:'15', 16:'16', 17:'17', 18:'18', 
           19:'19', 20:'20', 21:'21', 22:'22', 
           23:'X', 24:'MT'}

print(chromosome_dict)
display(df.info())

{1: '1', 2: '2', 3: '3', 4: '4', 5: '5', 6: '6', 7: '7', 
 8: '8', 9: '9', 10: '10', 11: '11', 12: '12', 13: '13', 
 14: '14', 15: '15', 16: '16', 17: '17', 18: '18', 
 19: '19', 20: '20', 21: '21', 22: '22', 23: 'X', 
 24: 'MT'}

<class 'pandas.core.frame.DataFrame'>
Int64Index: 634798 entries, 0 to 638530

Data columns (total 4 columns):

 rsid         634798 non-null object
chromosome    634798 non-null int64
position      634798 non-null int64
genotype      634798 non-null object

dtypes: int64(2), object(2)
memory usage: 24.2+ MB

None

There were 16,005 genotypes that I simply lacked:

genotype_na = df[df.genotype == '--']
len(genotype_na)

Some visualizations

df[df.chromosome == 1].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49514 entries, 0 to 49513
Data columns (total 4 columns):
 rsid         49514 non-null object
chromosome    49514 non-null int64
position      49514 non-null int64
genotype      49514 non-null object
dtypes: int64(2), object(2)
memory usage: 1.9+ MB

# Remove that pesky whitespace from the column name
df.rename({' rsid': 'rsid'}, axis='columns', inplace=True)

How many SNPs are there per chromosome?

# We can do this manually with a for loop . . .
x = []
y = []
for k in chromosome_dict:
    x.append(k)
    y.append(len(df[df.chromosome == k]))
rsid_per_chromosome = dict(zip(x,y))

rsid_per_chromosome

# . . . but pandas makes it a lot easier!
rsid_per_chromosome_series = df.groupby('chromosome')['rsid'].count()
rsid_per_chromosome_series.columns = ['chromosome', 'count']

rsid_per_chromosome_series.plot.barh(figsize=(16,9), fontsize=15)
plt.show()

rsid per chromosome

Getting data on SNPs from SNPedia

To acquire more information about my DNA, I pulled files from SNPedia, a wiki investigating human genetics that gathers extensive data and cites to peer-reviewed scientific publications. SNPedia catalogues common, reproducible SNPs (or ones found in meta-analyses or studies of at least 500 patients), or those with other historic or medical significance.

The columns are:

Unnamed: 0 (actually the SNP name)
Magnitude (a subjective measure of interest)
Repute (a subjective measure of whether the genotype is "good" or "bad" to have based on research, and blank for things like ancestry and eye color)
Summary (a narrative description)

snp_df = pd.read_csv('result.csv')
snp_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	Magnitude	Repute	Summary
0	Rs661(A;A)	9.0	Bad	early onset Alzheimer's disease
1	Rs6647(T;T)	0.0	Good	Normal; two copies of Pi-M1V allele
2	Rs6647(C;C)	0.0	Good	Normal; two copies of Pi-M1A allele
3	Rs1303(T;T)	0.0	Good	common in clinvar
4	Rs28929471(G;G)	0.0	Good	common in complete genomics

Fun with regular expressions

To align with my original DataFrame, I created a genotype column and used regex to separate out the genotype, which was stitched onto the end of the SNP name.

snp_df['genotype'] = snp_df['Unnamed: 0'].apply(lambda x: 
  re.sub(r'.*([AGCT]);([AGCT])\)', r'\1\2', x))

snp_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	Magnitude	Repute	Summary	genotype
0	Rs661(A;A)	9.0	Bad	early onset Alzheimer's disease	AA
1	Rs6647(T;T)	0.0	Good	Normal; two copies of Pi-M1V allele	TT
2	Rs6647(C;C)	0.0	Good	Normal; two copies of Pi-M1A allele	CC
3	Rs1303(T;T)	0.0	Good	common in clinvar	TT
4	Rs28929471(G;G)	0.0	Good	common in complete genomics	GG

For consistency's sake, I renamed the columns to match my original DataFrame and made sure the rsids were all lower-case.

new_cols = ['rsid', 'magnitude', 'repute', 
'summary', 'genotype']
snp_df.columns = new_cols

I used regex to clean up the rsid a little more, too (because I will take any excuse to use more regex).

snp_df['rsid'] = snp_df['rsid'].map(lambda x : x.lower())
snp_df['rsid'] = snp_df['rsid'].map(lambda x : 
  re.sub(r'([a-z]{1,}[\d]+)\([agct];[agct]\)', 
  r'\1', x))

snp_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rsid	magnitude	repute	summary	genotype
0	rs661	9.0	Bad	early onset Alzheimer's disease	AA
1	rs6647	0.0	Good	Normal; two copies of Pi-M1V allele	TT
2	rs6647	0.0	Good	Normal; two copies of Pi-M1A allele	CC
3	rs1303	0.0	Good	common in clinvar	TT
4	rs28929471	0.0	Good	common in complete genomics	GG

snp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
rsid         100 non-null object
magnitude    100 non-null float64
repute       91 non-null object
summary      96 non-null object
genotype     100 non-null object
dtypes: float64(1), object(4)
memory usage: 4.0+ KB

I overwrote the null reputes and summaries.

null_repute = snp_df[snp_df['repute'].isnull()]
null_summaries = snp_df[snp_df['summary'].isnull()]
null_repute_and_summaries = pd.concat([null_repute,null_summaries]).drop_duplicates().reset_index(drop=True)
display(null_repute_and_summaries)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rsid	magnitude	repute	summary	genotype
0	rs28931569	4.0	NaN	high risk of emphysema	CC
1	rs28929473	0.0	NaN	NaN	TT
2	rs28931572	0.0	NaN	NaN	AA
3	rs1801252	3.0	NaN	NaN	GG
4	rs8192466	4.0	NaN	uncertain	TT
5	rs4986852	2.0	NaN	predisposition to breast cancer?	AA
6	rs1800709	2.0	NaN	predisposition to breast cancer?	TT
7	rs28931592	0.0	NaN	NaN	TT
8	rs4986893	2.1	NaN	poor metabolizer of several commonly prescribe...	AA

snp_df['repute'].fillna(value='Neutral', inplace=True)
snp_df['summary'].fillna(value='None', inplace=True)

# No no NaNette
snp_df.isna().any()

rsid         False
magnitude    False
repute       False
summary      False
genotype     False
dtype: bool

Merging my data with SNPedia

Appropriately enough, I did an inner join of the SNPedia DataFrame on my DNA to see what data, if any, it had on my particular genotypes.

new_df = snp_df.merge(df, how='inner', on=['rsid', 'genotype'], suffixes=('_SNPedia', '_myDNA'))

new_df.head(20)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rsid	magnitude	repute	summary	genotype	chromosome	position
0	rs1303	0.0	Good	common in clinvar	TT	14	94844843
1	rs17580	2.5	Bad	a slightly reduced functionality form of Alpha...	TT	14	94847262
2	rs28931580	0.0	Good	common in clinvar	AA	12	50344783
3	rs1042714	0.0	Good	normal	CC	5	148206473
4	rs1800888	0.0	Good	normal	CC	5	148206885
5	rs2303790	0.0	Good	common in clinvar	AA	16	57017292
6	rs5882	2.0	Bad	Faster aging. Increased risk for Dementia. Les...	AA	16	57016092
7	rs2230199	2.0	Bad	2.5x+ risk of ARMD	GG	19	6718387
8	rs28931608	0.0	Good	common in clinvar	GG	7	75614497
9	rs4986893	0.0	Good	normal	GG	10	96540410
10	rs28399504	0.0	Good	normal	AA	10	96522463
11	rs2234922	0.0	Good	common in clinvar	AA	1	226026406
12	rs28931614	0.0	Good	common in clinvar	GG	4	1806119
13	rs28933068	0.0	Good	common in clinvar	CC	4	1807371

What's hiding in there?

# Create a DataFrame for some subsets of genes
good_genes = new_df[new_df.repute == 'Good']
bad_genes = new_df[new_df.repute == 'Bad']
interesting_genes = new_df[new_df.magnitude > 4] # 4 is the threshold for "worth your time" given by SNPedia

I have plenty of "good" genotypes, but none with a nonzero magnitude.

good_genes

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rsid	repute	summary	genotype	chromosome	position
0	rs1303	Good	common in clinvar	TT	14	94844843
2	rs28931580	Good	common in clinvar	AA	12	50344783
3	rs1042714	Good	normal	CC	5	148206473
4	rs1800888	Good	normal	CC	5	148206885
5	rs2303790	Good	common in clinvar	AA	16	57017292
8	rs28931608	Good	common in clinvar	GG	7	75614497
9	rs4986893	Good	normal	GG	10	96540410
10	rs28399504	Good	normal	AA	10	96522463
11	rs2234922	Good	common in clinvar	AA	1	226026406
12	rs28931614	Good	common in clinvar	GG	4	1806119
13	rs28933068	Good	common in clinvar	CC	4	1807371

I have three "bad" genotypes with a nonzero magnitude.

bad_genes

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rsid	magnitude	repute	summary	genotype	chromosome	position
1	rs17580	2.5	Bad	a slightly reduced functionality form of Alpha...	TT	14	94847262
6	rs5882	2.0	Bad	Faster aging. Increased risk for Dementia. Les...	AA	16	57016092
7	rs2230199	2.0	Bad	2.5x+ risk of ARMD	GG	19	6718387

Sadly, I had no "interesting" genotypes above the threshold of 4, although hearteningly I did possess some slightly interesting bad ones.

Scrape relevant articles with Selenium

I decided I might like to read up on my bad genetics, so I used Selenium to scrape the abstracts of some scientific papers from PubMed.

# Get the base URL from SNPedia
base_url = 'https://www.snpedia.com/index.php/'

# Create URLs for each gene that I want to study
gene_urls = [base_url + rsid for rsid in bad_genes['rsid']]

# Initialize Selenium
browser = webdriver.Chrome()

import time

# Write a function to visit the SNPedia URLs, click through to PubMed, 
# and retrieve the info on the articles for each gene

def scrape_abstracts(urls):
    
    rsid_list = []
    all_article_title = []
    all_article_citation = []
    all_article_authors = []
    all_article_abstract = []
    all_article_links = []

    for url in urls:
        link_urls = []
        browser.get(url) #load url
        rsid = browser.find_element_by_css_selector('.firstHeading').text
        links_elements = browser.find_elements_by_partial_link_text('PMID')
        
        # get the URLs to the PubMed pages
      for link in links_elements:
          link_urls.append(link.get_attribute('href')) 
    
      # follow each link element to PubMed
      for element in link_urls:
          browser.get(element) 
          time.sleep(2)
          article_title = browser.find_element_by_xpath("//div[@class='cit']/../h1").text
          article_citation = browser.find_element_by_class_name('cit').text
          article_authors = browser.find_element_by_class_name('auths').text
          article_abstract = browser.find_element_by_class_name('abstr').text
            
          rsid_list.append(rsid)
          all_article_title.append(article_title)
          all_article_citation.append(article_citation)
          all_article_authors.append(article_authors)
          all_article_abstract.append(article_abstract)
          all_article_links.append(element)

  # store the information
    df = pd.DataFrame()
    df['rsid'] = rsid_list
    df['article_title'] = all_article_title
    df['article_citation'] = all_article_citation
    df['article_authors'] = all_article_authors
    df['article_abstract'] = all_article_abstract
    df['link'] = all_article_links
        
    df = df.drop_duplicates()
        
    df.index = range(len(df.index))
    
    return df

abstracts_df = scrape_abstracts(gene_urls)

For later hypochondriacal perusal, I exported my findings, complete with abstracts and hyperlinks, to a CSV file using the pandas DataFrame.to_csv method.

# DataFrame to CSV
export_csv = abstracts_df.to_csv(r'/Users/lorajohns/Documents/Python/DNA/DNA_articles.csv')

Reading up on the medical literature

DNA_articles.csv in Numbers

Now I have a handy CSV file, nicely formatted, with citations to scientific articles analyzing and describing my probably un-problematic, but probationally proditory genotypes. Python provides prodigious tools to engage in literal introspection that the sawbones of old could never have imagined.

Solertis / DNA_pandas_selenium

Know Thyself: Using Data Science to Explore Your Own Genome

DNA analysis with pandas and Selenium

Importing my DNA into pandas and exploring the genome

Some visualizations

Getting data on SNPs from SNPedia

Fun with regular expressions

Merging my data with SNPedia

What's hiding in there?

Scrape relevant articles with Selenium

Reading up on the medical literature

About

Languages