Problem and goal

In genetic analysis, it is often necessary to remove closely related samples because many genetic models assume independence among samples. However, simply removing one of the related samples may not consider the connectivity among the samples.

To address this problem, I propose a graph-based method to break down the close relationships among samples while attempting to retain as many samples as possible for subsequent analysis. This graph-based approach aims to capture the connections between samples and utilize this information to make informed decisions about which samples to retain and which ones to remove.

Algorithm to solve the problem

The vallina ideas to solve the problem comes from Yang X, Xu S, The HUGO Pan-Asian SNP Consortium (2011) Identification of Close Relatives in the HUGO Pan-Asian SNP Database. PLOS ONE 6(12): e29502. https://doi.org/10.1371/journal.pone.0029502

Graph Construction: Build a graph representation of relationships among the samples, where each sample is represented as a node in the graph. The connections between samples is determined based on kinship calculated by KING (Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873).

Here we focus on the relationships among samples with a high degree of relatedness.

Breaking Down Relationships: Once the close relationships are identified, determine the optimal strategy to break down these relationships while preserving as many samples as possible. This step requires careful consideration to balance the removal of related samples with retaining as many samples possible.

Algorithm rel_breaker
Input: kinship file F, relatedness cutoff C
Output: a list of sample ids for removal
read in pairwise sample kinship from file, use sample id as vertex, build a graph G using the pairwise sample relationship with kinship greater than or equal to certain cutoff C;
initialzie an empty remove list L
while the graph G is not empty:
do
  Choose any vertex V of the graph G as start point
  Using Breadth First Traversal (BFS) algorithm to traverse over the graph G, and record the vertex V' with the maximum degree of connectivity.
  for all the vertex Vi connected to V'
  do 
    remove the edge between vertex Vi and vertex V'
    if no further vertex connected to vertex Vi
    then
      pop Vi from the graph G
    end if
  end for
  append V' to remove list L
  pop V'
end while
output remove list L

Implementation and usage

Here I implemented the algorithm described above in python, and the usage of the tool as below:

usage: graph.py [-h] -i INPUT [-o OUTPUT] [-c CUTOFF] [-t]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Filename of input (kinship file)
  -o OUTPUT, --output OUTPUT
                        Filename of output (removed eid list)
  -c CUTOFF, --cutoff CUTOFF
                        Cutoff for kinship, support absolute kinship value or degree such as (1, 2, or 3), default=0.0884
  -t, --has-title       The input has title or not, default=False

For example:

python3 rel_breaker.py --input sample.txt --output removed.txt --cutoff 0.0884 --has-title

Sample input and output

Sample input look like below (more pairs of relationship please refer to sample.txt)

SID1 SID2 HetHet IBS0 Kinship
S1000025 S2025656 0.045 0.0144 0.061
S1000130 S3375759 0.047 0.0133 0.0739

Sample output look like (more id in the removed list please refer to removed.txt)

xyang619 / RelBreaker

Problem and goal

Algorithm to solve the problem

Implementation and usage

Sample input and output

About

Languages