nf-core / hic

Analysis of Chromosome Conformation Capture data (Hi-C)

Home Page:https://nf-co.re/hic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sort extreme large pairs files

andyyhchen opened this issue · comments

##sort -k2,2 -k4,4 -k3,3n -k5,5n ${prefix}_contacts.pairs | bgzip -c > ${prefix}_contacts.pairs.gz

Not sure if a sorted pair file is still required by pairix but I suggest to change the original sort -k2,2 -k4,4 -k3,3n -k5,5n ${prefix}_contacts.pairs | bgzip -c > ${prefix}_contacts.pairs.gz to the following to avoid segmentation fault when dealing with extremely large pair files (>50G).

    ##columns: readID chr1 pos1 chr2 pos2 strand1 strand2
    awk '{OFS="\t";print \$1,\$2,\$3,\$5,\$6,\$4,\$7}' $vpairs > ${prefix}_contacts.pairs
    awk '{file=\$2 ".chunk"}{print > file}' ${prefix}_contacts.pairs
    for X in *.chunk; do sort -k2,2 -k4,4 -k3,3n -k5,5n < \$X > sorted-\$X; done
    ls sorted-*.chunk | sort  -V | xargs cat > ${prefix}_contacts.pairs.tmp
    bgzip -c -@ 4  ${prefix}_contacts.pairs.tmp > ${prefix}_contacts.pairs.gz
    pairix -f ${prefix}_contacts.pairs.gz
    rm *chunk
    rm ${prefix}_contacts.pairs.tmp