sort extreme large pairs files
andyyhchen opened this issue · comments
Andy Chen commented
hic/modules/local/hicpro/hicpro2pairs.nf
Line 23 in fe4ac65
Not sure if a sorted pair file is still required by pairix but I suggest to change the original sort -k2,2 -k4,4 -k3,3n -k5,5n ${prefix}_contacts.pairs | bgzip -c > ${prefix}_contacts.pairs.gz
to the following to avoid segmentation fault when dealing with extremely large pair files (>50G).
##columns: readID chr1 pos1 chr2 pos2 strand1 strand2
awk '{OFS="\t";print \$1,\$2,\$3,\$5,\$6,\$4,\$7}' $vpairs > ${prefix}_contacts.pairs
awk '{file=\$2 ".chunk"}{print > file}' ${prefix}_contacts.pairs
for X in *.chunk; do sort -k2,2 -k4,4 -k3,3n -k5,5n < \$X > sorted-\$X; done
ls sorted-*.chunk | sort -V | xargs cat > ${prefix}_contacts.pairs.tmp
bgzip -c -@ 4 ${prefix}_contacts.pairs.tmp > ${prefix}_contacts.pairs.gz
pairix -f ${prefix}_contacts.pairs.gz
rm *chunk
rm ${prefix}_contacts.pairs.tmp