cluster_files

cluster_files clusters files into multiple directories by creating symbolic links or moving files.

It is helpful in bioinformatic analyses, where multiple datasets are analysed in a series of steps, each with one or more methods.

Features
Best practice
Special cases
Installation
Usage
Support
License

Features

Safe. Creating symbolic links keeps the original files untouched.
Convenient for parallel processing of multiple datasets with the same input file structure.
- You can use rush or parallel for local batch processing,
- and easy_qsub or easy_sbatch for batch submitting jobs to a computer cluster.
Each analysis step can be separately performed in its directory
- Clear organization
- Avoid conflicts
- Supporting simultaneous analyses with multiple methods

Best practice

Raw data.

 $ tree data/
 data/
 ├── A-1_R1_001.fastq.gz
 ├── A-1_R2_001.fastq.gz
 ├── A-2_R1_001.fastq.gz
 ├── A-2_R2_001.fastq.gz
 └── MD5
     ├── A-1_R1_001.fastq.gz.md5
     ├── A-1_R2_001.fastq.gz.md5
     ├── A-2_R1_001.fastq.gz.md5
     └── A-2_R2_001.fastq.gz.md5

Make them read-only for safety, and keep the original file names.

 chmod -R a-w data/

Create another directory and create symbolic links.

 mkdir raw; cd raw;
 find ../data/ -name "*.fastq.gz" \
     | while read f; do ln -s $f; done
 cd ..

 $ tree raw
 raw/
 ├── A-1_R1_001.fastq.gz -> ../data/A-1_R1_001.fastq.gz
 ├── A-1_R2_001.fastq.gz -> ../data/A-1_R2_001.fastq.gz
 ├── A-2_R1_001.fastq.gz -> ../data/A-2_R1_001.fastq.gz
 └── A-2_R2_001.fastq.gz -> ../data/A-2_R2_001.fastq.gz

Rename the symbolic links with brename:

 brename -p '_R(\d)_.+' -r '_${1}.fq.gz' raw/

 $ tree raw
 raw
 ├── A-1_1.fq.gz -> ../data/A-1_R1_001.fastq.gz
 ├── A-1_2.fq.gz -> ../data/A-1_R2_001.fastq.gz
 ├── A-2_1.fq.gz -> ../data/A-2_R1_001.fastq.gz
 └── A-2_2.fq.gz -> ../data/A-2_R2_001.fastq.gz

Cluster files.

 cluster_files -p '(.+?)_[12]\.fq\.gz$' raw/ -o raw.cluster

 $ tree raw.cluster
 raw.cluster
 ├── A-1
 │   ├── A-1_1.fq.gz -> ../../raw/A-1_1.fq.gz
 │   └── A-1_2.fq.gz -> ../../raw/A-1_2.fq.gz
 └── A-2
     ├── A-2_1.fq.gz -> ../../raw/A-2_1.fq.gz
     └── A-2_2.fq.gz -> ../../raw/A-2_2.fq.gz

QC with fastp, rush is used for batch processing. In this step, we do not use cluster_files.

 s=raw.cluster
 t=raw.cluster.fastp

 mkdir -p $t
 ls -d $s/* | rush -v t=$t 'mkdir -p {t}/{%}'

 minlen=70
 j=8
 J=16
 minq=25
 ls -d $s/* \
     | rush -j $j -v t=$t -v l=$minlen -v j=$J -v q=$minq  \
         -v 'p={}/{%}'  -v 'op={t}/{%}/{%}' \
         '{ time fastp -i {p}_1.fq.gz -I {p}_2.fq.gz -o {op}_1.fq.gz -O {op}_2.fq.gz \
                 --unpaired1 {op}_1.unpaired.fq.gz --unpaired2 {op}_2.unpaired.fq.gz \
                 -l {l} -q {q} -W 2 -M {q} -3 {q} --thread {j} \
                 --trim_poly_g --poly_g_min_len 5 --low_complexity_filter \
                 --html {op}.fastp.html --json {op}.fastp.json ; } &> {op}.fastp.log' \
         -c -C fastp.rush --eta

 $ tree raw.cluster.fastp/
 raw.cluster.fastp/
 ├── A-1
 │   ├── A-1_1.fq.gz
 │   ├── A-1_1.unpaired.fq.gz
 │   ├── A-1_2.fq.gz
 │   ├── A-1_2.unpaired.fq.gz
 │   ├── A-1.fastp.html
 │   ├── A-1.fastp.json
 │   └── A-1.fastp.log
 └── A-2
     ├── A-2_1.fq.gz
     ├── A-2_1.unpaired.fq.gz
     ├── A-2_2.fq.gz
     ├── A-2_2.unpaired.fq.gz
     ├── A-2.fastp.html
     ├── A-2.fastp.json
     └── A-2.fastp.log

Assemble with megahit.

 s=raw.cluster.fastp
 t=raw.cluster.fastp.megahit

 # link the paired reads
 cluster_files -p '(.+)_[12].fq.gz$'          $s -o $t
 # link the unpaired reads
 cluster_files -p '(.+)_[12].unpaired.fq.gz$' $s -o $t

 $ tree raw.cluster.fastp.megahit
 raw.cluster.fastp.megahit
 ├── A-1
 │   ├── A-1_1.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.fq.gz
 │   ├── A-1_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.unpaired.fq.gz
 │   ├── A-1_2.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.fq.gz
 │   └── A-1_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.unpaired.fq.gz
 └── A-2
     ├── A-2_1.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.fq.gz
     ├── A-2_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.unpaired.fq.gz
     ├── A-2_2.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.fq.gz
     └── A-2_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.unpaired.fq.gz


 # -------------------------------------------

 conda activate megahit

 ls -d $t/* \
     | rush -j 4 -v 'p={}/{%}' \
         '{ time megahit -1 {p}_1.fq.gz -2 {p}_2.fq.gz -r {p}_1.unpaired.fq.gz,{p}_2.unpaired.fq.gz -o {}/megahit \
             --presets meta-sensitive -t 40 -m 0.4 ; } &> {}/megahit.log' \
         -c -C megahit.rush --verbose --eta

 $ tree raw.cluster.fastp.megahit
 raw.cluster.fastp.megahit
 ├── A-1
 │   ├── A-1_1.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.fq.gz
 │   ├── A-1_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.unpaired.fq.gz
 │   ├── A-1_2.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.fq.gz
 │   ├── A-1_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.unpaired.fq.gz
 │   ├── megahit
 │   │   └── files omitted
 │   └── megahit.log
 └── A-2
     ├── A-2_1.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.fq.gz
     ├── A-2_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.unpaired.fq.gz
     ├── A-2_2.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.fq.gz
     ├── A-2_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.unpaired.fq.gz
     ├── megahit
     │   └── files omitted
     └── megahit.log

Assemble with another tool.

 s=raw.cluster.fastp
 t=raw.cluster.fastp.xxxx

 # link the paired reads
 cluster_files -p '(.+)_[12].fq.gz$'          $s -o $t
 # link the unpaired reads
 cluster_files -p '(.+)_[12].unpaired.fq.gz$' $s -o $t

 # do somethings

All directories.

 data
 raw
 raw.cluster
 raw.cluster.fastp
 raw.cluster.fastp.megahit
 raw.cluster.fastp.xxxx

Special cases

You may need to change the file names of some original data files after performing some analysis steps. It's easy to batch rename files and symbolic links with brename, but the symbolic links will be broken. Let's just re-run cluster_files (v4.1.0 or later versions) with the same options and files.

 $ tree t/
 t/
 ├── A_1.fq.gz
 ├── A_2.fq.gz
 ├── B_1.fq.gz
 └── B_2.fq.gz

 $ cluster_files -p '(.+)_[12].fq.gz$' t
 [INFO] create a new directory: t.cluster/B
 [INFO] create a new symbolic link: t.cluster/B/B_2.fq.gz -> ../../t/B_2.fq.gz
 [INFO] create a new symbolic link: t.cluster/B/B_1.fq.gz -> ../../t/B_1.fq.gz
 [INFO] create a new directory: t.cluster/A
 [INFO] create a new symbolic link: t.cluster/A/A_2.fq.gz -> ../../t/A_2.fq.gz
 [INFO] create a new symbolic link: t.cluster/A/A_1.fq.gz -> ../../t/A_1.fq.gz

 # ---------------------------------------------------------------------------
 # well, I have to rename B to C
 $ brename -p B -r C -R -D t t.cluster/

 $ tree t t.cluster/
 t
 ├── A_1.fq.gz
 ├── A_2.fq.gz
 ├── C_1.fq.gz
 └── C_2.fq.gz
 t.cluster/
 ├── A
 │   ├── A_1.fq.gz -> ../../t/A_1.fq.gz
 │   └── A_2.fq.gz -> ../../t/A_2.fq.gz
 └── C
     ├── C_1.fq.gz -> ../../t/B_1.fq.gz         # broken symlinks
     └── C_2.fq.gz -> ../../t/B_2.fq.gz         # broken symlinks

 # ---------------------------------------------------------------------------
 # just re-run cluster_files

 $ cluster_files -p '(.+)_[12].fq.gz$' t
 [INFO] update existed directory: t.cluster
 [INFO] directory existed: t.cluster/C
 [INFO] fix the broken symbolic link: t.cluster/C/C_2.fq.gz -> ../../t/C_2.fq.gz
 [INFO] fix the broken symbolic link: t.cluster/C/C_1.fq.gz -> ../../t/C_1.fq.gz
 [INFO] directory existed: t.cluster/A
 [INFO] update the existed symbolic link: t.cluster/A/A_2.fq.gz -> ../../t/A_2.fq.gz
 [INFO] update the existed symbolic link: t.cluster/A/A_1.fq.gz -> ../../t/A_1.fq.gz

Installation

cluster_files is a single script written in Python using standard libraries. It's Python 2/3 compatible, and version 2.7 or a later version is needed.

You can simply save the script into any directory included in environment PATH, e.g /usr/local/bin.

git clone https://github.com/shenwei356/cluster_files.git
cd cluster_files

mkdir -p $HOME/bin; cp cluster_files /usr/local/bin

# sudo cp cluster_files /usr/local/bin

Usage

usage: cluster_files [-h] [-o OUTDIR] [-p PATTERN] [-k] [-m] [-f] indir

clustering files by regular expression (v4.1.0)

positional arguments:
  indir                 source directory

options:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        out directory [<indir>.cluster]
  -p PATTERN, --pattern PATTERN
                        pattern (regular expression) of files in indir. if not given, it will be the longest common substring of the
                        files.GROUP (parenthese) should be in the regular expression. Captured group will be the cluster name. e.g.
                        "(.+?)_\d\.fq\.gz"
  -k, --keep            keep original dir structure
  -m, --mv              moving files instead of creating symbolic links
  -f, --force           Attention: force directory overwriting, i.e. deleting existed out directory

https://github.com/shenwei356/cluster_files

Support

Please open an issue to report bugs, propose new functions, or ask for help.

License

MIT License

shenwei356 / cluster_files