shenwei356 / cluster_files

cluster files into multiple directories by creating symbolic links or moving files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cluster_files

cluster_files clusters files into multiple directories by creating symbolic links or moving files.

It is helpful in bioinformatic analyses, where multiple datasets are analysed in a series of steps, each with one or more methods.

Table of Contents

Features

  • Safe. Creating symbolic links keeps the original files untouched.
  • Convenient for parallel processing of multiple datasets with the same input file structure.
  • Each analysis step can be separately performed in its directory
    • Clear organization
    • Avoid conflicts
    • Supporting simultaneous analyses with multiple methods

Best practice

  1. Raw data.

     $ tree data/
     data/
     ├── A-1_R1_001.fastq.gz
     ├── A-1_R2_001.fastq.gz
     ├── A-2_R1_001.fastq.gz
     ├── A-2_R2_001.fastq.gz
     └── MD5
         ├── A-1_R1_001.fastq.gz.md5
         ├── A-1_R2_001.fastq.gz.md5
         ├── A-2_R1_001.fastq.gz.md5
         └── A-2_R2_001.fastq.gz.md5
    

    Make them read-only for safety, and keep the original file names.

     chmod -R a-w data/
    
  2. Create another directory and create symbolic links.

     mkdir raw; cd raw;
     find ../data/ -name "*.fastq.gz" \
         | while read f; do ln -s $f; done
     cd ..
    
     $ tree raw
     raw/
     ├── A-1_R1_001.fastq.gz -> ../data/A-1_R1_001.fastq.gz
     ├── A-1_R2_001.fastq.gz -> ../data/A-1_R2_001.fastq.gz
     ├── A-2_R1_001.fastq.gz -> ../data/A-2_R1_001.fastq.gz
     └── A-2_R2_001.fastq.gz -> ../data/A-2_R2_001.fastq.gz
    

    Rename the symbolic links with brename:

     brename -p '_R(\d)_.+' -r '_${1}.fq.gz' raw/
    
     $ tree raw
     raw
     ├── A-1_1.fq.gz -> ../data/A-1_R1_001.fastq.gz
     ├── A-1_2.fq.gz -> ../data/A-1_R2_001.fastq.gz
     ├── A-2_1.fq.gz -> ../data/A-2_R1_001.fastq.gz
     └── A-2_2.fq.gz -> ../data/A-2_R2_001.fastq.gz
    
  3. Cluster files.

     cluster_files -p '(.+?)_[12]\.fq\.gz$' raw/ -o raw.cluster
    
     $ tree raw.cluster
     raw.cluster
     ├── A-1
     │   ├── A-1_1.fq.gz -> ../../raw/A-1_1.fq.gz
     │   └── A-1_2.fq.gz -> ../../raw/A-1_2.fq.gz
     └── A-2
         ├── A-2_1.fq.gz -> ../../raw/A-2_1.fq.gz
         └── A-2_2.fq.gz -> ../../raw/A-2_2.fq.gz
    
  4. QC with fastp, rush is used for batch processing. In this step, we do not use cluster_files.

     s=raw.cluster
     t=raw.cluster.fastp
    
     mkdir -p $t
     ls -d $s/* | rush -v t=$t 'mkdir -p {t}/{%}'
    
     minlen=70
     j=8
     J=16
     minq=25
     ls -d $s/* \
         | rush -j $j -v t=$t -v l=$minlen -v j=$J -v q=$minq  \
             -v 'p={}/{%}'  -v 'op={t}/{%}/{%}' \
             '{ time fastp -i {p}_1.fq.gz -I {p}_2.fq.gz -o {op}_1.fq.gz -O {op}_2.fq.gz \
                     --unpaired1 {op}_1.unpaired.fq.gz --unpaired2 {op}_2.unpaired.fq.gz \
                     -l {l} -q {q} -W 2 -M {q} -3 {q} --thread {j} \
                     --trim_poly_g --poly_g_min_len 5 --low_complexity_filter \
                     --html {op}.fastp.html --json {op}.fastp.json ; } &> {op}.fastp.log' \
             -c -C fastp.rush --eta
    
     $ tree raw.cluster.fastp/
     raw.cluster.fastp/
     ├── A-1
     │   ├── A-1_1.fq.gz
     │   ├── A-1_1.unpaired.fq.gz
     │   ├── A-1_2.fq.gz
     │   ├── A-1_2.unpaired.fq.gz
     │   ├── A-1.fastp.html
     │   ├── A-1.fastp.json
     │   └── A-1.fastp.log
     └── A-2
         ├── A-2_1.fq.gz
         ├── A-2_1.unpaired.fq.gz
         ├── A-2_2.fq.gz
         ├── A-2_2.unpaired.fq.gz
         ├── A-2.fastp.html
         ├── A-2.fastp.json
         └── A-2.fastp.log
    
  5. Assemble with megahit.

     s=raw.cluster.fastp
     t=raw.cluster.fastp.megahit
    
     # link the paired reads
     cluster_files -p '(.+)_[12].fq.gz$'          $s -o $t
     # link the unpaired reads
     cluster_files -p '(.+)_[12].unpaired.fq.gz$' $s -o $t
    
     $ tree raw.cluster.fastp.megahit
     raw.cluster.fastp.megahit
     ├── A-1
     │   ├── A-1_1.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.fq.gz
     │   ├── A-1_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.unpaired.fq.gz
     │   ├── A-1_2.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.fq.gz
     │   └── A-1_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.unpaired.fq.gz
     └── A-2
         ├── A-2_1.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.fq.gz
         ├── A-2_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.unpaired.fq.gz
         ├── A-2_2.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.fq.gz
         └── A-2_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.unpaired.fq.gz
    
    
     # -------------------------------------------
    
     conda activate megahit
    
     ls -d $t/* \
         | rush -j 4 -v 'p={}/{%}' \
             '{ time megahit -1 {p}_1.fq.gz -2 {p}_2.fq.gz -r {p}_1.unpaired.fq.gz,{p}_2.unpaired.fq.gz -o {}/megahit \
                 --presets meta-sensitive -t 40 -m 0.4 ; } &> {}/megahit.log' \
             -c -C megahit.rush --verbose --eta
    
     $ tree raw.cluster.fastp.megahit
     raw.cluster.fastp.megahit
     ├── A-1
     │   ├── A-1_1.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.fq.gz
     │   ├── A-1_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_1.unpaired.fq.gz
     │   ├── A-1_2.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.fq.gz
     │   ├── A-1_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-1/A-1_2.unpaired.fq.gz
     │   ├── megahit
     │   │   └── files omitted
     │   └── megahit.log
     └── A-2
         ├── A-2_1.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.fq.gz
         ├── A-2_1.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_1.unpaired.fq.gz
         ├── A-2_2.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.fq.gz
         ├── A-2_2.unpaired.fq.gz -> ../../raw.cluster.fastp/A-2/A-2_2.unpaired.fq.gz
         ├── megahit
         │   └── files omitted
         └── megahit.log
    
  6. Assemble with another tool.

     s=raw.cluster.fastp
     t=raw.cluster.fastp.xxxx
    
     # link the paired reads
     cluster_files -p '(.+)_[12].fq.gz$'          $s -o $t
     # link the unpaired reads
     cluster_files -p '(.+)_[12].unpaired.fq.gz$' $s -o $t
    
     # do somethings
    
  7. All directories.

     data
     raw
     raw.cluster
     raw.cluster.fastp
     raw.cluster.fastp.megahit
     raw.cluster.fastp.xxxx
    

Special cases

  1. You may need to change the file names of some original data files after performing some analysis steps. It's easy to batch rename files and symbolic links with brename, but the symbolic links will be broken. Let's just re-run cluster_files (v4.1.0 or later versions) with the same options and files.

     $ tree t/
     t/
     ├── A_1.fq.gz
     ├── A_2.fq.gz
     ├── B_1.fq.gz
     └── B_2.fq.gz
    
     $ cluster_files -p '(.+)_[12].fq.gz$' t
     [INFO] create a new directory: t.cluster/B
     [INFO] create a new symbolic link: t.cluster/B/B_2.fq.gz -> ../../t/B_2.fq.gz
     [INFO] create a new symbolic link: t.cluster/B/B_1.fq.gz -> ../../t/B_1.fq.gz
     [INFO] create a new directory: t.cluster/A
     [INFO] create a new symbolic link: t.cluster/A/A_2.fq.gz -> ../../t/A_2.fq.gz
     [INFO] create a new symbolic link: t.cluster/A/A_1.fq.gz -> ../../t/A_1.fq.gz
    
     # ---------------------------------------------------------------------------
     # well, I have to rename B to C
     $ brename -p B -r C -R -D t t.cluster/
    
     $ tree t t.cluster/
     t
     ├── A_1.fq.gz
     ├── A_2.fq.gz
     ├── C_1.fq.gz
     └── C_2.fq.gz
     t.cluster/
     ├── A
     │   ├── A_1.fq.gz -> ../../t/A_1.fq.gz
     │   └── A_2.fq.gz -> ../../t/A_2.fq.gz
     └── C
         ├── C_1.fq.gz -> ../../t/B_1.fq.gz         # broken symlinks
         └── C_2.fq.gz -> ../../t/B_2.fq.gz         # broken symlinks
    
     # ---------------------------------------------------------------------------
     # just re-run cluster_files
    
     $ cluster_files -p '(.+)_[12].fq.gz$' t
     [INFO] update existed directory: t.cluster
     [INFO] directory existed: t.cluster/C
     [INFO] fix the broken symbolic link: t.cluster/C/C_2.fq.gz -> ../../t/C_2.fq.gz
     [INFO] fix the broken symbolic link: t.cluster/C/C_1.fq.gz -> ../../t/C_1.fq.gz
     [INFO] directory existed: t.cluster/A
     [INFO] update the existed symbolic link: t.cluster/A/A_2.fq.gz -> ../../t/A_2.fq.gz
     [INFO] update the existed symbolic link: t.cluster/A/A_1.fq.gz -> ../../t/A_1.fq.gz
    

Installation

cluster_files is a single script written in Python using standard libraries. It's Python 2/3 compatible, and version 2.7 or a later version is needed.

You can simply save the script into any directory included in environment PATH, e.g /usr/local/bin.

Or

git clone https://github.com/shenwei356/cluster_files.git
cd cluster_files

mkdir -p $HOME/bin; cp cluster_files /usr/local/bin

# sudo cp cluster_files /usr/local/bin

Usage

usage: cluster_files [-h] [-o OUTDIR] [-p PATTERN] [-k] [-m] [-f] indir

clustering files by regular expression (v4.1.0)

positional arguments:
  indir                 source directory

options:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        out directory [<indir>.cluster]
  -p PATTERN, --pattern PATTERN
                        pattern (regular expression) of files in indir. if not given, it will be the longest common substring of the
                        files.GROUP (parenthese) should be in the regular expression. Captured group will be the cluster name. e.g.
                        "(.+?)_\d\.fq\.gz"
  -k, --keep            keep original dir structure
  -m, --mv              moving files instead of creating symbolic links
  -f, --force           Attention: force directory overwriting, i.e. deleting existed out directory

https://github.com/shenwei356/cluster_files

Support

Please open an issue to report bugs, propose new functions, or ask for help.

License

MIT License

About

cluster files into multiple directories by creating symbolic links or moving files

License:MIT License


Languages

Language:HTML 99.1%Language:Python 0.9%