ljchueca / ENA_submission_reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ENA_submission_reads

Code to submit individual raw data to the European Nucleotide Archive (ENA) from a population genomic study. It is necessary an ENA account and credentials.

Disclaimer: These scripts work for us on our cluster, but there may be unforeseen idiosyncratic errors!

Neccesary tools

STEP 1) Register data in ENA

  • 1.1) Register a Study

  • 1.2) Register Samples

  • 1.3) After samples registration, save the SAMPLE accession numbers and your ALIAS sample IDs in a plain text.

    EXAMPLE

    C18_102_3 ERS15590137
    C18_102_4 ERS15590138
    C18_103_1 ERS15590139
    C18_103_5 ERS15590140
    C18_108_3 ERS15590141
    C18_108_4 ERS15590142
    C18_133_1 ERS15590143
    C18_133_2 ERS15590144
    C18_133_3 ERS15590145
    C18_019_3 ERS15590146  
    

STEP 2) Create manifest files

  • 2.1) ENA_manifest_files.sh Create the necessary manifest file for each sample in the directory where the reads are present.

  • 2.2) Validate manifest files and reads. I usually only validate the submission of one sample. Be sure to include your username and password correctly within quotes.

    EXAMPLE

    #!/bin/bash
    #SBATCH --job-name=ENA_validate
    #SBATCH --error %x-%j.err
    #SBATCH --output %x-%j.out
    
    #SBATCH --partition=tbg
    #SBATCH --mem=5G
    #SBATCH --cpus-per-task=1
    
    module load java/1.8.0_221
    
    java -Xms4G -jar /home/user/programs/webin-cli-6.5.0.jar \
     -context=reads \
     -manifest=C18_133_1_manifest.txt \
     -username="Webin-USER-number" \
     -password="PASSWORD" \
     -inputdir=/home/user/documents/projects/raw_data/C18_133_1 \
     -outputdir=/home/user/documents/projects/raw_data/C18_133_1 \
     -validate
    
    

STEP 3) Submit data to ENA

  • 3.1) ENA_submission_list.sh Create a command list with the parameters for each submission (basically is the same than validation step but replacing -validate by -submit flag and including aspera option -ascp flag.

  • 3.2) Submit all data in parallel

    EXAMPLE

    #!/bin/bash
    
    #SBATCH --job-name=ENA_data_submission
    #SBATCH --error %x-%j.err
    #SBATCH --output %x-%j.out
    
    #SBATCH --partition=tbg
    #SBATCH --mem=4G
    #SBATCH --cpus-per-task=1
    #SBATCH --array=1-26%26
    
    module load java/1.8.0_221
    
    # Submit the files to ENA in parallel
    FILE=$(cat ENA_submission.list | sed -n ${SLURM_ARRAY_TASK_ID}p)
    bash -c "$FILE"
    
  • 3.3) ENA_save_accession_numbers.sh Extract and save the EXPERIMENT and RUN accession numbers assigned after submission, together with their corresponding sample names.

    EXAMPLE

    Alias           Experiment              Run
    C18_019_3       ERX10957573             ERR11553340
    C18_019_4       ERX10957565             ERR11553332
    C18_079_3       ERX10957545             ERR11553312
    C18_080_1       ERX10957562             ERR11553329
    C18_093_3       ERX10957548             ERR11553315
    C18_093_4       ERX10957570             ERR11553337
    C18_095_5       ERX10957553             ERR11553320
    C18_102_3       ERX10957557             ERR11553324
    C18_102_4       ERX10957556             ERR11553323
    C18_103_1       ERX10957566             ERR11553333