NationalGenomicsInfrastructure / piper

A genomics pipeline build on top of the GATK Queue framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Change the pipeline setup xml to some more "flexible" solution

johandahlberg opened this issue · comments

It would be good if the pipeline setup xml format was more flexible in the sense that it's one could specify the exact location of the fastq-files. Probably the files should contain the full project, sample, library, flowcell hierarchy.

I'll have to think about this though.

@vezzi and @mariogiov

I've been working on this now, and everything looks fine, I'm just going to give it a test run over the weekend to make sure nothing looks funky.

Things will now work like this.

The setup runner will accept paths to fastq-files using the -i flag. These can be a mix of things in the IGN format and in UU format, and it will extract the information from the folder structure, file names (and in the case of UU format the report.xml). This then parsed into the xml-format, which looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<project xmlns="setup.xml.molmed">
    <metadata>
        <name>NA-001</name>
        <sequenceingcenter>NGI</sequenceingcenter>
        <platform>Illumina</platform>
        <uppmaxprojectid>a2009002</uppmaxprojectid>
        <uppmaxqos></uppmaxqos>
        <reference>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/exampleFASTA.fasta</reference>
    </metadata>
    <inputs>
        <sample>
            <samplename>F15</samplename>
            <library>
                <libraryname>SX396_MA140710.1</libraryname>
                <platformunit>
                    <unitinfo>000000000-AA3LB.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_F15/F15_CCGAAGTA_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_F15/F15_CCGAAGTA_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
        <sample>
            <samplename>E14</samplename>
            <library>
                <libraryname>SX396_MA140710.1</libraryname>
                <platformunit>
                    <unitinfo>000000000-AA3LB.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_E14/E14_AGTCACTA_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_E14/E14_AGTCACTA_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
        <sample>
            <samplename>P1171_104</samplename>
            <library>
                <libraryname>A</libraryname>
                <platformunit>
                    <unitinfo>AC41A2ANXX.2</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_104/A/140702_AC41A2ANXX/P1171_104_ATTCAGAA-GGCTCTGA_L002_R1_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
                <platformunit>
                    <unitinfo>AC41A2ANXX.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_104/A/140702_AC41A2ANXX/P1171_104_ATTCAGAA-GGCTCTGA_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_104/A/140702_AC41A2ANXX/P1171_104_ATTCAGAA-GGCTCTGA_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
        <sample>
            <samplename>P1171_102</samplename>
            <library>
                <libraryname>A</libraryname>
                <platformunit>
                    <unitinfo>AC41A2ANXX.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_102/A/140702_AC41A2ANXX/P1171_102_ATTCAGAA-CCTATCCT_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_102/A/140702_AC41A2ANXX/P1171_102_ATTCAGAA-CCTATCCT_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
    </inputs>
</project>

As you can see this basically takes to IGN folder structure and transforms it into a xml.

Do give me a heads up if you have any modifications to this that you want to see. Otherwise I'm going to push a release containing this sometime early next week.

Sounds perfect to me, just to double checked, you want to maintain compatibility with your current UU structure for the others project but for the IGN we are anyway going to start from the flowcells (i.e., Illumina format).

In this way the pipeline can pick up the fastq files transfer them and then feed piper.

Other question.... why nestor is fully packed with process name vezzi_bwa run by joda8933.... @johandahlberg when you sequenced my genome?

For IGN I will start from the IGN folder structure, e.g:

Project
 └── Sample
     └── Library Prep
         └── Sequencing Run
             ├── P1142_101_NoIndex_L002_R1_001.fastq.gz
             └── P1142_101_NoIndex_L002_R1_001.fastq.gz

And since you transform your Illumina format to that I was thinking that you should be able to transform it in the same way if we deliver the content of the unaligned folder to Uppmax.

And yes, I want to keep back-wards compatibility with the Uppsala style folder structure for now.

And, you know I took you coffee cup to the washer the other day? It had your DNA on it... 😈 (The real reason is that I'm test running on data I got from you and I had to name the run something...)

Sounds great!!!
I am a bit disappointed for the fact that the genome is not mine....

This should be fixed from "v1.2.0-beta18". If @mariogiov and @vezzi would like to check it out and say if they have any issues with this we can then close this issue.

Great job @johandahlberg and @mariogiov : the fact that I spend only 30 minutes to pass form a format to another shows that the implementation is really general and engine agnostic :-D