Change the pipeline setup xml to some more "flexible" solution

Question

Change the pipeline setup xml to some more "flexible" solution

johandahlberg opened this issue 10 years ago · comments

It would be good if the pipeline setup xml format was more flexible in the sense that it's one could specify the exact location of the fastq-files. Probably the files should contain the full project, sample, library, flowcell hierarchy.

I'll have to think about this though.

Johan Dahlberg commented 10 years ago

👍

Johan Dahlberg · Answer 1 · Fri Aug 22 2014 22:17:24 GMT+0800 (China Standard Time)

@vezzi and @mariogiov

I've been working on this now, and everything looks fine, I'm just going to give it a test run over the weekend to make sure nothing looks funky.

Things will now work like this.

The setup runner will accept paths to fastq-files using the -i flag. These can be a mix of things in the IGN format and in UU format, and it will extract the information from the folder structure, file names (and in the case of UU format the report.xml). This then parsed into the xml-format, which looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<project xmlns="setup.xml.molmed">
    <metadata>
        <name>NA-001</name>
        <sequenceingcenter>NGI</sequenceingcenter>
        <platform>Illumina</platform>
        <uppmaxprojectid>a2009002</uppmaxprojectid>
        <uppmaxqos></uppmaxqos>
        <reference>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/exampleFASTA.fasta</reference>
    </metadata>
    <inputs>
        <sample>
            <samplename>F15</samplename>
            <library>
                <libraryname>SX396_MA140710.1</libraryname>
                <platformunit>
                    <unitinfo>000000000-AA3LB.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_F15/F15_CCGAAGTA_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_F15/F15_CCGAAGTA_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
        <sample>
            <samplename>E14</samplename>
            <library>
                <libraryname>SX396_MA140710.1</libraryname>
                <platformunit>
                    <unitinfo>000000000-AA3LB.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_E14/E14_AGTCACTA_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/140812_M00485_0148_000000000-AA3LB/Projects/MD-0274/140812_M00485_0148_000000000-AA3LB/Sample_E14/E14_AGTCACTA_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
        <sample>
            <samplename>P1171_104</samplename>
            <library>
                <libraryname>A</libraryname>
                <platformunit>
                    <unitinfo>AC41A2ANXX.2</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_104/A/140702_AC41A2ANXX/P1171_104_ATTCAGAA-GGCTCTGA_L002_R1_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
                <platformunit>
                    <unitinfo>AC41A2ANXX.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_104/A/140702_AC41A2ANXX/P1171_104_ATTCAGAA-GGCTCTGA_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_104/A/140702_AC41A2ANXX/P1171_104_ATTCAGAA-GGCTCTGA_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
        <sample>
            <samplename>P1171_102</samplename>
            <library>
                <libraryname>A</libraryname>
                <platformunit>
                    <unitinfo>AC41A2ANXX.1</unitinfo>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_102/A/140702_AC41A2ANXX/P1171_102_ATTCAGAA-CCTATCCT_L001_R1_001.fastq.gz</path>
                    </fastqfile>
                    <fastqfile>
                        <path>/home/MOLMED/johda411/workspace/piper/src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root/P1171_102/A/140702_AC41A2ANXX/P1171_102_ATTCAGAA-CCTATCCT_L001_R2_001.fastq.gz</path>
                    </fastqfile>
                </platformunit>
            </library>
        </sample>
    </inputs>
</project>

As you can see this basically takes to IGN folder structure and transforms it into a xml.

Do give me a heads up if you have any modifications to this that you want to see. Otherwise I'm going to push a release containing this sometime early next week.

Francesco · Answer 2 · Fri Aug 22 2014 22:21:17 GMT+0800 (China Standard Time)

Sounds perfect to me, just to double checked, you want to maintain compatibility with your current UU structure for the others project but for the IGN we are anyway going to start from the flowcells (i.e., Illumina format).

In this way the pipeline can pick up the fastq files transfer them and then feed piper.

Francesco · Answer 3 · Fri Aug 22 2014 22:24:08 GMT+0800 (China Standard Time)

Other question.... why nestor is fully packed with process name vezzi_bwa run by joda8933.... @johandahlberg when you sequenced my genome?

Johan Dahlberg · Answer 4 · Fri Aug 22 2014 22:41:14 GMT+0800 (China Standard Time)

For IGN I will start from the IGN folder structure, e.g:

Project
 └── Sample
     └── Library Prep
         └── Sequencing Run
             ├── P1142_101_NoIndex_L002_R1_001.fastq.gz
             └── P1142_101_NoIndex_L002_R1_001.fastq.gz

And since you transform your Illumina format to that I was thinking that you should be able to transform it in the same way if we deliver the content of the unaligned folder to Uppmax.

And yes, I want to keep back-wards compatibility with the Uppsala style folder structure for now.

And, you know I took you coffee cup to the washer the other day? It had your DNA on it... 😈 (The real reason is that I'm test running on data I got from you and I had to name the run something...)

Francesco · Answer 5 · Fri Aug 22 2014 22:44:03 GMT+0800 (China Standard Time)

Sounds great!!!
I am a bit disappointed for the fact that the genome is not mine....

Johan Dahlberg · Answer 6 · Thu Aug 28 2014 16:01:31 GMT+0800 (China Standard Time)

This should be fixed from "v1.2.0-beta18". If @mariogiov and @vezzi would like to check it out and say if they have any issues with this we can then close this issue.

Francesco · Answer 7 · Thu Aug 28 2014 22:51:54 GMT+0800 (China Standard Time)

Great job @johandahlberg and @mariogiov : the fact that I spend only 30 minutes to pass form a format to another shows that the implementation is really general and engine agnostic :-D