Clinical-Genomics / cg

Glue between Clinical Genomics apps

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is code for demultiplexing on sequencer still relevant?

diitaz93 opened this issue · comments

Description

There is a lot of infrastructure in cg for flow cells demultiplexed on the sequencer, which is currently unused because we don't demultiplex in the sequencer. It was tried in the past but it turned out to be too slow, and had some other issues that I don't remember clearly.

Is this code still needed? Are we going to demultiplex in the sequencer at some point in the future?

EDIT: I am also wondering about some directory structure alternatives we consider in cg to find FASTQ files, but I have not seen them in any flow cell recently. I mean these:

def get_sample_fastqs_from_flow_cell(
    flow_cell_directory: Path, sample_internal_id: str
) -> list[Path] | None:
    """Retrieve all fastq files for a specific sample in a flow cell directory."""

    # The flat output structure for NovaseqX flow cells demultiplexed with BCLConvert on hasta
    root_pattern = f"{sample_internal_id}_S*_L*_R*_*{FileExtensions.FASTQ}{FileExtensions.GZIP}"

    # The default structure for flow cells demultiplexed with bcl2fastq
    unaligned_pattern = (
        f"Unaligned*/Project_*/Sample_{sample_internal_id}"
        f"/*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
    )

    # Alternative structure for bcl2fastq flow cells whose fastq files have a trailing sequence
    unaligned_alt_pattern = (
        f"Unaligned*/Project_*/Sample_{sample_internal_id}"
        f"_*/*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
    )

    # The default structure for flow cells demultiplexed with bclconvert
    bcl_convert_pattern = (
        f"Unaligned*/*/{sample_internal_id}_*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
    )

    # The pattern for novaseqx flow cells demultiplexed on board of the dragen
    demux_on_sequencer_pattern = (
        f"BCLConvert/fastq/{sample_internal_id}"
        f"_S*_L*_R*_*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
    )

    for pattern in [
        root_pattern,
        unaligned_pattern,
        unaligned_alt_pattern,
        bcl_convert_pattern,
        demux_on_sequencer_pattern,
    ]:
        sample_fastqs: list[Path] = get_files_matching_pattern(
            directory=flow_cell_directory, pattern=pattern
        )
        valid_sample_fastqs: list[Path] = get_valid_sample_fastqs(
            fastq_paths=sample_fastqs, sample_internal_id=sample_internal_id
        )

        if valid_sample_fastqs:
            return valid_sample_fastqs

In the above function, there are some patterns labelled as belonging to a bcl2fastq output, but in cg there are some BCLConvert flow cell fixtures with that pattern. Are those fixtures wrong? (I think so)

As far as I know, we are currently using only the first one (Flat BCLConvert output). Should we deprecate the other patterns, or are they still used by anything?

Suggested solution

Set for refinement for discussion if we should remove that code from cg.

This can be closed when

  • Decide what to do with this code
  • implement the decision

Blocked by

Refinement session

I think we should trash it 👍 It can always be found in the version control, it just adds bloat

Since we now have only one demux command, we can remove the patterns that do not describe the output structure in this function here.

Decision

Keep "root_pattern" and "demux_on_sequencer" pattern.