Is code for demultiplexing on sequencer still relevant?
diitaz93 opened this issue · comments
Description
There is a lot of infrastructure in cg for flow cells demultiplexed on the sequencer, which is currently unused because we don't demultiplex in the sequencer. It was tried in the past but it turned out to be too slow, and had some other issues that I don't remember clearly.
Is this code still needed? Are we going to demultiplex in the sequencer at some point in the future?
EDIT: I am also wondering about some directory structure alternatives we consider in cg
to find FASTQ files, but I have not seen them in any flow cell recently. I mean these:
def get_sample_fastqs_from_flow_cell(
flow_cell_directory: Path, sample_internal_id: str
) -> list[Path] | None:
"""Retrieve all fastq files for a specific sample in a flow cell directory."""
# The flat output structure for NovaseqX flow cells demultiplexed with BCLConvert on hasta
root_pattern = f"{sample_internal_id}_S*_L*_R*_*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
# The default structure for flow cells demultiplexed with bcl2fastq
unaligned_pattern = (
f"Unaligned*/Project_*/Sample_{sample_internal_id}"
f"/*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
)
# Alternative structure for bcl2fastq flow cells whose fastq files have a trailing sequence
unaligned_alt_pattern = (
f"Unaligned*/Project_*/Sample_{sample_internal_id}"
f"_*/*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
)
# The default structure for flow cells demultiplexed with bclconvert
bcl_convert_pattern = (
f"Unaligned*/*/{sample_internal_id}_*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
)
# The pattern for novaseqx flow cells demultiplexed on board of the dragen
demux_on_sequencer_pattern = (
f"BCLConvert/fastq/{sample_internal_id}"
f"_S*_L*_R*_*{FileExtensions.FASTQ}{FileExtensions.GZIP}"
)
for pattern in [
root_pattern,
unaligned_pattern,
unaligned_alt_pattern,
bcl_convert_pattern,
demux_on_sequencer_pattern,
]:
sample_fastqs: list[Path] = get_files_matching_pattern(
directory=flow_cell_directory, pattern=pattern
)
valid_sample_fastqs: list[Path] = get_valid_sample_fastqs(
fastq_paths=sample_fastqs, sample_internal_id=sample_internal_id
)
if valid_sample_fastqs:
return valid_sample_fastqs
In the above function, there are some patterns labelled as belonging to a bcl2fastq
output, but in cg there are some BCLConvert flow cell fixtures with that pattern. Are those fixtures wrong? (I think so)
As far as I know, we are currently using only the first one (Flat BCLConvert output). Should we deprecate the other patterns, or are they still used by anything?
Suggested solution
Set for refinement for discussion if we should remove that code from cg.
This can be closed when
- Decide what to do with this code
- implement the decision
Blocked by
Refinement session
I think we should trash it 👍 It can always be found in the version control, it just adds bloat
Since we now have only one demux command, we can remove the patterns that do not describe the output structure in this function here.
Decision
Keep "root_pattern" and "demux_on_sequencer" pattern.