zooplankton_barcode

get_bold_seqs.R

Required packages: bold, seqinr, tidyverse, optparse

Retrieves Cytochrome c oxidase subunit I sequences from BOLD (Barcode Of Life Data System).

Arguments:
-i: Path to species xlsx file. Must have a "Species" column.
-g: Group to subset.
-o: Directory to save sequences in.
-m: Max number of sequences per species to retrieve. Defaults to all.

Example use: Navigate to the zooplankton_barcode directory in a terminal, and run: Rscript get_bold_seqs.R -i data/species.xlsx -o test_out -g Salpida -m 1

To view help, run:
Rscript get_bold_seqs.R -h

Sequences will be named SPECIES_NAME|SUBSPECIES_NAME|MARKER_GENE|SAMPLE_ID|PROCESS_ID. Individual sequences will be stored in an individual_seqs folder within the -o directory, and concatenated in a file GROUP_NAME_combined_seqs.fasta.

Analysis with MEGA 11

Download MEGA 11 from https://www.megasoftware.net/.

Alignment

Open a combined sequence file with "File" -> "Open a File/Session".
Select `GROUPNAME_combined_seqs.fn' file.
Choose "Align"
Align with "Alignment" -> "Align by ClustalW"
Choose "OK" when prompted to select all.
Leave default options, then "OK".
Once the alignment finishes, delete overhanging sequences.
Save alignment as a '.mas' file.
Close alignment window.

Choose DNA evolution model

Open the '.mas' file, choose "Analyze".
Yes to protein coding.
"Models" -> "Find Best Protein/DNA Model"
The model with the lowest BIC score is the best fit. Note any "+" items in the name, and their parameter values. E.g. +G means gamma distributed rate, with parameter value in a (+G) column.

Compute Pairwise Distance Matrix

"Distance" -> "Compute Pairwise Distances"
Use active data. Use ~300 bootstrap replicates, and the best fit model (with gamma distributed rates if applicable).

bbeardsall / zooplankton_barcode