Required packages: bold
, seqinr
, tidyverse
, optparse
Retrieves Cytochrome c oxidase subunit I sequences from BOLD (Barcode Of Life Data System).
Arguments:
-i
: Path to species xlsx file. Must have a "Species" column.
-g
: Group to subset.
-o
: Directory to save sequences in.
-m
: Max number of sequences per species to retrieve. Defaults to all.
Example use:
Navigate to the zooplankton_barcode directory in a terminal, and run:
Rscript get_bold_seqs.R -i data/species.xlsx -o test_out -g Salpida -m 1
To view help, run:
Rscript get_bold_seqs.R -h
Sequences will be named SPECIES_NAME|SUBSPECIES_NAME|MARKER_GENE|SAMPLE_ID|PROCESS_ID
. Individual sequences will be stored in an individual_seqs
folder within the -o
directory, and concatenated in a file GROUP_NAME_combined_seqs.fasta
.
Download MEGA 11 from https://www.megasoftware.net/.
- Open a combined sequence file with "File" -> "Open a File/Session".
- Select `GROUPNAME_combined_seqs.fn' file.
- Choose "Align"
- Align with "Alignment" -> "Align by ClustalW"
- Choose "OK" when prompted to select all.
- Leave default options, then "OK".
- Once the alignment finishes, delete overhanging sequences.
- Save alignment as a '.mas' file.
- Close alignment window.
- Open the '.mas' file, choose "Analyze".
- Yes to protein coding.
- "Models" -> "Find Best Protein/DNA Model"
- The model with the lowest BIC score is the best fit. Note any "+" items in the name, and their parameter values. E.g. +G means gamma distributed rate, with parameter value in a (+G) column.
- "Distance" -> "Compute Pairwise Distances"
- Use active data. Use ~300 bootstrap replicates, and the best fit model (with gamma distributed rates if applicable).