Supplementary Material S1 - Eucalypt bait kit - full methods for designing the bait kit and FASTA file of genes in kit to use as targets for recovery.

Name: Supplementary Material S1 - Eucalypt bait kit - full methods for designing the bait kit and FASTA file of genes in kit to use as targets for recovery.
Published: 2023-06-15T03:30:12+00:00
License: https://creativecommons.org/licenses/by/4.0/
Keywords: eucalypt bait kit target file

dataset

posted on 2023-06-15, 03:30 authored by TODD MCLAYTODD MCLAY

To combine the A353, OzBaits, and Myrtaceae target capture bait kits and design a eucalypt specific kit, the target sequence files for each kit were used as inputs to recover the genes from the E. grandis and C. citriodora genomes using the tool BYO_transcriptomes.py (https://github.com/chrisjackson-pellicle/NewTargets/blob/master/BYO_transcriptome.py; McLay et al. 2021). This pipeline performs a hidden Markov model (HMM) search against a set of genes (in this case nucleotide sequences for each bait kit), and extracts matching sequences from input transcriptomes or proteomes. An HMM e-value of 1e-10 was used to identify the target genes from the reference proteomes. For the Angiosperms353 kit, we first expanded the file of target genes to include all 1000 Plants (1KP) Myrtaceae transcriptomes with the NewTargets pipeline (https://github.com/chrisjackson-pellicle/NewTargets; McLay et al 2021); all non-Myrtaceae transcriptomes were then removed from the target file.

The following steps were performed in Geneious Prime v2022.2.2 (Kearse et al. 2012). Local BLAST searches (Altschul et al. 1990) were used to identify overlapping genes between the three kits. Genes that were represented two or more times were visually confirmed to be the same gene, and then one version was kept (preferably the longest if there was a length difference). To compare the target loci between the Eucalyptus and Corymbia genomes, we aligned the two sequences for each locus, calculated pairwise identity for each alignment, and then manually inspected the most divergent loci; as we would expect the Eucalyptus and Corymbia sequence to be similar, this search was performed to check whether there were misassignment or misassemblies from the reference proteomes. Very divergent locus alignments were manually fixed by removing poorly matched regions (e.g. where introns were not removed from the proteome) or, in some cases, by removing the entire sequence for either the Eucalyptus or Corymbia sequence if it appeared to be a poor match both to the other reference genome and to the original target sequence. Some genes were flagged by BYO_transcriptomes.py as being out of frame, and these were also manually checked. In most cases, these included regions that appeared to be introns or untranslated/non-coding regions, and these sections were removed.

The FASTA file contains all 568 gene target sequences for the eucalypt bait kit.

Supplementary Material S1 - Eucalypt bait kit - full methods for designing the bait kit and FASTA file of genes in kit to use as targets for recovery.

History

Usage metrics

Categories

Keywords

Licence

Exports