Nemabiome ITS Database

dataset

posted on 2024-12-12, 05:38 authored by NEIL YOUNGNEIL YOUNG, Lucas HugginsLucas Huggins

18S, ITS1 and ITS2, 28S Full Nematode Database:

Building a NanoCLUST Db for Parasitic Nematodes using 18S rRNA, 28S rRNA, ITS1, 5.8S and ITS2.

Nematoda Taxonomy ID: 6231 (hence must use txid6231).

Key words:

18S ribosomal RNA

18S rRNA

18S

28S ribosomal RNA

28S rRNA

28S

5.8S ribosomal RNA

5.8S rRNA

5.8S

Ribosomal RNA

SSU rRNA

LSU rRNA

SSU ribosomal RNA

LSU ribosomal RNA

Internal transcribed spacer

Internal transcribed spacer 1

Internal transcribed spacer 2

ITS

ITS1

ITS2

Final NCBI GenBank search term:

(((((((((((((((((((((((((18S ribosomal RNA[Title]) OR 18S rRNA[Title]) OR 18S[Title]) OR 28S ribosomal RNA[Title]) OR 28S rRNA[Title]) OR 28S[Title]) OR 5.8S ribosomal RNA[Title]) OR 5.8S rRNA[Title]) OR 5.8S[Title]) OR ribosomal RNA[Title]) OR SSU rRNA[Title]) OR LSU rRNA[Title]) OR SSU ribosomal RNA[Title]) OR LSU ribosomal RNA[Title]) OR Internal transcribed spacer[Title]) OR Internal transcribed spacer 1[Title]) OR Internal transcribed spacer 2[Title]) OR ITS[Title]) OR ITS1[Title]) OR ITS2[Title]) AND txid6231[Organism])) AND 200:10000[Sequence Length])) AND nuccore pubmed[Filter]) NOT unverified[Keyword]

Downloaded as a fasta file.

Next a list of clade III and V parasitic nematodes i.e. the Ascarids, Ancylostomatids, etc were obtained – these downloaded as a fasta file.

Next this fasta file had the titles of the sequences changed to ‘sham’ titles to non-descript accession numbers e.g. Unidentified nematode 18S ribosomal RNA, partial sequence,

# Simplify the headers of your database fasta file

$ awk '{if($0~/^>/){print $1} else {print $0}}' Nemabiome_rRNA_fasta_v5_sequences.fasta > Nematoda_rRNA-ITS-5.8S_v5_30.04.24.fasta

# Make a text file of all the accession numbers in the database fasta file

$ awk '{if ($1~/^>/) print substr($1,2)}' Nematoda_rRNA-ITS-5.8S_v5_30.04.24.fasta > Nematoda_rRNA_v5_accession_ids.txt

# Create a mapping table of each accession to its taxa id - takes about 10 minutes as it has to read each of the 300 million lines nucl_gb.accession2taxid

$ awk -F"\t" 'BEGIN{while(getline<"Nematoda_rRNA_v5_accession_ids.txt") hash[$1]=1} {if ($2 in hash) print $2,$3}' nucl_gb.accession2taxid > Nematode_rRNA_v5_tax_map.txt

# Make the blast database using the database fasta file for example:

$ makeblastdb -in Nematoda_rRNA-ITS-5.8S_v5_30.04.24.fasta -parse_seqids -blastdb_version 5 -taxid_map Nematode_rRNA_v5_tax_map.txt -title "Nemabiome_rRNA database_v5" -out Nemabiome_rRNA_v5_db -dbtype nucl

Final database files produced = 10. For example Nemabiome_rRNA_v5_db.ndb, Nemabiome_rRNA_v5_db.nhr, Nemabiome_rRNA_v5_db.nin

These can be used by NanoCLUST e.g. in the command

nextflow run main.nf -profile docker --reads '/home/Public/Ps1/Lucas_Workspace/MinION_Nemabiome_ECR_Project_SUPdata/100_Sample_Comparison_96-well_Trial-4/pass/barcodes01-96/ITS-reads-amended-filtered/barcode30.tmp.inverse.pblat.fix.fastq-filt.fastq.gz' --db "db/Nemabiome_rRNA_v5_db" --tax "db" --min_read_length 700 --max_read_length 1800 --min_cluster_size 100 --polishing_reads 100 --cluster_sel_epsilon 1 --max_memory ’84.GB’ --max_cpus 12 --outdir ./Nemabiome_trial

Nemabiome ITS Database

History

Usage metrics

Categories

Keywords

Licence

Exports