The University of Melbourne
Browse

Nemabiome ITS Database

dataset
posted on 2024-12-12, 05:38 authored by NEIL YOUNGNEIL YOUNG, Lucas HugginsLucas Huggins

18S, ITS1 and ITS2, 28S Full Nematode Database:

Building a NanoCLUST Db for Parasitic Nematodes using 18S rRNA, 28S rRNA, ITS1, 5.8S and ITS2.

Nematoda Taxonomy ID: 6231 (hence must use txid6231).


Key words:

18S ribosomal RNA

18S rRNA

18S

28S ribosomal RNA

28S rRNA

28S

5.8S ribosomal RNA

5.8S rRNA

5.8S

Ribosomal RNA

SSU rRNA

LSU rRNA

SSU ribosomal RNA

LSU ribosomal RNA

Internal transcribed spacer

Internal transcribed spacer 1

Internal transcribed spacer 2

ITS

ITS1

ITS2


Final NCBI GenBank search term:

(((((((((((((((((((((((((18S ribosomal RNA[Title]) OR 18S rRNA[Title]) OR 18S[Title]) OR 28S ribosomal RNA[Title]) OR 28S rRNA[Title]) OR 28S[Title]) OR 5.8S ribosomal RNA[Title]) OR 5.8S rRNA[Title]) OR 5.8S[Title]) OR ribosomal RNA[Title]) OR SSU rRNA[Title]) OR LSU rRNA[Title]) OR SSU ribosomal RNA[Title]) OR LSU ribosomal RNA[Title]) OR Internal transcribed spacer[Title]) OR Internal transcribed spacer 1[Title]) OR Internal transcribed spacer 2[Title]) OR ITS[Title]) OR ITS1[Title]) OR ITS2[Title]) AND txid6231[Organism])) AND 200:10000[Sequence Length])) AND nuccore pubmed[Filter]) NOT unverified[Keyword]


Downloaded as a fasta file.


Next a list of clade III and V parasitic nematodes i.e. the Ascarids, Ancylostomatids, etc were obtained – these downloaded as a fasta file.

Next this fasta file had the titles of the sequences changed to ‘sham’ titles to non-descript accession numbers e.g. Unidentified nematode 18S ribosomal RNA, partial sequence,


# Simplify the headers of your database fasta file

$ awk '{if($0~/^>/){print $1} else {print $0}}' Nemabiome_rRNA_fasta_v5_sequences.fasta > Nematoda_rRNA-ITS-5.8S_v5_30.04.24.fasta

# Make a text file of all the accession numbers in the database fasta file

$ awk '{if ($1~/^>/) print substr($1,2)}' Nematoda_rRNA-ITS-5.8S_v5_30.04.24.fasta > Nematoda_rRNA_v5_accession_ids.txt

# Create a mapping table of each accession to its taxa id - takes about 10 minutes as it has to read each of the 300 million lines nucl_gb.accession2taxid

$ awk -F"\t" 'BEGIN{while(getline<"Nematoda_rRNA_v5_accession_ids.txt") hash[$1]=1} {if ($2 in hash) print $2,$3}' nucl_gb.accession2taxid > Nematode_rRNA_v5_tax_map.txt

# Make the blast database using the database fasta file for example:

$ makeblastdb -in Nematoda_rRNA-ITS-5.8S_v5_30.04.24.fasta -parse_seqids -blastdb_version 5 -taxid_map Nematode_rRNA_v5_tax_map.txt -title "Nemabiome_rRNA database_v5" -out Nemabiome_rRNA_v5_db -dbtype nucl


Final database files produced = 10. For example Nemabiome_rRNA_v5_db.ndb, Nemabiome_rRNA_v5_db.nhr, Nemabiome_rRNA_v5_db.nin


These can be used by NanoCLUST e.g. in the command


nextflow run main.nf -profile docker --reads '/home/Public/Ps1/Lucas_Workspace/MinION_Nemabiome_ECR_Project_SUPdata/100_Sample_Comparison_96-well_Trial-4/pass/barcodes01-96/ITS-reads-amended-filtered/barcode30.tmp.inverse.pblat.fix.fastq-filt.fastq.gz' --db "db/Nemabiome_rRNA_v5_db" --tax "db" --min_read_length 700 --max_read_length 1800 --min_cluster_size 100 --polishing_reads 100 --cluster_sel_epsilon 1 --max_memory ’84.GB’ --max_cpus 12 --outdir ./Nemabiome_trial

History

Usage metrics

    University of Melbourne

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC