NOTE: This is a temporary GitHub repository created for testing purposes.
The barcodeMineR package
This package allows to query multiple taxonomic names on the NCBI and BOLD repositories and retrieve DNA barcodes and associated metadata for any wanted marker. It heavily relies on the bold and rentrez packages from the rOpenSci, and takes advantage of the asynchronous framework from the future package to speed up the retrieval of data from the NCBI, respecting its API requests rate limit.
The final output is a data frame object modified following the formatting requirements of the refdb package, and it is cleaned from mining duplicates, differences in formats of BOLD and NCBI metadata (e.g. geographic coordinates, dates) and commonly occurring issues that derive from downloading and merging data from those two repositories.
In synthesis, it provides a unified framework for mining DNA Barcodes from the main online genomic repositories, providing clean, metadata-rich sequences in a programmatic way or interactively, depending on the wanted usage.
Installation
As the package is still in development, you can install this version of barcodeMineR, directly from the GitHub repository, using the Bioconductor package:
BiocManager::install("MatteoCe/barcodeMineR")
Basic usage
The basic functioning of the package includes only two steps, which must be run consequently:
library(barcodeMineR)
# check if a species is on the NCBI nucleotide database:
tax <- get_ncbi_taxonomy("Dissostichus mawsoni")
# download all records of this species from the database:
rec <- download_ncbi(tax, ask = FALSE)
# display output:
rec
#> # A tibble: 189 × 30
#> recordID markerCode DNA_seq phylum class order family genus species source
#> <chr> <chr> <DNA> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 HM422302.1 COI CTCTACT… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 2 ON000293.1 COX1 GCGCCTG… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 3 MK843765.1 COI TCTCTAC… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 4 DQ498816.1 Cytb GCCACCC… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 5 DQ498794.1 Rhod GCCTACA… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 6 MK500763.1 enc1 TCTGACG… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 7 MG729451.1 COI GAACTTA… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 8 KY656477.1 COI GCCGGAA… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 9 LC138011.1 ND1 ATGCTTT… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> 10 LC138011.1 ND2 ATGAGCC… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI
#> # ℹ 179 more rows
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> # QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> # sourceID <chr>, NCBI_ID <chr>, institutionStoring <lgl>,
#> # collected_by <chr>, collection_date <chr>, altitude <lgl>, depth <lgl>,
#> # country <chr>, directionPrimers <chr>, lengthSource <int>,
#> # PCR_primers <chr>, note <chr>
Take a look at the introductory vignette for a more complete tutorial!
After that, proceed with the ‘Speeding up’ vignette to learn how to increase speed and reliability of the package functions.