Easy Mining of DNA Barcodes from the NCBI and BOLD Repositories • barcodeMineR

NOTE: This is a temporary GitHub repository created for testing purposes.

The barcodeMineR package

This package allows to query multiple taxonomic names on the NCBI and BOLD repositories and retrieve DNA barcodes and associated metadata for any wanted marker. It heavily relies on the bold and rentrez packages from the rOpenSci, and takes advantage of the asynchronous framework from the future package to speed up the retrieval of data from the NCBI, respecting its API requests rate limit.

The final output is a data frame object modified following the formatting requirements of the refdb package, and it is cleaned from mining duplicates, differences in formats of BOLD and NCBI metadata (e.g. geographic coordinates, dates) and commonly occurring issues that derive from downloading and merging data from those two repositories.

In synthesis, it provides a unified framework for mining DNA Barcodes from the main online genomic repositories, providing clean, metadata-rich sequences in a programmatic way or interactively, depending on the wanted usage.

Installation

As the package is still in development, you can install this version of barcodeMineR, directly from the GitHub repository, using the Bioconductor package:

BiocManager::install("MatteoCe/barcodeMineR")

Basic usage

The basic functioning of the package includes only two steps, which must be run consequently:

library(barcodeMineR)

# check if a species is on the NCBI nucleotide database:
tax <- get_ncbi_taxonomy("Dissostichus mawsoni")

# download all records of this species from the database:
rec <- download_ncbi(tax, ask = FALSE)

# display output:
rec
#> # A tibble: 189 × 30
#>    recordID   markerCode DNA_seq  phylum class order family genus species source
#>    <chr>      <chr>      <DNA>    <chr>  <chr> <chr> <chr>  <chr> <chr>   <chr> 
#>  1 HM422302.1 COI        CTCTACT… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  2 ON000293.1 COX1       GCGCCTG… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  3 MK843765.1 COI        TCTCTAC… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  4 DQ498816.1 Cytb       GCCACCC… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  5 DQ498794.1 Rhod       GCCTACA… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  6 MK500763.1 enc1       TCTGACG… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  7 MG729451.1 COI        GAACTTA… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  8 KY656477.1 COI        GCCGGAA… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#>  9 LC138011.1 ND1        ATGCTTT… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#> 10 LC138011.1 ND2        ATGAGCC… Chord… Acti… Perc… Notot… Diss… Dissos… NCBI  
#> # ℹ 179 more rows
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> #   QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> #   sourceID <chr>, NCBI_ID <chr>, institutionStoring <lgl>,
#> #   collected_by <chr>, collection_date <chr>, altitude <lgl>, depth <lgl>,
#> #   country <chr>, directionPrimers <chr>, lengthSource <int>,
#> #   PCR_primers <chr>, note <chr>

Take a look at the introductory vignette for a more complete tutorial!

After that, proceed with the ‘Speeding up’ vignette to learn how to increase speed and reliability of the package functions.