NOTE: This package is still in development and awaiting for CRAN approval.
As mentioned in the introductory vignette, the basic usage of the barcodeMineR package consists in the adoption of two types of functions that must be conducted consequently in order to recover a final refdb object. However, in order to understand exactly which records will be recovered by the package, a proper understanding of the package function and rationale is needed.
How do the taxonomy functions work
The taxonomy functions are used to define exactly which records will be retrieved by the download functions. More precisely, only records that include the taxid/scientificName, retrieved using the taxonomy functions, as the lowest taxonomic identification, will be downloaded.
Although the basic usage of the two functions get_ncbi_taxonomy and get_bold_taxonomy is very similar, they slightly differ when it comes to downloading unclassified records. Below, you can find a thorough explanation of the two functions’ usage.
NCBI taxonomy
For example, the following object, obtained using the get_ncbi_taxonomy function, will retrieve all NCBI records corresponding to the species Maldane sarsi, excluding records identified as a subspecies:
tax_maldane
#> queryName taxid rank scientificName phylum class order
#> 1 Maldane sarsi 273041 species Maldane sarsi Annelida Polychaeta NA
#> family genus species
#> 1 Maldanidae Maldane Maldane sarsi
maldane_rec <- download_ncbi(tax_maldane, ask = FALSE)
nrow(maldane_rec)
#> [1] 9
However, in case we were interested in all records identified as Maldane sarsi, including its children taxonomy, then we would need to include all species and subspecies in the taxonomy data frame. This is done by default, when searching a species, if the ask argument is set to FALSE:
tax_maldane <- get_ncbi_taxonomy("Maldane sarsi", ask = FALSE)
# the taxonomy table now includes both species and subspecies:
tax_maldane
#> queryName taxid rank scientificName phylum class
#> 1 Maldane sarsi 1931189 subspecies Maldane sarsi antarctica Annelida Polychaeta
#> 2 Maldane sarsi 273041 species Maldane sarsi Annelida Polychaeta
#> order family genus species
#> 1 NA Maldanidae Maldane Maldane sarsi
#> 2 NA Maldanidae Maldane Maldane sarsi
Otherwise, the function will prompt the user to choose which taxonomic rank to include in the final output.
Running the download_ncbi function will search all the records with Maldane sarsi and Maldane sarsi antarctica as the lowest taxonomic identification.
maldane_rec <- download_ncbi(tax_maldane, ask = FALSE)
# the output now include two more records
nrow(maldane_rec)
#> [1] 11
This applies to all taxonomic levels. For example, imagine you’re interested in all Maldane sarsi records on the NCBI. In order to gather as much sequences as possible, we might also include all those records identified only at the genus level, in case they include highly similar sequences that we could use in our analyses. To do this, we could search for “unclassified Maldane” and retain all species and subspecies:
additional_tax <- get_ncbi_taxonomy("unclassified Maldane")
# after selecting only the ranks "species" and "subspecies" this is the final
# output:
additional_tax
#> queryName taxid rank scientificName phylum
#> 1 unclassified Maldane 2066661 species Maldane sp. 1 GK-2017 Annelida
#> 2 unclassified Maldane 2066650 species Maldane sp. 2 GK-2017 Annelida
#> 3 unclassified Maldane 649691 species Maldane sp. CBCA-2009 Annelida
#> class order family genus species
#> 1 Polychaeta NA Maldanidae Maldane Maldane sp. 1 GK-2017
#> 2 Polychaeta NA Maldanidae Maldane Maldane sp. 2 GK-2017
#> 3 Polychaeta NA Maldanidae Maldane Maldane sp. CBCA-2009
We could merge the two taxonomic tables and thus search all records corresponding to Maldane sarsi, Maldane sarsi antarctica plus Maldane sp. 1 GK-2017 and Maldane sp. 2 GK-2017 and Maldane sp. CBCA-2009:
final_tax <- rbind(tax_maldane, additional_tax)
download_ncbi(final_tax, ask = FALSE)
#> # A tibble: 19 × 30
#> recordID markerCode DNA_seq phylum class order family genus species source
#> <chr> <chr> <DNA> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 OQ053050.1 COX1 AACCTT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 2 LC342665.1 COX1 AACACT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 3 LC342640.1 COX1 AACATT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 4 MG421523.1 COI AACATT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 5 GQ229112.1 COI TTGTGG… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 6 OQ071313.1 large subu… GAGGGA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 7 OQ071256.1 small subu… CCTTCG… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 8 LC366965.1 18S riboso… TAGTCA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 9 LC366940.1 18S riboso… TAGTCA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 10 LC366001.1 28S riboso… ACTTGG… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 11 LC365980.1 28S riboso… CCCCAG… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 12 LC365952.1 16S riboso… AGCTTC… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 13 KX867346.1 16S riboso… GTATCC… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 14 KX867345.1 16S riboso… TATCCT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 15 AY612628.1 28S riboso… CCAACT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 16 AY612617.1 18S riboso… TATCTT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 17 AY569681.1 16S riboso… CGCGGT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 18 AY569669.1 28S riboso… TGTGCG… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 19 AY569655.1 18S riboso… TGCCAG… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> # QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> # sourceID <chr>, NCBI_ID <chr>, institutionStoring <lgl>,
#> # collected_by <chr>, collection_date <chr>, altitude <lgl>, depth <lgl>,
#> # country <lgl>, directionPrimers <lgl>, lengthSource <int>,
#> # PCR_primers <lgl>, note <lgl>
In order to be sure to include all Maldane sarsi records, we could simply search the genus Maldane and actively select the species/subspecies we’re interested in. In this case, the base function grep might come in handy:
tax_maldane <- get_ncbi_taxonomy("Maldane", ask = FALSE)
# actively filter all records including the strings "Maldane" and "sarsi" in the
# field "scientificName"
tax_maldane_all <- tax_maldane[grep("(Maldane).*(sarsi)", tax_maldane$scientificName), ]
tax_maldane_all
#> queryName taxid rank scientificName phylum class
#> 6 Maldane 2066649 species Maldane cf. sarsi GK-2017 Annelida Polychaeta
#> 7 Maldane 1931189 subspecies Maldane sarsi antarctica Annelida Polychaeta
#> 9 Maldane 880914 species Maldane sarsi CMC02 Annelida Polychaeta
#> 10 Maldane 879538 species Maldane sarsi CMC01 Annelida Polychaeta
#> 12 Maldane 273041 species Maldane sarsi Annelida Polychaeta
#> order family genus species
#> 6 NA Maldanidae Maldane Maldane cf. sarsi GK-2017
#> 7 NA Maldanidae Maldane Maldane sarsi
#> 9 NA Maldanidae Maldane Maldane sarsi CMC02
#> 10 NA Maldanidae Maldane Maldane sarsi CMC01
#> 12 NA Maldanidae Maldane Maldane sarsi
By merging this data frame with the one including all unclassified Maldane we can be confident to download most of the available Maldane sarsi records from the NCBI:
all_maldane <- rbind(tax_maldane_all, additional_tax)
ncbi_maldane <- download_ncbi(all_maldane, ask = FALSE)
ncbi_maldane
#> # A tibble: 29 × 30
#> recordID markerCode DNA_seq phylum class order family genus species source
#> <chr> <chr> <DNA> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 OQ053050.1 COX1 AACCTTA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 2 LC342665.1 COX1 AACACTA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 3 LC342640.1 COX1 AACATTA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 4 LC342639.1 COX1 AACATTA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 5 MG421523.1 COI AACATTA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 6 HQ023885.1 COI AACATTA… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 7 GQ229112.1 COI TTGTGGT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 8 GU672597.1 COI GGAACAT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 9 GU672596.1 COI GGAACAT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> 10 GU672576.1 COI GGAACAT… Annel… Poly… NA Malda… Mald… Maldan… NCBI
#> # ℹ 19 more rows
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> # QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> # sourceID <chr>, NCBI_ID <chr>, institutionStoring <lgl>,
#> # collected_by <chr>, collection_date <chr>, altitude <lgl>, depth <lgl>,
#> # country <lgl>, directionPrimers <lgl>, lengthSource <int>,
#> # PCR_primers <lgl>, note <lgl>
BOLD taxonomy
The basic usage of get_bold_taxonomy follows the same mechanism. In order to obtain all Clione antarctica records, we first need to get the corresponding taxonomy data frame:
# this taxonomy table will retrieve all "Clione antarctica" records
tax_clione
#> queryName taxid taxon rank records
#> 1 Clione antarctica 650223 Clione antarctica species 9
rec_clione <- download_bold(tax_clione, ask = FALSE)
nrow(rec_clione)
#> [1] 9
However, this will exclude all descending taxonomic levels, in this case any subspecies of Clione antarctica. The function get_bold_taxonomy automatically searches for downstream taxonomies, thanks to the default setting of the descend argument.
tax_clione_anta <- get_bold_taxonomy("Clione antarctica", descend = TRUE)
rec_clione <- download_bold(tax_clione_anta, ask = FALSE)
nrow(rec_clione)
#> [1] 9
However, this can be overridden by setting that argument to FALSE. The descend argument has been introduced to allow the user to avoid the internal children taxa searching method implementing the package taxize, which may result unstable when searching long vectors. See the Tips section of this vignette for more information on this matter.
The main difference between the NCBI taxonomy function and get_bold_taxonomy actually relates to the subsequent recovery of records. In fact, the BOLD database does not include an “unclassified taxon” rank that includes all records identified to that taxonomic level, but with no species name. However, the publication of records on the BOLD database does not require necessarily an identification at the species level, meaning that searching higher level taxonomies would result in the download of all records identified up to that taxon only.
To increase the recovery of records potentially related to Clione antarctica, we can also search for the genus Clione and download all the records identified up to the genus level:
tax_clione <- get_bold_taxonomy("Clione", descend = FALSE)
rec_clione <- download_bold(tax_clione, ask = FALSE)
rec_clione
#> # A tibble: 8 × 30
#> recordID markerCode DNA_seq phylum class order family genus species source
#> <chr> <chr> <DNA> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 QHAK1466-22 COI-5P GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 2 BHAK2663-20 COI-5P GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 3 QHAK1267-22 COI-5P GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 4 QHAK1223-22 COI-5P ACAAGGA… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 5 QHAK1237-22 COI-5P GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 6 QHAK1263-22 COI-5P GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 7 QHAK1446-22 COI-5P GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 8 QHAK3063-23 COI-5P GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> # QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> # sourceID <chr>, NCBI_ID <chr>, institutionStoring <chr>,
#> # collected_by <chr>, collection_date <lgl>, altitude <int>, depth <dbl>,
#> # country <chr>, directionPrimers <chr>, lengthSource <int>,
#> # PCR_primers <chr>, note <lgl>
The 8 records are only identified up to the genus Clione, thus, if we wanted to recover all records potentially belonging to the Clione antarctica species we would have to search for all unidentified Clione records together with those identified as Clione antarctica. Ideally, this would be the simplest approach:
final_tax <- get_bold_taxonomy(c("Clione", "Clione antarctica"), descend = FALSE, ask = FALSE)
rec_clione <- download_bold(final_tax, ask = FALSE)
rec_clione
#> # A tibble: 17 × 30
#> recordID markerCode DNA_seq phylum class order family genus species source
#> <chr> <chr> <DNA> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 QHAK1466-22 COI-5P GACTCT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 2 BHAK2663-20 COI-5P GACTCT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 3 QHAK1267-22 COI-5P GACTCT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 4 QHAK1223-22 COI-5P ACAAGG… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 5 QHAK1237-22 COI-5P GACTCT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 6 QHAK1263-22 COI-5P GACTCT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 7 QHAK1446-22 COI-5P GACTCT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 8 QHAK3063-23 COI-5P GACTCT… Mollu… Gast… Pter… Clion… Clio… NA BOLD
#> 9 CAOII309-09 COI-5P ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 10 CAOII310-09 COI-5P ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 11 CMARA044-09 COI-5P ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 12 GBMLG17117… COI-5P TTGTTT… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 13 GBMLG18681… COI-5P GTAGGC… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 14 GBMLG18682… COI-5P GTAGGC… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 15 GBMLG18683… COI-5P GTAGGC… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 16 CAOII308-09 COI-5P ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> 17 CAOII311-09 COI-5P ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> # QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> # sourceID <chr>, NCBI_ID <chr>, institutionStoring <chr>,
#> # collected_by <chr>, collection_date <lgl>, altitude <int>, depth <dbl>,
#> # country <chr>, directionPrimers <chr>, lengthSource <int>,
#> # PCR_primers <chr>, note <chr>
As in the above section, searching all taxa for Clione and then filtering only those including the strings “Clione” and “antarctica” using grep would increase our chances to include records with identification qualifiers like “cf.”.
Tips
- Searching long vectors
Be careful when searching long vectors including many different taxa. This might result in frequent blocking by the BOLD servers, as explained in the vignette topic Why can’t I speed up the BOLD functions?. However, when the number of taxa to be searched remains under an approximate limit of 250 per hour, the mining operations using the BOLD functions of the package can be sped up by setting the api_rate argument to 1 (thus 1 request per second).
- Overriding the default taxonomy filter of download_bold
When the user is interested in all records belonging to many higher level taxa (genera, families or orders), using the get_bold_taxonomy function may take lot of time to search for all descending taxonomic names. The low rate at which the bold functions are set is mainly due to the frequent blocking by the BOLD serves, as described in Why can’t I speed up the BOLD functions?. One solution consists in deactivating the default filtering step adopted by the download_bold function, which eliminates all records belonging to children taxonomies that are not present in the taxonomy table provided, i.e. the output of the previous get_bold_taxonomy. This can be achieved by setting the argument filter to FALSE:
# get taxonomic table of the genus Clione
tax <- get_bold_taxonomy("Clione", descend = FALSE, ask = FALSE)
# download all records, including those identified to species belonging to Clione
download_bold(tax, filter = FALSE)