Searching taxonomy • barcodeMineR

library(barcodeMineR)

NOTE: This package is still in development and awaiting for CRAN approval.

As mentioned in the introductory vignette, the basic usage of the barcodeMineR package consists in the adoption of two types of functions that must be conducted consequently in order to recover a final refdb object. However, in order to understand exactly which records will be recovered by the package, a proper understanding of the package function and rationale is needed.

How do the taxonomy functions work

The taxonomy functions are used to define exactly which records will be retrieved by the download functions. More precisely, only records that include the taxid/scientificName, retrieved using the taxonomy functions, as the lowest taxonomic identification, will be downloaded.

Although the basic usage of the two functions get_ncbi_taxonomy and get_bold_taxonomy is very similar, they slightly differ when it comes to downloading unclassified records. Below, you can find a thorough explanation of the two functions’ usage.

NCBI taxonomy

For example, the following object, obtained using the get_ncbi_taxonomy function, will retrieve all NCBI records corresponding to the species Maldane sarsi, excluding records identified as a subspecies:

tax_maldane
#>       queryName  taxid    rank scientificName   phylum      class order
#> 1 Maldane sarsi 273041 species  Maldane sarsi Annelida Polychaeta    NA
#>       family   genus       species
#> 1 Maldanidae Maldane Maldane sarsi
maldane_rec <- download_ncbi(tax_maldane, ask = FALSE)

nrow(maldane_rec)
#> [1] 9

However, in case we were interested in all records identified as Maldane sarsi, including its children taxonomy, then we would need to include all species and subspecies in the taxonomy data frame. This is done by default, when searching a species, if the ask argument is set to FALSE:

tax_maldane <- get_ncbi_taxonomy("Maldane sarsi", ask = FALSE)

# the taxonomy table now includes both species and subspecies:
tax_maldane
#>       queryName   taxid       rank           scientificName   phylum      class
#> 1 Maldane sarsi 1931189 subspecies Maldane sarsi antarctica Annelida Polychaeta
#> 2 Maldane sarsi  273041    species            Maldane sarsi Annelida Polychaeta
#>   order     family   genus       species
#> 1    NA Maldanidae Maldane Maldane sarsi
#> 2    NA Maldanidae Maldane Maldane sarsi

Otherwise, the function will prompt the user to choose which taxonomic rank to include in the final output.

Running the download_ncbi function will search all the records with Maldane sarsi and Maldane sarsi antarctica as the lowest taxonomic identification.

maldane_rec <- download_ncbi(tax_maldane, ask = FALSE)

# the output now include two more records
nrow(maldane_rec)
#> [1] 11

This applies to all taxonomic levels. For example, imagine you’re interested in all Maldane sarsi records on the NCBI. In order to gather as much sequences as possible, we might also include all those records identified only at the genus level, in case they include highly similar sequences that we could use in our analyses. To do this, we could search for “unclassified Maldane” and retain all species and subspecies:

additional_tax <- get_ncbi_taxonomy("unclassified Maldane")

# after selecting only the ranks "species" and "subspecies" this is the final 
# output:
additional_tax
#>              queryName   taxid    rank        scientificName   phylum
#> 1 unclassified Maldane 2066661 species Maldane sp. 1 GK-2017 Annelida
#> 2 unclassified Maldane 2066650 species Maldane sp. 2 GK-2017 Annelida
#> 3 unclassified Maldane  649691 species Maldane sp. CBCA-2009 Annelida
#>        class order     family   genus               species
#> 1 Polychaeta    NA Maldanidae Maldane Maldane sp. 1 GK-2017
#> 2 Polychaeta    NA Maldanidae Maldane Maldane sp. 2 GK-2017
#> 3 Polychaeta    NA Maldanidae Maldane Maldane sp. CBCA-2009

We could merge the two taxonomic tables and thus search all records corresponding to Maldane sarsi, Maldane sarsi antarctica plus Maldane sp. 1 GK-2017 and Maldane sp. 2 GK-2017 and Maldane sp. CBCA-2009:

final_tax <- rbind(tax_maldane, additional_tax)

download_ncbi(final_tax, ask = FALSE)
#> # A tibble: 19 × 30
#>    recordID   markerCode  DNA_seq phylum class order family genus species source
#>    <chr>      <chr>       <DNA>   <chr>  <chr> <chr> <chr>  <chr> <chr>   <chr> 
#>  1 OQ053050.1 COX1        AACCTT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  2 LC342665.1 COX1        AACACT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  3 LC342640.1 COX1        AACATT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  4 MG421523.1 COI         AACATT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  5 GQ229112.1 COI         TTGTGG… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  6 OQ071313.1 large subu… GAGGGA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  7 OQ071256.1 small subu… CCTTCG… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  8 LC366965.1 18S riboso… TAGTCA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  9 LC366940.1 18S riboso… TAGTCA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 10 LC366001.1 28S riboso… ACTTGG… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 11 LC365980.1 28S riboso… CCCCAG… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 12 LC365952.1 16S riboso… AGCTTC… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 13 KX867346.1 16S riboso… GTATCC… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 14 KX867345.1 16S riboso… TATCCT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 15 AY612628.1 28S riboso… CCAACT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 16 AY612617.1 18S riboso… TATCTT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 17 AY569681.1 16S riboso… CGCGGT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 18 AY569669.1 28S riboso… TGTGCG… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 19 AY569655.1 18S riboso… TGCCAG… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> #   QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> #   sourceID <chr>, NCBI_ID <chr>, institutionStoring <lgl>,
#> #   collected_by <chr>, collection_date <chr>, altitude <lgl>, depth <lgl>,
#> #   country <lgl>, directionPrimers <lgl>, lengthSource <int>,
#> #   PCR_primers <lgl>, note <lgl>

In order to be sure to include all Maldane sarsi records, we could simply search the genus Maldane and actively select the species/subspecies we’re interested in. In this case, the base function grep might come in handy:

tax_maldane <- get_ncbi_taxonomy("Maldane", ask = FALSE)

# actively filter all records including the strings "Maldane" and "sarsi" in the
# field "scientificName"
tax_maldane_all <- tax_maldane[grep("(Maldane).*(sarsi)", tax_maldane$scientificName), ]

tax_maldane_all
#>    queryName   taxid       rank            scientificName   phylum      class
#> 6    Maldane 2066649    species Maldane cf. sarsi GK-2017 Annelida Polychaeta
#> 7    Maldane 1931189 subspecies  Maldane sarsi antarctica Annelida Polychaeta
#> 9    Maldane  880914    species       Maldane sarsi CMC02 Annelida Polychaeta
#> 10   Maldane  879538    species       Maldane sarsi CMC01 Annelida Polychaeta
#> 12   Maldane  273041    species             Maldane sarsi Annelida Polychaeta
#>    order     family   genus                   species
#> 6     NA Maldanidae Maldane Maldane cf. sarsi GK-2017
#> 7     NA Maldanidae Maldane             Maldane sarsi
#> 9     NA Maldanidae Maldane       Maldane sarsi CMC02
#> 10    NA Maldanidae Maldane       Maldane sarsi CMC01
#> 12    NA Maldanidae Maldane             Maldane sarsi

By merging this data frame with the one including all unclassified Maldane we can be confident to download most of the available Maldane sarsi records from the NCBI:

all_maldane <- rbind(tax_maldane_all, additional_tax)

ncbi_maldane <- download_ncbi(all_maldane, ask = FALSE)
ncbi_maldane
#> # A tibble: 29 × 30
#>    recordID   markerCode DNA_seq  phylum class order family genus species source
#>    <chr>      <chr>      <DNA>    <chr>  <chr> <chr> <chr>  <chr> <chr>   <chr> 
#>  1 OQ053050.1 COX1       AACCTTA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  2 LC342665.1 COX1       AACACTA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  3 LC342640.1 COX1       AACATTA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  4 LC342639.1 COX1       AACATTA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  5 MG421523.1 COI        AACATTA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  6 HQ023885.1 COI        AACATTA… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  7 GQ229112.1 COI        TTGTGGT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  8 GU672597.1 COI        GGAACAT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#>  9 GU672596.1 COI        GGAACAT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> 10 GU672576.1 COI        GGAACAT… Annel… Poly… NA    Malda… Mald… Maldan… NCBI  
#> # ℹ 19 more rows
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> #   QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> #   sourceID <chr>, NCBI_ID <chr>, institutionStoring <lgl>,
#> #   collected_by <chr>, collection_date <chr>, altitude <lgl>, depth <lgl>,
#> #   country <lgl>, directionPrimers <lgl>, lengthSource <int>,
#> #   PCR_primers <lgl>, note <lgl>

BOLD taxonomy

The basic usage of get_bold_taxonomy follows the same mechanism. In order to obtain all Clione antarctica records, we first need to get the corresponding taxonomy data frame:

# this taxonomy table will retrieve all "Clione antarctica" records 
tax_clione
#>           queryName  taxid             taxon    rank records
#> 1 Clione antarctica 650223 Clione antarctica species       9
rec_clione <- download_bold(tax_clione, ask = FALSE)
nrow(rec_clione)
#> [1] 9

However, this will exclude all descending taxonomic levels, in this case any subspecies of Clione antarctica. The function get_bold_taxonomy automatically searches for downstream taxonomies, thanks to the default setting of the descend argument.

tax_clione_anta <- get_bold_taxonomy("Clione antarctica", descend = TRUE)

rec_clione <- download_bold(tax_clione_anta, ask = FALSE)
nrow(rec_clione)
#> [1] 9

However, this can be overridden by setting that argument to FALSE. The descend argument has been introduced to allow the user to avoid the internal children taxa searching method implementing the package taxize, which may result unstable when searching long vectors. See the Tips section of this vignette for more information on this matter.

The main difference between the NCBI taxonomy function and get_bold_taxonomy actually relates to the subsequent recovery of records. In fact, the BOLD database does not include an “unclassified taxon” rank that includes all records identified to that taxonomic level, but with no species name. However, the publication of records on the BOLD database does not require necessarily an identification at the species level, meaning that searching higher level taxonomies would result in the download of all records identified up to that taxon only.

To increase the recovery of records potentially related to Clione antarctica, we can also search for the genus Clione and download all the records identified up to the genus level:

tax_clione <- get_bold_taxonomy("Clione", descend = FALSE)

rec_clione <- download_bold(tax_clione, ask = FALSE)
rec_clione
#> # A tibble: 8 × 30
#>   recordID    markerCode DNA_seq  phylum class order family genus species source
#>   <chr>       <chr>      <DNA>    <chr>  <chr> <chr> <chr>  <chr> <chr>   <chr> 
#> 1 QHAK1466-22 COI-5P     GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> 2 BHAK2663-20 COI-5P     GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> 3 QHAK1267-22 COI-5P     GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> 4 QHAK1223-22 COI-5P     ACAAGGA… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> 5 QHAK1237-22 COI-5P     GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> 6 QHAK1263-22 COI-5P     GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> 7 QHAK1446-22 COI-5P     GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> 8 QHAK3063-23 COI-5P     GACTCTT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> #   QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> #   sourceID <chr>, NCBI_ID <chr>, institutionStoring <chr>,
#> #   collected_by <chr>, collection_date <lgl>, altitude <int>, depth <dbl>,
#> #   country <chr>, directionPrimers <chr>, lengthSource <int>,
#> #   PCR_primers <chr>, note <lgl>

The 8 records are only identified up to the genus Clione, thus, if we wanted to recover all records potentially belonging to the Clione antarctica species we would have to search for all unidentified Clione records together with those identified as Clione antarctica. Ideally, this would be the simplest approach:

final_tax <- get_bold_taxonomy(c("Clione", "Clione antarctica"), descend = FALSE, ask = FALSE)

rec_clione <- download_bold(final_tax, ask = FALSE)
rec_clione
#> # A tibble: 17 × 30
#>    recordID    markerCode DNA_seq phylum class order family genus species source
#>    <chr>       <chr>      <DNA>   <chr>  <chr> <chr> <chr>  <chr> <chr>   <chr> 
#>  1 QHAK1466-22 COI-5P     GACTCT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  2 BHAK2663-20 COI-5P     GACTCT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  3 QHAK1267-22 COI-5P     GACTCT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  4 QHAK1223-22 COI-5P     ACAAGG… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  5 QHAK1237-22 COI-5P     GACTCT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  6 QHAK1263-22 COI-5P     GACTCT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  7 QHAK1446-22 COI-5P     GACTCT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  8 QHAK3063-23 COI-5P     GACTCT… Mollu… Gast… Pter… Clion… Clio… NA      BOLD  
#>  9 CAOII309-09 COI-5P     ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 10 CAOII310-09 COI-5P     ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 11 CMARA044-09 COI-5P     ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 12 GBMLG17117… COI-5P     TTGTTT… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 13 GBMLG18681… COI-5P     GTAGGC… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 14 GBMLG18682… COI-5P     GTAGGC… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 15 GBMLG18683… COI-5P     GTAGGC… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 16 CAOII308-09 COI-5P     ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> 17 CAOII311-09 COI-5P     ------… Mollu… Gast… Pter… Clion… Clio… Clione… BOLD  
#> # ℹ 20 more variables: lat <dbl>, lon <dbl>, lengthGene <int>, sampleID <chr>,
#> #   QueryName <chr>, identified_by <chr>, taxNotes <lgl>, db_xref <chr>,
#> #   sourceID <chr>, NCBI_ID <chr>, institutionStoring <chr>,
#> #   collected_by <chr>, collection_date <lgl>, altitude <int>, depth <dbl>,
#> #   country <chr>, directionPrimers <chr>, lengthSource <int>,
#> #   PCR_primers <chr>, note <chr>

As in the above section, searching all taxa for Clione and then filtering only those including the strings “Clione” and “antarctica” using grep would increase our chances to include records with identification qualifiers like “cf.”.

Tips

Searching long vectors

Be careful when searching long vectors including many different taxa. This might result in frequent blocking by the BOLD servers, as explained in the vignette topic Why can’t I speed up the BOLD functions?. However, when the number of taxa to be searched remains under an approximate limit of 250 per hour, the mining operations using the BOLD functions of the package can be sped up by setting the api_rate argument to 1 (thus 1 request per second).

Overriding the default taxonomy filter of download_bold

When the user is interested in all records belonging to many higher level taxa (genera, families or orders), using the get_bold_taxonomy function may take lot of time to search for all descending taxonomic names. The low rate at which the bold functions are set is mainly due to the frequent blocking by the BOLD serves, as described in Why can’t I speed up the BOLD functions?. One solution consists in deactivating the default filtering step adopted by the download_bold function, which eliminates all records belonging to children taxonomies that are not present in the taxonomy table provided, i.e. the output of the previous get_bold_taxonomy. This can be achieved by setting the argument filter to FALSE:

# get taxonomic table of the genus Clione
tax <- get_bold_taxonomy("Clione", descend = FALSE, ask = FALSE)

# download all records, including those identified to species belonging to Clione
download_bold(tax, filter = FALSE)