Only queries galah for taxa not already in taxonomy_file. Can return a list, for several levels of taxonomic hierarchy, with the 'best' match at that level. For example, if 'genus' is provided in needed_ranks, the returned list will have an element 'genus' that contains, in a column named taxa, and for each of the original names provided, the best result at genus level or higher (in cases where no genus level match was available).

make_taxonomy(
  df = NULL,
  taxa_col = "original_name",
  taxonomy_file = tempfile(),
  force_new = list(original_name = NULL, timediff = as.difftime(26, units = "weeks")),
  remove_taxa = c("bold:", "unverified", "undetermined", "unidentified", "annual herb",
    "annual grass", "incertae sedis", "\\?", "another\\s", "not naturalised in sa",
    "annual tussock grass", "*no id", "spec\\.", "aquatic grass"),
  remove_strings = c("\\s\\-\\-\\s.*", "\\ssp\\.$", "\\sssp\\.$",
    "\\sspec\\.$", "\\ssp$", "\\sssp$", "\\ssp\\d$", "dead", "\\sx\\s.*",
    "\\sX\\s.*", "unknown", "\\scultivar$", "\\scomplex$", "\\(NC\\)",
    "\\saff\\."),
  not_names = c("sp", "ssp", "var", "subsp", "subspecies", "form", "race", "nov", "aff",
    "cf", "lineage", "group", "et", "al", "and", "pl", "revised", "nov", "sensu", "lato",
    "hybrid", "complex"),
  tri_strings = c("\\sssp\\s", "\\sssp\\.\\s", "\\svar\\s",
    "\\svar\\.\\s", "\\ssubsp\\.", "\\ssubspecies", "\\sform\\)",
    "\\sform\\s", "\\sf\\.", "\\srace\\s", "\\srace\\)",
    "\\sp\\.v\\."),
  bi_strings = c("all\\ssubspecies", "\\ssp\\s", "\\ssp\\.\\s",
    "\\sspecies", "\\sspp\\.\\s", "\\sspp\\s"),
  atlas = c("Australia"),
  tweak_species = TRUE,
  return_taxonomy = TRUE,
  limit = TRUE,
  needed_ranks = c("species"),
  overrides = NULL
)

Arguments

df

Dataframe with taxa_col. Can be NULL only if taxonomy_file already exists.

taxa_col

Character or index. Name or index of column with taxa names. Each unique taxa in this column will be queried against galah::search_taxa and appear in the results list element lutaxain a column called original_name

taxonomy_file

Character. File path to save results to. File type ignored. .parquet file used.

force_new

List with elements difftime and any column name from taxonomy_file. If taxonomy_file already exists any column matches between force_new and taxonomy_file, matching levels within that column will be requeried. Likewise any original_name that has not been searched since difftime will be requeried. Set either to NULL to ignore.

remove_taxa

Character. Rows with regular expressions in tolower(taxa_col) that match remove_taxa are removed (rows are removed).

remove_strings

Character. Text that matches remove_strings is removed from the taxa_col before searching (text, not row, is removed).

not_names

Character. Text that matches non_name_strings is used to remove non-names from original_names before a word count to indicate (guess) if the original_name is trinomial (original_is_tri field in lutaxa).

tri_strings, bi_strings

Character. Text that matches these strings is used to indicate if the original_name is trinomial or binomial. original_is (bin)omial or (tri)nomial appear in the resulting lutaxa.

atlas

Character. Name of galah atlas to use.

return_taxonomy

Logical. If TRUE, a list is returned containing the best match for each original_name in lutaxa and additional elements named for their rank (see envClean::lurank) with unique rows for that rank. One element per rank provided in needed_ranks

limit

Logical. If TRUE the returned list will be limited to those original_names in df

needed_ranks

Character vector of ranks required in the returned list. Can be "all" or any combination of ranks from envClean::lurank greater than or equal to subspecies.

overrides

Used to override results returned by galah::search_taxa(). Dataframe with (at least) columns: taxa_col and taxa_to_search. Can also contain any number of use_x columns where x is any of kingdom, phylum, class, order, family, genus, species, subspecies, variety and form. A two step process then attempts to find better results than if searched on taxa_col. Step 1 searches for taxa_to_search instead of taxa_col. If any use_x columns are present, step 2 then checks that the results from step 1 have a result at x. If not, level x results will be taken from use_x.

tweak_species.

Logical. If TRUE (default) and the returned species column result ends in a full stop, the values returned in the species column will be directly taken from the scientific_name column. See details.

Value

Null or list (depending on return_taxonomy). Writes taxonomy_file. taxa_col will be original_name in any outputs. Note that taxa_col, as original_name, will have any quotes removed. If list, then elements:

  • raw - the 'raw' results returned from galah::search_taxa(), tweaked by: column rank is an ordered factor as per envClean::lurank; rank_adj is a new column that will reflect the rank column unless rank is less than subspecies, in which case it will be subspecies; and original_is_(bi or tri) are new columns

  • needed_ranks - One element for each rank specified in needed_ranks.

    • lutaxa - dataframe. For each unique name in taxa_col, the best taxa taxonomic bin to use, for each original_name, taking into account each level of needed_ranks

      • original_name - unique values from taxa_col

      • match_type - directly from galah::search_taxa()

      • matched_rank - rank column from galah::search_taxa()

      • returned_rank - the rank of the taxa returned for each original_name. This will never be lower than needed_rank but may be higher than needed_rank if no match was available at needed_rank. Use this 'rank' to filter bins in a cleaning workflow

      • taxa - the best taxa available for original_name at needed_rank, perhaps taking into account overrides

      • override - is the taxa the result of an override?

      • original_is_tri,original_is_bi - Experimental. Is the original_name a trinomial or binomiail? Highlights cases where the matched rank is > subspecies but the original_name is probably a subspecies. Guesses are based on a word count after removal of: not_names; numbers; punctuation; capitalised words that are not the first word; and single letter 'words'. bi_strings or tri_strings override the guess - flagging TRUE. Note, clearly, this is only an (informed) guess at whether the original_name is binomial or trinomial.

    • taxonomy - dataframe. For each taxa in lutaxa a row of taxonomic hierarchy

Details

The argument tweak_species replaces the galah::search_taxa() result in the species column with the result in the scientific_name column. This attempt to deal with instances where galah::search_taxa() returns odd results in species but good results in scientific_name. e.g. galah::search_taxa("Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033)") returns spec. in the species column but Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033) in the scientific_name column

Previous envClean::make_taxonomy() function is still available via envClean::make_gbif_taxonomy()

Examples


  # setup
  # library("envClean")

  temp_file <- tempfile()

  taxa_df <- tibble::tibble(taxa = c("Charadrius rubricollis"
                                     , "Thinornis cucullatus"
                                     , "Melithreptus gularis laetior"
                                     , "Melithreptus gularis gularis"
                                     , "Eucalyptus viminalis"
                                     , "Eucalyptus viminalis cygnetensis"
                                     , "Eucalyptus"
                                     , "Charadrius mongolus all subspecies"
                                     , "Bettongia lesueur Barrow and Boodie Islands subspecies"
                                     , "Lagorchestes hirsutus Central Australian subspecies"
                                     , "Perameles gunnii Victorian subspecies"
                                     , "Pterostylis sp. Rock ledges (pl. 185, Bates & Weber 1990)"
                                     , "Spyridium glabrisepalum"
                                     , "Spyridium eriocephalum var. glabrisepalum"
                                     , "Petrogale lateralis (MacDonnell Ranges race)"
                                     , "Gehyra montium (revised)"
                                     , "Korthalsella japonica f. japonica"
                                     , "Galaxias sp. nov. 'Hunter'"
                                     , "Some rubbish"
                                     , "Senna artemisioides subsp x artemisioides"
                                     , "Halosarcia sp.  (NC)"
                                     , "TERMITOIDAE sp." # 'epifamily'
                                     )
                            )

  # make taxonomy (returns list and writes taxonomy_file)
  taxonomy <- make_taxonomy(df = taxa_df
                            , taxa_col = "taxa"
                            , taxonomy_file = temp_file
                            , needed_ranks = c("kingdom", "genus", "species", "subspecies")
                            )
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#>  try using `brew()
  taxonomy$raw
#> Error: object 'taxonomy' not found
  taxonomy$kingdom
#> Error: object 'taxonomy' not found
  taxonomy$genus
#> Error: object 'taxonomy' not found
  taxonomy$species
#> Error: object 'taxonomy' not found
  taxonomy$subspecies
#> Error: object 'taxonomy' not found

  # query more taxa (results are added to taxonomy_file but only the new taxa are returned (default `limit = TRUE`)
  more_taxa <- tibble::tibble(original_name = c("Amytornis whitei"
                                                , "Amytornis striatus"
                                                , "Amytornis modestus (North, 1902)"
                                                , "Amytornis modestus modestus"
                                                , "Amytornis modestus cowarie"
                                                )
                              )

  make_taxonomy(df = more_taxa
                , taxonomy_file = temp_file
                , needed_ranks = c("species")
                )
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#>  try using `brew()

  # no dataframe supplied - all results in taxonomy_file returned
  make_taxonomy(taxonomy_file = temp_file
                , needed_ranks = c("subspecies")
                )
#> Error: No such file: /tmp/RtmpOXQ9vM/file16b51319b9a592.parquet

  # Try automatic overrides
  auto_overrides <- make_unmatched_overrides(df = taxa_df
                                             , taxa_col = "taxa"
                                             , taxonomy = taxonomy
                                             , target_rank = "species"
                                             )
#> Error: object 'taxonomy' not found

  # overrrides
  overrides <- envClean::taxonomy_overrides

  # C. rubricollis binned to Phalarope lobatus at species level!
  taxonomy <- make_taxonomy(df = overrides
                            , taxonomy_file = temp_file
                            , needed_ranks = c("species", "subspecies")
                            )
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#>  try using `brew()

  taxonomy$species$lutaxa %>%
    dplyr::filter(grepl("rubricollis", original_name))
#> Error: object 'taxonomy' not found

  # add in override - C. rubricollis is binned to T. cucullatus at species level
  taxonomy <- make_taxonomy(df = overrides
                            , taxonomy_file = temp_file
                            , needed_ranks = c("species", "subspecies")
                            , overrides = overrides
                            )
#> Joining with `by = join_by(original_name)`
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_package(.pkg): No data stored by `potions`
#>  try using `brew()

  taxonomy$species$lutaxa %>%
    dplyr::filter(grepl("rubricollis", original_name))
#> Error: object 'taxonomy' not found


  # tweak_species example
  make_taxonomy(df = tibble::tibble(original_name = "Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033)")
                , tweak_species = FALSE
                )$raw %>%
    dplyr::select(original_name, scientific_name, species)
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#>  try using `brew()

  make_taxonomy(df = tibble::tibble(original_name = "Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033)")
                , tweak_species = TRUE
                )$raw %>%
    dplyr::select(original_name, scientific_name, species)
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#>  try using `brew()

  # clean up
  rm(taxonomy)
#> Warning: object 'taxonomy' not found
  unlist(paste0(temp_file, ".parquet"))
#> [1] "/tmp/RtmpOXQ9vM/file16b51319b9a592.parquet"