Only queries galah for taxa not already in taxonomy_file. Can return a list,
for several levels of taxonomic hierarchy, with the 'best' match at that
level. For example, if 'genus' is provided in needed_ranks, the returned
list will have an element 'genus' that contains, in a column named taxa,
and for each of the original names provided, the best result at genus level
or higher (in cases where no genus level match was available).
make_taxonomy(
df = NULL,
taxa_col = "original_name",
taxonomy_file = tempfile(),
force_new = list(original_name = NULL, timediff = as.difftime(26, units = "weeks")),
remove_taxa = c("bold:", "unverified", "undetermined", "unidentified", "annual herb",
"annual grass", "incertae sedis", "\\?", "another\\s", "not naturalised in sa",
"annual tussock grass", "*no id", "spec\\.", "aquatic grass"),
remove_strings = c("\\s\\-\\-\\s.*", "\\ssp\\.$", "\\sssp\\.$",
"\\sspec\\.$", "\\ssp$", "\\sssp$", "\\ssp\\d$", "dead", "\\sx\\s.*",
"\\sX\\s.*", "unknown", "\\scultivar$", "\\scomplex$", "\\(NC\\)",
"\\saff\\."),
not_names = c("sp", "ssp", "var", "subsp", "subspecies", "form", "race", "nov", "aff",
"cf", "lineage", "group", "et", "al", "and", "pl", "revised", "nov", "sensu", "lato",
"hybrid", "complex"),
tri_strings = c("\\sssp\\s", "\\sssp\\.\\s", "\\svar\\s",
"\\svar\\.\\s", "\\ssubsp\\.", "\\ssubspecies", "\\sform\\)",
"\\sform\\s", "\\sf\\.", "\\srace\\s", "\\srace\\)",
"\\sp\\.v\\."),
bi_strings = c("all\\ssubspecies", "\\ssp\\s", "\\ssp\\.\\s",
"\\sspecies", "\\sspp\\.\\s", "\\sspp\\s"),
atlas = c("Australia"),
tweak_species = TRUE,
return_taxonomy = TRUE,
limit = TRUE,
needed_ranks = c("species"),
overrides = NULL
)Dataframe with taxa_col. Can be NULL only if taxonomy_file
already exists.
Character or index. Name or index of column with taxa names.
Each unique taxa in this column will be queried against galah::search_taxa
and appear in the results list element lutaxain a column called
original_name
Character. File path to save results to. File type ignored. .parquet file used.
List with elements difftime and any column name from
taxonomy_file. If taxonomy_file already exists any column matches between
force_new and taxonomy_file, matching levels within that column will be
requeried. Likewise any original_name that has not been searched since
difftime will be requeried. Set either to NULL to ignore.
Character. Rows with regular expressions in
tolower(taxa_col) that match remove_taxa are removed (rows are removed).
Character. Text that matches remove_strings is
removed from the taxa_col before searching (text, not row, is removed).
Character. Text that matches non_name_strings is used to
remove non-names from original_names before a word count to indicate (guess)
if the original_name is trinomial (original_is_tri field in lutaxa).
Character. Text that matches these strings is
used to indicate if the original_name is trinomial or binomial. original_is
(bin)omial or (tri)nomial appear in the resulting lutaxa.
Character. Name of galah atlas to use.
Logical. If TRUE, a list is returned containing the
best match for each original_name in lutaxa and additional elements named
for their rank (see envClean::lurank) with unique rows for that rank. One
element per rank provided in needed_ranks
Logical. If TRUE the returned list will be limited to those
original_names in df
Character vector of ranks required in the returned list.
Can be "all" or any combination of ranks from envClean::lurank greater than
or equal to subspecies.
Used to override results returned by galah::search_taxa().
Dataframe with (at least) columns: taxa_col and taxa_to_search.
Can also contain any number of use_x columns where x is any of
kingdom, phylum, class, order, family, genus, species, subspecies, variety and form. A two step process then attempts
to find better results than if searched on taxa_col. Step 1 searches for
taxa_to_search instead of taxa_col. If any use_x columns are present,
step 2 then checks that the results from step 1 have a result at x. If not,
level x results will be taken from use_x.
Logical. If TRUE (default) and the returned species
column result ends in a full stop, the values returned in the species
column will be directly taken from the scientific_name column. See details.
Null or list (depending on return_taxonomy). Writes
taxonomy_file. taxa_col will be original_name in any outputs. Note that
taxa_col, as original_name, will have any quotes removed.
If list, then elements:
raw - the 'raw' results returned from galah::search_taxa(), tweaked
by: column rank is an ordered factor as per envClean::lurank;
rank_adj is a new column that will reflect the rank column unless rank is
less than subspecies, in which case it will be subspecies; and
original_is_(bi or tri) are new columns
needed_ranks - One element for each rank specified in needed_ranks.
lutaxa - dataframe. For each unique name in taxa_col, the best
taxa taxonomic bin to use, for each original_name, taking into
account each level of needed_ranks
original_name - unique values from taxa_col
match_type - directly from galah::search_taxa()
matched_rank - rank column from galah::search_taxa()
returned_rank - the rank of the taxa returned for each
original_name. This will never be lower than needed_rank but
may be higher than needed_rank if no match was available at
needed_rank. Use this 'rank' to filter bins in a cleaning
workflow
taxa - the best taxa available for original_name at
needed_rank, perhaps taking into account overrides
override - is the taxa the result of an override?
original_is_tri,original_is_bi - Experimental. Is the
original_name a trinomial or binomiail? Highlights cases where
the matched rank is > subspecies but the original_name is
probably a subspecies. Guesses are based on a word count after
removal of: not_names; numbers; punctuation; capitalised words
that are not the first word; and single letter 'words'.
bi_strings or tri_strings override the guess - flagging TRUE.
Note, clearly, this is only an (informed) guess at whether the
original_name is binomial or trinomial.
taxonomy - dataframe. For each taxa in lutaxa a row of
taxonomic hierarchy
The argument tweak_species replaces the galah::search_taxa() result in
the species column with the result in the scientific_name column. This
attempt to deal with instances where galah::search_taxa() returns odd
results in species but good results in scientific_name. e.g.
galah::search_taxa("Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033)")
returns spec. in the species column but
Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033) in the
scientific_name column
Previous envClean::make_taxonomy() function is still available via
envClean::make_gbif_taxonomy()
# setup
# library("envClean")
temp_file <- tempfile()
taxa_df <- tibble::tibble(taxa = c("Charadrius rubricollis"
, "Thinornis cucullatus"
, "Melithreptus gularis laetior"
, "Melithreptus gularis gularis"
, "Eucalyptus viminalis"
, "Eucalyptus viminalis cygnetensis"
, "Eucalyptus"
, "Charadrius mongolus all subspecies"
, "Bettongia lesueur Barrow and Boodie Islands subspecies"
, "Lagorchestes hirsutus Central Australian subspecies"
, "Perameles gunnii Victorian subspecies"
, "Pterostylis sp. Rock ledges (pl. 185, Bates & Weber 1990)"
, "Spyridium glabrisepalum"
, "Spyridium eriocephalum var. glabrisepalum"
, "Petrogale lateralis (MacDonnell Ranges race)"
, "Gehyra montium (revised)"
, "Korthalsella japonica f. japonica"
, "Galaxias sp. nov. 'Hunter'"
, "Some rubbish"
, "Senna artemisioides subsp x artemisioides"
, "Halosarcia sp. (NC)"
, "TERMITOIDAE sp." # 'epifamily'
)
)
# make taxonomy (returns list and writes taxonomy_file)
taxonomy <- make_taxonomy(df = taxa_df
, taxa_col = "taxa"
, taxonomy_file = temp_file
, needed_ranks = c("kingdom", "genus", "species", "subspecies")
)
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#> ℹ try using `brew()
taxonomy$raw
#> Error: object 'taxonomy' not found
taxonomy$kingdom
#> Error: object 'taxonomy' not found
taxonomy$genus
#> Error: object 'taxonomy' not found
taxonomy$species
#> Error: object 'taxonomy' not found
taxonomy$subspecies
#> Error: object 'taxonomy' not found
# query more taxa (results are added to taxonomy_file but only the new taxa are returned (default `limit = TRUE`)
more_taxa <- tibble::tibble(original_name = c("Amytornis whitei"
, "Amytornis striatus"
, "Amytornis modestus (North, 1902)"
, "Amytornis modestus modestus"
, "Amytornis modestus cowarie"
)
)
make_taxonomy(df = more_taxa
, taxonomy_file = temp_file
, needed_ranks = c("species")
)
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#> ℹ try using `brew()
# no dataframe supplied - all results in taxonomy_file returned
make_taxonomy(taxonomy_file = temp_file
, needed_ranks = c("subspecies")
)
#> Error: No such file: /tmp/RtmpOXQ9vM/file16b51319b9a592.parquet
# Try automatic overrides
auto_overrides <- make_unmatched_overrides(df = taxa_df
, taxa_col = "taxa"
, taxonomy = taxonomy
, target_rank = "species"
)
#> Error: object 'taxonomy' not found
# overrrides
overrides <- envClean::taxonomy_overrides
# C. rubricollis binned to Phalarope lobatus at species level!
taxonomy <- make_taxonomy(df = overrides
, taxonomy_file = temp_file
, needed_ranks = c("species", "subspecies")
)
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#> ℹ try using `brew()
taxonomy$species$lutaxa %>%
dplyr::filter(grepl("rubricollis", original_name))
#> Error: object 'taxonomy' not found
# add in override - C. rubricollis is binned to T. cucullatus at species level
taxonomy <- make_taxonomy(df = overrides
, taxonomy_file = temp_file
, needed_ranks = c("species", "subspecies")
, overrides = overrides
)
#> Joining with `by = join_by(original_name)`
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_package(.pkg): No data stored by `potions`
#> ℹ try using `brew()
taxonomy$species$lutaxa %>%
dplyr::filter(grepl("rubricollis", original_name))
#> Error: object 'taxonomy' not found
# tweak_species example
make_taxonomy(df = tibble::tibble(original_name = "Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033)")
, tweak_species = FALSE
)$raw %>%
dplyr::select(original_name, scientific_name, species)
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#> ℹ try using `brew()
make_taxonomy(df = tibble::tibble(original_name = "Acacia sp. Small Red-leaved Wattle (J.B.Williams 95033)")
, tweak_species = TRUE
)$raw %>%
dplyr::select(original_name, scientific_name, species)
#> Joining with `by = join_by(original_name)`
#> Error in check_pour_interactive(.slot): No data stored by `potions`
#> ℹ try using `brew()
# clean up
rm(taxonomy)
#> Warning: object 'taxonomy' not found
unlist(paste0(temp_file, ".parquet"))
#> [1] "/tmp/RtmpOXQ9vM/file16b51319b9a592.parquet"