Overview


  pacs <- c("knitr"
            , "envClean"
            , "envReport"
            , "envFunc", "fs", "purrr"
            , "dplyr", "sf", "tibble"
            , "tmap", "raster", "rstanarm"
            )

  purrr::walk(pacs
              , ~suppressPackageStartupMessages(library(.
                                                        , character.only = TRUE
                                                        , quietly = TRUE)
                                                )
              )

  #  Load data
  flor_all <- tibble::as_tibble(envClean::flor_all)
  
  # What crs to use for maps?
  use_crs <- 3577 # actually an epsg code. see epsg.io
  
  # set area of interest coordinate reference system
  aoi <- envClean::aoi %>%
    sf::st_transform(crs = use_crs)

Installation

envClean is not on CRAN.

Install the development version from GitHub

remotes::install_github("acanthiza/envClean")

Load envClean

library("envClean")

Suggested workflow

After many, many iterations, the following workflow has been found to be ok. Only ok. There is no awesome when cleaning large, unstructured data.

Suggested steps in the cleaning process
clean	desc	order
all	starting point	0
temp_range	within a date (temporal) range	1
temp_bin	assigning dates to temporal bins	2
temp_rel	temporal reliability	3
geo_range	within a geographic area	4
geo_bin	assigning locations to spatial bins	5
geo_rel	spatial reliability	6
context	define context	7
att	add non-spatial attributes	8
geo	add geographic context (e.g. IBRA)	9
ann	non-persistent taxa	10
taxa	align taxonomy and resolve any taxonomic duplication within bins	11
single	singletons	12
out	outliers	13
NA	NA values in important columns	14
ll	NA values in latitude/longitude columns	15
effort	context effort	16
prop	proportion of sites	17
life	as a byproduct of assigning all records of a taxa a lifeform	18
cov	as a byproduct of assigning all records of a taxa a cover value	19
recent	the most recent visit to a cell	20
lists	add list length	21
filt_list_df	filter occurrence data to a set of criteria	22
fst	fix spatial taxonomy	23
fbd	filter by distribution	24
pres	presences only	25
coord	centroids of state, capital and institutions	26
ind	indigenous species	27
rm	geographic reliability	28
include	taxa with presences, reliable distributions, and/or mcp around presences	29
bin	add temporal, geographic and/or taxonomic bins	30
region_taxa	taxa found within a geograpic area (but including their records outside that area)	31
spt_att	add spatially dependent attributes	32
fr	fix reliability	33
clean	final step in the generic cleaning process	34
tax_level	all records identified to at least a specific taxonomic level	35
novagrant	filter vagrant records	36
noextinct	filter records from an extinct part of a taxas range	37
nohaven	filter records from inside havens	38
tg	was the list long enough to imply good survey effort (and therefore absence may be implied for taxa not recorded here)	39

Key concepts

Filter/clean/tidy

envClean, helps with implementing:

filtering: remove rows of a data frame. These may be entirely legitimate observations but it is desirable to remove them for the purposes of a downstream analysis. For example, a [context] with only one (legitimate) record may not meet the expectations of an analysis that within each [context] there is a list of taxa recorded.
cleaning: remove observations to reduce the risk that spurious observations are included in downstream analysis. For example, two different data sources may contain the same observation. Most analyses will perform better when records duplicated within a context are removed.
tidying: as per tidy data (Wickham 2014) where each variable is a column and each observation is a unique row.

In practice these tasks are often blurred within each of the functions.

In general the process will be referred to as cleaning.

Bins (for sites, visits, records, taxa)

Due to the loose definition of bins (see below), the definitions of site, visit, record and taxa can change through the cleaning process.

sites are spatial locations. they may be defined by latitude, longitude, easting, northing and/or cell. These may be duplicated before exclusive application of context. They are not necessarily defined by all spatial concepts within context at all stages of the cleaning process. In env spatial bins are usually set by add_raster_cell.
visits are sites plus a time, such as year, month, day (or, even hour). Again, until context is applied exclusively, these may be duplicated. In env temporal bins are usually year, month, or occasionally, day.
records are visits plus an observation to some level of the taxonomic hierarchy (refered to simply as ‘taxa’)
taxa refers to some form of taxonomic entity. An entity may be duplicated within a visit before taxonomy is resolved and context is applied exclusively. In env taxonomic bins are usually set by make_taxonomy(target_rank = “desired rank”) where ‘desired_rank’ could be, say, ‘species’, or, say, ‘subspecies’.

Throughout the series of env packages the concept of context is used extensively, and at least currently, somewhat loosely. Context supplies the bins: spatial, temporal and taxonomic bins.

With respect to ‘loosely’: context may be defined by, say, c("lat", "long", "cell", "year", "month"). At various stages through the cleaning process not every one of those variables may be applicable. After running add_raster_cell (to assign a spatial bin) the variable lat and long may be removed (depending on the add_xy argument). However context can still be used in full in cleaning steps (via the consistent use of tidyselect::any_of in envClean functions).

Note that context must be applied exclusively at some point in the cleaning process (by, say, dplyr::distinct(across(any_of(context)))). Until that point extraneous fields/columns beyond context are maintained; and no claim is made regarding the uniqueness of ‘records’ until this step in the process.

Summarising the cleaning process

There are some cleaning process summary functions. Taking advantage of these requires:

consistent naming with a prefix, default bio_
- the suffix is a short name for that step in the cleaning process. e.g. bio_taxa would be the object created when applying the taxonomic bins, and bio_geo_bin is the object created when applying geogrphic bins
- see envClean::luclean for the suffixes/short names (in the clean column)
addition of a ctime (creation time) attribute, probably using envFunc::add_time_stamp()

The function clean_summary() then prepares information, based on the objects creating through the cleaning process, that can be used in summary reports. clean_summary() also, optionally (with default TRUE) saves the start and end objects from the cleaning process.

cleaning_text() prepares text, based on a cleaning summary, that can be used directly in .Rmd.

There are also small .Rmd files in /inst that match the suffix for each step. Looping through these child files from a main .Rmd provides the structure for the output report.

Coordinate reference systems

There are two (possibly three) main coordinate reference systems (crs) to worry about:

the crs for the original records. If these are in decimal degrees, using epsg = 4283 is likely to return the correct crs.
the crs you’d like to use for most spatial data. Set here (in setup chunk) to use_crs = 3577. It is likely that a projected crs will work best, particularly for buffering, filtering etc.
the crs for any other spatial data imported to help with cleaning. Try using sf::st_read("random_shape_file.shp") %>% sf::st_tranform(crs = use_crs) to deal with this.

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

Department for Environment and Water

Nigel Willoughby

Friday, 05 June, 2026