pacs <- c("knitr"
            , "envClean"
            , "envReport"
            , "envFunc", "fs", "purrr"
            , "dplyr", "sf", "tibble"
            , "tmap", "raster", "rstanarm"
            )

  purrr::walk(pacs
              , ~suppressPackageStartupMessages(library(.
                                                        , character.only = TRUE
                                                        , quietly = TRUE)
                                                )
              )

  #  Load data
  flor_all <- tibble::as_tibble(envClean::flor_all)
  
  # What crs to use for maps?
  use_crs <- 3577 # actually an epsg code. see epsg.io
  
  # set area of interest coordinate reference system
  aoi <- envClean::aoi %>%
    sf::st_transform(crs = use_crs)
  

Installation

envClean is not on CRAN.

Install the development version from GitHub

remotes::install_github("acanthiza/envClean")

Load envClean

Suggested workflow

After many, many iterations, the following workflow has been found to be ok. Only ok. There is no awesome when cleaning large, unstructured data.

Suggested steps in the cleaning process
clean desc order
all starting point 0
temp_range within a date (temporal) range 1
temp_bin assigning dates to temporal bins 2
temp_rel temporal reliability 3
geo_range within a geographic area 4
geo_bin assigning locations to spatial bins 5
geo_rel spatial reliability 6
context define context 7
att add non-spatial attributes 8
geo add geographic context (e.g. IBRA) 9
ann non-persistent taxa 10
taxa align taxonomy and resolve any taxonomic duplication within bins 11
single singletons 12
out outliers 13
NA NA values in important columns 14
ll NA values in latitude/longitude columns 15
effort context effort 16
prop proportion of sites 17
life as a byproduct of assigning all records of a taxa a lifeform 18
cov as a byproduct of assigning all records of a taxa a cover value 19
recent the most recent visit to a cell 20
lists add list length 21
filt_list_df filter occurrence data to a set of criteria 22
fst fix spatial taxonomy 23
fbd filter by distribution 24
pres presences only 25
coord centroids of state, capital and institutions 26
ind indigenous species 27
rm geographic reliability 28
include taxa with presences, reliable distributions, and/or mcp around presences 29
bin add temporal, geographic and/or taxonomic bins 30
region_taxa taxa found within a geograpic area (but including their records outside that area) 31
spt_att add spatially dependent attributes 32
fr fix reliability 33
clean final step in the generic cleaning process 34
tax_level all records identified to at least a specific taxonomic level 35
novagrant filter vagrant records 36
noextinct filter records from an extinct part of a taxas range 37
nohaven filter records from inside havens 38
tg was the list long enough to imply good survey effort (and therefore absence may be implied for taxa not recorded here) 39

Key concepts

Filter/clean/tidy

envClean, helps with implementing:

  • filtering: remove rows of a data frame. These may be entirely legitimate observations but it is desirable to remove them for the purposes of a downstream analysis. For example, a [context] with only one (legitimate) record may not meet the expectations of an analysis that within each [context] there is a list of taxa recorded.
  • cleaning: remove observations to reduce the risk that spurious observations are included in downstream analysis. For example, two different data sources may contain the same observation. Most analyses will perform better when records duplicated within a context are removed.
  • tidying: as per tidy data (Wickham 2014) where each variable is a column and each observation is a unique row.

In practice these tasks are often blurred within each of the functions.

In general the process will be referred to as cleaning.

Bins (for sites, visits, records, taxa)

Due to the loose definition of bins (see below), the definitions of site, visit, record and taxa can change through the cleaning process.

  • sites are spatial locations. they may be defined by latitude, longitude, easting, northing and/or cell. These may be duplicated before exclusive application of context. They are not necessarily defined by all spatial concepts within context at all stages of the cleaning process. In env spatial bins are usually set by add_raster_cell.
  • visits are sites plus a time, such as year, month, day (or, even hour). Again, until context is applied exclusively, these may be duplicated. In env temporal bins are usually year, month, or occasionally, day.
  • records are visits plus an observation to some level of the taxonomic hierarchy (refered to simply as ‘taxa’)
  • taxa refers to some form of taxonomic entity. An entity may be duplicated within a visit before taxonomy is resolved and context is applied exclusively. In env taxonomic bins are usually set by make_taxonomy(target_rank = “desired rank”) where ‘desired_rank’ could be, say, ‘species’, or, say, ‘subspecies’.

Throughout the series of env packages the concept of context is used extensively, and at least currently, somewhat loosely. Context supplies the bins: spatial, temporal and taxonomic bins.

With respect to ‘loosely’: context may be defined by, say, c("lat", "long", "cell", "year", "month"). At various stages through the cleaning process not every one of those variables may be applicable. After running add_raster_cell (to assign a spatial bin) the variable lat and long may be removed (depending on the add_xy argument). However context can still be used in full in cleaning steps (via the consistent use of tidyselect::any_of in envClean functions).

Note that context must be applied exclusively at some point in the cleaning process (by, say, dplyr::distinct(across(any_of(context)))). Until that point extraneous fields/columns beyond context are maintained; and no claim is made regarding the uniqueness of ‘records’ until this step in the process.

Summarising the cleaning process

There are some cleaning process summary functions. Taking advantage of these requires:

  • consistent naming with a prefix, default bio_
    • the suffix is a short name for that step in the cleaning process. e.g. bio_taxa would be the object created when applying the taxonomic bins, and bio_geo_bin is the object created when applying geogrphic bins
    • see envClean::luclean for the suffixes/short names (in the clean column)
  • addition of a ctime (creation time) attribute, probably using envFunc::add_time_stamp()

The function clean_summary() then prepares information, based on the objects creating through the cleaning process, that can be used in summary reports. clean_summary() also, optionally (with default TRUE) saves the start and end objects from the cleaning process.

cleaning_text() prepares text, based on a cleaning summary, that can be used directly in .Rmd.

There are also small .Rmd files in /inst that match the suffix for each step. Looping through these child files from a main .Rmd provides the structure for the output report.

Coordinate reference systems

There are two (possibly three) main coordinate reference systems (crs) to worry about:

  1. the crs for the original records. If these are in decimal degrees, using epsg = 4283 is likely to return the correct crs.
  2. the crs you’d like to use for most spatial data. Set here (in setup chunk) to use_crs = 3577. It is likely that a projected crs will work best, particularly for buffering, filtering etc.
  3. the crs for any other spatial data imported to help with cleaning. Try using sf::st_read("random_shape_file.shp") %>% sf::st_tranform(crs = use_crs) to deal with this.

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.