The background sampling includes code based on a Geographic Information Systems stack exchange answer by user Spacedman.
Usage
prep_sdm(
this_taxa = NULL,
out_dir = FALSE,
return_val = "path",
presence,
pres_col = "pa",
pres_val = 1,
pres_crs = 4326,
pres_x = "long",
pres_y = "lat",
pred_limit = TRUE,
limit_buffer = 0,
pred_clip = NULL,
predictors,
is_env_pred = TRUE,
terra_options = NULL,
cat_preds = NULL,
num_bg = 10000,
prop_abs = "abs",
many_p_prop = 2,
folds = 5,
spatial_folds = TRUE,
repeats = 1,
block_div = seq(5, by = 2, length.out = repeats),
min_fold_n = 8,
hold_prop = 0.3,
stretch_value = 10,
dens_res = 1000,
reduce_env_thresh_corr = 0.9,
reduce_env_quant_rf_imp = 0.2,
do_gc = FALSE,
force_new = FALSE
)
Arguments
- this_taxa
Character. Name of taxa. Only used to print some messages. Ignored if NULL
- out_dir
FALSE or character. If FALSE the result of prep_sdm will be saved to a temporary folder. If character, a file 'prep.rds' will be created at the path defined by out_dir.
- return_val
Character: "object" or "path". Both return a named list. In the case of "path" the named list is simply list(prep = out_dir). Will be set to "object" if
out_dir
is FALSE.- presence
Dataframe of presences with columns
pres_x
andpres_y
.- pres_col
Character. Name of column in
presence
that defines presence (1
) or absence (0
). Optional if only presence data is supplied.- pres_val
Numeric. Values in
pres_col
that represent presences. Optional if only presence data is supplied.- pres_crs
Anything that will return a legitimate crs when passed to the crs attribute of
sf::st_transform()
orsf::st_as_sf()
.- pres_x, pres_y
Character. Name of the columns in
presence
that have the x and y coordinates- pred_limit
Limit the background points and predictions? Can be
TRUE
(usepresence
to generate a minimum convex polygon to use as a limit. Not recommended as the points inpresence
have usually been filtered to very accurate spatial reliability and thus may be missing a large number of legitimate records);FALSE
(the full extent of the predictors will be used); path to existing .parquet to use; or sf object.- limit_buffer
Numeric. Apply this buffer to
pred_limit
. Only used ifpred_limit
isTRUE
. Passed to thedist
argument ofsf::st_buffer()
.- pred_clip
sf. Optional sf to clip the pred_limit back to (e.g. to prevent prediction into ocean).
- predictors
Character. Vector of paths to predictor
.tif
files.- is_env_pred
Logical. Does the naming of the directory and files in
predictors
follow the pattern required byenvRaster::parse_env_tif()
?- terra_options
Passed to
terra::terraOptions()
. e.g. list(memfrac = 0.6)- cat_preds
Character. Vector of predictor names that are character.
- num_bg
Numeric. How many background points?
- prop_abs
Character. Is
num_bg
a proportion (prop
) of the number of records inpresence
or an absolute (abs
) number?- many_p_prop
Numeric. Ensure the number of background points is at least
many_p_prop * number of presences
. e.g. If there are more than 5000 presences and num_bg is set at10000
andmany_p_prop
is2
, then num_bg will be increased tomany_p_prop * nrow(presences)
- folds
Numeric. How many folds to use in cross validation? Will be adjusted downwards if number of presences do not support
folds * min_fold_n
- spatial_folds
Logical. Use spatial folds? Even if
TRUE
, can resort to non-spatial cv if presences per fold do not meetmin_fold_n
or there are not enough presences to support more than one fold.- repeats
Numeric. Number of repeated cross validations.
- block_div
Numeric. The square root of the predict area is divided by this value before being passed to the
block_dist
argument ofblockCV::cv_spatial()
. If using repeated cross validation,block_div
must be of the same length as1:repeats
.- min_fold_n
Numeric. Sets both minimum number of presences, and, by default, the minimum number of presences required for a model.
- hold_prop
Numeric. Proportion of data to be held back from training to use to validate the final model.
- stretch_value
Numeric. Stretch the density raster to this value.
- dens_res
NULL
or numeric. Resolution (in metres) of density raster. Set toNULL
to use the same resolution as the predictors.- reduce_env_thresh_corr
Numeric. Threshold used to flag highly correlated variables. Set to 1 to skip this step. If > 0, highly correlated and low importance variables will be removed. In the case of highly correlated pairs of variables, only one is removed.
- reduce_env_quant_rf_imp
Numeric. Bottom quantile of importance values to drop.
- do_gc
Logical. Run
base::rm(list = ls)
andbase::gc()
at end of function? Useful when running SDMs for many, many taxa, especially if done in parallel.- force_new
Logical. If outputs already exist, should they be remade?
Value
If return_val
is "object" a named list. If return_val
is "path"
a named list list(prep = out_dir)
. If out_dir
is a valid path, the 'full
result' (irrespective of return_val
) is also saved to
fs::path(out_dir, "prep.rds")
. The 'full result' is a named list with
elements:
log:
a log of (rough) timings and other information from the process
abandoned:
Logical indicating if the sdm was abandoned. If abandoned is TRUE, some list elements may not be present
presence_ras:
tibble with two columns ('x' and 'y') representing unique cell centroids on the predictors at presences supplied in argument
presence
predict_boundary:
sf used to limit the background points and used by
predict_sdm()
to generate the 'mask'ed output
bg_points:
sf of cell centroids representing unique cell centroids for background points
blocks
data.frame with columns:
pa
: presence (1) or absence/background (0)x
andy
: cell centroids for each presence and absenceblock
: the spatial block to which the row belongsa column with values for each of
predictors
atx
andy
spatial_folds_used:
logical indicating if spatial folds were used. This may differ from the
spatial_folds
argument provided toprep_sdm()
if an attempt to use spatial folds failed to meet desiredfolds
andmin_fold_n
correlated:
list with elements as per
envModel::reduce_env()
, or, ifreduce_env
isFALSE
, a list with elementsremove_env
which is empty, andenv_var
andkeep
, which both contain the names of all predictors.
Details
If memory is an issue, try adjusting terra_options
and/or do_gc
.
To help build the density raster for assigning background points, 'absence'
data can be supplied in presence
as 0
values. e.g. For a bird, absence
data might be generated from other sites where birds were recorded but
this_taxa
was not.
Examples
out_dir <- file.path(system.file(package = "envSDM"), "examples")
data <- file.path(system.file(package = "predicts"), "ex") |>
fs::dir_ls(regexp = "\\.csv$") |>
tibble::enframe(name = NULL, value = "path") |>
dplyr::mutate(taxa = gsub("\\.csv", "", basename(path))
, presence = purrr::map(path, rio::import, setclass = "tibble", trust = TRUE)
, presence = purrr::map(presence
, \(x) x |>
dplyr::filter(!is.na(lat)
, !is.na(lon)
)
)
, taxa_dir = fs::path(out_dir, taxa)
, out_mcp = fs::path(taxa_dir, "mcp.parquet")
)
env_dat <- system.file("ex/bio.tif", package = "predicts")
# mcps --------
# make a clip boundary so mcps stay terrestrial
clip <- terra::as.polygons(terra::rast(env_dat)[[1]] > -Inf) |>
sf::st_as_sf()
purrr::pwalk(list(data$presence
, data$out_mcp
)
, \(x, y) make_mcp(x, y, pres_x = "lon"
, clip = clip
)
)
# prep -----------
# use the just created mcps (this allows using, say, a different spatial reliability threshold for the mcps)
purrr::pwalk(list(data$taxa
, data$taxa_dir
, data$presence
, data$out_mcp
)
, function(a, b, c, d) prep_sdm(this_taxa = a
, out_dir = b
, presence = c
, pres_x = "lon"
, pres_y = "lat"
, predictors = env_dat
, is_env_pred = FALSE
, pred_limit = d
, limit_buffer = 10000
, folds = 5
, repeats = 5
, hold_prop = 0
, dens_res = 1000 # ignored as decimal degrees preds
, reduce_env_thresh_corr = 0.95
, reduce_env_quant_rf_imp = 0.2
#, force_new = TRUE
)
)
# example of 'prep'
prep <- rio::import(fs::path(data$taxa_dir[[2]], "prep.rds"), trust = TRUE)
names(prep)
#> [1] "abandoned" "finished" "this_taxa" "original"
#> [5] "pa_ras" "presence_ras" "predict_boundary" "bg_points"
#> [9] "env" "testing" "training" "reduce_env"
#> [13] "log"
# env variables to remove prior to SDM
prep$reduce_env$remove
#> [1] "bio1" "bio8" "block" "cell" "id" "lat" "lon" "pa" "x"
#> [10] "y"
# Density raster
dens_ras <- terra::rast(fs::path(data$taxa_dir[[2]], "density.tif")) %>%
terra::mask(clip) %>%
terra::classify(matrix(c(0, NA), ncol = 2))
#> bio1
if(require("tmap")) {
m <-
tm_shape(clip) +
tm_borders() +
tm_shape(dens_ras) +
tm_raster(col.legend = "Background point density"
, col.scale = c(0, 2, 4, 6, 8, 10)
, drop.levels = TRUE
) +
tm_compass() +
tm_scalebar() +
tm_title(paste0("Background point density for ", prep$this_taxa))
m
presences <- prep$pa_ras |>
dplyr::filter(pa == 1) %>%
sf::st_as_sf(coords = c("x", "y")
, crs = 4326
)
m +
tm_shape(presences) +
tm_dots(fill = "pa")
}
#> Loading required package: tmap
#>
#> ── tmap v3 code detected ───────────────────────────────────────────────────────
#> [v3->v4] `tm_tm_raster()`: migrate the argument(s) related to the scale of the
#> visual variable `col` namely 'drop.levels' to col.scale = tm_scale(<HERE>).
#> ℹ For small multiples, specify a 'tm_scale_' for each multiple, and put them in
#> a list: 'col.scale = list(<scale1>, <scale2>, ...)'
#> The visual variable "fill" of the layer "dotssymbols" contains a unique value. Therefore a discrete scale is applied (tm_scale_discrete).
#> Scale bar set for latitude km and will be different at the top and bottom of the map.
# Background points
if(require("tmap")) {
blocks <- prep$bg_points %>%
dplyr::inner_join(prep$training |>
dplyr::select(rep, training) |>
tidyr::unnest(cols = c(training))
) |>
dplyr::mutate(block = factor(block) # for map
, rep = paste0("rep: ", rep)
) %>%
sf::st_as_sf(coords = c("x", "y")
, crs = sf::st_crs(terra::rast(env_dat[[1]]))
)
tm_shape(clip) +
tm_borders() +
tm_shape(blocks) +
tm_dots(title = "Background points\n coloured by block"
, col = "block"
) +
tm_facets(by = "rep") +
tm_legend(outside = TRUE) +
tm_compass() +
tm_scale_bar() +
tm_layout(main.title = paste0("Background points for ", prep$this_taxa))
}
#> Joining with `by = join_by(x, y)`
#> [tm_dots()] Argument `title` unknown.
#> ! `tm_scale_bar()` is deprecated. Please use `tm_scalebar()` instead.
#> [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
#>
#> ── tmap v3 code detected ───────────────────────────────────────────────────────
#> [v3->v4] `tm_legend()`: use 'tm_legend()' inside a layer function, e.g.
#> 'tm_polygons(..., fill.legend = tm_legend())'
#> This message is displayed once every 8 hours.
#> Scale bar set for latitude km and will be different at the top and bottom of the map.
#> Scale bar set for latitude km and will be different at the top and bottom of the map.
#> Scale bar set for latitude km and will be different at the top and bottom of the map.
#> Scale bar set for latitude km and will be different at the top and bottom of the map.
#> Scale bar set for latitude km and will be different at the top and bottom of the map.