Iteratively add trees to random forest until predictions stabilise

Usage

make_rf_good(
  env_df,
  clust_col = "cluster",
  env_names,
  trees_start = 499,
  trees_add = 249,
  trees_max = 9999,
  rf_cores = 1,
  use_mtry = NULL,
  set_min = FALSE,
  accept_delta = 0.995,
  accept_run = 3,
  internal_metrics = TRUE,
  do_imp = FALSE,
  keep_rf = FALSE,
  out_file = NULL,
  save_res = FALSE,
  do_gc = TRUE,
  ...
)

Arguments

env_df: Dataframe with clust_col, site_col and columns env_names.
clust_col: Character. Name of the columns with clusters.
env_names: Character. Name of the environmental variables (e.g. names(stack_list)).
trees_start: Number of trees in first random forest run.
trees_add: Number of trees to add in each subsequent run.
trees_max: Maximum number of trees in the random forest.
rf_cores: Number of cores to use for parallel processing.
use_mtry: mtry value for randomForest::randomForest() call. If NULL it will be generated by a (lengthy) call to caret::train() with a tune grid of .mtry = 1:floor(sqrt(length(env_names)).
set_min: FALSE or numeric. If numeric, classes in clust_col with less than set_min rows will be filtered
accept_delta: What proportion change between runs is acceptable?
accept_run: How many forests (in a row) need to beat accept_delta?
internal_metrics: TRUE or test data in same format as env_df.
do_imp: Logical. Passed to importance argument of randomForest::randomForest().
keep_rf: Logical. If true, randomForest object will be included in output. Defaults to FALSE to save memory.
out_file: Optional name of file to save results.
save_res: FALSE or folder path. If path is provided, any constant columns in env_df will be used to generate a file name, with metrics saved to folder path and file name.
do_gc: Logical. Run gc() when results are available and all other objects have been removed.