Skip to contents

Iteratively add trees to random forest until predictions stabilise

Usage

make_rf_good(
  env_df,
  clust_col = "cluster",
  env_names,
  trees_start = 499,
  trees_add = 249,
  trees_max = 9999,
  rf_cores = 1,
  use_mtry = NULL,
  set_min = FALSE,
  accept_delta = 0.995,
  accept_run = 3,
  internal_metrics = TRUE,
  do_imp = FALSE,
  keep_rf = FALSE,
  out_file = NULL,
  save_res = FALSE,
  do_gc = TRUE,
  ...
)

Arguments

env_df

Dataframe with clust_col, site_col and columns env_names.

clust_col

Character. Name of the columns with clusters.

env_names

Character. Name of the environmental variables (e.g. names(stack_list)).

trees_start

Number of trees in first random forest run.

trees_add

Number of trees to add in each subsequent run.

trees_max

Maximum number of trees in the random forest.

rf_cores

Number of cores to use for parallel processing.

use_mtry

mtry value for randomForest::randomForest() call. If NULL it will be generated by a (lengthy) call to caret::train() with a tune grid of .mtry = 1:floor(sqrt(length(env_names)).

set_min

FALSE or numeric. If numeric, classes in clust_col with less than set_min rows will be filtered

accept_delta

What proportion change between runs is acceptable?

accept_run

How many forests (in a row) need to beat accept_delta?

internal_metrics

TRUE or test data in same format as env_df.

do_imp

Logical. Passed to importance argument of randomForest::randomForest().

keep_rf

Logical. If true, randomForest object will be included in output. Defaults to FALSE to save memory.

out_file

Optional name of file to save results.

save_res

FALSE or folder path. If path is provided, any constant columns in env_df will be used to generate a file name, with metrics saved to folder path and file name.

do_gc

Logical. Run gc() when results are available and all other objects have been removed.