Iteratively add trees to random forest until predictions stabilise
Usage
make_rf_good(
env_df,
clust_col = "cluster",
env_names,
trees_start = 499,
trees_add = 249,
trees_max = 9999,
rf_cores = 1,
use_mtry = NULL,
set_min = FALSE,
accept_delta = 0.995,
accept_run = 3,
internal_metrics = TRUE,
do_imp = FALSE,
keep_rf = FALSE,
out_file = NULL,
save_res = FALSE,
do_gc = TRUE,
...
)
Arguments
- env_df
Dataframe with
clust_col
,site_col
and columnsenv_names
.- clust_col
Character. Name of the columns with clusters.
- env_names
Character. Name of the environmental variables (e.g.
names(stack_list)
).- trees_start
Number of trees in first random forest run.
- trees_add
Number of trees to add in each subsequent run.
- trees_max
Maximum number of trees in the random forest.
- rf_cores
Number of cores to use for parallel processing.
- use_mtry
mtry
value forrandomForest::randomForest()
call. IfNULL
it will be generated by a (lengthy) call tocaret::train()
with a tune grid of.mtry = 1:floor(sqrt(length(env_names))
.- set_min
FALSE or numeric. If numeric, classes in
clust_col
with less thanset_min
rows will be filtered- accept_delta
What proportion change between runs is acceptable?
- accept_run
How many forests (in a row) need to beat
accept_delta
?- internal_metrics
TRUE or test data in same format as
env_df
.- do_imp
Logical. Passed to
importance
argument ofrandomForest::randomForest()
.- keep_rf
Logical. If true,
randomForest
object will be included in output. Defaults toFALSE
to save memory.- out_file
Optional name of file to save results.
- save_res
FALSE or folder path. If path is provided, any constant columns in
env_df
will be used to generate a file name, with metrics saved to folder path and file name.- do_gc
Logical. Run
gc()
when results are available and all other objects have been removed.