Speaker
Description
Random forest is a widely used machine learning method across the life sciences due to its high predictive performance, minimal assumptions, and flexibility in handling diverse data types. However, a critical yet often overlooked property of random forest is its inherent non-determinism: repeated runs on the same data set can produce different prediction models. This variability can compromise the reproducibility of prediction based decision-making processes, especially in high-stakes applications such as medical, agricultural, or environmental prediction tasks.
Random forest builds an ensemble of decision or regression trees and aggregates their outputs. A key but underexplored parameter in this process is the number of trees. While widely used implementations employ a fixed default value of 500 trees, this parameter strongly affects model stability and computational efficiency.
This study provides a measure that describes the prediction stability and the effect that non-determinism can have on prediction-based decision-making processes. Furthermore, we show that the relationship between the number of trees and prediction stability is non-linear: while the prediction stability increases rapidly with an initiate increase in the number of trees, it forms a plateau with higher number of trees beyond which the prediction stability increases only marginally while computation time continues to increase linearly. To address this trade-off, we developed the R package optRF which models the relationship between the prediction stability and the number of trees to determine the optimal number of trees for a given data set.
By providing a systematic way to quantify and optimise random forest stability, optRF contributes to more reliable and transparent machine learning analyses. While originally developed with genomic selection in mind, the approach is broadly applicable across the life sciences, where reproducibility and computational efficiency are critical for shaping data-driven decisions of the future.
42858803609