We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Learning de-biased regression trees and forests from complex samples.
- Authors
Nalenz, Malte; Rodemann, Julian; Augustin, Thomas
- Abstract
Regression trees and forests are widely used due to their flexibility and predictive accuracy. Whereas typical tree induction assumes independently identically distributed (i.i.d.) data, in many applications the training sample follows a complex sampling structure. This includes unequal probability sampling, which is often found in survey data. Then, a 'naive estimation' that simply ignores the sampling weights may be substantially biased. This article analyzes the bias arising from a naive estimation of regression trees or forests under complex sample designs and proposes ways of de-biasing. This is achieved by bridging tree learning to survey statistics, due to the correspondence of the mean-squared-error criterion in regression trees and variance estimation. Transferring population variance estimation approaches from survey statistics to tree induction, indeed considerably reduces the bias in the resulting trees, both in predictions and the tree structure. The latter is particularly crucial if the trees are to be interpreted. Our methodology is extended to random forests, where we show on simulated data and a housing dataset that correcting for complex sample designs leads to overall much better predictive accuracy and more trustworthy interpretation. Interestingly, corrected forests can surpass forests learned on i.i.d. samples in terms of accuracy, which also has important implications for adaptive data collection approaches.
- Subjects
RANDOM forest algorithms; POPULATION transfers; REGRESSION trees; SUPERVISED learning; ACQUISITION of data
- Publication
Machine Learning, 2024, Vol 113, Issue 6, p3379
- ISSN
0885-6125
- Publication type
Article
- DOI
10.1007/s10994-023-06439-1