We found a match
Your institution may have rights to this item. Sign in to continue.
- Title
A Unified Sample Selection Framework for Output Noise Filtering: An Error-Bound Perspective.
- Authors
Gaoxia Jiang; Wenjian Wang; Yuhua Qian; Jiye Liang
- Abstract
The existence of output noise will bring difficulties to supervised learning. Noise ltering, aiming to detect and remove polluted samples, is one of the main ways to deal with the noise on outputs. However, most of the filters are heuristic and could not explain the ltering influence on the generalization error (GE) bound. The hyper-parameters in various lters are specied manually or empirically, and they are usually unable to adapt to the data environment. The lter with an improper hyper-parameter may overclean, leading to a weak generalization ability. This paper proposes a unied framework of optimal sample selection (OSS) for the output noise ltering from the perspective of error bound. The covering distance lter (CDF) under the framework is presented to deal with noisy outputs in regression and ordinal classication problems. Firstly, two necessary and sufficient conditions for a xed goodness of t in regression are deduced from the perspective of GE bound. They provide the unied theoretical framework for determining the ltering eectiveness and optimizing the size of removed samples. The optimal sample size has the adaptability to the environmental changes in the sample size, the noise ratio, and noise variance. It oers a choice of tuning the hyper-parameter and could prevent lters from overcleansing. Meanwhile, the OSS framework can be integrated with any noise estimator and produces a new lter. Then the covering interval is proposed to separate low-noise and high-noise samples, and the eectiveness is proved in regression. The covering distance is introduced as an unbiased estimator of high noises. Further, the CDF algorithm is designed by integrating the cover distance with the OSS framework. Finally, it is veried that the CDF not only recognizes noise labels correctly but also brings down the prediction errors on real apparent age data set. Experimental results on benchmark regression and ordinal classication data sets demonstrate that the CDF outperforms the state-of-the-art lters in terms of prediction ability, noise recognition, and efficiency.
- Subjects
KALMAN filtering; NOISE; SUPERVISED learning; AGE groups; ENVIRONMENTAL sampling; SAMPLE size (Statistics)
- Publication
Journal of Machine Learning Research, 2021, Vol 22, p1
- ISSN
1532-4435
- Publication type
Article