Files

Abstract

A tremendous amount of attention has been paid to Big Data in recent years. Such data hold promise for scientific discoveries but also pose challenges for analyses. In their 2014 article ”Challenges to Big Data analysis,” Fan and colleagues propose that the high dimen- sionality of Big Data introduces statistical problems including noise accumulation. This thesis explores noise accumulation in high dimensional two-group classification problems. First, it aims to determine whether noise accumulation threatens the discriminative ability of classifiers developed with three common machine learning approaches – random forest, support vector machine, and boosted classification trees. Four different scenarios with dif- fering amount of signal strength are simulated to evaluate each method. After determining that noise accumulation may impact the performance of these classifiers, the thesis charac- terizes factors which impact noise accumulation. Simulations varying sample size, signal strength, signal strength proportional to the number predictors, and signal magnitude are conducted with random forest classifiers. Finally, this thesis develops Total Signal Index to summarize the amount of signal relative to noise in a two-group classification problem. Theoretical and empirical versions of this measure are defined and simulations are used to assess them.

Details

PDF

Statistics

from
to
Export
Download Full History