Files
Abstract
This is a comparative study of the Goodness of Pronunciation (GOP) score, a phone pronunciation quality metric, that explores its formulation and evolution. The effectiveness of the GOP score lies predominantly in its two main components: its forced aligner, which produces the expected phone segments, and its phone loop, which produces the observed phone segments. As with the derivatives created since the inception of the GOP score, this thesis explores alternatives to the traditional forced aligner and phone loop by using several Deep Neural Network (DNN) architectures. The two general classes of architectures used, from Deep Learning, are the single-input classifier and the sequence-to-sequence classifier. Along with these architectures, proposed approaches are also presented on how to utilize DNNs within a GOP score. Lastly, a new generalized GOP score, the GOP-ensemble, is proposed to enable users to combine various established GOP scores to create a new, modular pronunciation score.