Files
Abstract
Identification of biologically relevant high-occupancy transcription factor binding sites (TFBS) in silico has historically been a difficult problem with a high error rate. Methods that utilize information in addition to the sequence of binding sites (e.g. chromatin information) have been shown to improve performance over strictly sequence-based methods; however, several questions about such methods remain unanswered: whether such models are suitable for multiple transcription factors, whether a general model or generalizable approach to the problem is possible, and what the effect of such prediction on biological inference is. In this work, we construct and evaluate several classifiers of position weight matrix-predicted TFBS (“occupancy classifiers”) based on four distinct transcription factors and demonstrate that such classifiers identify biochemically confirmed high--occupancy sites at a high rate.