Files
Abstract
Chinese word segmentation (CWS) is essential for natural language processing, yet written Chinese lacks explicit word boundaries. Recent CWS evaluations include closed tasks (training data only) and open tasks (external resources allowed), but the impact of external resources remains unclear. This study quantifies contributions of various resources and explores integration methods. Results show independent dictionaries significantly improve performance, a finding generalizable to other languages. Additionally, normalization of numbers, ASCII characters, and punctuation provides further gains. These findings offer practical guidance for optimizing open-task CWS systems