Applied Sciences, Vol. 15, Pages 2072: Automated Dataset-Creation and Evaluation Pipeline for NER in Russian Literary Heritage
Applied Sciences doi: 10.3390/app15042072
Authors:
Kenan Kassab
Nikolay Teslya
Ekaterina Vozhik
Developing robust and reliable models for Named Entity Recognition (NER) in the Russian language presents significant challenges due to the linguistic complexity of Russian and the limited availability of suitable training datasets. This study introduces a semi-automated methodology for building a customized Russian dataset for NER specifically designed for literary purposes. The paper provides a detailed description of the methodology employed for collecting and proofreading the dataset, outlining the pipeline used for processing and annotating its contents. A comprehensive analysis highlights the dataset’s richness and diversity. Central to the proposed approach is the use of a voting system to facilitate the efficient elicitation of entities, enabling significant time and cost savings compared to traditional methods of constructing NER datasets. The voting system is described theoretically and mathematically to highlight its impact on enhancing the annotation process. The results of testing the voting system with various thresholds show its impact in increasing the overall precision by 28% compared to using only the state-of-the-art model for auto-annotating. The dataset is meticulously annotated and thoroughly proofread, ensuring its value as a high-quality resource for training and evaluating NER models. Empirical evaluations using multiple NER models underscore the dataset’s importance and its potential to enhance the robustness and reliability of NER models in the Russian language.
Source link
Kenan Kassab www.mdpi.com