Computers, Vol. 14, Pages 527: Zero-Inflated Text Data Analysis Using Imbalanced Data Sampling and Statistical Models


Computers, Vol. 14, Pages 527: Zero-Inflated Text Data Analysis Using Imbalanced Data Sampling and Statistical Models

Computers doi: 10.3390/computers14120527

Authors:
Sunghae Jun

Text data often exhibits high sparsity and zero inflation, where a substantial proportion of entries in the document–keyword matrix are zeros. This characteristic presents challenges to traditional count-based models, which may suffer from reduced predictive accuracy and interpretability in the presence of excessive zeros and overdispersion. To overcome this issue, we propose an effective analytical framework that integrates imbalanced data handling by undersampling with classical probabilistic count models. Specifically, we apply Poisson’s generalized linear models, zero-inflated Poisson, and zero-inflated negative binomial models to analyze zero-inflated text data while preserving the statistical interpretability of term-level counts. The framework is evaluated using both real-world patent documents and simulated datasets. Empirical results demonstrate that our undersampling-based approach improves the model fit without modifying the downstream models. This study contributes a practical preprocessing strategy for enhancing zero-inflated text analysis and offers insights into model selection and data balancing techniques for sparse count data.



Source link

Sunghae Jun www.mdpi.com