Computers, Vol. 14, Pages 527: Zero-Inflated Text Data Analysis Using Imbalanced Data Sampling and Statistical Models

Greenberg December 2, 2025 in News - 1 Minute

Computers, Vol. 14, Pages 527: Zero-Inflated Text Data Analysis Using Imbalanced Data Sampling and Statistical Models

Computers doi: 10.3390/computers14120527

Authors:
Sunghae Jun

Text data often exhibits high sparsity and zero inflation, where a substantial proportion of entries in the document&ndash;keyword matrix are zeros. This characteristic presents challenges to traditional count-based models, which may suffer from reduced predictive accuracy and interpretability in the presence of excessive zeros and overdispersion. To overcome this issue, we propose an effective analytical framework that integrates imbalanced data handling by undersampling with classical probabilistic count models. Specifically, we apply Poisson&rsquo;s generalized linear models, zero-inflated Poisson, and zero-inflated negative binomial models to analyze zero-inflated text data while preserving the statistical interpretability of term-level counts. The framework is evaluated using both real-world patent documents and simulated datasets. Empirical results demonstrate that our undersampling-based approach improves the model fit without modifying the downstream models. This study contributes a practical preprocessing strategy for enhancing zero-inflated text analysis and offers insights into model selection and data balancing techniques for sparse count data.

Source link

Sunghae Jun www.mdpi.com

Greenberg

Learn More →

Related Posts

Languages, Vol. 11, Pages 29: Exploring the Cooperative Principle in Cross-Cultural Contexts: A Corpus-Based Pragmatic Study of International Students Learning Romanian

Diagnostics, Vol. 16, Pages 494: Systemic Inflammatory and Hematological Profiles in Triple-Negative Breast Cancer: A Study from a Senegalese Cohort

IJMS, Vol. 27, Pages 1587: STAT3R152W Mutation Model Reveals Temporal Changes in Hematopoietic Populations

Greenberg