Diagnostics, Vol. 15, Pages 3014: A Novel Method for Predicting Oncogenic Types of Human Papillomavirus


Diagnostics, Vol. 15, Pages 3014: A Novel Method for Predicting Oncogenic Types of Human Papillomavirus

Diagnostics doi: 10.3390/diagnostics15233014

Authors:
Songül Çeçen Kaynak
Hilal Arslan

Background and Objectives: Human Papillomavirus (HPV) is a leading cause of cervical and other anogenital cancers, with over 200 known genotypes classified into high-risk, probable high-risk, and low-risk groups. While conventional diagnostic and classification approaches often rely on sequence alignment, phylogenetic relationships, or protein structure analyses, these methods are limited in scalability, cost efficiency, and generalizability to emerging HPV types. This study aims to develop a novel, machine learning-based framework for classifying HPV genotypes by oncogenic risk using genome-derived numerical features. A key objective is to introduce TATA-box, CAAT-box, and CpG-island-based features to HPV risk prediction for the first time. Methods: We constructed a comprehensive feature set that integrates regulatory sequence motifs (TATA-box, CAAT-box, CpG islands) with dinucleotide and trinucleotide (k-mer) composition derived from full HPV genomes. Multiple machine learning algorithms were implemented to evaluate classification performance across all risk categories. Model accuracy, precision, recall, and F1-score were calculated to assess the effectiveness and robustness of the proposed feature set. Results: The proposed method achieves an average precision of 0.95, a recall of 0.95, an F1-score of 0.95, and an accuracy of 97.47%. The experimental findings indicate that the proposed method not only attains high classification accuracy across all HPV risk groups but also surpasses existing models in generalizability by utilizing genomic data and novel biologically informed features. Conclusions: This study introduces regulatory motif-based numerical features to HPV classification for the first time and demonstrates that integrating these with k-mer descriptors yields a highly accurate and scalable machine learning model. Unlike previous studies, which often focus on specific HPV genes or a limited subset of types, our method is scalable, robust, and capable of classifying known and emerging HPV types with high reliability. This highlights its potential for real-world deployment in large-scale epidemiological screening and vaccine development programs.



Source link

Songül Çeçen Kaynak www.mdpi.com