Biology, Vol. 14, Pages 670: Analysis of Software Read Cross-Contamination in DNBSEQ Data


Biology, Vol. 14, Pages 670: Analysis of Software Read Cross-Contamination in DNBSEQ Data

Biology doi: 10.3390/biology14060670

Authors:
Dmitry N. Konanov
Vera Y. Tereshchuk
Ignat V. Sonets
Elena V. Korneenko
Aleksandra V. Lukina-Gronskaya
Anna S. Speranskaya
Elena N. Ilina

DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as “software contamination”, is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as “digital chimeric” reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.



Source link

Dmitry N. Konanov www.mdpi.com