BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
Recent advances in long-read sequencing allow characterization of genome structure and its variation within and between species at a resolution not previously possible. Detection of overlap between reads is an essential component of many long read genome pipelines, such as de novo genome assembly. Longer reads simplify genome assembly and improve reconstruction contiguity, but current long read technologies are associated with moderate to high error rates.
In this work, we present Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel overlap detection and alignment algorithm using sparse matrix-matrix multiplication. In addition, we present a probabilistic model that demonstrates the feasibility of using k-mers for overlap candidate detection and shows its flexibility when applied to different k-mer selection strategies. Based on such a model, we introduce a notion of reliable k-mers. Combining reliable k-mers with our binning mechanism increases the computational efficiency and accuracy of our algorithm. Finally, we present a new method based on Chernoff bounds to separate true overlaps from false positives by combining alignment techniques and probabilistic modeling. Our goal is to maximize the balance of precision and recall.
For both real and synthetic data, BELLA is among the best F1 scores, showing a stability of performance that is often lacking in competing software. BELLA’s F1 score is consistently within 1.7% of the top performer. In particular, we show improved de novo assembly quality on synthetic data when BELLA is coupled with the miniasm assembler.