Learning HTSlib (1)

In the past week, I’ve been adapting my own HiC pipeline to the 4DN recommended pipeline. One critical step is to convert hundreds of read-paired BAM files generated by my old pipeline to pairs format. 4DN consortium provides a tool, bam2pairs, for this task. It’s basically a Perl script that calls Samtools to read a BAM file and output targeted fields. Because the pair format suggests the columns to be sorted by chr1-chr2-pos1-pos2, bam2pairs then calls Linux “sort” to perform four times of sorting based on these four columns [1]. [Read More]

Exact string matching algorithms: Boyer-Moore and KMP

String matching algorithms are used in lots of scenarios such as searching words in a text file, or locating specific sequences in a genome. I’ve heard of KMP algorithm long time ago, but don’t have a chance to implement it by myself (just lazy). It seems that KMP is not widely used in genomics field, but instead another algorithm, Boyer-Moore, is taught more often in computational genomics classes. I recently find a great online tutorial on computational genomics by Ben Langmead @JHU (link). [Read More]