Demystify Hi-C Data Normalization

Hi-C is a sequencing-based method for profiling the genome-wide chromatin contacts. It has been widely used in studying various biological questions such as gene regulation, chromatin structures, genome assembly, etc. The Hi-C experiments involves a series of biochemistical reactions that may introduce noises to the output. Subsequent data analysis such as read mapping also give rise to noises that affect the interpretation of the final output: a contact matrix, where each element in the matrix denotes the contact strength between any two regions of genome. [Read More]

Learning HTSlib (1)

In the past week, I’ve been adapting my own HiC pipeline to the 4DN recommended pipeline. One critical step is to convert hundreds of read-paired BAM files generated by my old pipeline to pairs format. 4DN consortium provides a tool, bam2pairs, for this task. It’s basically a Perl script that calls Samtools to read a BAM file and output targeted fields. Because the pair format suggests the columns to be sorted by chr1-chr2-pos1-pos2, bam2pairs then calls Linux “sort” to perform four times of sorting based on these four columns [1]. [Read More]

Exact string matching algorithms: Boyer-Moore and KMP

String matching algorithms are used in lots of scenarios such as searching words in a text file, or locating specific sequences in a genome. I’ve heard of KMP algorithm long time ago, but don’t have a chance to implement it by myself (just lazy). It seems that KMP is not widely used in genomics field, but instead another algorithm, Boyer-Moore, is taught more often in computational genomics classes. I recently find a great online tutorial on computational genomics by Ben Langmead @JHU (link). [Read More]

How does number of threads affect mapping time in BWA MEM

I’ve been using BWA MEM a lot recently for mapping short reads from whole genome sequencing or HiC experiments. Mapping high coverage sequencing data is a time-consuming task. The easiest way to accelerate, especially on servers, is to use multiple threads which is done by adding the -t flag. I’ve seen my labmates use 20 threads, 32 threads, or 40 threads. There seem to be no offical suggestions on the number of threads that would make the mapping fastest. [Read More]