A paper titled “Eigenvectors from eigenvalues” has recently sparkled heated discussions in Chinese communities on Wechat. This paper, co-authored by Peter Denton, Stephen Parke, Terence Tao and Xining Zhang, introduces a new method for calculating eigenvectors from solely eigenvalues. Some bloggers even describe this method as a revolution to traditional textbook method for calculating eigenvectors. I’ve never heard of this method before and wanted to give it a try to see if it is valid from a computational aspect.
[Read More]
Demystify Hi-C Data Normalization
Hi-C is a sequencing-based method for profiling the genome-wide chromatin contacts. It has been widely used in studying various biological questions such as gene regulation, chromatin structures, genome assembly, etc. The Hi-C experiments involves a series of biochemistical reactions that may introduce noises to the output. Subsequent data analysis such as read mapping also give rise to noises that affect the interpretation of the final output: a contact matrix, where each element in the matrix denotes the contact strength between any two regions of genome.
[Read More]
Learning HTSlib (1)
In the past week, I’ve been adapting my own HiC pipeline to the 4DN recommended pipeline. One critical step is to convert hundreds of read-paired BAM files generated by my old pipeline to pairs format. 4DN consortium provides a tool, bam2pairs, for this task. It’s basically a Perl script that calls Samtools to read a BAM file and output targeted fields. Because the pair format suggests the columns to be sorted by chr1-chr2-pos1-pos2, bam2pairs then calls Linux “sort” to perform four times of sorting based on these four columns [1].
[Read More]
Exact string matching algorithms: Boyer-Moore and KMP
String matching algorithms are used in lots of scenarios such as searching words in a text file, or locating specific sequences in a genome. I’ve heard of KMP algorithm long time ago, but don’t have a chance to implement it by myself (just lazy). It seems that KMP is not widely used in genomics field, but instead another algorithm, Boyer-Moore, is taught more often in computational genomics classes. I recently find a great online tutorial on computational genomics by Ben Langmead @JHU (link).
[Read More]
Similarity measurement of two sets of TADs
Months ago I had this question of comparing TADs from two Hi-C experiments in my research. TAD is short for topologically associated domains, discovered by Dixon et al [1] in 2012, which are regions in the genome that interacts more frequenctly with regions in the domain than outside. The goal of comparing TADs from two Hi-C expriments is to search for differentially interacting regions under two conditions to tease out regions that react to the conditions.
[Read More]
How does number of threads affect mapping time in BWA MEM
I’ve been using BWA MEM a lot recently for mapping short reads from whole genome sequencing or HiC experiments. Mapping high coverage sequencing data is a time-consuming task. The easiest way to accelerate, especially on servers, is to use multiple threads which is done by adding the -t flag. I’ve seen my labmates use 20 threads, 32 threads, or 40 threads. There seem to be no offical suggestions on the number of threads that would make the mapping fastest.
[Read More]