Eigenvectors from Eigenvalues: A python implementation

Posted on November 19, 2019 | 4 minutes | 668 words | Fan

A paper titled “Eigenvectors from eigenvalues” has recently sparkled heated discussions in Chinese communities on Wechat. This paper, co-authored by Peter Denton, Stephen Parke, Terence Tao and Xining Zhang, introduces a new method for calculating eigenvectors from solely eigenvalues. Some bloggers even describe this method as a revolution to traditional textbook method for calculating eigenvectors. I’ve never heard of this method before and wanted to give it a try to see if it is valid from a computational aspect. [Read More]

statistics algorithm

Demystify Hi-C Data Normalization

Posted on November 3, 2019 | 7 minutes | 1398 words | Fan

Hi-C is a sequencing-based method for profiling the genome-wide chromatin contacts. It has been widely used in studying various biological questions such as gene regulation, chromatin structures, genome assembly, etc. The Hi-C experiments involves a series of biochemistical reactions that may introduce noises to the output. Subsequent data analysis such as read mapping also give rise to noises that affect the interpretation of the final output: a contact matrix, where each element in the matrix denotes the contact strength between any two regions of genome. [Read More]

Bioinfo Hi-C

Learning HTSlib (1)

Posted on April 7, 2019 | 6 minutes | 1220 words | Fan

In the past week, I’ve been adapting my own HiC pipeline to the 4DN recommended pipeline. One critical step is to convert hundreds of read-paired BAM files generated by my old pipeline to pairs format. 4DN consortium provides a tool, bam2pairs, for this task. It’s basically a Perl script that calls Samtools to read a BAM file and output targeted fields. Because the pair format suggests the columns to be sorted by chr1-chr2-pos1-pos2, bam2pairs then calls Linux “sort” to perform four times of sorting based on these four columns [1]. [Read More]

Bioinfo C CPP Sequencing

Exact string matching algorithms: Boyer-Moore and KMP

Posted on December 28, 2018 | 7 minutes | 1310 words | Fan

String matching algorithms are used in lots of scenarios such as searching words in a text file, or locating specific sequences in a genome. I’ve heard of KMP algorithm long time ago, but don’t have a chance to implement it by myself (just lazy). It seems that KMP is not widely used in genomics field, but instead another algorithm, Boyer-Moore, is taught more often in computational genomics classes. I recently find a great online tutorial on computational genomics by Ben Langmead @JHU (link). [Read More]

Algorithm Bioinfo Sequencing

Similarity measurement of two sets of TADs

Posted on November 24, 2018 | 5 minutes | 931 words | Fan

Months ago I had this question of comparing TADs from two Hi-C experiments in my research. TAD is short for topologically associated domains, discovered by Dixon et al [1] in 2012, which are regions in the genome that interacts more frequenctly with regions in the domain than outside. The goal of comparing TADs from two Hi-C expriments is to search for differentially interacting regions under two conditions to tease out regions that react to the conditions. [Read More]

Statistics Hi-C

How does number of threads affect mapping time in BWA MEM

Posted on August 9, 2017 | 2 minutes | 338 words | Fan

I’ve been using BWA MEM a lot recently for mapping short reads from whole genome sequencing or HiC experiments. Mapping high coverage sequencing data is a time-consuming task. The easiest way to accelerate, especially on servers, is to use multiple threads which is done by adding the -t flag. I’ve seen my labmates use 20 threads, 32 threads, or 40 threads. There seem to be no offical suggestions on the number of threads that would make the mapping fastest. [Read More]

Bioinfo

Hello from dearxxj

Posted on January 8, 2017 | 1 minutes | 2 words | Fan

Hello World!

Misc