In the past week, I’ve been adapting my own HiC pipeline to the 4DN recommended pipeline. One critical step is to convert hundreds of read-paired BAM files generated by my old pipeline to pairs format. 4DN consortium provides a tool, bam2pairs, for this task. It’s basically a Perl script that calls Samtools to read a BAM file and output targeted fields. Because the pair format suggests the columns to be sorted by chr1-chr2-pos1-pos2, bam2pairs then calls Linux “sort” to perform four times of sorting based on these four columns [1]. This step is actually very time-consuming. I tested it on a 200 million reads Hi-C bam file which took ~ 8 hours and 20 minutes to finish. This is an astounding task for my hundreds of BAM files, let alone many of these files contain billions of reads.
The reason that the pairs format is sorted this way is that some downstream tools (pairix/pairtools) requires the reads from the same block of the super matrix being placed together, such as reads from chr12-chr12, chr17-chr20, etc. Juicer tools (pre) that generate .hic format from pairs format also requires input the pairs file to be sorted at least by chr1-chr2. However, cooler has no such requirements. Pairs files are no more than intermediate files for generating .hic or .cool binary files. What I need is just pair files sorted by chr1-chr2, this should save some time compared with four-column sorting. Now that the goal is clear, I can either modify the original bam2pairs script or write my own version. Since I haven’t touched Perl for a long time, I think it’s time to write it from scratch with either C (HTSlib) or python (pysam). For efficiency consideration, I chose HTSlib for this task.
HTSlib is written by Heng Li, the author of Samtools and BWA. I wanted to share a little story here about him on how he influences the Bioinformatics field. The professor for my first-year Bioinformatics class once called him the “Mapping God”. Tools developed by Heng are fast and easy to use, like BWA/samtools/seqtk/bcftools/minimap etc. HTSlib is the underlying library for samtools/bcftools. One thing I don’t like about HTSlib is that it does not have a good document. Maybe that’s not a problem for pro developers. It gave me a hard time to learn HTSlib purely from the source code, but the process is lots of fun and I wanted to share my notes on HTSlib here.
HTSlib consists of a bunch of header files. Almost all SAM/BAM manipulations are written in the sam.h file. If all you need is read and write sam files, this is probably the only header file that you need to include. It first defines a BAM header struct.
1typedef struct bam_hdr_t {
2 int32_t n_targets, ignore_sam_err; // number of reference sequences
3 uint32_t l_text;
4 uint32_t *target_len;
5 int8_t *cigar_tab;
6 char **target_name; // names of the reference sequences
7 char *text;
8 void *sdict; // header dictionary
9 uint32_t ref_count;
10} bam_hdr_t;
target_name is an array of references names such as chr1, chr2, …, etc. It is useful when one wants to retrieve chromosome names from sam records which only stores the name indexes to save space. sdict is a string to integer hash table which is used in header parsing. Followed by the header struct are CIGAR-related macros.
1#define BAM_FPAIRED 1
2/*! @abstract the read is mapped in a proper pair */
3#define BAM_FPROPER_PAIR 2
4/*! @abstract the read itself is unmapped; conflictive with BAM_FPROPER_PAIR */
5#define BAM_FUNMAP 4
6/*! @abstract the mate is unmapped */
7#define BAM_FMUNMAP 8
8/*! @abstract the read is mapped to the reverse strand */
9#define BAM_FREVERSE 16
10/*! @abstract the mate is mapped to the reverse strand */
11#define BAM_FMREVERSE 32
I found BAM_FUNMAP and BAM_FMUNMAP are useful for basic read filtering, which is equivalent to “samtools view -F12”. BAM_FREVERSE and BAM_FMREVERSE can be used to distinguish reads mapped to two strands. The way to do these filtering in C is simply “flag & BAM_XXXXX”.
Next, structs for one alignment and alignment core information are defined.
1typedef struct {
2 int32_t tid; // chromosome ID, defined by bam_hdr_t
3 int32_t pos; // 0-based leftmost coordinate
4 uint16_t bin;
5 uint8_t qual;
6 uint8_t l_qname;
7 uint16_t flag;
8 uint8_t unused1;
9 uint8_t l_extranul;
10 uint32_t n_cigar;
11 int32_t l_qseq;
12 int32_t mtid; // chromosome ID of next/mate read in template, defined by bam_hdr_t
13 int32_t mpos;
14 int32_t isize;
15} bam1_core_t;
16
17typedef struct {
18 bam1_core_t core; // bam1_core_t
19 int l_data;
20 uint32_t m_data;
21 uint8_t *data; // all variable-length data, concatenated; structure: qname-cigar-seq-qual-aux
22#ifndef BAM_NO_ID
23 uint64_t id;
24#endif
25} bam1_t;
All fields in the bam1_core_t are pretty self-explained. tid/mtid are indexes to the chromosome names as mentioned above (e.g. bam_hdr->target_names[tid]). Note that pos here is 0-based while SAM format is 1-based, therefore (pos + 1) is need to generate SAM-like output. bam1_t is defined in a way similar to kstring_t struct. It can be accessed by some useful APIs defined in the following lines.
1#define bam_is_rev(b) (((b)->core.flag&BAM_FREVERSE) != 0) // whether the query is on the reverse strand
2#define bam_is_mrev(b) (((b)->core.flag&BAM_FMREVERSE) != 0) // whether the mate query is on the reverse strand
3#define bam_get_qname(b) ((char*)(b)->data) // Get the name of the query
4#define bam_get_cigar(b) // Get the CIGAR array
5#define bam_get_seq(b) // Get query sequence
6#define bam_seqi(s, i) ((s)[(i)>>1] >> ((~(i)&1)<<2) & 0xf) // Get a base on read
There are at least two ways to read a BAM file: 1) read from the beginning; 2) read from a given chromosomal position by loading the BAM index.
1////////// either way, initialization of an alignment is needed
2b = bam_init1();
3in = sam_open(argv[1], "r");
4if (in == NULL) return -1;
5header = sam_hdr_read(in);
6if (header = NULL) return -1;
7
8////////// 1. Read from the beginning
9while (sam_read1(in, header, b) >= 0) {
10 c = &b->core;
11 chr = header->target_name[c->tid];
12 mchr = header->target_name[c->mtid];
13 pos = c->pos + 1; // sam format is 1-based
14 mpos = c->mpos + 1;
15 fprintf(stdout, "%s\t%d\t%s\t%d\n", chr, pos, mchr, mpos);
16}
17
18////////// 2. Read from a given region
19hts_itr_t *iter; // initiate an iterator
20hts_idx_t *idx; // initiate an BAM index
21idx = sam_index_load(in, argv[1]); // second parameter is same as BAM file path
22if (idx == NULL) return -1;
23iter = sam_itr_querys(idx, header, argv[2]); // third parameter is like chr1:1000-2000, locate to the start of the region
24while ( sam_itr_next(in, iter, b) >= 0) {
25 ...
26}
27
28////////// Last, don't forget to destroy all pointers
29hts_itr_destroy(iter);
30bam_destroy1(b);
31bam_hdr_destroy(header);
32sam_close(in);
Compile a HTSlib program
1# 1. compile HTSlib and create a static library and header files
2git clone ....; ...; make && make install
3
4# 2. link to the HTSlib library and other necessary libraries
5gcc -O2 -Wall main.c htslib-build/lib/libhts.a -lz -lbz2 -llzma -lcurl -o bam2pairs
[1]. (Update 191103). GNU sort actually can be pretty fast if used properly (ref) since it is already optimized for large files. To do this, one needs to set the –parallel parameter to the number of cores that sort will use. Algorithmly, GNU sort implements external sorting which first split the input into smaller portions and fit that in the memory, then perform sorting on each portion and lastly merge the results.