Differences between LiBiNorm and htseq-count
Differences in the assignments of reads to genes
There are four differences to the way that LiBiNorm counts reads compared to htseq-count:
- Reads that map to contigs where no genes are located in the reference annotation are counted as noMatch by LiBiNorm. Htseq-count ignores such reads.
- LiBiNorm ignores transcripts of biotype "retained intron" in gtf files as, in practice, few reads map to the retained introns and their presence gives misleading results when estimating gene/transcript length.
- When a bam file contains multiple alternative mappings for a given read or read pair LiBiNorm only counts this as a single non-unique read or read pair. htseq-count counts the read multiple times.
- If the gtf file contains regions where the source is "ensembl_havana" or "ERCC" then only these regions will be used. This avoids problems where the same gene is defined
"LiBiNorm count" can be run in an htseq-count compatible mode where reads are assigned to genes in exactly the same way as is done in htseq-count using the -z option.
In addition, when htseq-count is used with paired end data there are small differences between the counts that are generated with name (-r name ) and position (-r pos) ordered data. LibiNorm generates the same counts irrespective of the read ordering. LiBiNorm counts are identical to name ordered results using htseq-count. The htseq-count __alignment_not_unique count changed at some point between release 0.6 and 0.11. LiBiNorm 2.4 switched to the 0.11 definition.
Processing speed
LiBiNorm is ten to 15 times faster than htseq-count. The following compares the times taken to process the same bam data in name and position format
Program | Ordering | Size | Time |
LiBiNorm 2.4 | Name ordered | 5.9GByte | 8 minutes |
LiBiNorm 2.4 | Position Ordered | 4.1GByte | 12 minutes |
htseq-count 0.11.3 | Name ordered | 5.9GByte | 117 minutes |
htseq-count 0.11.3 | Position ordered | 41GByte | 130 minutes |
Unsupported htseq-count options
-f <format>, --format=<format>
Unavailable as LiBiNorm only supports SAM files
--additional-attr=<id attributes> *
LiBiNorm does not support the specification of additional attributes.
--nonunique=<nonunique mode> *
LiBiNorm only operates in the default:none mode
--secondary-alignments=<mode> *
LiBiNorm only operates in the default:score mode where the score is used to determine if the read should be included
--supplementary-alignments=<mode> *
LiBiNorm only operates in the default:score mode where the score is used to determine if the read should be included
--max-reads-in-buffer=<number>
This not required in LiBiNorm in that when the buffer size exceeds 200000 the buffer is written to one or more temporary files which are then reprocessed to find all of the remaining pairs. The buffer size can be varied using the READ_CACHE_SIZE #define in Options.h
* These modes are currently not supported but it is possible that any of them could be provided in the future if there is a demand for them.
Differences in htseq-count options
-o <bamout>, --bamout=<bamout>
the -o option is used to create bam files and not sam files, and can only be used with paired end data when it is ordered by read name.