Skip to main content

LiBiNorm count ≈ htseq-count

This mode uses aligned RNA-seq data from a bam file and determines the number of reads associated with each gene or transcript, and normalises for bias associated with the read distributions.

The command line format is LiBiNorm count [options] alignment_file gff_file

This mode takes as its foundation the functionality found in htseq-count, indeed "LiBiNorm count" can be run in a fully htseq-count compatible mode using the -z option. This option also disables the two changes that have been made to the htseq-count algorithm such that the results should align fully with htseq-count.

The standard htseq-count command line options are supported, except:

  • the -f option is unavailable as currently only bam files are supported
  • the -o option is used to create bam files and not sam files, and can only be used with paired end data when it is ordered by read name.

In the default mode LiBiNorm count normalises the effective gene lengths based on the read distribution bias that is found. The associated paper describes six alternative bias models: A-E and BD. Model BD is appropriate for SMART-seq datasets and is used by default.

The output giving the list of genes and associated expression is sent to stdout, mirroring the operation of htseq-count. It can also be sent to a file using a new -c option.

Bias normalisation in "LiBiNorm count"

Bias normalisation within LiBiNorm is performed in two stages. The first is to determine the model parameters that best fit the data, and the second is to use the parameters to generate expression values for each gene or transcript using the TPM measure of expression where the length of the transcript is adjusted based on the bias predicted by the model. For example, if the transcript length is 2000 and the predicted bias is 0.4 then the effective length used in the calculation of TPM is 800.

Bias normalisation options

The default bias compensation is based on model BD, which corresponds to SMART-seq protocols. Alternative models can be selected for different protocols using the -n M option where M is one of the 6 models A-E, BD or best or none.

The -n best option calculates parameters for all 6 models, picks the most suitable model and normalises the expression value based on this model.

The -n none option disables normalisation and a htseq-compatible count file is produced as is produced using the -z option, except that the htseq-compatibility modes are disabled

Output files

When the -u <filenameroot> option is used a number of files are generated that provide details of the characteristics of the reads and the capabilities of all the models which are described here.

In the htseq-count compatible mode only the bias file is produced with the -u option.

The -o <bamfilename> option generates a bam file where each read is annotated with an extra 'XF' tag that identifies the feature to which the read was mapped, mirroring LiBiNorm except that a bam rather than a sam file is created. Two further alternative options to -o are -om <bamfilename>, which only outputs reads that are associated with a genetic feature and -ou <bamfilename>, which creates a file that only contains reads that were not associated with features. Two bam files are created. In the first they retain the ordering of the original bam file, and in the second (bamfilename.sort.bam) they are ordered by genomic position, such that they can be viewed using viewers such as IGV.

Generating a landscape file for use by LiBiNorm model

Additional options for bias normalisation are available using the 'LiBiNorm model' mode. This takes as its input a landscape file that is produced by LiBiNorm count using the -l option and which creates the file <filenameroot>_landscape.txt

Threads

By default LiBiNorm uses three threads for the parameter estimation but this can be changed with the -p option.

Parameter discovery using a reduced numbers of reads

By default LiBiNorm uses a maximum of 100 million reads for parameter estimation, but this can be changed using -d N. Reducing the number of reads reduces the time taken to determine the parameters but can make the parameters less optimal.