Running LiBiNorm

This is the standard mode which uses aligned RNA-seq data from a bam file and determines the number of reads associated with each gene or transcript, and is able to normalise for bias associated with the the protocol, particularly Smart-seq2.

The command line format is LiBiNorm count [options] alignment_file gff_file

This mode takes as its foundation the functionality found in htseq-count, indeed "LiBiNorm count" can be run in an essentially htseq-count compatible mode using the -z option.

Standard htseq-count based options

The following htseq-count command line options are available withn LiBiNorm count. Options that are not available, and functional differences to the supported options.

-r <order>, --order=<order>
For paired-end data, the alignment have to be sorted either by read name or by alignment position. The default is name.

If name is indicated, LiBiNorm expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in an in memory cache until the mate is found. Each time the buffer exceeds 300000 reads it is written to a temporary file and at the end all the remaining pairs are located in the temporary cache files.

-s <yes/no/reverse>, --stranded=<yes/no/reverse>

whether the data is from a strand-specific assay (default: yes)

For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.

-a <minaqual>, --a=<minaqual>
skip all reads with alignment quality lower than the given minimum value (default: 10)

-t <feature type>, --type=<feature type>
feature type (3rd column in GFF file) to be used, all features of other types are ignored (default exon)

-i <id attribute>, --idattr=<id attribute>
GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table (defaults: GTF file: gene_id, GFF file: gene)

-m <mode>, --mode=<mode>

Mode to handle reads overlapping more than one feature. Possible values for <mode> are union, intersection-strict and intersection-nonempty (default: intersect_strict, or union when using -z htseq compatible mode)

-o <bamout>, --bamout=<bamout>

write out all BAM alignment records into an output BAM file called <bamout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’)

-q, --quiet
suppress progress report and warnings

-h, --help
Show a usage summary and exit.

Additional command line options

-z, --htseq-compatible

The default is that LiBiNorm performs global bias correction that is appropriate for the Smart-seq2 protocol. This option disables this correction and switches the format of the output data to be fully htseq-count compatible. This option also disables the four changes that have been made to the htseq-count algorithm and sets the default count mode to union rather than intersect_strict such that the results should align fully with those produced by htseq-count.

-c <filename>, --counts=<filename>

The gene expression data is writtent to <filename> rather than being sent to stdout, which is the default.

-om <bamfilename> and -ou <bamfilename>

-om <bamfilename> is similar to the -o option but only outputs reads that are associated with the selected genetic feature (typically exons). -ou <bamfilename> creates a file that only contains reads that were not associated with the feature. In each case two bam files are created each containing all the reads. In the first they retain the ordering of the original bam file, and in the second (<bamfilename>.sort.bam) they are ordered by genomic position, such that they can be viewed using viewers such as IGV.

-u <filenameroot>

A number of additional files, named <filenameroot>_<type>.txt are generated that provide details of the characteristics of the reads and the capabilities of all the models. The files are described in more detail. This can be used in conjunction with -z in which case a subset of the files are produced

-j, --fpkm

When used with the -u option, an extended <fileroot>_expression.txt file is created that contains TPM and FPKM values. If this is used with bias normalisation then both raw and bias corrected values are produced. From release 2.3 this can be used in conjunction with -z.

Bias normalisation options

Bias normalisation within LiBiNorm is performed in two stages. The first is to determine the model parameters that best fit the data, and the second is to use the parameters to generate expression values for each gene or transcript using the TPM measure of expression where the length of the transcript is adjusted based on the bias predicted by the model. For example, if the transcript length is 2000 and the predicted bias is 0.4 then the effective length used in the calculation of TPM is 800.

-n <model>, --normModel=<model>

The default bias compensation is based on model BD, which s appropriate for the Smart-seq2 protocol. This allows alternative models to be selected where <model> is one of the six models A-E, BD or none.

The -n none option disables normalisation and a htseq-compatible count file is produced as is produced using the -z option, except that the htseq-compatibility modes are disabled

-f, --full

When used in conjunction with -u <filenameroot> this results in the output files containing information relating to all six models rather than just the selected or the default model.

-p <N>, --threads=<N>

Sets the number of threads that are used during the process of determining model parameters (default 3).

-d <N>, --reads=<N>

Sets the maximum number of reads that are used during the process of parameter determination (default 100000000). A lower number will reduce the time taken to determine the parameters but may make the parameters less optimal.