Input File Conversion
ChIP-seq files come in a whole variety of different formats.
The 'alignment -> BAR' function has been extended to cater for a wide range of different formats. However, there are a couple of functions which have been designed to convert from other formats into the simplest internal format before using 'alignment -> BAR'. These are:
- Eland -> aln
This was in the original cisGenome. Converts from Eland format. Has been extended to be able to convert multiple files and create a single output file. Format example:
>FC12033_91907:6:1:419:667 AATCTCATAAACAATGTTGAATGAAAG U0 1 3 4 chr4.fa 56990795 R ..
>FC12033_91907:6:1:458:899 TAAAAAAAAAAAGCCATAAATCCAAAC U0 1 0 0 chr15.fa 38956903 F ..
- Illumina -> aln
There are (at least) two different data formats that are produced by the illumina pipeline:- 'results'
This provides information about all of the individual sequences found, how many locations the sequence matches in the genome with no, one and two mismatches, and if there is there is only one example of the best match, the genomic location:
- 'results'
>FC30DTE_20080920:6:1:543:118 GATGATGATTCCATTCCAGTCCATATGA R0 2 2 44
>FC30DTE_20080920:6:1:1357:1869 GAGTAGAGAGCTCAGCAGGACATGGCTT U0 1 0 0 chr13.fa 48695227 F ..
-
- 'multi'
Similar to 'results' except that when there are multiple matches to genomic locations, provides information about more than one of the matches.
- 'multi'
>USI-EAS28:1:3:194:2007#0/1 AGCCAAATTTATCCTGACTTCCCAGAGA 1:0:0 chr6.fa:29803458F0
>USI-EAS28:1:3:194:756#0/1 AATCAAAACTAAAACCAAAGTGTCATTA 1:1:1 chr4.fa:8487516R0,chrX.fa:135845453F1
When parsing this data, the new cisGenome function provides the option of either selecting only perfect alignment, or alignments with up to one, or up to two mismatches.