bamCleave
In some circumstances bam files arise with reads from multiple genomes. These cannot be viewed using programs such as IGV and cannot be processed by applications that assume a single genome and standard chromosome names. bamCleave can separate out chromosomes for a specific genome and change the chromosome names to more standard names.
bam files can also contain data for different cells. bamSplit can divide such data into bam files for individual cells.
Syntax
bamCleave -b <bamFIleName> (-n <txt>)/(-t <tag>) (-c N)/(-c <filename>)/(-p <prefix>)/(-m <mappingFile>) (-o <outputFile>)
/ indicates that the options (e.g. -n or -t) are mutually exclusive.
-b Specifies the source bam file.
-n <txt> indicates that the cell identifier is set by the read name up to the character string <txt>
-o specifies the root for the generated files. If this is not specified then the source bam filename is used as the root for the generated files
-c N is for use with single cell data where the cell is identified by a barcode within the bam file (-t) or by the initial text in the read name (-n). bamCleave reads through the file to find the N cell identifiers with the most reads and then creates bam files (with an index for all N cells)
-c <filename> is an alternative to -c N where only the data associated with the barcodes in <filename> is output. There should be one barcode per line in the file.
-d N for use with single cell data. Specifies the maximum number of reads that are used to determine the N cell indexes with the most associated reads, Default 20,000,000.
-t <tag> specifies the tag that is used to identify the single cell barcode. The default if the -t option is not used is XC
The -p and/or the -m file can be used to specify the chromosomes that are separated out int a separate bam file. The -p prefix option specifies that all the chromosome names of the format <prefix>XYZ will be separated out and the new chromosomes will have the prefix removed. The -m option specifies a file that lists all the chromosomes to be removed and the names in the new bam file, as a tab delimited file, e.g.
MOUSE_17 17
MOUSE_18 18
MOUSE_19 19
MOUSE_1 1
MOUSE_1_GL456210_random 1_GL456210_random
MOUSE_1_GL456211_random 1_GL456211_random
MOUSE_1_GL456212_random 1_GL456212_random
Examples
Take the reads from source.bam and split into separate files whose filenames start with 'output' and where the
individual cells barcodes use the 'CB' tab. Output data for the top 100 cells
bamCleave -b source.bam -o output -t CB -c 100
Take the reads from source.bam and split into separate files whose filenames start with 'output' and where the
individual cells barcodes are set by the read names up to the character ":" Output data for the top 300 cells
bamCleave -b source.bam -o output -n ":" -c 300
Generated files
All files have a root file name <rootFileName> which is either specified by the -o option or is the source bam file without the .bam suffix.
<rootFileName>_<prefix>.bam or rootFileName>_sel,bam | Reads that have been separated out using the -p or the -m options |
<rootFileName>_res.bam | The remaining reads |
<rootFileName>_<prefix>/sel_XYZ.bam | Reads associated with the cell with identifier XYZ |
<rootFileName>_chimeras.txt | Info about paired end reads which are split between genomes |
<rootFileName>_split.log | Run log. Includes read counts. |
Download
bamCleave is a command line executable that should be placed in an appropriate directory
For linux and mac the command "chmod a+x bamCleave" will need to be used to allow the downloaded file to be run as a program. The program can be run by specifying the full path name to the program or ./bamCleave if it is in the current directory. Alternatively it can be put into a directory such as /usr/local/bin where bamCleave will automatically be found when run from the command line in any other directory. A final alternative is to place it in a new directory for such programs and add the directory to the PATH environment variable in the .profile file.