Tools to manage pyrosequencing data
Extracting meaningful information from next-generation sequencing runs requires some new ways of thinking and some new tools. First and foremost, because the sheer quantity of data overwhelms any ability to manually process it, automated tools are required for basic data processing. Second, because the potential for erroneous analyses and conclusions scale with data quantity and ease of collection, it is critical to have rigorous and systematic data quality control procedures.
We have developed a collection of scripts for basic data processing and quality control of pyrosequencing data which approximates a generic 'pyrosequencing pipeline' that can be customized as necessary. These scripts were written by me but are open-source and offered freely to the community; feel free to modify as best suits your needs. Feel free to contact me (brian.oakley@ars.usda.gov) if you have any questions.
We also have a series of videos (link via tabs above or links below) which demonstrate various aspects of this pyrosequencing analysis pipeline which links a series of perl scripts for initial processing of sequences, then reformats and summarizes the output from a rapid clustering method into a summary data table which is then used in turn as input for a series of R scripts to produce graphical and tabular output.
- Video 1: Initial screening and quality control.
- Video 2: High-throughput clustering and automated analysis Part I.
- Video 3: Automated analysis and graphics Part II.
- Video 4: Integration with mother.
All of this can be invoked with a single command to take a fasta file of unaligned sequences and automatically produce summary analyses and graphics, all run locally. Scripts can be downloaded from the list at the bottom of the page (new versions uploaded 27 April 2012).
Some of the videos show how to access many of the useful tools of Pat Schloss' excellent program mothur without having to go through all the time and tedium of aligning and making distance matrices with huge numbers of sequences - often a barely tractable task.
As a solution, I've come up with some scripts to do the clustering with CD-Hit which is very fast (and accurate as our group has shown with several control datasets (see citation below)) and then allow that output to be used directly in mothur. The videos show examples but there are other examples of using the tools available in mothur on the mothur websiteLink opens in a new window.
Citation
If you use the Pyrosequence pipeline we would appreciate it if you reference the following paper:
Oakley BB, Carbonero F, Dowd SE, Hawkins RH, and Purdy KJ. (2012) Contrasting patterns of niche partitioning between two anaerobic terminal oxidizers of organic matter. ISME Journal. 6 905-914 DOI:10.1038/ismej.2011.165
Scripts
Scripts can be downloaded as a set within a zip file:
Or as individual files:
- arb_format_conversion_v2.pl
- length_summary_and_trimming.pl
- primer_screen_and_barcode_trimming.pl
- remove_redundant_seqs_from_fasta_file.pl
- rename_seqs_by_pattern_matching.pl
- rev_com.pl
- split_fas_by_txt.pl
- split_fasta_file_into_batches.pl
- strip_gaps.pl
- translate_3_or_6_frames.pl
- trim_trailing_xs.pl
- shell_script_example_16s.sh