Targetted 16S Metagenomic Analysis using Qiime
This was from a workshop I have given in the lab, but could be used as a guide for future analysis as well.*
Background
Metagenomics involves the study of genomes found in the environment. Some microbes are difficult to culture in the laboratory, limiting the acquisition of data and study of genes in these organisms. Metagenomics involves sequencing and analyzing genomes of different organisms directly, without culturing and isolation, producing an unbiased profile of the community in a sample. Unlike whole genome analysis, metagenomic analysis is challenging in that the diversity of genomes in the sample produce redundancy of sequences between closely related community of microbes. In addition, the presence of multiple genomes in one sample complicates assembly, annotation, and the interpretation of data.
Targetted metagenomics aims to sequence and analyze only a portion of a genome or set(s) of genes from a given environmental sample. On the otherhad, shotgun sequencing involves indescriminate sequencing of DNA. This exercise covers basic steps in processing 16s rRNA targetted metagenomic sequences. 16s rRNA is one of the widely used genetic markers to profile bacterial communities.
Tools
QIIME uses external tools in some of its modules. Please see http://qiime.org for a list of QIIME dependencies and their corresponding citations.
Setting up
QIIME is an open-source software available at http://qiime.org. To install, simply run:
1
$ pip install qiime
Additional setting up include the following. Note that this is a non-exhaustive list and specifics would depend on the configurations of the machine you’re using, the network, as well as the specific QIIME modules you are going to run, etc:
Download the
uchime
binary here. Rename, copy, or link it asuchime61
. Put the location of the binary in yourPATH
by either writing it in.bashrc
or linking it in/bin
or/usr/local/bin
.- Make sure you have a copy of emperor or download it using
1
$ sudo pip install emperor
- If you have older versions of the Python module
numpy
, upgrade it using:1
$ pip install -U numpy
- QIIME works using older versions of
matplotlib
. To download specific versions of a Python module, e.g. matplotlib version 1.4.3:1
$ pip install matplotlib==1.4.3
- Set up
matplotlibrc
:1
$ vi .config/matplotlib/matplotlibrc
Go to
INSERT
mode by pressingi
in your keyboard, then write the following:1
backend: agg
This prevents matplotlib from using interactive mode when outputing figures. Save your edits by pressing
Esc
in your keyboard, and writing:wq
. - Lastly, you also need to copy the QIIME config file to your home directory:
1
$ cat /etc/qiime/qiime_config > ~/.qiime_config
Dowload databases
The following databases will be used for chimera searching and operational taxonomic units (OTU) picking:
- Ribosomal Database Project (RDP) Gold sequences 3, which can be downloaded at here.
- The Greengenes database 4 which can be downloaded here. Unpack the sequences before using, e.g.:
1
tar -xvzf gg_otus_4feb2011.tgz
Organize (output) data into directories
Create directories where the resulting files will be written to. This is a good practice for organizing your data, intermediate files, and results. An example of commands to do this are as follows:
1
2
3
4
5
6
7
8
9
$ mkdir -p outputs/{01_multiple_join_paired_ends, \
02_split_libraries_fastq, \
03_identify_chimeric_seqs,04_filter_fasta, \
05_pick_open_reference_otus, \
06_summarize_taxa_through_plots,07_alpha_rarefaction, \
08_biom_summarize_table, \
09_alphadiversity,10_betadiversity}
$ mkdir outputs/04_filter_fasta/{data1,data2,merged}
$ mkdir outputs/08_biom_summarize_table/{data1,data2,merged}
Sample workflow
This workflow uses 16s rRNA metagenomic reads, the QIIME modules, and the Linux commandline interface. It is assumed that you have a working knowledge of basic Linux commands.
- Always refer to figure below for each of the steps.
- For alpha diversity workflow only, do Step 1- Step 9.
- For beta diversity workflow only, do Step 1 - Step 4, and continue to Step 10 - Step 15.
Instructions
The input files are in FASTQ format inside the data
directory.
Assemble forward and reverse reads into longer contigs or sequences.
1
$ multiple_join_paired_ends.py -i data -o outputs/01_multiple_join_paired_ends
The command above will produce FASTQ sequences with filename
fastqjoin.join.fastq
in the directory specified in the command. You may check if you have generated such file usingless
orls -l
.Filter sequences by quality, and label each sequence with a “sample ID”.
1 2
$ split_libraries_fastq.py -i outputs/01_multiple_join_paired_ends/data1_R1_L001/fastqjoin.join.fastq -o outputs/02_split_libraries_fastq/data1 -q 19 --barcode_type 'not-barcoded' --sample_ids data1 $ split_libraries_fastq.py -i outputs/01_multiple_join_paired_ends/data2_R1_L001/fastqjoin.join.fastq -o outputs/02_split_libraries_fastq/data2 -q 19 --barcode_type 'not-barcoded' --sample_ids data2
This will produce sequences in FASTA format with filename
seqs.fna
. Each of the sequences insideseqs.fna
should have a FASTA header starting with the specified string in the--sample_ids
flag in the command, e.g.data1_filename_0
. Check if you have a similar label using the Linux commandless
. Note: If possible, do not put non-alphanumeric characters such as “_” as your sample ID, as this may result in errors in succeeding steps.- Identify chimeric sequences by comparing the assembled sequences from Step 2 to sequences from the RDP Gold database.
1 2
$ identify_chimeric_seqs.py -m usearch61 -i outputs/02_split_libraries_fastq/data1/seqs.fna -r /databases/rdp_gold/rdp_gold.fa -o outputs/03_identify_chimeric_seqs/data1 $ identify_chimeric_seqs.py -m usearch61 -i outputs/02_split_libraries_fastq/data2/seqs.fna -r /databases/rdp_gold/rdp_gold.fa -o outputs/03_identify_chimeric_seqs/data2
This step produces many files, most important of which is
chimeras.txt
which contains the names of sequences tagged to be chimeric. Check if you were able to generate such file. - Filter out sequences with chimera.
1 2
$ filter_fasta.py -f outputs/02_split_libraries_fastq/data1/seqs.fna -o outputs/04_filter_fasta/data1/seqs_chimera_filtered.fna -s outputs/03_identify_chimeric_seqs/data1/chimeras.txt -n $ filter_fasta.py -f outputs/02_split_libraries_fastq/data2/seqs.fna -o outputs/04_filter_fasta/data2/seqs_chimera_filtered.fna -s outputs/03_identify_chimeric_seqs/data2/chimeras.txt -n
The
-n
(negated) flag indicates that the supplied sequences via the flag-s
must be removed. - Pick operational taxonomic units or OTUs from the sequences. First, create a paramaters file for the tool, by using
vi
. Pressi
to go intoINSERT MODE
then type the following:1
$ vi params.txt
E.g.
1 2
pick_otus:enable_rev_strand_match True plot_taxa_summary:chart_type pie,bar
This file specifies other parameters for the OTU picking command in the first line in the format
<tool>:<parameter> value
. In addition, the second line specifies output formats for the next tool (summarize_taxa_through_plots.py
).Press
Esc
on the keyboard, then type:wq
and hit enter to save and quit. Execute the following commands to start OTU picking.1 2
$ pick_open_reference_otus.py -i outputs/04_filter_fasta/data1/seqs_chimera_filtered.fna -o outputs/05_pick_open_reference_otus/data1 -p params.txt -r /databases/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta --parallel --jobs_to_start=10 $ pick_open_reference_otus.py -i outputs/04_filter_fasta/data2/seqs_chimera_filtered.fna -o outputs/05_pick_open_reference_otus/data2 -p params.txt -r /databases/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta --parallel --jobs_to_start=10
The OTU picking script(s) of QIIME are in themselves a collection of tools and commands, see this for a description of these tools and steps. An “open reference” method of OTU picking, such as the one done in this step, is a combination of both referenced and de novo OTU picking.
This step produces many files including binary files with extensions
biom
or Biological Observation Matrix. These files contain the OTU counts. More information about theBIOM
format may be found here.In addition, this step makes a phylogenetic tree of representative sequences of the OTUs identified
rep_set.tre
. Descriptions of relevant file outputs are inindex.html
. Create summary plots of the OTUs found. First, create a mapping file containing details of the samples: sample ID, and description of each sample. Use the Linux command
vi
. Edit and save your file as in Step 5.1
$ vi mapping.tsv
Write the following inside
mapping.tsv
. Note that the.tsv
extension indicates that this is a tab-delimited file, i.e. a tab separates each value:1 2 3
#SampleID Description data1 first_data data2 second_data_duplicate
The mapping file (a.k.a the metadata mapping file) contains information about the sample. It contains information that may aid in the analysis of the samples e.g. sampling site, time points, barcodes, and primer used to amplify the sequences.
SampleID
andDescription
are among the important fields.Note: To be useful and to avoid errors, use the same sample ID for each sample as supplied in the
--sample_ids
flag in Step 2. The mapping file will be used by other tools in this workflow.Run the following commands:
1 2
$ summarize_taxa_through_plots.py -i outputs/05_pick_open_reference_otus/data1/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/06_summarize_taxa_through_plots/data1 -p params.txt -m mapping.tsv $ summarize_taxa_through_plots.py -i outputs/05_pick_open_reference_otus/data2/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/06_summarize_taxa_through_plots/data2 -p params.txt -m mapping.tsv
This will generate bar plots and and pie charts of the taxonomic diversity found in the samples.
Create rarefaction curves.
1 2
$ alpha_rarefaction.py -i outputs/05_pick_open_reference_otus/data1/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/07_alpha_rarefaction/data1 -m mapping.tsv -t outputs/05_pick_open_reference_otus/data1/rep_set.tre --parallel --jobs_to_start=10 $ alpha_rarefaction.py -i outputs/05_pick_open_reference_otus/data2/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/07_alpha_rarefaction/data2 -m mapping.tsv -t outputs/05_pick_open_reference_otus/data2/rep_set.tre --parallel --jobs_to_start=10
This step generates alpha rarefaction plots, as well as computes the alpha diversity metrics for each rarefied OTU table, and collate alpha diversity results. Chao 1 estimates true species diversity of a sample using an equtation based on the number of rare (singleton) and those seen twice (doubletons)5, while PD or phylogenetic diversity is also a measure of diversity based on the branch lengths of the phylogenetic tree generated6.
Create biom summary tables.
1 2
$ biom summarize-table -i outputs/05_pick_open_reference_otus/data1/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/08_biom_summarize_table/data1/biom_summary.txt $ biom summarize-table -i outputs/05_pick_open_reference_otus/data2/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/08_biom_summarize_table/data2/biom_summary.txt
This step outputs general summary metrics of the bioms generated by the OTU picking step.
Compute alpha diversity or within-species diversity using the OTU table generated.
1 2
$ alpha_diversity.py -i outputs/05_pick_open_reference_otus/data1/otu_table_mc2_w_tax_no_pynast_failures.biom -t outputs/05_pick_open_reference_otus/data1/rep_set.tre -o outputs/09_alphadiversity/data1.alpha_diversity $ alpha_diversity.py -i outputs/05_pick_open_reference_otus/data2/otu_table_mc2_w_tax_no_pynast_failures.biom -t outputs/05_pick_open_reference_otus/data2/rep_set.tre -o outputs/09_alphadiversity/data2.alpha_diversity
The following commands are applicable only for the beta-diversity workflow. Similar descriptions of Steps 11-14 can be found in Steps 5-8.
Concatenate cleaned sequences using the Linux command
cat
.1
$ cat outputs/04_filter_fasta/data1/seqs_chimera_filtered.fna outputs/04_filter_fasta/data2/seqs_chimera_filtered.fna > outputs/04_filter_fasta/merged/merged_seqs_chimera_filtered.fna
Pick OTUs in the combined datasets. First create a
params.txt
file as in Step 5.1
$ pick_open_reference_otus.py -i outputs/04_filter_fasta/merged/merged_seqs_chimera_filtered.fna -o outputs/05_pick_open_reference_otus/merged -p params.txt -r /databases/gg_otus_4feb2011/rep_set/gg_99_otus_4feb2011.fasta --parallel --jobs_to_start=10
Plot the taxonomix diversity of the samples. First create a
mapping.tsv
file as in Step 6.1
$ summarize_taxa_through_plots.py -i outputs/05_pick_open_reference_otus/merged/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/06_summarize_taxa_through_plots/merged -p params.txt -m mapping.tsv
Create rarefaction curves
1
$ alpha_rarefaction.py -i outputs/05_pick_open_reference_otus/merged/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/07_alpha_rarefaction/merged -m mapping.tsv -t outputs/05_pick_open_reference_otus/merged/rep_set.tre --parallel --jobs_to_start=10
```
Create biom summary tables
1
$ biom summarize-table -i outputs/05_pick_open_reference_otus/merged/otu_table_mc2_w_tax_no_pynast_failures.biom -o outputs/08_biom_summarize_table/merged/biom_summary.txt
Generate beta-diversity plots. Note that for this exercise, we will use other
biom
file and corresponding mapping file to illustrate how the tool can be ran because of the limitations of the tables generated fromdata1
anddata2
.1
$ beta_diversity_through_plots.py -i subs/otu_table_mc2_w_tax_no_pynast_failures.biom -m subs/mapping.tsv -t subs/rep_set.tre -o outputs/10_betadiversity/merged --parallel --jobs_to_start=10
Beta diversity is typically measured among samples from different sampling conditions or habitats. This step also generates interactive 3D Principal Coordinates of Analysis (PCoA) plots which groups points (samples) based on their distances.
Directory structure
A sample directory structure containing the inputs and outputs can be found below.
Go to the working directory, e.g. $ cd ~/qiime
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
data
----- data1_R1_L001.fastq *
----- data1_R2_L001.fastq *
----- data2_R1_L001.fastq *
----- data2_R2_L001.fastq *
outputs
----- 01_multiple_join_paired_ends
----- 02_split_libraries_fastq
----- 03_identify_chimeric_seqs
----- 04_filter_fasta
---------- data1
---------- data2
---------- merged
----- 05_pick_open_reference_otus
----- 06_summarize_taxa_through_plots
----- 07_alpha_rarefaction
----- 08_biom_summarize_table
---------- data1
---------- data2
---------- merged
----- 09_alphadiversity
----- 10_betadiversity
References
Kuczynski J, Stombaugh J, Walters WA, Gonzalez A, Caporaso JG, Knight R. (2011). Using QIIME to analyze 16S rRNA gene sequences from microbial communities. Current Protocols in Bioinformatics. Chapter 10:Unit 10.7. ↩︎
J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D Bushman, Elizabeth K Costello, Noah Fierer, Antonio Gonzalez Pena, Julia K Goodrich, Jeffrey I Gordon, Gavin A Huttley, Scott T Kelley, Dan Knights, Jeremy E Koenig, Ruth E Ley, Catherine A Lozupone, Daniel McDonald, Brian D Muegge, Meg Pirrung, Jens Reeder, Joel R Sevinsky, Peter J Turnbaugh, William A Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld and Rob Knight. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods. doi:10.1038/nmeth.f.303 ↩︎
Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, Brown CT, Porras-Alfaro A, Kuske CR, Tiedje JM. (2014). Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Research.42(Database issue):D633-42. ↩︎
DeSantis, T. Z., P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen. 2006. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl Environ Microbiol 72:5069-72 ↩︎