GATK pipeline, at a glance
Examples to serve as a guide for analyses using GATK. Version used is 3.2.2. It is best to consult the tool documentation for in-depth discussion of the parameters and outputs.
Set paths to tools
1
2
3
4
bwa="/opt/bwa/gcc-4.8.2_bwa-0.7.10/bwa"
picardtools="java -jar -Xmx10g -XX:ParallelGCThreads=1 /home/mroa/bin/picard-tools-1.119"
samtools="/home/mroa/bin/samtools-1.0/samtools"
gatk="java -jar /home/mroa/bin/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar"
More paths to directories, inputs and outputs
1
2
3
4
5
6
7
8
9
WORKDIR=/home/mroa/data
LANE=lane3 ## when working with multiple samples from the same lane
REFDIR=$WORKDIR/reference
REFERENCE=$REFDIR/reference.fasta
REFINDEX=$REFERENCE
REFDICT=$REFDIR/reference.dict
OUTDIR=$WORKDIR/snps/$LANE
DATADIR=$WORKDIR/demultiplex/$LANE
SNPSDIR=$WORKDIR/snps
Do this once for each reference
1
2
3
bwa index $REFERENCE &&
$picardtools/CreateSequenceDictionary.jar REFERENCE=$REFERENCE OUTPUT=$REFDICT &&
$samtools faidx $REFERENCE &&
Do this for all FASTQ pairs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for num in {1..96}; do
READ=sample$num
LOGTIME=$WORKDIR/snps/$LANE/timelogs/$READ
mkdir -p $LOGTIME
READ1=$DATADIR/$READ\_1.fastq
READ2=$DATADIR/$READ\_2.fastq
SNPSDIR=$WORKDIR/snps
bwa mem -t 20 $REFINDEX $READ1 $READ2 > $READ.sam &&
$picardtools/SortSam.jar SO=coordinate INPUT=$READ.sam OUTPUT=$READ.sorted.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE &&
$picardtools/FixMateInformation.jar SO=coordinate INPUT=$READ.sorted.bam OUTPUT=$READ.fixmate.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE &&
$picardtools/MarkDuplicates.jar INPUT=$READ.fixmate.bam OUTPUT=$READ.markdup.bam METRICS_FILE=$READ.metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 &&
$picardtools/AddOrReplaceReadGroups.jar INPUT=$READ.markdup.bam OUTPUT=$READ.adrg.bam RGID=$READ LB=$READ PL='Illumina' SM=$READ CN="" RGPU=$READ VALIDATION_STRINGENCY=LENIENT SO=coordinate CREATE_INDEX=TRUE &&
$gatk -T RealignerTargetCreator -nt 20 -R $REFERENCE -I $READ.adrg.bam -o $READ.intervals &&
$gatk -T IndelRealigner -I $READ.adrg.bam -R $REFERENCE -targetIntervals $READ.intervals -o $READ.realigned.bam
done
Do this once to merge all alignments
1
2
$samtools merge $SNPSDIR/Seq.merged.bam $SNPSDIR/lane1/*realigned.bam $SNPSDIR/lane2/*realigned.bam $SNPSDIR/lane3/*realigned.bam $SNPSDIR/lane4/*realigned.bam
$samtools index $SNPSDIR/Seq.merged.bam
Call SNPs
1
$gatk -T UnifiedGenotyper -nt 20 -R $REFERENCE -I $SNPSDIR/Seq.merged.bam -o $SNPSDIR/Seq.vcf.gz -glm SNP -mbq 20 --genotyping_mode DISCOVERY -out_mode EMIT_VARIANTS_ONLY
This post is licensed under CC BY 4.0 by the author.