Thursday, April 16, 2015

Generating a consensus sequence from a bam file

Creating a consensus sequence from a bam file requires 3 steps:
     Step 1: Run samtools mpileup to create the pileup file
     Step 2: Run bcftools call to call SNPs and INDELs
     Step 3: run vcfutils.pl to create the consensus file

You can run the commands separately or pipe them together in one command

The examples below include the options I use when creating the consensus sequence. You can change them as necessary.

NOTE: The commands listed below might wrap to additional lines, but they should be entered on one line

Step 1: Run samtools mpileup to create the pileup file
Parameters
--min-BQ
     Minimum base quality for a base to be considered.
--VCF
     Compute genotype likelihoods and output them in the variant call format (VCF).
--uncompressed
     Generate uncompressed VCF/BCF output.
--output-tags DP,DV,DPR,DP4,SP
     Comma-separated list of tags to output
          DP --> Number of high-quality bases
          DV --> Number of high-quality non-reference bases
          DPR --> Number of high-quality bases for each observed allele
          DP4 --> Number of high-quality ref-forward, ref-reverse, alt-forward and alt-reverse bases
          SP --> Phred-scaled strand bias P-value
--fasta-ref
     The faidx-indexed reference file in the FASTA format.
--output
     Write pileup or VCF/BCF output to a file, rather than the default of standard output.

Command
samtools mpileup --min-BQ 20 --VCF --uncompressed --output-tags DP,DV,DPR,DP4,SP --fasta-ref REF-FILE BAM-FILE --output VCF-FILE

where:
     REF-FILE is the name of the reference file
     BAM-FILE is the name of the bam file
     VCF-FILE is the name of the output file in VCF format

Step 2: Run bcftools call to call SNPs and INDELs
Parameters
--consensus-caller
     use the original samtools/bcftools calling method
--output-type v
     output uncompressed VCF
--output
     Write output to a file rather than the default of standard output.

Command
bcftools call --consensus-caller --output-type v PILEUP-FILE --output SNP-FILE

where:
     PILEUP-FILE is the name of the VCF file created in Step 1
     SNP-FILE is the name of the output file containing the SNPs and INDELs in VCF format

Step 3: run vcfutils.pl to create the consensus file
Parameters
vcf2fa
     Convert the VCF file to FASTA format
The greater than sign redirects the output to CONSENSUS-FILE instead of writing to STDOUT

Command
vcfutils.pl vcf2fa SNP-FILE > CONSENSUS-FILE

where
     SNP-FILE is the VCF file created in Step 2
     CONSENSUS-FILE is the name of the ouput consensus file in FASTA format