A CWL-based pipeline for calling small germline variants, namely SNPs and small INDELs, by processing data from Whole-genome Sequencing (WGS) or Targeted Sequencing (e.g., Whole-exome sequencing; WES) experiments.
On the respective GitHub folder are available:
- The CWL wrappers and subworkflows for the workflow
- A pre-configured YAML template, based on validation analysis of publicly available HTS data
Briefly, the workflow performs the following steps:
- Quality control of Illumina reads (FastQC)
- Trimming of the reads (e.g., removal of adapter and/or low quality sequences) (Trim galore)
- Mapping to reference genome (BWA-MEM)
- Convertion of mapped reads from SAM (Sequence Alignment Map) to BAM (Binary Alignment Map) format (samtools)
- Sorting mapped reads based on read names (samtools)
- Adding information regarding paired end reads (e.g., CIGAR field information) (samtools)
- Re-sorting mapped reads based on chromosomal coordinates (samtools)
- Adding basic Read-Group information regarding sample name, platform unit, platform (e.g., ILLUMINA), library and identifier (picard AddOrReplaceReadGroups)
- Marking PCR and/or optical duplicate reads (picard MarkDuplicates)
- Collection of summary statistics (samtools)
- Creation of indexes for coordinate-sorted BAM files to enable fast random access (samtools)
- Splitting the reference genome into a predefined number of intervals for parallel processing (GATK SplitIntervals)
At this point the application of single-sample workflow follows, during which multiple samples are accepted as input and they are not merged into a unified VCF file but are rather processed separately in each step of the workflow, leading to the production of a VCF file for each sample:
- Application of Base Quality Score Recalibration (BQSR) (GATK BaseRecalibrator, GatherBQSRReports and ApplyBQSR tools)
- Variant calling (GATK HaplotypeCaller)
- Merging of all genomic interval-split gVCF files for each sample (GATK MergeVCFs)
- Separate annotation of SNPs and INDELs based on pretrained Convolutional Neural Network (CNN) models (GATK SelectVariants, CNNScoreVariants and FilterVariantTranches tools)
- (Optional) Independent step of hard-filtering (GATK VariantFiltration)
- Variant filtering based on the information added during VQSR and/or custom filters (bcftools)
- Normalization of INDELs (split multiallelic sites) (bcftools)
- Annotation of the final dataset of filtered variants with genomic, population-related and/or clinical information (ANNOVAR)
Version History
Version 1 (earliest) Created 5th Jul 2023 at 10:48 by Konstantinos Kyritsis
Initial commit

Views: 2456 Downloads: 301
Created: 5th Jul 2023 at 10:48
