This section of the documentation is currently under construction. If your question is not answered here, please contact us!
St. Jude Cloud hosts both raw genomic data files and processed results files:
|File Type||Short Description||Details|
|BAM||HG38 aligned BAM files produced by Microsoft Genomics Service||Click here|
|gVCF||Genomic VCF files produced by Microsoft Genomics Service||Click here|
|Somatic VCF||Curated list of somatic variants produced by the St. Jude somatic variant analysis pipeline||Click here|
|CNV||list of somatic copy number alterations produced by St. Jude CONSERTING pipeline||Click here|
In St. Jude Cloud, we stored aligned sequence reads in the ubiquitous BAM file format. BAM files were produced by the Microsoft Genomics Service aligned to HG38 (GRCh38 no alt analysis set). For more information about how Microsoft Genomics produces BAM files or any other questions regarding data generation, please refer to the official Microsoft Genomics whitepaper.
For more information on SAM/BAM files, please refer to the SAM/BAM specification.
We provide gVCF files produced by the Microsoft Genomics Service. gVCF files are derived from the BAM files produced above as called by GATK's haplotype caller. Today, we defer to the official specification document from the Broad Institute, as well as this discussion on the difference between VCF and gVCF files. For more information about how Microsoft Genomics produces gVCF files or any other questions regarding data generation, please refer to the official Microsoft Genomics whitepaper.
Somatic VCF files
Somatic VCF files contain HG38 based SNV/Indel variant calls from the St. Jude somatic variant analysis pipeline as follows. Broadly speaking:
- Reads were aligned to HG19 using bwa backtrack (
bwa sampe) using default parameters.
- Post processing of aligned reads was performed using Picard
- Variants were called using the Bambino variant caller (you can download by navigating here and searching for "Bambino package").
- Variants were post-processed using an in-house post-processing pipeline that cleans and annotates variants. This pipeline is not currently publicly available.
- Variants were manually reviewed by analysts and published with the relevant Pediatric Cancer Genome Project (PCGP) paper.
- Post-publication, variants were lifted over to HG38 (the original HG19 coordinates are stored in the
For more information on variants for each of the individuals, please refer to the relevant PCGP paper. For more information on the variant calling format (VCF), please see the latest specification for VCF document listed here.
CNV files contain copy number alteration (CNA) analysis results for paired tumor-normal WGS samples. Files are produced by running paired tumor-normal BAM files through the CONSERTING pipeline which identifies CNA through iterative analysis of (i) local segmentation by read depth within boundaries identified by structural variation (SV) breakpoints followed by (ii) segment merging and local SV analysis. CREST was used to identify local SV breakpoints. CNV files contain the following information:
|loc.start||start of segment|
|loc.end||end of segment|
|num.mark||number of windows retained in the segment (gaps and windows with low mappability are excluded)|
|length.ratio||The ratio between the length of the used windows to the genomic length|
|seg.mean||The estimated GC corrected difference signal (2 copy gain will have a seg.mean of 1)|
|GMean||The mean coverage in the germline sample (a value of 1 represents diploid)|
|DMean||The mean coverage in the tumor sample|
|LogRatio||Log2 ratio between tumor and normal coverage|
|Quality score||A empirical score used in merging|
|SV_Matching||whether the boundary of the segments were supported by SVs (3: both ends supported, 2: right end supported, 1: left end supported, 0: neither end supported)|