RNAIndel
RNAIndel calls coding indels from tumor RNA-Seq data and classifies them as somatic, germline, and artifactual. RNAIndel supports GRCh38, 37, and mouse mm10.
Explore the docs »
Read the paper »
Request Feature
|
Report Bug
⭐ Consider starring the repo! ⭐
What's new in Version 3
New implementation with indelpost, an indel realigner/phaser.
* faster analysis (typically < 20 min with 8 cores)
* somatic complex indel calling in RNA-Seq
* ensemble calling with your own caller (e.g., GATK HaplotypeCaller/MuTect2)
* improved sensitivity for homopolymer indels by error-profile outlier analysis
Quick Start
RNAIndel can be executed via Docker or run locally, downloadable via PyPI.
Docker
We publish our latest docker builds on GitHub. You can run the latest code base by running the following command
> docker run --rm -v ${PWD}:/data ghcr.io/stjude/rnaindel:latest
If you want to have a more native feel, you can add an alias to your shell's rc file.
> alias rnaindel="docker run --rm -v ${PWD}:/data ghcr.io/stjude/rnaindel:latest"
Note: if its the first time you are executing the docker run command, you will see the output of docker downloading the image
PyPI
RNAIndel depends on python>=3.8.0 and java>=1.8.0.
Installing via the pip command will install the following packages:
* indelpost>=0.0.4
* pysam>=0.15.0
* cython>=0.29.12
* numpy>=1.16.0
* ssw-py>=1.0.1
* pandas>=0.23.0
* scikit-learn>=0.22.0
> pip install indelpost --no-binary indelpost --no-build-isolation
> pip install rnaindel
Test the installation.
> rnaindel -h
usage: rnaindel <subcommand> [<args>]
subcommands are:
SetUp Initialize predicition models
PredictIndels Predict somatic/germline/artifact indels from tumor RNA-Seq data
CalculateFeatures Calculate and report features for training
Train Perform model training
CountOccurrence Count occurrence within cohort to filter false somatic predictions
positional arguments:
subcommand PredictIndels, CalculateFeatures, Train, CountOccurrence
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
DataPackage
Download data package (version 3 is not compatible with the previous data package). This data package is based on NCBI RefSeq. If Ensembl transcripts are preferred, please use GRCh38 MANE (Matched Annotation from NCBI and EBI). Mouse genome is also experimentally supported (RNAIndel version 3.4.0 or higher).
#GRCh38
curl -LO https://zenodo.org/records/17675562/files/data_dir_grch38.tar.gz
tar -xvzf data_dir_grch38.tar.gz
#GRCh38 MANE (Ensembl)
curl -LO https://zenodo.org/records/17675562/files/data_dir_grch38_mane.tar.gz
tar -xvzf data_dir_grch38_mane.tar.gz
#GRCh37
curl -LO https://zenodo.org/records/17675562/files/data_dir_grch37.tar.gz
tar -xvzf data_dir_grch37.tar.gz
#mouse mm10
curl -LO https://zenodo.org/records/17675562/files/data_dir_mm10.tar.gz
tar -xvzf data_dir_mm10.tar.gz
Usage
RNAIndel has 5 subcommands:
* SetUp pretrain the model with user's sklearn version
* PredictIndels analyze RNA-Seq data for indel discovery
* CalculateFeatures calculate features for training
* Train train models with user's dataset
* CountOccurrence annotate over-represented somatic predictions
Subcommands are invoked:
> rnaindel subcommand [subcommand-specific options]
Set up
Run the first-time-only command. Takes 5 to 10 minutes to complete.
> rnaindel SetUp -d data_dir
Discover somatic indels
Input BAM file
RNAIndel expects STAR 2-pass mapped BAM file with sorted by coordinate and MarkDuplicates. Further preprocessing such as indel realignment may prevent desired behavior.
Standard calling
This mode uses the built-in caller to analyze simple and complex indels.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -p 8 (default 1)
Ensemble calling
Indels in the exernal VCF (supplied by -v) are integrated to the callset by the built-in caller to boost performance.
See demo.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -v gatk.vcf.gz -p 8
With DNA-Seq
Somatic predictions from RNA-Seq are validated against DNA-Seq on the fly.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -t tumor.dna.bam -n normal.dna.bam -p 8
Extravaganza
Leverage all resources for best performance.
> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -v mutect2.vcf.gz -t tumor.dna.bam -n normal.dna.bam -p 8
Options
-iinput STAR-mapped BAM file (required)-ooutput VCF file (required)-rreference genome FASTA file (required)-ddata directory contains trained models and databases (required)-vVCF file (must be .vcf.gz + index) from user's caller. (default: None)-pnumber of cores (default: 1)-
other options (click to open)
-tTumor DNA-Seq BAM file (default: None)-nNormal DNA-Seq BAM file (default: None)-qSTAR mapping quality MAPQ for unique mappers (default: 255)-mmaximum heap space (default: 6000m)--regiontarget genomic region. specify by chrN:start-stop (default: None)--ponuser's defined list of non-somatic calls such as PanelOfNormals. Supply as .vcf.gz with index (default: None)--include-all-external-callsset to include all indels in VCF file supplied by -v. (default: False. Use only calls with PASS in FILTER)--skip-homopolyer-outlier-analysisno outlier analysis for homopolymer indels (repeat > 4) performed if set. (default: False)--safety-modedeactivate parallelism at realignment step. may be required to run with -p > 1 on some platforms. (default: False)--deactivate-sensitive-modedeactivate additional realignments for soft-clipped reads. (default: False)
Benchmarking
Using pediatric tumor RNA-Seq samples (SJC-DS-1003, n=77), the time and memory consumption was benchmarked for ensemble calling with 8 cores (i.e., -p 8) on a server with 32-core AMD EPYC 7542 CPU @2.90 GHz.
| Run time (wall) | Max memory | |
|---|---|---|
| median | 374 sec | 18.6 GB |
| max | 1388 sec | 23.5 GB |
Train RNAIndel
Users can train RNAIndel with their own training set.
Annotate over-represented putative somatic indels
Check occurrence to filter probable false positives.
Contact
- kohei.hagiwara[AT]stjude.org
Citation
Published in Bioinformatics