Train RNAIndel

Train the RNAIndel model using your training set.

Step 1 (feature calculation)

Features are calculated for each indel and reported in a tab-delimited file.
Suppose we have N samples. For i-th sample:

rnaindel CalculateFeatures  -i sample.i.bam \
                            -o sample.i.tab \
                            -r reference.fa \
                            -d ./data_dir_grch38\
                            [-v sample.i.external.vcf.gz]

Step 2 (annotation)

The output tab-delimited file has \"truth\" column. Users annotate each indel by filling the column. Possible values are:

somatic, germline, artifact 


Repeat Step 1 and 2 for N samples.

Step 3 (update models)

Concatenate the annotated files.

head -1 sample.1.tab > training_set.tab           # keep the header line
tail -n +2 -q sample.*.tab > training_set.tab     # concatenate files without header

The concatenated file is used as a training set to update the models. Specify the indel class to be trained by -c.

rnaindel Train -t training_set.tab -d ./data_dir_grch38 -c indel_class_to_train [other options]

Options

  • -t training set with annotation (required)
  • -d data directory contains trained models and databases (required)
  • -c indel class to be trained. "s" for single-nucleotide indel, "m" for multi-nucleotide indel, "h" for homopolymer indel(required)
  • other options (click to open)

    • -k number of folds in k-fold cross-validation (default: 5)
    • -p number of processes (default: 1)
    • -l directory to ouput log files (default: current)
    • --ds-beta F beta to be optimized in down sampling step. Optimized for TPR if beta > 100. (default: 10)
    • --fs-beta F beta to be optimized in feature selection step. Optimized for TPR if beta > 100. (default: 10)
    • --pt-beta F beta to be optimized in parameter tuning step. Optimized for TPR if beta > 100. (default: 10)
    • --downsample-ratio train with a user-specified downsample ratio: integer between 1 and 20. (default: None)
    • --feature-names train with a user-specified subset of features: input example (default: None)
    • --auto-param train with sklearn.RandomForestClassifer's max_features="auto" (default: False)