Structural Variants Tutorial

The possible structural variants to design primers around are insertions, deletions, translocations, and inversions. These have a slightly different file format than the Standard and Multiplex inputs, and insertion + translocation are different than deletion + inversion. To view input file specifications, check out the inputs page.

Although the input files are different, the pipeline inputs and outputs are the same for all SV’s, so I will just perform the tutorial with a deletion. The steps of the pipeline are also quite similar to the Standard and Multiplex Tutorial, so if you have performed that one already, this may seem more straightforward to you.

This pipeline just parses sequences from the reference genome based on the coordinates in the inputs. If the SV type is a bit more complex than a deletion, such as an insertion of the - strand to the + strand of the normal chromosome, some more complex string manipulation is done when designing the positional flanking sequences.

NOTE: THERE IS CURRENTLY NO OPTION TO MULTIPLEX SV’S ALTHOUGH THIS COULD BE ADDED IN A LATER RELEASE IF DESIRED.

If you followed the installation page and git cloned the repo, there is a directory called test in the PrimerTK repo. This is for code testing but it also has data for the user to implement the full pipeline (testing is good).

For the SV workflow, only primer3 needs to be added to your path. If you followed standard installation and added it to your .bashrc, you can skip this. If not:

export PATH=~/bin/primer3:$PATH

Or add the above to your ~/.bashrc to prevent this in the future. Now check ot make sure primer_tk and primer3_core are accessible via the command line anywhere in your system.

0) Input File Examples

It is important to note that each translocation type as its own input file format. I will display a csv example of each with a brief explanation:

deletion.csv, inversion.csv:

gene1,sample1,1,100,500
gene2,sample2,1,1000000,2000000

Deletions and inversions are expected to take place in a defined region, therefore the csv file columns are the same.

columns:

column1: an arbitrary name for the position (I put gene but you can call it whatever you want).
column2: another name, this time sample, that way if the same sample has multiple regions you can differentiate.
column3: chromosome
column4: sv start
column5: sv stop

translocation.csv:

gene1,sample1,1,1000000,+,2,2000000,+
gene1,sample1,1,1000000,+,2,2000000,-
gene1,sample1,1,1000000,-,2,2000000,+
gene1,sample1,1,1000000,-,2,2000000,-

columns:

column3: the chromosome that is being translocated.
column4: the position the translocation event starts
column5: the strand being translocated
column6: the chromosome receiving the translocation
column7: the position it is starting
column8: which strand the sequence is translocating onto

Consider these situations from the above (all circumstances):

(+)SeqZ:	      	  (+)SeqM: 	      output (++):

ZZZZZZZZZZZ		   MMMMMMMMMMM	      MMMMMMZZZZZ
zzzzzzzzzzz		   mmmmmmmmmmm	      mmmmmmmmmmm

(+)SeqZ:	      	  (-)SeqM:	      output (+-):

ZZZZZZZZZZZ		   MMMMMMMMMMM	      MMMMMMMMMMM
zzzzzzzzzzz		   mmmmmmmmmmm	      ZZZZZmmmmmm

(-)SeqZ:	      	  (+)SeqM:	      output (-+):

ZZZZZZZZZZZ		   MMMMMMMMMMM	      MMMMMMzzzzz
zzzzzzzzzzz		   mmmmmmmmmmm	      mmmmmmmmmmm

(+)SeqZ:	      	  (-)SeqM:	      output (--):

ZZZZZZZZZZZ		   MMMMMMMMMMM	      MMMMMMMMMMM
zzzzzzzzzzz		   mmmmmmmmmmm	      zzzzzmmmmmm

So this gives us a brief diagram that we can always expect the translocated sequence to be going from 5’ -> 3’ and all sequence manipulation to generate the proper primers is handled for you.

insertion.csv:

gene1,sample1,1,1000000,1005000,+,2,2000000,2005000,-

columns:

column3: chr of sequence being inserted
column4: start position of sequence being inserted
column5: stop position of sequence being inserted
column6: strand of sequence being inserted
column7: chr where insertion is
column8: start position on chr where insertion is
column9: stop position on chr where insertion is
column10: strand where the sequence is being inserted

A pair of primers is generated for each insertion breakpoint by the same logic used in the translocation.

1) Setup a tutorial directory called sv_tutorial with files from test

I will do everything in a directory called sv_tutorial:

mkdir -p ~/sv_tutorial/

cd ~/sv_tutorial/
cp ~/PrimerTK/test/data/humrep.ref .
cp ~/PrimerTK/test/data/test_standard.fa .
cp ~/PrimerTK/test/data/input_sv.csv .

2) Start running the program

Then let;s start running the program. Step 1 is called iterator_sv and it has a lot of inputs, but that is just because I wanted to give users a lot of flexibility on thermodynamic parameters for primer3.

Let’s display the help message and explain the parameters:

primer_tk iterator_sv -h
usage: primer_tk iterator_sv [-h] -ref REF_GENOME -in REGIONS_FILE
                             [-opt_size PRIMER_OPT_SIZE]
                             [-min_size PRIMER_MIN_SIZE]
                             [-max_size PRIMER_MAX_SIZE]
                             [-opt_gc PRIMER_OPT_GC] [-min_gc PRIMER_MIN_GC]
                             [-max_gc PRIMER_MAX_GC] [-opt_tm PRIMER_OPT_TM]
                             [-min_tm PRIMER_MIN_TM] [-max_tm PRIMER_MAX_TM]
                             [-sr PRODUCT_SIZE_RANGE]
                             [-flank FLANKING_REGION_SIZE]
                             [-st SEQUENCE_TARGET] -mp MISPRIMING -tp
                             THERMOPATH -sv {deletion,inversion,insertion,translocation}

optional arguments:
  -h, --help            show this help message and exit
  -ref REF_GENOME, --ref_genome REF_GENOME
                        Reference Genome File to design primers around
  -in REGIONS_FILE, --regions_file REGIONS_FILE
                        File with regions to design primers around
  -opt_size PRIMER_OPT_SIZE, --primer_opt_size PRIMER_OPT_SIZE
                        The optimum primer size for output, default: 22
  -min_size PRIMER_MIN_SIZE, --primer_min_size PRIMER_MIN_SIZE
                        The optimum primer size for output, default: 18
  -max_size PRIMER_MAX_SIZE, --primer_max_size PRIMER_MAX_SIZE
                        The optimum primer size for output, default: 25
  -opt_gc PRIMER_OPT_GC, --primer_opt_gc PRIMER_OPT_GC
                        Optimum primer GC, default: 50
  -min_gc PRIMER_MIN_GC, --primer_min_gc PRIMER_MIN_GC
                        Minimum primer GC, default: 20
  -max_gc PRIMER_MAX_GC, --primer_max_gc PRIMER_MAX_GC
                        Maximum primer GC, default: 80
  -opt_tm PRIMER_OPT_TM, --primer_opt_tm PRIMER_OPT_TM
                        Optimum primer TM, default: 60
  -min_tm PRIMER_MIN_TM, --primer_min_tm PRIMER_MIN_TM
                        minimum primer TM, default: 57
  -max_tm PRIMER_MAX_TM, --primer_max_tm PRIMER_MAX_TM
                        maximum primer TM, default: 63
  -sr PRODUCT_SIZE_RANGE, --product_size_range PRODUCT_SIZE_RANGE
                        Size Range for PCR Product, default=200-400
  -flank FLANKING_REGION_SIZE, --flanking_region_size FLANKING_REGION_SIZE
                        This value will select how many bases up and
                        downstream to count when flanking SNP (will do 200 up
                        and 200 down), default: 200
  -st SEQUENCE_TARGET, --sequence_target SEQUENCE_TARGET
                        default: 199,1, should be half of your flanking region
                        size, so SNP/V will be included.
  -mp MISPRIMING, --mispriming MISPRIMING
                        full path to mispriming library for primer3 (EX:
                        /home/dkennetz/mispriming/humrep.ref
  -tp THERMOPATH, --thermopath THERMOPATH
                        full path to thermo parameters for primer3 to use (EX:
                        /home/dkennetz/primer3/src/primer3_config/) install
                        loc
  -sv {deletion,inversion,insertion}, --sv-type {deletion,inversion,insertion}
                        currently supported SV primer generation: deletion,
                        inversion, and insertion.

The -ref parameter is the reference genome for which you want to design primers. The -in parameter is the position input file with the positions of interest. The parameters in the middle with -opt, -min, -max in front of them are the thermodynamic parameters. They all contain defaults that I have found to be good for most standard small fragment experiments. -sr is the product_size_range, -mp is the primer mispriming library which penalizes primers that land on sequences in the file. I have provided a copy of the file I use in the test directory as I think it is sufficient for most cases. -tp is the thermodynamics path for your primer3 program to use. This is in the primer3 install location, so if you installed primer3 in your ~/bin like the standard installation, this path will be ~/bin/primer3/src/primer3_config/. Lastly, -sv is the sv type. For this I will choose deletion. If you are doing a translocation setup, you should use insertion. This is explained in the inputs for the insertion file.

Anything with a default value does not need to be specified unless you want to change that default value to something else, but for the sake of the tutorial I will include them:

NOTE: THE -st SHOULD ALWAYS BE (-flank-1,1) SO IF FLANK IS 200, -st SHOULD be 200-1,1 OR 199,1. THIS IS SO YOUR POSITION OF INTEREST IS GUARANTEED TO BE IN YOUR PRODUCT. I ALSO ALWAYS SET MY -sr UPPER LIMIT TO BE TWICE MY FLANK SIZE.

primer_tk iterator_sv \
 -ref test_standard.fa \
 -in input_sv.csv \
 -opt_size 22 \
 -min_size 18 \
 -max_size 25 \
 -opt_gc 50 \
 -min_gc 20 \
 -max_gc 80 \
 -opt_tm 60 \
 -min_tm 57 \
 -max_tm 63 \
 -sr 200-400 \
 -flank 200 \
 -st 199,1 \
 -mp /home/dkennetz/sv_tutorial/humrep.ref \
 -tp /home/dkennetz/bin/primer3/src/primer3_config/ \
 -sv deletion

This will output the following files:

flanking_regions.input_sv.fasta
primer3_input.input_sv.txt

The flanking regions file is used later in the workflow, and the primer3_input file is used in the next (primer3) step. The filename convention is always flanking_regions..fasta and priemr3_input..txt.

3) Run Primer3

This step is easy enough but can be time consuming for large datasets. In our case, it should take about 30 seconds.

primer3_core --output=primer_dump.txt primer3_input.input_sv.txt

This will output a log file named primer_dump.txt (we want) and a bunch of intermediate files (we don’t want). I like to clean the intermediates up (the intermediates also aren’t kept if you use CWL).

rm -rf *.int *.for *.rev

Up to 5 primer pairs will be output per position, and all will be visible in the files. In the final step, we will have one file called “top_primers” and one called “all_primers”. The top_primers file will be the top ranking primer pair that passed all filtering steps, and all_primers will output all primers that passed all filtering steps (up to 5 per position).

4) Run the primer3 output parsing

This is similar to the Standard Primer Design pipeline in that all we are doing is parsing the primer3 output in this step:

primer_tk pre_sv -h
usage: primer_tk pre_sv [-h] -d DUMP -o OUTFILE [-pcr PCRFILE]

optional arguments:
  -h, --help            show this help message and exit
  -d DUMP, --primer3_dump DUMP
                        Primer3 stdout passed into a 'dump' file to be used as
                        input
  -o OUTFILE, --outfile_name OUTFILE
                        The output filename for all primer information.
  -pcr PCRFILE, --pcrfile PCRFILE
                        The pseudopcr file

the -pcr flag here just sets up a pseudopcr file and could be useful in the future, so I included it. To run:

primer_tk pre_sv \
 -d primer_dump.txt \
 -o total_primers.csv \
 -pcr fake_pcr.txt

If we look at the output, we see two primer pairs were designed around 1 position. This looks like a tough region to design around! We can always go back and check out the primer_dump.txt file to see why a specific region was rejected. It has a line like this for each position in the file:

PRIMER_LEFT_EXPLAIN=considered 653, low tm 476, high repeat similarity 113, long poly-x seq 64, ok 0
PRIMER_RIGHT_EXPLAIN=considered 1308, GC content failed 557, low tm 53, high tm 521, high hairpin stability 147, high repeat similarity 21, ok 9

So for position 1, we see all the problems associated with the region and why primers failed there. If we actually look at the sequence it used:

SEQUENCE_TEMPLATE=AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGC

This is reasonable!

5) Pseudo PCR check

There is no actual PCR done here, as it isn’t super necessary when we are not multiplexing. The likelihood of severe off-target for a single pair is pretty low, although it does happen occasionally!

To run the next step:

primer_tk post_sv -h
usage: primer_tk post_sv [-h] [-f FLANK_FILE] [-tp TOTAL_PRIMERS]
                         [-all ALL_FINAL_PRIMERS] [-top TOP_FINAL_PRIMERS]
                         [-plate PLATE_BASENAME]

optional arguments:
  -h, --help            show this help message and exit
  -f FLANK_FILE, --flank_file FLANK_FILE
                        use flanking_regions file from output of
                        genome_iterator_sv.py
  -tp TOTAL_PRIMERS, --total_primers TOTAL_PRIMERS
                        the pre-PCR master primer file that contains all
                        sample + primer info
  -all ALL_FINAL_PRIMERS, --all_final_primers ALL_FINAL_PRIMERS
                        all primers generated for targets
  -top TOP_FINAL_PRIMERS, --top_final_primers TOP_FINAL_PRIMERS
                        top primers generated for targets
  -plate PLATE_BASENAME, --plate_basename PLATE_BASENAME
                        the basename of the primers in plate format ready to
                        order.

For this step, we actually pass the flanking regions file output from step 1 back into this step because this is the sequence that was used to design the primer pair. From this we can extract the full sequence the primers are amplifying and give pseudo-pcr product information such as GC content and product length, which is why I say there is pseudo-pcr done here.

The primer positions for each pair in regards to the reference are also output, as this can be useful to potentially trouble shoot primers that seem like they have failed.

The outputs for this are all primer pairs, and top primer pair by input position, as well as an easy to order primer plate format.

To run:

primer_tk post_sv \ -f flanking_regions.input_sv.fasta \ -tp total_primers.csv \ -all all_primers.csv \ -top top_primers.csv \ -plate plate \

And all final outputs from this pipeline are:

primer3_input.input_sv.txt
flanking_regions.input_sv.fasta
primer_dump.txt
total_primers.csv
fake_pcr.txt
all_primers.csv
top_primers.csv
plate_F.csv
plate_R.csv

7) SNP Annotation of Primers

To view this, please head over to the Standard and Multiplex Tutorial Step 7.

PrimerTK

A toolkit to design standard primers, multiplexed primers, and primers around SV's

Structural Variants Tutorial

0) Input File Examples

1) Setup a tutorial directory called sv_tutorial with files from test

2) Start running the program

3) Run Primer3

4) Run the primer3 output parsing

5) Pseudo PCR check

7) SNP Annotation of Primers