Haplosaurus
Haplosaurus (haplo) is a local tool implementation of the same functionality that powers the Ensembl transcript haplotypes view.
It takes phased genotypes from a VCF and constructs a pair of haplotype sequences for each overlapped transcript; these sequences are also translated into predicted protein haplotype sequences. Each variant haplotype sequence is aligned and compared to the reference, and an HGVS-like name is constructed representing its differences to the reference.
This approach offers an advantage over VEP's analysis, which treats each input variant independently. By considering the combined change contributed by all the variant alleles across a transcript, the compound effects the variants may have are correctly accounted for.
haplo shares much of the same command line functionality with vep, and can use VEP caches, Ensembl databases, GFF and GTF files as sources of transcript data; all vep command line flags relating to this functionality work the same with haplo.
Download and install
Haplosaurus is part of the VEP package.
Please follow the instructions about the download and installation of VEP.
Getting Haplosaurus to run faster
Usage
Input data must be a VCF containing phased genotype data for at least one individual and file must be sorted by chromosome and genomic position; no other formats are currently supported.
Example of VCF input:
##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IND1 IND2 12 6029429 rs1800380 C T . . . GT 1|0 1|0 12 6029431 rs370984712 G A . . . GT 1|0 1|0 12 12477741 rs201941751 GCACGC G . . . GT 0|1 1|0 12 12477747 rs200271649 TGGGC T . . . GT 0|1 1|0 21 25597309 rs540774105 G A . . . GT 0|0 0|0 21 25597391 rs1135618 T C . . . GT 1|1 1|0 21 25606638 rs3989369 A G . . . GT 0|1 0|0
When using a VEP cache as the source of transcript annotation, the first time you run haplo with a particular cache it will spend some time scanning transcript locations in the cache.
./haplo -i input.vcf -o out.txt --cache
Output
The default output format is a simple tab-delimited file reporting all observed non-reference haplotypes. It has the following fields:
- Transcript stable ID
- CDS haplotype name
- Comma-separated list of flags for CDS haplotype
- Protein haplotype name
- Comma-separated list of flags for protein haplotype
- Comma-separated list of frequency data for protein haplotype
- Comma-separated list of contributing variants
- Comma-separated list of sample:count that exhibit this haplotype
Example of outputs (default format), using the VCF data displayed above:
# Input #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IND1 IND2 12 6029429 rs1800380 C T . . . GT 1|0 1|0 12 6029431 rs370984712 G A . . . GT 1|0 1|0 # Output ENST00000261405 | ENST00000261405:2878C>T,2880G>A | | ENSP00000261405:960R>*,961del{1854} | stop_change | | rs370984712,rs1800380 | IND1:1,IND2:1
# Input #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IND1 IND2 12 12477741 rs201941751 GCACGC G . . . GT 0|1 1|0 12 12477747 rs200271649 TGGGC T . . . GT 0|1 1|0 # Output ENST00000298573 | ENST00000298573:1080del{4},1085delG,1087delGTG,1092delC | resolved_frameshift,indel | ENSP00000298573:364delPSV | indel | | rs200271649,rs201941751 | IND1:1,IND2:1
# Input #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IND1 IND2 21 25597391 rs1135618 T C . . . GT 1|1 1|0 21 25606638 rs3989369 A G . . . GT 0|1 0|0 # Output ENST00000352957 | ENST00000352957:91T>C,612A>G | | ENSP00000284967:31S>P | | | rs3989369,rs1135618 | IND1:1
ENST00000307301 | ENST00000307301:612A>G | | ENSP00000305682:REF | | | rs1135618 | IND1:1,IND2:1 ENST00000307301 | ENST00000307301:91T>C,612A>G | | ENSP00000305682:31S>P | | | rs3989369,rs1135618 | IND1:1
ENST00000419219 | ENST00000419219:582A>G | | ENSP00000404426:REF | | | rs1135618 | IND1:1,IND2:1 ENST00000419219 | ENST00000419219:91T>C,582A>G | | ENSP00000404426:31S>P | | | rs3989369,rs1135618 | IND1:1
Alternatively, JSON output matching the format of the transcript haplotype REST endpoint may be generated by using --json.
Each transcript analysed is summarised as a JSON object written to one line of the output file.
You may exclude fields in the JSON from being exported with --dont_export field1,field2.
This may be used, for example, to exclude the full haplotype sequence and aligned sequences from the output with --dont_export seq,aligned_sequences.
Note
REST API
The transcript haplotype REST endpoint returns arrays of protein_haplotypes and cds_haplotypes for a given transcript.
The default haplotype record includes:
- population_counts: the number of times the haplotype is seen in each population
- population_frequencies: the frequency of the haplotype in each population
- contributing_variants: variants contributing to the haplotype
- diffs: differences between the reference and this haplotype
- hex: the md5 hex of this haplotype sequence
- other_hexes: the md5 hex of other related haplotype sequences (CDSHaplotypes that translate to this ProteinHaplotype or ProteinHaplotype representing the translation of this CDSHaplotype)
- has_indel: does the haplotype contain insertions or deletions
- type: the type of haplotype - cds, protein
- name: a human readable name for the haplotype (sequence id + REF or a change description)
- flags: [flags](#haploflags) for the haplotype
- frequency: haplotype frequency in full sample set
- count: haplotype count in full sample set
The REST service does not return raw sequences, sample-haplotype assignments and the aligned sequences used to generate differences by default.
Flags
Haplotypes may be flagged with one or more of the following:
- indel: haplotype contains an insertion or deletion (indel) relative to the reference.
- frameshift: haplotype contains at least one indel that disrupts the reading frame of the transcript.
- resolved_frameshift: haplotype contains two or more indels whose combined effect restores the reading frame of the transcript.
- stop_changed: indicates either a STOP codon is gained (protein truncating variant, PTV) or the existing reference STOP codon is lost.
- deleterious_sift_or_polyphen: haplotype contains at least one single amino acid substitution event flagged as deleterious (SIFT) or probably damaging (PolyPhen-2).
bioperl-ext
haplo can make use of a fast compiled alignment algorithm from the bioperl-ext package; this can speed up analysis, particularly in longer transcripts where insertions and/or deletions are introduced.
The bioperl-ext package is no longer maintained and requires some tweaking to install.
The following instructions install the package in $HOME/perl5; edit PREFIX=[path] to change this. You may also need to edit the export command to point to the path created for the architecture on your machine.
git clone https://github.com/bioperl/bioperl-ext.git cd bioperl-ext/Bio/Ext/Align/ perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL perl Makefile.PL PREFIX=~/perl5 make make install cd - export PERL5LIB=${PERL5LIB}:${HOME}/perl5/lib/x86_64-linux-gnu/perl/5.22.1/
If successful the following should print OK:
perl -MBio::Tools::dpAlign -e"print qq{OK\n}"