Liftover of Clinvar¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "internal_hg19"
%env GOR_API_PROJECT={project}
Clinvar¶
First, lets look at our Clinvar database and the number of variants in it:
%%gor
gor ref/clinical_variants.gorz | granno genome -count | top 5 | columnreorder chrom,pos,ref,alt,allcount
Query ran in 0.09 sec Query fetched 5 rows in 4.10 sec (total time 4.19 sec)
chrom | pos | ref | alt | allCount | disease | dbsource | clinimpact | maxclinimpact | set_hgnc | variationid | submitter | FDA_approved | max_clinical_significance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69134 | A | G | 2307454 | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2205837 | Ambry Genetics | 0 | LB |
1 | chr1 | 69581 | C | G | 2307454 | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2252161 | Ambry Genetics | 0 | VUS |
2 | chr1 | 69682 | G | A | 2307454 | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2396347 | Ambry Genetics | 0 | VUS |
3 | chr1 | 69769 | T | C | 2307454 | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2288999 | Ambry Genetics | 0 | VUS |
4 | chr1 | 69995 | G | C | 2307454 | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2351346 | Ambry Genetics | 0 | VUS |
%%gor
gor ref/clinical_variants.gorz | top 5
| liftover config/liftover/hg19tohg38.gor -var -build hg19
| columnreorder chrom,pos,ref,alt,hg19_*
Query ran in 0.71 sec Query fetched 5 rows in 0.17 sec (total time 0.88 sec)
chrom | pos | ref | alt | hg19_chrom | hg19_pos | hg19_qStrand | hg19_liftoverStatus | disease | dbsource | clinimpact | maxclinimpact | set_hgnc | variationid | submitter | FDA_approved | max_clinical_significance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69134 | A | G | chr1 | 69134 | + | mapped | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2205837 | Ambry Genetics | 0 | LB |
1 | chr1 | 69581 | C | G | chr1 | 69581 | + | mapped | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2252161 | Ambry Genetics | 0 | VUS |
2 | chr1 | 69682 | G | A | chr1 | 69682 | + | mapped | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2396347 | Ambry Genetics | 0 | VUS |
3 | chr1 | 69769 | T | C | chr1 | 69769 | + | mapped | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2288999 | Ambry Genetics | 0 | VUS |
4 | chr1 | 69995 | G | C | chr1 | 69995 | + | mapped | INBORN_GENETIC_DISEASES | clinvar | unknown | Unknown | OR4F5 | 2351346 | Ambry Genetics | 0 | VUS |
We can summarize how many variants are mapped:
%%gor
gor ref/clinical_variants.gorz | select 1-4
| liftover config/liftover/hg19tohg38.gor -var -build hg19
| group genome -gc hg19_liftoverstatus -count
Query ran in 0.24 sec Query fetched 2 rows in 23.83 sec (total time 24.07 sec)
chrom | bpStart | bpStop | hg19_liftoverStatus | allCount | |
---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | mapped | 2307326 |
1 | chrA | 0 | 1000000000 | unmapped | 128 |
We see that only 128 out of 2,307,454 = 128+2,307,326 are not mapped, i.e. 0.005%. Notice that we used the -var option which forces the whole ref segment of the variant to fall fully within the definition of the so-called chain files. In GOR, the UCSC defined chain files have been mapped to GOR files, i.e. config/liftover/hg19tohg38.gor. We can also run the liftover with the -snp option, i.e. map the variants as-if SNPs.
%%gor
gor ref/clinical_variants.gorz | select 1-4
| liftover config/liftover/hg19tohg38.gor -snp -build hg19
| group genome -gc hg19_liftoverstatus -count
Query ran in 0.65 sec Query fetched 2 rows in 18.79 sec (total time 19.45 sec)
chrom | bpStart | bpStop | hg19_liftoverStatus | allCount | |
---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | mapped | 2307336 |
1 | chrA | 0 | 1000000000 | unmapped | 118 |
Now we see that 10 more variants are mapped.
Mapping the variants back to hg19 from hg38¶
We can also see how many of the variant are mapped back:
%%gor
gor ref/clinical_variants.gorz | select 1-4
| liftover config/liftover/hg19tohg38.gor -var -build hg19
| liftover config/liftover/hg38tohg19.gor -var -build hg38
| calc good if(chrom=hg19_chrom and pos = hg19_pos and pos > 0 and ref=upper(refbases(chrom,pos,pos+len(ref)-1)),1,0)
| group genome -gc good -count
Query ran in 0.52 sec Query fetched 2 rows in 49.63 sec (total time 50.14 sec)
chrom | bpStart | bpStop | good | allCount | |
---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 0 | 486 |
1 | chrA | 0 | 1000000000 | 1 | 2306968 |
Now we have additional 486 variants that are not good, i.e. don't make it back and have reference value consistent with the reference sequence build, i.e. 0.021% loss in back-and-forth lift-over.
Looking at the reference consistency in hg38¶
Now we want to see how the reference of the variants successfully mapped to hg38 (based on the chain definitions) compare with the reference genome in hg38. For that we must save the output and move it to a differenct project, i.e. internal_hg38. But first we write out the file in hg19.
project = "internal_hg19"
%env GOR_API_PROJECT={project}
env: GOR_API_PROJECT=internal_hg19
%%gor
gor ref/clinical_variants.gorz | select 1-4 | rownum | rename rownum ID
| liftover config/liftover/hg19tohg38.gor -var -build hg19
| write user_data/hakon/liftover/clinvar_hg38.gorz
Query ran in 0.37 sec Query fetched 0 rows in 26.71 sec (total time 27.08 sec)
chrom | pos |
---|
Let's look at the folde we want to copy from:
import os
cmd = "ls -al ~/projects/internal_hg19/user_data/hakon/liftover"
returned_value = os.system(cmd) # Returns the exit code in Unix
print("Returned value:", returned_value)
total 143065 drwxr-xr-x 2 jovyan jovyan 25600 Apr 8 09:43 . drwxr-xr-x 21 jovyan jovyan 33792 Apr 7 19:07 .. -rw-r--r-- 1 jovyan jovyan 9140973 Apr 8 09:43 clinvar_hg19_chr1.vcf -rw-r--r-- 1 jovyan jovyan 107610131 Apr 8 09:37 clinvar_hg19.vcf -rw-r--r-- 1 jovyan jovyan 29910205 Apr 8 10:23 clinvar_hg38.gorz -rw-r--r-- 1 jovyan jovyan 403 Apr 8 10:23 clinvar_hg38.gorz.meta -rw-r--r-- 1 jovyan jovyan 5354 Apr 7 19:16 clinvar_unmapped.vcf Returned value: 0
cmd = "cp ~/projects/internal_hg19/user_data/hakon/liftover/clinvar_hg38.gorz ~/projects/internal_hg38/user_data/hakon/liftover/."
returned_value = os.system(cmd) # Returns the exit code in Unix
print("Returned value:", returned_value)
Returned value: 0
We switch project such that following queries run in hg38
project = "internal_hg38"
%env GOR_API_PROJECT={project}
env: GOR_API_PROJECT=internal_hg38
%%gor
gor user_data/hakon/liftover/clinvar_hg38.gorz
| calc good if(hg19_liftoverStatus = 'mapped' and ref=upper(refbases(chrom,pos,pos+len(ref)-1)),1,0)
| group genome -gc good -count
Query ran in 0.20 sec Query fetched 2 rows in 16.10 sec (total time 16.29 sec)
chrom | bpStart | bpStop | good | allCount | |
---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 0 | 2906 |
1 | chrA | 0 | 1000000000 | 1 | 2304548 |
2906/(2304548+2906)*100
0.12593967203679898
We see that 0.12% of the variants do NOT have consistent reference in hg38, a significantly worse outcome than if we simply look at the success of mapping alone, as we did before and again here below:
%%gor
gor user_data/hakon/liftover/clinvar_hg38.gorz
| calc good if(hg19_liftoverStatus = 'mapped',1,0)
| group genome -gc good -count
Query ran in 0.54 sec Query fetched 2 rows in 2.17 sec (total time 2.71 sec)
chrom | bpStart | bpStop | good | allCount | |
---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 0 | 128 |
1 | chrA | 0 | 1000000000 | 1 | 2307326 |
We can break down the differences:
%%gor
nor <(gor user_data/hakon/liftover/clinvar_hg38.gorz
| calc good if(hg19_liftoverStatus = 'mapped' and ref=upper(refbases(chrom,pos,pos+len(ref)-1)),1,0)
)
| calc type if(len(ref)=len(alt),'snp','indel')
| calc samechrom if(chrom=hg19_chrom,'yes','no')
| group -gc type,samechrom,hg19_qstrand,hg19_liftoverStatus,good -count
| sort -c good,allcount:n
Query ran in 0.18 sec Query fetched 10 rows in 3.75 sec (total time 3.93 sec)
type | samechrom | hg19_qStrand | hg19_liftoverStatus | good | allCount | |
---|---|---|---|---|---|---|
0 | snp | yes | - | mapped | 0 | 7 |
1 | indel | yes | NaN | unmapped | 0 | 16 |
2 | indel | yes | + | mapped | 0 | 61 |
3 | snp | yes | NaN | unmapped | 0 | 112 |
4 | snp | yes | + | mapped | 0 | 2710 |
5 | snp | no | + | mapped | 1 | 33 |
6 | indel | yes | - | mapped | 1 | 108 |
7 | snp | yes | - | mapped | 1 | 3035 |
8 | indel | yes | + | mapped | 1 | 183829 |
9 | snp | yes | + | mapped | 1 | 2117543 |
Comparing with Ensembl Assembly Converter.¶
http://www.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core
We took the Clinvar file and ran it through hg37 to hg38 conversion at Ensembl. They only accept 50Mb so we limit ourselves to chr1.
%%gor
gor user_data/hakon/liftover/clinvar_hg38.gorz | where hg19_chrom = 'chr1' | group genome -count
Query ran in 0.17 sec Query fetched 1 rows in 1.72 sec (total time 1.89 sec)
chrom | bpStart | bpStop | allCount | |
---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 202012 |
%%gor
gor user_data/hakon/liftover/ensembl_AC_clinvar_chr1_hg38.gorz | group chrom -count
Query ran in 0.07 sec Query fetched 1 rows in 0.15 sec (total time 0.23 sec)
CHROM | bpStart | bpStop | allCount | |
---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 201837 |
We see that 175 of the variants (0.09%) did not map. We can summarize this comparison with GOR liftover:
%%gor
create #Ens# = gor user_data/hakon/liftover/ensembl_AC_clinvar_chr1_hg38.gorz | select chrom,pos,id,ref,alt |
| calc goodEns if(ref=upper(refbases(chrom,pos,pos+len(ref)-1)),1,0);
nor <(gor user_data/hakon/liftover/clinvar_hg38.gorz | where hg19_chrom = 'chr1'
| calc good if(hg19_liftoverStatus = 'mapped' and ref=upper(refbases(chrom,pos,pos+len(ref)-1)),1,0))
| map -c id -m '0' <(nor [#Ens#]| select id,goodEns | calc inEnsembl 1)
| group -gc hg19_liftoverStatus,good,goodEns,inEnsembl -count
| sort -c inEnsembl
Query ran in 0.53 sec Query fetched 6 rows in 2.20 sec (total time 2.73 sec)
hg19_liftoverStatus | good | goodEns | inEnsembl | allCount | |
---|---|---|---|---|---|
0 | mapped | 0 | 0 | 0 | 53 |
1 | mapped | 1 | 0 | 0 | 114 |
2 | unmapped | 0 | 0 | 0 | 8 |
3 | mapped | 0 | 1 | 1 | 18 |
4 | mapped | 1 | 1 | 1 | 201810 |
5 | unmapped | 0 | 1 | 1 | 9 |
We see that there is good concordance between the methods, i.e. 201810 are mapped in both and 8 variants are missed in both. There are however differences:
- 53 variants are badly mapped in GOR and not in Ensembl AC.
- 114 are well mapped in GOR but missed by AC.
- 18 are badly mapped in GOR but correctly in AC.
- 9 are not mapped in GOR but correctly mapped in AC.
Hence Ensemble AC takes 27 variants correctly that GOR misses but AC incorrectly misses 114 that GOR got right, as it seems! We also notice that all the variants reported in Ensembl have the correct reference wheras this is a post-processing filtering step after the GOR liftover.