CNV features for variant scoring¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "internal_hg19"
%env GOR_API_PROJECT={project}
import matplotlib.pyplot as plt
CNV feature annotations¶
Here we will demonstrate how we can annotate CNVs based on the classified CNVs in the Oscar database. We will do the following:
- Define a subset of the Oscar CNVs; GAIN and LOSS variants based on date.
- We will then annotate them based on overlap statistics with other classified CNVs. Overlaps will only be calculated between CNVs of the same type.
- Three types of overlaps are defined: a) left-overlap (lov) where the overlap is normalized by the size of the CNV at hand. b) right-overlap (rov) where the overlap is normalized by the size of the CNV being compared with (e.g. one of the classified variants), and c) Jaccard style overlap, i.e. where the overlap is normalized with the total span of both of the CNVs being compared.
- Per CNV, all the overlaps are counted, averaged, and the min and max of them calculated, to define overlap features.
- The overlap analysis is carried both out with all the CNVs in the Oscar database and only between the CNVs that are similar in size, i.e. where the rounded log of the size is close.
- Additionally, we annotate the CNVs with gene classification binary attributes, weighted by the exon count fraction and the coding region overlap fraction of the CNV at hand with the corresponding gene. These weighted gene binary attributes are then aggregated for all the genes that overlap with the CNV at hand.
- Also, we calculate overlap metrics with GnomAD CNVs. These GnomAD CNVs are split into two groups, based on their estimated population frequency.
qh = GQH.GOR_Query_Helper()
mydefs = ""
Selecting CNV variants for training¶
First, we define a subset of the Oscar CNVs that we want to use and calculate the log-size attribute.
mydefs = qh.add_many("""
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;
create #class_cnvs# = gor ref_gregor/classvars/oscar_cnv.gorz
| where left(classification_date,4)>'2020'
| where alt in ('GAIN','LOSS')
| select 1-3,alt,gdx_classification
| calc size ceil(log(#3-#2))
| rownum
| rename rownum CNVid;
""")
Average overlaps with classified variants in different groups¶
Next we setup queries that define the two overlap analysis, i.e. for all CNVs and CNVs that are similar in size. The "left-cnvs" is the relation that defines the CNVs at hand. When generating the training data, this is the same relation as the one that defines the classified CNVs. When generating the training data, the WHERE filter is to avoid a CNV to be compared with itself.
mydefs = qh.add_many("""
def ##left_cnvs## = [#class_cnvs#];
def ##same_filter## = where CNVid != CNVidx;
create #all_overlaps# = pgor ##left_cnvs##
| tryhide gdx_classification /* we want to use the classification from the right-source */
| join -segseg -xl alt -xr alt [#class_cnvs#]
| ##same_filter##
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS'
;
create #simsize_overlaps# = pgor ##left_cnvs##
| tryhide gdx_classification /* we want to use the classification from the right-source */
| join -segseg -xl alt -xr alt [#class_cnvs#]
| ##same_filter##
| where abs(size-sizex)<2
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS';
""")
GeneMB multi-gene attributes summary¶
Next we setup CNV overlap with exons and gene coding regions and then weight the gene classification attributes from GeneMb. First, the map table of Boolean/binary like attributes per HGNC-ID. Then overlap of the CNVs at hand with the gene exons, turning the CNVs into gene segments with the correpsonding overlap fractions. Next, each CNV at hand is joined with the gene based CNVs, by segment overlap and CNVid. We sum up the binary attribute from each gene the CNV overlaps with, weighting them by the correspoinding exon and coding region fraction.
mydefs = qh.add_many("""
create ##gmb_filter## = nor ref_gregor/genemb/gene_disease_inher.tsv
| calc gmb_is_dom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR'),1,0)
| calc gmb_is_ar_validated if(disease_status = 'validated' and inheritance in ('AR'),1,0)
| calc gmb_is_xr_validated if(disease_status = 'validated' and inheritance in ('XR','XL'),1,0)
| calc gmb_is_any_ar_validated if(disease_status = 'validated' and inheritance in ('AR','DI'),1,0)
| calc gmb_is_alldom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XD','XL','MI'),1,0)
| calc gmb_is_lof if(mechanism = 'amorph',1,0)
| group -gc hgnc_id -max -ic gmb_is_dom_validated-
| rename max_(.*) #{1}
;
create #oscar_cnv_gene_summary# = pgor ##left_cnvs##
| join -segseg -maxseg 100000 ##cnv_analysis_exons##
| calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart)
| group 1 -gc #3,alt,CNVid,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| select chrom,gene_start,gene_end,CNVid,hgnc_id,gene_symbol,alt,ex_ov_frac,gene_ov_frac
| sort chrom
;
create #cnv_gene_classification# = gor ##left_cnvs##
| join -segseg -l -e 0 -xl cnvid -xr cnvid <(gor [#oscar_cnv_gene_summary#] | map -c hgnc_id -m 0 [##gmb_filter##]
| replace gmb_* (ex_ov_frac+gene_ov_frac)*0.5*float(#rc)
)
| group 1 -gc #3,cnvid -sum -fc gmb_* -count
| rename allcount Genes;
""")
GnomAD¶
Then we setup a query to calculate the overlaps with common and rare GnomAD CNVs. We define CNVs that are in frequency greater than 1/1000 to be common but otherwise rare. We use all the same overlap types.
mydefs = qh.add_many("""
create #gnom# = gor ref_gdx/exvars/gnomad/CNVs_v4_11.gorz | select 1-alt,sc,sn
| replace alt decode(alt,'<DEL>,LOSS,<DUP>,GAIN,O')
| calc AF 'AF'+if(sc/sn<0.001,'rare','comm')
| rename cnv_end cnv_stop;
create #gnom_overlaps# = gor ##left_cnvs## | join -segseg -xl alt -xr alt [#gnom#]
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,cnvid,AF -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,CNVid -e 0 AF -v 'AFrare','AFcomm';
""")
Collecting the attributes¶
The final step is then to read all the CNVs at hand and decorate them with the different types of CNV attributes we defined above. Now the joins don't have to be -segseg but only -snpsnp because we are looking for exact position match. However, in case there are more than one CNV with the same starting position, we enforce match on the same CNVid via the -xl and -xr equi-join options.
%%gor
$mydefs
gor ##left_cnvs##
| join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#]
| hide ge_cnv_stop-ge_cnvid
| join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#]
| hide gnom_cnv_stop-gnom_cnvid
| join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#]
| hide a_cnv_stop-a_cnvid
| join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#]
| hide b_cnv_stop-b_cnvid
| top 10
/*
| write user_data/vinnie/CNV/dragen_sample_cnv_features.gorz
*/
Query ran in 3.13 sec. Query fetched 10 rows in 24.35 sec (total time 27.48 sec)
chrom | cnv_start | cnv_stop | alt | gdx_classification | size | CNVid | ge_Genes | ge_sum_gmb_is_dom_validated | ge_sum_gmb_is_ar_validated | ... | b_VUS_CNVovCount | b_VUS_min_lov | b_VUS_max_lov | b_VUS_avg_lov | b_VUS_min_rov | b_VUS_max_rov | b_VUS_avg_rov | b_VUS_min_jov | b_VUS_max_jov | b_VUS_avg_jov | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 0 | 125000000 | LOSS | UV | 9 | 1 | 1039 | 71.00000 | 131.00000 | ... | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 |
1 | chr1 | 0 | 2300000 | GAIN | UV | 7 | 2 | 56 | 6.00000 | 6.00000 | ... | 68 | 0.026 | 0.348 | 0.143 | 0.023 | 1.0 | 0.942 | 0.013 | 0.348 | 0.141 |
2 | chr1 | 0 | 28000000 | LOSS | UV | 8 | 3 | 373 | 33.00000 | 47.00000 | ... | 4 | 0.038 | 0.065 | 0.052 | 1.000 | 1.0 | 1.000 | 0.038 | 0.065 | 0.052 |
3 | chr1 | 69089 | 8378247 | LOSS | PATH | 7 | 4 | 106 | 9.00000 | 14.00000 | ... | 14 | 0.013 | 0.061 | 0.032 | 1.000 | 1.0 | 1.000 | 0.013 | 0.061 | 0.032 |
4 | chr1 | 564423 | 6031125 | GAIN | PATH | 7 | 5 | 78 | 7.00000 | 9.93640 | ... | 80 | 0.020 | 0.473 | 0.083 | 0.982 | 1.0 | 1.000 | 0.020 | 0.473 | 0.083 |
5 | chr1 | 746607 | 1252867 | LOSS | VUS | 6 | 6 | 20 | 0.00000 | 3.64505 | ... | 10 | 0.033 | 0.987 | 0.445 | 0.077 | 1.0 | 0.837 | 0.024 | 0.969 | 0.441 |
6 | chr1 | 753342 | 11417400 | LOSS | PATH | 8 | 7 | 137 | 15.00000 | 18.00000 | ... | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 |
7 | chr1 | 753342 | 1261909 | LOSS | VUS | 6 | 8 | 20 | 0.00000 | 4.00000 | ... | 10 | 0.051 | 0.982 | 0.446 | 0.119 | 1.0 | 0.849 | 0.037 | 0.969 | 0.443 |
8 | chr1 | 753342 | 2994476 | LOSS | PATH | 7 | 9 | 64 | 6.02985 | 7.00000 | ... | 13 | 0.052 | 0.227 | 0.124 | 0.987 | 1.0 | 0.999 | 0.052 | 0.227 | 0.124 |
9 | chr1 | 766235 | 1972528 | GAIN | UV | 7 | 10 | 48 | 5.00000 | 6.00000 | ... | 65 | 0.032 | 0.629 | 0.236 | 0.057 | 1.0 | 0.852 | 0.024 | 0.608 | 0.230 |
10 rows × 154 columns
Generating annotation features for CNV scoring¶
The code is analogous to what we defined above. Indeed, we only need to change our definition of the left CNV source. Here we will score Dragen CNVs that have been onboarded into clin-hg19, i.e. from 2274 samples, each with approximitely 700 CNVs or so.
mydefs = qh.add_many("""
create #Dragen_sample_cnvs# = gor user_data/hakon/CNVs/dragen_cnvs.gorz
| replace alt decode(alt,'<DUP>,GAIN,<DEL>,LOSS,O')
| where alt in ('GAIN','LOSS')
| select 1-3,alt,PN
| calc size ceil(log(#3-#2))
| rownum
| rename rownum CNVid | rename #2 cnv_start | rename #3 cnv_stop;
""")
%%gor
$mydefs
nor <(gor [#sample_cnvs#] | group genome -gc PN -count) | group -avg -std -ic allcount -count
Query ran in 0.14 sec Query fetched 1 rows in 1.29 sec (total time 1.43 sec)
allCount | avg_allCount | std_allCount | |
---|---|---|---|
0 | 2274 | 704.32102 | 352.488011 |
We now change our definition of the left-source:
mydefs = qh.add_many("""
def ##left_cnvs## = [#Dragen_sample_cnvs#];
def ##same_filter## = where 2=2;
""")
%%gor
$mydefs
nor <(gor ##left_cnvs##
| join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#]
| hide ge_cnv_stop-ge_cnvid
| join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#]
| hide gnom_cnv_stop-gnom_cnvid
| join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#]
| hide a_cnv_stop-a_cnvid
| join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#]
| hide b_cnv_stop-b_cnvid
| top 1 ) | unpivot #1-
/* | write user_data/vinnie/CNV/dragen_sample_cnv_features.gorz */
Query ran in 1.07 sec Query fetched 154 rows in 30.23 sec (total time 31.29 sec)
Col_Name | Col_Value | |
---|---|---|
0 | CHROM | chr1 |
1 | cnv_start | 724256 |
2 | cnv_stop | 758624 |
3 | ALT | GAIN |
4 | PN | GDX_2807791 |
... | ... | ... |
149 | b_VUS_max_rov | 0 |
150 | b_VUS_avg_rov | 0 |
151 | b_VUS_min_jov | 0 |
152 | b_VUS_max_jov | 0 |
153 | b_VUS_avg_jov | 0 |
154 rows × 2 columns
We see that we have about 150 feature columns for annotation. The entire query is shown below:
print(mydefs)
print("""gor ##left_cnvs##
| join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#]
| hide ge_cnv_stop-ge_cnvid
| join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#]
| hide gnom_cnv_stop-gnom_cnvid
| join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#]
| hide a_cnv_stop-a_cnvid
| join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#]
| hide b_cnv_stop-b_cnvid""")
def ##ref## = ref_gregor; def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz; def ##genes## = ##ref##/genemb/gmb_genes.gorz; create #class_cnvs# = gor ref_gregor/classvars/oscar_cnv.gorz | where left(classification_date,4)>'2020' | where alt in ('GAIN','LOSS') | select 1-3,alt,gdx_classification | calc size ceil(log(#3-#2)) | rownum | rename rownum CNVid; create #all_overlaps# = pgor ##left_cnvs## | tryhide gdx_classification /* we want to use the classification from the right-source */ | join -segseg -xl alt -xr alt [#class_cnvs#] | ##same_filter## | calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start) | calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx) | calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx)) | group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov | replace allcount[+1]- float(form(float(#rc),4,3)) | rename allcount CNVovCount | pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS' ; create #simsize_overlaps# = pgor ##left_cnvs## | tryhide gdx_classification /* we want to use the classification from the right-source */ | join -segseg -xl alt -xr alt [#class_cnvs#] | ##same_filter## | where abs(size-sizex)<2 | calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start) | calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx) | calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx)) | group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov | replace allcount[+1]- float(form(float(#rc),4,3)) | rename allcount CNVovCount | pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS'; create ##gmb_filter## = nor ref_gregor/genemb/gene_disease_inher.tsv | calc gmb_is_dom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR'),1,0) | calc gmb_is_ar_validated if(disease_status = 'validated' and inheritance in ('AR'),1,0) | calc gmb_is_xr_validated if(disease_status = 'validated' and inheritance in ('XR','XL'),1,0) | calc gmb_is_any_ar_validated if(disease_status = 'validated' and inheritance in ('AR','DI'),1,0) | calc gmb_is_alldom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XD','XL','MI'),1,0) | calc gmb_is_lof if(mechanism = 'amorph',1,0) | group -gc hgnc_id -max -ic gmb_is_dom_validated- | rename max_(.*) #{1} ; create #oscar_cnv_gene_summary# = pgor ##left_cnvs## | join -segseg -maxseg 100000 ##cnv_analysis_exons## | calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart) | group 1 -gc #3,alt,CNVid,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap | rename sum_ex_ovlap cnv_overlap | calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4) | calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4) | map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol) | select chrom,gene_start,gene_end,CNVid,hgnc_id,gene_symbol,alt,ex_ov_frac,gene_ov_frac | sort chrom ; create #cnv_gene_classification# = gor ##left_cnvs## | join -segseg -l -e 0 -xl cnvid -xr cnvid <(gor [#oscar_cnv_gene_summary#] | map -c hgnc_id -m 0 [##gmb_filter##] | replace gmb_* (ex_ov_frac+gene_ov_frac)*0.5*float(#rc) ) | group 1 -gc #3,cnvid -sum -fc gmb_* -count | rename allcount Genes; create #gnom# = gor ref_gdx/exvars/gnomad/CNVs_v4_11.gorz | select 1-alt,sc,sn | replace alt decode(alt,'<DEL>,LOSS,<DUP>,GAIN,O') | calc AF 'AF'+if(sc/sn<0.001,'rare','comm') | rename cnv_end cnv_stop; create #gnom_overlaps# = gor ##left_cnvs## | join -segseg -xl alt -xr alt [#gnom#] | calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start) | calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx) | calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx)) | group 1 -gc cnv_stop,alt,cnvid,AF -avg -min -max -count -fc lov,rov,jov | replace allcount[+1]- float(form(float(#rc),4,3)) | rename allcount CNVovCount | pivot -gc cnv_stop,alt,CNVid -e 0 AF -v 'AFrare','AFcomm'; create #Dragen_sample_cnvs# = gor user_data/hakon/CNVs/dragen_cnvs.gorz | replace alt decode(alt,'<DUP>,GAIN,<DEL>,LOSS,O') | where alt in ('GAIN','LOSS') | select 1-3,alt,PN | calc size ceil(log(#3-#2)) | rownum | rename rownum CNVid | rename #2 cnv_start | rename #3 cnv_stop; def ##left_cnvs## = [#Dragen_sample_cnvs#]; def ##same_filter## = where 2=2; gor ##left_cnvs## | join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#] | hide ge_cnv_stop-ge_cnvid | join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#] | hide gnom_cnv_stop-gnom_cnvid | join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#] | hide a_cnv_stop-a_cnvid | join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#] | hide b_cnv_stop-b_cnvid