CNV features for variant scoring¶

CNV_features

In [3]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "internal_hg19"
%env GOR_API_PROJECT={project}

import matplotlib.pyplot as plt

CNV feature annotations¶

Here we will demonstrate how we can annotate CNVs based on the classified CNVs in the Oscar database. We will do the following:

Define a subset of the Oscar CNVs; GAIN and LOSS variants based on date.
We will then annotate them based on overlap statistics with other classified CNVs. Overlaps will only be calculated between CNVs of the same type.
Three types of overlaps are defined: a) left-overlap (lov) where the overlap is normalized by the size of the CNV at hand. b) right-overlap (rov) where the overlap is normalized by the size of the CNV being compared with (e.g. one of the classified variants), and c) Jaccard style overlap, i.e. where the overlap is normalized with the total span of both of the CNVs being compared.
Per CNV, all the overlaps are counted, averaged, and the min and max of them calculated, to define overlap features.
The overlap analysis is carried both out with all the CNVs in the Oscar database and only between the CNVs that are similar in size, i.e. where the rounded log of the size is close.
Additionally, we annotate the CNVs with gene classification binary attributes, weighted by the exon count fraction and the coding region overlap fraction of the CNV at hand with the corresponding gene. These weighted gene binary attributes are then aggregated for all the genes that overlap with the CNV at hand.
Also, we calculate overlap metrics with GnomAD CNVs. These GnomAD CNVs are split into two groups, based on their estimated population frequency.

In [4]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

Selecting CNV variants for training¶

First, we define a subset of the Oscar CNVs that we want to use and calculate the log-size attribute.

In [5]:

mydefs = qh.add_many("""
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;

create #class_cnvs# = gor ref_gregor/classvars/oscar_cnv.gorz
| where left(classification_date,4)>'2020'
| where alt in ('GAIN','LOSS')
| select 1-3,alt,gdx_classification
| calc size ceil(log(#3-#2))
| rownum
| rename rownum CNVid;
""")

Average overlaps with classified variants in different groups¶

Next we setup queries that define the two overlap analysis, i.e. for all CNVs and CNVs that are similar in size. The "left-cnvs" is the relation that defines the CNVs at hand. When generating the training data, this is the same relation as the one that defines the classified CNVs. When generating the training data, the WHERE filter is to avoid a CNV to be compared with itself.

In [7]:

mydefs = qh.add_many("""
def ##left_cnvs## = [#class_cnvs#];
def ##same_filter## = where CNVid != CNVidx;

create #all_overlaps# = pgor ##left_cnvs##
| tryhide gdx_classification /* we want to use the classification from the right-source */
| join -segseg -xl alt -xr alt [#class_cnvs#]
| ##same_filter##
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS' 
;

create #simsize_overlaps# = pgor ##left_cnvs##
| tryhide gdx_classification /* we want to use the classification from the right-source */
| join -segseg -xl alt -xr alt [#class_cnvs#]
| ##same_filter##
| where abs(size-sizex)<2
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS';
""")

GeneMB multi-gene attributes summary¶

Next we setup CNV overlap with exons and gene coding regions and then weight the gene classification attributes from GeneMb. First, the map table of Boolean/binary like attributes per HGNC-ID. Then overlap of the CNVs at hand with the gene exons, turning the CNVs into gene segments with the correpsonding overlap fractions. Next, each CNV at hand is joined with the gene based CNVs, by segment overlap and CNVid. We sum up the binary attribute from each gene the CNV overlaps with, weighting them by the correspoinding exon and coding region fraction.

In [8]:

mydefs = qh.add_many("""
create ##gmb_filter## =  nor ref_gregor/genemb/gene_disease_inher.tsv
| calc gmb_is_dom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR'),1,0)
| calc gmb_is_ar_validated if(disease_status = 'validated' and inheritance in ('AR'),1,0)
| calc gmb_is_xr_validated if(disease_status = 'validated' and inheritance in ('XR','XL'),1,0)
| calc gmb_is_any_ar_validated if(disease_status = 'validated' and inheritance in ('AR','DI'),1,0)
| calc gmb_is_alldom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XD','XL','MI'),1,0)
| calc gmb_is_lof if(mechanism = 'amorph',1,0)
| group -gc hgnc_id -max -ic gmb_is_dom_validated-
| rename max_(.*) #{1}
;

create #oscar_cnv_gene_summary# = pgor ##left_cnvs##
| join -segseg -maxseg 100000 ##cnv_analysis_exons##
| calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart)
| group 1 -gc #3,alt,CNVid,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| select chrom,gene_start,gene_end,CNVid,hgnc_id,gene_symbol,alt,ex_ov_frac,gene_ov_frac
| sort chrom
;

create #cnv_gene_classification# = gor ##left_cnvs##
| join -segseg -l -e 0 -xl cnvid -xr cnvid <(gor [#oscar_cnv_gene_summary#] | map -c hgnc_id -m 0 [##gmb_filter##]
   | replace gmb_* (ex_ov_frac+gene_ov_frac)*0.5*float(#rc)
)
| group 1 -gc #3,cnvid -sum -fc gmb_* -count
| rename allcount Genes;
""")

GnomAD¶

Then we setup a query to calculate the overlaps with common and rare GnomAD CNVs. We define CNVs that are in frequency greater than 1/1000 to be common but otherwise rare. We use all the same overlap types.

In [9]:

mydefs = qh.add_many("""
create #gnom# = gor ref_gdx/exvars/gnomad/CNVs_v4_11.gorz | select 1-alt,sc,sn
| replace alt decode(alt,'<DEL>,LOSS,<DUP>,GAIN,O')
| calc AF 'AF'+if(sc/sn<0.001,'rare','comm')
| rename cnv_end cnv_stop;

create #gnom_overlaps# = gor ##left_cnvs## | join -segseg -xl alt -xr alt [#gnom#]
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,cnvid,AF -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,CNVid -e 0 AF -v 'AFrare','AFcomm';
""")

Collecting the attributes¶

The final step is then to read all the CNVs at hand and decorate them with the different types of CNV attributes we defined above. Now the joins don't have to be -segseg but only -snpsnp because we are looking for exact position match. However, in case there are more than one CNV with the same starting position, we enforce match on the same CNVid via the -xl and -xr equi-join options.

In [10]:

%%gor
$mydefs
gor ##left_cnvs##
| join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#]
| hide ge_cnv_stop-ge_cnvid
| join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#]
| hide gnom_cnv_stop-gnom_cnvid
| join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#]
| hide a_cnv_stop-a_cnvid
| join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#]
| hide b_cnv_stop-b_cnvid
| top 10
/*
| write user_data/vinnie/CNV/dragen_sample_cnv_features.gorz
*/

Query ran in 3.13 sec.
Query fetched 10 rows in 24.35 sec (total time 27.48 sec)

Out[10]:

	chrom	cnv_start	cnv_stop	alt	gdx_classification	size	CNVid	ge_Genes	ge_sum_gmb_is_dom_validated	ge_sum_gmb_is_ar_validated	...	b_VUS_CNVovCount	b_VUS_min_lov	b_VUS_max_lov	b_VUS_avg_lov	b_VUS_min_rov	b_VUS_max_rov	b_VUS_avg_rov	b_VUS_min_jov	b_VUS_max_jov	b_VUS_avg_jov
0	chr1	0	125000000	LOSS	UV	9	1	1039	71.00000	131.00000	...	0	0.000	0.000	0.000	0.000	0.0	0.000	0.000	0.000	0.000
1	chr1	0	2300000	GAIN	UV	7	2	56	6.00000	6.00000	...	68	0.026	0.348	0.143	0.023	1.0	0.942	0.013	0.348	0.141
2	chr1	0	28000000	LOSS	UV	8	3	373	33.00000	47.00000	...	4	0.038	0.065	0.052	1.000	1.0	1.000	0.038	0.065	0.052
3	chr1	69089	8378247	LOSS	PATH	7	4	106	9.00000	14.00000	...	14	0.013	0.061	0.032	1.000	1.0	1.000	0.013	0.061	0.032
4	chr1	564423	6031125	GAIN	PATH	7	5	78	7.00000	9.93640	...	80	0.020	0.473	0.083	0.982	1.0	1.000	0.020	0.473	0.083
5	chr1	746607	1252867	LOSS	VUS	6	6	20	0.00000	3.64505	...	10	0.033	0.987	0.445	0.077	1.0	0.837	0.024	0.969	0.441
6	chr1	753342	11417400	LOSS	PATH	8	7	137	15.00000	18.00000	...	0	0.000	0.000	0.000	0.000	0.0	0.000	0.000	0.000	0.000
7	chr1	753342	1261909	LOSS	VUS	6	8	20	0.00000	4.00000	...	10	0.051	0.982	0.446	0.119	1.0	0.849	0.037	0.969	0.443
8	chr1	753342	2994476	LOSS	PATH	7	9	64	6.02985	7.00000	...	13	0.052	0.227	0.124	0.987	1.0	0.999	0.052	0.227	0.124
9	chr1	766235	1972528	GAIN	UV	7	10	48	5.00000	6.00000	...	65	0.032	0.629	0.236	0.057	1.0	0.852	0.024	0.608	0.230

10 rows × 154 columns

Generating annotation features for CNV scoring¶

The code is analogous to what we defined above. Indeed, we only need to change our definition of the left CNV source. Here we will score Dragen CNVs that have been onboarded into clin-hg19, i.e. from 2274 samples, each with approximitely 700 CNVs or so.

In [11]:

mydefs = qh.add_many("""
create #Dragen_sample_cnvs# = gor user_data/hakon/CNVs/dragen_cnvs.gorz
| replace alt decode(alt,'<DUP>,GAIN,<DEL>,LOSS,O')
| where alt in ('GAIN','LOSS')
| select 1-3,alt,PN
| calc size ceil(log(#3-#2))
| rownum
| rename rownum CNVid | rename #2 cnv_start | rename #3 cnv_stop;
""")

In [20]:

%%gor
$mydefs
nor <(gor [#sample_cnvs#] | group genome -gc PN -count) | group -avg -std -ic allcount -count

Query ran in 0.14 sec
Query fetched 1 rows in 1.29 sec (total time 1.43 sec)

Out[20]:

	allCount	avg_allCount	std_allCount
0	2274	704.32102	352.488011

We now change our definition of the left-source:

In [12]:

mydefs = qh.add_many("""
def ##left_cnvs## = [#Dragen_sample_cnvs#];
def ##same_filter## = where 2=2;
""")

In [13]:

%%gor
$mydefs
nor <(gor ##left_cnvs##
| join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#]
| hide ge_cnv_stop-ge_cnvid
| join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#]
| hide gnom_cnv_stop-gnom_cnvid
| join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#]
| hide a_cnv_stop-a_cnvid
| join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#]
| hide b_cnv_stop-b_cnvid
| top 1 ) | unpivot #1-
/* | write user_data/vinnie/CNV/dragen_sample_cnv_features.gorz */

Query ran in 1.07 sec
Query fetched 154 rows in 30.23 sec (total time 31.29 sec)

Out[13]:

	Col_Name	Col_Value
0	CHROM	chr1
1	cnv_start	724256
2	cnv_stop	758624
3	ALT	GAIN
4	PN	GDX_2807791
...	...	...
149	b_VUS_max_rov	0
150	b_VUS_avg_rov	0
151	b_VUS_min_jov	0
152	b_VUS_max_jov	0
153	b_VUS_avg_jov	0

154 rows × 2 columns

We see that we have about 150 feature columns for annotation. The entire query is shown below:

In [14]:

print(mydefs)
print("""gor ##left_cnvs##
| join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#]
| hide ge_cnv_stop-ge_cnvid
| join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#]
| hide gnom_cnv_stop-gnom_cnvid
| join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#]
| hide a_cnv_stop-a_cnvid
| join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#]
| hide b_cnv_stop-b_cnvid""")

def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;
create #class_cnvs# = gor ref_gregor/classvars/oscar_cnv.gorz
| where left(classification_date,4)>'2020'
| where alt in ('GAIN','LOSS')
| select 1-3,alt,gdx_classification
| calc size ceil(log(#3-#2))
| rownum
| rename rownum CNVid;
create #all_overlaps# = pgor ##left_cnvs##
| tryhide gdx_classification /* we want to use the classification from the right-source */
| join -segseg -xl alt -xr alt [#class_cnvs#]
| ##same_filter##
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS' 
;
create #simsize_overlaps# = pgor ##left_cnvs##
| tryhide gdx_classification /* we want to use the classification from the right-source */
| join -segseg -xl alt -xr alt [#class_cnvs#]
| ##same_filter##
| where abs(size-sizex)<2
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,size,gdx_classification,CNVid -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,size,CNVid -e 0 gdx_classification -v 'BEN','LBEN','LPATH','PATH','UV','VUS';
create ##gmb_filter## =  nor ref_gregor/genemb/gene_disease_inher.tsv
| calc gmb_is_dom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR'),1,0)
| calc gmb_is_ar_validated if(disease_status = 'validated' and inheritance in ('AR'),1,0)
| calc gmb_is_xr_validated if(disease_status = 'validated' and inheritance in ('XR','XL'),1,0)
| calc gmb_is_any_ar_validated if(disease_status = 'validated' and inheritance in ('AR','DI'),1,0)
| calc gmb_is_alldom_validated if(disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XD','XL','MI'),1,0)
| calc gmb_is_lof if(mechanism = 'amorph',1,0)
| group -gc hgnc_id -max -ic gmb_is_dom_validated-
| rename max_(.*) #{1}
;
create #oscar_cnv_gene_summary# = pgor ##left_cnvs##
| join -segseg -maxseg 100000 ##cnv_analysis_exons##
| calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart)
| group 1 -gc #3,alt,CNVid,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| select chrom,gene_start,gene_end,CNVid,hgnc_id,gene_symbol,alt,ex_ov_frac,gene_ov_frac
| sort chrom
;
create #cnv_gene_classification# = gor ##left_cnvs##
| join -segseg -l -e 0 -xl cnvid -xr cnvid <(gor [#oscar_cnv_gene_summary#] | map -c hgnc_id -m 0 [##gmb_filter##]
   | replace gmb_* (ex_ov_frac+gene_ov_frac)*0.5*float(#rc)
)
| group 1 -gc #3,cnvid -sum -fc gmb_* -count
| rename allcount Genes;
create #gnom# = gor ref_gdx/exvars/gnomad/CNVs_v4_11.gorz | select 1-alt,sc,sn
| replace alt decode(alt,'<DEL>,LOSS,<DUP>,GAIN,O')
| calc AF 'AF'+if(sc/sn<0.001,'rare','comm')
| rename cnv_end cnv_stop;
create #gnom_overlaps# = gor ##left_cnvs## | join -segseg -xl alt -xr alt [#gnom#]
| calc lov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stop-cnv_start)
| calc rov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(cnv_stopx-cnv_startx)
| calc jov (min(cnv_stop,cnv_stopx)-max(cnv_start,cnv_startx))/(max(cnv_stop,cnv_stopx)-min(cnv_start,cnv_startx))
| group 1 -gc cnv_stop,alt,cnvid,AF -avg -min -max -count -fc lov,rov,jov
| replace allcount[+1]- float(form(float(#rc),4,3))
| rename allcount CNVovCount
| pivot -gc cnv_stop,alt,CNVid -e 0 AF -v 'AFrare','AFcomm';
create #Dragen_sample_cnvs# = gor user_data/hakon/CNVs/dragen_cnvs.gorz
| replace alt decode(alt,'<DUP>,GAIN,<DEL>,LOSS,O')
| where alt in ('GAIN','LOSS')
| select 1-3,alt,PN
| calc size ceil(log(#3-#2))
| rownum
| rename rownum CNVid | rename #2 cnv_start | rename #3 cnv_stop;
def ##left_cnvs## = [#Dragen_sample_cnvs#];
def ##same_filter## = where 2=2;

gor ##left_cnvs##
| join -snpsnp -rprefix ge -l -e 0 -xl cnvid -xr cnvid [#cnv_gene_classification#]
| hide ge_cnv_stop-ge_cnvid
| join -snpsnp -rprefix gnom -l -e 0 -xl cnvid -xr cnvid [#gnom_overlaps#]
| hide gnom_cnv_stop-gnom_cnvid
| join -snpsnp -rprefix a -l -e 0 -xl cnvid -xr cnvid [#all_overlaps#]
| hide a_cnv_stop-a_cnvid
| join -snpsnp -rprefix b -l -e 0 -xl cnvid -xr cnvid [#simsize_overlaps#]
| hide b_cnv_stop-b_cnvid

Note: Two of the def commands appear late in the query because we redefined them. Order of def and create statements in GOR-pipe queries does not matter however, although it makes it easier to follow the code when we refer to previously written code!

In [ ]: