CNVs from multiple XA-samples analysed¶

CNVs

In [3]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "internal_hg19"
%env GOR_API_PROJECT={project}

import matplotlib.pyplot as plt

XA and CNVs¶

Here we will take a look at the CNVs that are in the GORdb version of XA.

In [4]:

%%gor
create temp = nor -asdict XA/vars/cnv2.gord | group -count;
nor [temp]

Query ran in 5.93 sec.
Query fetched 1 rows in 0.00 sec (total time 5.94 sec)

Out[4]:

	allCount
0	628077

We see that we have 628 samples. Now a quick inspection of the columns we have.

In [7]:

%%gor
gor XA/vars/cnv2.gord -ff <(nor -asdict XA/vars/cnv2.gord | skip 10000 | top 2 | select #2) | top 5

Query ran in 0.21 sec
Query fetched 5 rows in 0.00 sec (total time 0.22 sec)

Out[7]:

	chrom	bpStart	bpstop	cnv_type	exons	alt	callcopies	genotype	gt_quality	var_allele_depth	sample_coverage	samtools	info	gatk_hap	sample_id	result_id	variant_id	lis_variant_info_id	pn
0	chr1	16376517	16386258	DEL	multi	-0.50	1	het	99	NaN	NaN	NaN	source=scramble	NaN	407016	3525594597	16342710	155224975	407016
1	chr1	16376517	16386258	DEL	multi	-0.75	0	wt	20	0.0	NaN	NaN	source=scramble	NaN	407074	3540269334	16348765	156159726	407074
2	chr1	26608795	26608872	DEL	ex16	-0.50	0	wt	20	0.0	NaN	NaN	source=scramble	NaN	407074	3540269337	16344714	156132607	407074
3	chr1	26608823	26608901	DEL	ex16	-0.50	1	het	99	NaN	NaN	NaN	source=scramble	NaN	407016	3525594598	16342641	155188171	407016
4	chr1	26671521	26671674	DEL	ex2	-0.50	0	wt	20	0.0	NaN	NaN	source=scramble	NaN	407074	3540269340	16347825	156441502	407074

We see that cnv_type is DEL or DUP and Info stores the algorithm (source). Now lets see how the CNVs are distributed among these algorithms. We will form an aggregate query that can be inspected in different ways:

In [8]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

In [9]:

mydefs = qh.add_many("""
def #xa_cnvs# = XA/vars/cnv2.gord;
create #p_cnvs_types# = partgor -dict #xa_cnvs# -partsize 10000 <(gor #xa_cnvs# -f #{tags} 
| calc size ceil(log(#3-#2))
| group chrom -gc cnv_type,size,info,pn -count
);
create #cnvs_types# = gor [#p_cnvs_types#] | group genome -gc cnv_type,size,info,pn -sum -ic allcount;
""")

CNV algorithms¶

In [11]:

%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc info -sum -ic sum_allcount;
nor [temp] | sort -c #2:nr | rename sum_sum_allcount CNVcount

Query ran in 6.13 sec.
Query fetched 65 rows in 0.00 sec (total time 6.14 sec)

Out[11]:

	info	CNVcount
0	cnv_reporter	50953508
1	NaN	16529000
2	source=scramble	16068727
3	source=cnv_reporter;	13940610
4	source=scramble;	8128519
...	...	...
60	source=dragen_ploidy:copynumber=1.56	1
61	source=dragen_ploidy:copynumber=1.61	1
62	source=dragen_ploidy:copynumber=1.64	1
63	source=dragen_ploidy:copynumber=1.99	1
64	source=dragen_ploidy:copynumber=2.14	1

65 rows × 2 columns

We see that most of the CNVs are from cnv_reporter. We can also see how the samples are distributed:

In [12]:

%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc info,pn | group -gc info -count;
nor [temp] | sort -c #2:nr | rename allcount Samplecount

Query ran in 11.15 sec
Query fetched 65 rows in 7.12 sec (total time 18.27 sec)

Out[12]:

	info	Samplecount
0	cnv_reporter	321183
1	source=scramble	310582
2	source=scramble;	152877
3	NaN	149428
4	source=cnv_reporter;	125093
...	...	...
60	source=dragen_ploidy:copynumber=1.56	1
61	source=dragen_ploidy:copynumber=1.61	1
62	source=dragen_ploidy:copynumber=1.64	1
63	source=dragen_ploidy:copynumber=1.99	1
64	source=dragen_ploidy:copynumber=2.14	1

65 rows × 2 columns

However, most of the samples have CNVs from more than one caller:

In [13]:

%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc pn -set -sc info;
nor [temp] | group -gc set_info -count | sort -c allcount:nr | top 5

Query ran in 6.41 sec.
Query fetched 5 rows in 3.87 sec (total time 10.28 sec)

Out[13]:

	set_info	allCount
0	cnv_reporter,source=scramble	272924
1	NaN	137193
2	source=cnv_reporter;,source=scramble;	113985
3	cnv_reporter,source=scramble;	29202
4	source=scramble	17850

and to see the rows from 6 and below:

In [14]:

%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc pn -set -sc info;
nor [temp] | group -gc set_info -count | sort -c allcount:nr | skip 5

Query ran in 6.00 sec.
Query fetched 153 rows in 1.37 sec (total time 7.37 sec)

Out[14]:

	set_info	allCount
0	dragen_cnv	17797
1	,cnv_reporter,source=scramble	9902
2	source=cnv_reporter;	8145
3	source=scramble;	6071
4	cnv_reporter,source=karyotype,source=scramble	4203
...	...	...
148	source=cnv_reporter,source=scramble,source=sma	1
149	source=dragen_ploidy:copynumber=0.0,source=dra...	1
150	source=dragen_ploidy:copynumber=0.0,source=dra...	1
151	source=dragen_ploidy:copynumber=0.49,source=dr...	1
152	source=dragen_ploidy:copynumber=1.46	1

153 rows × 2 columns

CNV size distribution¶

First for all callers combined. Note from the query above that size == ceil(log(cnv_span))

In [15]:

%%gor data <<
$mydefs
create temp = nor [#cnvs_types#] | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 6.48 sec.
Query fetched 10 rows in 0.00 sec (total time 6.48 sec)

In [16]:

size = data['size']
count = data['sum_sum_allCount']
plt.bar(size,count)
plt.title('All CNVs')

Out[16]:

Text(0.5, 1.0, 'All CNVs')

No description has been provided for this image

and now for DEL and DUP separately:

In [17]:

%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 7.66 sec.
Query fetched 10 rows in 5.24 sec (total time 12.90 sec)

In [18]:

%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 7.07 sec.
Query fetched 9 rows in 4.53 sec (total time 11.60 sec)

In [21]:

delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('All DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('All DUPs')

Out[21]:

Text(0.5, 1.0, 'All DUPs')

We notice that deletions are typically much smaller, i.e. most between 100-10k bp while quite a few are bigger than 10Mb.

Now lets look at Dragen only:

In [22]:

%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' and info = 'dragen_cnv' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 8.19 sec.
Query fetched 7 rows in 4.35 sec (total time 12.54 sec)

In [23]:

%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' and info = 'dragen_cnv' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 6.57 sec.
Query fetched 6 rows in 3.53 sec (total time 10.11 sec)

In [24]:

delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('Dragen DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('Dragen DUPs')

Out[24]:

Text(0.5, 1.0, 'Dragen DUPs')

We notice that there is more difference in the DEL size distribution. Now for cnv_reporter:

In [25]:

%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' and info = 'cnv_reporter' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 7.66 sec.
Query fetched 9 rows in 5.29 sec (total time 12.94 sec)

In [26]:

%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' and info = 'cnv_reporter' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 6.57 sec.
Query fetched 7 rows in 3.26 sec (total time 9.83 sec)

In [27]:

delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('cnv_reporter DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('cnv_reporter DUPs')

Out[27]:

Text(0.5, 1.0, 'cnv_reporter DUPs')

and finally Scramble:

In [28]:

%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' and info = 'source=scramble' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 7.16 sec.
Query fetched 8 rows in 5.29 sec (total time 12.44 sec)

In [29]:

%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' and info = 'source=scramble' | group -gc size -sum -ic sum_allcount;
nor [temp]

Query ran in 6.57 sec.
Query fetched 0 rows in 4.22 sec (total time 10.79 sec)

In [30]:

delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('Scramble DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('Scramble DUPs')

Out[30]:

Text(0.5, 1.0, 'Scramble DUPs')

We notice that Scramble only has deletions and also that there are quite many really large deletions called with it.

Count of CNVs in samples¶

Now we calculate some statistics for the number of CNVs in the 600k samples. First for duplications:

DUP counts¶

In [43]:

%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount 
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
    
nor [temp] | where cnv_type = 'DUP'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 20

Query ran in 5.87 sec.
Query fetched 20 rows in 0.01 sec (total time 5.88 sec)

Out[43]:

	info	cnv_type	Samples	min_cnvCount	med_cnvCount	max_cnvCount	avg_cnvCount	std_cnvCount
0	gatk_info	DUP	13	102	142	187	143	29
1	cnv_gatk	DUP	120	57	76	120	77	11
2	source=cnv_reporter	DUP	678	1	8	1574	44	151
3	dragen_cnv	DUP	18673	13	32	619	35	17
4	cnvReporter	DUP	3	19	19	19	19	0
5	NaN	DUP	126736	1	3	3103	16	106
6	source=cnv_reporter;	DUP	121680	1	6	575	12	24
7	cnv_reporter	DUP	314526	1	6	869	11	19
8	source=karyotype;	DUP	707	1	1	3	1	0
9	source=karyotype	DUP	5464	1	1	3	1	0
10	source=user:pipeline;	DUP	117	1	1	2	1	0
11	source=dragen_ploidy:copynumber=0.51	DUP	11	1	1	2	1	0
12	source=dragen_ploidy:copynumber=0.5	DUP	24	1	1	2	1	0
13	source=dragen_ploidy:copynumber=1.0	DUP	38	1	1	2	1	0
14	source=dragen_ploidy:copynumber=0.0	DUP	21	1	1	1	1	0
15	source=dragen_ploidy:copynumber=0.01	DUP	43	1	1	1	1	0
16	source=dragen_ploidy:copynumber=0.02	DUP	7	1	1	1	1	0
17	source=dragen_ploidy:copynumber=0.03	DUP	3	1	1	1	1	0
18	source=dragen_ploidy:copynumber=0.48	DUP	1	1	1	1	1	0
19	source=dragen_ploidy:copynumber=0.49	DUP	5	1	1	1	1	0

DEL counts¶

In [44]:

%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount 
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
    
nor [temp] | where cnv_type = 'DEL'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 20

Query ran in 5.88 sec.
Query fetched 20 rows in 0.01 sec (total time 5.90 sec)

Out[44]:

	info	cnv_type	Samples	min_cnvCount	med_cnvCount	max_cnvCount	avg_cnvCount	std_cnvCount
0	gatk_info	DEL	13	151	480	2178	530	532
1	cnvReporter	DEL	3	443	443	443	443	0
2	source=cnv_reporter	DEL	704	1	82	16432	434	1357
3	cnv_reporter	DEL	320916	1	57	4872	147	285
4	cnv_gatk	DEL	120	71	103	509	107	40
5	NaN	DEL	138451	1	17	52897	104	607
6	source=cnv_reporter;	DEL	124794	1	38	14949	99	202
7	dragen_cnv	DEL	18672	1	45	2440	60	93
8	source=scramble;	DEL	152877	2	54	1082	53	18
9	source=scramble	DEL	310582	1	51	512	51	17
10	source=dragen_ploidy:copynumber=0.51	DEL	5	1	1	2	1	0
11	source=user:pipeline;	DEL	197	1	1	4	1	0
12	source=karyotype	DEL	715	1	1	2	1	0
13	source=karyotype;	DEL	346	1	1	3	1	0
14	source=dragen_ploidy:copynumber=0.5	DEL	21	1	1	2	1	0
15	source=dragen_ploidy:copynumber=0.0	DEL	25	1	1	1	1	0
16	source=dragen_ploidy:copynumber=0.19	DEL	1	1	1	1	1	0
17	source=dragen_ploidy:copynumber=0.49	DEL	6	1	1	1	1	0
18	source=dragen_ploidy:copynumber=0.52	DEL	1	1	1	1	1	0
19	source=dragen_ploidy:copynumber=0.53	DEL	2	1	1	1	1	0

WGS vs WES¶

Now lets only look at WGS samples:

In [37]:

mydefs = qh.add_many("""
create #wgs_pns# = nor -h XA/raw/samples.tsv | select id,panel_name | where (Contains(panel_name,'wgs'))
| select #1 | rename #1 PN;
""")

In [38]:

%%gor
$mydefs
nor [#wgs_pns#] | group -count

Query ran in 22.67 sec
Query fetched 1 rows in 0.03 sec (total time 22.70 sec)

Out[38]:

	allCount
0	22670

In [40]:

%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount 
| inset -c pn [#wgs_pns#]
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
    
nor [temp] | where cnv_type = 'DUP'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 10

Query ran in 5.81 sec.
Query fetched 10 rows in 0.01 sec (total time 5.82 sec)

Out[40]:

	info	cnv_type	Samples	min_cnvCount	med_cnvCount	max_cnvCount	avg_cnvCount	std_cnvCount
0	dragen_cnv	DUP	18458	13	32	619	35	17
1	source=cnv_reporter;	DUP	177	1	6	27	6	5
2	NaN	DUP	254	1	1	91	3	11
3	source=dragen_ploidy:copynumber=0.51	DUP	10	1	1	2	1	0
4	source=dragen_ploidy:copynumber=0.5	DUP	24	1	1	2	1	0
5	source=dragen_ploidy:copynumber=1.0	DUP	38	1	1	2	1	0
6	source=dragen_ploidy:copynumber=0.0	DUP	20	1	1	1	1	0
7	source=dragen_ploidy:copynumber=0.01	DUP	43	1	1	1	1	0
8	source=dragen_ploidy:copynumber=0.02	DUP	7	1	1	1	1	0
9	source=dragen_ploidy:copynumber=0.03	DUP	3	1	1	1	1	0

We see that while we have more genomic region to analyze, the avgerage count for cnv_reporter goes from 12 to 6. Dragen stays the same because most of the 18k samples are WGS. Now the deletions:

In [41]:

%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount 
| inset -c pn [#wgs_pns#]
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
    
nor [temp] | where cnv_type = 'DEL'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 10

Query ran in 6.18 sec.
Query fetched 10 rows in 0.01 sec (total time 6.19 sec)

Out[41]:

	info	cnv_type	Samples	min_cnvCount	med_cnvCount	max_cnvCount	avg_cnvCount	std_cnvCount
0	source=cnv_reporter;	DEL	180	7	115	1537	180	203
1	dragen_cnv	DEL	18457	1	45	2440	60	93
2	source=scramble;	DEL	1	9	9	9	9	0
3	NaN	DEL	406	1	1	275	6	28
4	source=dragen_ploidy:copynumber=0.51	DEL	5	1	1	2	1	0
5	source=dragen_ploidy:copynumber=0.5	DEL	21	1	1	2	1	0
6	source=dragen_ploidy:copynumber=0.0	DEL	25	1	1	1	1	0
7	source=dragen_ploidy:copynumber=0.19	DEL	1	1	1	1	1	0
8	source=dragen_ploidy:copynumber=0.49	DEL	6	1	1	1	1	0
9	source=dragen_ploidy:copynumber=0.52	DEL	1	1	1	1	1	0

We see that WGS dels for source=cnv_reporter; are 180 vs 99 for WES+WGS.

Projecting CNVs from multiple samples onto the genome¶

The most unbiased way to estimate the frequency of CNVs is simply to find out how many CNVs overlap a particular nucleotide in the genome. GOR has the SEGPROJ command that can be used to project segments onto the bases and to calculate the projection height from overlap of multiple segments.

The query below was executed in few minutes for all the 600k samples. In it we processed CNVs from 5 Info types separately, i.e. what we believe is 3 different algorithms. The rest is collapsed into a single "other" type.

The first SEGPROJ command collapses the data per PN for each combination of cnv_type and info separately. This is done because we noticed that in some cases the same sample may have more than one CNV of the same type that overlap the same space, hence, would cause that sample (PN) to contribute more than once to the over all frequency count. The next SEGPROJ command then projects the segments, giving us a segCount number that represents the number of PN overlapping a given segment range. The -maxseg option is used to specify the size of the largest segment. Here it needs to be big enough to cover the larges CNV in all the data set. Smaller number allows to optimize performance, i.e. speed and memory usage. Not an issue here when we ran this with a PARTGOR that takes 50k PNs at the time.

Finally, we combine the segCounts from each 50k partition and calculate the overall segment projection. We don't execute the query here but refer to the multiple output files that are generated with the -fork option in the WRITE command

In [ ]:

def #xa_cnvs# = XA/vars/cnv2.gord;
create #pns# = nor -asdict XA/vars/cnv2.gord | select #2 | rename #1 PN | top 10000;

create #p# = partgor -dict #xa_cnvs# -partsize 50000 <(gor #xa_cnvs# -f #{tags}
| replace info if(info in ('cnv_reporter','source=cnv_reporter;','dragen_cnv','source=scramble;','source=scramble'),#rc,'other')
| segproj -gc cnv_type,info,pn -maxseg 250000000
| segproj -gc cnv_type,info -maxseg 250000000
);

gor [#p#] | segproj -gc cnv_type,info -maxseg 250000000 -sumcol segcount
| segspan -gc cnv_type,info,segcount -maxseg 250000000
| hide segcount | rename segcountx Samples
| calc fc replace(cnv_type+'_'+info,';','sc') | write -f fc -r user_data/hakon/XA_cnv/proj/#{fork}.gorz

In [47]:

%%gor
nor user_data/hakon/XA_cnv/proj | where right(filename,4)='gorz' | select filename,filesize

Query ran in 0.08 sec
Query fetched 11 rows in 0.01 sec (total time 0.09 sec)

Out[47]:

	Filename	Filesize
0	DUP_cnv_reporter.gorz	3454194
1	DEL_source=scramblesc.gorz	1038826
2	DUP_source=cnv_reportersc.gorz	2514174
3	DUP_source=scramblesc.gorz	119
4	DEL_dragen_cnv.gorz	1127070
5	DUP_other.gorz	3193421
6	DEL_other.gorz	4598861
7	DUP_dragen_cnv.gorz	562417
8	DEL_source=scramble.gorz	1432061
9	DEL_cnv_reporter.gorz	4278149
10	DEL_source=cnv_reportersc.gorz	4195546

We now combine all the Info type into a single count as well:

In [48]:

%%gor
def #path# = user_data/hakon/XA_cnv/proj;

gor #path#/DEL_cnv_reporter.gorz #path#/DEL_dragen_cnv.gorz #path#/DEL_other.gorz #path#/DEL_source=cnv_reportersc.gorz #path#/DEL_source=scramble.gorz #path#/DEL_source=scramblesc.gorz
| segproj -maxseg 250000000 -sumcol samples
| rename segcount Samples
| write user_data/hakon/XA_cnv/proj/DEL_all.gorz

Query ran in 0.49 sec
Query fetched 0 rows in 3.94 sec (total time 4.43 sec)

Out[48]:

	Chrom	bpStart

In [49]:

%%gor
def #path# = user_data/hakon/XA_cnv/proj;

gor #path#/DUP_cnv_reporter.gorz #path#/DUP_dragen_cnv.gorz #path#/DUP_other.gorz #path#/DUP_source=cnv_reportersc.gorz #path#/DUP_source=scramblesc.gorz
| segproj -maxseg 250000000 -sumcol samples
| rename segcount Samples
| write user_data/hakon/XA_cnv/proj/DUP_all.gorz

Query ran in 0.08 sec
Query fetched 0 rows in 2.30 sec (total time 2.38 sec)

Out[49]:

	Chrom	bpStart

Finally, we now generate a input to a Sequence Miner genome-browser track file with the following command:

In [50]:

%%gor
nor user_data/hakon/XA_cnv/proj | where right(filename,4)='gorz'
| calc name replace(filename,'.gorz','')
| calc file name+'.gorz'
| calc type if(left(name,3)='DEL','DEL','DUP')
| calc track '<track name="'+replace(name,type+'_','')+'" type="'+type+'_CNV_count" visible="true" selected="true">
ffiletrack(height=CNVheight,filename=CNVPath+"'+file+'", name="'+name+'", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="'+if(type='DEL','Green','Blue')+'",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>'
| select track

Query ran in 0.07 sec
Query fetched 13 rows in 0.03 sec (total time 0.10 sec)

Out[50]:

	track
0	<track name="cnv_reporter" type="DUP_CNV_count...
1	<track name="source=scramblesc" type="DEL_CNV_...
2	<track name="source=cnv_reportersc" type="DUP_...
3	<track name="all" type="DUP_CNV_count" visible...
4	<track name="source=scramblesc" type="DUP_CNV_...
5	<track name="all" type="DEL_CNV_count" visible...
6	<track name="dragen_cnv" type="DEL_CNV_count" ...
7	<track name="other" type="DUP_CNV_count" visib...
8	<track name="other" type="DEL_CNV_count" visib...
9	<track name="dragen_cnv" type="DUP_CNV_count" ...
10	<track name="source=scramble" type="DEL_CNV_co...
11	<track name="cnv_reporter" type="DEL_CNV_count...
12	<track name="source=cnv_reportersc" type="DEL_...

The genome browser track file is stored in user_data/hakon/XA_cnv/cnv_projections.gb and has the following content:

In [ ]:

<tracks>
<global>
CNVPath="user_data/hakon/XA_cnv/proj/"
CNVheight=50
CNVmax = "1000"
gorPath = ""
lessthan = chr(60)

</global>
<track name="all" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_all.gorz", name="DUP_all", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=3, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="all" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_all.gorz", name="DEL_all", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=3, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="cnv_reporter" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_cnv_reporter.gorz", name="DUP_cnv_reporter", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=scramblesc" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_source=scramblesc.gorz", name="DEL_source=scramblesc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=cnv_reportersc" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_source=cnv_reportersc.gorz", name="DUP_source=cnv_reportersc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=scramblesc" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_source=scramblesc.gorz", name="DUP_source=scramblesc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="dragen_cnv" type="DEL_CNV_count" visible="true" selected="true">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_dragen_cnv.gorz", name="DEL_dragen_cnv", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="other" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_other.gorz", name="DUP_other", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="other" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_other.gorz", name="DEL_other", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="dragen_cnv" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_dragen_cnv.gorz", name="DUP_dragen_cnv", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=scramble" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_source=scramble.gorz", name="DEL_source=scramble", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="cnv_reporter" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_cnv_reporter.gorz", name="DEL_cnv_reporter", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=cnv_reportersc" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_source=cnv_reportersc.gorz", name="DEL_source=cnv_reportersc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
</tracks>

In [55]:

from IPython.display import Image
Image('SM_GB_CNVs.png',height=500,width=800)

Out[55]:

Note that while the tracks shown above display sample counts, we could easily normalize these based on the total number of samples to give some kind of AF estimate for these CNVs.

Mapping CNVs to coding exons of genes.¶

In many cases it can be interesting to find how CNVs overlap with genes, i.e. which exons and how much of the coding region they cover.

Below we show a query that maps all the CNVs, separately for DEL and DUP (as stored in cnv_type renamed to alt). The query uses ##cnv_analysis_exons##, a table that have a single "super transcript" per gene, i.e. a projection of all the exons associated with the gene. It also stored the HGNC_ID of the gene, total coding region size, the exon number and the total number of exons per gene.

In [57]:

%%gor 
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
gor ##cnv_analysis_exons## | granno chrom -gc hgnc_id -count | where allcount > 1 | top 3

Query ran in 0.20 sec
Query fetched 3 rows in 0.13 sec (total time 0.32 sec)

Out[57]:

	chrom	chromstart	chromend	hgnc_id	strand	gene_coding_size	exon_num	exon_count	allCount
0	chr1	861321	861393	HGNC:28706	+	2321	1	17	17
1	chr1	865534	865716	HGNC:28706	+	2321	2	17	17
2	chr1	866418	866469	HGNC:28706	+	2321	3	17	17

The query below joins all the exons for all genes with all the CNVs from the samples. Notice that we use -maxseg 100000000 to indicate the maximum size of the CNVs in the "right-source". We then calculate the ex_overlap and the use MAP to add the start and stop coordinates of the correspoinding gene, which we make the GOR position columns with the COLUMNREORDER. Because we are switching the GOR ordering column, we must apply SORT to fix the ordering. In principle, a SORT with 3Mb should be enough (and make the query run faster), however, because of some discrepancies in locations of few genes (as compared with the transcript data), we got wrong-order error until we made the sliding sort window larger. The query below does still run fast and comletes the mapping of all the 600k samples to the 40k genes in half an hour or so.

In [72]:

def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;

def #xa_cnvs# = XA/vars/cnv2.gord; 

create #gene_cnvs# = partgor -dict #xa_cnvs# -partsize 10000 <(gor ##cnv_analysis_exons##
    | join -segseg -maxseg 100000000 <(gor #xa_cnvs# -f #{tags}
    | where info in ('cnv_reporter','source=cnv_reporter;','dragen_cnv','source=scramble;','source=scramble')
                                        | rename #2 cnv_start | rename #3 cnv_stop | rename cnv_type alt | select 1-3,alt,pn)
| calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart)
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| columnreorder chrom,gene_start,gene_end,pn,hgnc_id,gene_symbol,alt
| sort chrom
| group 1 -gc gene_end,pn,hgnc_id,gene_symbol,alt,gene_coding_size,exon_count -sum -ic ex_ovlap,exon_num -min -max
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(min(cnv_overlap/gene_coding_size,1.0),4,4)
| select 1,2,gene_end,pn,hgnc_id,gene_symbol,alt,ex_ov_frac,gene_ov_frac
| atmax 1 -gc hgnc_id,pn gene_ov_frac
);

pgor  [#gene_cnvs#] | write user_data/hakon/XA_cnv/gene_cnv_filt.gord

Query ran in 17.21 sec
Query fetched 0 rows in 1484.12 sec (total time 1501.33 sec)

Out[72]:

	chrom	gene_start	gene_end	pn	hgnc_id	gene_symbol	alt	ex_ov_frac	gene_ov_frac

Notice that for each sample, we pick either DEL or DUP within each gene, using ATMAX based on gene overlap fraction.

In [62]:

%%gor
gor -p chr1:16384263 <(gor user_data/hakon/XA_cnv/gene_cnv_filt.gord | sort 1 -c ex_ov_frac:n | top 5
                       | merge <(gor user_data/hakon/XA_cnv/gene_cnv_filt.gord | sort 1 -c ex_ov_frac:nr | top 5))

Query ran in 0.13 sec
Query fetched 10 rows in 0.76 sec (total time 0.89 sec)

Out[62]:

	chrom	gene_start	gene_end	pn	hgnc_id	gene_symbol	alt	ex_ov_frac	gene_ov_frac
0	chr1	16384263	16400146	191079	HGNC:26717	FAM131C	DEL	1.0000	1.0000
1	chr1	16384263	16400146	191080	HGNC:26717	FAM131C	DEL	1.0000	1.0000
2	chr1	16384263	16400146	191081	HGNC:26717	FAM131C	DEL	1.0000	1.0000
3	chr1	16384263	16400146	192253	HGNC:26717	FAM131C	DEL	1.0000	1.0000
4	chr1	16384263	16400146	192254	HGNC:26717	FAM131C	DEL	1.0000	1.0000
5	chr1	16384263	16400146	494102	HGNC:26717	FAM131C	DEL	0.1429	0.0261
6	chr1	16384263	16400146	494193	HGNC:26717	FAM131C	DEL	0.1429	0.0261
7	chr1	16384263	16400146	494194	HGNC:26717	FAM131C	DEL	0.1429	0.0261
8	chr1	16384263	16400146	494622	HGNC:26717	FAM131C	DEL	0.1429	0.0261
9	chr1	16384263	16400146	494623	HGNC:26717	FAM131C	DEL	0.1429	0.0261

Phasing the CNVs¶

Once we have the CNVs mapped to a "defined grid" (the genes) we can also start to compare them across samples and phase them. Below is a query that takes as input the table above and uses PEDPIVOT to restructure the data such that the paternity can easily be specified per row.

In [ ]:

def #samples# = XA/raw/samples.tsv;

create #ped# = nor #samples#
| select id,case_id,relation
| pivot -gc case_id relation -v proband,father,mother -e ''
| where father_id != '' and mother_id != ''and proband_id != ''
| select #2-;

pgor user_data/hakon/XA_cnv/gene_cnv_filt.gord
| calc present if(gene_ov_frac > 0.1 or ex_ov_frac > 0.5,1,0)
| hide ex_ov_frac,gene_ov_frac 
| pedpivot -gc #3,hgnc_id,gene_symbol,alt -e 0 PN [#ped#]
| where p_present = 1
| calc paternity if(f_present = 1 and m_present = 1,'Both', if(f_present = 1 and m_present = 0,'Paternal',if(f_present = 0 and m_present = 1,'Maternal','Unknown')))
/*
| where paternity in ('Paternal','Maternal')
*/
| select 1-P_pn,paternity
| rename p_pn PN
| write user_data/hakon/XA_cnv/gene_cnv_filt_phased.gord

As an example, we can look at phased CNVs in the RHD gene:

In [63]:

%%gor
gor -p chr1:25598976 user_data/hakon/XA_cnv/gene_cnv_phased.gord | where paternity in ('Paternal','Maternal')

Query ran in 0.42 sec
Query fetched 5 rows in 0.03 sec (total time 0.45 sec)

Out[63]:

	chrom	gene_start	gene_end	hgnc_id	gene_symbol	alt	PN	paternity
0	chr1	25598976	25656936	HGNC:10009	RHD	DEL	193957	Paternal
1	chr1	25598976	25656936	HGNC:10009	RHD	DEL	605367	Paternal
2	chr1	25598976	25656936	HGNC:10009	RHD	DEL	193965	Maternal
3	chr1	25598976	25656936	HGNC:10009	RHD	DEL	193939	Maternal
4	chr1	25598976	25656936	HGNC:10009	RHD	DEL	193976	Maternal

With the phased CNV genotypes, we can start to compare the paternity of CNV with the paternity of SNV in the same gene, i.e. to see if we have a CNV in trans with a SNP.

Oscar classified CNVs¶

Similarly, we can take the classified CNVs from Oscar and map the to the genes. We will do that separately for each classification lable. Note that here the join-order is reversed from what we saw above (this query was written before). While it is due to a very advanced topic on join performance, but because of the very large size of even just few of the CNVs and the nature of the JOIN implementation, most of the exons from the right-source can be fetched into the join-cache-buffer and cause unnecessary cartesian-join overhead, even for CNVs that are far away from them. By reversion the join order, as we did above, this overhead is dramatically less because of the small fraction of large CNVs. However, here, this is not an issue because the total CNVs from Oscar are only about 200k and NOT 108 million or so (the total XA CNVs)

In [64]:

%%gor
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;

create #oscar_cnv_gene_summary# = pgor ref_gregor/classvars/oscar_cnv.gorz 
| select 1-gdx_classification
| join -segseg -maxseg 100000 ##cnv_analysis_exons##
| calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart)
| group 1 -gc #3,alt,gdx_classification,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| group chrom -gc hgnc_id,alt,gdx_classification -avg -max -fc gene_ov_frac,ex_ov_frac -count
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| rename allcount cnvCount
| select chrom,gene_start,gene_end,hgnc_id,gene_symbol,alt,gdx_classification,cnvCount,max_ex_ov_frac,avg_ex_ov_frac,max_gene_ov_frac,avg_gene_ov_frac
| sort chrom
;
                   
gor [#oscar_cnv_gene_summary#] | write user_data/hakon/XA_cnv/oscar_cnv_gene.gorz

Query ran in 1.98 sec
Query fetched 0 rows in 34.45 sec (total time 36.43 sec)

Out[64]:

	chrom	gene_start

Note: A very advanced topic on join performance: Because of the very large size of even just few of the CNVs and the nature of the JOIN implementation, most of the exons from the right-source can be fetched into the join-cache-buffer and cause unnecessary cartesian-join overhead, even for CNVs that are far away from them. By reversing the join order, as we did above when we joined the sample CNVs with exons, this overhead is dramatically less, because of the small fraction of large CNVs. However, here, this is not an issue because the total CNVs from Oscar are only about 200k and NOT 108 million or so as the total number of XA CNVs.

Additionally, above, we were not only aggregating by gene (group chrom -gc hgnc_id) but also by PN, hence the cardinality per chromosome would be VERY large and the memory requirements. By reversion the join order we could use GROUP 1 -gc hgnc_id,PN, i.e. one gene at the time, thus only the cardinality of the PNs. The downside is however that we have to do the SORT, but SORT is gentle with regard to memory usage.

In [66]:

%%gor
gor -p chr1:25598976 user_data/hakon/XA_cnv/oscar_cnv_gene.gorz

Query ran in 0.66 sec
Query fetched 8 rows in 0.00 sec (total time 0.66 sec)

Out[66]:

	chrom	gene_start	gene_end	hgnc_id	gene_symbol	alt	gdx_classification	cnvCount	max_ex_ov_frac	avg_ex_ov_frac	max_gene_ov_frac	avg_gene_ov_frac
0	chr1	25598976	25656936	HGNC:10009	RHD	GAIN	BEN	2	0.9286	0.821450	0.8246	0.748200
1	chr1	25598976	25656936	HGNC:10009	RHD	GAIN	LBEN	19	1.0000	0.954895	1.0000	0.923795
2	chr1	25598976	25656936	HGNC:10009	RHD	GAIN	PATH	5	1.0000	1.000000	1.0000	1.000000
3	chr1	25598976	25656936	HGNC:10009	RHD	GAIN	UV	8	1.0000	0.883925	1.0000	0.890213
4	chr1	25598976	25656936	HGNC:10009	RHD	LOSS	BEN	1	0.7143	0.714300	0.6718	0.671800
5	chr1	25598976	25656936	HGNC:10009	RHD	LOSS	UV	4	1.0000	1.000000	1.0000	1.000000
6	chr1	25598976	25656936	HGNC:10009	RHD	ROH	PATH	2	1.0000	1.000000	1.0000	1.000000
7	chr1	25598976	25656936	HGNC:10009	RHD	ROH	UV	819	1.0000	1.000000	1.0000	1.000000

Gnomad CNVs¶

In Gnomad v 4.11 we have information on 50k CNVs from various populations. These CNVs have been normalized based on majority overlap and each CNV is then annotated based on the number of samples that overlap (SC) an the total number of samples with data in given location (SN). We use the total SC aggregated over the gene as the gene's SC but the maximum SN as the denominator in a AF estimate for CNVs in a given gene.

In [67]:

%%gor
gor user_data/hakon/gnomAD/cnv_all_v4_hg19.gorz | group genome -count

Query ran in 0.30 sec
Query fetched 1 rows in 0.26 sec (total time 0.56 sec)

Out[67]:

	CHROM	bpStart	bpStop	allCount
0	chrA	0	1000000000	52060

In [69]:

%%gor
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;
create #gnomad_cnv_gene_summary# = gor user_data/hakon/gnomAD/cnv_all_v4_hg19.gorz
| select 1-alt,sc,sn
| join -segseg -maxseg 100000 ##cnv_analysis_exons##
| calc ex_ovlap min(cnv_end,chromend)-max(cnv_start,chromstart)
| group 1 -gc #3,alt,sc,sn,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| group chrom -gc hgnc_id,alt -avg -max -fc gene_ov_frac,ex_ov_frac -count -sum -ic sc,sn
| rename allcount cnvCount
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| select chrom,gene_start,gene_end,hgnc_id,gene_symbol,alt,cnvCount,max_ex_ov_frac,avg_ex_ov_frac,max_gene_ov_frac,avg_gene_ov_frac,sum_sc,max_sn
| calc AF sum_sc/max_sn
| sort chrom
| replace avg_*,max_* form(float(#rc),4,2) 
| replace af form(af,4,6)
;
    
gor [#gnomad_cnv_gene_summary#] | write user_data/hakon/XA_cnv/gnomad_cnv_gene.gorz

Query ran in 1.06 sec
Query fetched 0 rows in 3.34 sec (total time 4.40 sec)

Out[69]:

	CHROM	gene_start

Annotating CNVs with Oscar and Gnomad¶

First counting XA CNVs and then annotating:

In [73]:

%%gor
gor -p chr1:16384263 user_data/hakon/XA_cnv/gene_cnv.gord
| group 1 -gc 3,hgnc_id,gene_symbol,alt -count
| replace alt if(alt='DEL','LOSS','GAIN')
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Oscar user_data/hakon/XA_cnv/oscar_cnv_gene.gorz 
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Gnom <(gor user_data/hakon/XA_cnv/gnomad_cnv_gene.gorz | replace alt if(alt='<DEL>','LOSS','GAIN'))
| select 1,2,gene_end,hgnc_id,gene_symbol,alt,allCount,Oscar_gdx_classification,Oscar_cnvCount,Oscar_max_ex_ov_frac,Oscar_avg_ex_ov_frac,Oscar_max_gene_ov_frac,Oscar_avg_gene_ov_frac,Gnom_cnvCount,Gnom_max_ex_ov_frac,Gnom_avg_ex_ov_frac,Gnom_max_gene_ov_frac,Gnom_avg_gene_ov_frac,Gnom_sum_SC,Gnom_max_SN,Gnom_AF
| granno 1 -gc hgnc_id,alt -set -sc oscar_gdx_classification

Query ran in 0.72 sec
Query fetched 6 rows in 0.39 sec (total time 1.11 sec)

Out[73]:

	chrom	gene_start	gene_end	hgnc_id	gene_symbol	alt	allCount	Oscar_gdx_classification	Oscar_cnvCount	Oscar_max_ex_ov_frac	...	Oscar_avg_gene_ov_frac	Gnom_cnvCount	Gnom_max_ex_ov_frac	Gnom_avg_ex_ov_frac	Gnom_max_gene_ov_frac	Gnom_avg_gene_ov_frac	Gnom_sum_SC	Gnom_max_SN	Gnom_AF	set_Oscar_gdx_classification
0	chr1	16384263	16400146	HGNC:26717	FAM131C	LOSS	374495	PATH	8	1.0	...	0.963075	1	0.86	0.86	0.97	0.97	1	464297.0	0.000002	PATH,UV
1	chr1	16384263	16400146	HGNC:26717	FAM131C	LOSS	374495	UV	9	1.0	...	0.928556	1	0.86	0.86	0.97	0.97	1	464297.0	0.000002	PATH,UV
2	chr1	16384263	16400146	HGNC:26717	FAM131C	GAIN	7837	LPATH	3	1.0	...	1.000000	2	1.00	0.93	1.00	0.99	2	464297.0	0.000004	LPATH,PATH,UV,VUS
3	chr1	16384263	16400146	HGNC:26717	FAM131C	GAIN	7837	PATH	2	1.0	...	1.000000	2	1.00	0.93	1.00	0.99	2	464297.0	0.000004	LPATH,PATH,UV,VUS
4	chr1	16384263	16400146	HGNC:26717	FAM131C	GAIN	7837	UV	7	1.0	...	0.865943	2	1.00	0.93	1.00	0.99	2	464297.0	0.000004	LPATH,PATH,UV,VUS
5	chr1	16384263	16400146	HGNC:26717	FAM131C	GAIN	7837	VUS	4	1.0	...	0.920525	2	1.00	0.93	1.00	0.99	2	464297.0	0.000004	LPATH,PATH,UV,VUS

6 rows × 22 columns

Reading the phased CNVs for few samples and annotating them:

In [72]:

%%gor
gor -p chr1:25598976 user_data/hakon/XA_cnv/gene_cnv_phased.gord | where paternity in ('Paternal','Maternal')

| replace alt if(alt='DEL','LOSS','GAIN')
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Oscar user_data/hakon/XA_cnv/oscar_cnv_gene.gorz 
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Gnom <(gor user_data/hakon/XA_cnv/gnomad_cnv_gene.gorz | replace alt if(alt='<DEL>','LOSS','GAIN'))
| select 1,2,gene_end,pn,paternity,hgnc_id,gene_symbol,alt,Oscar_gdx_classification,Oscar_cnvCount,Oscar_max_ex_ov_frac,Oscar_avg_ex_ov_frac,Oscar_max_gene_ov_frac,Oscar_avg_gene_ov_frac,Gnom_cnvCount,Gnom_max_ex_ov_frac,Gnom_avg_ex_ov_frac,Gnom_max_gene_ov_frac,Gnom_avg_gene_ov_frac,Gnom_sum_SC,Gnom_max_SN,Gnom_AF
| granno 1 -gc hgnc_id,alt -set -sc oscar_gdx_classification

Query ran in 0.23 sec
Query fetched 10 rows in 0.15 sec (total time 0.38 sec)

Out[72]:

	chrom	gene_start	gene_end	PN	paternity	hgnc_id	gene_symbol	alt	Oscar_gdx_classification	Oscar_cnvCount	...	Oscar_avg_gene_ov_frac	Gnom_cnvCount	Gnom_max_ex_ov_frac	Gnom_avg_ex_ov_frac	Gnom_max_gene_ov_frac	Gnom_avg_gene_ov_frac	Gnom_sum_SC	Gnom_max_SN	Gnom_AF	set_Oscar_gdx_classification
0	chr1	25598976	25656936	193957	Paternal	HGNC:10009	RHD	LOSS	BEN	1	...	0.6718	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
1	chr1	25598976	25656936	193957	Paternal	HGNC:10009	RHD	LOSS	UV	4	...	1.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
2	chr1	25598976	25656936	605367	Paternal	HGNC:10009	RHD	LOSS	BEN	1	...	0.6718	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
3	chr1	25598976	25656936	605367	Paternal	HGNC:10009	RHD	LOSS	UV	4	...	1.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
4	chr1	25598976	25656936	193965	Maternal	HGNC:10009	RHD	LOSS	BEN	1	...	0.6718	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
5	chr1	25598976	25656936	193965	Maternal	HGNC:10009	RHD	LOSS	UV	4	...	1.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
6	chr1	25598976	25656936	193939	Maternal	HGNC:10009	RHD	LOSS	BEN	1	...	0.6718	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
7	chr1	25598976	25656936	193939	Maternal	HGNC:10009	RHD	LOSS	UV	4	...	1.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
8	chr1	25598976	25656936	193976	Maternal	HGNC:10009	RHD	LOSS	BEN	1	...	0.6718	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV
9	chr1	25598976	25656936	193976	Maternal	HGNC:10009	RHD	LOSS	UV	4	...	1.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BEN,UV

10 rows × 23 columns

Instead of using the gene-based CNV statistics, we can also use the inhouse XA metrics that are on base-level. More cosly however but easily doable for few hundred CNVs.

In [74]:

%%gor
gor -p chr1:16384263 user_data/hakon/XA_cnv/gene_cnv.gord
| group 1 -gc 3,hgnc_id,gene_symbol,alt -count
| where alt = 'DUP' 
| join -segseg -l -e 0 -maxseg 10000000 user_data/hakon/XA_cnv/proj/DUP_all.gorz
| group 1 -gc #3-allcount -min -max -ic samples

Query ran in 0.44 sec
Query fetched 1 rows in 0.35 sec (total time 0.79 sec)

Out[74]:

	chrom	gene_start	gene_end	hgnc_id	gene_symbol	alt	allCount	min_Samples	max_Samples
0	chr1	16384263	16400146	HGNC:26717	FAM131C	DUP	7837	337	11868

We see that we have 7837 samples in XA that include the FAM131C DUP CNV. Regions of this gene do however have overlap with from 337 to 11868 samples. We can compare this with the distribution of the overlap of the individual samples:

In [91]:

%%gor data <<
nor <(gor -p chr1:16384263 user_data/hakon/XA_cnv/gene_cnv.gord | where alt = 'DUP' ) | select gene_ov_frac

Query ran in 0.57 sec
Query fetched 7,837 rows in 0.29 sec (total time 0.86 sec)

In [92]:

plt.hist(data)

Out[92]:

(array([7.000e+00, 1.200e+01, 1.200e+01, 8.210e+02, 2.090e+02, 3.000e+00,
        3.992e+03, 1.241e+03, 6.570e+02, 8.830e+02]),
 array([0.0261 , 0.12349, 0.22088, 0.31827, 0.41566, 0.51305, 0.61044,
        0.70783, 0.80522, 0.90261, 1.     ]),
 <BarContainer object of 10 artists>)

This looks consistent with our number above, i.e. that only few of the samples have perfect CNV overlap with the gene and some of them very small.

In [ ]: