CNVs from multiple XA-samples analysed¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "internal_hg19"
%env GOR_API_PROJECT={project}
import matplotlib.pyplot as plt
XA and CNVs¶
Here we will take a look at the CNVs that are in the GORdb version of XA.
%%gor
create temp = nor -asdict XA/vars/cnv2.gord | group -count;
nor [temp]
Query ran in 5.93 sec. Query fetched 1 rows in 0.00 sec (total time 5.94 sec)
allCount | |
---|---|
0 | 628077 |
We see that we have 628 samples. Now a quick inspection of the columns we have.
%%gor
gor XA/vars/cnv2.gord -ff <(nor -asdict XA/vars/cnv2.gord | skip 10000 | top 2 | select #2) | top 5
Query ran in 0.21 sec Query fetched 5 rows in 0.00 sec (total time 0.22 sec)
chrom | bpStart | bpstop | cnv_type | exons | alt | callcopies | genotype | gt_quality | var_allele_depth | sample_coverage | samtools | info | gatk_hap | sample_id | result_id | variant_id | lis_variant_info_id | pn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 16376517 | 16386258 | DEL | multi | -0.50 | 1 | het | 99 | NaN | NaN | NaN | source=scramble | NaN | 407016 | 3525594597 | 16342710 | 155224975 | 407016 |
1 | chr1 | 16376517 | 16386258 | DEL | multi | -0.75 | 0 | wt | 20 | 0.0 | NaN | NaN | source=scramble | NaN | 407074 | 3540269334 | 16348765 | 156159726 | 407074 |
2 | chr1 | 26608795 | 26608872 | DEL | ex16 | -0.50 | 0 | wt | 20 | 0.0 | NaN | NaN | source=scramble | NaN | 407074 | 3540269337 | 16344714 | 156132607 | 407074 |
3 | chr1 | 26608823 | 26608901 | DEL | ex16 | -0.50 | 1 | het | 99 | NaN | NaN | NaN | source=scramble | NaN | 407016 | 3525594598 | 16342641 | 155188171 | 407016 |
4 | chr1 | 26671521 | 26671674 | DEL | ex2 | -0.50 | 0 | wt | 20 | 0.0 | NaN | NaN | source=scramble | NaN | 407074 | 3540269340 | 16347825 | 156441502 | 407074 |
We see that cnv_type is DEL or DUP and Info stores the algorithm (source). Now lets see how the CNVs are distributed among these algorithms. We will form an aggregate query that can be inspected in different ways:
qh = GQH.GOR_Query_Helper()
mydefs = ""
mydefs = qh.add_many("""
def #xa_cnvs# = XA/vars/cnv2.gord;
create #p_cnvs_types# = partgor -dict #xa_cnvs# -partsize 10000 <(gor #xa_cnvs# -f #{tags}
| calc size ceil(log(#3-#2))
| group chrom -gc cnv_type,size,info,pn -count
);
create #cnvs_types# = gor [#p_cnvs_types#] | group genome -gc cnv_type,size,info,pn -sum -ic allcount;
""")
CNV algorithms¶
%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc info -sum -ic sum_allcount;
nor [temp] | sort -c #2:nr | rename sum_sum_allcount CNVcount
Query ran in 6.13 sec. Query fetched 65 rows in 0.00 sec (total time 6.14 sec)
info | CNVcount | |
---|---|---|
0 | cnv_reporter | 50953508 |
1 | NaN | 16529000 |
2 | source=scramble | 16068727 |
3 | source=cnv_reporter; | 13940610 |
4 | source=scramble; | 8128519 |
... | ... | ... |
60 | source=dragen_ploidy:copynumber=1.56 | 1 |
61 | source=dragen_ploidy:copynumber=1.61 | 1 |
62 | source=dragen_ploidy:copynumber=1.64 | 1 |
63 | source=dragen_ploidy:copynumber=1.99 | 1 |
64 | source=dragen_ploidy:copynumber=2.14 | 1 |
65 rows × 2 columns
We see that most of the CNVs are from cnv_reporter. We can also see how the samples are distributed:
%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc info,pn | group -gc info -count;
nor [temp] | sort -c #2:nr | rename allcount Samplecount
Query ran in 11.15 sec Query fetched 65 rows in 7.12 sec (total time 18.27 sec)
info | Samplecount | |
---|---|---|
0 | cnv_reporter | 321183 |
1 | source=scramble | 310582 |
2 | source=scramble; | 152877 |
3 | NaN | 149428 |
4 | source=cnv_reporter; | 125093 |
... | ... | ... |
60 | source=dragen_ploidy:copynumber=1.56 | 1 |
61 | source=dragen_ploidy:copynumber=1.61 | 1 |
62 | source=dragen_ploidy:copynumber=1.64 | 1 |
63 | source=dragen_ploidy:copynumber=1.99 | 1 |
64 | source=dragen_ploidy:copynumber=2.14 | 1 |
65 rows × 2 columns
However, most of the samples have CNVs from more than one caller:
%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc pn -set -sc info;
nor [temp] | group -gc set_info -count | sort -c allcount:nr | top 5
Query ran in 6.41 sec. Query fetched 5 rows in 3.87 sec (total time 10.28 sec)
set_info | allCount | |
---|---|---|
0 | cnv_reporter,source=scramble | 272924 |
1 | NaN | 137193 |
2 | source=cnv_reporter;,source=scramble; | 113985 |
3 | cnv_reporter,source=scramble; | 29202 |
4 | source=scramble | 17850 |
and to see the rows from 6 and below:
%%gor
$mydefs
create temp = nor [#cnvs_types#] | group -gc pn -set -sc info;
nor [temp] | group -gc set_info -count | sort -c allcount:nr | skip 5
Query ran in 6.00 sec. Query fetched 153 rows in 1.37 sec (total time 7.37 sec)
set_info | allCount | |
---|---|---|
0 | dragen_cnv | 17797 |
1 | ,cnv_reporter,source=scramble | 9902 |
2 | source=cnv_reporter; | 8145 |
3 | source=scramble; | 6071 |
4 | cnv_reporter,source=karyotype,source=scramble | 4203 |
... | ... | ... |
148 | source=cnv_reporter,source=scramble,source=sma | 1 |
149 | source=dragen_ploidy:copynumber=0.0,source=dra... | 1 |
150 | source=dragen_ploidy:copynumber=0.0,source=dra... | 1 |
151 | source=dragen_ploidy:copynumber=0.49,source=dr... | 1 |
152 | source=dragen_ploidy:copynumber=1.46 | 1 |
153 rows × 2 columns
CNV size distribution¶
First for all callers combined. Note from the query above that size == ceil(log(cnv_span))
%%gor data <<
$mydefs
create temp = nor [#cnvs_types#] | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 6.48 sec. Query fetched 10 rows in 0.00 sec (total time 6.48 sec)
size = data['size']
count = data['sum_sum_allCount']
plt.bar(size,count)
plt.title('All CNVs')
Text(0.5, 1.0, 'All CNVs')
and now for DEL and DUP separately:
%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 7.66 sec. Query fetched 10 rows in 5.24 sec (total time 12.90 sec)
%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 7.07 sec. Query fetched 9 rows in 4.53 sec (total time 11.60 sec)
delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('All DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('All DUPs')
Text(0.5, 1.0, 'All DUPs')
We notice that deletions are typically much smaller, i.e. most between 100-10k bp while quite a few are bigger than 10Mb.
Now lets look at Dragen only:
%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' and info = 'dragen_cnv' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 8.19 sec. Query fetched 7 rows in 4.35 sec (total time 12.54 sec)
%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' and info = 'dragen_cnv' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 6.57 sec. Query fetched 6 rows in 3.53 sec (total time 10.11 sec)
delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('Dragen DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('Dragen DUPs')
Text(0.5, 1.0, 'Dragen DUPs')
We notice that there is more difference in the DEL size distribution. Now for cnv_reporter:
%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' and info = 'cnv_reporter' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 7.66 sec. Query fetched 9 rows in 5.29 sec (total time 12.94 sec)
%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' and info = 'cnv_reporter' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 6.57 sec. Query fetched 7 rows in 3.26 sec (total time 9.83 sec)
delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('cnv_reporter DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('cnv_reporter DUPs')
Text(0.5, 1.0, 'cnv_reporter DUPs')
and finally Scramble:
%%gor del_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DEL' and info = 'source=scramble' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 7.16 sec. Query fetched 8 rows in 5.29 sec (total time 12.44 sec)
%%gor dup_data <<
$mydefs
create temp = nor [#cnvs_types#] | where cnv_type = 'DUP' and info = 'source=scramble' | group -gc size -sum -ic sum_allcount;
nor [temp]
Query ran in 6.57 sec. Query fetched 0 rows in 4.22 sec (total time 10.79 sec)
delsize = del_data['size']
dupsize = dup_data['size']
delcount = del_data['sum_sum_allCount']
dupcount = dup_data['sum_sum_allCount']
plt.subplot(121)
plt.bar(delsize,delcount)
plt.title('Scramble DELs')
plt.subplot(122)
plt.bar(dupsize,dupcount)
plt.title('Scramble DUPs')
Text(0.5, 1.0, 'Scramble DUPs')
We notice that Scramble only has deletions and also that there are quite many really large deletions called with it.
%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
nor [temp] | where cnv_type = 'DUP'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 20
Query ran in 5.87 sec. Query fetched 20 rows in 0.01 sec (total time 5.88 sec)
info | cnv_type | Samples | min_cnvCount | med_cnvCount | max_cnvCount | avg_cnvCount | std_cnvCount | |
---|---|---|---|---|---|---|---|---|
0 | gatk_info | DUP | 13 | 102 | 142 | 187 | 143 | 29 |
1 | cnv_gatk | DUP | 120 | 57 | 76 | 120 | 77 | 11 |
2 | source=cnv_reporter | DUP | 678 | 1 | 8 | 1574 | 44 | 151 |
3 | dragen_cnv | DUP | 18673 | 13 | 32 | 619 | 35 | 17 |
4 | cnvReporter | DUP | 3 | 19 | 19 | 19 | 19 | 0 |
5 | NaN | DUP | 126736 | 1 | 3 | 3103 | 16 | 106 |
6 | source=cnv_reporter; | DUP | 121680 | 1 | 6 | 575 | 12 | 24 |
7 | cnv_reporter | DUP | 314526 | 1 | 6 | 869 | 11 | 19 |
8 | source=karyotype; | DUP | 707 | 1 | 1 | 3 | 1 | 0 |
9 | source=karyotype | DUP | 5464 | 1 | 1 | 3 | 1 | 0 |
10 | source=user:pipeline; | DUP | 117 | 1 | 1 | 2 | 1 | 0 |
11 | source=dragen_ploidy:copynumber=0.51 | DUP | 11 | 1 | 1 | 2 | 1 | 0 |
12 | source=dragen_ploidy:copynumber=0.5 | DUP | 24 | 1 | 1 | 2 | 1 | 0 |
13 | source=dragen_ploidy:copynumber=1.0 | DUP | 38 | 1 | 1 | 2 | 1 | 0 |
14 | source=dragen_ploidy:copynumber=0.0 | DUP | 21 | 1 | 1 | 1 | 1 | 0 |
15 | source=dragen_ploidy:copynumber=0.01 | DUP | 43 | 1 | 1 | 1 | 1 | 0 |
16 | source=dragen_ploidy:copynumber=0.02 | DUP | 7 | 1 | 1 | 1 | 1 | 0 |
17 | source=dragen_ploidy:copynumber=0.03 | DUP | 3 | 1 | 1 | 1 | 1 | 0 |
18 | source=dragen_ploidy:copynumber=0.48 | DUP | 1 | 1 | 1 | 1 | 1 | 0 |
19 | source=dragen_ploidy:copynumber=0.49 | DUP | 5 | 1 | 1 | 1 | 1 | 0 |
DEL counts¶
%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
nor [temp] | where cnv_type = 'DEL'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 20
Query ran in 5.88 sec. Query fetched 20 rows in 0.01 sec (total time 5.90 sec)
info | cnv_type | Samples | min_cnvCount | med_cnvCount | max_cnvCount | avg_cnvCount | std_cnvCount | |
---|---|---|---|---|---|---|---|---|
0 | gatk_info | DEL | 13 | 151 | 480 | 2178 | 530 | 532 |
1 | cnvReporter | DEL | 3 | 443 | 443 | 443 | 443 | 0 |
2 | source=cnv_reporter | DEL | 704 | 1 | 82 | 16432 | 434 | 1357 |
3 | cnv_reporter | DEL | 320916 | 1 | 57 | 4872 | 147 | 285 |
4 | cnv_gatk | DEL | 120 | 71 | 103 | 509 | 107 | 40 |
5 | NaN | DEL | 138451 | 1 | 17 | 52897 | 104 | 607 |
6 | source=cnv_reporter; | DEL | 124794 | 1 | 38 | 14949 | 99 | 202 |
7 | dragen_cnv | DEL | 18672 | 1 | 45 | 2440 | 60 | 93 |
8 | source=scramble; | DEL | 152877 | 2 | 54 | 1082 | 53 | 18 |
9 | source=scramble | DEL | 310582 | 1 | 51 | 512 | 51 | 17 |
10 | source=dragen_ploidy:copynumber=0.51 | DEL | 5 | 1 | 1 | 2 | 1 | 0 |
11 | source=user:pipeline; | DEL | 197 | 1 | 1 | 4 | 1 | 0 |
12 | source=karyotype | DEL | 715 | 1 | 1 | 2 | 1 | 0 |
13 | source=karyotype; | DEL | 346 | 1 | 1 | 3 | 1 | 0 |
14 | source=dragen_ploidy:copynumber=0.5 | DEL | 21 | 1 | 1 | 2 | 1 | 0 |
15 | source=dragen_ploidy:copynumber=0.0 | DEL | 25 | 1 | 1 | 1 | 1 | 0 |
16 | source=dragen_ploidy:copynumber=0.19 | DEL | 1 | 1 | 1 | 1 | 1 | 0 |
17 | source=dragen_ploidy:copynumber=0.49 | DEL | 6 | 1 | 1 | 1 | 1 | 0 |
18 | source=dragen_ploidy:copynumber=0.52 | DEL | 1 | 1 | 1 | 1 | 1 | 0 |
19 | source=dragen_ploidy:copynumber=0.53 | DEL | 2 | 1 | 1 | 1 | 1 | 0 |
WGS vs WES¶
Now lets only look at WGS samples:
mydefs = qh.add_many("""
create #wgs_pns# = nor -h XA/raw/samples.tsv | select id,panel_name | where (Contains(panel_name,'wgs'))
| select #1 | rename #1 PN;
""")
%%gor
$mydefs
nor [#wgs_pns#] | group -count
Query ran in 22.67 sec Query fetched 1 rows in 0.03 sec (total time 22.70 sec)
allCount | |
---|---|
0 | 22670 |
%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount
| inset -c pn [#wgs_pns#]
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
nor [temp] | where cnv_type = 'DUP'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 10
Query ran in 5.81 sec. Query fetched 10 rows in 0.01 sec (total time 5.82 sec)
info | cnv_type | Samples | min_cnvCount | med_cnvCount | max_cnvCount | avg_cnvCount | std_cnvCount | |
---|---|---|---|---|---|---|---|---|
0 | dragen_cnv | DUP | 18458 | 13 | 32 | 619 | 35 | 17 |
1 | source=cnv_reporter; | DUP | 177 | 1 | 6 | 27 | 6 | 5 |
2 | NaN | DUP | 254 | 1 | 1 | 91 | 3 | 11 |
3 | source=dragen_ploidy:copynumber=0.51 | DUP | 10 | 1 | 1 | 2 | 1 | 0 |
4 | source=dragen_ploidy:copynumber=0.5 | DUP | 24 | 1 | 1 | 2 | 1 | 0 |
5 | source=dragen_ploidy:copynumber=1.0 | DUP | 38 | 1 | 1 | 2 | 1 | 0 |
6 | source=dragen_ploidy:copynumber=0.0 | DUP | 20 | 1 | 1 | 1 | 1 | 0 |
7 | source=dragen_ploidy:copynumber=0.01 | DUP | 43 | 1 | 1 | 1 | 1 | 0 |
8 | source=dragen_ploidy:copynumber=0.02 | DUP | 7 | 1 | 1 | 1 | 1 | 0 |
9 | source=dragen_ploidy:copynumber=0.03 | DUP | 3 | 1 | 1 | 1 | 1 | 0 |
We see that while we have more genomic region to analyze, the avgerage count for cnv_reporter goes from 12 to 6. Dragen stays the same because most of the 18k samples are WGS. Now the deletions:
%%gor
$mydefs
create temp = nor [#cnvs_types#]
| group -gc info,cnv_type,pn -sum -ic sum_allcount
| inset -c pn [#wgs_pns#]
| rename sum_sum_allcount cnvCount
| group -gc info,cnv_type -min -max -avg -std -med -count -ic cnvCount;
nor [temp] | where cnv_type = 'DEL'
| rename allcount Samples | sort -c avg_cnvcount:rn | replace #6- int(float(#rc)) | top 10
Query ran in 6.18 sec. Query fetched 10 rows in 0.01 sec (total time 6.19 sec)
info | cnv_type | Samples | min_cnvCount | med_cnvCount | max_cnvCount | avg_cnvCount | std_cnvCount | |
---|---|---|---|---|---|---|---|---|
0 | source=cnv_reporter; | DEL | 180 | 7 | 115 | 1537 | 180 | 203 |
1 | dragen_cnv | DEL | 18457 | 1 | 45 | 2440 | 60 | 93 |
2 | source=scramble; | DEL | 1 | 9 | 9 | 9 | 9 | 0 |
3 | NaN | DEL | 406 | 1 | 1 | 275 | 6 | 28 |
4 | source=dragen_ploidy:copynumber=0.51 | DEL | 5 | 1 | 1 | 2 | 1 | 0 |
5 | source=dragen_ploidy:copynumber=0.5 | DEL | 21 | 1 | 1 | 2 | 1 | 0 |
6 | source=dragen_ploidy:copynumber=0.0 | DEL | 25 | 1 | 1 | 1 | 1 | 0 |
7 | source=dragen_ploidy:copynumber=0.19 | DEL | 1 | 1 | 1 | 1 | 1 | 0 |
8 | source=dragen_ploidy:copynumber=0.49 | DEL | 6 | 1 | 1 | 1 | 1 | 0 |
9 | source=dragen_ploidy:copynumber=0.52 | DEL | 1 | 1 | 1 | 1 | 1 | 0 |
We see that WGS dels for source=cnv_reporter; are 180 vs 99 for WES+WGS.
Projecting CNVs from multiple samples onto the genome¶
The most unbiased way to estimate the frequency of CNVs is simply to find out how many CNVs overlap a particular nucleotide in the genome. GOR has the SEGPROJ command that can be used to project segments onto the bases and to calculate the projection height from overlap of multiple segments.
The query below was executed in few minutes for all the 600k samples. In it we processed CNVs from 5 Info types separately, i.e. what we believe is 3 different algorithms. The rest is collapsed into a single "other" type.
The first SEGPROJ command collapses the data per PN for each combination of cnv_type and info separately. This is done because we noticed that in some cases the same sample may have more than one CNV of the same type that overlap the same space, hence, would cause that sample (PN) to contribute more than once to the over all frequency count. The next SEGPROJ command then projects the segments, giving us a segCount number that represents the number of PN overlapping a given segment range. The -maxseg option is used to specify the size of the largest segment. Here it needs to be big enough to cover the larges CNV in all the data set. Smaller number allows to optimize performance, i.e. speed and memory usage. Not an issue here when we ran this with a PARTGOR that takes 50k PNs at the time.
Finally, we combine the segCounts from each 50k partition and calculate the overall segment projection. We don't execute the query here but refer to the multiple output files that are generated with the -fork option in the WRITE command
def #xa_cnvs# = XA/vars/cnv2.gord;
create #pns# = nor -asdict XA/vars/cnv2.gord | select #2 | rename #1 PN | top 10000;
create #p# = partgor -dict #xa_cnvs# -partsize 50000 <(gor #xa_cnvs# -f #{tags}
| replace info if(info in ('cnv_reporter','source=cnv_reporter;','dragen_cnv','source=scramble;','source=scramble'),#rc,'other')
| segproj -gc cnv_type,info,pn -maxseg 250000000
| segproj -gc cnv_type,info -maxseg 250000000
);
gor [#p#] | segproj -gc cnv_type,info -maxseg 250000000 -sumcol segcount
| segspan -gc cnv_type,info,segcount -maxseg 250000000
| hide segcount | rename segcountx Samples
| calc fc replace(cnv_type+'_'+info,';','sc') | write -f fc -r user_data/hakon/XA_cnv/proj/#{fork}.gorz
%%gor
nor user_data/hakon/XA_cnv/proj | where right(filename,4)='gorz' | select filename,filesize
Query ran in 0.08 sec Query fetched 11 rows in 0.01 sec (total time 0.09 sec)
Filename | Filesize | |
---|---|---|
0 | DUP_cnv_reporter.gorz | 3454194 |
1 | DEL_source=scramblesc.gorz | 1038826 |
2 | DUP_source=cnv_reportersc.gorz | 2514174 |
3 | DUP_source=scramblesc.gorz | 119 |
4 | DEL_dragen_cnv.gorz | 1127070 |
5 | DUP_other.gorz | 3193421 |
6 | DEL_other.gorz | 4598861 |
7 | DUP_dragen_cnv.gorz | 562417 |
8 | DEL_source=scramble.gorz | 1432061 |
9 | DEL_cnv_reporter.gorz | 4278149 |
10 | DEL_source=cnv_reportersc.gorz | 4195546 |
We now combine all the Info type into a single count as well:
%%gor
def #path# = user_data/hakon/XA_cnv/proj;
gor #path#/DEL_cnv_reporter.gorz #path#/DEL_dragen_cnv.gorz #path#/DEL_other.gorz #path#/DEL_source=cnv_reportersc.gorz #path#/DEL_source=scramble.gorz #path#/DEL_source=scramblesc.gorz
| segproj -maxseg 250000000 -sumcol samples
| rename segcount Samples
| write user_data/hakon/XA_cnv/proj/DEL_all.gorz
Query ran in 0.49 sec Query fetched 0 rows in 3.94 sec (total time 4.43 sec)
Chrom | bpStart |
---|
%%gor
def #path# = user_data/hakon/XA_cnv/proj;
gor #path#/DUP_cnv_reporter.gorz #path#/DUP_dragen_cnv.gorz #path#/DUP_other.gorz #path#/DUP_source=cnv_reportersc.gorz #path#/DUP_source=scramblesc.gorz
| segproj -maxseg 250000000 -sumcol samples
| rename segcount Samples
| write user_data/hakon/XA_cnv/proj/DUP_all.gorz
Query ran in 0.08 sec Query fetched 0 rows in 2.30 sec (total time 2.38 sec)
Chrom | bpStart |
---|
Finally, we now generate a input to a Sequence Miner genome-browser track file with the following command:
%%gor
nor user_data/hakon/XA_cnv/proj | where right(filename,4)='gorz'
| calc name replace(filename,'.gorz','')
| calc file name+'.gorz'
| calc type if(left(name,3)='DEL','DEL','DUP')
| calc track '<track name="'+replace(name,type+'_','')+'" type="'+type+'_CNV_count" visible="true" selected="true">
ffiletrack(height=CNVheight,filename=CNVPath+"'+file+'", name="'+name+'", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="'+if(type='DEL','Green','Blue')+'",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>'
| select track
Query ran in 0.07 sec Query fetched 13 rows in 0.03 sec (total time 0.10 sec)
track | |
---|---|
0 | <track name="cnv_reporter" type="DUP_CNV_count... |
1 | <track name="source=scramblesc" type="DEL_CNV_... |
2 | <track name="source=cnv_reportersc" type="DUP_... |
3 | <track name="all" type="DUP_CNV_count" visible... |
4 | <track name="source=scramblesc" type="DUP_CNV_... |
5 | <track name="all" type="DEL_CNV_count" visible... |
6 | <track name="dragen_cnv" type="DEL_CNV_count" ... |
7 | <track name="other" type="DUP_CNV_count" visib... |
8 | <track name="other" type="DEL_CNV_count" visib... |
9 | <track name="dragen_cnv" type="DUP_CNV_count" ... |
10 | <track name="source=scramble" type="DEL_CNV_co... |
11 | <track name="cnv_reporter" type="DEL_CNV_count... |
12 | <track name="source=cnv_reportersc" type="DEL_... |
The genome browser track file is stored in user_data/hakon/XA_cnv/cnv_projections.gb and has the following content:
<tracks>
<global>
CNVPath="user_data/hakon/XA_cnv/proj/"
CNVheight=50
CNVmax = "1000"
gorPath = ""
lessthan = chr(60)
</global>
<track name="all" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_all.gorz", name="DUP_all", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=3, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="all" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_all.gorz", name="DEL_all", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=3, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="cnv_reporter" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_cnv_reporter.gorz", name="DUP_cnv_reporter", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=scramblesc" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_source=scramblesc.gorz", name="DEL_source=scramblesc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=cnv_reportersc" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_source=cnv_reportersc.gorz", name="DUP_source=cnv_reportersc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=scramblesc" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_source=scramblesc.gorz", name="DUP_source=scramblesc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="dragen_cnv" type="DEL_CNV_count" visible="true" selected="true">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_dragen_cnv.gorz", name="DEL_dragen_cnv", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="other" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_other.gorz", name="DUP_other", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="other" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_other.gorz", name="DEL_other", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="dragen_cnv" type="DUP_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DUP_dragen_cnv.gorz", name="DUP_dragen_cnv", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Blue",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=scramble" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_source=scramble.gorz", name="DEL_source=scramble", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="cnv_reporter" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_cnv_reporter.gorz", name="DEL_cnv_reporter", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
<track name="source=cnv_reportersc" type="DEL_CNV_count" visible="true" selected="false">
ffiletrack(height=CNVheight,filename=CNVPath+"DEL_source=cnv_reportersc.gorz", name="DEL_source=cnv_reportersc", isheader="true", idcol=-1, chrcol=0,startcol=1, stopcol=2, valuecol=5, primdim="", color="Green",painter=myPainter,scale="Linear",aggregate="Max",max=CNVmax,min="0",refline="None",stacking="true")
</track>
</tracks>
from IPython.display import Image
Image('SM_GB_CNVs.png',height=500,width=800)
Note that while the tracks shown above display sample counts, we could easily normalize these based on the total number of samples to give some kind of AF estimate for these CNVs.
Mapping CNVs to coding exons of genes.¶
In many cases it can be interesting to find how CNVs overlap with genes, i.e. which exons and how much of the coding region they cover.
Below we show a query that maps all the CNVs, separately for DEL and DUP (as stored in cnv_type renamed to alt). The query uses ##cnv_analysis_exons##, a table that have a single "super transcript" per gene, i.e. a projection of all the exons associated with the gene. It also stored the HGNC_ID of the gene, total coding region size, the exon number and the total number of exons per gene.
%%gor
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
gor ##cnv_analysis_exons## | granno chrom -gc hgnc_id -count | where allcount > 1 | top 3
Query ran in 0.20 sec Query fetched 3 rows in 0.13 sec (total time 0.32 sec)
chrom | chromstart | chromend | hgnc_id | strand | gene_coding_size | exon_num | exon_count | allCount | |
---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 861321 | 861393 | HGNC:28706 | + | 2321 | 1 | 17 | 17 |
1 | chr1 | 865534 | 865716 | HGNC:28706 | + | 2321 | 2 | 17 | 17 |
2 | chr1 | 866418 | 866469 | HGNC:28706 | + | 2321 | 3 | 17 | 17 |
The query below joins all the exons for all genes with all the CNVs from the samples. Notice that we use -maxseg 100000000 to indicate the maximum size of the CNVs in the "right-source". We then calculate the ex_overlap and the use MAP to add the start and stop coordinates of the correspoinding gene, which we make the GOR position columns with the COLUMNREORDER. Because we are switching the GOR ordering column, we must apply SORT to fix the ordering. In principle, a SORT with 3Mb should be enough (and make the query run faster), however, because of some discrepancies in locations of few genes (as compared with the transcript data), we got wrong-order error until we made the sliding sort window larger. The query below does still run fast and comletes the mapping of all the 600k samples to the 40k genes in half an hour or so.
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;
def #xa_cnvs# = XA/vars/cnv2.gord;
create #gene_cnvs# = partgor -dict #xa_cnvs# -partsize 10000 <(gor ##cnv_analysis_exons##
| join -segseg -maxseg 100000000 <(gor #xa_cnvs# -f #{tags}
| where info in ('cnv_reporter','source=cnv_reporter;','dragen_cnv','source=scramble;','source=scramble')
| rename #2 cnv_start | rename #3 cnv_stop | rename cnv_type alt | select 1-3,alt,pn)
| calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart)
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| columnreorder chrom,gene_start,gene_end,pn,hgnc_id,gene_symbol,alt
| sort chrom
| group 1 -gc gene_end,pn,hgnc_id,gene_symbol,alt,gene_coding_size,exon_count -sum -ic ex_ovlap,exon_num -min -max
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(min(cnv_overlap/gene_coding_size,1.0),4,4)
| select 1,2,gene_end,pn,hgnc_id,gene_symbol,alt,ex_ov_frac,gene_ov_frac
| atmax 1 -gc hgnc_id,pn gene_ov_frac
);
pgor [#gene_cnvs#] | write user_data/hakon/XA_cnv/gene_cnv_filt.gord
Query ran in 17.21 sec Query fetched 0 rows in 1484.12 sec (total time 1501.33 sec)
chrom | gene_start | gene_end | pn | hgnc_id | gene_symbol | alt | ex_ov_frac | gene_ov_frac |
---|
Notice that for each sample, we pick either DEL or DUP within each gene, using ATMAX based on gene overlap fraction.
%%gor
gor -p chr1:16384263 <(gor user_data/hakon/XA_cnv/gene_cnv_filt.gord | sort 1 -c ex_ov_frac:n | top 5
| merge <(gor user_data/hakon/XA_cnv/gene_cnv_filt.gord | sort 1 -c ex_ov_frac:nr | top 5))
Query ran in 0.13 sec Query fetched 10 rows in 0.76 sec (total time 0.89 sec)
chrom | gene_start | gene_end | pn | hgnc_id | gene_symbol | alt | ex_ov_frac | gene_ov_frac | |
---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 16384263 | 16400146 | 191079 | HGNC:26717 | FAM131C | DEL | 1.0000 | 1.0000 |
1 | chr1 | 16384263 | 16400146 | 191080 | HGNC:26717 | FAM131C | DEL | 1.0000 | 1.0000 |
2 | chr1 | 16384263 | 16400146 | 191081 | HGNC:26717 | FAM131C | DEL | 1.0000 | 1.0000 |
3 | chr1 | 16384263 | 16400146 | 192253 | HGNC:26717 | FAM131C | DEL | 1.0000 | 1.0000 |
4 | chr1 | 16384263 | 16400146 | 192254 | HGNC:26717 | FAM131C | DEL | 1.0000 | 1.0000 |
5 | chr1 | 16384263 | 16400146 | 494102 | HGNC:26717 | FAM131C | DEL | 0.1429 | 0.0261 |
6 | chr1 | 16384263 | 16400146 | 494193 | HGNC:26717 | FAM131C | DEL | 0.1429 | 0.0261 |
7 | chr1 | 16384263 | 16400146 | 494194 | HGNC:26717 | FAM131C | DEL | 0.1429 | 0.0261 |
8 | chr1 | 16384263 | 16400146 | 494622 | HGNC:26717 | FAM131C | DEL | 0.1429 | 0.0261 |
9 | chr1 | 16384263 | 16400146 | 494623 | HGNC:26717 | FAM131C | DEL | 0.1429 | 0.0261 |
Phasing the CNVs¶
Once we have the CNVs mapped to a "defined grid" (the genes) we can also start to compare them across samples and phase them. Below is a query that takes as input the table above and uses PEDPIVOT to restructure the data such that the paternity can easily be specified per row.
def #samples# = XA/raw/samples.tsv;
create #ped# = nor #samples#
| select id,case_id,relation
| pivot -gc case_id relation -v proband,father,mother -e ''
| where father_id != '' and mother_id != ''and proband_id != ''
| select #2-;
pgor user_data/hakon/XA_cnv/gene_cnv_filt.gord
| calc present if(gene_ov_frac > 0.1 or ex_ov_frac > 0.5,1,0)
| hide ex_ov_frac,gene_ov_frac
| pedpivot -gc #3,hgnc_id,gene_symbol,alt -e 0 PN [#ped#]
| where p_present = 1
| calc paternity if(f_present = 1 and m_present = 1,'Both', if(f_present = 1 and m_present = 0,'Paternal',if(f_present = 0 and m_present = 1,'Maternal','Unknown')))
/*
| where paternity in ('Paternal','Maternal')
*/
| select 1-P_pn,paternity
| rename p_pn PN
| write user_data/hakon/XA_cnv/gene_cnv_filt_phased.gord
As an example, we can look at phased CNVs in the RHD gene:
%%gor
gor -p chr1:25598976 user_data/hakon/XA_cnv/gene_cnv_phased.gord | where paternity in ('Paternal','Maternal')
Query ran in 0.42 sec Query fetched 5 rows in 0.03 sec (total time 0.45 sec)
chrom | gene_start | gene_end | hgnc_id | gene_symbol | alt | PN | paternity | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | DEL | 193957 | Paternal |
1 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | DEL | 605367 | Paternal |
2 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | DEL | 193965 | Maternal |
3 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | DEL | 193939 | Maternal |
4 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | DEL | 193976 | Maternal |
With the phased CNV genotypes, we can start to compare the paternity of CNV with the paternity of SNV in the same gene, i.e. to see if we have a CNV in trans with a SNP.
Oscar classified CNVs¶
Similarly, we can take the classified CNVs from Oscar and map the to the genes. We will do that separately for each classification lable. Note that here the join-order is reversed from what we saw above (this query was written before). While it is due to a very advanced topic on join performance, but because of the very large size of even just few of the CNVs and the nature of the JOIN implementation, most of the exons from the right-source can be fetched into the join-cache-buffer and cause unnecessary cartesian-join overhead, even for CNVs that are far away from them. By reversion the join order, as we did above, this overhead is dramatically less because of the small fraction of large CNVs. However, here, this is not an issue because the total CNVs from Oscar are only about 200k and NOT 108 million or so (the total XA CNVs)
%%gor
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;
create #oscar_cnv_gene_summary# = pgor ref_gregor/classvars/oscar_cnv.gorz
| select 1-gdx_classification
| join -segseg -maxseg 100000 ##cnv_analysis_exons##
| calc ex_ovlap min(cnv_stop,chromend)-max(cnv_start,chromstart)
| group 1 -gc #3,alt,gdx_classification,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| group chrom -gc hgnc_id,alt,gdx_classification -avg -max -fc gene_ov_frac,ex_ov_frac -count
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| rename allcount cnvCount
| select chrom,gene_start,gene_end,hgnc_id,gene_symbol,alt,gdx_classification,cnvCount,max_ex_ov_frac,avg_ex_ov_frac,max_gene_ov_frac,avg_gene_ov_frac
| sort chrom
;
gor [#oscar_cnv_gene_summary#] | write user_data/hakon/XA_cnv/oscar_cnv_gene.gorz
Query ran in 1.98 sec Query fetched 0 rows in 34.45 sec (total time 36.43 sec)
chrom | gene_start |
---|
Additionally, above, we were not only aggregating by gene (group chrom -gc hgnc_id) but also by PN, hence the cardinality per chromosome would be VERY large and the memory requirements. By reversion the join order we could use GROUP 1 -gc hgnc_id,PN, i.e. one gene at the time, thus only the cardinality of the PNs. The downside is however that we have to do the SORT, but SORT is gentle with regard to memory usage.
%%gor
gor -p chr1:25598976 user_data/hakon/XA_cnv/oscar_cnv_gene.gorz
Query ran in 0.66 sec Query fetched 8 rows in 0.00 sec (total time 0.66 sec)
chrom | gene_start | gene_end | hgnc_id | gene_symbol | alt | gdx_classification | cnvCount | max_ex_ov_frac | avg_ex_ov_frac | max_gene_ov_frac | avg_gene_ov_frac | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | GAIN | BEN | 2 | 0.9286 | 0.821450 | 0.8246 | 0.748200 |
1 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | GAIN | LBEN | 19 | 1.0000 | 0.954895 | 1.0000 | 0.923795 |
2 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | GAIN | PATH | 5 | 1.0000 | 1.000000 | 1.0000 | 1.000000 |
3 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | GAIN | UV | 8 | 1.0000 | 0.883925 | 1.0000 | 0.890213 |
4 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | LOSS | BEN | 1 | 0.7143 | 0.714300 | 0.6718 | 0.671800 |
5 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | LOSS | UV | 4 | 1.0000 | 1.000000 | 1.0000 | 1.000000 |
6 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | ROH | PATH | 2 | 1.0000 | 1.000000 | 1.0000 | 1.000000 |
7 | chr1 | 25598976 | 25656936 | HGNC:10009 | RHD | ROH | UV | 819 | 1.0000 | 1.000000 | 1.0000 | 1.000000 |
Gnomad CNVs¶
In Gnomad v 4.11 we have information on 50k CNVs from various populations. These CNVs have been normalized based on majority overlap and each CNV is then annotated based on the number of samples that overlap (SC) an the total number of samples with data in given location (SN). We use the total SC aggregated over the gene as the gene's SC but the maximum SN as the denominator in a AF estimate for CNVs in a given gene.
%%gor
gor user_data/hakon/gnomAD/cnv_all_v4_hg19.gorz | group genome -count
Query ran in 0.30 sec Query fetched 1 rows in 0.26 sec (total time 0.56 sec)
CHROM | bpStart | bpStop | allCount | |
---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 52060 |
%%gor
def ##ref## = ref_gregor;
def ##cnv_analysis_exons## = ##ref##/cnv/cnv_analysis_exons.gorz;
def ##genes## = ##ref##/genemb/gmb_genes.gorz;
create #gnomad_cnv_gene_summary# = gor user_data/hakon/gnomAD/cnv_all_v4_hg19.gorz
| select 1-alt,sc,sn
| join -segseg -maxseg 100000 ##cnv_analysis_exons##
| calc ex_ovlap min(cnv_end,chromend)-max(cnv_start,chromstart)
| group 1 -gc #3,alt,sc,sn,hgnc_id,gene_coding_size,exon_count -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc ex_ov_frac form((max_exon_num-min_exon_num+1)/exon_count,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| group chrom -gc hgnc_id,alt -avg -max -fc gene_ov_frac,ex_ov_frac -count -sum -ic sc,sn
| rename allcount cnvCount
| map -c hgnc_id <(nor ##genes## | select hgnc_id,gene_start,gene_end,gene_symbol)
| select chrom,gene_start,gene_end,hgnc_id,gene_symbol,alt,cnvCount,max_ex_ov_frac,avg_ex_ov_frac,max_gene_ov_frac,avg_gene_ov_frac,sum_sc,max_sn
| calc AF sum_sc/max_sn
| sort chrom
| replace avg_*,max_* form(float(#rc),4,2)
| replace af form(af,4,6)
;
gor [#gnomad_cnv_gene_summary#] | write user_data/hakon/XA_cnv/gnomad_cnv_gene.gorz
Query ran in 1.06 sec Query fetched 0 rows in 3.34 sec (total time 4.40 sec)
CHROM | gene_start |
---|
Annotating CNVs with Oscar and Gnomad¶
First counting XA CNVs and then annotating:
%%gor
gor -p chr1:16384263 user_data/hakon/XA_cnv/gene_cnv.gord
| group 1 -gc 3,hgnc_id,gene_symbol,alt -count
| replace alt if(alt='DEL','LOSS','GAIN')
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Oscar user_data/hakon/XA_cnv/oscar_cnv_gene.gorz
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Gnom <(gor user_data/hakon/XA_cnv/gnomad_cnv_gene.gorz | replace alt if(alt='<DEL>','LOSS','GAIN'))
| select 1,2,gene_end,hgnc_id,gene_symbol,alt,allCount,Oscar_gdx_classification,Oscar_cnvCount,Oscar_max_ex_ov_frac,Oscar_avg_ex_ov_frac,Oscar_max_gene_ov_frac,Oscar_avg_gene_ov_frac,Gnom_cnvCount,Gnom_max_ex_ov_frac,Gnom_avg_ex_ov_frac,Gnom_max_gene_ov_frac,Gnom_avg_gene_ov_frac,Gnom_sum_SC,Gnom_max_SN,Gnom_AF
| granno 1 -gc hgnc_id,alt -set -sc oscar_gdx_classification
Query ran in 0.72 sec Query fetched 6 rows in 0.39 sec (total time 1.11 sec)
chrom | gene_start | gene_end | hgnc_id | gene_symbol | alt | allCount | Oscar_gdx_classification | Oscar_cnvCount | Oscar_max_ex_ov_frac | ... | Oscar_avg_gene_ov_frac | Gnom_cnvCount | Gnom_max_ex_ov_frac | Gnom_avg_ex_ov_frac | Gnom_max_gene_ov_frac | Gnom_avg_gene_ov_frac | Gnom_sum_SC | Gnom_max_SN | Gnom_AF | set_Oscar_gdx_classification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 16384263 | 16400146 | HGNC:26717 | FAM131C | LOSS | 374495 | PATH | 8 | 1.0 | ... | 0.963075 | 1 | 0.86 | 0.86 | 0.97 | 0.97 | 1 | 464297.0 | 0.000002 | PATH,UV |
1 | chr1 | 16384263 | 16400146 | HGNC:26717 | FAM131C | LOSS | 374495 | UV | 9 | 1.0 | ... | 0.928556 | 1 | 0.86 | 0.86 | 0.97 | 0.97 | 1 | 464297.0 | 0.000002 | PATH,UV |
2 | chr1 | 16384263 | 16400146 | HGNC:26717 | FAM131C | GAIN | 7837 | LPATH | 3 | 1.0 | ... | 1.000000 | 2 | 1.00 | 0.93 | 1.00 | 0.99 | 2 | 464297.0 | 0.000004 | LPATH,PATH,UV,VUS |
3 | chr1 | 16384263 | 16400146 | HGNC:26717 | FAM131C | GAIN | 7837 | PATH | 2 | 1.0 | ... | 1.000000 | 2 | 1.00 | 0.93 | 1.00 | 0.99 | 2 | 464297.0 | 0.000004 | LPATH,PATH,UV,VUS |
4 | chr1 | 16384263 | 16400146 | HGNC:26717 | FAM131C | GAIN | 7837 | UV | 7 | 1.0 | ... | 0.865943 | 2 | 1.00 | 0.93 | 1.00 | 0.99 | 2 | 464297.0 | 0.000004 | LPATH,PATH,UV,VUS |
5 | chr1 | 16384263 | 16400146 | HGNC:26717 | FAM131C | GAIN | 7837 | VUS | 4 | 1.0 | ... | 0.920525 | 2 | 1.00 | 0.93 | 1.00 | 0.99 | 2 | 464297.0 | 0.000004 | LPATH,PATH,UV,VUS |
6 rows × 22 columns
Reading the phased CNVs for few samples and annotating them:
%%gor
gor -p chr1:25598976 user_data/hakon/XA_cnv/gene_cnv_phased.gord | where paternity in ('Paternal','Maternal')
| replace alt if(alt='DEL','LOSS','GAIN')
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Oscar user_data/hakon/XA_cnv/oscar_cnv_gene.gorz
| join -segseg -xl hgnc_id,alt -xr hgnc_id,alt -l -rprefix Gnom <(gor user_data/hakon/XA_cnv/gnomad_cnv_gene.gorz | replace alt if(alt='<DEL>','LOSS','GAIN'))
| select 1,2,gene_end,pn,paternity,hgnc_id,gene_symbol,alt,Oscar_gdx_classification,Oscar_cnvCount,Oscar_max_ex_ov_frac,Oscar_avg_ex_ov_frac,Oscar_max_gene_ov_frac,Oscar_avg_gene_ov_frac,Gnom_cnvCount,Gnom_max_ex_ov_frac,Gnom_avg_ex_ov_frac,Gnom_max_gene_ov_frac,Gnom_avg_gene_ov_frac,Gnom_sum_SC,Gnom_max_SN,Gnom_AF
| granno 1 -gc hgnc_id,alt -set -sc oscar_gdx_classification
Query ran in 0.23 sec Query fetched 10 rows in 0.15 sec (total time 0.38 sec)
chrom | gene_start | gene_end | PN | paternity | hgnc_id | gene_symbol | alt | Oscar_gdx_classification | Oscar_cnvCount | ... | Oscar_avg_gene_ov_frac | Gnom_cnvCount | Gnom_max_ex_ov_frac | Gnom_avg_ex_ov_frac | Gnom_max_gene_ov_frac | Gnom_avg_gene_ov_frac | Gnom_sum_SC | Gnom_max_SN | Gnom_AF | set_Oscar_gdx_classification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 25598976 | 25656936 | 193957 | Paternal | HGNC:10009 | RHD | LOSS | BEN | 1 | ... | 0.6718 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
1 | chr1 | 25598976 | 25656936 | 193957 | Paternal | HGNC:10009 | RHD | LOSS | UV | 4 | ... | 1.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
2 | chr1 | 25598976 | 25656936 | 605367 | Paternal | HGNC:10009 | RHD | LOSS | BEN | 1 | ... | 0.6718 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
3 | chr1 | 25598976 | 25656936 | 605367 | Paternal | HGNC:10009 | RHD | LOSS | UV | 4 | ... | 1.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
4 | chr1 | 25598976 | 25656936 | 193965 | Maternal | HGNC:10009 | RHD | LOSS | BEN | 1 | ... | 0.6718 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
5 | chr1 | 25598976 | 25656936 | 193965 | Maternal | HGNC:10009 | RHD | LOSS | UV | 4 | ... | 1.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
6 | chr1 | 25598976 | 25656936 | 193939 | Maternal | HGNC:10009 | RHD | LOSS | BEN | 1 | ... | 0.6718 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
7 | chr1 | 25598976 | 25656936 | 193939 | Maternal | HGNC:10009 | RHD | LOSS | UV | 4 | ... | 1.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
8 | chr1 | 25598976 | 25656936 | 193976 | Maternal | HGNC:10009 | RHD | LOSS | BEN | 1 | ... | 0.6718 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
9 | chr1 | 25598976 | 25656936 | 193976 | Maternal | HGNC:10009 | RHD | LOSS | UV | 4 | ... | 1.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BEN,UV |
10 rows × 23 columns
Instead of using the gene-based CNV statistics, we can also use the inhouse XA metrics that are on base-level. More cosly however but easily doable for few hundred CNVs.
%%gor
gor -p chr1:16384263 user_data/hakon/XA_cnv/gene_cnv.gord
| group 1 -gc 3,hgnc_id,gene_symbol,alt -count
| where alt = 'DUP'
| join -segseg -l -e 0 -maxseg 10000000 user_data/hakon/XA_cnv/proj/DUP_all.gorz
| group 1 -gc #3-allcount -min -max -ic samples
Query ran in 0.44 sec Query fetched 1 rows in 0.35 sec (total time 0.79 sec)
chrom | gene_start | gene_end | hgnc_id | gene_symbol | alt | allCount | min_Samples | max_Samples | |
---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 16384263 | 16400146 | HGNC:26717 | FAM131C | DUP | 7837 | 337 | 11868 |
We see that we have 7837 samples in XA that include the FAM131C DUP CNV. Regions of this gene do however have overlap with from 337 to 11868 samples. We can compare this with the distribution of the overlap of the individual samples:
%%gor data <<
nor <(gor -p chr1:16384263 user_data/hakon/XA_cnv/gene_cnv.gord | where alt = 'DUP' ) | select gene_ov_frac
Query ran in 0.57 sec Query fetched 7,837 rows in 0.29 sec (total time 0.86 sec)
plt.hist(data)
(array([7.000e+00, 1.200e+01, 1.200e+01, 8.210e+02, 2.090e+02, 3.000e+00, 3.992e+03, 1.241e+03, 6.570e+02, 8.830e+02]), array([0.0261 , 0.12349, 0.22088, 0.31827, 0.41566, 0.51305, 0.61044, 0.70783, 0.80522, 0.90261, 1. ]), <BarContainer object of 10 artists>)
This looks consistent with our number above, i.e. that only few of the samples have perfect CNV overlap with the gene and some of them very small.