CNV annotations for Gregor - draft¶

CNV_annotations

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "gdx-demo"
%env GOR_API_PROJECT={project}

Gregor¶

CNV view¶

Here we will describe the CNV query template used in Gregor, i.e. gregor_cnvs_anno.ftl.yml(). Some of the query code is similar to Gregor step2 that has been described elsewhere, such as the handling of candidate genes. We will try to focus on the aspects that are unique to the CNV, e.g. their segmental nature. We use the query helper as before and start by defining the path to the results from step1 and the candidate genes. For simplicity, we will skip the genepanelinfo file, because its us and handling is the same as in step2.

In [2]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

In [3]:

mydefs = qh.add_many("""
def ##ref## = ref;
def ##vep_path## = source/anno/vep_v106;

def ##vcf_wgs_anno## = user_data/Gregor1/DEMO_2796799_v3.gorz;
def ##candidate_gene_file## = studies/GCA724303/candidate_genes.gor;

def ##genepanelinfo## = ;
def ##defMOI## = 'all';
def ##fuzzdist## = 20;

def ##min_read_depth## = 8;
def ##min_hom_call_ratio## = 0.66;
def ##min_het_call_ratio## = 0.2;
def ##min_gt_likelihood## = 5;
def ##rmax_af## = 0.01;
def ##dmax_af## = 0.001;
def ##max_cust_af## = ##rmax_af##;
def ##delta_ctrl## = 0;
def ##delta_cases## = 0;
def ##max_num_candgenes## = 10;
def ##disease_specificity_max_gene_count## = 20;
def ##min_lhz_overlap_fraction## = 0.1;
""")

# Filters
mydefs = qh.add_many("""
def ##vep_filter## = consequence in ('transcript_ablation','splice_acceptor_variant','splice_donor_variant','stop_gained','frameshift_variant','stop_lost','start_lost','transcript_amplification','inframe_insertion','inframe_deletion','missense_variant','protein_altering_variant','splice_region_variant','incomplete_terminal_codon_variant');
def ##vcf_filter## = ('Not LowQual');
def ##af_filter## = ref_af <= Gene_RmaxAf and ref_af_cust <= ##max_cust_af##;
def ##cat_filter## = DIAG_ACMGcat in ('Cat1','Cat1B','Cat1C');
def ##known_filter## = qc_known_vars = 1;
def ##protected_filter## = qc_protected_vars = 1;
def ##vcf_quality_filter## = qc_vcf_filter = 1;
def ##variant_scope_filter## =  2 = 2;
def ##repeat_regions_filter## =  qc_repeat_regions = 0 ;
def ##variant_filter## =  ( ##vep_filter## ) and ( ##af_filter## ) and ( ##repeat_regions_filter## ) and ( ##variant_scope_filter## )  and ( ##vcf_quality_filter## ) or ( ##cat_filter## ) or ( ##protected_filter## ) or ( ##known_filter## );
""")

mydefs = qh.add_many("""
def ##vep_single## = ##vep_path##/vepref_single_wgs.gord;
def ##vep_multi## = ##vep_path##/vepref_multi_wgs.gord;
def ##vep_clinical## = ##ref##/disvariants/clinical_variants_vepref_multi.gorz;
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##gene_map## = ##ref##/refgenes/refgenes.map;
def ##exons## = ref/refgenes/refgenes_exons.gorz;
def ##vep_impact_map## = ##ref##/VEP_impact.map;
""")

Reading data from step1¶

In our example for Gregor step1 and step2, we have been using a trio. We applied the same input parameters to the CNV YML, however, at present it does not make use of the parents in the CNV annotations, e.g. #pn_anno_vars# singles out the index case.

In [4]:

mydefs = qh.add_many("""
def ##wgs_vars## = source/var/wgs_varcalls.gord -s PN;
def ##segment_cov## = source/cov/segment_cov.gord -s PN;
def ##cand_vars## = source/var/candidate_variants.gord -s PN;
def ##cnvs## = source/var/cnv_varcalls.gord -s PN;
""")
# participant setup
mydefs = qh.add_many("""
def ##index_case## = 'DEMO_2796799';
def ##index_case_no_quotes## = 'DEMO_2796799';

create #pn_anno_vars# = gor ##vcf_wgs_anno##
| where pn = ##index_case##
| select 1,2,Reference,Call,PN,ref_af,gene_symbol,VEP_Max_Consequence,VEP_Impact,CallRatio,CallCopies,Depth,GT_IHE,GT_Paternity,GT_Info,diag_denovo,LHZ_vars,LHZ_size,lhz_transfrac,KNOWN_var_diseases,KNOWN_clinvar_stars,zygosity
| multimap -c vep_max_consequence -m 100 <(nor ref/VEP_impact.map | select #1,vep_rank)
| atmin 1 -gc #3,#4,gene_symbol vep_rank | hide vep_rank
;
""")

Candidate genes¶

The candidate genes are transformed as in step2.

In [5]:

mydefs = qh.add_many("""
create ##disease_specificity## = nor -h <(gor ##candidate_gene_file##
| where gene_candORparalog='c'
| split gene_candidate_phenotypes
| where not(gene_candidate_phenotypes like '*:par_*')
| distinct )
| granno -gc gene_candidate_phenotypes -count
| rename allcount hpo_geneCount
| atmin -gc gene_symbol hpo_geneCount
| group -gc gene_symbol -set -sc gene_candidate_phenotypes,hpo_geneCount
| rename set_hpo_genecount min_hpo_genecount
;

create ##alias_candidate_genes## = gor ##candidate_gene_file##
| calc ep gene_candidate_phenotypes | split ep
| map -c ep -h <(nor <(gor ##candidate_gene_file## | split gene_candidate_phenotypes
| group genome -gc gene_candidate_phenotypes -dis -sc gene_symbol)
| select GENE_candidate_Phenotypes,dis_GENE_symbol | calc pheno_significance 1.0/dis_GENE_symbol) | group 1 -gc 3-ep[-1] -sum -fc pheno_significance
| rename sum_pheno_significance Gene_Cand_PhenoScore | map -c gene_symbol ##gene_map## -n gene_aliases -m 'missing'
| split gene_aliases | replace gene_symbol if(gene_aliases != 'missing',gene_aliases,gene_symbol)
| hide gene_aliases
| distinct
| map -c gene_symbol -m 10000 -n min_hpo_genecount [##disease_specificity##]
;""")

Exon overlap¶

We want to analyze the overlap of CNVs with exons. However, exons are transcript dependent and we plan to report the CNVs on gene level, like the variants in step2. One option would be to pick a transcripts, e.g. a preferred transcript from gene panel info file or to select the largest or canonical transcript. Here, as shown in ##e##, we rather collapse the exons from all the transcripts by finding their overlap span using SEGSPAN and then calculate the size of the collapsed exons in es.

In our CNV overlap analysis, we want to list the range of the exons that overlap. Thus, we need to number them. For that, we borrow a small logic from the code in step1, where we self-join the exons by the gene_symbol, sum up upstream (bs, nif) and downstream (as, nir) values, and label the exon based on it strand.

The CNV exon overlap is then found by performing a -segseg join, calculating the overlap with each exon, then summing up for each CNV per gene the exon overlap. We also calculate the exon range overlapping from the mininum exon number and the largest. Finally, we calculate the overlap fraction of CNV and gene, e.g. cnv_ov_frac and gene_ov_frac.

In [6]:

mydefs = qh.add_many("""
create ##e## = gor ##exons##
| where transcript_biotype = 'protein_coding' and gene_symbol != 'na' | select 1-strand 
| segspan -gc gene_symbol,strand
| rename #2 chromstart | rename #3 chromend
| calc es #3-#2
;

create ##ex## = pgor [##e##]
| join -snpsnp -f 4000000 -xl gene_symbol -xr gene_symbol [##e##]
| calc bs if(chromstartx < chromstart,esx,0)
| calc as if(chromstartx > chromstart,esx,0)
| calc nif if(bs>0,1,0)
| calc nir if(as>0,1,0)
| group 1 -gc chromend,gene_symbol,gene_symbol,strand -sum -ic bs,as,nif,nir 
| calc exon_Num if(strand='+',sum_nif+1,sum_nir+1)
| calc numberOfExons sum_nif+sum_nir+1
| calc gene_coding_size sum_as+sum_bs+#3-#2
| select 1-strand,exon_Num-gene_coding_size
;

create #exon_overlap# = gor ##cnvs## -f ##index_case##
| select 1-5
| join -segseg -maxseg 100000 [##ex##]
| calc ex_ovlap min(bpstop,chromend)-max(bpstart,chromstart)
| group 1 -gc #3-alt,gene_symbol,numberOfExons,gene_coding_size -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc exon_range if(max_exon_num-min_exon_num+1=numberOfExons,'all of '+numberOfExons,
                     'ex '+replace(listdist(min_exon_num+','+max_exon_num),',','-')+' of '+numberOfExons)
| hide min_*,max_*,sum_*
| calc cnvsize #3-#2
| calc cnv_ov_frac form(cnv_overlap/cnvsize,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| hide numberOfExons;
""")

We can now inspect few of the CNV overlap rows:

In [7]:

%%gor
$mydefs
gor [#exon_overlap#] | where not(contains(exon_range,'all')) and cnvsize < 10000 | top 5

Query ran in 1.13 sec
Query fetched 5 rows in 0.01 sec (total time 1.14 sec)

Out[7]:

	CHROM	bpStart	bpStop	REF	ALT	gene_symbol	gene_coding_size	cnv_overlap	exon_range	cnvsize	cnv_ov_frac	gene_ov_frac
0	chr10	98355332	98355392	N	<DEL>	PIK3AP1	4800	59	ex 17 of 17	60	0.9833	0.0123
1	chr10	135369099	135372456	N	<DUP>	SYCE1	1789	634	ex 5-12 of 15	3357	0.1889	0.3544
2	chr11	7716914	7717219	A	<DEL>	OVCH2	3328	1	ex 12 of 17	305	0.0033	0.0003
3	chr11	67434043	67434166	N	<DEL>	ALDH3B2	2721	121	ex 6 of 12	123	0.9837	0.0445
4	chr11	114424905	114431925	G	<DEL>	NXPE1	4860	84	ex 1 of 8	7020	0.0120	0.0173

We can then pick the SYCE1 gene and see the overlap with the CNVs:

In [8]:

%%gor
$mydefs

gor [##ex##] | where gene_symbol = 'SYCE1' | calc es #3-#2 | join -segseg -l -e 0 <(gor ##cnvs## -f ##index_case## | select 1-3)

Query ran in 0.28 sec
Query fetched 15 rows in 0.27 sec (total time 0.55 sec)

Out[8]:

	Chrom	chromstart	chromend	gene_symbol	strand	exon_Num	numberOfExons	gene_coding_size	es	bpStart	bpStop
0	chr10	135367402	135367832	SYCE1	-	15	15	1789	430	0	0
1	chr10	135368350	135368633	SYCE1	-	14	15	1789	283	0	0
2	chr10	135368854	135368942	SYCE1	-	13	15	1789	88	0	0
3	chr10	135369100	135369211	SYCE1	-	12	15	1789	111	135369099	135372456
4	chr10	135369283	135369407	SYCE1	-	11	15	1789	124	135369099	135372456
5	chr10	135369484	135369551	SYCE1	-	10	15	1789	67	135369099	135372456
6	chr10	135370262	135370326	SYCE1	-	9	15	1789	64	135369099	135372456
7	chr10	135370570	135370660	SYCE1	-	8	15	1789	90	135369099	135372456
8	chr10	135371367	135371422	SYCE1	-	7	15	1789	55	135369099	135372456
9	chr10	135371670	135371718	SYCE1	-	6	15	1789	48	135369099	135372456
10	chr10	135372380	135372455	SYCE1	-	5	15	1789	75	135369099	135372456
11	chr10	135372804	135372864	SYCE1	-	4	15	1789	60	0	0
12	chr10	135373594	135373657	SYCE1	-	3	15	1789	63	0	0
13	chr10	135378960	135379087	SYCE1	-	2	15	1789	127	0	0
14	chr10	135381691	135381795	SYCE1	-	1	15	1789	104	0	0

And similarly, we can join the single CNV that overlaps the SYCE1 gene with its exons:

In [14]:

%%gor
$mydefs

gor -p chr10:135369099 ##cnvs## -f ##index_case## | select 1-3 | calc cnvsize #3-#2
| join -segseg -maxseg 100000 [##ex##]
| calc ex_ovlap min(bpstop,chromend)-max(bpstart,chromstart)
| granno chrom -sum -ic ex_ovlap

Query ran in 0.33 sec
Query fetched 8 rows in 0.02 sec (total time 0.35 sec)

Out[14]:

	CHROM	bpStart	bpStop	cnvsize	chromstart	chromend	gene_symbol	strand	exon_Num	numberOfExons	gene_coding_size	ex_ovlap	sum_ex_ovlap
0	chr10	135369099	135372456	3357	135369100	135369211	SYCE1	-	12	15	1789	111	634
1	chr10	135369099	135372456	3357	135369283	135369407	SYCE1	-	11	15	1789	124	634
2	chr10	135369099	135372456	3357	135369484	135369551	SYCE1	-	10	15	1789	67	634
3	chr10	135369099	135372456	3357	135370262	135370326	SYCE1	-	9	15	1789	64	634
4	chr10	135369099	135372456	3357	135370570	135370660	SYCE1	-	8	15	1789	90	634
5	chr10	135369099	135372456	3357	135371367	135371422	SYCE1	-	7	15	1789	55	634
6	chr10	135369099	135372456	3357	135371670	135371718	SYCE1	-	6	15	1789	48	634
7	chr10	135369099	135372456	3357	135372380	135372455	SYCE1	-	5	15	1789	75	634

Even though the CNV is fully covered by the gene, the low CNV overlap fraction of 0.1889 = 634/3357 is explained by the fact that most of it is covered by introns and the biggest exons are outside it, as we see in the es column of the previous query.

Variant annotations¶

We want to add some simple variant annotations to the CNVs. Here, we do this using the #exon_overlap# relation, because the variant annotations we are interested in here are based on genes, thus they should overlap this relation. From step1, #pn_anno_vars#, we have variants annotated on transcript/feature level, which is actually finer resolution that we prefer. Therefore, in the query below, we aggregate the variant annotations over the gene symbol. Following the -segsnp join between the CNV and the gene-based variants, we resolve these aggregations before we aggregate again per CNV.

Note: Rather than using the output from step1, one might also consider using the output from step2, where the final filtering and aggregation per gene symbol is performed. This would either call for including the logic of step2 in here, like we do in the gene overview report, or building a more complex dependency between the YMLs and feed the output from step2 as an input file to this CNV query, like we do with the output from step1.

In [10]:

mydefs = qh.add_many("""
create #anno_var_overlap# = gor [#exon_overlap#]
| select 1-gene_symbol
| join -segsnp -xl gene_symbol -xr gene_symbol <(gor [#pn_anno_vars#] | group 1 -gc gene_symbol -lis -sc VEP_Impact -max -ic known_clinvar_stars,LHZ_vars)
| calc high if(contains(lis_VEP_Impact,'HIGH'),1,0)
| calc moderate if(contains(lis_VEP_Impact,'MODERATE'),1,0)
| rename max_known_clinvar_stars known_clinvar_stars
| rename max_LHZ_vars LHZ_vars
| group 1 -gc #3-distance -ic high,moderate,known_clinvar_stars,LHZ_vars -sum -max
| rename sum_(.*) #{1}
| select 1-distance[-1],high,moderate,known_clinvar_stars,max_LHZ_vars
;""")

GDX CNV knowledgebase¶

Some of the most valuable annotations for CNVs are probably information about the overlap with known CNVs. The relation #GDX_cnvs# defines a subset of the GDX knowledgebase. Then, #GDX_ov# defines three levels, i.e. boolean flags for 25, 50, and 75 percent overlap, which are then aggregated for each CNV, in case there are many known CNVs overlapping each row.

In [11]:

mydefs = qh.add_many("""
create #GDX_cnvs# = gor ref_base/GDX/XA/confirmation_cnv.gorz 
| where (not(genotype in ('wt')) and relevance in ('1 - Causative Mutation','2 - Possibly Associated','5 - ACMG Secondary Finding') and report in ('True') and clinical_interp in ('LPATH','PATH'));

create #GDX_ov# = gor ##cnvs## -f ##index_case##
| select 1-5
| join -segseg -rprefix GDX [#GDX_cnvs#]
| calc overlap_size min(bpstop,gdx_cnv_end)-max(bpstart,gdx_cnv_start)
| calc gdx_cnv_size gdx_cnv_end-gdx_cnv_start
| calc gdx_ov_frac overlap_size/gdx_cnv_size
| calc ov25 if(gdx_ov_frac >= 0.25,1,0)
| calc ov50 if(gdx_ov_frac >= 0.50,1,0)
| calc ov75 if(gdx_ov_frac >= 0.75,1,0)
| group 1 -gc #3 -sum -ic ov25-ov75
| rename sum_(.*) GDXCNVs_#{1};
""")

In [12]:

%%gor
$mydefs

gor [#GDX_ov#]

Query ran in 0.13 sec
Query fetched 5 rows in 0.00 sec (total time 0.14 sec)

Out[12]:

	CHROM	bpStart	bpStop	GDXCNVs_ov25	GDXCNVs_ov50	GDXCNVs_ov75
0	chr1	861320	3440811	2	2	2
1	chr17	60741602	61565890	9	9	9
2	chr19	15473169	58657893	25	25	25
3	chr5	88032028	88032342	0	0	0
4	chr9	140772668	140773504	0	0	0

From above, we see that all the known CNVs for the first three rows overlap by 75% or more while for the last two it is less than 25%.

Het-Hom callratio analysis¶

CNVs, deletions or duplicatins, should impact the so-called B-allele-frequency or the callratio of non-reference allele reads. To estimate deviation from normal, in #hethom_avg#, we calculate here the average callratio of het and hom variants across the genome in non-CNV regions. We access all the sample variants in ##wgs_vars## and use negative join with CNVs to define the normal autosomal genom.

In #hethom#, we use inclusive join between the CNVs and the variants, to estimate these distribution of callratios of het and homs in the CNVs, i.e. the average and the standard deviation. While we are primarily interested in these metrics for the heterozygous variants, here we calculate it both for het and hom and then apply a student t-test to calculate a p-value significance. By removing the "hom" value in the -v option of PIVOT, we could easily suppress the columns representing the hom-calls. Similarly, we could relatively easily modify these calculation to estimate the ratio of het vs hom variants in the CNVs and compare them to other locations in the genome.

In [15]:

mydefs = qh.add_many("""
create #hethom_avg# = nor <(gor ##wgs_vars## -f ##index_case## | where not( #1 in ('chrX','chrY') )
| join -n -snpseg -maxseg 100000000 <(gor ##cnvs## -f ##index_case##)
| select 1-4,callcopies,callratio,depth | calc hethom if(callcopies=2,'hom','het') 
| group genome -gc hethom -avg -std -fc callratio,depth -count) | select #4-;


create #hethom# = gor ##cnvs## -f ##index_case##
| join -segsnp -r <(gor ##wgs_vars## -f ##index_case## 
                  | select 1-4,callcopies,callratio,depth | calc hethom if(callcopies=2,'hom','het') )
| group 1 -gc #3,hethom -avg -std -fc callratio,depth -count
| map -c hethom [#hethom_avg#]
| replace avg_callratio float(gform(float(#rc),3,3))
| calc t float(gform((avg_callratio-avg_callratiox)/sqrt(sqr(std_callratio)/allcount),3,3))
| calc p gform(if(contains(t,'Infinity'),1.0,1.0-student(max(allcount-1,1), abs(t))),3,3)
| select 1,2,bpStop,hethom,avg_callratio,allCount,t,p
| pivot -gc bpstop hethom -v het,hom -e 0
| rename het_allcount het_Vars
| rename hom_allcount hom_Vars
;""")

In [24]:

%%gor
$mydefs
gor [#hethom#] | where het_vars > 10 | top 10 | join -snpsnp <(gor ##cnvs## -f ##index_case## /* | select 1,2,3,svtype */)

Query ran in 0.47 sec
Query fetched 12 rows in 0.14 sec (total time 0.61 sec)

Out[24]:

	CHROM	bpStart	bpStop	het_avg_CallRatio	het_Vars	het_t	het_p	hom_avg_CallRatio	hom_Vars	hom_t	...	bpStopx	REF	ALT	QUAL	FILTER	cnvLen	svType	caller	formatZip	PN
0	chr1	861320	3440811	0.4790	697	-0.834	2.020000e-01	0.993	1264	-2.330	...	3440811	N	<DUP>	.	PASS	2579490	CNV	CNVREPORTER	GT=./1:SM=.:CN=0.67:BC=.:PE=.:QS=99	DEMO_2796799
1	chr1	161569418	161600885	0.4020	21	-1.710	5.140000e-02	1.000	23	inf	...	161600885	N	<DUP>	.	PASS	31466	CNV	CNVREPORTER	GT=0/0:SM=.:CN=.:BC=.:PE=.:QS=.	DEMO_2796799
2	chr10	39112347	42379386	0.4670	258	-1.240	1.080000e-01	0.962	228	-3.150	...	42379386	T	<DEL>	.	PASS	3267039	DEL	SCRAMBLE	GT=0/1:GQ=99:DP=169:AD=8:LSC=0:RSC=8	DEMO_2796799
3	chr10	42356843	42380124	0.4580	202	-1.760	4.000000e-02	0.959	134	-2.450	...	42380124	C	<DEL>	.	PASS	23281	DEL	SCRAMBLE	GT=0/1:GQ=99:DP=266:AD=10:LSC=0:RSC=10	DEMO_2796799
4	chr10	42384736	42396790	0.4780	256	-0.434	3.320000e-01	0.967	169	-3.940	...	42396790	C	<DEL>	.	PASS	12054	DEL	SCRAMBLE	GT=0/1:GQ=99:DP=872:AD=71:LSC=71:RSC=0	DEMO_2796799
5	chr10	42387487	42596876	0.4730	456	-0.981	1.640000e-01	0.970	345	-3.970	...	42596876	A	<DEL>	.	PASS	209389	DEL	SCRAMBLE	GT=0/1:GQ=99:DP=1850:AD=57:LSC=57:RSC=0	DEMO_2796799
6	chr10	124343836	124354045	0.4980	11	0.206	4.200000e-01	1.000	3	inf	...	124354045	N	<DUP>	.	PASS	10208	CNV	CNVREPORTER	GT=./1:SM=.:CN=1.45:BC=9:PE=.:QS=26	DEMO_2796799
7	chr11	56143368	56143433	0.0918	14	-180.000	0.000000e+00	0.000	0	0.000	...	56143433	T	TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG...	.	PASS	65	INS	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799
8	chr11	56143368	56143433	0.0918	14	-180.000	0.000000e+00	0.000	0	0.000	...	56143434	T	<DEL>	.	PASS	66	DEL	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799
9	chr11	56143368	56143434	0.1230	15	-12.000	4.680000e-09	0.000	0	0.000	...	56143433	T	TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG...	.	PASS	65	INS	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799
10	chr11	56143368	56143434	0.1230	15	-12.000	4.680000e-09	0.000	0	0.000	...	56143434	T	<DEL>	.	PASS	66	DEL	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799
11	chr16	32297076	33294648	0.4110	276	-6.960	1.250000e-11	0.993	329	-0.913	...	33294648	T	<DEL>	.	PASS	997572	DEL	SCRAMBLE	GT=0/1:GQ=99:DP=9:AD=6:LSC=0:RSC=6	DEMO_2796799

12 rows × 21 columns

Notice the exceptionally low callratio for two CNVs on chr11. Strangely, it covers both INS and DEL, indicating that there is something strange going on with the caller!

CNV seq-read coverage¶

For completion, we want to calculate the average sequence read depth in the CNVs taken from the ##segment_cov## relation. The logic is a simplified version of the gene coverage; in bases, we find the potential partial overlap between the CNVs and the coverage segments, and in basedepth we adjust for the depth rounding used in segment-cov.

In [25]:

mydefs = qh.add_many("""
create ##cnv_coverage## = gor ##cnvs## -f ##index_case## 
| rename #2 cnv_start
| rename #3 cnv_end
| calc cnv_size #3-#2
| join -segseg -l  -e '-1' -maxseg 10000  <(gor ##segment_cov## -f ##index_case## ) 
| calc bases if(depth = -1,cnv_end-cnv_start,min(cnv_end,bpstop)-max(cnv_start,bpstart)) 
| calc basedepth if(depth = -1, 0,if(depth<5,depth,if(depth<10,depth+1,if(depth<30,depth+2,if(depth<50,depth+5,depth+12))))*bases) 
| group 1 -gc #3,cnv_size -sum -ic basedepth,bases
| calc avg_depth int(sum_basedepth/sum_bases)
| select 1-3,avg_depth
;""")

In [28]:

%%gor
$mydefs
gor -p chr11:56143368 ##cnvs## -f ##index_case## | join -snpsnp [##cnv_coverage##] | join -snpseg -maxseg 10000 -e 0 <(gor ##segment_cov## -f ##index_case## )

Query ran in 0.69 sec
Query fetched 4 rows in 0.32 sec (total time 1.01 sec)

Out[28]:

	CHROM	bpStart	bpStop	REF	ALT	QUAL	FILTER	cnvLen	svType	caller	formatZip	PN	cnv_end	avg_depth	bpStartx	bpStopx	Depth	PNx
0	chr11	56143368	56143433	T	TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG...	.	PASS	65	INS	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799	56143433	212	56143355	56143434	200	DEMO_2796799
1	chr11	56143368	56143433	T	TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG...	.	PASS	65	INS	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799	56143434	212	56143355	56143434	200	DEMO_2796799
2	chr11	56143368	56143434	T	<DEL>	.	PASS	66	DEL	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799	56143433	212	56143355	56143434	200	DEMO_2796799
3	chr11	56143368	56143434	T	<DEL>	.	PASS	66	DEL	SCRAMBLE	GT=0\|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5	DEMO_2796799	56143434	212	56143355	56143434	200	DEMO_2796799

We see that our selected CNV has coverage reported by Scramble that is 392, while our segment coverage is 200 and the avg_depth estimate is 212 due to the fact that in segment cov we store a rounded depth as floor(depth/25.0)*25, when it exceeds 50. The difference in depth between Scramble and segment cov might be due to different filtering of reads, possibly due to filtering based on alignment mapping quality or base quality. To resolve this question, we dive into the BAM/CRAM and calculate the depth per base, without any filtering, e.g. using PILEUP with -q 0 -bq 0 options. We see that our average CNV coverage estimated from ##segment_cov## is indeed correct!

In [29]:

%%gor
gor -p chr11:56142368-56143434 source/samples/2796000-2796999/2796799/DEMO_2796799.bam 
| pileup -depth -q 0 -bq 0
| where pos >= 56143368 and pos <= 56143434
| granno chrom -avg -std -count -ic depth
| replace avg_depth,std_depth form(float(#rc),4,1)

Query ran in 0.97 sec
Query fetched 67 rows in 0.03 sec (total time 1.00 sec)

Out[29]:

	Chrom	Pos	RefBase	Depth	allCount	avg_Depth	std_Depth
0	chr11	56143368	A	211	67	212.0	4.2
1	chr11	56143369	T	212	67	212.0	4.2
2	chr11	56143370	G	205	67	212.0	4.2
3	chr11	56143371	T	204	67	212.0	4.2
4	chr11	56143372	T	204	67	212.0	4.2
...	...	...	...	...	...	...	...
62	chr11	56143430	G	210	67	212.0	4.2
63	chr11	56143431	A	210	67	212.0	4.2
64	chr11	56143432	A	210	67	212.0	4.2
65	chr11	56143433	T	210	67	212.0	4.2
66	chr11	56143434	C	210	67	212.0	4.2

67 rows × 7 columns

Putting it all together¶

The final query simply reads all the CNVs and annotates them with the relations we produced above. Because some CNVs may span multiple genes, we issue a cnvID to make it easy to visualize them and also calculate the number of overlapping genes via GRANNO as dis_gene_symbol. We then use COLUMNREORDER to organize some of the most informative columns to left.

In [30]:

%%gor
$mydefs

gor ##cnvs## -f ##index_case##
| select 1-5
| rownum | rename rownum cnv_ID | replace cnv_ID 'cnv'+#rc
| calc cnv_size #3-#2
| join -snpsnp -l -xl #3-#5 -xr #3-#5 -l [#exon_overlap#]
| hide bpstopx,refx,altx
| join -snpsnp -l -xl #3-#5,gene_symbol -xr #3-#5,gene_symbol -l -rprefix var [#anno_var_overlap#]

| hide var_bpstop,var_ref,var_alt,var_gene_symbol
| granno 1 -gc #3-#5 -dis -sc gene_symbol
| hide gene_coding_size,cnv_overlap,cnvsize

| join -snpsnp -xl #3 -xr cnv_end -r -l -e 0 [##cnv_coverage##] | hide cnv_end
| prefix exon_range-dis_gene_symbol[-1] gene
| prefix dis_gene_symbol- cnv
| join -snpsnp -xl #3 -xr #3 -r -l -e 0 -rprefix cnv [#hethom#] | hide cnv_bpstop
| calc cnv_var_count cnv_het_vars+cnv_hom_vars
| rename #2 CNV_Start
| rename #3 CNV_End
| rename gene_var_max_LHZ_vars gene_var_LHZ
| rename gene_gene_ov_frac gene_ov_frac
| hide ref
| replace gene_ov_frac-gene_var_LHZ if(#rc='','0',#rc)
| join -snpsnp -xl #3 -xr #3 -r -l -e 0 -rprefix cnv [#GDX_ov#] | hide cnv_bpStop
| columnreorder 1,2,CNV_End,ALT,cnv_ID,cnv_size,CNV_GDXCNVS_OV25,cnv_dis_gene_symbol,cnv_avg_depth,cnv_var_count,gene_symbol,gene_exon_range,gene_cnv_ov_frac,gene_ov_frac,gene_var_high,gene_var_moderate,gene_var_KNOWN_clinvar_stars,gene_var_LHZ /* client filters: */

| join -varseg -r -f 10000 -l -xl gene_symbol -xr gene_symbol [##alias_candidate_genes##]
| hide gene_symbolx,Gene_Cand_PhenoScore,min_hpo_genecount

Query ran in 0.91 sec
Query fetched 1,148 rows in 0.09 sec (total time 1.00 sec)

Out[30]:

	CHROM	CNV_Start	CNV_End	ALT	cnv_ID	cnv_size	cnv_GDXCNVs_ov25	cnv_dis_gene_symbol	cnv_avg_depth	cnv_var_count	...	cnv_hom_Vars	cnv_hom_t	cnv_hom_p	cnv_GDXCNVs_ov50	cnv_GDXCNVs_ov75	GENE_candidate_Phenotypes	GENE_candidate_paralogs	GENE_candORparalog	GENE_PHRANK	GENE_PHRANK_norm
0	chr1	861320	3440811	<DUP>	cnv1	2579491	2	66	52	1961	...	1264	-2.33	0.00998	2	2	NaN	NaN	NaN	NaN	NaN
1	chr1	861320	3440811	<DUP>	cnv1	2579491	2	66	52	1961	...	1264	-2.33	0.00998	2	2	NaN	NaN	NaN	NaN	NaN
2	chr1	861320	3440811	<DUP>	cnv1	2579491	2	66	52	1961	...	1264	-2.33	0.00998	2	2	NaN	NaN	NaN	NaN	NaN
3	chr1	861320	3440811	<DUP>	cnv1	2579491	2	66	52	1961	...	1264	-2.33	0.00998	2	2	NaN	NaN	NaN	NaN	NaN
4	chr1	861320	3440811	<DUP>	cnv1	2579491	2	66	52	1961	...	1264	-2.33	0.00998	2	2	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1143	chrX	155251243	155252494	<DEL>	cnv120	1251	0	1	120	12	...	1	inf	1.00000	0	0	NaN	NaN	NaN	NaN	NaN
1144	chrY	13651394	13808320	<DEL>	cnv121	156926	0	1	9	276	...	166	-1.24	0.10800	0	0	NaN	NaN	NaN	NaN	NaN
1145	chrY	13704402	13801064	<DEL>	cnv122	96662	0	1	4	95	...	53	-1.22	0.11400	0	0	NaN	NaN	NaN	NaN	NaN
1146	chrY	13851738	13851744	ACTGTAC	cnv123	6	0	1	74	0	...	0	0.00	0.00000	0	0	NaN	NaN	NaN	NaN	NaN
1147	chrY	13851738	13851950	<DEL>	cnv124	212	0	1	7	0	...	0	0.00	0.00000	0	0	NaN	NaN	NaN	NaN	NaN

1148 rows × 33 columns

In [ ]: