CNV annotations for Gregor - draft¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "gdx-demo"
%env GOR_API_PROJECT={project}
Gregor¶
CNV view¶
Here we will describe the CNV query template used in Gregor, i.e. gregor_cnvs_anno.ftl.yml(). Some of the query code is similar to Gregor step2 that has been described elsewhere, such as the handling of candidate genes. We will try to focus on the aspects that are unique to the CNV, e.g. their segmental nature. We use the query helper as before and start by defining the path to the results from step1 and the candidate genes. For simplicity, we will skip the genepanelinfo file, because its us and handling is the same as in step2.
qh = GQH.GOR_Query_Helper()
mydefs = ""
mydefs = qh.add_many("""
def ##ref## = ref;
def ##vep_path## = source/anno/vep_v106;
def ##vcf_wgs_anno## = user_data/Gregor1/DEMO_2796799_v3.gorz;
def ##candidate_gene_file## = studies/GCA724303/candidate_genes.gor;
def ##genepanelinfo## = ;
def ##defMOI## = 'all';
def ##fuzzdist## = 20;
def ##min_read_depth## = 8;
def ##min_hom_call_ratio## = 0.66;
def ##min_het_call_ratio## = 0.2;
def ##min_gt_likelihood## = 5;
def ##rmax_af## = 0.01;
def ##dmax_af## = 0.001;
def ##max_cust_af## = ##rmax_af##;
def ##delta_ctrl## = 0;
def ##delta_cases## = 0;
def ##max_num_candgenes## = 10;
def ##disease_specificity_max_gene_count## = 20;
def ##min_lhz_overlap_fraction## = 0.1;
""")
# Filters
mydefs = qh.add_many("""
def ##vep_filter## = consequence in ('transcript_ablation','splice_acceptor_variant','splice_donor_variant','stop_gained','frameshift_variant','stop_lost','start_lost','transcript_amplification','inframe_insertion','inframe_deletion','missense_variant','protein_altering_variant','splice_region_variant','incomplete_terminal_codon_variant');
def ##vcf_filter## = ('Not LowQual');
def ##af_filter## = ref_af <= Gene_RmaxAf and ref_af_cust <= ##max_cust_af##;
def ##cat_filter## = DIAG_ACMGcat in ('Cat1','Cat1B','Cat1C');
def ##known_filter## = qc_known_vars = 1;
def ##protected_filter## = qc_protected_vars = 1;
def ##vcf_quality_filter## = qc_vcf_filter = 1;
def ##variant_scope_filter## = 2 = 2;
def ##repeat_regions_filter## = qc_repeat_regions = 0 ;
def ##variant_filter## = ( ##vep_filter## ) and ( ##af_filter## ) and ( ##repeat_regions_filter## ) and ( ##variant_scope_filter## ) and ( ##vcf_quality_filter## ) or ( ##cat_filter## ) or ( ##protected_filter## ) or ( ##known_filter## );
""")
mydefs = qh.add_many("""
def ##vep_single## = ##vep_path##/vepref_single_wgs.gord;
def ##vep_multi## = ##vep_path##/vepref_multi_wgs.gord;
def ##vep_clinical## = ##ref##/disvariants/clinical_variants_vepref_multi.gorz;
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##gene_map## = ##ref##/refgenes/refgenes.map;
def ##exons## = ref/refgenes/refgenes_exons.gorz;
def ##vep_impact_map## = ##ref##/VEP_impact.map;
""")
Reading data from step1¶
In our example for Gregor step1 and step2, we have been using a trio. We applied the same input parameters to the CNV YML, however, at present it does not make use of the parents in the CNV annotations, e.g. #pn_anno_vars# singles out the index case.
mydefs = qh.add_many("""
def ##wgs_vars## = source/var/wgs_varcalls.gord -s PN;
def ##segment_cov## = source/cov/segment_cov.gord -s PN;
def ##cand_vars## = source/var/candidate_variants.gord -s PN;
def ##cnvs## = source/var/cnv_varcalls.gord -s PN;
""")
# participant setup
mydefs = qh.add_many("""
def ##index_case## = 'DEMO_2796799';
def ##index_case_no_quotes## = 'DEMO_2796799';
create #pn_anno_vars# = gor ##vcf_wgs_anno##
| where pn = ##index_case##
| select 1,2,Reference,Call,PN,ref_af,gene_symbol,VEP_Max_Consequence,VEP_Impact,CallRatio,CallCopies,Depth,GT_IHE,GT_Paternity,GT_Info,diag_denovo,LHZ_vars,LHZ_size,lhz_transfrac,KNOWN_var_diseases,KNOWN_clinvar_stars,zygosity
| multimap -c vep_max_consequence -m 100 <(nor ref/VEP_impact.map | select #1,vep_rank)
| atmin 1 -gc #3,#4,gene_symbol vep_rank | hide vep_rank
;
""")
Candidate genes¶
The candidate genes are transformed as in step2.
mydefs = qh.add_many("""
create ##disease_specificity## = nor -h <(gor ##candidate_gene_file##
| where gene_candORparalog='c'
| split gene_candidate_phenotypes
| where not(gene_candidate_phenotypes like '*:par_*')
| distinct )
| granno -gc gene_candidate_phenotypes -count
| rename allcount hpo_geneCount
| atmin -gc gene_symbol hpo_geneCount
| group -gc gene_symbol -set -sc gene_candidate_phenotypes,hpo_geneCount
| rename set_hpo_genecount min_hpo_genecount
;
create ##alias_candidate_genes## = gor ##candidate_gene_file##
| calc ep gene_candidate_phenotypes | split ep
| map -c ep -h <(nor <(gor ##candidate_gene_file## | split gene_candidate_phenotypes
| group genome -gc gene_candidate_phenotypes -dis -sc gene_symbol)
| select GENE_candidate_Phenotypes,dis_GENE_symbol | calc pheno_significance 1.0/dis_GENE_symbol) | group 1 -gc 3-ep[-1] -sum -fc pheno_significance
| rename sum_pheno_significance Gene_Cand_PhenoScore | map -c gene_symbol ##gene_map## -n gene_aliases -m 'missing'
| split gene_aliases | replace gene_symbol if(gene_aliases != 'missing',gene_aliases,gene_symbol)
| hide gene_aliases
| distinct
| map -c gene_symbol -m 10000 -n min_hpo_genecount [##disease_specificity##]
;""")
Exon overlap¶
We want to analyze the overlap of CNVs with exons. However, exons are transcript dependent and we plan to report the CNVs on gene level, like the variants in step2. One option would be to pick a transcripts, e.g. a preferred transcript from gene panel info file or to select the largest or canonical transcript. Here, as shown in ##e##, we rather collapse the exons from all the transcripts by finding their overlap span using SEGSPAN and then calculate the size of the collapsed exons in es.
In our CNV overlap analysis, we want to list the range of the exons that overlap. Thus, we need to number them. For that, we borrow a small logic from the code in step1, where we self-join the exons by the gene_symbol, sum up upstream (bs, nif) and downstream (as, nir) values, and label the exon based on it strand.
The CNV exon overlap is then found by performing a -segseg join, calculating the overlap with each exon, then summing up for each CNV per gene the exon overlap. We also calculate the exon range overlapping from the mininum exon number and the largest. Finally, we calculate the overlap fraction of CNV and gene, e.g. cnv_ov_frac and gene_ov_frac.
mydefs = qh.add_many("""
create ##e## = gor ##exons##
| where transcript_biotype = 'protein_coding' and gene_symbol != 'na' | select 1-strand
| segspan -gc gene_symbol,strand
| rename #2 chromstart | rename #3 chromend
| calc es #3-#2
;
create ##ex## = pgor [##e##]
| join -snpsnp -f 4000000 -xl gene_symbol -xr gene_symbol [##e##]
| calc bs if(chromstartx < chromstart,esx,0)
| calc as if(chromstartx > chromstart,esx,0)
| calc nif if(bs>0,1,0)
| calc nir if(as>0,1,0)
| group 1 -gc chromend,gene_symbol,gene_symbol,strand -sum -ic bs,as,nif,nir
| calc exon_Num if(strand='+',sum_nif+1,sum_nir+1)
| calc numberOfExons sum_nif+sum_nir+1
| calc gene_coding_size sum_as+sum_bs+#3-#2
| select 1-strand,exon_Num-gene_coding_size
;
create #exon_overlap# = gor ##cnvs## -f ##index_case##
| select 1-5
| join -segseg -maxseg 100000 [##ex##]
| calc ex_ovlap min(bpstop,chromend)-max(bpstart,chromstart)
| group 1 -gc #3-alt,gene_symbol,numberOfExons,gene_coding_size -min -max -sum -ic exon_num,ex_ovlap
| rename sum_ex_ovlap cnv_overlap
| calc exon_range if(max_exon_num-min_exon_num+1=numberOfExons,'all of '+numberOfExons,
'ex '+replace(listdist(min_exon_num+','+max_exon_num),',','-')+' of '+numberOfExons)
| hide min_*,max_*,sum_*
| calc cnvsize #3-#2
| calc cnv_ov_frac form(cnv_overlap/cnvsize,4,4)
| calc gene_ov_frac form(cnv_overlap/gene_coding_size,4,4)
| hide numberOfExons;
""")
We can now inspect few of the CNV overlap rows:
%%gor
$mydefs
gor [#exon_overlap#] | where not(contains(exon_range,'all')) and cnvsize < 10000 | top 5
Query ran in 1.13 sec Query fetched 5 rows in 0.01 sec (total time 1.14 sec)
CHROM | bpStart | bpStop | REF | ALT | gene_symbol | gene_coding_size | cnv_overlap | exon_range | cnvsize | cnv_ov_frac | gene_ov_frac | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr10 | 98355332 | 98355392 | N | <DEL> | PIK3AP1 | 4800 | 59 | ex 17 of 17 | 60 | 0.9833 | 0.0123 |
1 | chr10 | 135369099 | 135372456 | N | <DUP> | SYCE1 | 1789 | 634 | ex 5-12 of 15 | 3357 | 0.1889 | 0.3544 |
2 | chr11 | 7716914 | 7717219 | A | <DEL> | OVCH2 | 3328 | 1 | ex 12 of 17 | 305 | 0.0033 | 0.0003 |
3 | chr11 | 67434043 | 67434166 | N | <DEL> | ALDH3B2 | 2721 | 121 | ex 6 of 12 | 123 | 0.9837 | 0.0445 |
4 | chr11 | 114424905 | 114431925 | G | <DEL> | NXPE1 | 4860 | 84 | ex 1 of 8 | 7020 | 0.0120 | 0.0173 |
We can then pick the SYCE1 gene and see the overlap with the CNVs:
%%gor
$mydefs
gor [##ex##] | where gene_symbol = 'SYCE1' | calc es #3-#2 | join -segseg -l -e 0 <(gor ##cnvs## -f ##index_case## | select 1-3)
Query ran in 0.28 sec Query fetched 15 rows in 0.27 sec (total time 0.55 sec)
Chrom | chromstart | chromend | gene_symbol | strand | exon_Num | numberOfExons | gene_coding_size | es | distance | bpStart | bpStop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr10 | 135367402 | 135367832 | SYCE1 | - | 15 | 15 | 1789 | 430 | 0 | 0 | 0 |
1 | chr10 | 135368350 | 135368633 | SYCE1 | - | 14 | 15 | 1789 | 283 | 0 | 0 | 0 |
2 | chr10 | 135368854 | 135368942 | SYCE1 | - | 13 | 15 | 1789 | 88 | 0 | 0 | 0 |
3 | chr10 | 135369100 | 135369211 | SYCE1 | - | 12 | 15 | 1789 | 111 | 0 | 135369099 | 135372456 |
4 | chr10 | 135369283 | 135369407 | SYCE1 | - | 11 | 15 | 1789 | 124 | 0 | 135369099 | 135372456 |
5 | chr10 | 135369484 | 135369551 | SYCE1 | - | 10 | 15 | 1789 | 67 | 0 | 135369099 | 135372456 |
6 | chr10 | 135370262 | 135370326 | SYCE1 | - | 9 | 15 | 1789 | 64 | 0 | 135369099 | 135372456 |
7 | chr10 | 135370570 | 135370660 | SYCE1 | - | 8 | 15 | 1789 | 90 | 0 | 135369099 | 135372456 |
8 | chr10 | 135371367 | 135371422 | SYCE1 | - | 7 | 15 | 1789 | 55 | 0 | 135369099 | 135372456 |
9 | chr10 | 135371670 | 135371718 | SYCE1 | - | 6 | 15 | 1789 | 48 | 0 | 135369099 | 135372456 |
10 | chr10 | 135372380 | 135372455 | SYCE1 | - | 5 | 15 | 1789 | 75 | 0 | 135369099 | 135372456 |
11 | chr10 | 135372804 | 135372864 | SYCE1 | - | 4 | 15 | 1789 | 60 | 0 | 0 | 0 |
12 | chr10 | 135373594 | 135373657 | SYCE1 | - | 3 | 15 | 1789 | 63 | 0 | 0 | 0 |
13 | chr10 | 135378960 | 135379087 | SYCE1 | - | 2 | 15 | 1789 | 127 | 0 | 0 | 0 |
14 | chr10 | 135381691 | 135381795 | SYCE1 | - | 1 | 15 | 1789 | 104 | 0 | 0 | 0 |
And similarly, we can join the single CNV that overlaps the SYCE1 gene with its exons:
%%gor
$mydefs
gor -p chr10:135369099 ##cnvs## -f ##index_case## | select 1-3 | calc cnvsize #3-#2
| join -segseg -maxseg 100000 [##ex##]
| calc ex_ovlap min(bpstop,chromend)-max(bpstart,chromstart)
| granno chrom -sum -ic ex_ovlap
Query ran in 0.33 sec Query fetched 8 rows in 0.02 sec (total time 0.35 sec)
CHROM | bpStart | bpStop | cnvsize | distance | chromstart | chromend | gene_symbol | strand | exon_Num | numberOfExons | gene_coding_size | ex_ovlap | sum_ex_ovlap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135369100 | 135369211 | SYCE1 | - | 12 | 15 | 1789 | 111 | 634 |
1 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135369283 | 135369407 | SYCE1 | - | 11 | 15 | 1789 | 124 | 634 |
2 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135369484 | 135369551 | SYCE1 | - | 10 | 15 | 1789 | 67 | 634 |
3 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135370262 | 135370326 | SYCE1 | - | 9 | 15 | 1789 | 64 | 634 |
4 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135370570 | 135370660 | SYCE1 | - | 8 | 15 | 1789 | 90 | 634 |
5 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135371367 | 135371422 | SYCE1 | - | 7 | 15 | 1789 | 55 | 634 |
6 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135371670 | 135371718 | SYCE1 | - | 6 | 15 | 1789 | 48 | 634 |
7 | chr10 | 135369099 | 135372456 | 3357 | 0 | 135372380 | 135372455 | SYCE1 | - | 5 | 15 | 1789 | 75 | 634 |
Even though the CNV is fully covered by the gene, the low CNV overlap fraction of 0.1889 = 634/3357 is explained by the fact that most of it is covered by introns and the biggest exons are outside it, as we see in the es column of the previous query.
Variant annotations¶
We want to add some simple variant annotations to the CNVs. Here, we do this using the #exon_overlap# relation, because the variant annotations we are interested in here are based on genes, thus they should overlap this relation. From step1, #pn_anno_vars#, we have variants annotated on transcript/feature level, which is actually finer resolution that we prefer. Therefore, in the query below, we aggregate the variant annotations over the gene symbol. Following the -segsnp join between the CNV and the gene-based variants, we resolve these aggregations before we aggregate again per CNV.
mydefs = qh.add_many("""
create #anno_var_overlap# = gor [#exon_overlap#]
| select 1-gene_symbol
| join -segsnp -xl gene_symbol -xr gene_symbol <(gor [#pn_anno_vars#] | group 1 -gc gene_symbol -lis -sc VEP_Impact -max -ic known_clinvar_stars,LHZ_vars)
| calc high if(contains(lis_VEP_Impact,'HIGH'),1,0)
| calc moderate if(contains(lis_VEP_Impact,'MODERATE'),1,0)
| rename max_known_clinvar_stars known_clinvar_stars
| rename max_LHZ_vars LHZ_vars
| group 1 -gc #3-distance -ic high,moderate,known_clinvar_stars,LHZ_vars -sum -max
| rename sum_(.*) #{1}
| select 1-distance[-1],high,moderate,known_clinvar_stars,max_LHZ_vars
;""")
GDX CNV knowledgebase¶
Some of the most valuable annotations for CNVs are probably information about the overlap with known CNVs. The relation #GDX_cnvs# defines a subset of the GDX knowledgebase. Then, #GDX_ov# defines three levels, i.e. boolean flags for 25, 50, and 75 percent overlap, which are then aggregated for each CNV, in case there are many known CNVs overlapping each row.
mydefs = qh.add_many("""
create #GDX_cnvs# = gor ref_base/GDX/XA/confirmation_cnv.gorz
| where (not(genotype in ('wt')) and relevance in ('1 - Causative Mutation','2 - Possibly Associated','5 - ACMG Secondary Finding') and report in ('True') and clinical_interp in ('LPATH','PATH'));
create #GDX_ov# = gor ##cnvs## -f ##index_case##
| select 1-5
| join -segseg -rprefix GDX [#GDX_cnvs#]
| calc overlap_size min(bpstop,gdx_cnv_end)-max(bpstart,gdx_cnv_start)
| calc gdx_cnv_size gdx_cnv_end-gdx_cnv_start
| calc gdx_ov_frac overlap_size/gdx_cnv_size
| calc ov25 if(gdx_ov_frac >= 0.25,1,0)
| calc ov50 if(gdx_ov_frac >= 0.50,1,0)
| calc ov75 if(gdx_ov_frac >= 0.75,1,0)
| group 1 -gc #3 -sum -ic ov25-ov75
| rename sum_(.*) GDXCNVs_#{1};
""")
%%gor
$mydefs
gor [#GDX_ov#]
Query ran in 0.13 sec Query fetched 5 rows in 0.00 sec (total time 0.14 sec)
CHROM | bpStart | bpStop | GDXCNVs_ov25 | GDXCNVs_ov50 | GDXCNVs_ov75 | |
---|---|---|---|---|---|---|
0 | chr1 | 861320 | 3440811 | 2 | 2 | 2 |
1 | chr17 | 60741602 | 61565890 | 9 | 9 | 9 |
2 | chr19 | 15473169 | 58657893 | 25 | 25 | 25 |
3 | chr5 | 88032028 | 88032342 | 0 | 0 | 0 |
4 | chr9 | 140772668 | 140773504 | 0 | 0 | 0 |
From above, we see that all the known CNVs for the first three rows overlap by 75% or more while for the last two it is less than 25%.
Het-Hom callratio analysis¶
CNVs, deletions or duplicatins, should impact the so-called B-allele-frequency or the callratio of non-reference allele reads. To estimate deviation from normal, in #hethom_avg#, we calculate here the average callratio of het and hom variants across the genome in non-CNV regions. We access all the sample variants in ##wgs_vars## and use negative join with CNVs to define the normal autosomal genom.
In #hethom#, we use inclusive join between the CNVs and the variants, to estimate these distribution of callratios of het and homs in the CNVs, i.e. the average and the standard deviation. While we are primarily interested in these metrics for the heterozygous variants, here we calculate it both for het and hom and then apply a student t-test to calculate a p-value significance. By removing the "hom" value in the -v option of PIVOT, we could easily suppress the columns representing the hom-calls. Similarly, we could relatively easily modify these calculation to estimate the ratio of het vs hom variants in the CNVs and compare them to other locations in the genome.
mydefs = qh.add_many("""
create #hethom_avg# = nor <(gor ##wgs_vars## -f ##index_case## | where not( #1 in ('chrX','chrY') )
| join -n -snpseg -maxseg 100000000 <(gor ##cnvs## -f ##index_case##)
| select 1-4,callcopies,callratio,depth | calc hethom if(callcopies=2,'hom','het')
| group genome -gc hethom -avg -std -fc callratio,depth -count) | select #4-;
create #hethom# = gor ##cnvs## -f ##index_case##
| join -segsnp -r <(gor ##wgs_vars## -f ##index_case##
| select 1-4,callcopies,callratio,depth | calc hethom if(callcopies=2,'hom','het') )
| group 1 -gc #3,hethom -avg -std -fc callratio,depth -count
| map -c hethom [#hethom_avg#]
| replace avg_callratio float(gform(float(#rc),3,3))
| calc t float(gform((avg_callratio-avg_callratiox)/sqrt(sqr(std_callratio)/allcount),3,3))
| calc p gform(if(contains(t,'Infinity'),1.0,1.0-student(max(allcount-1,1), abs(t))),3,3)
| select 1,2,bpStop,hethom,avg_callratio,allCount,t,p
| pivot -gc bpstop hethom -v het,hom -e 0
| rename het_allcount het_Vars
| rename hom_allcount hom_Vars
;""")
%%gor
$mydefs
gor [#hethom#] | where het_vars > 10 | top 10 | join -snpsnp <(gor ##cnvs## -f ##index_case## /* | select 1,2,3,svtype */)
Query ran in 0.47 sec Query fetched 12 rows in 0.14 sec (total time 0.61 sec)
CHROM | bpStart | bpStop | het_avg_CallRatio | het_Vars | het_t | het_p | hom_avg_CallRatio | hom_Vars | hom_t | ... | bpStopx | REF | ALT | QUAL | FILTER | cnvLen | svType | caller | formatZip | PN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 861320 | 3440811 | 0.4790 | 697 | -0.834 | 2.020000e-01 | 0.993 | 1264 | -2.330 | ... | 3440811 | N | <DUP> | . | PASS | 2579490 | CNV | CNVREPORTER | GT=./1:SM=.:CN=0.67:BC=.:PE=.:QS=99 | DEMO_2796799 |
1 | chr1 | 161569418 | 161600885 | 0.4020 | 21 | -1.710 | 5.140000e-02 | 1.000 | 23 | inf | ... | 161600885 | N | <DUP> | . | PASS | 31466 | CNV | CNVREPORTER | GT=0/0:SM=.:CN=.:BC=.:PE=.:QS=. | DEMO_2796799 |
2 | chr10 | 39112347 | 42379386 | 0.4670 | 258 | -1.240 | 1.080000e-01 | 0.962 | 228 | -3.150 | ... | 42379386 | T | <DEL> | . | PASS | 3267039 | DEL | SCRAMBLE | GT=0/1:GQ=99:DP=169:AD=8:LSC=0:RSC=8 | DEMO_2796799 |
3 | chr10 | 42356843 | 42380124 | 0.4580 | 202 | -1.760 | 4.000000e-02 | 0.959 | 134 | -2.450 | ... | 42380124 | C | <DEL> | . | PASS | 23281 | DEL | SCRAMBLE | GT=0/1:GQ=99:DP=266:AD=10:LSC=0:RSC=10 | DEMO_2796799 |
4 | chr10 | 42384736 | 42396790 | 0.4780 | 256 | -0.434 | 3.320000e-01 | 0.967 | 169 | -3.940 | ... | 42396790 | C | <DEL> | . | PASS | 12054 | DEL | SCRAMBLE | GT=0/1:GQ=99:DP=872:AD=71:LSC=71:RSC=0 | DEMO_2796799 |
5 | chr10 | 42387487 | 42596876 | 0.4730 | 456 | -0.981 | 1.640000e-01 | 0.970 | 345 | -3.970 | ... | 42596876 | A | <DEL> | . | PASS | 209389 | DEL | SCRAMBLE | GT=0/1:GQ=99:DP=1850:AD=57:LSC=57:RSC=0 | DEMO_2796799 |
6 | chr10 | 124343836 | 124354045 | 0.4980 | 11 | 0.206 | 4.200000e-01 | 1.000 | 3 | inf | ... | 124354045 | N | <DUP> | . | PASS | 10208 | CNV | CNVREPORTER | GT=./1:SM=.:CN=1.45:BC=9:PE=.:QS=26 | DEMO_2796799 |
7 | chr11 | 56143368 | 56143433 | 0.0918 | 14 | -180.000 | 0.000000e+00 | 0.000 | 0 | 0.000 | ... | 56143433 | T | TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG... | . | PASS | 65 | INS | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 |
8 | chr11 | 56143368 | 56143433 | 0.0918 | 14 | -180.000 | 0.000000e+00 | 0.000 | 0 | 0.000 | ... | 56143434 | T | <DEL> | . | PASS | 66 | DEL | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 |
9 | chr11 | 56143368 | 56143434 | 0.1230 | 15 | -12.000 | 4.680000e-09 | 0.000 | 0 | 0.000 | ... | 56143433 | T | TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG... | . | PASS | 65 | INS | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 |
10 | chr11 | 56143368 | 56143434 | 0.1230 | 15 | -12.000 | 4.680000e-09 | 0.000 | 0 | 0.000 | ... | 56143434 | T | <DEL> | . | PASS | 66 | DEL | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 |
11 | chr16 | 32297076 | 33294648 | 0.4110 | 276 | -6.960 | 1.250000e-11 | 0.993 | 329 | -0.913 | ... | 33294648 | T | <DEL> | . | PASS | 997572 | DEL | SCRAMBLE | GT=0/1:GQ=99:DP=9:AD=6:LSC=0:RSC=6 | DEMO_2796799 |
12 rows × 21 columns
Notice the exceptionally low callratio for two CNVs on chr11. Strangely, it covers both INS and DEL, indicating that there is something strange going on with the caller!
CNV seq-read coverage¶
For completion, we want to calculate the average sequence read depth in the CNVs taken from the ##segment_cov## relation. The logic is a simplified version of the gene coverage; in bases, we find the potential partial overlap between the CNVs and the coverage segments, and in basedepth we adjust for the depth rounding used in segment-cov.
mydefs = qh.add_many("""
create ##cnv_coverage## = gor ##cnvs## -f ##index_case##
| rename #2 cnv_start
| rename #3 cnv_end
| calc cnv_size #3-#2
| join -segseg -l -e '-1' -maxseg 10000 <(gor ##segment_cov## -f ##index_case## )
| calc bases if(depth = -1,cnv_end-cnv_start,min(cnv_end,bpstop)-max(cnv_start,bpstart))
| calc basedepth if(depth = -1, 0,if(depth<5,depth,if(depth<10,depth+1,if(depth<30,depth+2,if(depth<50,depth+5,depth+12))))*bases)
| group 1 -gc #3,cnv_size -sum -ic basedepth,bases
| calc avg_depth int(sum_basedepth/sum_bases)
| select 1-3,avg_depth
;""")
%%gor
$mydefs
gor -p chr11:56143368 ##cnvs## -f ##index_case## | join -snpsnp [##cnv_coverage##] | join -snpseg -maxseg 10000 -e 0 <(gor ##segment_cov## -f ##index_case## )
Query ran in 0.69 sec Query fetched 4 rows in 0.32 sec (total time 1.01 sec)
CHROM | bpStart | bpStop | REF | ALT | QUAL | FILTER | cnvLen | svType | caller | formatZip | PN | cnv_end | avg_depth | distance | bpStartx | bpStopx | Depth | PNx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr11 | 56143368 | 56143433 | T | TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG... | . | PASS | 65 | INS | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 | 56143433 | 212 | 0 | 56143355 | 56143434 | 200 | DEMO_2796799 |
1 | chr11 | 56143368 | 56143433 | T | TAGTATTTCGTTCAATGCATGTGCTGCTCAGTTAGGCTGTTTCCTG... | . | PASS | 65 | INS | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 | 56143434 | 212 | 0 | 56143355 | 56143434 | 200 | DEMO_2796799 |
2 | chr11 | 56143368 | 56143434 | T | <DEL> | . | PASS | 66 | DEL | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 | 56143433 | 212 | 0 | 56143355 | 56143434 | 200 | DEMO_2796799 |
3 | chr11 | 56143368 | 56143434 | T | <DEL> | . | PASS | 66 | DEL | SCRAMBLE | GT=0|1:GQ=99:DP=392:AD=20:LSC=15:RSC=5 | DEMO_2796799 | 56143434 | 212 | 0 | 56143355 | 56143434 | 200 | DEMO_2796799 |
We see that our selected CNV has coverage reported by Scramble that is 392, while our segment coverage is 200 and the avg_depth estimate is 212 due to the fact that in segment cov we store a rounded depth as floor(depth/25.0)*25, when it exceeds 50. The difference in depth between Scramble and segment cov might be due to different filtering of reads, possibly due to filtering based on alignment mapping quality or base quality. To resolve this question, we dive into the BAM/CRAM and calculate the depth per base, without any filtering, e.g. using PILEUP with -q 0 -bq 0 options. We see that our average CNV coverage estimated from ##segment_cov## is indeed correct!
%%gor
gor -p chr11:56142368-56143434 source/samples/2796000-2796999/2796799/DEMO_2796799.bam
| pileup -depth -q 0 -bq 0
| where pos >= 56143368 and pos <= 56143434
| granno chrom -avg -std -count -ic depth
| replace avg_depth,std_depth form(float(#rc),4,1)
Query ran in 0.97 sec Query fetched 67 rows in 0.03 sec (total time 1.00 sec)
Chrom | Pos | RefBase | Depth | allCount | avg_Depth | std_Depth | |
---|---|---|---|---|---|---|---|
0 | chr11 | 56143368 | A | 211 | 67 | 212.0 | 4.2 |
1 | chr11 | 56143369 | T | 212 | 67 | 212.0 | 4.2 |
2 | chr11 | 56143370 | G | 205 | 67 | 212.0 | 4.2 |
3 | chr11 | 56143371 | T | 204 | 67 | 212.0 | 4.2 |
4 | chr11 | 56143372 | T | 204 | 67 | 212.0 | 4.2 |
... | ... | ... | ... | ... | ... | ... | ... |
62 | chr11 | 56143430 | G | 210 | 67 | 212.0 | 4.2 |
63 | chr11 | 56143431 | A | 210 | 67 | 212.0 | 4.2 |
64 | chr11 | 56143432 | A | 210 | 67 | 212.0 | 4.2 |
65 | chr11 | 56143433 | T | 210 | 67 | 212.0 | 4.2 |
66 | chr11 | 56143434 | C | 210 | 67 | 212.0 | 4.2 |
67 rows × 7 columns
Putting it all together¶
The final query simply reads all the CNVs and annotates them with the relations we produced above. Because some CNVs may span multiple genes, we issue a cnvID to make it easy to visualize them and also calculate the number of overlapping genes via GRANNO as dis_gene_symbol. We then use COLUMNREORDER to organize some of the most informative columns to left.
%%gor
$mydefs
gor ##cnvs## -f ##index_case##
| select 1-5
| rownum | rename rownum cnv_ID | replace cnv_ID 'cnv'+#rc
| calc cnv_size #3-#2
| join -snpsnp -l -xl #3-#5 -xr #3-#5 -l [#exon_overlap#]
| hide bpstopx,refx,altx
| join -snpsnp -l -xl #3-#5,gene_symbol -xr #3-#5,gene_symbol -l -rprefix var [#anno_var_overlap#]
| hide var_bpstop,var_ref,var_alt,var_gene_symbol
| granno 1 -gc #3-#5 -dis -sc gene_symbol
| hide gene_coding_size,cnv_overlap,cnvsize
| join -snpsnp -xl #3 -xr cnv_end -r -l -e 0 [##cnv_coverage##] | hide cnv_end
| prefix exon_range-dis_gene_symbol[-1] gene
| prefix dis_gene_symbol- cnv
| join -snpsnp -xl #3 -xr #3 -r -l -e 0 -rprefix cnv [#hethom#] | hide cnv_bpstop
| calc cnv_var_count cnv_het_vars+cnv_hom_vars
| rename #2 CNV_Start
| rename #3 CNV_End
| rename gene_var_max_LHZ_vars gene_var_LHZ
| rename gene_gene_ov_frac gene_ov_frac
| hide ref
| replace gene_ov_frac-gene_var_LHZ if(#rc='','0',#rc)
| join -snpsnp -xl #3 -xr #3 -r -l -e 0 -rprefix cnv [#GDX_ov#] | hide cnv_bpStop
| columnreorder 1,2,CNV_End,ALT,cnv_ID,cnv_size,CNV_GDXCNVS_OV25,cnv_dis_gene_symbol,cnv_avg_depth,cnv_var_count,gene_symbol,gene_exon_range,gene_cnv_ov_frac,gene_ov_frac,gene_var_high,gene_var_moderate,gene_var_KNOWN_clinvar_stars,gene_var_LHZ /* client filters: */
| join -varseg -r -f 10000 -l -xl gene_symbol -xr gene_symbol [##alias_candidate_genes##]
| hide gene_symbolx,Gene_Cand_PhenoScore,min_hpo_genecount
Query ran in 0.91 sec Query fetched 1,148 rows in 0.09 sec (total time 1.00 sec)
CHROM | CNV_Start | CNV_End | ALT | cnv_ID | cnv_size | cnv_GDXCNVs_ov25 | cnv_dis_gene_symbol | cnv_avg_depth | cnv_var_count | ... | cnv_hom_Vars | cnv_hom_t | cnv_hom_p | cnv_GDXCNVs_ov50 | cnv_GDXCNVs_ov75 | GENE_candidate_Phenotypes | GENE_candidate_paralogs | GENE_candORparalog | GENE_PHRANK | GENE_PHRANK_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 861320 | 3440811 | <DUP> | cnv1 | 2579491 | 2 | 66 | 52 | 1961 | ... | 1264 | -2.33 | 0.00998 | 2 | 2 | NaN | NaN | NaN | NaN | NaN |
1 | chr1 | 861320 | 3440811 | <DUP> | cnv1 | 2579491 | 2 | 66 | 52 | 1961 | ... | 1264 | -2.33 | 0.00998 | 2 | 2 | NaN | NaN | NaN | NaN | NaN |
2 | chr1 | 861320 | 3440811 | <DUP> | cnv1 | 2579491 | 2 | 66 | 52 | 1961 | ... | 1264 | -2.33 | 0.00998 | 2 | 2 | NaN | NaN | NaN | NaN | NaN |
3 | chr1 | 861320 | 3440811 | <DUP> | cnv1 | 2579491 | 2 | 66 | 52 | 1961 | ... | 1264 | -2.33 | 0.00998 | 2 | 2 | NaN | NaN | NaN | NaN | NaN |
4 | chr1 | 861320 | 3440811 | <DUP> | cnv1 | 2579491 | 2 | 66 | 52 | 1961 | ... | 1264 | -2.33 | 0.00998 | 2 | 2 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1143 | chrX | 155251243 | 155252494 | <DEL> | cnv120 | 1251 | 0 | 1 | 120 | 12 | ... | 1 | inf | 1.00000 | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
1144 | chrY | 13651394 | 13808320 | <DEL> | cnv121 | 156926 | 0 | 1 | 9 | 276 | ... | 166 | -1.24 | 0.10800 | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
1145 | chrY | 13704402 | 13801064 | <DEL> | cnv122 | 96662 | 0 | 1 | 4 | 95 | ... | 53 | -1.22 | 0.11400 | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
1146 | chrY | 13851738 | 13851744 | ACTGTAC | cnv123 | 6 | 0 | 1 | 74 | 0 | ... | 0 | 0.00 | 0.00000 | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
1147 | chrY | 13851738 | 13851950 | <DEL> | cnv124 | 212 | 0 | 1 | 7 | 0 | ... | 0 | 0.00 | 0.00000 | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
1148 rows × 33 columns