Gene drill-in query¶

gene_overview

In [47]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "gdx-demo"
%env GOR_API_PROJECT={project}

Gregor¶

Gene drill-in overview¶

Here we will describe query logic that is used to provide more detailed variant information on a single gene at the time, e.g. as a drill-in function. As described elsewhere, filtering is applied both in step1 and step2, e.g. by gene region context, AF and impact cut-off etc. The purpose of the gene drill-in is to show the full variant picture, including gene introns, with key annotatons but without filtering.

Step2¶

Currently, some of the annotations in the gene drill-in are based not only on the output from step1 but also logic that is in step2, such as the full ACMG scoring. Because of this and the fact that the output of step2 is not stored in the same manner as step1 (but only in the GORdb cache), we need to execute all the query statements of step2 as well in the drill-in logic. This leads to duplication of code, because currently there is no systematic way to call a YML code from within another YML code, except by passing the full the full list of parameters to step2 YML.

In the future, we might however chain views together with dependency graph and actually feed a materialized output from step2 into the drill-in YML code. For this discussion, we will simply simulate that by running gregor_step2.ftl.yml, save the results, and use it instead of the ##gdx_result## statement that is currently referred in the code.

This saves us from listing all the code that we have already covered in the Gregor2.ipynb notebook. We run the step2 logic and save the result to user_data/Gregor2/DEMO_2796799.gorz

In [48]:

%%gor

gor user_data/GregorTraining/gregor_step2.ftl.yml(
    index_case = 'DEMO_2796799',father = 'DEMO_2827751',
    mother = 'DEMO_2796810', prepopulated_result = user_data/Gregor1/DEMO_2796799_v3.gorz,
    vep_consequence = 'transcript_ablation,splice_acceptor_variant,splice_donor_variant,stop_gained,frameshift_variant,stop_lost,start_lost,transcript_amplification'
    )
| write user_data/Gregor2/DEMO_2796799.gorz

Query ran in 1.16 sec
Query fetched 0 rows in 1.15 sec (total time 2.31 sec)

Out[48]:

	CHROM	POS

We will not include some of the definitions, that are mostly the same as in step2, but skip statements down to ##gdx_result## from the gregor_gene_overview.ftl.yml().

In [49]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

In [50]:

mydefs = qh.add_many("""
def ##ref## = ref;
def ##vep_path## = source/anno/vep_v110;
def ##vcf_wgs_anno## = user_data/Gregor1/DEMO_2796799_v3.gorz;
def ##genepanelinfo## = ;
def ##defMOI## = 'all';
def ##fuzzdist## = 20;
def ##min_read_depth## = 8;
def ##min_hom_call_ratio## = 0.66;
def ##min_het_call_ratio## = 0.2;
def ##min_gt_likelihood## = 5;
def ##rmax_af## = 0.01;
def ##dmax_af## = 0.001;
def ##max_cust_af## = ##rmax_af##;
def ##delta_ctrl## = 0;
def ##delta_cases## = 0;
def ##max_num_candgenes## = 10;
def ##disease_specificity_max_gene_count## = 20;
def ##min_lhz_overlap_fraction## = 0.1;
def ##vep_filter## = consequence in ('transcript_ablation','splice_acceptor_variant','splice_donor_variant','stop_gained','frameshift_variant','stop_lost','start_lost','transcript_amplification','inframe_insertion','inframe_deletion','missense_variant','protein_altering_variant','splice_region_variant','incomplete_terminal_codon_variant');
def ##vcf_filter## = ('Not LowQual');
def ##af_filter## = ref_af <= Gene_RmaxAf and ref_af_cust <= ##max_cust_af##;
def ##cat_filter## = DIAG_ACMGcat in ('Cat1','Cat1B','Cat1C');
def ##known_filter## = qc_known_vars = 1;
def ##protected_filter## = qc_protected_vars = 1;
def ##vcf_quality_filter## = qc_vcf_filter = 1;
def ##variant_scope_filter## =  2 = 2;
def ##repeat_regions_filter## =  qc_repeat_regions = 0 ;
def ##variant_filter## =  ( ##vep_filter## ) and ( ##af_filter## ) and ( ##repeat_regions_filter## ) and ( ##variant_scope_filter## )  and ( ##vcf_quality_filter## ) or ( ##cat_filter## ) or ( ##protected_filter## ) or ( ##known_filter## );
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##gene_map## = ##ref##/refgenes/refgenes.map;
def ##vep_impact_map## = ##ref##/VEP_impact.map;
def ##vep_single## = ##vep_path##/vepref_single_wgs.gord;
def ##vep_multi## = ##vep_path##/vepref_multi_wgs.gord;
def ##vep_clinical## = ##ref##/disvariants/clinical_variants_vepref_multi.gorz;
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##gene_map## = ##ref##/refgenes/refgenes.map;
def #denovoAFpv# = 0.0005;
def #homAF# = 0.005;
def #chzAF# = 0.01;
def #xlinkAF# = 0.01;
def #sihetAF# = 0.01;
def #domAF# = 0.0005;
def #chomAF# = 0.001;
def ##index_case## = 'DEMO_2796799';
def ##father## = 'DEMO_2827751';
def ##father_affstat## = 'unaffected';
def ##mother## = 'DEMO_2796810';
def ##mother_affstat## = 'unaffected';
def ##male_controls## = 'DEMO_2827751';
def ##male_cases## = ;
def ##female_controls## = 'DEMO_2796810';
def ##female_cases## = ;
def ##cc_pns## = 'DEMO_2827751','DEMO_2796810';
def ##all_pns## = 'DEMO_2827751','DEMO_2796810','DEMO_2796799';
def ##all_pns_no_quotes## = DEMO_2827751,DEMO_2796810,DEMO_2796799;
def ##pivot_type## = 'male_cases','female_cases','male_controls','female_controls';
def ##pntype## = IF(PN IN ('DEMO_2796799'),'index',IF(PN IN ('DEMO_2796810'),'female_controls',IF(PN IN ('DEMO_2827751'),'male_controls','UNKNOWN')));
""")

We replace the actual ##gdx_result## statement with a simple GOR query that reads the file we generated above:

In [51]:

mydefs = qh.add_many("""
create ##gdx_result## = gor user_data/Gregor2/DEMO_2796799.gorz;
""")

In [52]:

%%gor
$mydefs
nor [##gdx_result##] | group -gc gene_symbol -count | sort -c allcount:nr

Query ran in 1.00 sec
Query fetched 79 rows in 0.01 sec (total time 1.01 sec)

Out[52]:

	Gene_Symbol	allCount
0	SKA3	3
1	HLA-DQB1	2
2	ITPKB	2
3	OR5H1	2
4	PCSK9	2
...	...	...
74	TTC3	1
75	UBE3A	1
76	UEVLD	1
77	ZC3H15	1
78	ZNF23	1

79 rows × 2 columns

We will pick the PCSK9 as our example in the following example. The two variants there are:

In [53]:

%%gor
$mydefs
gor [##gdx_result##] | where gene_symbol = 'PCSK9' | select 1-4,auto_*

Query ran in 0.08 sec
Query fetched 2 rows in 0.02 sec (total time 0.10 sec)

Out[53]:

	CHROM	POS	Reference	Call	auto_classification	auto_evidence	auto_classification_simple	auto_score
0	chr1	55505613	G	T	VUS	PM2,PP2	VUS	2
1	chr1	55517991	C	CAT	LP	PVS1,PM2	LP	9

and we will make the drill-in from the LP variants at pos = 55517991. Now, specificly for the gene-drill of this variant, we add the following definitions:

In [54]:

mydefs = qh.add_many("""
def ##wgs_vars## = source/var/wgs_varcalls.gord -s PN;
def ##segment_cov## = source/cov/segment_cov.gord -s PN;
def ##cand_vars## = source/var/candidate_variants.gord -s PN;

def ##var_gene_symbol## = 'PCSK9';
def ##var_pos## = 55517991;
""")

Gene variant count summary¶

To summarize the variants, we quickly read through the gene relations and filter it by our gene of interest, ##var_gene_symbol##, and perform a intersect-right-join with all the variants in the case-study, ##all_pns##. We summarize the counts for each case-ctrl group (and the index-case) into multi-columnar table, using the PIVOT command. In #varCounts#, we then collaps these columns into a attribute-value column, varCounts, showing only attributes for the groups that have non-zero count.

In [78]:

mydefs = qh.add("""
create #ccvar# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol## 
| join -ir -segsnp -f 100 <(gor ##wgs_vars## -f ##all_pns## )
| varnorm -left #3 #4
| calc PNtype ##pntype##
| split PNtype
| calc het IF(callCopies = 1,1,0)
| calc hom IF(callCopies = 2,1,0)
| group 1 -gc reference,call,PNtype -sum -ic het,hom 
| rename sum_het het
| rename sum_hom hom
| pivot PNtype -gc reference,call -v 'index',##pivot_type## -e 0
| calc CallCopies if(index_het=1,'1',if(index_hom=1,'2','0'))
| hide index_*
| columnsort chrom,pos,reference,call,callcopies
;""")

mydefs = qh.add("""
create #varCounts# = gor [#ccvar#]
| calc x male_cases_het+','+male_cases_hom+','+female_cases_het+','+female_cases_hom+','+male_controls_het+','+male_controls_hom+','+female_controls_het+','+female_controls_hom
| calc y 'caseMaleHet,caseMaleHom,caseFemHet,caseFemHom,ctrlMaleHet,ctrlMaleHom,ctrlFemHet,ctrlFemHom'
| calc varCounts replace(listzipfilter(listzip(y,x),x,'int(x)>0'),';','=')
| hide CallCopies[+1]-varCounts[-1];
""")

In [56]:

%%gor
$mydefs
gor [#varCounts#] | where pos in ('55517991','55505613') | varnorm -left #3 #4

Query ran in 0.09 sec
Query fetched 2 rows in 0.01 sec (total time 0.10 sec)

Out[56]:

	CHROM	POS	Reference	Call	CallCopies	varCounts
0	chr1	55505613	G	T	1	ctrlFemHet=1
1	chr1	55517991	C	CAT	1	ctrlMaleHet=1

Ensembl VEP annotation¶

We want to annotate every variant in the gene with VEP and show its position in context of exons and introns. Since there are possibly many different transcripts (and we want a single row per variant) we have to pick some transcript for this. We could use the feature value of the variant specified with ##var_pos##, however, this parameter it not mandatory and at present, the Gregor interface is not feeding this parameter into the YML. Therefore, our strategy is to pick the transcript that shows most often in the output of step2.

In [62]:

mydefs = qh.add_many("""
create #sel_feature# = nor <(gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 -xl gene_symbol -xr gene_symbol [##gdx_result##])
| group -gc feature -count
| rename feature selected_feature
| merge <(norrows 1 | calc selected_feature '' | select selected_feature | calc allcount 0)
| sort -c allcount:nr
| top 1
| hide allcount
;

create #vepvars# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 -xl gene_symbol -xr gene_symbol ##vep_multi##
| multimap -cartesian [#sel_feature#]
| where selected_feature = '' or feature = selected_feature
| hide selected_feature
| select 1,2,Reference,call,gene_symbol,Norm,Span,Max_Consequence,Consequence,cDNA_position,CDS_position,Protein_position,EXON,INTRON
| group 1 -gc #3,#4,gene_symbol -set -sc Norm-
| rename set_(.*) #{1}
;""")

The code above needs some explanation! In Gregor, the gene-drill-in view can be invoked from views that are not based on step2, e.g. gene coverage summary or CNVs. Thus, there is a possiblity that a gene is selected, for which no variant is present in ##mendel_result##. This would lead to an empty #sel_feature# relation and therefore the cartesian multimap join in #vepvars# would provide no output rows. While there are many ways to pick transcript features, e.g. simply take the one with the largest coding region, here we simply add a dummy empty feature to #sel_feature#. The sort and top 1 selection will ensure that it only shows up if no variant came out of the join with ##mendel_result##. Similarly, in #vepvars#, if this dummy feature was selected, we pass through all the features, instead of blocking all. Finally, to deal with this case of multiple features, we aggregate them with the GROUP command.

This is a somewhat ugly solution that could for instance be fixed by having the execution context passed in as parameter. We leave this as is for now - at least not hide any data! Notice that #vepvars# is an extract of all the variants in ##vep_multi## for a single gene and it may include significantly more variants than are present in our case-study, i.e. #varCounts#.

In [65]:

%%gor
$mydefs

nor <(gor [#vepvars#] | varjoin -ic [#varCounts#] ) | group -gc overlapcount -count

Query ran in 0.14 sec
Query fetched 2 rows in 0.01 sec (total time 0.15 sec)

Out[65]:

	OverlapCount	allCount
0	0	31
1	1	50

Speed considerations¶

The above numbers are not realistic since there are very few studies in this project. However, even for very large projects with a very high number of samples, the execution of #vepvars# should at most take few seconds. It is however possible to rewrite it and use PGOR statement that uses -SPLIT based on SEGHIST of all the variants in the project, e.g. source/anno/vep_v110/allvars.gord.

For educational purpose, we will quickly show such version below, although the YML is not written in that way:

In [71]:

mydefs = qh.add_many("""
create #segsplits# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol## 
| join -ir -segsnp -f 100 source/anno/vep_v110/allvars.gord
| group 1000 -count | seghist 100000;

create #vepvars# = pgor -split <(gor [#segsplits#]) [#varCounts#] 
| varjoin -ir ##vep_multi##
| where gene_symbol = ##var_gene_symbol##
| multimap -cartesian [#sel_feature#]
| where selected_feature = '' or feature = selected_feature
| hide selected_feature
| select 1,2,Reference,call,gene_symbol,Norm,Span,Max_Consequence,Consequence,cDNA_position,CDS_position,Protein_position,EXON,INTRON
| group 1 -gc #3,#4,gene_symbol -set -sc Norm-
| rename set_(.*) #{1}
| varjoin -i [#varCounts#]
;""")

In the above version, we generate split segments that based allvars.gord have no more than 100k variants in it. Keeping in mind the nature of SEGHIST, i.e. that it generates segments that cover the whole genome, as shown here:

In [73]:

%%gor
$mydefs
gor [#segsplits#]

Query ran in 1.18 sec
Query fetched 1 rows in 0.01 sec (total time 1.19 sec)

Out[73]:

	Chrom	bpStart	bpStop	Count
0	chr1	0	249250621	81

in #vepvars#, as GOR left-source, we therefore read the #varCounts# and varjoin into the BIG VEP relation the variants we need. The cost of this is proportional to the variant density in ##vep_multi## while the size of the cached data is only as large as the density of variants in #varCounts#. Because of this intersect join, we see that all the variants have an overlap count of 1:

In [74]:

%%gor
$mydefs

nor <(gor [#vepvars#] | varjoin -ic [#varCounts#] ) | group -gc overlapcount -count

Query ran in 0.73 sec
Query fetched 1 rows in 0.01 sec (total time 0.73 sec)

Out[74]:

	OverlapCount	allCount
0	1	50

We can easily add similar intersect join with #varCounts# in the first version of our #vepvars# statement. This does however mean that the query cannot be scheduled until execution of #varCounts# is finished and may therefore not necessarily be faster. Subsequent query that joins with #vepvars# will however be faster because of its reduced size.

Allele frequency information¶

Similarly to the VEP annotation, we want to decorate every variant with AF information. Here we draw AF infor from two sources as shown below. The same discussion about AF as in Gregor1.ipynb applies here!

In [75]:

mydefs = qh.add_many("""
create #XAaf# = pgor XA/analysis/unaff_parents_conf_gender_af.gorz | select 1-4,conservative_af,homCount
| varnorm -left #3 #4 | atmax 1 -gc ref,alt conservative_af ;

create #afvars# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 ##ref##/freq_max.gorz
| select 1-4,gnomad_af_max | replace gnomad_af_max form(gnomad_af_max,4,5)
| distinct
| varjoin -r -l -e 0 -rprefix XA [#XAaf#]
| replace gnomad_af_max,XA_conservative_af form(float(#rc),4,5)
;""")

Final output¶

Before the final query, we also extract a subset from the step2 results as annotations on variants where they apply. This could ofcourse also be done by a nested query in the final query.

In [76]:

mydefs = qh.add_many("""
create #mendvars# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 -xl gene_symbol -xr gene_symbol [##gdx_result##]
| replace ref_af form(ref_af,4,5)
| select 1,2,Reference,Call,auto_classification,auto_evidence,inheritance,GDX_view,GDX_classification,GT_Paternity,ref_af,Fidel_PRECPAT,spliceai_max_impact
| distinct
;""")

Now the final output is a simple join from all the variants in the case-study with our annotations relations:

In [82]:

%%gor finaloutput <<
$mydefs

gor [#varCounts#]
| varjoin -r -l -e '' [#mendvars#] 
| varjoin -r [#vepvars#]
| varjoin -r -l -e 0 [#afvars#]
| calc dist pos - ##var_pos##
| select 1,2,Reference,Call,CallCopies,dist,gene_symbol,auto_classification,auto_evidence,EXON,INTRON,varCounts,auto_evidence,auto_classification,GDX_view,GDX_classification,GT_Paternity,inheritance,Consequence,XA_conservative_af,XA_homCount,gnomad_af_max,Norm,Span,Max_Consequence,cDNA_position,CDS_position,Protein_position,Fidel_PRECPAT,spliceai_max_impact

Query ran in 0.83 sec
Query fetched 50 rows in 0.01 sec (total time 0.84 sec)

In [83]:

finaloutput

Out[83]:

	CHROM	POS	Reference	Call	CallCopies	dist	Gene_Symbol	auto_classification	auto_evidence	EXON	...	XA_homCount	gnomad_af_max	Norm	Span	Max_Consequence	cDNA_position	CDS_position	Protein_position	Fidel_PRECPAT	spliceai_max_impact
0	chr1	55505613	G	T	1	-12378	PCSK9	VUS	PM2,PP2	1/12	...	0	0.00004	S	0	missense_variant	393	103	35	0.956313	NaN
1	chr1	55505732	A	G	1	-12259	PCSK9	NaN	NaN	NaN	...	0	0.97029	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
2	chr1	55509355	C	T	1	-8636	PCSK9	NaN	NaN	NaN	...	0	0.96979	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
3	chr1	55509900	C	T	1	-8091	PCSK9	NaN	NaN	NaN	...	0	0.16981	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
4	chr1	55509939	T	C	2	-8052	PCSK9	NaN	NaN	NaN	...	0	0.16981	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
5	chr1	55512549	T	C	1	-5442	PCSK9	NaN	NaN	NaN	...	0	0.87313	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
6	chr1	55513061	T	C	0	-4930	PCSK9	NaN	NaN	NaN	...	0	0.80231	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
7	chr1	55513348	G	T	1	-4643	PCSK9	NaN	NaN	NaN	...	0	0.00000	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
8	chr1	55513725	T	A	0	-4266	PCSK9	NaN	NaN	NaN	...	0	0.87313	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
9	chr1	55514024	C	A	2	-3967	PCSK9	NaN	NaN	NaN	...	0	0.00000	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
10	chr1	55514046	G	T	2	-3945	PCSK9	NaN	NaN	NaN	...	0	0.87313	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
11	chr1	55514973	G	C	2	-3018	PCSK9	NaN	NaN	NaN	...	0	1.00000	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
12	chr1	55516188	C	T	2	-1803	PCSK9	NaN	NaN	NaN	...	0	0.80720	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
13	chr1	55516781	G	T	0	-1210	PCSK9	NaN	NaN	NaN	...	0	1.00000	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
14	chr1	55517861	C	G	1	-130	PCSK9	NaN	NaN	NaN	...	0	0.76132	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
15	chr1	55517883	C	G	1	-108	PCSK9	NaN	NaN	NaN	...	0	0.70558	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
16	chr1	55517991	C	CAT	1	0	PCSK9	LP	PVS1,PM2	4/12	...	0	0.00035	L	3	frameshift_variant	854-855	564-565	188-189	-1.000000	NaN
17	chr1	55518166	G	A	1	175	PCSK9	NaN	NaN	NaN	...	0	0.78655	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
18	chr1	55518190	AC	A	1	199	PCSK9	NaN	NaN	NaN	...	0	0.76427	L	7	intron_variant	NaN	NaN	NaN	NaN	NaN
19	chr1	55518316	C	T	1	325	PCSK9	NaN	NaN	NaN	...	1140	0.50049	S	0	splice_region_variant	NaN	NaN	NaN	NaN	NaN
20	chr1	55518467	A	G	1	476	PCSK9	NaN	NaN	NaN	...	0	0.72285	S	0	splice_donor_region_variant	NaN	NaN	NaN	NaN	NaN
21	chr1	55518528	C	A	1	537	PCSK9	NaN	NaN	NaN	...	0	0.77161	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
22	chr1	55518682	C	T	1	691	PCSK9	NaN	NaN	NaN	...	0	0.47277	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
23	chr1	55520408	G	C	2	2417	PCSK9	NaN	NaN	NaN	...	0	0.76671	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
24	chr1	55520682	GA	G	2	2691	PCSK9	NaN	NaN	NaN	...	0	0.77236	L	13	intron_variant	NaN	NaN	NaN	NaN	NaN
25	chr1	55521109	G	A	0	3118	PCSK9	NaN	NaN	NaN	...	0	0.77057	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
26	chr1	55521899	GGGGCGGA	G	1	3908	PCSK9	NaN	NaN	NaN	...	0	0.89578	L	17	intron_variant	NaN	NaN	NaN	NaN	NaN
27	chr1	55522741	C	CCGGT	2	4750	PCSK9	NaN	NaN	NaN	...	0	0.86457	L	1	intron_variant	NaN	NaN	NaN	NaN	NaN
28	chr1	55522741	C	CCTGT	0	4750	PCSK9	NaN	NaN	NaN	...	0	0.39025	L	2	intron_variant	NaN	NaN	NaN	NaN	NaN
29	chr1	55523033	A	G	2	5042	PCSK9	NaN	NaN	7/12	...	236917	1.00000	S	0	synonymous_variant	1316	1026	342	NaN	NaN
30	chr1	55523361	A	G	2	5370	PCSK9	NaN	NaN	NaN	...	0	0.99936	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
31	chr1	55523483	C	T	2	5492	PCSK9	NaN	NaN	NaN	...	0	0.99936	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
32	chr1	55523984	C	T	2	5993	PCSK9	NaN	NaN	NaN	...	0	0.99871	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
33	chr1	55524116	C	T	2	6125	PCSK9	NaN	NaN	NaN	...	0	0.99872	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
34	chr1	55524197	A	G	2	6206	PCSK9	NaN	NaN	9/12	...	0	0.99318	S	0	synonymous_variant	1670	1380	460	NaN	NaN
35	chr1	55524237	G	A	2	6246	PCSK9	NaN	NaN	9/12	...	0	0.99318	S	0	missense_variant	1710	1420	474	NaN	NaN
36	chr1	55524339	CGTGTGTGTGTGT	C	0	6348	PCSK9	NaN	NaN	NaN	...	0	0.08620	L	40	intron_variant	NaN	NaN	NaN	NaN	NaN
37	chr1	55524339	CGTGTGTGTGTGTGT	C	1	6348	PCSK9	NaN	NaN	NaN	...	0	0.43879	L	38	intron_variant	NaN	NaN	NaN	NaN	NaN
38	chr1	55524339	CGTGTGTGTGTGTGTGTGT	C	1	6348	PCSK9	NaN	NaN	NaN	...	0	0.11716	L	34	intron_variant	NaN	NaN	NaN	NaN	NaN
39	chr1	55524373	TGTGTGTGTGTGTG	T	1	6382	PCSK9	NaN	NaN	NaN	...	0	0.00000	L	0	intron_variant	NaN	NaN	NaN	NaN	NaN
40	chr1	55524387	T	C	2	6396	PCSK9	NaN	NaN	NaN	...	0	0.99740	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
41	chr1	55524601	G	A	2	6610	PCSK9	NaN	NaN	NaN	...	0	0.91528	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
42	chr1	55524842	G	A	1	6851	PCSK9	NaN	NaN	NaN	...	0	0.86632	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
43	chr1	55525400	G	A	1	7409	PCSK9	NaN	NaN	NaN	...	0	0.86026	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
44	chr1	55526685	G	A	2	8694	PCSK9	NaN	NaN	NaN	...	0	0.99871	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
45	chr1	55527323	G	A	1	9332	PCSK9	NaN	NaN	NaN	...	0	0.85806	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
46	chr1	55527479	A	G	2	9488	PCSK9	NaN	NaN	NaN	...	0	0.99872	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
47	chr1	55527918	C	T	0	9927	PCSK9	NaN	NaN	NaN	...	0	0.99872	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
48	chr1	55528807	G	C	1	10816	PCSK9	NaN	NaN	NaN	...	0	0.33353	S	0	intron_variant	NaN	NaN	NaN	NaN	NaN
49	chr1	55529187	G	A	1	11196	PCSK9	NaN	NaN	12/12	...	0	0.97625	S	0	missense_variant	2299	2009	670	NaN	NaN

50 rows × 30 columns

Notice that the dist column stores the base-pair distance from the varinat from which we activated the drill-in. Also notice that while most of the variants have AF annotation, the other annotations that are drawn from step2 are only present on the "most important" variants. If we want, we can change that and annotate by joining directly with those sources, however, some annotations such as auto_classification will call for the full query logic in step1 and step2.

To demonstrate this, we left join our output with the full Fidel table and prefix the corresponding columns Fidel2. We pick the three missense variants and do some transpose gymnastics to make it easier to visualize:

In [107]:

%%gor

nor <(gor [var:finaloutput] | where max_consequence in ('missense_variant')
      | varjoin -r -l -e 'missing' -rprefix Fidel2 ref_base/GDX/Fidel/fidelhg19.gorz
      | calc var iooa(pos) )
| columnreorder var
| unpivot #2-
| calc colord iooa(col_name)
| pivot var -v 1,2,3 -e '' -gc col_name,colord
| sort -c colord:n
| hide colord
| rename (.*)_col_value var_#{1}

Query ran in 0.08 sec
Loading relations [var:finaloutput] from local state
Query ran in 0.16 sec
Query fetched 35 rows in 0.02 sec (total time 0.18 sec)

Out[107]:

	Col_Name	var_1	var_2	var_3
0	CHROM	chr1	chr1	chr1
1	POS	55505613	55524237	55529187
2	Reference	G	G	G
3	Call	T	A	A
4	CallCopies	1	2	1
5	dist	-12378	6246	11196
6	Gene_Symbol	PCSK9	PCSK9	PCSK9
7	auto_classification	VUS	NaN	NaN
8	auto_evidence	PM2,PP2	NaN	NaN
9	EXON	1/12	9/12	12/12
10	INTRON	NaN	NaN	NaN
11	varCounts	ctrlFemHet=1	ctrlMaleHom=1,ctrlFemHom=1	ctrlMaleHom=1,ctrlFemHet=1
12	auto_evidencex	PM2,PP2	NaN	NaN
13	auto_classificationx	VUS	NaN	NaN
14	GDX_view	chz	NaN	NaN
15	GDX_classification	LPATH	NaN	NaN
16	GT_Paternity	Maternal	NaN	NaN
17	inheritance	CHZcc	NaN	NaN
18	Consequence	missense_variant	missense_variant	missense_variant
19	XA_conservative_AF	1e-05	0.0	0.0
20	XA_homCount	0	0	0
21	gnomad_af_max	4e-05	0.99318	0.97625
22	Norm	S	S	S
23	Span	0	0	0
24	Max_Consequence	missense_variant	missense_variant	missense_variant
25	cDNA_position	393	1710	2299
26	CDS_position	103	1420	2009
27	Protein_position	35	474	670
28	Fidel_PRECPAT	0.956313	NaN	NaN
29	spliceai_max_impact	NaN	NaN	NaN
30	Fidel2_PRECBEN	0.925260	0.993031	0.995968
31	Fidel2_PRECPAT	0.956313	0.684241	0.626545
32	Fidel2_ALPHA	0.118100	0.073900	0.054000
33	Fidel2_REVEL	0.320000	0.051000	0.067000
34	Fidel2_CALIBRATED	1	1	1

Now we see that all the variants have Fidel2 annotations whereas only one of them gets annotation from step2. The fact that we are extracting the Fidel variant annotations from a single gene, the join comes with almost no cost, just like the AF joins we setup before in #afvars#.

The easiest way to see the complete query statements, including those from step2, is to click on a gene from the standard variants views from step2, e.g. All_variants or GDX_variants, and then use the toolbar icon to copy the GOR query statemenst to the clipboard (next to the Excel icon).

In [ ]: