Gene drill-in query¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "gdx-demo"
%env GOR_API_PROJECT={project}
Gregor¶
Gene drill-in overview¶
Here we will describe query logic that is used to provide more detailed variant information on a single gene at the time, e.g. as a drill-in function. As described elsewhere, filtering is applied both in step1 and step2, e.g. by gene region context, AF and impact cut-off etc. The purpose of the gene drill-in is to show the full variant picture, including gene introns, with key annotatons but without filtering.
Step2¶
Currently, some of the annotations in the gene drill-in are based not only on the output from step1 but also logic that is in step2, such as the full ACMG scoring. Because of this and the fact that the output of step2 is not stored in the same manner as step1 (but only in the GORdb cache), we need to execute all the query statements of step2 as well in the drill-in logic. This leads to duplication of code, because currently there is no systematic way to call a YML code from within another YML code, except by passing the full the full list of parameters to step2 YML.
In the future, we might however chain views together with dependency graph and actually feed a materialized output from step2 into the drill-in YML code. For this discussion, we will simply simulate that by running gregor_step2.ftl.yml, save the results, and use it instead of the ##gdx_result## statement that is currently referred in the code.
This saves us from listing all the code that we have already covered in the Gregor2.ipynb notebook. We run the step2 logic and save the result to user_data/Gregor2/DEMO_2796799.gorz
%%gor
gor user_data/GregorTraining/gregor_step2.ftl.yml(
index_case = 'DEMO_2796799',father = 'DEMO_2827751',
mother = 'DEMO_2796810', prepopulated_result = user_data/Gregor1/DEMO_2796799_v3.gorz,
vep_consequence = 'transcript_ablation,splice_acceptor_variant,splice_donor_variant,stop_gained,frameshift_variant,stop_lost,start_lost,transcript_amplification'
)
| write user_data/Gregor2/DEMO_2796799.gorz
Query ran in 1.16 sec Query fetched 0 rows in 1.15 sec (total time 2.31 sec)
CHROM | POS |
---|
We will not include some of the definitions, that are mostly the same as in step2, but skip statements down to ##gdx_result## from the gregor_gene_overview.ftl.yml().
qh = GQH.GOR_Query_Helper()
mydefs = ""
mydefs = qh.add_many("""
def ##ref## = ref;
def ##vep_path## = source/anno/vep_v110;
def ##vcf_wgs_anno## = user_data/Gregor1/DEMO_2796799_v3.gorz;
def ##genepanelinfo## = ;
def ##defMOI## = 'all';
def ##fuzzdist## = 20;
def ##min_read_depth## = 8;
def ##min_hom_call_ratio## = 0.66;
def ##min_het_call_ratio## = 0.2;
def ##min_gt_likelihood## = 5;
def ##rmax_af## = 0.01;
def ##dmax_af## = 0.001;
def ##max_cust_af## = ##rmax_af##;
def ##delta_ctrl## = 0;
def ##delta_cases## = 0;
def ##max_num_candgenes## = 10;
def ##disease_specificity_max_gene_count## = 20;
def ##min_lhz_overlap_fraction## = 0.1;
def ##vep_filter## = consequence in ('transcript_ablation','splice_acceptor_variant','splice_donor_variant','stop_gained','frameshift_variant','stop_lost','start_lost','transcript_amplification','inframe_insertion','inframe_deletion','missense_variant','protein_altering_variant','splice_region_variant','incomplete_terminal_codon_variant');
def ##vcf_filter## = ('Not LowQual');
def ##af_filter## = ref_af <= Gene_RmaxAf and ref_af_cust <= ##max_cust_af##;
def ##cat_filter## = DIAG_ACMGcat in ('Cat1','Cat1B','Cat1C');
def ##known_filter## = qc_known_vars = 1;
def ##protected_filter## = qc_protected_vars = 1;
def ##vcf_quality_filter## = qc_vcf_filter = 1;
def ##variant_scope_filter## = 2 = 2;
def ##repeat_regions_filter## = qc_repeat_regions = 0 ;
def ##variant_filter## = ( ##vep_filter## ) and ( ##af_filter## ) and ( ##repeat_regions_filter## ) and ( ##variant_scope_filter## ) and ( ##vcf_quality_filter## ) or ( ##cat_filter## ) or ( ##protected_filter## ) or ( ##known_filter## );
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##gene_map## = ##ref##/refgenes/refgenes.map;
def ##vep_impact_map## = ##ref##/VEP_impact.map;
def ##vep_single## = ##vep_path##/vepref_single_wgs.gord;
def ##vep_multi## = ##vep_path##/vepref_multi_wgs.gord;
def ##vep_clinical## = ##ref##/disvariants/clinical_variants_vepref_multi.gorz;
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##gene_map## = ##ref##/refgenes/refgenes.map;
def #denovoAFpv# = 0.0005;
def #homAF# = 0.005;
def #chzAF# = 0.01;
def #xlinkAF# = 0.01;
def #sihetAF# = 0.01;
def #domAF# = 0.0005;
def #chomAF# = 0.001;
def ##index_case## = 'DEMO_2796799';
def ##father## = 'DEMO_2827751';
def ##father_affstat## = 'unaffected';
def ##mother## = 'DEMO_2796810';
def ##mother_affstat## = 'unaffected';
def ##male_controls## = 'DEMO_2827751';
def ##male_cases## = ;
def ##female_controls## = 'DEMO_2796810';
def ##female_cases## = ;
def ##cc_pns## = 'DEMO_2827751','DEMO_2796810';
def ##all_pns## = 'DEMO_2827751','DEMO_2796810','DEMO_2796799';
def ##all_pns_no_quotes## = DEMO_2827751,DEMO_2796810,DEMO_2796799;
def ##pivot_type## = 'male_cases','female_cases','male_controls','female_controls';
def ##pntype## = IF(PN IN ('DEMO_2796799'),'index',IF(PN IN ('DEMO_2796810'),'female_controls',IF(PN IN ('DEMO_2827751'),'male_controls','UNKNOWN')));
""")
We replace the actual ##gdx_result## statement with a simple GOR query that reads the file we generated above:
mydefs = qh.add_many("""
create ##gdx_result## = gor user_data/Gregor2/DEMO_2796799.gorz;
""")
%%gor
$mydefs
nor [##gdx_result##] | group -gc gene_symbol -count | sort -c allcount:nr
Query ran in 1.00 sec Query fetched 79 rows in 0.01 sec (total time 1.01 sec)
Gene_Symbol | allCount | |
---|---|---|
0 | SKA3 | 3 |
1 | HLA-DQB1 | 2 |
2 | ITPKB | 2 |
3 | OR5H1 | 2 |
4 | PCSK9 | 2 |
... | ... | ... |
74 | TTC3 | 1 |
75 | UBE3A | 1 |
76 | UEVLD | 1 |
77 | ZC3H15 | 1 |
78 | ZNF23 | 1 |
79 rows × 2 columns
We will pick the PCSK9 as our example in the following example. The two variants there are:
%%gor
$mydefs
gor [##gdx_result##] | where gene_symbol = 'PCSK9' | select 1-4,auto_*
Query ran in 0.08 sec Query fetched 2 rows in 0.02 sec (total time 0.10 sec)
CHROM | POS | Reference | Call | auto_classification | auto_evidence | auto_classification_simple | auto_score | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 55505613 | G | T | VUS | PM2,PP2 | VUS | 2 |
1 | chr1 | 55517991 | C | CAT | LP | PVS1,PM2 | LP | 9 |
and we will make the drill-in from the LP variants at pos = 55517991. Now, specificly for the gene-drill of this variant, we add the following definitions:
mydefs = qh.add_many("""
def ##wgs_vars## = source/var/wgs_varcalls.gord -s PN;
def ##segment_cov## = source/cov/segment_cov.gord -s PN;
def ##cand_vars## = source/var/candidate_variants.gord -s PN;
def ##var_gene_symbol## = 'PCSK9';
def ##var_pos## = 55517991;
""")
Gene variant count summary¶
To summarize the variants, we quickly read through the gene relations and filter it by our gene of interest, ##var_gene_symbol##, and perform a intersect-right-join with all the variants in the case-study, ##all_pns##. We summarize the counts for each case-ctrl group (and the index-case) into multi-columnar table, using the PIVOT command. In #varCounts#, we then collaps these columns into a attribute-value column, varCounts, showing only attributes for the groups that have non-zero count.
mydefs = qh.add("""
create #ccvar# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 <(gor ##wgs_vars## -f ##all_pns## )
| varnorm -left #3 #4
| calc PNtype ##pntype##
| split PNtype
| calc het IF(callCopies = 1,1,0)
| calc hom IF(callCopies = 2,1,0)
| group 1 -gc reference,call,PNtype -sum -ic het,hom
| rename sum_het het
| rename sum_hom hom
| pivot PNtype -gc reference,call -v 'index',##pivot_type## -e 0
| calc CallCopies if(index_het=1,'1',if(index_hom=1,'2','0'))
| hide index_*
| columnsort chrom,pos,reference,call,callcopies
;""")
mydefs = qh.add("""
create #varCounts# = gor [#ccvar#]
| calc x male_cases_het+','+male_cases_hom+','+female_cases_het+','+female_cases_hom+','+male_controls_het+','+male_controls_hom+','+female_controls_het+','+female_controls_hom
| calc y 'caseMaleHet,caseMaleHom,caseFemHet,caseFemHom,ctrlMaleHet,ctrlMaleHom,ctrlFemHet,ctrlFemHom'
| calc varCounts replace(listzipfilter(listzip(y,x),x,'int(x)>0'),';','=')
| hide CallCopies[+1]-varCounts[-1];
""")
%%gor
$mydefs
gor [#varCounts#] | where pos in ('55517991','55505613') | varnorm -left #3 #4
Query ran in 0.09 sec Query fetched 2 rows in 0.01 sec (total time 0.10 sec)
CHROM | POS | Reference | Call | CallCopies | varCounts | |
---|---|---|---|---|---|---|
0 | chr1 | 55505613 | G | T | 1 | ctrlFemHet=1 |
1 | chr1 | 55517991 | C | CAT | 1 | ctrlMaleHet=1 |
Ensembl VEP annotation¶
We want to annotate every variant in the gene with VEP and show its position in context of exons and introns. Since there are possibly many different transcripts (and we want a single row per variant) we have to pick some transcript for this. We could use the feature value of the variant specified with ##var_pos##, however, this parameter it not mandatory and at present, the Gregor interface is not feeding this parameter into the YML. Therefore, our strategy is to pick the transcript that shows most often in the output of step2.
mydefs = qh.add_many("""
create #sel_feature# = nor <(gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 -xl gene_symbol -xr gene_symbol [##gdx_result##])
| group -gc feature -count
| rename feature selected_feature
| merge <(norrows 1 | calc selected_feature '' | select selected_feature | calc allcount 0)
| sort -c allcount:nr
| top 1
| hide allcount
;
create #vepvars# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 -xl gene_symbol -xr gene_symbol ##vep_multi##
| multimap -cartesian [#sel_feature#]
| where selected_feature = '' or feature = selected_feature
| hide selected_feature
| select 1,2,Reference,call,gene_symbol,Norm,Span,Max_Consequence,Consequence,cDNA_position,CDS_position,Protein_position,EXON,INTRON
| group 1 -gc #3,#4,gene_symbol -set -sc Norm-
| rename set_(.*) #{1}
;""")
The code above needs some explanation! In Gregor, the gene-drill-in view can be invoked from views that are not based on step2, e.g. gene coverage summary or CNVs. Thus, there is a possiblity that a gene is selected, for which no variant is present in ##mendel_result##. This would lead to an empty #sel_feature# relation and therefore the cartesian multimap join in #vepvars# would provide no output rows. While there are many ways to pick transcript features, e.g. simply take the one with the largest coding region, here we simply add a dummy empty feature to #sel_feature#. The sort and top 1 selection will ensure that it only shows up if no variant came out of the join with ##mendel_result##. Similarly, in #vepvars#, if this dummy feature was selected, we pass through all the features, instead of blocking all. Finally, to deal with this case of multiple features, we aggregate them with the GROUP command.
This is a somewhat ugly solution that could for instance be fixed by having the execution context passed in as parameter. We leave this as is for now - at least not hide any data! Notice that #vepvars# is an extract of all the variants in ##vep_multi## for a single gene and it may include significantly more variants than are present in our case-study, i.e. #varCounts#.
%%gor
$mydefs
nor <(gor [#vepvars#] | varjoin -ic [#varCounts#] ) | group -gc overlapcount -count
Query ran in 0.14 sec Query fetched 2 rows in 0.01 sec (total time 0.15 sec)
OverlapCount | allCount | |
---|---|---|
0 | 0 | 31 |
1 | 1 | 50 |
Speed considerations¶
The above numbers are not realistic since there are very few studies in this project. However, even for very large projects with a very high number of samples, the execution of #vepvars# should at most take few seconds. It is however possible to rewrite it and use PGOR statement that uses -SPLIT based on SEGHIST of all the variants in the project, e.g. source/anno/vep_v110/allvars.gord.
For educational purpose, we will quickly show such version below, although the YML is not written in that way:
mydefs = qh.add_many("""
create #segsplits# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 source/anno/vep_v110/allvars.gord
| group 1000 -count | seghist 100000;
create #vepvars# = pgor -split <(gor [#segsplits#]) [#varCounts#]
| varjoin -ir ##vep_multi##
| where gene_symbol = ##var_gene_symbol##
| multimap -cartesian [#sel_feature#]
| where selected_feature = '' or feature = selected_feature
| hide selected_feature
| select 1,2,Reference,call,gene_symbol,Norm,Span,Max_Consequence,Consequence,cDNA_position,CDS_position,Protein_position,EXON,INTRON
| group 1 -gc #3,#4,gene_symbol -set -sc Norm-
| rename set_(.*) #{1}
| varjoin -i [#varCounts#]
;""")
In the above version, we generate split segments that based allvars.gord have no more than 100k variants in it. Keeping in mind the nature of SEGHIST, i.e. that it generates segments that cover the whole genome, as shown here:
%%gor
$mydefs
gor [#segsplits#]
Query ran in 1.18 sec Query fetched 1 rows in 0.01 sec (total time 1.19 sec)
Chrom | bpStart | bpStop | Count | |
---|---|---|---|---|
0 | chr1 | 0 | 249250621 | 81 |
in #vepvars#, as GOR left-source, we therefore read the #varCounts# and varjoin into the BIG VEP relation the variants we need. The cost of this is proportional to the variant density in ##vep_multi## while the size of the cached data is only as large as the density of variants in #varCounts#. Because of this intersect join, we see that all the variants have an overlap count of 1:
%%gor
$mydefs
nor <(gor [#vepvars#] | varjoin -ic [#varCounts#] ) | group -gc overlapcount -count
Query ran in 0.73 sec Query fetched 1 rows in 0.01 sec (total time 0.73 sec)
OverlapCount | allCount | |
---|---|---|
0 | 1 | 50 |
We can easily add similar intersect join with #varCounts# in the first version of our #vepvars# statement. This does however mean that the query cannot be scheduled until execution of #varCounts# is finished and may therefore not necessarily be faster. Subsequent query that joins with #vepvars# will however be faster because of its reduced size.
Allele frequency information¶
Similarly to the VEP annotation, we want to decorate every variant with AF information. Here we draw AF infor from two sources as shown below. The same discussion about AF as in Gregor1.ipynb applies here!
mydefs = qh.add_many("""
create #XAaf# = pgor XA/analysis/unaff_parents_conf_gender_af.gorz | select 1-4,conservative_af,homCount
| varnorm -left #3 #4 | atmax 1 -gc ref,alt conservative_af ;
create #afvars# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 ##ref##/freq_max.gorz
| select 1-4,gnomad_af_max | replace gnomad_af_max form(gnomad_af_max,4,5)
| distinct
| varjoin -r -l -e 0 -rprefix XA [#XAaf#]
| replace gnomad_af_max,XA_conservative_af form(float(#rc),4,5)
;""")
Final output¶
Before the final query, we also extract a subset from the step2 results as annotations on variants where they apply. This could ofcourse also be done by a nested query in the final query.
mydefs = qh.add_many("""
create #mendvars# = gor ##all_genes## | where gene_symbol = ##var_gene_symbol##
| join -ir -segsnp -f 100 -xl gene_symbol -xr gene_symbol [##gdx_result##]
| replace ref_af form(ref_af,4,5)
| select 1,2,Reference,Call,auto_classification,auto_evidence,inheritance,GDX_view,GDX_classification,GT_Paternity,ref_af,Fidel_PRECPAT,spliceai_max_impact
| distinct
;""")
Now the final output is a simple join from all the variants in the case-study with our annotations relations:
%%gor finaloutput <<
$mydefs
gor [#varCounts#]
| varjoin -r -l -e '' [#mendvars#]
| varjoin -r [#vepvars#]
| varjoin -r -l -e 0 [#afvars#]
| calc dist pos - ##var_pos##
| select 1,2,Reference,Call,CallCopies,dist,gene_symbol,auto_classification,auto_evidence,EXON,INTRON,varCounts,auto_evidence,auto_classification,GDX_view,GDX_classification,GT_Paternity,inheritance,Consequence,XA_conservative_af,XA_homCount,gnomad_af_max,Norm,Span,Max_Consequence,cDNA_position,CDS_position,Protein_position,Fidel_PRECPAT,spliceai_max_impact
Query ran in 0.83 sec Query fetched 50 rows in 0.01 sec (total time 0.84 sec)
finaloutput
CHROM | POS | Reference | Call | CallCopies | dist | Gene_Symbol | auto_classification | auto_evidence | EXON | ... | XA_homCount | gnomad_af_max | Norm | Span | Max_Consequence | cDNA_position | CDS_position | Protein_position | Fidel_PRECPAT | spliceai_max_impact | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 55505613 | G | T | 1 | -12378 | PCSK9 | VUS | PM2,PP2 | 1/12 | ... | 0 | 0.00004 | S | 0 | missense_variant | 393 | 103 | 35 | 0.956313 | NaN |
1 | chr1 | 55505732 | A | G | 1 | -12259 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.97029 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
2 | chr1 | 55509355 | C | T | 1 | -8636 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.96979 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
3 | chr1 | 55509900 | C | T | 1 | -8091 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.16981 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
4 | chr1 | 55509939 | T | C | 2 | -8052 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.16981 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
5 | chr1 | 55512549 | T | C | 1 | -5442 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.87313 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
6 | chr1 | 55513061 | T | C | 0 | -4930 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.80231 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
7 | chr1 | 55513348 | G | T | 1 | -4643 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.00000 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
8 | chr1 | 55513725 | T | A | 0 | -4266 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.87313 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
9 | chr1 | 55514024 | C | A | 2 | -3967 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.00000 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
10 | chr1 | 55514046 | G | T | 2 | -3945 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.87313 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
11 | chr1 | 55514973 | G | C | 2 | -3018 | PCSK9 | NaN | NaN | NaN | ... | 0 | 1.00000 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
12 | chr1 | 55516188 | C | T | 2 | -1803 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.80720 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
13 | chr1 | 55516781 | G | T | 0 | -1210 | PCSK9 | NaN | NaN | NaN | ... | 0 | 1.00000 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
14 | chr1 | 55517861 | C | G | 1 | -130 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.76132 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
15 | chr1 | 55517883 | C | G | 1 | -108 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.70558 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
16 | chr1 | 55517991 | C | CAT | 1 | 0 | PCSK9 | LP | PVS1,PM2 | 4/12 | ... | 0 | 0.00035 | L | 3 | frameshift_variant | 854-855 | 564-565 | 188-189 | -1.000000 | NaN |
17 | chr1 | 55518166 | G | A | 1 | 175 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.78655 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
18 | chr1 | 55518190 | AC | A | 1 | 199 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.76427 | L | 7 | intron_variant | NaN | NaN | NaN | NaN | NaN |
19 | chr1 | 55518316 | C | T | 1 | 325 | PCSK9 | NaN | NaN | NaN | ... | 1140 | 0.50049 | S | 0 | splice_region_variant | NaN | NaN | NaN | NaN | NaN |
20 | chr1 | 55518467 | A | G | 1 | 476 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.72285 | S | 0 | splice_donor_region_variant | NaN | NaN | NaN | NaN | NaN |
21 | chr1 | 55518528 | C | A | 1 | 537 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.77161 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
22 | chr1 | 55518682 | C | T | 1 | 691 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.47277 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
23 | chr1 | 55520408 | G | C | 2 | 2417 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.76671 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
24 | chr1 | 55520682 | GA | G | 2 | 2691 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.77236 | L | 13 | intron_variant | NaN | NaN | NaN | NaN | NaN |
25 | chr1 | 55521109 | G | A | 0 | 3118 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.77057 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
26 | chr1 | 55521899 | GGGGCGGA | G | 1 | 3908 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.89578 | L | 17 | intron_variant | NaN | NaN | NaN | NaN | NaN |
27 | chr1 | 55522741 | C | CCGGT | 2 | 4750 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.86457 | L | 1 | intron_variant | NaN | NaN | NaN | NaN | NaN |
28 | chr1 | 55522741 | C | CCTGT | 0 | 4750 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.39025 | L | 2 | intron_variant | NaN | NaN | NaN | NaN | NaN |
29 | chr1 | 55523033 | A | G | 2 | 5042 | PCSK9 | NaN | NaN | 7/12 | ... | 236917 | 1.00000 | S | 0 | synonymous_variant | 1316 | 1026 | 342 | NaN | NaN |
30 | chr1 | 55523361 | A | G | 2 | 5370 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99936 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
31 | chr1 | 55523483 | C | T | 2 | 5492 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99936 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
32 | chr1 | 55523984 | C | T | 2 | 5993 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99871 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
33 | chr1 | 55524116 | C | T | 2 | 6125 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99872 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
34 | chr1 | 55524197 | A | G | 2 | 6206 | PCSK9 | NaN | NaN | 9/12 | ... | 0 | 0.99318 | S | 0 | synonymous_variant | 1670 | 1380 | 460 | NaN | NaN |
35 | chr1 | 55524237 | G | A | 2 | 6246 | PCSK9 | NaN | NaN | 9/12 | ... | 0 | 0.99318 | S | 0 | missense_variant | 1710 | 1420 | 474 | NaN | NaN |
36 | chr1 | 55524339 | CGTGTGTGTGTGT | C | 0 | 6348 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.08620 | L | 40 | intron_variant | NaN | NaN | NaN | NaN | NaN |
37 | chr1 | 55524339 | CGTGTGTGTGTGTGT | C | 1 | 6348 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.43879 | L | 38 | intron_variant | NaN | NaN | NaN | NaN | NaN |
38 | chr1 | 55524339 | CGTGTGTGTGTGTGTGTGT | C | 1 | 6348 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.11716 | L | 34 | intron_variant | NaN | NaN | NaN | NaN | NaN |
39 | chr1 | 55524373 | TGTGTGTGTGTGTG | T | 1 | 6382 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.00000 | L | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
40 | chr1 | 55524387 | T | C | 2 | 6396 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99740 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
41 | chr1 | 55524601 | G | A | 2 | 6610 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.91528 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
42 | chr1 | 55524842 | G | A | 1 | 6851 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.86632 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
43 | chr1 | 55525400 | G | A | 1 | 7409 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.86026 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
44 | chr1 | 55526685 | G | A | 2 | 8694 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99871 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
45 | chr1 | 55527323 | G | A | 1 | 9332 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.85806 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
46 | chr1 | 55527479 | A | G | 2 | 9488 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99872 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
47 | chr1 | 55527918 | C | T | 0 | 9927 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.99872 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
48 | chr1 | 55528807 | G | C | 1 | 10816 | PCSK9 | NaN | NaN | NaN | ... | 0 | 0.33353 | S | 0 | intron_variant | NaN | NaN | NaN | NaN | NaN |
49 | chr1 | 55529187 | G | A | 1 | 11196 | PCSK9 | NaN | NaN | 12/12 | ... | 0 | 0.97625 | S | 0 | missense_variant | 2299 | 2009 | 670 | NaN | NaN |
50 rows × 30 columns
Notice that the dist column stores the base-pair distance from the varinat from which we activated the drill-in. Also notice that while most of the variants have AF annotation, the other annotations that are drawn from step2 are only present on the "most important" variants. If we want, we can change that and annotate by joining directly with those sources, however, some annotations such as auto_classification will call for the full query logic in step1 and step2.
To demonstrate this, we left join our output with the full Fidel table and prefix the corresponding columns Fidel2. We pick the three missense variants and do some transpose gymnastics to make it easier to visualize:
%%gor
nor <(gor [var:finaloutput] | where max_consequence in ('missense_variant')
| varjoin -r -l -e 'missing' -rprefix Fidel2 ref_base/GDX/Fidel/fidelhg19.gorz
| calc var iooa(pos) )
| columnreorder var
| unpivot #2-
| calc colord iooa(col_name)
| pivot var -v 1,2,3 -e '' -gc col_name,colord
| sort -c colord:n
| hide colord
| rename (.*)_col_value var_#{1}
Query ran in 0.08 sec Loading relations [var:finaloutput] from local state Query ran in 0.16 sec Query fetched 35 rows in 0.02 sec (total time 0.18 sec)
Col_Name | var_1 | var_2 | var_3 | |
---|---|---|---|---|
0 | CHROM | chr1 | chr1 | chr1 |
1 | POS | 55505613 | 55524237 | 55529187 |
2 | Reference | G | G | G |
3 | Call | T | A | A |
4 | CallCopies | 1 | 2 | 1 |
5 | dist | -12378 | 6246 | 11196 |
6 | Gene_Symbol | PCSK9 | PCSK9 | PCSK9 |
7 | auto_classification | VUS | NaN | NaN |
8 | auto_evidence | PM2,PP2 | NaN | NaN |
9 | EXON | 1/12 | 9/12 | 12/12 |
10 | INTRON | NaN | NaN | NaN |
11 | varCounts | ctrlFemHet=1 | ctrlMaleHom=1,ctrlFemHom=1 | ctrlMaleHom=1,ctrlFemHet=1 |
12 | auto_evidencex | PM2,PP2 | NaN | NaN |
13 | auto_classificationx | VUS | NaN | NaN |
14 | GDX_view | chz | NaN | NaN |
15 | GDX_classification | LPATH | NaN | NaN |
16 | GT_Paternity | Maternal | NaN | NaN |
17 | inheritance | CHZcc | NaN | NaN |
18 | Consequence | missense_variant | missense_variant | missense_variant |
19 | XA_conservative_AF | 1e-05 | 0.0 | 0.0 |
20 | XA_homCount | 0 | 0 | 0 |
21 | gnomad_af_max | 4e-05 | 0.99318 | 0.97625 |
22 | Norm | S | S | S |
23 | Span | 0 | 0 | 0 |
24 | Max_Consequence | missense_variant | missense_variant | missense_variant |
25 | cDNA_position | 393 | 1710 | 2299 |
26 | CDS_position | 103 | 1420 | 2009 |
27 | Protein_position | 35 | 474 | 670 |
28 | Fidel_PRECPAT | 0.956313 | NaN | NaN |
29 | spliceai_max_impact | NaN | NaN | NaN |
30 | Fidel2_PRECBEN | 0.925260 | 0.993031 | 0.995968 |
31 | Fidel2_PRECPAT | 0.956313 | 0.684241 | 0.626545 |
32 | Fidel2_ALPHA | 0.118100 | 0.073900 | 0.054000 |
33 | Fidel2_REVEL | 0.320000 | 0.051000 | 0.067000 |
34 | Fidel2_CALIBRATED | 1 | 1 | 1 |
Now we see that all the variants have Fidel2 annotations whereas only one of them gets annotation from step2. The fact that we are extracting the Fidel variant annotations from a single gene, the join comes with almost no cost, just like the AF joins we setup before in #afvars#.
The easiest way to see the complete query statements, including those from step2, is to click on a gene from the standard variants views from step2, e.g. All_variants or GDX_variants, and then use the toolbar icon to copy the GOR query statemenst to the clipboard (next to the Excel icon).