Gregor2 SNV postprocessing¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "gdx-demo"
%env GOR_API_PROJECT={project}
Gregor Step2¶
Introduction¶
This notebook is an overview of the YML-based GOR query logic used as a the main interactive variant analysis and view in Gregor, i.e. step2. The design goal when the Gregor query logic was refactored was to make sure that the data dependancy of the interactive steps would primarily rely on a relatively small archivable data set and not on many very large, and potentially volatile, reference data sets. Not only would this make it easier to regenerate the output of step2 at later times but also to make the queries faster.
Gregor does archive the output of step1 for each case-study (and potentially more versions if study-setup changes). The query logic in step2 does then this as the "annotated variant input" and then applies last-minute decisions from the user to do the final processing. This may include:
- Variant QC filtering
- AF frequency filtering
- Refined VEP impact filtering
- Optional gene panel based feature selection
- Candidate (phenotype related) gene ACMG evidence labeling
- Inheritance logic (based on filtering, e.g. CHZ)
- Final ACMG scoring
- View classification logic, e.g. GDX-views
For a typical trio-output from step1, with about 250k transcript annotated variant rows and close to 150 columns, the processing time in step2 is typically few seconds and always < 15 seconds, even for parameter changes such as QC or AF cut-offs that involve re-evaluation of inheritance rules.
Here we will outline the structure of step2, including some recent changes that have been setup as a prototype for new GDX view filters.
qh = GQH.GOR_Query_Helper()
mydefs = ""
Like in step1, we start with some definitions that are directly based on input parameters.
mydefs = qh.add_many("""
def ##ref## = ref;
def ##fuzzdist## = 20;
def ##min_read_depth## = 8;
def ##min_hom_call_ratio## = 0.66;
def ##min_het_call_ratio## = 0.2;
def ##min_gt_likelihood## = 5;
def ##rmax_af## = 0.01;
def ##dmax_af## = 0.001;
def ##max_cust_af## = ##rmax_af##;
def ##delta_ctrl## = 0;
def ##delta_cases## = 0;
def ##max_num_candgenes## = 10;
def ##disease_specificity_max_gene_count## = 20;
def ##min_lhz_overlap_fraction## = 0.1;
""")
The output from step1 is provided as a parameter to step2. It is assigned to ##vcf_wgs_anno##.
Step2 also accept two optional gene input files; one for so-called candidate genes, based on gene lists and HPO gene associations, and a gene panel info file that stores optional AF cut-off and inheritance models for genes. The candidate gene file is defined in another YML and executed by Gregor and stored in the corresponding study folder.
# Study type setup
mydefs = qh.add_many("""
def ##vcf_wgs_anno## = user_data/Gregor1/DEMO_2796799_v3.gorz;
def ##candidate_gene_file## = studies/GCA724303/candidate_genes.gor;
""")
mydefs = qh.add_many("""
create ##demo_genepanelinfo## = nor #genes# | select gene_symbol | top 100
| calc Gene_RmaxAf 'default' | calc Gene_RmaxGf 'default'
| calc Gene_DmaxAf 'default' | calc Gene_MOI 'all' | calc Gene_Deftrans 'default';
def ##genepanelinfo## = [##demo_genepanelinfo##];
def ##defMOI## = 'all';
""")
Rather than providing a real gene panel info file, we have just setup a demo file with the necessary columns and values that simply trigger the default behaviour, i.e. ##rmax_af##, ##dmax_af##, and ##defMOI##, essentially the same value as for the genes that are missing from the file.
In the future, we would like this file to come from GeneMB export. Later, we will discuss a preliminary export from GeneMB and how we have included it into the GDX based view calculations.
We add some more definitions:
# Filters
mydefs = qh.add_many("""
def ##vep_filter## = consequence in ('transcript_ablation','splice_acceptor_variant','splice_donor_variant','stop_gained','frameshift_variant','stop_lost','start_lost','transcript_amplification','inframe_insertion','inframe_deletion','missense_variant','protein_altering_variant','splice_region_variant','incomplete_terminal_codon_variant');
def ##vcf_filter## = ('Not LowQual');
def ##af_filter## = ref_af <= Gene_RmaxAf and ref_af_cust <= ##max_cust_af##;
def ##cat_filter## = DIAG_ACMGcat in ('Cat1','Cat1B','Cat1C');
def ##known_filter## = qc_known_vars = 1;
def ##protected_filter## = qc_protected_vars = 1;
def ##vcf_quality_filter## = qc_vcf_filter = 1;
def ##variant_scope_filter## = 2 = 2;
def ##repeat_regions_filter## = qc_repeat_regions = 0 ;
""")
# this is the main filter predicate used below in ##vcf_filtered##
mydefs = qh.add("""
def ##variant_filter## = ( ##vep_filter## ) and ( ##af_filter## ) and ( ##repeat_regions_filter## ) and ( ##variant_scope_filter## ) and ( ##vcf_quality_filter## ) or ( ##cat_filter## ) or ( ##protected_filter## ) or ( ##known_filter## )
;""")
# Some more reference files
mydefs = qh.add_many("""
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##gene_map## = ##ref##/refgenes/refgenes.map;
def ##vep_impact_map## = ##ref##/VEP_impact.map;
""")
# Some definition used in the GDX-view calculations
mydefs = qh.add_many("""
def #denovoAFpv# = 0.0005;
def #homAF# = 0.005;
def #chzAF# = 0.01;
def #xlinkAF# = 0.01;
def #sihetAF# = 0.01;
def #domAF# = 0.0005;
def #chomAF# = 0.001;
""")
# participant setup
mydefs = qh.add_many("""
def ##index_case## = 'DEMO_2796799';
def ##father## = 'DEMO_2827751';
def ##father_affstat## = 'unaffected';
def ##mother## = 'DEMO_2819603';
def ##mother_affstat## = 'unaffected';
def ##male_controls## = 'DEMO_2827751';
def ##male_cases## = ;
def ##female_controls## = 'DEMO_2819603';
def ##female_cases## = ;
def ##pivot_type## = 'male_cases','female_cases','male_controls','female_controls';
def ##pntype## = IF(PN IN ('DEMO_2796799'),'index',IF(PN IN ('DEMO_2819603'),'female_controls',IF(PN IN ('DEMO_2827751'),'male_controls','UNKNOWN')));
""")
We transform the gene panel info table and expand it for all gene aliases:
# Making the pane info file work for gene aliases
mydefs = qh.add_many("""
create ##genealiaspanelinfofile## = nor -h ##genepanelinfo##
| multimap -c gene_symbol -m missing -h <(nor ##gene_map## -h | select gene_symbol,gene_aliases | split gene_aliases)
| replace gene_symbol if(gene_aliases != 'missing',gene_aliases,gene_symbol) | hide gene_aliases | distinct
| group -gc gene_symbol -set -sc Gene_RmaxAf- -len 10000 | rename set_(.*) #{1}
| replace Gene_RmaxAf- if(listfilter(#rc,'x!="default"')!='',listfilter(#rc,'x!="default"'),'default')
| replace Gene_RmaxAf-Gene_DmaxAf if(#rc!='default',str(listnummax(#rc)),#rc);
""")
We want to combine the XA AF statistics with the ref_AF value from step1. First we make sure it is properly normalized, because variants in XA are not. We take the max where there is ambiguity in InDel notation:
# Setup the AF from XA in proper structure, e.g. normalize and max
mydefs = qh.add_many("""
create #XAaf# = pgor ref_base/GDX/XA/unaff_parents_conf_gender_af.gorz
| select 1-4,conservative_af,homCount
| varnorm -left #3 #4 | atmax 1 -gc ref,alt conservative_af;
""")
Filter the input variants¶
The filtering step below uses the definition ##variant_filter## shown above, i.e.
vep_filter##, ##af_filter##, ##repeat_regions_filter##, ##variant_scope_filter##, ##vcf_quality_filter##¶
cat_filter##¶
protected_filter##¶
known_filter##¶
Notice that we also join the XA-AF and let it override the ref_AF (from Gnomad in step1) if higher. The we map the ##genealiaspanelinfofile## to get gene specific AF cut-offs. Notice that missing genes get the "default" value, e.g. ##rmax_af## etc.
# We since there is no AF filtering in Gregor-step1, we can simply let the XA-AF override the Gnomad-AF before we filter
mydefs = qh.add_many("""
create ##vcf_filtered## = pgor ##vcf_wgs_anno##
| varjoin -l -r -e 0 -r -rprefix XA [#XAaf#]
| replace ref_af if(ref_af < XA_conservative_af, XA_conservative_af, ref_af)
| replace homcount if(homcount < XA_homcount, XA_homcount, homcount)
| map -c gene_symbol [##genealiaspanelinfofile##] -n Gene_RmaxAf,Gene_RmaxGf,Gene_DmaxAf,Gene_MOI -m 'default'
| prefix Gene_RmaxAf- Temp
| calc Gene_RmaxAf if(Temp_Gene_RmaxAf='default',##rmax_af##,float(Temp_Gene_RmaxAf))
| calc Gene_DmaxAf if(Temp_Gene_DmaxAf='default',##dmax_af##,float(Temp_Gene_DmaxAf))
| calc Gene_MOI if(Temp_Gene_MOI='default',##defMOI##,Temp_Gene_MOI)
| hide Temp_*
| calc qc_vcf_filter if( not(filter = 'LowQual') and
(Depth = -1 or GL_Call >= ##min_gt_likelihood## and
Depth >= ##min_read_depth## and (CallCopies = 2 and CallRatio >= ##min_hom_call_ratio## or CallCopies = 1 and CallRatio >= ##min_het_call_ratio## and CallRatio <= 1.0-##min_het_call_ratio##))
,1,0)
| where ##variant_filter##;
""")
Usually, the settings used in step2, the ##variant_filter## eliminates many low impact variants, e.g.:
%%gor
$mydefs
create #temp# = nor ##vcf_wgs_anno## | group -gc pn -count
| map -c pn <(nor [##vcf_filtered##] | group -gc pn -count | rename allcount filterCount);
nor [#temp#]
Query ran in 0.20 sec Query fetched 3 rows in 0.00 sec (total time 0.20 sec)
PN | allCount | filterCount | |
---|---|---|---|
0 | DEMO_2796799 | 250398 | 1396 |
1 | DEMO_2796810 | 266250 | 1567 |
2 | DEMO_2827751 | 253335 | 1519 |
Aggregate gene statistics¶
After having filtered the variants, we now aggregate some metrics for the for each gene-feature. First for the index-case then for the case-control groups. Notice that we join gene segments with the variants in order to get the gene_start and gene_stop, because we want to use this information later in GOR-join. If we would aggregate this for the gene symbols and store this as NOR ouput (TSV) and use this later with MAP command, we would need to store a memory-based hash-map with about 200k rows, given that there are on average ten transcrips per gene - definitely possible, but with much higher memory footprint than with GOR join. Here we aggregate per CHROM and thus the memory footprint is 10x smaller than if genome-wide.
Notice the final select replaces the chrom start with the gene_start, hence we must SORT per chrom to ensure correct ordering! In ##ccgene##, also pay attention to the -e option in the final PIVOT commands, that deals with the sparse nature of the input stream and sets all the count columns to zero where rows are absent.
# Aggregate index-case for gene transcripts
mydefs = qh.add_many("""
create ##pngene## = pgor [##vcf_filtered##]
| select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,callCopies,feature,diag_denovo,Gene_DmaxAf
| calc PNtype ##pntype##
| split PNtype
| where PNtype='index'
| join -varseg -f ##fuzzdist## -xl gene_symbol -xr gene_symbol <(gor ##all_genes## | select 1-4)
| select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,callCopies,feature,gene_start,gene_end,diag_denovo,Gene_DmaxAf
| calc from_father if(GT_Paternity='Paternal',1,0)
| calc from_mother if(GT_Paternity='Maternal',1,0)
| calc from_unknown if(from_father=0 and from_mother=0,1,0)
| calc from_denovo if(diag_denovo=1 and ref_af<Gene_DmaxAf,1,0)
| calc hom IF(Zygosity = 'hom',1,0)
| calc het IF(Zygosity = 'het',1,0)
| group chrom -gc PN,feature,gene_symbol,gene_start,gene_end -ic from_father,from_mother,from_unknown,from_denovo,het,hom -sum
| calc index_subjCompHeterInGeneTrans if( sum_het>1 and ( sum_from_father>0 and sum_from_mother+sum_from_denovo>0 or sum_from_father+sum_from_denovo>0 and sum_from_mother>0),1,0)
| calc index_subjCompHeterInGene if( sum_het>1 and (index_subjCompHeterInGeneTrans=1 or ( sum_from_father>0 and sum_from_mother+sum_from_unknown>0 or sum_from_father+sum_from_unknown>0 and sum_from_mother>0 or sum_from_unknown>1 )),1,0)
| calc index_subjWithVarInGene IF(sum_het+sum_hom>0,1,0)
| calc index_subjWithHomVarInGene IF(sum_hom>0,1,0)
| select chrom,gene_start,gene_end,feature,gene_symbol,PN,index_*
| sort chrom;
""")
# Aggregate other samples in case-control-male-female groups for gene transcripts
mydefs = qh.add_many("""
create ##ccgene## = pgor [##vcf_filtered##]
| select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,feature,covered
| join -varseg -f ##fuzzdist## -xl gene_symbol -xr gene_symbol <(gor ##all_genes## | select 1-4)
| select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,feature,gene_symbol,gene_start,gene_end,covered
| calc from_father if(GT_Paternity='Paternal' or zygosity='hom',1,0)
| calc from_mother if(GT_Paternity='Maternal' or zygosity='hom',1,0)
| calc from_unknown if(from_father=0 and from_mother=0,1,0)
| calc hom IF(Zygosity = 'hom',1,0)
| calc het IF(Zygosity = 'het',1,0)
| group chrom -gc PN,feature,gene_symbol,gene_start,gene_end -ic from_father,from_mother,from_unknown,covered,het,hom -sum
| calc CHZ if(sum_covered>0 and (sum_from_father>0 and sum_from_mother+sum_from_unknown>0 or sum_from_father+sum_from_unknown>0 and sum_from_mother>0 or sum_from_unknown>1),1,0)
| calc WVIG IF(sum_het+sum_hom>0,1,0)
| calc WHVIG IF(sum_hom>0,1,0)
| calc gcovered if(sum_covered>0,1,0)
| calc PNtype ##pntype##
| group chrom -gc feature,gene_symbol,gene_start,gene_end,PNtype -sum -ic CHZ,WVIG,WHVIG,gcovered
| rename sum_WVIG subjWithVarInGene
| rename sum_WHVIG subjWithHomVarInGene
| rename sum_CHZ subjCompHeterInGene
| rename sum_gcovered subjWithGeneCovered
| select chrom,gene_start,gene_end,feature,gene_symbol,PNtype,subj*
| sort chrom
| pivot -gc gene_end,feature,gene_symbol PNtype -v ##pivot_type## -e 0;
""")
Inheritance rules¶
We apply the inheritance rules on the filtered variants and join the aggregate numbers from ##pngene## and ##ccgene#, as calculated above, using a -varseg join per gene_symbol,feature and PN. The query shown below is based on FreeMarker logic for a trio with unaffected parents. When there are male or female cases, some of the conditions look more complicated and make use of column values such as male_cases_VarCovered and/or female_cases_VarCovered as calculated in ##ccvar## of step1. Likewise, diag_denovo is calculated in step1.
# Apply inheritance rules after filtering, using ##pngene## and ##ccgene## for CHZ logic
mydefs = qh.add_many("""
create ##vcf_inheritance## = pgor [##vcf_filtered##]
| calc PNtype ##pntype##
| split PNtype
| where PNtype='index'
| calc index_subjWithVar = 1
| calc index_subjWithHomVar = if(zygosity='hom',1,0)
| calc father_affstat ##father_affstat##
| calc mother_affstat ##mother_affstat##
| replace diag_denovo_strict if(diag_denovo_strict=1 and ref_af>Gene_DmaxAf,0,diag_denovo_strict)
| calc diag_dominant if(ref_af <= Gene_DmaxAf and (
index_subjWithVar>0 and 1=2
),1,0)
| calc diag_homrecess
if(( chrom = 'chrX' and (
( gender = 'male' and index_subjWithVar>0 or gender = 'female' and index_subjWithHomVar>0)
and female_controls_subjWithHomVar+male_controls_subjWithVar <= ##delta_ctrl##
) or chrom != 'chrX' and ( index_subjWithHomVar>0
and female_controls_subjWithHomVar+male_controls_subjWithHomVar <= ##delta_ctrl##
)),1,0)
| join -varseg -l -r -e 0 -xl gene_symbol,feature,PN -xr gene_symbol,feature,PN [##pngene##]
| hide gene_symbolx,featurex,PNx
| join -varseg -l -r -e 0 -xl gene_symbol,feature -xr gene_symbol,feature [##ccgene##]
| hide gene_symbolx,featurex
| calc diag_chz_solo if(zygosity='het' and (
chrom = 'chrX' and ( gender = 'male' and index_subjWithVar>0 or gender = 'female' and index_subjWithVar>0 and (GT_Paternity in ('Paternal','Maternal') or diag_denovo = 1) and index_subjCompHeterInGeneTrans>0)
or chrom != 'chrX' and index_subjWithVar > 0 and (GT_Paternity in ('Paternal','Maternal') or diag_denovo = 1) and index_subjCompHeterInGeneTrans>0
),1,0)
| calc case_support_xlinked 1
| calc control_support_xlinked if(female_controls_subjCompHeterInGene+male_controls_subjWithVarInGene <= ##delta_ctrl##,1,0)
| calc case_support_autosomal 1
| calc control_support_autosomal if(female_controls_subjCompHeterInGene+male_controls_subjCompHeterInGene <= ##delta_ctrl##,1,0)
| calc diag_chz if(diag_chz_solo=1 and ( chrom = 'chrX' and ( case_support_xlinked=1 and control_support_xlinked=1
) or chrom != 'chrX' and ( case_support_autosomal=1 and control_support_autosomal=1
)),1,0)
| hide case_support*,control_support*
| calc mother_withvar if( mother_call!='NaN' and GT_Paternity!='Paternal' and not(GT_Paternity='Unknown' and var_type='sub' and not(listhasany(mother_call,call))),1,0)
| calc father_withvar if( father_call!='NaN' and GT_Paternity!='Maternal' and not( GT_Paternity='Unknown' and var_type='sub' and not(listhasany(father_call,call))),1,0)
;""")
Notice that some of the rules above are greatly simplified here, because of the case-study setup we applied here. For instance if additional case samples would have been provided, then we would have seen the following:
if(female_cases_subjCompHeterInGene+male_cases_subjWithVarInGene >= max(female_cases_subjWithGeneCovered+male_cases_subjWithGeneCovered - ##delta_cases##,1),1,0)
The constants ##delta_cases## and ##delta_ctrl## are used to specify "how much discount" we give from a "perfect penetrance" in our rules. For full information on these rules, the best way is to read the full FreeMarker logic within the YML file. It does for instance use column values on *_VarCovered and *_subjWithGeneCovered to account for lack of data due to sequence read coverage.
Candidate genes¶
Next we have two old statements used to process the candidate genes and rank them based on relevance. This was setup before we implemented the Phrank score as part of the candidate gene logic (see Jagadeesh, K.A., Birgmeier, J., Guturu, H. et al. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet Med 21, 464–470 (2019)).
# Calculate the number of genes with each phenotype and for each gene list the most specific phenotype count
mydefs = qh.add_many("""
create ##disease_specificity## = nor -h <(gor ##candidate_gene_file##
| where gene_candORparalog='c'
| split gene_candidate_phenotypes
| where not(gene_candidate_phenotypes like '*:par_*')
| distinct )
| granno -gc gene_candidate_phenotypes -count
| rename allcount hpo_geneCount
| atmin -gc gene_symbol hpo_geneCount
| group -gc gene_symbol -set -sc gene_candidate_phenotypes,hpo_geneCount
| rename set_hpo_genecount min_hpo_genecount;
""")
# Old-school gene ranking now replaced with Phrank in candidate-gene-query
mydefs = qh.add_many("""
create ##alias_candidate_genes## = gor ##candidate_gene_file##
| calc ep gene_candidate_phenotypes | split ep
| map -c ep -h <(nor <(gor ##candidate_gene_file## | split gene_candidate_phenotypes /* Calc pheno significance based on distinct genes */
| group genome -gc gene_candidate_phenotypes -dis -sc gene_symbol)
| select GENE_candidate_Phenotypes,dis_GENE_symbol | calc pheno_significance 1.0/dis_GENE_symbol)
| group 1 -gc 3-ep[-1] -sum -fc pheno_significance /* Sum up the pheno significance for all phenotypes per gene_symbol */
| rename sum_pheno_significance Gene_Cand_PhenoScore
| map -c gene_symbol ##gene_map## -n gene_aliases -m 'missing' /* replace gene symbol with alias */
| split gene_aliases | replace gene_symbol if(gene_aliases != 'missing',gene_aliases,gene_symbol)
| hide gene_aliases
| distinct
| map -c gene_symbol -m 10000 -n min_hpo_genecount [##disease_specificity##]
;
""")
%%gor
$mydefs
nor [##alias_candidate_genes##]
Query ran in 0.23 sec Query fetched 23,429 rows in 0.17 sec (total time 0.40 sec)
chrom | GENE_start | GENE_end | GENE_symbol | GENE_candidate_Phenotypes | GENE_candidate_paralogs | GENE_candORparalog | GENE_PHRANK | GENE_PHRANK_norm | Gene_Cand_PhenoScore | min_hpo_genecount | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 859302 | 879954 | SAMD11 | 3:par_HPO:HP:0000582 | PHC1 | p | 0.000 | 0.000 | 0.000993 | 10000 |
1 | chr1 | 859302 | 879954 | MGC45873 | 3:par_HPO:HP:0000582 | PHC1 | p | 0.000 | 0.000 | 0.000993 | 10000 |
2 | chr1 | 955499 | 991496 | AGRN | 11:par_HPO:HP:0007018,1:par_HPO:HP:0000164,2:p... | HSPG2 | c | 8.791 | 0.020 | 0.004976 | 956 |
3 | chr1 | 955499 | 991496 | AGRIN | 11:par_HPO:HP:0007018,1:par_HPO:HP:0000164,2:p... | HSPG2 | c | 8.791 | 0.020 | 0.004976 | 10000 |
4 | chr1 | 1167616 | 1170421 | B3GALT6 | 1:HPO:HP:0000164,6:HPO:HP:0000750,8:HPO:HP:000... | NaN | c | 12.872 | 0.017 | 0.003034 | 956 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23424 | chrY | 21865750 | 21906612 | JARID1 | 11:par_HPO:HP:0007018,1:par_HPO:HP:0000164,2:p... | JARID2,KDM5B,KDM5C | p | 0.001 | 0.000 | 0.005901 | 10000 |
23425 | chrY | 21865750 | 21906612 | KIAA0234 | 11:par_HPO:HP:0007018,1:par_HPO:HP:0000164,2:p... | JARID2,KDM5B,KDM5C | p | 0.001 | 0.000 | 0.005901 | 10000 |
23426 | chrY | 26191375 | 26194166 | CDY1B | 6:par_HPO:HP:0000750,8:par_HPO:HP:0001270 | AUH | p | 0.000 | 0.000 | 0.000827 | 10000 |
23427 | chrY | 27768263 | 27771049 | CDY1 | 6:par_HPO:HP:0000750,8:par_HPO:HP:0001270 | AUH | p | 0.001 | 0.000 | 0.000827 | 10000 |
23428 | chrY | 27768263 | 27771049 | CDY1A | 6:par_HPO:HP:0000750,8:par_HPO:HP:0001270 | AUH | p | 0.001 | 0.000 | 0.000827 | 10000 |
23429 rows × 11 columns
Putting it all together¶
The final statement in step2, before the recently added prototype GDX-view logic, is ##mendel_result##. It is now modified to use ##XA_annotations##, a relation that stores the latest variant classification according to the XA knowledgebase.
We see that if gene panel info file is provided, it uses Gene_DefTrans to override qc_consequence_rank that was calculated in step1. It adds PP4 to ACMG auto_evidence, from step1, if min_hpo_genecount < ##disease_specificity_max_gene_count##, and similarly PM3 if the variant is recessive or CHZ. It then finalizes some of the other inheritance labels and combines them into a single inheritance list, name inheritance.
Notice that while that while each individual inheritance label is calculated per transcript/feature, we do aggregated the inheritance over the features, per variant and gene, using GRANNO.
We then filter the feature transcript with the lowest qc_consequence_rank (nor ATMIN pick single output row in case there is a tie, hence single feature per variant per gene)
Next we add some evidence based on proximity and codon change similarity with known pathogenetic variants and finally we implement the rule scoring based on the 2015 ACMG-AMP guidelines. We have all the evidence labels in auto_evidence and use listfilter and listsize to count the number of evidence that fall into each group.
In ##PP3_BP4## in step1 we both setup logic to specify PP3 and BP4 for the old 2015 guidelines as well as for the Bayesian framework by Tavtigian etal. Before we calculate the Bayesian score, we therefore replace the plain PP3 and BP4, if present, with their modified impact stored in fidel_bp4_pp3 and based on the Fidel score level. We then simply use listnumsum, listmap, and the decode function to translate the collective evidence into a log-style Bayesian probability score.
The columns auto_classification_simple and auto_classification store then the P,LP,VUS,LB,B labels, based on the 2015 guidelines and the Bayesian respectively.
# Special annotation for last XA classification. XA classes already included in Clinvar and HGMD source in Step1
mydefs = qh.add_many("""
def #date_val($1) = int(left($1,4))*365+int(substr($1,5,7))*30+int(substr($1,8,10));
create ##XA_annotations## = gor ref_base/GDX/XA/variant_classifications.gorz
| where active = 't' | calc t #date_val(date) | atmax 1 -gc ref,alt t | select 1-alt,classification;
""")
# Query to apply final annotations and selection of transcript,
# i.e. possibly override qc_consequence_rank based on genepanelinfo
# Inheritance is still collapsed to variant-gene_symbol level, for not to loose information
mydefs = qh.add_many("""
create ##mendel_result## = pgor [##vcf_inheritance##]
| map -c gene_symbol -h <(nor -h [##genealiaspanelinfofile##] | select gene_symbol,Gene_DefTrans) -m 'default'
| replace qc_consequence_rank if(containsany(feature,Gene_DefTrans),0,qc_consequence_rank)
| join -l -varseg -r -f ##fuzzdist## -xl gene_symbol -xr gene_symbol [##alias_candidate_genes##]
| hide gene_symbolx
| replace Gene_Cand_PhenoScore if(Gene_Cand_PhenoScore='','0',str(Gene_Cand_PhenoScore))
| replace auto_evidence if(gene_candORparalog='c' and min_hpo_genecount < ##disease_specificity_max_gene_count##,listadd(auto_evidence,'PP4'),auto_evidence)
| replace auto_evidence if(diag_chz=1 or diag_homrecess=1,listadd(auto_evidence,'PM3'),auto_evidence)
| calc inheritance_REC if(diag_homrecess=1, if(chrom = 'chrX','XREC','REC'),if(diag_chz=1,'CHZcc', if(diag_chz_solo=1,'CHZ','')))
| calc inheritance_DeNovoDOM if(diag_denovo_strict=1,'DeNovo', if(diag_dominant=1,'DOM',''))
| calc inheritance_LHZ if(lhz_transfrac > ##min_lhz_overlap_fraction## and lhz_vars != '','LHZ','')
| calc inheritance = listfilter(inheritance_DeNovoDOM+','+inheritance_REC+','+inheritance_LHZ,'len(x)>0')
| granno 1 -gc reference,call,gene_symbol -set -sc inheritance
| replace inheritance if(left(set_inheritance,1)=',',listtail(set_inheritance),set_inheritance)
| map -c vep_max_consequence <(nor -h ##vep_impact_map## | rename vep_impact vep_max_impact ) -n vep_max_impact
| atmin 1 -gc reference,call,gene_symbol,PN qc_consequence_rank
| hide set_*
| varjoin -r -l -e '' [##XA_annotations##]
| calc is_patho_in_knowledgebase if(contains(classification,'PATH') or contains(classification,'LPATH'),1,0)
| replace auto_evidence if(DIAG_ACMGcat='Cat1B',listadd(auto_evidence,'PS1'),if(DIAG_ACMGcat='Cat1C',listadd(auto_evidence,'PM5'),auto_evidence))
| replace auto_evidence if(auto_evidence='',if(DIAG_ACMGcat='Cat1','PP5',if(DIAG_ACMGcat='Cat1C','BP6',auto_evidence)),auto_evidence)
| calc ls_PVS1 listsize(listfilter(auto_evidence,'x in ("PVS1")'))
| calc ls_PS1_4 listsize(listfilter(auto_evidence,'x in ("PS1","PS2","PS3","PS4")'))
| calc ls_PM1_6 listsize(listfilter(auto_evidence,'x in ("PM1","PM2","PM3","PM4","PM5","PM6")'))
| calc ls_PP1_5 listsize(listfilter(auto_evidence,'x in ("PP1","PP2","PP3","PP4","PP5")'))
| calc ls_BA1 listsize(listfilter(auto_evidence,'x in ("BA1")'))
| calc ls_BS1_4 listsize(listfilter(auto_evidence,'x in ("BS1","BS2","BS3","BS4")'))
| calc ls_BP1_7 listsize(listfilter(auto_evidence,'x in ("BP1","BP2","BP3","BP4","BP5","BP6","BP7")'))
| calc D_Pi if(ls_PVS1=1 and (ls_PS1_4>=1 or ls_PM1_6>=2 or (LS_PM1_6=1 and ls_PP1_5=1) or ls_PP1_5>=2),1,0)
| calc D_Pii if(ls_PS1_4>=2,1,0)
| calc D_Piii if(ls_PS1_4>=1 and ( ls_PM1_6>=3 or (ls_PM1_6>=2 and ls_PP1_5>=2) or (ls_PM1_6>=1 and ls_PP1_5>=4) ),1,0)
| calc D_P if(D_Pi+D_Pii+D_Piii > 0,1,0)
| calc D_LPi if(ls_PVS1=1 and ls_PM1_6=1,1,0)
| calc D_LPii if(ls_PS1_4=1 and (ls_PM1_6=1 or ls_PM1_6=2),1,0)
| calc D_LPiii if(ls_PS1_4=1 and ls_PP1_5>=2,1,0)
| calc D_LPiv if(ls_PM1_6>=3,1,0)
| calc D_LPv if(ls_PM1_6=2 and ls_PP1_5>=2,1,0)
| calc D_LPvi if(ls_PM1_6=1 and ls_PP1_5>=4,1,0)
| calc D_LP if(D_LPi+D_LPii+D_LPiii+D_LPiv+D_LPi+D_LPv+D_LPvi > 0,1,0)
| calc D_B if(ls_BA1=1 or ls_BS1_4>=2,1,0)
| calc D_LB if(ls_BS1_4=1 and ls_BP1_7=1 or ls_BP1_7>=2,1,0)
| calc D_PLP if(D_P=1,'P',if(D_LP=1,'LP',''))
| calc D_BLB if(D_B=1,'B',if(D_LB=1,'LB',''))
| calc auto_classification_simple if(len(D_PLP)>0 and len(D_BLB)=0,D_PLP,if(len(D_PLP)=0 and len(D_BLB)>0,D_BLB,'VUS'))
| replace auto_evidence if(contains(auto_evidence,'PP3'),replace(auto_evidence,'PP3',fidel_bp4_pp3),
if(contains(auto_evidence,'BP4'),replace(auto_evidence,'BP4',fidel_bp4_pp3),auto_evidence))
| calc auto_score if(auto_evidence!='',int(listnumsum(listmap(auto_evidence,"decode(x,'BA1,-8,BS1,-4,BS2,-4,BS3,-4,BS4,-4,BP1,-1,BP2,-1,BP3,-1,BP4,-1,BP4_supporting,-1,BP4_moderate,-2,BP4_strong,-4,BP4_verystrong,-8,BP5,-1,BP6,-1,BP7,-1,PVS1,8,PS1,4,PS2,4,PS3,4,PS4,4,PM1,2,PM2,1,PM3,2,PM4,2,PM5,2,PM6,2,PP1,1,PP2,1,PP3,1,PP3_strong,4,PP3_moderate,2,PP3_supporting,1,PP4,1,PP5,1,0')"))),0)
| calc auto_classification if(auto_score >= 10,'P',
if(auto_score >= 6 and auto_score <=9,'LP',
if(auto_score >= 0 and auto_score <=5,'VUS',
if(auto_score >= -6 and auto_score <= -1,'LB',
if(auto_score <= -7,'B','')))))
| calc GT reference + '/' + call
| rename known_clinvar_stars clinvar_stars
| rename known_clinvar_varianttype clinvar_varianttype
| rename OMIM_disease_name OMIM_phenotype
| hide gender,covered,pntype,gene_dmaxAf,gene_rmaxAf,BP4_score,BP7_score,PP3_score,gene_MOI,Gene_DefTrans,D_*,ls_*,inheritance_*,femalecases_*,malecases_*
| replace auto_evidence replace(replace(replace(replace(#rc,'_verystrong',''),'_supporting',''),'_strong',''),'_moderate','')
;""")
Finally, we have some logic that was recently added to implement filter views based on GDX specification. First we load some GeneMB data to classify genes. The relation #gene_inher# combines gene inheritance models in gene2inher.tsv with gene AF reference and Phenovar categories in gene2freqcat.tsv. This data is transformaed and stored with one row per gene symbol. The final query then implements the predicate logic for the Boolean flags representing the GDX filter/view membership and collapses them then into a single list in GDX_view compactly using listzipfilter and cols2list.
The expressions speak for themselves!
mydefs = qh.add("""
create #gene_inher# = nor ref_base/GDX/GeneMB/gene2inher.tsv
| replace #2 listmap(#rc,"decode(x,'Autosomal dominant germline de novo mutation,ADdenov,Autosomal dominant inheritance,AD,Autosomal dominant inheritance with maternal imprinting,ADmi,Autosomal dominant inheritance with paternal imprinting,ADpi,Autosomal dominant somatic cell mutation,ADsom,Autosomal recessive inheritance,AR,Digenic inheritance,DI,Mitochondrial inheritance,MI,Multifactorial inheritance,MFI,Oligogenic inheritance,OGI,Semidominant inheritance,SDI,Somatic mutation,SM,X-linked dominant de novo mutation,XLdenovo,X-linked dominant inheritance,XLdom,X-linked dominant with interference,XLdomwi,X-linked inheritance,XL,X-linked recessive inheritance,XLrec,Y-linked inheritance,YL,?')")
| map -c gene_symbol -m '' <(nor ref_base/GDX/GeneMB/gene2freqcat.tsv
| group -gc gene_symbol -set -sc categorical_bin
| replace #2 upper(listfilter(#rc,"x!=''")));
""")
# Code for prototype GDX filtering/searches
mydefs = qh.add("""
create #GDX# = gor [##mendel_result##]
| map -c gene_symbol -m '' [#gene_inher#]
| varjoin -r -l -rprefix GDX [##XA_annotations##]
| calc GDX_phenovar if(contains(set_categorical_bin,'PHENOVAR'),1,0)
| calc GDX_path if(containsany(GDX_classification,'P','LP'),1,0)
| calc GDX_denovo if(diag_denovo = 1 and (not(homcount > 1 or GDX_classification in ('BEN','LBEN'))
and (ref_af = 0.0 and listhasany(set_inheritance,'AD','ADdenov','ADmi','ADpi','XLdom','XLdomwi')
or (ref_af <= #denovoAFpv# and (GDX_phenovar = 1 or listhasany(set_inheritance,'XLrec','XL','XLdenovo','XLdom','XLdomwi')))
or GDX_path = 1)), 1,0)
| calc GDX_homoz if(listhasany(set_inheritance,'AD')
and diag_homrecess = 1 and GT_IHE != 1
and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN'))
and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #homAF# or GDX_path = 1), 1,0)
| calc GDX_chz if(listhasany(set_inheritance,'AD')
and diag_chz_solo = 1 and GT_IHE != 1
and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN'))
and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #chzAF# or GDX_path = 1), 1,0)
| calc GDX_xlink if(listhasany(set_inheritance,'XLrec','XL','XLdenovo','XLdom','XLdomwi')
and (diag_chz_solo = 1 or diag_homrecess = 1) and GT_IHE != 1
and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN'))
and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #xlinkAF# or GDX_path = 1), 1,0)
| calc GDX_sihet if(listhasany(set_inheritance,'AR')
and callcopies = 1 and GT_IHE != 1
and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN'))
and VEP_IMPACT in ('HIGH') and (ref_af <= #sihetAF# or GDX_path = 1), 1,0)
| calc GDX_dom if(listhasany(set_inheritance,'AD','XLrec','XL','XLdenovo','XLdom','XLdomwi')
and diag_dominant = 1 and GT_IHE != 1
and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN'))
and VEP_IMPACT in ('HIGH') and (ref_af <= #domAF# or GDX_path = 1), 1,0)
| calc GDX_c_denovo if(diag_denovo = 1 and (not(homcount > 1 or GDX_classification in ('BEN','LBEN'))
and VEP_IMPACT in ('HIGH','MODERATE')
and (ref_af = 0.0 and not(listhasany(set_inheritance,'AD','ADdenov','ADmi','ADpi','XLdom','XLdomwi'))
or GDX_path = 1)), 1,0)
| calc GDX_c_homoz if(not(listhasany(set_inheritance,'AD'))
and (diag_homrecess = 1 or diag_chz_solo = 1) and GT_IHE != 1
and not(homcount > 1 or GDX_classification in ('BEN','LBEN'))
and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #chomAF# or GDX_path = 1), 1,0)
| calc GDX_second if(GDX_path = 1 and GT_IHE != 1 and VEP_IMPACT in ('HIGH'), 1,0)
| inset -c gene_symbol -b <(nor ref/ensgenes/ensgenes_disease.map | grep acmg) | rename inset inACMG
| calc GDX_inciden if(GDX_path = 1 and GT_IHE != 1 and VEP_IMPACT in ('HIGH') and inACMG = 1, 1,0)
| hide inACMG
| calc GDX_view replace(listzipfilter('GDX_denovo,GDX_homoz,GDX_chz,GDX_xlink,GDX_sihet,GDX_dom,GDX_c_denovo,GDX_c_homoz,GDX_second,GDX_inciden',
cols2list('GDX_denovo,GDX_homoz,GDX_chz,GDX_xlink,GDX_sihet,GDX_dom,GDX_c_denovo,GDX_c_homoz,GDX_second,GDX_inciden'),'x="1"'),'GDX_','')
| hide GDX_denovo,GDX_homoz,GDX_chz,GDX_xlink,GDX_sihet,GDX_dom,GDX_c_denovo,GDX_c_homoz,GDX_second,GDX_inciden
| rename set_inheritance GDX_inheritance
| rename set_categorical_bin GDX_categorical_bin
| calc sequence_variant chrom+':'+pos+Reference+'>'+Call+', ' + zygosity + ', ' + Gene_symbol + ', '+HGVSc + if(len(HGVSp)>1,' ('+ if(posof(HGVSp,':p.')>0,substr(HGVSp,1+posof(HGVSp,':p.'),1000),HGVSp)+')','')
| columnsort 1,2,reference,call,zygosity,sequence_variant,Gene_Symbol,auto_classification,auto_evidence,GDX_view,DIAG_ACMGcat,Fidel_PRECPAT,Fidel_ALPHA,spliceai_max_impact,ref_af,XA_conservative_AF,XA_homCount,VEP_Max_Consequence,GT_Info,KNOWN_gene_diseases,Fidel_CALIBRATED,GDX_categorical_bin,GDX_classification,GDX_phenovar
;""")
%%gor
$mydefs
gor [#GDX#] | top 10
Query ran in 0.27 sec Query fetched 10 rows in 0.01 sec (total time 0.28 sec)
CHROM | POS | Reference | Call | zygosity | sequence_variant | Gene_Symbol | auto_classification | auto_evidence | GDX_view | ... | Refgene | ref_af_cust | rsids | spliceai_max_consequence | transcript_source | trigger50bpRule | var_type | VEP_Impact | vep_max_impact | VEP_Max_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 906082 | C | T | het | chr1:906082C>T, het, PLEKHN1, NM_032129.3:c.42... | PLEKHN1 | LB | BP4 | NaN | ... | NM_032129 | 0.0 | rs373726748 | NaN | no_source_detected | 0 | sub | MODERATE | MODERATE | 0.711 |
1 | chr1 | 909854 | G | A | het | chr1:909854G>A, het, PLEKHN1, NM_032129.3:c.17... | PLEKHN1 | LB | BP4 | NaN | ... | NM_032129 | 0.0 | rs765491496 | NaN | no_source_detected | 1 | sub | MODERATE | MODERATE | 0.499 |
2 | chr1 | 911595 | A | G | hom | chr1:911595A>G, hom, PERM1, NM_001291366.2:c.2... | PERM1 | B | BA1,BP4 | NaN | ... | NM_001291366 | 0.0 | rs7417106 | NaN | no_source_detected | 1 | sub | MODERATE | MODERATE | 0.976 |
3 | chr1 | 911595 | A | G | hom | chr1:911595A>G, hom, PLEKHN1, | PLEKHN1 | B | BA1,BP4 | NaN | ... | NM_032129 | 0.0 | rs7417106 | NaN | no_source_detected | 0 | sub | LOWEST | MODERATE | 0.976 |
4 | chr1 | 6676839 | G | A | het | chr1:6676839G>A, het, PHF13, NM_153812.3:c.62G... | PHF13 | LB | BP4 | NaN | ... | NM_153812 | 0.0 | rs772942562 | NaN | no_source_detected | 0 | sub | MODERATE | MODERATE | 1.000 |
5 | chr1 | 7993342 | T | C | het | chr1:7993342T>C, het, TNFRSF9, NM_001561.6:c.5... | TNFRSF9 | LB | PM2,BP4 | NaN | ... | NM_001561 | 0.0 | NaN | NaN | no_source_detected | 0 | sub | MODERATE | MODERATE | 0.002 |
6 | chr1 | 10386304 | G | C | het | chr1:10386304G>C, het, KIF1B, NM_015074.3:c.26... | KIF1B | VUS | PM2,PP2,BP4 | NaN | ... | NM_015074 | 0.0 | rs752271843 | NaN | no_source_detected | 0 | sub | MODERATE | MODERATE | 0.974 |
7 | chr1 | 10386305 | G | C | het | chr1:10386305G>C, het, KIF1B, NM_015074.3:c.26... | KIF1B | VUS | PM2,PP2,BP4 | NaN | ... | NM_015074 | 0.0 | NaN | NaN | no_source_detected | 0 | sub | MODERATE | MODERATE | 0.857 |
8 | chr1 | 12919956 | G | C | het | chr1:12919956G>C, het, PRAMEF2, NM_023014.1:c.... | PRAMEF2 | LB | PM2,BP4 | NaN | ... | NM_023014 | 0.0 | NaN | NaN | no_source_detected | 0 | sub | MODERATE | MODERATE | 0.968 |
9 | chr1 | 13448185 | G | GTC | het | chr1:13448185G>GTC, het, PRAMEF13, NM_00129138... | PRAMEF13 | VUS | PM2 | NaN | ... | NM_001291380 | 0.0 | rs782180079 | NaN | no_source_detected | 1 | ins | HIGH | HIGH | 0.000 |
10 rows × 191 columns
Moving the query to Sequence Miner¶
We can display the entire query in the following way, copy it and paste it into SM in order to explore it there.
print(mydefs)
def ##ref## = ref; def ##fuzzdist## = 20; def ##min_read_depth## = 8; def ##min_hom_call_ratio## = 0.66; def ##min_het_call_ratio## = 0.2; def ##min_gt_likelihood## = 5; def ##rmax_af## = 0.01; def ##dmax_af## = 0.001; def ##max_cust_af## = ##rmax_af##; def ##delta_ctrl## = 0; def ##delta_cases## = 0; def ##max_num_candgenes## = 10; def ##disease_specificity_max_gene_count## = 20; def ##min_lhz_overlap_fraction## = 0.1; def ##vcf_wgs_anno## = user_data/Gregor1/DEMO_2796799_v3.gorz; def ##candidate_gene_file## = studies/GCA724303/candidate_genes.gor; create ##demo_genepanelinfo## = nor #genes# | select gene_symbol | top 100 | calc Gene_RmaxAf 'default' | calc Gene_RmaxGf 'default' | calc Gene_DmaxAf 'default' | calc Gene_MOI 'all' | calc Gene_Deftrans 'default'; def ##genepanelinfo## = [##demo_genepanelinfo##]; def ##defMOI## = 'all'; def ##vep_filter## = consequence in ('transcript_ablation','splice_acceptor_variant','splice_donor_variant','stop_gained','frameshift_variant','stop_lost','start_lost','transcript_amplification','inframe_insertion','inframe_deletion','missense_variant','protein_altering_variant','splice_region_variant','incomplete_terminal_codon_variant'); def ##vcf_filter## = ('Not LowQual'); def ##af_filter## = ref_af <= Gene_RmaxAf and ref_af_cust <= ##max_cust_af##; def ##cat_filter## = DIAG_ACMGcat in ('Cat1','Cat1B','Cat1C'); def ##known_filter## = qc_known_vars = 1; def ##protected_filter## = qc_protected_vars = 1; def ##vcf_quality_filter## = qc_vcf_filter = 1; def ##variant_scope_filter## = 2 = 2; def ##repeat_regions_filter## = qc_repeat_regions = 0; def ##variant_filter## = ( ##vep_filter## ) and ( ##af_filter## ) and ( ##repeat_regions_filter## ) and ( ##variant_scope_filter## ) and ( ##vcf_quality_filter## ) or ( ##cat_filter## ) or ( ##protected_filter## ) or ( ##known_filter## ) ; def ##all_genes## = ##ref##/refgenes/rgenes.gorz; def ##gene_map## = ##ref##/refgenes/refgenes.map; def ##vep_impact_map## = ##ref##/VEP_impact.map; def #denovoAFpv# = 0.0005; def #homAF# = 0.005; def #chzAF# = 0.01; def #xlinkAF# = 0.01; def #sihetAF# = 0.01; def #domAF# = 0.0005; def #chomAF# = 0.001; def ##index_case## = 'DEMO_2796799'; def ##father## = 'DEMO_2827751'; def ##father_affstat## = 'unaffected'; def ##mother## = 'DEMO_2819603'; def ##mother_affstat## = 'unaffected'; def ##male_controls## = 'DEMO_2827751'; def ##male_cases## =; def ##female_controls## = 'DEMO_2819603'; def ##female_cases## =; def ##pivot_type## = 'male_cases','female_cases','male_controls','female_controls'; def ##pntype## = IF(PN IN ('DEMO_2796799'),'index',IF(PN IN ('DEMO_2819603'),'female_controls',IF(PN IN ('DEMO_2827751'),'male_controls','UNKNOWN'))); create ##genealiaspanelinfofile## = nor -h ##genepanelinfo## | multimap -c gene_symbol -m missing -h <(nor ##gene_map## -h | select gene_symbol,gene_aliases | split gene_aliases) | replace gene_symbol if(gene_aliases != 'missing',gene_aliases,gene_symbol) | hide gene_aliases | distinct | group -gc gene_symbol -set -sc Gene_RmaxAf- -len 10000 | rename set_(.*) #{1} | replace Gene_RmaxAf- if(listfilter(#rc,'x!="default"')!='',listfilter(#rc,'x!="default"'),'default') | replace Gene_RmaxAf-Gene_DmaxAf if(#rc!='default',str(listnummax(#rc)),#rc); create #XAaf# = pgor ref_base/GDX/XA/unaff_parents_conf_gender_af.gorz | select 1-4,conservative_af,homCount | varnorm -left #3 #4 | atmax 1 -gc ref,alt conservative_af; create ##vcf_filtered## = pgor ##vcf_wgs_anno## | varjoin -l -r -e 0 -r -rprefix XA [#XAaf#] | replace ref_af if(ref_af < XA_conservative_af, XA_conservative_af, ref_af) | replace homcount if(homcount < XA_homcount, XA_homcount, homcount) | map -c gene_symbol [##genealiaspanelinfofile##] -n Gene_RmaxAf,Gene_RmaxGf,Gene_DmaxAf,Gene_MOI -m 'default' | prefix Gene_RmaxAf- Temp | calc Gene_RmaxAf if(Temp_Gene_RmaxAf='default',##rmax_af##,float(Temp_Gene_RmaxAf)) | calc Gene_DmaxAf if(Temp_Gene_DmaxAf='default',##dmax_af##,float(Temp_Gene_DmaxAf)) | calc Gene_MOI if(Temp_Gene_MOI='default',##defMOI##,Temp_Gene_MOI) | hide Temp_* | calc qc_vcf_filter if( not(filter = 'LowQual') and (Depth = -1 or GL_Call >= ##min_gt_likelihood## and Depth >= ##min_read_depth## and (CallCopies = 2 and CallRatio >= ##min_hom_call_ratio## or CallCopies = 1 and CallRatio >= ##min_het_call_ratio## and CallRatio <= 1.0-##min_het_call_ratio##)) ,1,0) | where ##variant_filter##; create ##pngene## = pgor [##vcf_filtered##] | select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,callCopies,feature,diag_denovo,Gene_DmaxAf | calc PNtype ##pntype## | split PNtype | where PNtype='index' | join -varseg -f ##fuzzdist## -xl gene_symbol -xr gene_symbol <(gor ##all_genes## | select 1-4) | select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,callCopies,feature,gene_start,gene_end,diag_denovo,Gene_DmaxAf | calc from_father if(GT_Paternity='Paternal',1,0) | calc from_mother if(GT_Paternity='Maternal',1,0) | calc from_unknown if(from_father=0 and from_mother=0,1,0) | calc from_denovo if(diag_denovo=1 and ref_af<Gene_DmaxAf,1,0) | calc hom IF(Zygosity = 'hom',1,0) | calc het IF(Zygosity = 'het',1,0) | group chrom -gc PN,feature,gene_symbol,gene_start,gene_end -ic from_father,from_mother,from_unknown,from_denovo,het,hom -sum | calc index_subjCompHeterInGeneTrans if( sum_het>1 and ( sum_from_father>0 and sum_from_mother+sum_from_denovo>0 or sum_from_father+sum_from_denovo>0 and sum_from_mother>0),1,0) | calc index_subjCompHeterInGene if( sum_het>1 and (index_subjCompHeterInGeneTrans=1 or ( sum_from_father>0 and sum_from_mother+sum_from_unknown>0 or sum_from_father+sum_from_unknown>0 and sum_from_mother>0 or sum_from_unknown>1 )),1,0) | calc index_subjWithVarInGene IF(sum_het+sum_hom>0,1,0) | calc index_subjWithHomVarInGene IF(sum_hom>0,1,0) | select chrom,gene_start,gene_end,feature,gene_symbol,PN,index_* | sort chrom; create ##ccgene## = pgor [##vcf_filtered##] | select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,feature,covered | join -varseg -f ##fuzzdist## -xl gene_symbol -xr gene_symbol <(gor ##all_genes## | select 1-4) | select 1-4,PN,gene_symbol,zygosity,ref_af,GT_paternity,feature,gene_symbol,gene_start,gene_end,covered | calc from_father if(GT_Paternity='Paternal' or zygosity='hom',1,0) | calc from_mother if(GT_Paternity='Maternal' or zygosity='hom',1,0) | calc from_unknown if(from_father=0 and from_mother=0,1,0) | calc hom IF(Zygosity = 'hom',1,0) | calc het IF(Zygosity = 'het',1,0) | group chrom -gc PN,feature,gene_symbol,gene_start,gene_end -ic from_father,from_mother,from_unknown,covered,het,hom -sum | calc CHZ if(sum_covered>0 and (sum_from_father>0 and sum_from_mother+sum_from_unknown>0 or sum_from_father+sum_from_unknown>0 and sum_from_mother>0 or sum_from_unknown>1),1,0) | calc WVIG IF(sum_het+sum_hom>0,1,0) | calc WHVIG IF(sum_hom>0,1,0) | calc gcovered if(sum_covered>0,1,0) | calc PNtype ##pntype## | group chrom -gc feature,gene_symbol,gene_start,gene_end,PNtype -sum -ic CHZ,WVIG,WHVIG,gcovered | rename sum_WVIG subjWithVarInGene | rename sum_WHVIG subjWithHomVarInGene | rename sum_CHZ subjCompHeterInGene | rename sum_gcovered subjWithGeneCovered | select chrom,gene_start,gene_end,feature,gene_symbol,PNtype,subj* | sort chrom | pivot -gc gene_end,feature,gene_symbol PNtype -v ##pivot_type## -e 0; create ##vcf_inheritance## = pgor [##vcf_filtered##] | calc PNtype ##pntype## | split PNtype | where PNtype='index' | calc index_subjWithVar = 1 | calc index_subjWithHomVar = if(zygosity='hom',1,0) | calc father_affstat ##father_affstat## | calc mother_affstat ##mother_affstat## | replace diag_denovo_strict if(diag_denovo_strict=1 and ref_af>Gene_DmaxAf,0,diag_denovo_strict) | calc diag_dominant if(ref_af <= Gene_DmaxAf and ( index_subjWithVar>0 and 1=2 ),1,0) | calc diag_homrecess if(( chrom = 'chrX' and ( ( gender = 'male' and index_subjWithVar>0 or gender = 'female' and index_subjWithHomVar>0) and female_controls_subjWithHomVar+male_controls_subjWithVar <= ##delta_ctrl## ) or chrom != 'chrX' and ( index_subjWithHomVar>0 and female_controls_subjWithHomVar+male_controls_subjWithHomVar <= ##delta_ctrl## )),1,0) | join -varseg -l -r -e 0 -xl gene_symbol,feature,PN -xr gene_symbol,feature,PN [##pngene##] | hide gene_symbolx,featurex,PNx | join -varseg -l -r -e 0 -xl gene_symbol,feature -xr gene_symbol,feature [##ccgene##] | hide gene_symbolx,featurex | calc diag_chz_solo if(zygosity='het' and ( chrom = 'chrX' and ( gender = 'male' and index_subjWithVar>0 or gender = 'female' and index_subjWithVar>0 and (GT_Paternity in ('Paternal','Maternal') or diag_denovo = 1) and index_subjCompHeterInGeneTrans>0) or chrom != 'chrX' and index_subjWithVar > 0 and (GT_Paternity in ('Paternal','Maternal') or diag_denovo = 1) and index_subjCompHeterInGeneTrans>0 ),1,0) | calc case_support_xlinked 1 | calc control_support_xlinked if(female_controls_subjCompHeterInGene+male_controls_subjWithVarInGene <= ##delta_ctrl##,1,0) | calc case_support_autosomal 1 | calc control_support_autosomal if(female_controls_subjCompHeterInGene+male_controls_subjCompHeterInGene <= ##delta_ctrl##,1,0) | calc diag_chz if(diag_chz_solo=1 and ( chrom = 'chrX' and ( case_support_xlinked=1 and control_support_xlinked=1 ) or chrom != 'chrX' and ( case_support_autosomal=1 and control_support_autosomal=1 )),1,0) | hide case_support*,control_support* | calc mother_withvar if( mother_call!='NaN' and GT_Paternity!='Paternal' and not(GT_Paternity='Unknown' and var_type='sub' and not(listhasany(mother_call,call))),1,0) | calc father_withvar if( father_call!='NaN' and GT_Paternity!='Maternal' and not( GT_Paternity='Unknown' and var_type='sub' and not(listhasany(father_call,call))),1,0) ; create ##disease_specificity## = nor -h <(gor ##candidate_gene_file## | where gene_candORparalog='c' | split gene_candidate_phenotypes | where not(gene_candidate_phenotypes like '*:par_*') | distinct ) | granno -gc gene_candidate_phenotypes -count | rename allcount hpo_geneCount | atmin -gc gene_symbol hpo_geneCount | group -gc gene_symbol -set -sc gene_candidate_phenotypes,hpo_geneCount | rename set_hpo_genecount min_hpo_genecount; create ##alias_candidate_genes## = gor ##candidate_gene_file## | calc ep gene_candidate_phenotypes | split ep | map -c ep -h <(nor <(gor ##candidate_gene_file## | split gene_candidate_phenotypes /* Calc pheno significance based on distinct genes */ | group genome -gc gene_candidate_phenotypes -dis -sc gene_symbol) | select GENE_candidate_Phenotypes,dis_GENE_symbol | calc pheno_significance 1.0/dis_GENE_symbol) | group 1 -gc 3-ep[-1] -sum -fc pheno_significance /* Sum up the pheno significance for all phenotypes per gene_symbol */ | rename sum_pheno_significance Gene_Cand_PhenoScore | map -c gene_symbol ##gene_map## -n gene_aliases -m 'missing' /* replace gene symbol with alias */ | split gene_aliases | replace gene_symbol if(gene_aliases != 'missing',gene_aliases,gene_symbol) | hide gene_aliases | distinct | map -c gene_symbol -m 10000 -n min_hpo_genecount [##disease_specificity##] ; def #date_val($1) = int(left($1,4))*365+int(substr($1,5,7))*30+int(substr($1,8,10)); create ##XA_annotations## = gor ref_base/GDX/XA/variant_classifications.gorz | where active = 't' | calc t #date_val(date) | atmax 1 -gc ref,alt t | select 1-alt,classification; create ##mendel_result## = pgor [##vcf_inheritance##] | map -c gene_symbol -h <(nor -h [##genealiaspanelinfofile##] | select gene_symbol,Gene_DefTrans) -m 'default' | replace qc_consequence_rank if(containsany(feature,Gene_DefTrans),0,qc_consequence_rank) | join -l -varseg -r -f ##fuzzdist## -xl gene_symbol -xr gene_symbol [##alias_candidate_genes##] | hide gene_symbolx | replace Gene_Cand_PhenoScore if(Gene_Cand_PhenoScore='','0',str(Gene_Cand_PhenoScore)) | replace auto_evidence if(gene_candORparalog='c' and min_hpo_genecount < ##disease_specificity_max_gene_count##,listadd(auto_evidence,'PP4'),auto_evidence) | replace auto_evidence if(diag_chz=1 or diag_homrecess=1,listadd(auto_evidence,'PM3'),auto_evidence) | calc inheritance_REC if(diag_homrecess=1, if(chrom = 'chrX','XREC','REC'),if(diag_chz=1,'CHZcc', if(diag_chz_solo=1,'CHZ',''))) | calc inheritance_DeNovoDOM if(diag_denovo_strict=1,'DeNovo', if(diag_dominant=1,'DOM','')) | calc inheritance_LHZ if(lhz_transfrac > ##min_lhz_overlap_fraction## and lhz_vars != '','LHZ','') | calc inheritance = listfilter(inheritance_DeNovoDOM+','+inheritance_REC+','+inheritance_LHZ,'len(x)>0') | granno 1 -gc reference,call,gene_symbol -set -sc inheritance | replace inheritance if(left(set_inheritance,1)=',',listtail(set_inheritance),set_inheritance) | map -c vep_max_consequence <(nor -h ##vep_impact_map## | rename vep_impact vep_max_impact ) -n vep_max_impact | atmin 1 -gc reference,call,gene_symbol,PN qc_consequence_rank | hide set_* | varjoin -r -l -e '' [##XA_annotations##] | calc is_patho_in_knowledgebase if(contains(classification,'PATH') or contains(classification,'LPATH'),1,0) | replace auto_evidence if(DIAG_ACMGcat='Cat1B',listadd(auto_evidence,'PS1'),if(DIAG_ACMGcat='Cat1C',listadd(auto_evidence,'PM5'),auto_evidence)) | replace auto_evidence if(auto_evidence='',if(DIAG_ACMGcat='Cat1','PP5',if(DIAG_ACMGcat='Cat1C','BP6',auto_evidence)),auto_evidence) | calc ls_PVS1 listsize(listfilter(auto_evidence,'x in ("PVS1")')) | calc ls_PS1_4 listsize(listfilter(auto_evidence,'x in ("PS1","PS2","PS3","PS4")')) | calc ls_PM1_6 listsize(listfilter(auto_evidence,'x in ("PM1","PM2","PM3","PM4","PM5","PM6")')) | calc ls_PP1_5 listsize(listfilter(auto_evidence,'x in ("PP1","PP2","PP3","PP4","PP5")')) | calc ls_BA1 listsize(listfilter(auto_evidence,'x in ("BA1")')) | calc ls_BS1_4 listsize(listfilter(auto_evidence,'x in ("BS1","BS2","BS3","BS4")')) | calc ls_BP1_7 listsize(listfilter(auto_evidence,'x in ("BP1","BP2","BP3","BP4","BP5","BP6","BP7")')) | calc D_Pi if(ls_PVS1=1 and (ls_PS1_4>=1 or ls_PM1_6>=2 or (LS_PM1_6=1 and ls_PP1_5=1) or ls_PP1_5>=2),1,0) | calc D_Pii if(ls_PS1_4>=2,1,0) | calc D_Piii if(ls_PS1_4>=1 and ( ls_PM1_6>=3 or (ls_PM1_6>=2 and ls_PP1_5>=2) or (ls_PM1_6>=1 and ls_PP1_5>=4) ),1,0) | calc D_P if(D_Pi+D_Pii+D_Piii > 0,1,0) | calc D_LPi if(ls_PVS1=1 and ls_PM1_6=1,1,0) | calc D_LPii if(ls_PS1_4=1 and (ls_PM1_6=1 or ls_PM1_6=2),1,0) | calc D_LPiii if(ls_PS1_4=1 and ls_PP1_5>=2,1,0) | calc D_LPiv if(ls_PM1_6>=3,1,0) | calc D_LPv if(ls_PM1_6=2 and ls_PP1_5>=2,1,0) | calc D_LPvi if(ls_PM1_6=1 and ls_PP1_5>=4,1,0) | calc D_LP if(D_LPi+D_LPii+D_LPiii+D_LPiv+D_LPi+D_LPv+D_LPvi > 0,1,0) | calc D_B if(ls_BA1=1 or ls_BS1_4>=2,1,0) | calc D_LB if(ls_BS1_4=1 and ls_BP1_7=1 or ls_BP1_7>=2,1,0) | calc D_PLP if(D_P=1,'P',if(D_LP=1,'LP','')) | calc D_BLB if(D_B=1,'B',if(D_LB=1,'LB','')) | calc auto_classification_simple if(len(D_PLP)>0 and len(D_BLB)=0,D_PLP,if(len(D_PLP)=0 and len(D_BLB)>0,D_BLB,'VUS')) | replace auto_evidence if(contains(auto_evidence,'PP3'),replace(auto_evidence,'PP3',fidel_bp4_pp3), if(contains(auto_evidence,'BP4'),replace(auto_evidence,'BP4',fidel_bp4_pp3),auto_evidence)) | calc auto_score if(auto_evidence!='',int(listnumsum(listmap(auto_evidence,"decode(x,'BA1,-8,BS1,-4,BS2,-4,BS3,-4,BS4,-4,BP1,-1,BP2,-1,BP3,-1,BP4,-1,BP4_supporting,-1,BP4_moderate,-2,BP4_strong,-4,BP4_verystrong,-8,BP5,-1,BP6,-1,BP7,-1,PVS1,8,PS1,4,PS2,4,PS3,4,PS4,4,PM1,2,PM2,1,PM3,2,PM4,2,PM5,2,PM6,2,PP1,1,PP2,1,PP3,1,PP3_strong,4,PP3_moderate,2,PP3_supporting,1,PP4,1,PP5,1,0')"))),0) | calc auto_classification if(auto_score >= 10,'P', if(auto_score >= 6 and auto_score <=9,'LP', if(auto_score >= 0 and auto_score <=5,'VUS', if(auto_score >= -6 and auto_score <= -1,'LB', if(auto_score <= -7,'B',''))))) | calc GT reference + '/' + call | rename known_clinvar_stars clinvar_stars | rename known_clinvar_varianttype clinvar_varianttype | rename OMIM_disease_name OMIM_phenotype | hide gender,covered,pntype,gene_dmaxAf,gene_rmaxAf,BP4_score,BP7_score,PP3_score,gene_MOI,Gene_DefTrans,D_*,ls_*,inheritance_*,femalecases_*,malecases_* | replace auto_evidence replace(replace(replace(replace(#rc,'_verystrong',''),'_supporting',''),'_strong',''),'_moderate','') ; create #gene_inher# = nor ref_base/GDX/GeneMB/gene2inher.tsv | replace #2 listmap(#rc,"decode(x,'Autosomal dominant germline de novo mutation,ADdenov,Autosomal dominant inheritance,AD,Autosomal dominant inheritance with maternal imprinting,ADmi,Autosomal dominant inheritance with paternal imprinting,ADpi,Autosomal dominant somatic cell mutation,ADsom,Autosomal recessive inheritance,AR,Digenic inheritance,DI,Mitochondrial inheritance,MI,Multifactorial inheritance,MFI,Oligogenic inheritance,OGI,Semidominant inheritance,SDI,Somatic mutation,SM,X-linked dominant de novo mutation,XLdenovo,X-linked dominant inheritance,XLdom,X-linked dominant with interference,XLdomwi,X-linked inheritance,XL,X-linked recessive inheritance,XLrec,Y-linked inheritance,YL,?')") | map -c gene_symbol -m '' <(nor ref_base/GDX/GeneMB/gene2freqcat.tsv | group -gc gene_symbol -set -sc categorical_bin | replace #2 upper(listfilter(#rc,"x!=''"))); ; create #GDX# = gor [##mendel_result##] | map -c gene_symbol -m '' [#gene_inher#] | varjoin -r -l -rprefix GDX [##XA_annotations##] | calc GDX_phenovar if(contains(set_categorical_bin,'PHENOVAR'),1,0) | calc GDX_path if(containsany(GDX_classification,'P','LP'),1,0) | calc GDX_denovo if(diag_denovo = 1 and (not(homcount > 1 or GDX_classification in ('BEN','LBEN')) and (ref_af = 0.0 and listhasany(set_inheritance,'AD','ADdenov','ADmi','ADpi','XLdom','XLdomwi') or (ref_af <= #denovoAFpv# and (GDX_phenovar = 1 or listhasany(set_inheritance,'XLrec','XL','XLdenovo','XLdom','XLdomwi'))) or GDX_path = 1)), 1,0) | calc GDX_homoz if(listhasany(set_inheritance,'AD') and diag_homrecess = 1 and GT_IHE != 1 and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN')) and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #homAF# or GDX_path = 1), 1,0) | calc GDX_chz if(listhasany(set_inheritance,'AD') and diag_chz_solo = 1 and GT_IHE != 1 and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN')) and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #chzAF# or GDX_path = 1), 1,0) | calc GDX_xlink if(listhasany(set_inheritance,'XLrec','XL','XLdenovo','XLdom','XLdomwi') and (diag_chz_solo = 1 or diag_homrecess = 1) and GT_IHE != 1 and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN')) and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #xlinkAF# or GDX_path = 1), 1,0) | calc GDX_sihet if(listhasany(set_inheritance,'AR') and callcopies = 1 and GT_IHE != 1 and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN')) and VEP_IMPACT in ('HIGH') and (ref_af <= #sihetAF# or GDX_path = 1), 1,0) | calc GDX_dom if(listhasany(set_inheritance,'AD','XLrec','XL','XLdenovo','XLdom','XLdomwi') and diag_dominant = 1 and GT_IHE != 1 and not((homcount > 3 or homcount > 1 and GDX_phenovar = 0) or GDX_classification in ('BEN','LBEN')) and VEP_IMPACT in ('HIGH') and (ref_af <= #domAF# or GDX_path = 1), 1,0) | calc GDX_c_denovo if(diag_denovo = 1 and (not(homcount > 1 or GDX_classification in ('BEN','LBEN')) and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af = 0.0 and not(listhasany(set_inheritance,'AD','ADdenov','ADmi','ADpi','XLdom','XLdomwi')) or GDX_path = 1)), 1,0) | calc GDX_c_homoz if(not(listhasany(set_inheritance,'AD')) and (diag_homrecess = 1 or diag_chz_solo = 1) and GT_IHE != 1 and not(homcount > 1 or GDX_classification in ('BEN','LBEN')) and VEP_IMPACT in ('HIGH','MODERATE') and (ref_af <= #chomAF# or GDX_path = 1), 1,0) | calc GDX_second if(GDX_path = 1 and GT_IHE != 1 and VEP_IMPACT in ('HIGH'), 1,0) | inset -c gene_symbol -b <(nor ref/ensgenes/ensgenes_disease.map | grep acmg) | rename inset inACMG | calc GDX_inciden if(GDX_path = 1 and GT_IHE != 1 and VEP_IMPACT in ('HIGH') and inACMG = 1, 1,0) | hide inACMG | calc GDX_view replace(listzipfilter('GDX_denovo,GDX_homoz,GDX_chz,GDX_xlink,GDX_sihet,GDX_dom,GDX_c_denovo,GDX_c_homoz,GDX_second,GDX_inciden', cols2list('GDX_denovo,GDX_homoz,GDX_chz,GDX_xlink,GDX_sihet,GDX_dom,GDX_c_denovo,GDX_c_homoz,GDX_second,GDX_inciden'),'x="1"'),'GDX_','') | hide GDX_denovo,GDX_homoz,GDX_chz,GDX_xlink,GDX_sihet,GDX_dom,GDX_c_denovo,GDX_c_homoz,GDX_second,GDX_inciden | rename set_inheritance GDX_inheritance | rename set_categorical_bin GDX_categorical_bin | calc sequence_variant chrom+':'+pos+Reference+'>'+Call+', ' + zygosity + ', ' + Gene_symbol + ', '+HGVSc + if(len(HGVSp)>1,' ('+ if(posof(HGVSp,':p.')>0,substr(HGVSp,1+posof(HGVSp,':p.'),1000),HGVSp)+')','') | columnsort 1,2,reference,call,zygosity,sequence_variant,Gene_Symbol,auto_classification,auto_evidence,GDX_view,DIAG_ACMGcat,Fidel_PRECPAT,Fidel_ALPHA,spliceai_max_impact,ref_af,XA_conservative_AF,XA_homCount,VEP_Max_Consequence,GT_Info,KNOWN_gene_diseases,Fidel_CALIBRATED,GDX_categorical_bin,GDX_classification,GDX_phenovar ;