Gregor1 SNV preprocessing¶

Gregor1

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "gdx-demo"
%env GOR_API_PROJECT={project}

Gregor Step1¶

Introduction¶

This notebook is an overview of the YML-based GOR query logic used as a the study analysis prestep in the Gregor clinical sequence variation analysis system. Following ETL steps that take gVCF files and load them into GORdb tables with bi-allelic variants and sequence read coverage information, multiple samples that form a case-study are processed and analysed together, e.g. for variant inheritance patterns.

The case-study analysis primarily consist of two steps, step1; where much of the heavy lifting of variant annotation with large variant database sources takes place, and step2; where the final variant filtering, e.g. based on QC, allele frequence, and variant impact, takes place, ACMG scoring, view filtering etc. In addition to these steps that work solely on small SNVs and InDels, there is additional YML-based report logic for processing CNVs, gene coverage and summary, as well as detailed gene drill-in overview etc.

Here we will take apart the output of cla-queries/templates/gdx/gregor_step1.ftl.yml which is a recent version of step1 that has been adjusted to some reference data at GeneDx, such as the Fidel score, SpliceAI score, classified variants, and genotype frequencies from XA.

We start by setting up a query helper in order to allow us to build the query incrementally and explore the individual steps as necessary. The helper is a simple Python code with its limitations, especially the add_many() method. It is sensitive to trailing characters following semicolon and also does not tolerate lines that are only comments.

In [2]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

PN parallelism¶

The following definition is used to enable PN specific queries to be executed in parallel or with a single GOR worker. When step1 is executed as part of onboarding, i.e. in prewarming, the time it takes to run the full query without PN-parallelism is around 5 minutes per study. With PGOR, the time is less than a minute, however, the load on the file system and the GORdb cache is significantly higher and significantly less efficient when onboarding a high volume samples in parallel. The PGOR option is therefore more intended for interactive re-evaluation, i.e. for case-study setup changes that call for re-analysis of step1 instead of the typical execution of step2.

In [3]:

# Do we run PN statements in PGOR or GOR
mydefs = qh.add_many("""def #gor_or_pgor# = gor;""")

Next we add multiple definitions for data sources (tables/files).

In [4]:

# reference data files and vep setup 
mydefs = qh.add_many("""
def ##ref## = ref;
def ##vep_path## = source/anno/vep_v110;
def ##splice_path## = ref_base/spliceai/spliceai_scores.masked.all.hg19.filtered.gorz;
def ##FidelSource## = ref_base/GDX/Fidel/fidelhg19.gorz;
def ##hgmd_variants## = ##ref##/hgmd/hgmd_hgmd.gorz;
def ##hgmd_orpha## = ##ref##/hgmd/hgmd_orpha.gff.gorz;

def ##CGDmap_file## = ##ref##/disgenes/CGD.map;
def ##clinical_genes## = ##ref##/disgenes/clinical_genes.gorz;
def ##dbnsfp_max## = ##ref##/variants/dbnsfp_max.gorz;
def ##vep_impact## = ##ref##/VEP_impact.map;
def ##transcript_map## = ##ref##/disgenes/hgmd_clinvar_combined_transcripts.map;
def ##omim_inheritance_file## = ##ref##/disgenes/omim_inheritance.tsv;
def ##omim_morbidmap## = ##ref##/disgenes/omim_morbidmap.tsv;
def ##gnomad_gene_constraints## = ##ref##/variants/gnomad_gene_contraint.gorz;
def ##gnomad_freq## = ##ref##/variants/gnomad_freq.gorz;
def ##gene_map## = ##ref##/ensgenes/ensgenes.map;

def ##exac_gene_constraint## = ##ref##/variants/ExAC_gene_constraint_scores.gorz;
def ##exac_mpc_scores_file## = ##ref##/variants/ExAC_MPC_scores.gorz;

def ##mim2gene## = nor ##ref##/disgenes/mim2gene.txt | select 3,2,4,5 | rename #1 mim_number | rename #2 approved_gene_symbol | rename #3 mim_entry_type| rename #4 entrez_id;
def ##the_clinical_variants## = ##ref##/disvariants/clinical_variants.gorz;
def ##freq_max_file## = ##gnomad_freq##;

create ##dbsnp_clinvar## = pgor ##ref##/variants/dbsnp_clinvar.gorz | varnorm -left #3 #4
                           | granno 1 -gc #3,#4 -count | where allcount = 1 | hide allcount;
create ##dbsnp_file## = pgor ##ref##/variants/dbsnp.gorz;
create ##repeat_regions## = pgor ##ref##/repeats/simplerepeats.gorz;
""")

# Only using Refgene sources now

mydefs = qh.add_many("""
def ##vep_single## = ##vep_path##/vepref_single_wgs.gord;
def ##vep_multi## = ##vep_path##/vepref_multi_wgs.gord;
def ##vep_clinical## = ##ref##/disvariants/clinical_variants_vepref_multi.gorz;

create ##veprank## = nor -h ##vep_impact##
| rename so_term consequence
| rename vep_rank rank_consequence
| select consequence,rank_consequence;

def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##all_genes_id## = ##ref##/refgenes/refgenes.gorz;
def ##all_exons## = ##ref##/refgenes/refgenes_exons.gorz;
def ##coding_exons## = ##ref##/refgenes/refgenes_codingexons.gorz;
def ##gene_transcripts## = ##ref##/refgenes/reftranscripts.gorz;
def ##deep_code_score## = <(gor ##ref##/deepCODE/deepCODE_score.gorz | rename refgene_transcripID transcript_stable_id);
""")

In [5]:

# Filter QC and parameter setup
mydefs = qh.add_many("""
def ##denovo_readcount## = 1;
def ##max_parents_read_ratio## = 0.1;
def ##cat3_max_score_threshold## = 0.9;
def ##cat4_minAf## = 0.03;
def ##hgmd_filter## = ('DM');
def ##clinvar_filter## = ('P','LP','P/LP');
def ##min_read_depth## = 8;
def ##gene_overlap_fuzz_factor## = 20;
def ##gene_overlap_fuzz_factor_vep## = 100;
def ##known_vars_fuzz_factor_vep## = 2;
def ##fuzzdist## = 20;
def ##vep_impact_filter## = ('HIGH','MODERATE');
def ##repeat_overlap_fuzz_factor## = 2;
def ##max_common_rare_ratio## = 0.05;
def ##min_lhz_size## = 3000;
def ##min_lhz_varcount## = 5;
""")

#  Coverage setup 
mydefs = qh.add_many("""
def ##lt5_max## = 0.05;
def ##lt10_max## = 0.1;
def ##genecov_hist## = source/cov/gene_cov_all_seg.gord -s PN;
""")

# Main gor-dict table sources 
mydefs = qh.add_many("""
def ##wgs_vars## = source/var/wgs_varcalls.gord -s PN;
def ##good_cov## = source/cov/goodcov_8.wgs.gord -s PN;
def ##segment_cov## = source/cov/segment_cov.gord -s PN;
def ##cand_vars## = source/var/candidate_variants.gord -s PN;
""")

Note: The definitions and create statements we see here in this notebook may depend on input parameters and FreeMarker branching logic within the YML. For instance, the definitions below, are setup for a trio: index-case DEMO_2796799, father DEMO_2827751, and mother DEMO_2796810. Other definitions such as ##pntype## are dependent on these specifications and setup with FreeMarker. Instead of using FreeMarker, we could implement similar logic in Python within the notebook here.

As an example, above we see that above ##good_cov## is set for a read-depth of 8. This will be relevant in the analysis of deNovo variants and the inheritance rule analysis for cases and controls, e.g. via ##denovo_reads## and ##varcoverage##. We could also use ##segment_cov## for the same purpose, however, its data footprint orders of magnitude larger! We do however use ##segment_cov## to calculate approximate depth in sites that are not with VCF-variants in all individual, e.g. in ##reference_depth##.

In [6]:

# Participant setup 
mydefs = qh.add_many("""
def ##index_case## = 'DEMO_2796799';
def ##index_gender## = 'male';
def ##index_case_no_quotes## = 'DEMO_2796799';
def ##pngender_string## = 'DEMO_2796799,male';
def ##father## = 'DEMO_2827751';
def ##father_affstat## = unaffected;
def ##mother## = 'DEMO_2796810';
def ##mother_affstat## = unaffected;

def ##male_controls## = 'DEMO_2827751';
def ##male_cases## = ;
def ##female_controls## = 'DEMO_2796810';
def ##female_cases## = ;
def ##cc_pns## = 'DEMO_2827751','DEMO_2796810';
def ##all_pns## = 'DEMO_2827751','DEMO_2796810','DEMO_2796799';
def ##all_pns_no_quotes## = DEMO_2827751,DEMO_2796810,DEMO_2796799;
def ##pivot_type## = 'male_cases','female_cases','male_controls','female_controls';
def ##pntype## = IF(PN IN ('DEMO_2796799'),'index',IF(PN IN ('DEMO_2796810'),'female_controls',IF(PN IN ('DEMO_2827751'),'male_controls','UNKNOWN')));

create ##pngender## = nor <(gorrow chr1,1,2 | calc pn ##index_case_no_quotes## | split pn | calc gender decode(PN,##pngender_string##))
| select pn,gender;

create ##trio_pedigree## = nor <(gorrow chr1,1,2
| calc pn ##index_case_no_quotes##
| calc fn ##father##
| calc mn ##mother##
| split pn)
| select pn,fn,mn;
""")

# Legacy category names from WuxiNextcode 
mydefs = qh.add_many("""
def ##category_map## = '1-KP_cDNA,Cat1,2-KP_AA,Cat1B,3-DV_KPCodon,Cat1C,4-KNonP,Cat1D,5-DV_intron,Cat1E,6-EP,Cat2,7-novel,Cat3,8-common,Cat4';
""")
# remove once we have stars column in dbsnp_clinvar file 
mydefs = qh.add_many("""
def ##reviewstatus_to_stars## = 'no_assertion_provided,0,no_assertion_criteria_provided,0,no_interpretation_for_the_single_variant,0,criteria_provided|conflicting_interpretations,1,criteria_provided|single_submitter,1,criteria_provided|multiple_submitters|no_conflicts,2,reviewed_by_expert_panel,3,practice_guideline,4';
def #date_val($1) = int(left($1,4))*365+int(substr($1,5,7))*30+int(substr($1,8,10));
""")

The ##category_map## above maps internal WuXiNextCode categories with flavors of Cat1 variants, in a classification scheme based on Cat1-Cat4, e.g. Cat1 is Known Pathogenic same cDNA change, Cat1B same amino acid change, Cat1C different change but same codon, Cat1D same known but non-Pathogenic variant, and Cat1E a different but close by known variant in intron etc.

The statements below show the sources for known variants; ##XA_annotations## for known variants from XA based on their latest classification, using ATMAX based on the date converted to number, t.

Then we have ##clinical_variants##, which in the past has been based on the Clinvar database. Note that we merge these two sources and map their classification to a common ACMG style scheme, P,LP,VUS,LB,B, using the decode() function.

The column MaxClinImpact is use in many queries to define a subset of known Pathogenic variants. We see from the XA merge code that both P and LP in XA are mapped to Pathogenic in clinimpact and maxclinimpact. Furthermore, we collapse the variants and their sources into one row per variant with group 1 -gc ref,alt -set -sc #5- and then set them MaxClinImpact to Pathogenic if any source labels them so, e.g. with the last replace command.

Finally, we generate an annotated version of the table, ##clinvar_hgmd_XA_annotations##, using var-left-joins with HGMD and dbsnp_clinvar.

The definition of these tables will need to be revised in light of how we want to combine Oscar, XA, Clinvar, and HGMD known variants and surely there is a room to simplify this code significantly or even remove it from the code is step1 and place in a update-merge process outside of it.

In [7]:

# Classified variants from XA 
mydefs = qh.add_many("""
create ##XA_annotations## = gor ref_base/GDX/XA/variant_classifications.gorz
| where active = 't' | calc t #date_val(date) | atmax 1 -gc ref,alt t | select 1-alt,classification;
""")

# Merge XA classified variants with Clinvar and HGMD known variants.  Collapse to single row per variant 
mydefs = qh.add_many("""
create ##clinical_variants## = pgor ##the_clinical_variants##
| merge <(gor [##XA_annotations##] | calc dbsource 'XA' 
| calc max_clinical_significance decode(classification,'BEN,B,GDXBEN,B,GDXVOUS,VUS,LBEN,LB,LPATH,LP,PATH,P,PAV,LP,PLEX,LP,RISK,VUS,RPTBEN,LBEN,UV,VUS,VUS,VUS')
| hide classification
| calc clinimpact if(contains(max_clinical_significance,'P'),'Pathogenic','unknown')
| calc maxclinimpact clinimpact
)
| group 1 -gc ref,alt -set -sc #5-
| replace set_max_clinical_significance if(listhasany(#rc,'P'),'P',if(listhasany(#rc,'LP'),'LP',if(listhasany(#rc,'VUS'),'VUS',if(listhasany(#rc,'LB'),'LB','B'))))
| replace set_* listdist(listfilter(#rc,"x!=''"))
| rename set_(.*) #{1}
| replace maxclinimpact if(contains(#rc,'Pathogenic'),'Pathogenic','')
;

create ##clinvar_hgmd_XA_annotations## = gor [##clinical_variants##]
| where dbsource LIKE '*hgmd*' or dbsource LIKE '*clinvar*'
| select 1-4,disease,dbsource,submitter,FDA_approved,max_clinical_significance
| rename #3 reference
| rename #4 call
| rename disease var_diseases
| rename dbsource source
| varjoin -l -r <(gor ##hgmd_variants## | select 1,2,ref,allele,hgmdacc,variantType,entrez,pmid | rename #3 reference | rename #4 call | rename variantType hgmd_variantType | rename hgmdacc hgmd_accession
  | calc qc_known_vars  if(hgmd_variantType in ##hgmd_filter##,1,0))
| varjoin -l -r <(gor [##dbsnp_clinvar##]| select 1,2,ref,alt,clnacc,clnsig_name,clnrevstat | rename #3 reference | rename #4 call | calc clinvar_stars decode(clnrevstat,##reviewstatus_to_stars##) | replace CLNSIG_NAME listfirst(CLNSIG_NAME) | replace CLNSIG_NAME if(CLNSIG_NAME like '*conflicting*','conflicting',if(CLNSIG_NAME='benign/likelybenign','B/LB',if(CLNSIG_NAME='pathogenic/likelypathogenic','P/LP',if(CLNSIG_NAME like '*likelybenign*','LB',if(CLNSIG_NAME like '*benign*','B',if(CLNSIG_NAME like '*likelypathogenic*','LP',if(CLNSIG_NAME like '*uncertainsignificance*','VUS',if(CLNSIG_NAME like '*pathogenic*','P','other')))))))) | rename clnsig_name clinvar_variantType | rename CLNACC clinvar_accession
  | calc qc_known_vars  if(clinvar_variantType in ##clinvar_filter##,1,0)
  )
| replace clinvar_variantType if(contains(source,'XA'),max_clinical_significance,#rc)
| replace qc_known_vars if(qc_known_varsx = 1,1,qc_known_vars)
| where qc_known_vars > 0
| group 1 -gc reference,call -set -sc 5-pmid,clinvar_accession,clinvar_variantType,clinvar_stars
| rename set_(.*) #{1};
""")

As an example, we can explore the classified variants from XA like this:

In [8]:

%%gor
$mydefs
nor [##XA_annotations##] | group -gc classification -count

Query ran in 0.36 sec
Query fetched 12 rows in 0.35 sec (total time 0.71 sec)

Out[8]:

	classification	allCount
0	BEN	38576
1	GDXBEN	3515
2	GDXVOUS	1
3	LBEN	108113
4	LPATH	25858
5	PATH	28763
6	PAV	5
7	PLEX	28
8	RISK	12
9	RPTBEN	47
10	UV	2384
11	VUS	199657

An the combined clinical variants like this:

In [9]:

%%gor
$mydefs
nor [##clinical_variants##] | grep Pathogenic | group -gc dbsource -count

Query ran in 0.17 sec
Query fetched 9 rows in 9.16 sec (total time 9.34 sec)

Out[9]:

	dbsource	allCount
0	XA	26148
1	XA,clinvar	26915
2	XA,clinvar,omim	2866
3	XA,omim	74
4	XA,omim,clinvar	2290
5	clinvar	183283
6	clinvar,omim	6471
7	omim	7808
8	omim,clinvar	7943

We see that half of the XA variant are present in Clinvar and that XA is only 1/5th of Clinvar in terms of number of variants.

Next we calculate a mapping relation between Ensembl transcript IDs and corresponding Refgene IDs, because some sources like dbNSFP and Exac MPC scores do not come with Refgene IDs. This is done by collapsing all start-stop coordinates of the exons in the transcripts into a list and then by joining them on exact match on these lists, in order to match these two id domains. The relation ##enstrans2reftrans## also includes the Ensembl transcript biotype.

We then calculate a transcript-biotype relation, by merging the biotype from the ensgenes_codingexons and refgenes_codingexons, where not available in ensgenes, e.g. using inset -c transcript_stable_id -not <(...).

In the statement for ##dbnsfp_file## we select a subset of the many columns that are available in dbNSFP and we translate from Ensembl transcript IDs to Refgene IDs. Notice that several columns in the dbnsfp table have semicolon separated values, like the ensembl_transcriptid. If we were to use them, we would add them into the SPLIT command, however, none of the columns we are interested here depend on the transcript, hence we only split on the Ensembl transcript.

In [10]:

# Create a map between ens- and ref-transcripts based on exact match of start-stop positions of exons 
mydefs = qh.add_many("""
create ##enstrans2reftrans## = nor <(gor ##ref##/ensgenes/ensgenes_codingexons.gorz
| group chrom -gc transcript_stable_id,transcript_biotype -lis -sc 2,3
| join -snpsnp -xl lis_* -xr lis_*
  <(gor ##ref##/refgenes/refgenes_codingexons.gorz
        | where left(transcript_stable_id,3)='NM_'
| group chrom -gc transcript_stable_id -lis -sc 2,3)
| group genome -gc transcript_stable_id,transcript_biotype,transcript_stable_idx)
| select transcript_stable_id,transcript_biotype,transcript_stable_idx
| rename transcript_stable_idx ref_transcript_stable_id;
""")

# This assigns biotype from Ensembl where available (more detail, e.g. NMD), otherwise from Refgene 
mydefs = qh.add_many("""
create ##trans2biotype## = nor [##enstrans2reftrans##]
| select ref_transcript_stable_id,transcript_biotype
| merge <(nor ##ref##/refgenes/refgenes_codingexons.gorz
          | where left(transcript_stable_id,3)='NM_'
          | inset -c transcript_stable_id -not <(nor [##enstrans2reftrans##] | select ref_transcript_stable_id)
          | select transcript_stable_id,transcript_biotype | rename #1 ref_transcript_stable_id)
| group -gc #1 -set -sc transcript_biotype
| calc Biotype set_transcript_biotype
| select #1,biotype;
""")

# Add refgene transcript_stable_id using gene_symbol join to make file look as in Ensembl case. Ugly, but works for non-transcript based annotations
# Have to use add here instead of add_may because of bug in handling semicolons!
mydefs = qh.add("""
create ##dbnsfp_file## = pgor ##ref##/variants/dbnsfp.gorz 
| select 1,2,ref,alt,genename,ensembl_transcriptid,aaref,interpro_domain,MutationTaster_score,MetaSVM_score,MetaSVM_rankscore,MetaSVM_pred,CADD_phred,GERPxx_NR,GERPxx_RS,GERPxx_RS_rankscore
| split ensembl_transcriptid -s ';'
| multimap -c ensembl_transcriptid <(nor [##enstrans2reftrans##] | hide transcript_biotype
                                    | map -c ref_transcript_stable_id <(nor ##ref##/refgenes/refgenes_codingexons.gorz 
                                                          | where left(transcript_stable_id,3)='NM_' | select transcript_stable_id,gene_symbol | distinct))
| select 1,2,ref,alt,gene_symbol,ref_transcript_stable_id,aaref,interpro_domain,MutationTaster_score,MetaSVM_score,MetaSVM_rankscore,MetaSVM_pred,CADD_phred,GERPxx_NR,GERPxx_RS,GERPxx_RS_rankscore
| rename ref_transcript_stable_id transcript_stable_id
| distinct;
""")

In [11]:

%%gor
$mydefs

nor [##enstrans2reftrans##]

Query ran in 0.25 sec
Query fetched 66,361 rows in 0.30 sec (total time 0.56 sec)

Out[11]:

	transcript_stable_id	transcript_biotype	ref_transcript_stable_id
0	ENST00000000233	protein_coding	NM_001662.4
1	ENST00000000412	protein_coding	NM_002355.4
2	ENST00000000442	protein_coding	NM_001282450.2
3	ENST00000000442	protein_coding	NM_004451.5
4	ENST00000001008	protein_coding	NM_002014.4
...	...	...	...
66356	ENST00000610183	protein_coding	NM_001286350.2
66357	ENST00000610189	protein_coding	NM_007266.4
66358	ENST00000610205	protein_coding	NM_145250.4
66359	ENST00000610222	protein_coding	NM_001253772.2
66360	ENST00000610273	protein_coding	NM_173516.3

66361 rows × 3 columns

The following statements are to generate a ##gene_panel_info## relation. The frequency cut-offs here are used to define the column af_cutoff in ##acmg_classification## used to calculate ACMG evidence BS1. This type of gene specific information should come from the GeneMB database. Furthermore, in Gregor step2, it is possible to provide a gene panel info file. The logic in ##acmg_classification## could be deferred to step2 without much increase in compute time. These are therefore candidates for revision.

In [12]:

# This preferred transcript might be replaced with data from GeneMb 
mydefs = qh.add_many("""
create ##pref_transcript## = nor -h ##transcript_map##
| calc Gene_defTrans LISTDIST(if(HGMD_transcript!='',HGMD_transcript+if(CLINVAR_transcript!='',','+CLINVAR_transcript,''),CLINVAR_transcript))
| split Gene_defTrans
| colsplit Gene_defTrans 2 PrefTRANS -s '\.'
| group -gc gene_symbol -set -sc 2-
| rename set_(.*) #{1}
| hide prefTrans_2,Gene_defTrans
| rename PrefTRANS_1 Gene_defTrans;
""")

# Generate gene_panel_info file from hgmd_orpha.  Used to be an input-file, e.g. GeneMB 
mydefs = qh.add_many("""
create ##prevlookup## = norrows 1
| calc l '1-5 / 10 000'
| calc v 5.0/10000
| merge <(norrows 1 | calc l '1-9 / 1 000 000' | calc v 9.0/1000000)
| merge <(norrows 1 | calc l '1-9 / 100 000' | calc v 9.0/100000)
| merge <(norrows 1 | calc l '6-9 / 10 000' | calc v 9.0/10000)
| merge <(norrows 1 | calc l '<1 / 1 000 000' | calc v 1.0/1000000)
| merge <(norrows 1 | calc l '>1 / 1000' | calc v 1.0/1000)
| select l,v;

create ##gene_panel_info##  = nor <(gor ##gene_transcripts## | select 1-gene_symbol,strand,transcript_stable_id | calc size #3-#2 | group chrom -gc gene_symbol,strand,transcript_stable_id -sum -ic size )
| atmax -gc gene_symbol sum_size
| select gene_symbol,strand,transcript_stable_id
| map -c gene_symbol -h -m NA <(nor ##hgmd_orpha##
| split prevalence_class
| map -c prevalence_class -h [##prevlookup##]
| select hgnc,v
| rename hgnc gene_symbol
| atmax -gc gene_symbol v
| calc Gene_maximum_genotype_freq_for_recessive sqrt(v)
| rename v Gene_maximum_genotype_freq_for_dominant)
| replace Gene_maximum_genotype_freq_for_dominant if(#rc='NA','0.001',#rc)
| replace Gene_maximum_genotype_freq_for_recessive if(#rc='NA','0.01',#rc);
""")

We see that at present, in our schema instance, the ##hgmd_orpha## table is empty

In [13]:

%%gor
$mydefs
nor ##hgmd_orpha## | top 10

Query ran in 0.15 sec
Query fetched 0 rows in 0.01 sec (total time 0.16 sec)

Out[13]:

	chrom	chromstart	chromend	name	strand	ID	accession	avg_age_of_death	avg_age_of_onset	disorder	...	prevalence_class	prevalence_geography	prevalence_qualification	prevalence_source	prevalence_type	sign	uniprot	valmoy	feature	hyperlink

0 rows × 27 columns

and therefore, all the frequency cutoffs are the default values, e.g. 0.001 and 0.01.

In [14]:

%%gor
$mydefs
nor [##gene_panel_info##] 
| rename Gene_maximum_genotype_freq_(.*) gmgf_#{1} 
| granno -dis -sc gmgf_*

Query ran in 0.16 sec
Query fetched 28,174 rows in 0.22 sec (total time 0.38 sec)

Out[14]:

	gene_symbol	strand	transcript_stable_id	gmgf_for_dominant	gmgf_for_recessive	dis_gmgf_for_dominant	dis_gmgf_for_recessive
0	A3GALT2	-	NM_001080438.1	0.001	0.01	1	1
1	AADACL3	+	NM_001103170.3	0.001	0.01	1	1
2	AADACL4	+	NM_001013630.2	0.001	0.01	1	1
3	ABCA4	-	NM_000350.3	0.001	0.01	1	1
4	ABCB10	-	NM_012089.3	0.001	0.01	1	1
...	...	...	...	...	...	...	...
28169	VCY1B	+	NM_181880.2	0.001	0.01	1	1
28170	XKRY	-	NM_004677.2	0.001	0.01	1	1
28171	XKRY2	+	NM_001002906.1	0.001	0.01	1	1
28172	ZFY	+	NM_001145275.2	0.001	0.01	1	1
28173	ZFY-AS1	-	NR_144458.1	0.001	0.01	1	1

28174 rows × 7 columns

Next are two statements that transform Gnomad and Exac tables that store information on intolerability to loss-of-function and missense deleteriousness. In the former, we transform it from transcript level to gene level (most genes only one transcript in this table and very few two). In the second, we transform Ensembl transcript IDs to Refgene using the mapping relation we defined above. In bot cases, we select based on max, e.g. pLI and MPC respectively.

In [15]:

# Use the start,stop from ##all_genes## in gene_constraint 
mydefs = qh.add_many("""
create ##gnomad_gene_constraints_annotations## = gor ##gnomad_gene_constraints##
| select 1,chromstart,chromend,gene,transcript,mis_z,syn_z,lof_z,pLI
| replace lof_z if(lof_z='NA','NaN',lof_z)
| replace pLI if(pLI='NA','NaN',pLI)
| map -c gene <(nor -h ##all_genes## | select gene_symbol,1-3)
| select 1,gene_start,gene_end,gene,mis_z,syn_z,lof_z,pLI
| sort chrom
| atmax 1 -gc gene pLI    
| rename mis_z gnomad_gene_constraint_score_missense
| rename syn_z gnomad_gene_constraint_score_synonymous
| rename lof_z gnomad_gene_constraint_score_LOF_z
| rename pLI gnomad_gene_constraint_score_LOF_pLI
| distinct;
""")

# Convert exac file to use ref-transcripts 
mydefs = qh.add_many("""
create ##exac_mpc_scores## = pgor ##exac_mpc_scores_file##
| select 1-4,gene_name,ENST,MPC      
| multimap -c ENST [##enstrans2reftrans##]
| split ref_transcript_stable_id
| rename ref_transcript_stable_id transcript_stable_id      
| group 1 -gc #3,#4,gene_name,transcript_stable_id -max -fc MPC
| rename max_MPC MPC;
""")

Following are some gene disease based annotation tables that have been used in the past.

The first two are solely used for annotations; one from the Clinical Gene Database, and another is ##clinical_genes## that links to ref/disgenes/clinical_genes.gorz. This table is transformed to annotate gene disease relationships also through their gene paralog reltionships, as stored in ##gene_map##. We notice that only new gene disease relationships are added to Gene_par_diseases, e.g. by eliminating those already in gene_diseases from the list, e.g. listfilter(set_par_diseases,'not(containsany(gene_diseases,x))').

Then we have "bulky" transformations of the OMIM phenotype and inheritance modes. The last table ##omim_inheritance_matched## is used for annotation in ##vcf_wgs_anno## and then the omim_inheritance in ACMG evidence rules, e.g. contains(omim_inheritance,'AD').

In [16]:

# Clinical Gene Database 

mydefs = qh.add_many("""
create ##CGDmap## = nor -h ##CGDmap_file##
| select gene,condition,inheritance,age_group,comments,intervention_categories,intervention_rationale,manifestation_categories,references
| prefix 1- CGD
| replace CGD_inheritance trim(CGD_inheritance)
| where CGD_MANIFESTATION_CATEGORIES != '' or CGD_INHERITANCE != '';
""")

# Create a table with gene-diseases and other gene-paralog-diseases 
mydefs = qh.add_many("""
create ##pardiseases## = pgor ##clinical_genes## | select 1-gene_diseases
| map -c gene_symbol ##gene_map## -m '' -n gene_paralogs | split gene_paralogs
| map -c gene_paralogs -m '' <(nor ##clinical_genes## | select gene_symbol,gene_diseases) -h
| rename gene_diseasesx par_diseases
| group 1 -gc gene_end-gene_diseases -set -sc gene_paralogs,par_diseases -len 10000
| replace set_par_diseases listdist(listfilter(set_par_diseases,'not(containsany(gene_diseases,x))'))
| rename set_par_diseases Gene_par_diseases | rename set_gene_paralogs Gene_paralogs
| replace Gene_diseases,gene_par_diseases replace(#rc,',',', ')
| replace Gene_paralogs listfilter(Gene_paralogs,'x!="."');
""")

# Transform the OMIM table 
mydefs = qh.add_many("""
create ##omim_inheritance## = nor -h ##omim_inheritance_file##
| split inheritance
| distinct
| replace inheritance if(inheritance='' and phenotype_id!='','Unknown inheritance',inheritance)
| calc new_inheritance if(inheritance = 'Mitochondrial','Mi',if(inheritance = 'Digenic dominant','DD',if(inheritance = 'Isolated cases','IC',if(inheritance = 'Pseudoautosomal recessive','PR',if(inheritance = 'Pseudoautosomal dominant','PD',if(inheritance = 'Pseudoautosomal recessive','PR',if(inheritance = '\?Autosomal dominant','?AD',if(inheritance like '*Somatic*', 'SMu',if(inheritance like '*Multifactorial*','MU',if(inheritance like '*Digenic recess*','DR',if(inheritance like '*linked d*','XLD',if(inheritance like '*linked r*','XLR',if(inheritance = 'X-linked','XL',if(inheritance = 'Unknown inheritance','UI',inheritance))))))))))))))
| group -gc omim_id,approved_symbol,phenotype_id -set -sc inheritance,new_inheritance
| rename set_inheritance inheritance | rename set_new_inheritance new_inheritance
| calc omim_moi if(listhasany(new_inheritance,'Autosomal recessive'),replace(new_inheritance,'Autosomal recessive','AR'),new_inheritance)
| replace omim_moi if(listhasany(omim_moi,'Autosomal dominant'),replace(omim_moi,'Autosomal dominant','AD'),omim_moi)
| replace omim_moi replace(omim_moi,',','/')
| replace new_inheritance replace(new_inheritance,',','/')
| group -gc omim_id,approved_symbol,phenotype_id, -set -sc new_inheritance,omim_moi
| rename set_new_inheritance inheritance
| rename set_omim_moi omim_moi
| replace inheritance if(listfirst(inheritance)='',listtail(inheritance), inheritance)
| columnsort approved_symbol
| split approved_symbol -s ','
| calc pheno_inheritance if(phenotype_id!='','('+phenotype_id+':'+omim_moi+')','')
| replace pheno_inheritance if(inheritance='' and phenotype_id!='', '('+phenotype_id+')', pheno_inheritance)
| select approved_symbol,omim_id,pheno_inheritance,phenotype_id,omim_moi
| group -gc approved_symbol,omim_id -lis -sc pheno_inheritance,phenotype_id,omim_moi
| rename lis_(.*) #{1}
| map -c omim_id -m '' <(nor -h ##omim_morbidmap##) -n omim_description
| replace phenotype_id if(left(phenotype_id,1)=',',right(phenotype_id,len(phenotype_id)-1),phenotype_id)
| rename phenotype_id omim_phenotype_mim
| rename pheno_inheritance omim_inheritance
| rename omim_description omim_disease_name
| rename omim_id omim_gene_mim
| rename approved_symbol gene_symbol
| map -c omim_gene_mim <(##mim2gene##)
| select entrez_id,gene_symbol,omim_*;
;

create ##omim_inheritance_entrez## = nor ##all_genes_id##
| rename gene_stable_id entrez_id
| select entrez_id,gene_symbol
| map -c entrez_id -m 'OMIM_NA' <(nor [##omim_inheritance##])
;

create ##omim_inheritance_gene## = nor [##omim_inheritance_entrez##]
| where omim_gene_mim='OMIM_NA'
| select entrez_id,gene_symbol
| map -c gene_symbol <(nor [##omim_inheritance##] | columnsort gene_symbol)
;

create ##omim_inheritance_matched## = nor [##omim_inheritance_entrez##]
| where omim_gene_mim!='OMIM_NA'
| merge [##omim_inheritance_gene##]
| select gene_symbol,omim_gene_mim,omim_phenotype_mim,omim_inheritance,omim_moi,omim_disease_name
| distinct;
""")

We will not detail the gory details of the above OMIM transformation. Nedless to say that these tables need to be revised in connection with GeneMB. Below, we show the structure of ##omim_inheritance_matched## and that it satisfies a gene_symbol unique-ness condition.

In [17]:

%%gor
$mydefs
nor [##omim_inheritance_matched##] | granno -gc gene_symbol -count | throwif allcount > 1 
    | where omim_inheritance != '' and listsize(omim_inheritance) > 1 | top 1 | unpivot #1-

Query ran in 0.27 sec
Query fetched 7 rows in 0.13 sec (total time 0.41 sec)

Out[17]:

	Col_Name	Col_Value
0	gene_symbol	B3GALT6
1	omim_gene_mim	615291
2	omim_phenotype_mim	271640,609465,615349
3	omim_inheritance	(271640:AR),(609465:AR),(615349:AR)
4	omim_moi	AR,AR,AR
5	omim_disease_name	Al-Gazali syndrome, 609465 (3),Ehlers-Danlos s...
6	allCount	1

Looking at the above, we see that in ##acmg_classification##, it would be cleaner in to use listhasany(omim_moi,'AD') instead of contains(omim_inheritance,'AD') to check for mode of inheritance to set AF cut-offs.

AF and GT homozygous counts¶

Step1 provides options to select the AF source. Here we have selected to use Gnomad and we maximize across populations and different representations that may exist of InDels, e.g. via left-varnorm.

Currently, there is no variant AF filtering in step1 but in step2 we have added lines for XA derived AF:

create ##vcf_filtered## = pgor ##vcf_wgs_anno## | varjoin -l -r -e 0 -r -rprefix XA [#XAaf#] /* ref_base/GDX/XA/unaff_parents_conf_gender_af.gorz */ | replace ref_af if(ref_af < XA_conservative_af, XA_conservative_af, ref_af) | replace homcount if(homcount < XA_homcount, XA_homcount, homcount) ...

However, in step1 there is some ACMG evidence logic based on ref_af and it will be impacted by the AF source. Thus, the definition of ##frequency_file## should be revisited, e.g. in light of Gnomad vs 3, XA, etc. Clearly, it is trivial to add the same max-logic as in step2 into ##frequency_file## and ##homozygous_vars##.

In [18]:

# Generate AF file with ref_af column. 
mydefs = qh.add_many("""
create ##frequency_file## = pgor ##freq_max_file##
| select 1-4,gnomad_exomes_qc,gnomad_genomes_qc,gnomad_het,gnomad_hom,gnomad_af_filtered,gnomad_af_afr,gnomad_af_amr,gnomad_af_asj,gnomad_af_eas,gnomad_af_fin,gnomad_af_nfe,gnomad_af_oth,gnomad_af_sas
| calc ref_af listnummax(gnomad_af_afr+','+gnomad_af_amr+','+gnomad_af_asj+','+gnomad_af_eas+','+gnomad_af_fin+','+gnomad_af_nfe+','+gnomad_af_oth+','+gnomad_af_sas)
| rename gnomad_af_filtered ref_af_filtered
| hide gnomad_af*
| varnorm -left #3 #4
| atmax 1 -gc #3,#4 ref_af
| replace ref_af form(ref_af,6,6);

create ##homozygous_vars## = pgor ##gnomad_freq##
| join -varseg -f 10 -i ##gene_transcripts##
| select 1-4,gnomad_exomes_hom,gnomad_genomes_hom
| where max(gnomad_exomes_hom,gnomad_genomes_hom)>0
| varnorm -left #3 #4
| calc homCount max(gnomad_exomes_hom,gnomad_genomes_hom)
| select 1-4,homCount
| atmax 1 -gc #3,#4 homCount;
""")

Computational evidence¶

There are many algorithms that have been developed to score variants, e.g. for pathogeneity. Within GDX there is work on a variant ranker, that leverages patient phenotype information via gene multi-score, which plan is to integrate into Gregor. At present, we are using scores from dbNSFP for legacy type variant classification in DIAG_ACMGcat.

In [19]:

# This score should be combined with Fidel.  Now used for old ACMG categories. Only SNPs, hence no need for left-norm
mydefs = qh.add_many("""
create ##max_score## = pgor ##dbnsfp_max##
| calc max_Score max(max(float(Polyphen2_HDIV_score,0),float(Polyphen2_HVAR_score,0)),float(Sift_score,0))
| group 1 -gc #3,#4 -max -fc max_score | rename max_max_score Max_Score;
""")

Additionally, for missense SNP variants, we use an internally developed score named Fidel, that combines and recalibrates (for some genes) the Alpha-missens and Revel score. This score is used for ACMG evidence PP3.

In [20]:

# Calibration of scores analogous to in paper by Tavtigan, levels suggested by Robert Kueffner. Only SNPs, hence no need for left-norm.
mydefs = qh.add_many("""
create ##PP3_BP4## = pgor ##FidelSource##
| select 1-4,precben,precpat
| calc fidel_bp4_pp3
       if(precpat>0.99,'PP3_strong',
       if(precpat>.98,'PP3_moderate',
       if(precpat>.97,'PP3_supporting',
       if(precben>.99,'BP4_strong',
       if(precben>.98,'BP4_moderate',
       if(precben>.97,'BP4_supporting',''))))))
| where fidel_bp4_pp3 != ''
| calc PP3_score if(fidel_bp4_pp3 ~ 'PP3*',1,0)
| calc BP4_score if(fidel_bp4_pp3 ~ 'BP4*',1,0)
| select 1-4,fidel_bp4_pp3,PP3_score,BP4_score;
""")

Interpro regional domains are now simply extracted from dbNSFP. Some other BED-based source might be a better solution. Notice the definition of bpstart and the use of segspan -gc interpro_domain, to get better ovelap between adjacent SNPs sharing a domain.

In [21]:

# Calculate interpro regions from dbNSFP label 
mydefs = qh.add("""
create ##interpro_regions## = pgor [##dbnsfp_file##]
| where interpro_domain != '.'
| select 1,2,interpro_domain
| split interpro_domain -s ';'
| where interpro_domain != '.'
| calc bpstart pos-2
| select 1,bpstart,pos,interpro_domain
| segspan -maxseg 10000 -gc interpro_domain
| calc newpos if(bpStart+1<bpStop,bpStart+1,bpStart)
| select 1,newpos,bpStop,interpro_domain
| rename newpos bpStart
| distinct;
""")

The above statements are non-volatile, i.e. do not depend on sample input nor shared sample tables like the master VEP table. Hence, they can be moved into reference data generaton, and as such shared across GORdb projects unlike the cached GORdb statements. Sofar, this has not been deemed important enough to be done and allowed for flexibility in the Gregor YML development.

Volatile and PN based statements¶

Now we look into some of the key query statements in step1, ##sub_vep## and ##pn_variants_vepped##, both of which are what we refer to as volatile. The latter obviously so through the change in ##all_pns##, i.e. it will depend on the case-study samples that are analysed in step1. It is less obvious for ##sub_vep##, however, keep in mind that as new samples are onboarded into the GORdb database project schema, the incremental VEP process will add new variant rows into ##vep_multi##.

Currently, the expressions below are setup in step1 with option to include or exclude SpliceAI and Fidel, hence the "unnecessarily complicated" use of definitions. We might want to change simplify and change this soon.

In [22]:

# Logic to support conditional inclusion of spliceAI.  Join on var-gene-aliases
mydefs = qh.add_many("""
def ##spliceai_join## = | varjoin -r -l ref_base/spliceai/spliceai_scores.masked.all.hg19.filtered.gorz
                        | leftwhere gene_symbolx[-1] listhasany(gene_aliases,gene_symbol)
                        | hide scorezip,gene_symbolx,gene_aliases
                        | calc BP7_score if(spliceai_max_impact = '',1,0)
                        | varjoin -r -l -e '-1' -rprefix Fidel ##FidelSource##
                        ;
def ##spliceai_cols## = spliceai_max_consequence,spliceai_max_impact,Fidel_PRECPAT,Fidel_PRECBEN,Fidel_Calibrated,Fidel_ALPHA;
def ##spliceai_filter## = or spliceai_max_impact in ('HIGH','MODERATE') or Fidel_PRECPAT >= 0.97;
""")

# The master-join to generate a reduced vep table for fuzzy-exon-regions, high-mod VEP, known variants, protected.
# This statement is independent of PNs, however, needs to be re-evaluated if the VEP table is updated following sample import.
# All tables are assumed left-normalized.
# Right-join sources are stored in cache for fast seek, e.g. due to prior-filtering of left-source.

mydefs = qh.add_many("""
create ##sub_vep## = pgor -split 100 ##vep_multi##
| join -varseg -ic -f ##gene_overlap_fuzz_factor_vep## ##all_exons##
| rename overlapcount oc1
| calc oc2 0 /* varjoin -ic [##protected_variants##] | rename overlapcount oc2 */
| join -snpsnp -f ##known_vars_fuzz_factor_vep## -ic [##clinvar_hgmd_XA_annotations##]
| rename overlapcount oc3
| map -c consequence -h ##vep_impact##
##spliceai_join##
| where oc1 > 0 or oc2 > 0 or oc3 > 0 or vep_impact in ##vep_impact_filter## ##spliceai_filter##
| atmin 1 -last -gc reference,call,feature vep_rank
| calcifmissing BP7_score 1
| calc qc_high_mod_impact if(vep_impact in ##vep_impact_filter## ##spliceai_filter## ,1,0)
| replace gene_symbol if(gene_symbol='.','',gene_symbol)
| select 1,2,Reference,Call,Max_Consequence,Gene_symbol,Feature,Consequence,amino_acids,Protein_position,Protein_Size,Biotype,Exon,HGVSc,HGVSp,Refgene,vep_impact,qc_high_mod_impact,BP7_score,##spliceai_cols##
| varjoin -norm -l -r [##dbsnp_file##]
| varjoin -norm -r -l -e 0.0 [##frequency_file##]
| varjoin -norm -r -l -e 0.0 [##max_score##]
| varjoin -norm -r -l [##PP3_BP4##]
| varjoin -norm -r -l -e 0 [##homozygous_vars##]
| varjoin -norm -r -l -xl feature -xr transcript_stable_id -e 0 [##exac_mpc_scores##]
| hide gene_name,transcript_stable_id;
""")

Notice the use of -split 100 in PGOR. This is done because this join involves all the BIG tables sources, i.e. the VEP table that is ever growing, because of incremental sample VEP, dbNSFP related tables, Gnomad, SpliceAI etc., all of which are very large (see query below).

Further notice that we apply ATMIN to select the most relevant VEP consequence per variant-transcript/feature (Ensembl VEP returns potentially multiple consequences which are all stored for optional further details). Because of the size of these tables, we use the -norm VARJOIN option, because we know that all these tables are left-normalized. This makes the join more efficient than if we normalize on the fly, especially for dense tables.

The most important command though is the WHERE filtering expression. We see that we pass through variants that fullfill any of the following:

known classified variant
inside fuzzy exon boundaries
white listed (oc2 set to zero here because protected variant list was not provided)
VEP consequence of significance (typcically HIGH and MODERATE impact)
SplicAI impact HIGH or MODERATE
Fidel pathogenicity prediction >= 0.97

Notice that the column qc_high_mod_impact includes VEP, SpliceAI, and Fidel conditions.

In [23]:

%%gor
$mydefs
create #a# = pgor [##frequency_file##] | group chrom -count;
create #b# = pgor [##dbsnp_file##] | group chrom -count;
nor [#a#] | group -sum -ic allcount | merge -s <(nor [#b#] | group -sum -ic allcount)

Query ran in 0.52 sec
Query fetched 2 rows in 0.03 sec (total time 0.55 sec)

Out[23]:

	sum_allCount	Source
0	1251568008	R
1	275510472	L

Multiple executions of step1, e.g. for different case-studies, can leverage the same cached ##sub_vep##, as long as there is not incremental VEP happening. Clearly, both for the efficiency of the VEP process and the execution of step1, it is beneficial to batch both VEP and step1 analysis.

Next we define ##pn_variants_vepped##, a subset of ##sub_vep## for the variants in samples of the case study. This table is the basis for most of the downstream analysis in step1. It basically varjoins the sample variants (after left-normalization because the PN input was not assumed normalized) with ##sub_vep##. It the repeats some of the light-joins in ##sub_vep## to establish filtering information, because only qc_high_mod_impact is stored there (this ugly and could easily be modified). It then adds information about preferred transcript (if available) or ranks the transcripts based on consequence and/or protein size. Furthermore, it turns all the qc_* columns into 1/0 Boolean style values. Also, if a custom AF source is provided (which was not the case here) a ref_af_cust is added from that source via varjoin.

In [24]:

# Left-varjoin from PN into ##sub_vep## and rank-order transcripts per gene.  See genealiaspanelinfo in Gregor2 for ##pref_transcript## 
mydefs = qh.add_many("""
create ##pn_variants_vepped## = #gor_or_pgor# ##wgs_vars## -f ##all_pns##
| select 1-4,PN,callratio,callcopies,depth,filter,gl_call
| varnorm -left #3 #4
| varjoin -ic [##clinvar_hgmd_XA_annotations##]
| rename OverlapCount qc_known_vars
| join -snpsnp -f ##known_vars_fuzz_factor_vep## -ic [##clinvar_hgmd_XA_annotations##]
| rename OverlapCount qc_closeby_known_vars
 | calc qc_protected_vars 0
| varjoin -l -r -e 'NO_VEP_RESULT' <(gor [##sub_vep##])
| hide biotype
| map -c feature [##trans2biotype##] /* This filters out transcript based on NM status in refgene and biotype from Ensembl */
| join -varseg -ic -f ##gene_overlap_fuzz_factor## ##all_exons##
| rename overlapCount qc_in_exon
| where qc_in_exon > 0 or qc_known_vars > 0 or qc_protected_vars > 0 or qc_high_mod_impact > 0 or qc_closeby_known_vars > 0
| replace qc_in_exon,qc_known_vars,qc_closeby_known_vars if(int(#rc)>0,1,0)
| map -c consequence -m 100 -h [##veprank##]
| map -m 'miss' -c gene_symbol -h [##pref_transcript##]
| calc DefTrans if(Gene_DefTrans='miss', '0', if(containsany(listfirst(refgene,'.'),gene_DefTrans), '1', '0'))
| calc transcript_source if(LISTHASANY(Refgene,HGMD_transcript) and Refgene!='.' and not(LISTHASANY(Refgene,CLINVAR_transcript)),'is_hgmd',if(LISTHASANY(Refgene,HGMD_transcript) and Refgene!='.' and LISTHASANY(Refgene,CLINVAR_transcript),'is_hgmd_and_clinvar',if(not(LISTHASANY(Refgene,HGMD_transcript)) and Refgene!='.' and LISTHASANY(Refgene,CLINVAR_transcript),'is_clinvar','no_source_detected')))
| replace rank_consequence if(refgene='.',rank_consequence+100,rank_consequence)
| replace rank_consequence if(containsany(refgene,Gene_DefTrans),rank_consequence-200,rank_consequence)
| replace protein_size if(isint(protein_size),int(protein_size),1)
| granno 1 -gc reference,call,PN,gene_symbol -max -ic protein_size
| replace rank_consequence if(protein_size = max_protein_size,rank_consequence-50,rank_consequence)
| rename rank_consequence tmp_rank
| rank 1 -gc reference,call,PN,gene_symbol tmp_rank -o asc
| rename rank_tmp_rank qc_consequence_rank
  | calc ref_af_cust 0.0
| select 1,2,Reference,Call,PN,gene_symbol,feature,ref_af,homcount,Max_score,Max_Consequence,Consequence,vep_impact,amino_acids,Protein_position,Protein_Size,BioType,Exon,HGVSc,HGVSp,Refgene,clinvar_transcript,hgmd_transcript,qc_consequence_rank,callratio,callcopies,depth,filter,gl_call,ref_af_cust,qc_in_exon,qc_protected_vars,qc_high_mod_impact,qc_known_vars,qc_closeby_known_vars,transcript_source,DefTrans,rsIDs,fidel_bp4_pp3,PP3_score,BP4_score,BP7_score,MPC,##spliceai_cols##
;""")

The ##pn_variants_vepped## table is a much reduced subset of ##sub_vep##, especially so if many samples have been onboarded into the project. While the heavy joins are done once for potentially multiple case studies in ##sub_vep##, it may be better to combine these two steps and execute them as a single step for multiple case studies, outside step1, and provide these study specific files as an input to step1, analogous to how step1 feeds step2 with its output. We will look at this and the efficiency improvements related to this in a separate notebook.

In [25]:

%%gor
$mydefs
gor [##pn_variants_vepped##] | top 100

Query ran in 0.39 sec
Query fetched 100 rows in 0.01 sec (total time 0.39 sec)

Out[25]:

	CHROM	POS	Reference	Call	PN	Gene_Symbol	Feature	ref_af	homCount	Max_Score	...	PP3_score	BP4_score	BP7_score	MPC	spliceai_max_consequence	spliceai_max_impact	Fidel_PRECPAT	Fidel_PRECBEN	Fidel_CALIBRATED	Fidel_ALPHA
0	chr1	69270	A	G	DEMO_2796799	OR4F5	NM_001005484.2	0.998296	13922	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000
1	chr1	69270	A	G	DEMO_2796810	OR4F5	NM_001005484.2	0.998296	13922	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000
2	chr1	69270	A	G	DEMO_2827751	OR4F5	NM_001005484.2	0.998296	13922	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000
3	chr1	69511	A	G	DEMO_2796799	OR4F5	NM_001005484.2	0.999479	74171	0.348	...	0.0	1.0	1	0.0	NaN	NaN	0.36672	0.998221	0	0.0543
4	chr1	69511	A	G	DEMO_2796810	OR4F5	NM_001005484.2	0.999479	74171	0.348	...	0.0	1.0	1	0.0	NaN	NaN	0.36672	0.998221	0	0.0543
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	chr1	879687	T	C	DEMO_2827751	NOC2L	NM_015658.4	0.961982	13740	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000
96	chr1	879789	CCTGGAA	C	DEMO_2796799	SAMD11	NM_001385640.1	0.098191	15	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000
97	chr1	879789	CCTGGAA	C	DEMO_2796799	SAMD11	NM_001385641.1	0.098191	15	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000
98	chr1	879789	CCTGGAA	C	DEMO_2796799	SAMD11	NM_152486.4	0.098191	15	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000
99	chr1	879789	CCTGGAA	C	DEMO_2796799	NOC2L	NM_015658.4	0.098191	15	0.000	...	NaN	NaN	1	0.0	NaN	NaN	-1.00000	-1.000000	-1	-1.0000

100 rows × 49 columns

Next we draw out attention to classification of variants based on perfect overlap or adjacency to known classified variants. For that, we need to know the codon and amino acid change for both the sample based variants as well as all the known variants. In the past, we relied on a VEP table for Clinvar, e.g. ref/disvariants/clinical_variants_vep_multi.gorz, however, as we are now dynamically adding classified variants from XA and Oscar, we must have some process for updating this or equivalent source.

Given that we do not need the detailed classification that Ensembl VEP provides, rather than setting up a separate VEP process for variants coming through Oscar (or folding them into the incremental sample VEP process), here we simply implement a super-fast GOR-VEP version, to do this within step1.

First, we setup ##e##, used to calculate cds position and amino acid changes with codons spanning two exons, and #geneseg#, used to classify non-coding variants (not really needed for our purpose here though). Notice the use of refbases() in the calculation of fSeq and rSeq in ##extmp##, which are then used further below in ##known_var_vep_multi## to calculate fo_dnabefore and re_dnabefore, i.e. the coding sequence before mutation.

In [26]:

# Calculate segments used to classify non exonic variants in GOR-VEP
mydefs = qh.add_many("""
create #utrs# = gor ##gene_transcripts## | calc utr_start if(isint(cdsstart),gene_start+','+cdsend,'')
| calc utr_end if(isint(cdsstart),cdsstart+','+gene_end,'') | split utr_start,utr_end
| where utr_start != '' | calc segType if(utr_start=gene_start and strand='+' or utr_start!=gene_start and strand='-','5 prime UTR','3 prime UTR')
| select 1,utr_start,utr_end,segType,transcript_stable_id,strand,gene_symbol | sort chrom;

create #upstrdownstr# = gor <(nor ##gene_transcripts## | calc bpstart gene_start-5000 | calc bpstop gene_start | calc segType if(strand='+','Upstream','Downstream')
| merge <(nor ##gene_transcripts## | calc bpstart gene_end | calc bpstop gene_end+5000 | calc segType if(strand='-','Upstream','Downstream'))
| select 1,bpStart,bpStop,transcript_stable_id,strand,gene_symbol,segType | replace bpStart max(bpStart,0) ) | sort chrom;

create ##e## = gor ##coding_exons##
| hide transcript_biotype
| multimap -c Transcript_stable_id [##trans2biotype##] | rename Biotype transcript_biotype
| select 1-gene_symbol,exon,Transcript_stable_id,transcript_biotype,strand
| calc es #3-#2;
""")

# Self-join exons calculate upstream/downstream ref-seq with adjacent exons, similar to Ex. 28 in Supplemental material in GORpipe paper
mydefs = qh.add("""
create ##extmp## = pgor [##e##]
| join -snpsnp -f 4000000 -xl transcript_stable_id -xr transcript_stable_id [##e##]
| calc bs if(chromstartx < chromstart,esx,0)
| calc as if(chromstartx > chromstart,esx,0)
| calc nif if(bs>0,1,0)
| calc nir if(as>0,1,0)
| calc rseq if(chromstartx < chromstart,refbases(chrom,max(chromendx-2,chromstartx+1),chromendx),'')
| calc fseq if(chromstartx > chromstart,refbases(chrom,chromstartx+1,min(chromstartx+3,chromendx)),'')
| group 1 -gc chromend,gene_symbol,exon,transcript_stable_id,transcript_biotype,strand -sum -ic bs,as,nif,nir -lis -sc rseq,fseq -len 10000
| hide lis_bs,lis_as,lis_nif,lis_nir
| calc rseq right(replace(lis_rseq,',',''),3)
| calc fseq left(replace(lis_fseq,',',''),3)
| hide lis_*
| calc exonNum if(strand='+',sum_nif+1,sum_nir+1)
| calc numExons sum_nif+sum_nir+1
| calc transcript_size sum_as+sum_bs+#3-#2
""")

# Exon table with information needed to calculate the cds position for 50bp rule and amino acid changes with codons spanning two exons 
mydefs = qh.add("""create ##ex## = gor [##extmp##]
| map -c transcript_Stable_id <(nor -h <(gor [##extmp##]
| calc last_exon_size if(exonNum=numExons,chromend-chromstart,0))
| group -gc transcript_stable_id -max -ic last_exon_size
| rename max_(.*) #{1})
""")

# Generate introns by joining the exons based on adjaced position within transcript 
mydefs = qh.add("""
create #geneseg# = gor [##ex##]  | join -snpsnp -f 4000000 -xl transcript_stable_id,sum_nif -xr transcript_stable_id,x <(gor [##ex##] | calc x sum_nif - 1)
| calc gapstart chromend-1 | calc gapend chromstartx | select chrom,gapstart,gapend,gene_symbol,transcript_stable_id,strand,exonNum
| replace exonNum if(strand = '-',exonNum-1,exonNum) | sort 4000000
| rename gapstart bpStart | rename gapend bpStop | calc segType 'Intron'
| merge <(gor [#utrs#] | rename utr_start bpStart | rename utr_end bpStop  )
| merge <(gor [##ex##] | rename chromstart bpStart | rename chromend bpStop | select 1-3,gene_symbol,transcript_stable_id,strand,exonNum | calc segType 'Exon')
| merge [#upstrdownstr#]
| replace gene_symbol if(posof(gene_symbol,'-')>0,left(gene_symbol,posof(gene_symbol,'-')),gene_symbol)
""")

For our purpose, i.e. to calculate ##cat1bc##, we only need VEP like information for the known Pathogenic variants. The following code delivers us ##known_var_vep_multi## for all such variants, e.g. XA and Clinvar.

In [27]:

# Use fast GOR-VEP to classify known variant. 
# Using this approach, we no-longer need to Esembl-VEP clinical variants, e.g. ref/disvariants/clinical_variants_vep_multi.gorz 
# This allows us to classify sample variants in proximity with known variants with same amino-acid changes 
# and to calculate ##patho_high## and ##patho_missense##.  However, for detailed classification of the sample variants, we rely on Ensebl-VEP

# This fast GOR-VEP code is based on VEP_calc.yml
mydefs = qh.add("""
def #vep_input_file# = [##clinical_variants##] | where MaxClinImpact = 'Pathogenic';

create ##known_var_vep_multi## = pgor #vep_input_file#  | select 1-4
| rename #3 reference | rename #4 call
| join -varseg -l -e 10 <(gor [#geneseg#] 
                          | calc impact if(segType = 'Exon',1,if(segType in ('3 prime UTR','5 prime UTR','Intron'),2, if(segType in ('Downstream','Upstream'),3,4)))) 

| rank 1 impact -gc reference,call,transcript_stable_id -o asc -rmax 1 
| where rank_impact = 1 | hide impact,rank_impact 

| varjoin -r -l -xl transcript_stable_id,segType,pos -xr transcript_stable_id,segType,pos  <(gor #vep_input_file# | select 1-4
| join -varseg  [##ex##]
| rename chromstart sta | rename chromend sto
| rename #3 Ref | rename #4 Alt
| calc exorder if(strand = '+',sta,-sto)
| granno 1 -gc transcript_stable_id -min -ic exorder -count
| rename allCount exonOverlaps
| where exorder = min_exorder 
| hide exorder,min_exorder
| calc cdspos if(strand = '+',pos-sta+sum_bs,sto-pos-len(Ref)+sum_as+2)
| calc fo_mo mod(pos+sum_bs-sta-1,3)
| calc fo_dnabefore if(pos>sta and pos+len(Ref)-1<=sto,lower(right(rSeq+if(pos>sta+1,refbases(chrom,max(sta+1,pos-4),pos-1),''),fo_mo)),'')
| calc fo_dnaafter if( (2-fo_mo-mod(len(Ref),3)) >= 0,lower(refbases(chrom,pos+len(Ref),pos+len(Ref)+2-fo_mo-mod(len(Ref),3))),'')
| calc re_mo mod(sto-(pos+len(Ref)-1)+sum_as,3)
| calc re_dnabefore if(pos>sta and pos+len(Ref)-1<=sto,lower(revcompl(left(if(pos+len(Ref)-1<sto,refbases(chrom,pos+len(Ref),min(sto,pos+len(Ref)+4)),'')+fSeq,re_mo))),'')
| calc re_dnaafter if( (2-re_mo-mod(len(Ref),3)) >= 0,lower(revcompl(refbases(chrom,pos-1-(2-re_mo-mod(len(Ref),3)),pos-1))),'')
| calc Codons if(strand = '+',fo_dnabefore+Ref+fo_dnaafter+'/'+fo_dnabefore+Alt+fo_dnaafter,re_dnabefore+revcompl(Ref)+re_dnaafter+'/'+re_dnabefore+revcompl(Alt)+re_dnaafter)
| hide distance-sto,sum_bs-fSeq,fo_mo-re_dnaafter 
| calc Protein_position str(ceil(cdspos/3))+if(len(ref)>2,'-'+str(ceil((cdspos+len(ref)-1)/3)),'')
| calc Amino_acids if(codons2shortaminos(upper(substr(codons,0,posof(codons,'/'))))=codons2shortaminos(upper(substr(codons,posof(codons,'/')+1,1000)))
                      ,codons2shortaminos(upper(substr(codons,0,posof(codons,'/'))))
                      ,codons2shortaminos(upper(substr(codons,0,posof(codons,'/'))))+'/'+codons2shortaminos(upper(substr(codons,posof(codons,'/')+1,1000))))
| replace amino_acids replace(replace(amino_acids,'/X','/'+'*'),'X/','')             
| calc segType = 'Exon'
)
| calc aminobefore if(contains(codons,'/'),codons2aminos(substr(codons,0,posof(codons,'/'))),'.') 
    | calc aminoafter if(contains(codons,'/'),codons2aminos(substr(codons,posof(codons,'/')+1,1000)),'.')
    | calc Consequence if(segType = 'Intron' and (strand = '+' and segdist(pos-1,pos-1+len(reference),bpstart,bpstart+1)<3 or strand = '-' and segdist(pos-1,pos-1+len(reference),bpstop-1,bpstop)<3),'splice_donor_variant', 
    if(segType = 'Intron' and (strand = '+' and segdist(pos-1,pos-1+len(reference),bpstop-1,bpstop)<3 or strand = '-' and segdist(pos-1,pos-1+len(reference),bpstart,bpstart+1)<3),'splice_acceptor_variant', 
    if(segType = 'Intron' and (segdist(pos-1,pos-1+len(reference),bpstop-1,bpstop)<8 or segdist(pos-1,pos-1+len(reference),bpstart,bpstart+1)<8),'splice_region_variant',
    if(segType = 'Intron','intron_variant',
    if(segType = 'Upstream','upstream_gene_variant',
    if(segType = 'Downstream','downstream_gene_variant',
    if(segType = 'Exon' and aminoafter = 'STO','stop_gained',
    if(segType = 'Exon' and mod(len(reference)-len(call),3)!=0,'frameshift_variant',
    if(segType = 'Exon' and aminobefore = 'STO' and aminoafter != 'STO','stop_lost',
    if(segType = 'Exon' and aminobefore = 'STO' and aminoafter = 'STO','stop_retained_variant',
    if(segType = 'Exon' and aminobefore = 'ATG' and aminoafter != 'ATG','start_lost',
    if(segType = 'Exon' and len(reference)<len(call),'inframe_insertion',
    if(segType = 'Exon' and len(reference)>len(call),'inframe_deletion',
    if(segType = 'Exon' and aminobefore != aminoafter,'missense_variant',
    if(segType = 'Exon' and (( segdist(pos-1,pos-1+len(reference),bpstart,bpstart+1)<3 and not(strand='-' and exonNum=numExons)) or ( segdist(pos-1,pos-1+len(reference),bpstop-1,bpstop)<3 and not(strand='+' and exonNum=numExons))),'splice_region_variant',
    if(segType = 'Exon' and aminobefore = aminoafter,'synonymous_variant',
    if(segType = '5 prime UTR','5_prime_UTR_variant',
    if(segType = '3 prime UTR','3_prime_UTR_variant','?' 
    ))))))))))))))))))
| select 1,2,reference,call,gene_symbol,strand,transcript_stable_id,Consequence,Codons,Protein_position,Amino_acids,exonNum,segType,exon,transcript_biotype,transcript_size,cdspos
| rename transcript_stable_id feature
;""")

We can for instance compare our GOR-VEP with Ensembl VEP in terms of classifying these known variants as HIGH/MODERATE on variant level, i.e. by collapsing the impact membership to the variant and summarizing it as such:

In [28]:

%%gor
$mydefs
create temp = pgor [##known_var_vep_multi##] | where len(#3)=len(#4) 
| varjoin -xl feature -xr feature ref/disvariants/clinical_variants_vepref_multi.gorz
| inset -b -c consequence <(nor ##vep_impact## | where vep_impact in ('HIGH','MODERATE'))
| inset -b -c consequencex <(nor ##vep_impact## | where vep_impact in ('HIGH','MODERATE'))
| group 1 -gc 3,4 -max -ic inset,insetx
| group chrom -gc max_* -count;

nor [temp] | group -gc max_* -sum -ic allcount 
| calc same if(#1 = #2,sum_allcount,0) | calc diff if(#1 != #2,sum_allcount,0)
| group -sum -ic same,diff | calc f sum_diff/(sum_same+sum_diff)

Query ran in 0.47 sec
Query fetched 1 rows in 0.01 sec (total time 0.48 sec)

Out[28]:

	sum_same	sum_diff	f
0	146553	882	0.005982

We see that the disagreement is only 0.6%. Similarly, we can compare the codon changes on variant-feature level for important SNPs:

In [29]:

%%gor
$mydefs
create temp = pgor [##known_var_vep_multi##] | where len(#3)=len(#4) 
| varjoin -xl feature -xr feature ref/disvariants/clinical_variants_vepref_multi.gorz
| inset -b -c consequence <(nor ##vep_impact## | where vep_impact in ('HIGH','MODERATE'))
| inset -b -c consequencex <(nor ##vep_impact## | where vep_impact in ('HIGH','MODERATE'))
| where inset = 1 or insetx = 1
| calc same_amino_acids if(amino_acids = amino_acidsx,1,0)
| group 1 -gc 3,4,feature -max -ic same_*
| group chrom -sum -ic max_* -count;

nor [temp] | group -sum -ic allcount- | calc f #2/#1*100

Query ran in 0.43 sec
Query fetched 1 rows in 0.01 sec (total time 0.44 sec)

Out[29]:

	sum_allCount	sum_sum_max_same_amino_acids	f
0	489204	472427	96.570551

We see that the amino acid change for these important SNPs agrees for 97% of them - not perfect but certainly better than not classifying all the variants that come from XA or Oscar because of static clinical_variants_vepref_multi.gorz table!

Now we are ready to calculate all the Cat1 category flavors, which are ranked and merged in ##all_cat1## using GRANNO. This category is then used to define DIAG_ACMGcat, an old-school variant category, and also as input to ACMG evidence in step2.

In [30]:

# Classify variants based on match with known classified variants, Path and non-path.  Cat1 = 1-KP_cDNA and Cat1d = 4-KNonP 
mydefs = qh.add("""
create ##cat1_1d## = #gor_or_pgor# [##pn_variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| varjoin -ic <(gor [##clinical_variants##] | select 1-Alt,MaxClinImpact | where MaxClinImpact = 'Pathogenic')
| rename overlapcount Path
| varjoin -ic <(gor [##clinical_variants##] | select 1-Alt,MaxClinImpact | where NOT(MaxClinImpact = 'Pathogenic'))
| rename overlapcount NonPath
| calc category if(Path > 0,'1-KP_cDNA',if(NonPath > 0,'4-KNonP','none'))
| where category != 'none'
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category
;
""")
        
# Classify variants based on imperfect match with Path variants.
# Label them based on same amino change, Cat1b = 2-KP_AA or same protein position only Cat1c = 3-DV_KPCodon
mydefs = qh.add("""
create ##cat1bc## = #gor_or_pgor# [##pn_variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| join -snpsnp -xl feature -xr feature -f 2 [##known_var_vep_multi## ]
| where not(pos = posx and reference = referencex and call = callx) and not(amino_acids in ('','.')) 
| calc category if(amino_acids = amino_acidsx and protein_position+'' = protein_positionx,'2-KP_AA',if(protein_position+'' = protein_positionx,'3-DV_KPCodon',''))
| where category != ''
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;
""")

# Find intronic variants close to known intronic variants.  Cat1e = 5-DV_intron.  Same variant match superceeds with Cat1 and Cat1d. 
mydefs = qh.add("""
create ##cat1e## = #gor_or_pgor# [##pn_variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| join -snpsnp -f 2 <(gor [##clinical_variants##] | select 1-Alt,MaxClinImpact | rename #3 Reference | rename #4 Call)
| where not(pos = posx and reference = referencex and call = callx) 
| where consequence = 'intron_variant'
| calc category '5-DV_intron'
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;
""")

# Rank and merge categories per variant-feature 
mydefs = qh.add("""
create ##all_cat1## = #gor_or_pgor# [##cat1_1d##] 
| merge [##cat1bc##]
| merge [##cat1e##] 
| rownum
| calc cat1_rank mod(rownum,100)+if(category='1-KP_cDNA',100,if(category='2-KP_AA',200,if(category='3-DV_KPCodon',300,if(category='4-KNonP',400,500))))
| granno 1 -gc 3,4,gene_symbol,feature -min -ic cat1_rank
| where cat1_rank = min_cat1_rank
| select 1-4,gene_symbol,feature,category;
""")

These statements above depend on the PN input while ##known_var_vep_multi## only needs to be updated when the knowledgebase grows. It does however run very quickly, in less than a minute for several million variants.

Next we create few derivatives of ##known_var_vep_multi## that are used to define moderately rare HIGH impact and missense variants. These are then used with alternative start information to calculate trigger50bpRule in ##neighbor_info##.

In [31]:

# Logic used for various ACMG evidence scoring
# See comment about ##newvep_clinical## above 
mydefs = qh.add_many("""
create ##patho_high## = pgor [##clinical_variants##]
| group 1 -gc #3,#4 -set -sc clinimpact
| where not (contains(set_clinimpact,'unknown'))
| varjoin -i <(gor [##known_var_vep_multi##] | map -c consequence -h  ##vep_impact## | where vep_impact = 'HIGH')
| varjoin -n <(gor [##frequency_file##] | where ref_af >= 0.05);

create ##patho_missense## = pgor [##clinical_variants##]
| group 1 -gc #3,#4 -set -sc clinimpact
| where not (contains(set_clinimpact,'unknown'))
| varjoin -i <(gor [##known_var_vep_multi##] | where consequence = 'missense_variant')
| varjoin -n <(gor [##frequency_file##] | where ref_af >= 0.05);
""")

# Find alternate start sites from dbNSFP 
mydefs = qh.add("""
create ##altstart_vars## = pgor [##dbnsfp_file##]
| where aaref = 'M'
| select 1-4,transcript_stable_id
| distinct;
""")

# Annotate each pathogenic variant with cds position, exon number and flag if it violates the 50bp rule and NMD 
mydefs = qh.add("""
create ##patho_trans_info## = gor [##patho_high##]
| join -snpseg <(gor [##ex##] | select 1-gene_symbol,strand,transcript_stable_id,sum_*,exonNum,numExons,transcript_size,last_exon_size,transcript_biotype)
| calc cdspos if(strand = '+',pos-chromstart+sum_bs,max(chromend-pos-len(ref)+sum_as+2,0))
| select 1-alt,transcript_stable_id,exonnum,numExons,cdspos,transcript_size,last_exon_size,transcript_biotype
| calc violates50bpRule if(transcript_size-last_exon_size-cdspos < 50,1,0)
| calc NMDtranscript if(contains(transcript_biotype,'nonsense_mediated_decay'),1,0);
""")

# Find the min and max positions of pathogenic variants that overlap a given transcript.
# Similarly, find min/max positions of alternative starts that do not have pathogenic variant upstream 
mydefs = qh.add("""
create ##transcript_info## = gor ##gene_transcripts##
| select 1-gene_symbol,strand,transcript_stable_id
| join -segsnp -l -xl transcript_stable_id -xr transcript_stable_id -e '-1' [##patho_trans_info##]
| rename pos path_pos
| group 1 -gc gene_end-transcript_stable_id -min -max -ic path_pos,violates50bpRule,NMDtranscript
| calc is50bpRuleViolated if(max_violates50bpRule = 1,1,0)
| calc isNMDtranscript if(max_NMDtranscript = 1,1,0)
| join -segsnp -l -e '-1' -xl transcript_stable_id -xr transcript_stable_id [##altstart_vars##]
| rename pos m_pos
| replace m_pos if(m_pos > 0 and max_path_pos < 0 or strand = '+' and m_pos < min_path_pos and m_pos > 0 or strand = '-' and m_pos > max_path_pos and max_path_pos > 0,m_pos,-1)
| group 1 -gc gene_end-max_path_pos,is50bpRuleViolated,isNMDtranscript -min -max -ic m_pos;
""")

We also like to classify genes for PVS1 and PP2 evidence and classify protein domains for PM1 based on information of HIGH and MODERATE variants. Here we use Gnomad as our genomic source for variants and use GOR-VEP to find all HIGH/MODERATE variants in there, in a similar but slighly more efficient manner than in ##known_var_vep_multi## above, i.e. we only look at exons and filter before we save the output via inset. This processes the entire Gnomad source in a matter of one or two minutes.

Notice that currently in ##PM1_info##, we aggregate ##PM1_info_tmp## over the Interpro domain type, as compared to doing it per domain spatial region. This is easy to change if that is considered the correct interpretation.

In [32]:

# Label PVS1 genes for tolerance and known high-impact pathogenic variants.  Could use GOR-join instead of MAP !
mydefs = qh.add("""
create ##pvs1_genes## = nor ##all_genes##
| map -c gene_symbol -h -m '-1' <(nor <(gor ##exac_gene_constraint##
| select 1,2,gene_symbol,lof_pli | calc kdg 0
| merge <(gor ##gene_transcripts## | join -segsnp [##patho_high##] -i | select 1,2,gene_symbol | calc kdg 1 | calc lof_pli 0)
)
| group -gc gene_symbol -max -fc lof_pli,kdg
| rename max_lof_pli lof_pli | rename max_kdg kdg
| select gene_symbol-kdg | distinct)
| calc pvs1gene if(kdg=1 or lof_pli > 0.9,1,0)
| select gene_symbol,pvs1gene,lof_pli,kdg;
""")

# Estimate density of high-moderate variants - used for PM1 and PP2 ACMG evidence
mydefs = qh.add("""
create ##vep_highmod## = pgor ##gnomad_freq## 
| join -varseg  [##ex##]
| rename chromstart sta | rename chromend sto
| rename #3 Ref | rename #4 Alt
| calc exorder if(strand = '+',sta,-sto)
| granno 1 -gc transcript_stable_id -min -ic exorder -count
| rename allCount exonOverlaps
| where exorder = min_exorder 
| hide exorder,min_exorder
| calc cdspos if(strand = '+',pos-sta+sum_bs,sto-pos-len(Ref)+sum_as+2)
| calc fo_mo mod(pos+sum_bs-sta-1,3)
| calc fo_dnabefore if(pos>sta and pos+len(Ref)-1<=sto,lower(right(rSeq+if(pos>sta+1,refbases(chrom,max(sta+1,pos-4),pos-1),''),fo_mo)),'')
| calc fo_dnaafter if( (2-fo_mo-mod(len(Ref),3)) >= 0,lower(refbases(chrom,pos+len(Ref),pos+len(Ref)+2-fo_mo-mod(len(Ref),3))),'')
| calc re_mo mod(sto-(pos+len(Ref)-1)+sum_as,3)
| calc re_dnabefore if(pos>sta and pos+len(Ref)-1<=sto,lower(revcompl(left(if(pos+len(Ref)-1<sto,refbases(chrom,pos+len(Ref),min(sto,pos+len(Ref)+4)),'')+fSeq,re_mo))),'')
| calc re_dnaafter if( (2-re_mo-mod(len(Ref),3)) >= 0,lower(revcompl(refbases(chrom,pos-1-(2-re_mo-mod(len(Ref),3)),pos-1))),'')
| calc Codons if(strand = '+',fo_dnabefore+Ref+fo_dnaafter+'/'+fo_dnabefore+Alt+fo_dnaafter,re_dnabefore+revcompl(Ref)+re_dnaafter+'/'+re_dnabefore+revcompl(Alt)+re_dnaafter)
| hide distance-sto,sum_bs-fSeq,fo_mo-re_dnaafter 
| calc segType = 'Exon'
| calc aminobefore if(contains(codons,'/'),codons2aminos(substr(codons,0,posof(codons,'/'))),'.') 
| calc aminoafter if(contains(codons,'/'),codons2aminos(substr(codons,posof(codons,'/')+1,1000)),'.')
| rename #3 Reference | rename #4 Call

| calc Consequence if(segType = 'Intron','intron_variant',
    if(segType = 'Upstream','upstream_gene_variant',
    if(segType = 'Downstream','downstream_gene_variant',
    if(segType = 'Exon' and aminoafter = 'STO','stop_gained',
    if(segType = 'Exon' and mod(len(reference)-len(call),3)!=0,'frameshift_variant',
    if(segType = 'Exon' and aminobefore = 'STO' and aminoafter != 'STO','stop_lost',
    if(segType = 'Exon' and aminobefore = 'STO' and aminoafter = 'STO','stop_retained_variant',
    if(segType = 'Exon' and aminobefore = 'ATG' and aminoafter != 'ATG','start_lost',
    if(segType = 'Exon' and len(reference)<len(call),'inframe_insertion',
    if(segType = 'Exon' and len(reference)>len(call),'inframe_deletion',
    if(segType = 'Exon' and aminobefore != aminoafter,'missense_variant',
    if(segType = 'Exon' and aminobefore = aminoafter,'synonymous_variant','?' 
    ))))))))))))
| inset -c consequence <(nor ##vep_impact## | where vep_impact in ('HIGH','MODERATE'))
| select 1-4,consequence | distinct;    
""")

# Compare number of high-impact vs missense variants in interpro regions
mydefs = qh.add("""
create ##PM1_info_tmp## = pgor [##interpro_regions##]
| join -segsnp -ic [##patho_high##] | rename overlapcount path_high_allCount
| join -segsnp -ic [##patho_missense##] | rename overlapcount path_missense_allCount
| join -segsnp  -l -e 'x' <(gor [##vep_highmod##] | select 1-4 | distinct
                            | varjoin -r -l -e 'rare' <(gor [##frequency_file##] | calc type if(ref_af >= 0.05,'common',if(ref_af<0.01,'rare','other')) ))
| group 1 -gc bpstop,interpro_domain,path_high_allCount,path_missense_allCount,type -count
| pivot -v rare,common type -gc bpstop,interpro_domain,path_high_allCount,path_missense_allCount -e 0
| where path_missense_allcount > 0 or path_high_allCount > 0 or rare_allcount > 0;
""")

# Here we aggregate interpro regions across the genome - not sure if that is the right interpretation 
mydefs = qh.add("""
create ##PM1_info## = pgor [##interpro_regions##]
| map -c interpro_domain <(nor [##PM1_info_tmp##] | group -gc interpro_domain -count -sum -ic 5- ) -h
| rename sum_(.*) #{1}
| calc ratio_common_rare if(rare_allCount!=0, float(common_allCount/rare_allCount),if(common_allCount=0,0.0,1.0));
""")

# Similar to PM1, but now for genes for PP2 and BP1 
mydefs = qh.add("""
create ##PP2_BP1_info## = pgor ##all_genes##
| join -segsnp -ic [##patho_high##] | rename overlapcount path_high_allCount
| join -segsnp -ic [##patho_missense##] | rename overlapcount path_missense_allCount
| join -segsnp -l -e 'x'  <(gor [##vep_highmod##] | where consequence = 'missense_variant' | select 1-4 | distinct
                           | varjoin -r -l -e 'rare' <(gor [##frequency_file##] | calc type if(ref_af >= 0.05,'common',if(ref_af<0.01,'rare','other')) )
                           )
| group 1 -gc gene_end,gene_symbol,path_high_allCount,path_missense_allCount,type -count
| pivot -v rare,common type -gc gene_end,gene_symbol,path_high_allCount,path_missense_allCount -e 0
| where path_missense_allcount > 0 or path_high_allCount > 0 or rare_allcount > 0;
""")

LHZ analysis¶

Here we perform some simple LHZ analysis. We use the full set of variants in ##wgs_vars## for each sample. The logic in ##homoseg## 2kb sliding window (step size of 10 bases) to find regions where there is no more than one heterozygote. Then, a merge with heterozygot regions followed by segment projection and a filtering on segCount, is used to break up potential stretches of homozygous regions. In ##LHZ##, we refine the region boundaries with the overlaping variants and filter the based on variant count and size.

In [33]:

# Select all 2kb regions in step of 10bp that have only one het.
# Create a continuous span of the mostly homozygous regions per sample.
# Generate 1bp regions for all heterozygous vars.  Project them with the homozygous regions.
# Where segCount is 2, a heterozygous variant breaks up the end of a LOHZ candidate.
# Then refine the LHZ segments based on position of the first and last homozygot variant overlapping 
mydefs = qh.add("""
create ##LHZ## = #gor_or_pgor# ##wgs_vars## -f ##all_pns##
| select 1-4,PN,callcopies
| calc het if(callcopies=1,1,0)
| group 2000 -steps 10 -gc pn -sum -ic het
| where sum_het < 2
| segspan -gc pn
| merge <(gor ##wgs_vars## -f ##all_pns## | where callcopies = 1 | select 1,2,pn | calc startpos pos-1 | select chrom,startpos,pos,pn | rename #2 bpStart | rename #3 bpStop )
| segproj -gc pn
| rename #4 pn
| where segCount = 1 and #3-#2 > 1
| join -segsnp -xl PN -xr PN <(gor ##wgs_vars## -f ##all_pns## | where callcopies = 2 | select 1-callcopies,pn )
| group 1 -gc bpStop,pn -min -max -ic pos -count
| rename allCount LHZ_vars
| calc ns = min_pos-1
| calc ne = max_pos
| select #1,ns,ne,pn,LHZ_vars
| sort 100000
| rename #2 bpStart
| rename #3 bpStop
| calc LHZ_size #3-#2
| where LHZ_size>##min_lhz_size## and LHZ_vars>##min_lhz_varcount##;
""")

# Create a transcript map with fractional size of LHZ
mydefs = qh.add("""create ##LHZ_transcripts## = gor [##LHZ##]
| join -segseg -f 1000 -maxseg 4000000 <(gor ##gene_transcripts## | select 1,gene_start,gene_end,transcript_stable_id | distinct)
| where transcript_stable_id != '0'
| calc o min(#3,gene_end)-max(#2,gene_start)
| calc t gene_end-gene_start
| calc lhz_transfrac form(if(t>0,float(o)/t,0.0),3,2)
| select 1,gene_start,gene_end,PN,transcript_stable_id,lhz_transfrac
| sort chrom
| group chrom -gc PN,transcript_stable_id -max -fc lhz_transfrac
| rename max_lhz_transfrac lhz_transfrac;
""")

Variant loci information¶

The following query statements are used to provide information on potentially multiple biallelic variants that overlap in the same genome loci.

In ##varspan## we annotate InDels with the span in their ambiguos notation due to their potential repeat nature.
In ##adjustedvars##, we go back from the typical biallelic notation used in GOR into the loci-based notation used in VCF. This allows for more strict inheritance analysis.
In ##reference_depth## we calculate the approximate (due to the rounding in ##segment_cov##) sequence read coverage for each loci in the trio.

In [34]:

# Calculate span of InDels based on left- and right-normalization
mydefs = qh.add("""
create ##varspan## = #gor_or_pgor# [##pn_variants_vepped##]
| select 1-4
| group 1 -gc 3,4
| where not(len(#3)=1 and len(#4)=1)
| varnorm -left #3 #4
| distinct
| calc leftpos #2
| varnorm -right #3 #4
| calc rightpos #2
| calc varspan rightpos-leftpos
| group 1 -gc 3,4 -max -ic varspan
| rename max_varspan varspan
| select 1-4,varspan;
""")

# Calculated loci-based VCF-like variants from bi-allelic variants in GOR files. In order to strengthen IHE check and for GT-Info 
mydefs = qh.add("""
create ##adjustedvars## = #gor_or_pgor# <(gor [##pn_variants_vepped##] | select 1-4,callcopies,CallRatio,Depth,PN | distinct 
| calc newpos if( not(left(#3,1)!=left(#4,1) and len(#3)>0 and len(#4)>0) ,#2+1,#2) | calc newref if( not(left(#3,1)!=left(#4,1) and len(#3)>0 and len(#4)>0) ,substr(#3,1,10000),#3) | calc newalt if( not(left(#3,1)!=left(#4,1) and len(#3)>0 and len(#4)>0) ,if(substr(#4,1,10000)="","Del",substr(#4,1,10000)),#4)
| select #1,newpos,newref,newalt,pn,callcopies,CallRatio,depth | rename #2 Pos | rename #3 Reference | rename #4 Call )
| sort 200
| replace callratio form(callratio,3,2)
| group 1 -gc reference,pn -lis -sc call,callcopies,CallRatio,depth
| rename lis_(.*) #{1}
| rename #3 Ref
| rename Call Alt
| calc newpos if(ref = "",pos-1,pos)
| calc newref if(ref="",upper(refbase(chrom,newpos)),ref)
| calc newalt if(ref="",listmap(alt,"newref+x"),alt)
| select #1,newpos,newref,newalt,pn,callcopies,CallRatio,depth
| rename newpos pos
| rename #3 Ref
| rename #4 Alt
| sort 2
| group 1 -gc ref,pn -lis -sc alt,callcopies,CallRatio,depth
| rename lis_(.*) #{1};
""")

# Get variant coverage for all sites (including hom-ref) from segment coverage. 
mydefs = qh.add("""
create ##reference_depth## = #gor_or_pgor# [##adjustedvars##] | select 1-3 | distinct | calc PN '##all_pns_no_quotes##' | split PN
| join -snpseg -l -xl pn -xr pn -e 0 -r -maxseg 10000 <(gor ##segment_cov## -f ##all_pns## | rename Depth ApprDepth)
| select 1-3,PN,ApprDepth
| rename #3 ref
| pedpivot -gc ref pn -e 0 [##trio_pedigree##];
""")

Joint variant comparison¶

The following query statements are used to generate information about the index-case trio, i.e. samples from father and mother. We are working with the "adjusted" variants and therefore we apply PEDPIVOT based on "-gc ref" only and join the ##reference_depth## similarly (it is also calculated with similar PEDPIVOT logic). The following inheritance analysis logic is then relatively complicated because we are dealing with a list of alleles.

In ##denovo_reads##, we use highly sensitive variant calling stored in ##cand_vars## and coverage information specified in ##good_cov##, to collect information used in deNovo classification, i.e. the number of reads with the alternative allele and the coverage in the parents. This information is then used to calculate diag_denovo in ##vcf_wgs_anno##.

In [35]:

# Calculate IHE, inheritance and GT-info from adjusted vars and reference depth 
mydefs = qh.add("""
create ##gt_mode_of_inheritance## = #gor_or_pgor# [##adjustedvars##]
| select 1,2,ref,alt,pn,callcopies,CallRatio,depth
| rename alt call
| colsplit callRatio 2 tmpCR -s ','
| colsplit depth 2 tmpD -s ','
| calc ratio_breakdown str(round(float(tmpCR_1)*float(tmpD_1)))+'/'+str(round(tmpD_1))
| replace ratio_breakdown if(tmpCR_2!='',ratio_breakdown+','+str(round(float(tmpCR_2)*float(tmpD_2)))+'/'+str(round(tmpD_2)),ratio_breakdown)
| hide tmp*
| pedpivot -gc ref pn -e '0' [##trio_pedigree##]
| where p_call != '0'
| join -snpsnp -xl ref,P_PN -xr ref,P_PN -l -e 0 [##reference_depth##]
| calc onlyinfather listfilter(p_call,'listhasany(f_call,x) and not(listhasany(m_call,x))')
| calc onlyinmother listfilter(p_call,'listhasany(m_call,x) and not(listhasany(f_call,x))')
| calc inboth listfilter(p_call,'listhasany(m_call,x) and listhasany(f_call,x)')
| calc inneither listfilter(p_call,'not(listhasany(m_call,x) or listhasany(f_call,x))')
| calc fromfather if(onlyinfather='',str(inboth),str(onlyinfather))
| calc frommother if(onlyinmother='',str(inboth),str(onlyinmother))
| calc P_GT if(p_call='0',if( P_ApprDepth>= ##min_read_depth## ,'HomRef;0;'+str(P_ApprDepth),'Unknown;0;'+str(P_ApprDepth)),listzip(listzip(listzip(p_call,listmap(p_callcopies,"if(int(x)=2,'Hom',if(int(x)=1,'Het','u'))")),p_callratio),p_depth))
| calc F_GT if(f_call='0',if( F_ApprDepth>= ##min_read_depth## ,'HomRef;0;'+str(F_ApprDepth),'Unknown;0;'+str(F_ApprDepth)),listzip(listzip(listzip(f_call,listmap(f_callcopies,"if(int(x)=2,'Hom',if(int(x)=1,'Het','u'))")),f_callratio),f_depth))
| calc M_GT if(m_call='0',if( M_ApprDepth>= ##min_read_depth## ,'HomRef;0;'+str(M_ApprDepth),'Unknown;0;'+str(M_ApprDepth)),listzip(listzip(listzip(m_call,listmap(m_callcopies,"if(int(x)=2,'Hom',if(int(x)=1,'Het','u'))")),m_callratio),m_depth))
| split P_call,p_callcopies
| calc GT_IHE if( listhasany(inneither,P_call) and ( int(p_callcopies)=1 and max(listnummax(f_depth),listnummax(f_apprdepth))>= ##min_read_depth## and max(listnummax(m_depth),listnummax(m_apprdepth))>= ##min_read_depth##
  or int(p_callcopies)=2 and (max(listnummax(f_depth),listnummax(f_apprdepth))>= ##min_read_depth## or max(listnummax(m_depth),listnummax(m_apprdepth))>= ##min_read_depth##))
  or int(p_callcopies) = 1 and listhasany(inboth,p_call) and f_callcopies in ('2','2.0') and m_callcopies in ('2','2.0')
  or int(p_callcopies) = 2 and (listhasany(onlyinfather,p_call) and max(listnummax(m_depth),listnummax(m_apprdepth))>= ##min_read_depth## or listhasany(onlyinmother,p_call) and max(listnummax(f_depth),listnummax(f_apprdepth))>= ##min_read_depth##)
  ,1,0)
| rename ref reference
| rename P_callCopies callcopies
| rename P_call call
| varjoin -r -l -e 0 [##varspan##]
| rename varspan GT_PosAmbiguity
| calc GT_Paternity if(listhasany(fromfather,call) and not(listhasany(frommother,call)),'Paternal',if(listhasany(frommother,call) and not(listhasany(fromfather,call)),'Maternal',if(callcopies=2,'Both','Unknown')))
| calc GT_Info Reference+'>'+P_GT+' Father:_'+F_GT+' Mother:_'+M_GT
| rename P_(.*) #{1}
| rename f_(.*) father_#{1}
| rename m_(.*) mother_#{1}
| hide father_PN*,mother_PN*
| replace Call if(Call='Del','',Call)
| select 1,2,Reference,Call,PN,GT_PosAmbiguity,GT_IHE,GT_Paternity,GT_info,ratio_breakdown,father_*,mother_*
| distinct;
""")

# Use cand-vars and coverage to estimate de-novo inheritance
mydefs = qh.add("""
create ##denovo_reads## = #gor_or_pgor# [##pn_variants_vepped##]
| select 1-4
| multimap -cartesian -h <(nor [##trio_pedigree##]
| calc x PN+','+FN+','+MN
| select x
| rename x PN
| split PN
| distinct)
| varjoin -xl PN -xr PN -l -r -e 0 <(gor ##cand_vars## -fs -f ##all_pns##
| where varCount >= ##denovo_readcount##)
| hide PNx
| calc readsFound if(varCount=0,0,1)
| rename varCount readsWithVar
| join -snpseg -xl PN -xr PN -maxseg 10000 -ic <(gor ##good_cov## -f ##all_pns##)
| rename overlapcount goodCov
| pedpivot PN [##trio_pedigree##] -gc reference,call -e 0 -v
| rename Person_PN PN
| rename Person_readsFound P_readsFound
| inset -c PN <(nor [##trio_pedigree##]
| select PN)
| select 1-4,PN,P_readsFound,father_*,mother_*;
""")

Make note of the command split P_call,p_callcopies in ##gt_mode_of_inheritance##. It converts the loci-multi-allelic notation in ##adjustedvars## back to the regular biallelic notation, where we can apply VARJOINS on alleles. Similarly notice that pedpivot for ##trio_pedigree## uses only -gc ref (instead of the regular ref,alt) and the subsequent join with ##reference_depth## uses -xl ref,P_PN etc.

To understand this better, we can inspect ##adjustedvars## and ##gt_mode_of_inheritance## for a site that has a tri-allelic SNP. See the two alleles in the Alt column of the former relation and then them split up in the call column of the latter relation. Also see how GT_Info (and indeed GT_IHE) takes both alleles into consideration.

In [36]:

%%gor
$mydefs
create temp = nor <(gor [##adjustedvars##] | where listsize(alt)>1 and len(ref)=1 and len(alt)=3 and listsize(listdist(callratio))>1 | top 1
| join -snpsnp -xl ref,pn -xr reference,pn [##gt_mode_of_inheritance##]
    ) | rownum | columnreorder rownum | unpivot 2- ;
nor [temp] | where rownum = 1 | hide rownum | map -c col_name <(nor [temp] | where rownum = 2 | hide rownum)

Query ran in 0.78 sec
Query fetched 30 rows in 0.01 sec (total time 0.78 sec)

Out[36]:

	Col_Name	Col_Value	Col_Valuex
0	CHROM	chr13	chr13
1	pos	110859069	110859069
2	Ref	G	G
3	PN	DEMO_2796799	DEMO_2796799
4	Alt	C,T	C,T
5	CallCopies	1,1	1,1
6	CallRatio	0.39,0.61	0.39,0.61
7	Depth	23,23	23,23
8	reference	G	G
9	call	C	T
10	PNx	DEMO_2796799	DEMO_2796799
11	GT_PosAmbiguity	0	0
12	GT_IHE	0	0
13	GT_Paternity	Maternal	Paternal
14	GT_Info	G>C;Het;0.39;23,T;Het;0.61;23 Father:_T;Het;0....	G>C;Het;0.39;23,T;Het;0.61;23 Father:_T;Het;0....
15	ratio_breakdown	9/23,14/23	9/23,14/23
16	father_call	T	T
17	father_CallCopies	1	1
18	father_CallRatio	0.38	0.38
19	father_Depth	42	42
20	father_ratio_breakdown	16/42	16/42
21	father_ApprDepth	40	40
22	father_GT	T;Het;0.38;42	T;Het;0.38;42
23	mother_call	C	C
24	mother_CallCopies	2	2
25	mother_CallRatio	1.00	1.00
26	mother_Depth	74	74
27	mother_ratio_breakdown	74/74	74/74
28	mother_ApprDepth	75	75
29	mother_GT	C;Hom;1.00;74	C;Hom;1.00;74

Next we generate two relations that are used for properly calulating the inheritance rules in ##vcf_inheritance## of step2.

In ##varcoverage## we join "sufficient coverage" with every distinc variant in the study for all the 4 case/ctrl groups. Notice the use of SEGPROJ to aggregate the coverage in each group.
In ##ccvar## we aggregate the genotype counts for each group and pivot them (as PNtype), join with the coverage counts to define how many samples have their variants covered.

In [37]:

# Calculate variant coverage for all relation-gender combinations
mydefs = qh.add("""
create ##varcoverage## = #gor_or_pgor# [##pn_variants_vepped##]
| select 1-4
| distinct
| calc male_cases_segCount 0
| calc female_cases_segCount 0
| join -snpseg -r -l -e 0 -maxseg 10000 <( gor ##good_cov## -f ##male_controls## | segproj )
| rename segCount male_controls_segCount
| join -snpseg -r -l -e 0 -maxseg 10000 <( gor ##good_cov## -f ##female_controls## | segproj )
| rename segCount female_controls_segCount
""")

# Combine variant counts and coverage into single relation 
mydefs = qh.add("""
create ##ccvar## = #gor_or_pgor# [##pn_variants_vepped##]
| calc PNtype ##pntype##
| split PNtype
| where PNtype != 'index'
| calc hom IF(callCopies = 2,1,0)
| group 1 -gc PNtype,reference,call,gene_symbol,feature -ic hom -sum -count
| rename allCount subjWithVar
| rename sum_hom subjWithHomVar
| pivot PNtype -gc reference,call,gene_symbol,feature -v ##pivot_type## -e 0
| varjoin -r -l -e 0 [##varcoverage##]
| distinct
| calc male_cases_VarCovered max(male_cases_subjWithVar,male_cases_segCount)
| calc female_cases_VarCovered max(female_cases_subjWithVar,female_cases_segCount)
| calc male_controls_VarCovered max(male_controls_subjWithVar,male_controls_segCount)
| calc female_controls_VarCovered max(female_controls_subjWithVar,female_controls_segCount)
| hide male_cases_segCount,female_cases_segCount,male_controls_segCount,female_controls_segCount
""")

Next we derive variant neighborhood information about HIGH-MODERATE variants, both such variants in the sample itself as well as known pathgenetic variants. We self-join the variants in each transcript and record how many are upstream, downstream or in the same exon (but not the same variant). This information is then also used to define if we trigger the 50bp rule and/or the NMD rule.

In [38]:

# Annotate HIGH and MODERATE variants with transcript information
mydefs = qh.add("""
create ##sample_trans_info## = gor [##pn_variants_vepped##] | where vep_impact in ('HIGH','MODERATE') | select 1,2,Reference,Call,PN,feature,VEP_Impact | distinct
| join -snpseg -xl feature -xr transcript_stable_id <(gor [##ex##] | select 1-gene_symbol,strand,transcript_stable_id,sum_*,exonNum,numExons,transcript_size,last_exon_size)
| calc cdspos if(strand = '+',pos-chromstart+sum_bs,max(chromend-pos-len(reference)+sum_as+2,0))
| select 1-call,PN,VEP_Impact,transcript_stable_id,exonnum,numExons,cdspos,transcript_size,last_exon_size;
""")

# Self-join the sample(s) variants to find how many are upstream, downstream and in same exon
# Similarly, find how many pathogenic variants overlap in this way.
# Then label the variants 
mydefs = qh.add("""
create ##neighbor_info## = #gor_or_pgor# [##sample_trans_info##] | join -rprefix r -snpsnp -f 4000000 -xl pn,TRANSCRIPT_STABLE_ID -xr pn,TRANSCRIPT_STABLE_ID [##sample_trans_info##]
| calc upstream if(r_cdspos < cdspos,1,0)
| calc downstream if(r_cdspos > cdspos,1,0)
| calc sameexon if(r_exonnum = exonnum and pos!=r_pos and reference != r_reference and call != r_call,1,0)
| group 1 -gc 3-distance[-1] -sum -ic upstream-sameexon
| rename sum_(.*) HM_Sampvars_#{1}
| join -rprefix r -snpsnp -f 4000000 -l -e '-1' -xl TRANSCRIPT_STABLE_ID -xr TRANSCRIPT_STABLE_ID [##patho_trans_info##]
| calc upstream if(r_cdspos > 0 and r_cdspos < cdspos,1,0)
| calc downstream if(r_cdspos > 0 and r_cdspos > cdspos,1,0)
| calc sameexon if(r_exonnum > 0 and r_exonnum = exonnum and pos!=r_pos and reference != r_ref and call != r_alt,1,0)
| group 1 -gc 3-distance[-1] -sum -ic upstream-sameexon
| rename sum_(.*) HM_Pathvars_#{1}
| join -snpseg -l -xl transcript_stable_id -xr transcript_stable_id -r [##transcript_info##]
| calc alt_start if(strand = '+' and max_m_pos > pos,max_m_pos,if(strand = '-' and min_m_pos > 0 and min_m_pos < pos,min_m_pos,0))
| calc var_in_50bp_region if(transcript_size-last_exon_size-cdspos < 50,1,0)
/* if var is downstream of 50bp threshold and patho_var is not present (is50bpRuleViolated=0) in that region, we trigger the 50bp rule */
| calc trigger50bpRule if(var_in_50bp_region = 1 and is50bpRuleViolated = 0,1,0)
/* if var is upstream of 50bp threshold and the given transcript is subjected to NMD process, we trigger the NMD rule */
| calc triggerNMDRule if(var_in_50bp_region = 0 and isNMDtranscript = 1,1,0)
| select 1-4,PN,gene_symbol,transcript_stable_id,trigger50bpRule,alt_start,isNMDtranscript,triggerNMDRule,HM_*;
""")

In [39]:

%%gor
$mydefs
nor [##neighbor_info##] | where trigger50bpRule = 0 | top 1 | unpivot #1-

Query ran in 0.56 sec
Query fetched 17 rows in 0.06 sec (total time 0.62 sec)

Out[39]:

	Col_Name	Col_Value
0	CHROM	chr1
1	POS	865694
2	Reference	C
3	Call	T
4	PN	DEMO_2796799
5	gene_symbol	SAMD11
6	transcript_stable_id	NM_001385640.1
7	trigger50bpRule	0
8	alt_start	0
9	isNMDtranscript	0
10	triggerNMDRule	0
11	HM_Sampvars_upstream	0
12	HM_Sampvars_downstream	1
13	HM_Sampvars_sameexon	0
14	HM_Pathvars_upstream	0
15	HM_Pathvars_downstream	1
16	HM_Pathvars_sameexon	0

Final annotations¶

We are approaching the end. There are two statements, ##vcf_wgs_anno## and ##acmg_classification##, that used to be kept separate because of an option to run the ACMG logic, but could easily be combined into a single statement.

The only thing here that is non-trivial is in the latter step where we join ##PM1_info## Interpro segments with the variants. There are several hundres thousand ovarlapping domain segments. We filter these segments based on ratio of common to rare HIGH/MODERATE variants, but select only one domain as the representative one, based on the column tomax, defined in the nested query, which is then used in the ATMAX command. This keeps the output with one row per variant per transcript per PN.

In [40]:

# Annotate variants with inheritance and coverage
mydefs = qh.add("""
create ##vcf_wgs_anno## = #gor_or_pgor# [##pn_variants_vepped##]
| varjoin -r -l -xl PN -xr PN [##gt_mode_of_inheritance##]
| hide PNx
| join -varseg -xl gene_symbol,PN -xr gene_symbol,PN -l -e 'NaN' -rprefix GENECOV <(gor ##genecov_hist## -f ##all_pns##)
| calc Gene_cov if(not(GENECOV_lt5='NaN'),'L:'+form(GENECOV_lt5,4,2)+'M:'+form((GENECOV_lt10-GENECOV_lt5),4,2)+'H:'+form((1-GENECOV_lt10),4,2),'NaN')
| calc Gene_avg_depth if(GENECOV_avg_depth!='NaN',GENECOV_avg_depth,'')
| calc covered if(not(GENECOV_gene_start = 'NaN') and (GENECOV_lt5 < ##lt5_max## and GENECOV_lt10 < ##lt10_max##),1,0)
| hide GENECOV_*,distance
| varjoin -l -e 0 -r -xl PN -xr PN [##denovo_reads##]
| hide PNx
| calc diag_denovo_strict if(P_readsFound > 0 and (father_readsFound = 0 or float(father_readsWithVar)/int(father_ApprDepth) <= ##max_parents_read_ratio## ) and father_goodCov = 1 and (mother_readsFound = 0 or float(mother_readsWithVar)/int(mother_ApprDepth) <= ##max_parents_read_ratio## ) and mother_goodCov = 1,1,0)
| calc diag_denovo if(father_call='NaN' and father_goodCov = 1 and mother_call='NaN' and mother_goodCov = 1,1,0)
| hide P_readsFound,mother_readsFound,father_readsFound
| varjoin -l -e 0 -r -xl gene_symbol,feature -xr gene_symbol,feature [##ccvar##]
| hide gene_symbolx,featurex
| join -l -r -varseg -xl PN -xr PN [##LHZ##]
| hide PNx
| join -l -r -varseg -xl PN,feature -xr PN,transcript_stable_id [##LHZ_transcripts##]
| hide PNx,transcript_stable_id
| replace lhz_transfrac if(lhz_vars = '', '', if(isfloat(#rc) and float(#rc)=0, '0', #rc) )
| varjoin -l -r -rprefix KNOWN [##clinvar_hgmd_XA_annotations##]
| rename KNOWN_entrez gene_id
| varjoin -xl gene_symbol,feature -xr gene_symbol,feature -r -l -e '7-novel' [##all_cat1##]
| hide gene_symbolx,featurex
| replace category if(ref_af>##cat4_minAf## and qc_known_vars=0,'8-common',if(category='7-novel' and vep_impact='HIGH','6-EP',category))
| calc DIAG_ACMGcat decode(category, ##category_map##)
| replace DIAG_ACMGcat if(DIAG_ACMGcat = 'Cat3',if(max_score >= ##cat3_max_score_threshold##,'Cat3A','Cat3B'),DIAG_ACMGcat)
| join -varseg -l -f ##fuzzdist## -r -xl gene_symbol -xr gene [##gnomad_gene_constraints_annotations##]
| hide gene
| join -varseg -l -t -f ##fuzzdist## -r -rprefix KNOWN -xl GENE_symbol -xr GENE_SYMBOL <(gor [##pardiseases##]
| select 1-3,gene_symbol,gene_diseases,gene_par_diseases | replace gene_diseases if(listsize(#rc)>1,'"'+replace(#rc,',','", "')+'"',#rc) )
| hide KNOWN_GENE_SYMBOL
| map -c PN [##pngender##] -m '' -n gender
| map -c gene_symbol -h -m '' [##CGDmap##]
| map -c gene_symbol -m '' [##omim_inheritance_matched##] -n omim_gene_mim,omim_inheritance,omim_disease_name,omim_phenotype_mim,omim_moi
| replace omim_inheritance,omim_phenotype_mim,omim_disease_name,omim_moi if(left(#rc,1)=',',listtail(#rc),#rc)
| calc var_type if(len(reference)=len(call),'sub',if(len(call)<len(reference) and substr(reference,0,len(call)) = call,'del',if(len(call)>len(reference) and substr(call,0,len(reference)) = reference,'ins','indel') ))
| prefix max_consequence,max_score VEP
| join -varseg -ic -f ##repeat_overlap_fuzz_factor## [##repeat_regions##]
| rename OverlapCount qc_repeat_regions
| join -varseg -ic -f ##gene_overlap_fuzz_factor## ##coding_exons##
| calc qc_in_coding_exon if(overlapCount>0,1,0)
| hide overlapCount
| distinct
| calc zygosity if(callcopies=2,'hom','het');
""")

# Calculate ACMG based annotations
mydefs = qh.add("""
create ##acmg_classification## = #gor_or_pgor# [##vcf_wgs_anno##]
| select 1,2,Reference,Call,PN,gene_symbol,feature,ref_af,VEP_max_score,VEP_Impact,Vep_max_consequence,CallCopies,GT_PosAmbiguity,GT_IHE,GT_Paternity,category,DIAG_ACMGcat,diag_denovo,gender,omim_inheritance,var_type,qc_repeat_regions,fidel_bp4_pp3,PP3_score,BP4_score,BP7_score,MPC,HM_*
| distinct
| map -c gene_symbol -h -m 'missing' [##pvs1_genes##]
| varjoin -l -r -xl gene_symbol,feature,PN -xr gene_symbol,transcript_stable_id,PN -e 0 [##neighbor_info##]
| hide gene_symbolx,PNx,transcript_stable_id
| replace pvs1gene if(alt_start>0 or trigger50bpRule>0 or triggerNMDRule>0,'0',pvs1gene)
| calcifmissing spliceai_max_impact ''
| calc Evidence_PVS1 if(pvs1gene='1' and (VEP_Impact='HIGH' or spliceai_max_impact = 'HIGH') ,'PVS1','')
| calc Evidence_PS2_PM6 if(diag_denovo='1','PS2',if(GT_IHE='1','PM6',''))
| calc Evidence_PS3_BS3 ''
| varjoin -r -l -e 0 [##homozygous_vars##]
| map -c gene_symbol -h [##gene_panel_info##]
| calc af_cutoff if(contains(omim_inheritance,'AD') or #1 = 'chrX' and gender = 'male', Gene_maximum_genotype_freq_for_dominant, Gene_maximum_genotype_freq_for_recessive)
| calc Evidence_BA1_BS1_BS2_PS4_PM2 if(ref_af > 0.05,'BA1',
  if( ref_af > af_cutoff,'BS1',
  if( homCount > 4 /* conservative because using GnomAD instead of 1000G */,'BS2','')))
| replace Evidence_BA1_BS1_BS2_PS4_PM2 if(Evidence_BA1_BS1_BS2_PS4_PM2='' and (ref_af<=0.0 or ( ref_af < Gene_maximum_genotype_freq_for_recessive and contains(omim_inheritance,'AR') and not (contains(omim_inheritance,'AD')))) or (ref_af < Gene_maximum_genotype_freq_for_dominant and contains(omim_inheritance,'AD')),'PM2',Evidence_BA1_BS1_BS2_PS4_PM2)
/* todo : define PS4 using HPO carrier logic info to derive odds ratio once database is large enough */
| join -snpseg -rprefix PM1 -l -e 0 -r <(gor [##PM1_info##] | where path_high_allcount + path_missense_allcount > 0 and ratio_common_rare < ##max_common_rare_ratio## | calc tomax path_high_allcount + path_missense_allcount)
| atmax 1 -gc 3,4,gene_symbol,feature,PN pm1_tomax | hide pm1_tomax /* pick the most informative Interpro domain */
| calc Evidence_PM1 if(VEP_max_consequence in ('missense_variant') and pm1_path_high_allcount + pm1_path_missense_allcount > 0 and float(MPC,0)>2,'PM1','')
| calc Evidence_PM4_BP3 if(Vep_max_consequence in ('inframe_insertion','inframe_deletion'),if(qc_repeat_regions=0,'PM4','BP3'),'')
| replace Evidence_PM4_BP3 if(vep_max_consequence = 'stop_lost','PM4',Evidence_PM4_BP3)
| join -snpseg -xl gene_symbol -xr gene_symbol -rprefix PP2_BP1 -l -e 0 -r [##PP2_BP1_info##]
| hide PP2_BP1_gene_symbol
| calc Evidence_PP2_BP1 if(VEP_max_consequence='missense_variant',if(PP2_BP1_path_missense_allcount/float(PP2_BP1_path_high_allcount+PP2_BP1_path_missense_allcount+0.1)>0.8
and PP2_BP1_common_allcount/float(PP2_BP1_rare_allcount+PP2_BP1_common_allcount+0.1)<0.1,'PP2',
if(PP2_BP1_path_high_allcount/float(PP2_BP1_path_high_allcount+PP2_BP1_path_missense_allcount+0.1)>0.8,'BP1','')),'')            
| calc Evidence_PP3_BP4 if(PP3_score=1,'PP3',if(BP4_score=1,'BP4',''))
| calc Evidence_BP7 if(BP7_score=1 and (VEP_max_consequence = 'synonymous_variant' or contains(VEP_max_consequence,'splice')),'BP7','')
| calc Evidence_all = listfilter(Evidence_PVS1+','+Evidence_PS2_PM6+','+Evidence_PS3_BS3+','+Evidence_BA1_BS1_BS2_PS4_PM2+','+Evidence_PM1+','+Evidence_PM4_BP3+','+Evidence_PP2_BP1+','+Evidence_PP3_BP4+','+Evidence_BP7,'len(x)>0')
| rename Evidence_all auto_evidence
| select 1-4,PN,gene_symbol,feature,auto_evidence,isNMDtranscript,trigger50bpRule,MPC,HM_*;
""")

In [41]:

# Combine the final two annotations steps
mydefs = qh.add("""
create ##step1_final## = gor [##vcf_wgs_anno##] | hide MPC
| varjoin -l -r -xl PN,gene_symbol,feature -xr PN,gene_symbol,feature [##acmg_classification##]
| hide PNx,gene_symbolx,featurex;
""")

In [42]:

%%gor
$mydefs
nor <(gor [##step1_final##] | group 1 -gc #3,#4,PN ) | group -gc PN -count | rename allcount varCount
| merge <(nor [##step1_final##] | group -gc PN -count | rename allcount rowCount)

Query ran in 0.63 sec
Query fetched 6 rows in 19.91 sec (total time 20.54 sec)

Out[42]:

	PN	varCount	rowCount
0	DEMO_2796799	65347.0	NaN
1	DEMO_2796810	69416.0	NaN
2	DEMO_2827751	65426.0	NaN
3	DEMO_2796799	NaN	250398.0
4	DEMO_2796810	NaN	266250.0
5	DEMO_2827751	NaN	253335.0

In [ ]: