SNV variant feature generation for ranking¶

RankerAnno

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "internal_hg19"
%env GOR_API_PROJECT={project}

Variant annotations¶

This notebooks describes how we can annotate variants, SNPs and small InDels, for the purpose of classifying them or for decorating them for downstream filtering or ranking by ML approaches. For training a ML variant ranker, it is important to be able to process variants from multiple samples efficiently in order to be able to try out improved attributes for ranking. Likewise, in order to use such a ranker in clinical practice, it is important to be able to run the annotation process effectively in incremental fashion.

Here we show annotation process that is similar to the annotation steps used in Gregor step1 that can be applied to all the 600k+ samples in XA within two hour ors on the GORdb cluster. By slight modifications to the existing step1, we can provide all the necessary annotations for our ranker. As is evident from the fact that we process all the XA samples very quickly, in future versions of the "incremental" annotation step, we will greatly improve the efficiency of the process by batching multiple samples together. This is because:

Many of the steps only involve the variant and its context in the genome and not other phenotypic characteristics of the samples. The fact that there is a great overlap between the variants of the samples (the "square-root law") means greater efficiency as the sample number grows.
Many of the annotations sources involve join with very large tables, e.g. the VEP table, gnomad and inhouse AF tables, dbNSFP etc. The per variant the join becomes more efficient as we process more sample variants together.

Much of the "business logic" shown here has been describe in Gregor1.ipynb. Big part of the code there is transformation of reference data to make it better suited for the sample based variant queries there. However, here we have "materialized" most of these non-PN based create statements. This is both because it relies less on the GORdb cache and also because many of these sources will be revised in relation to GeneMB, Oscar, etc. Below, we will list these tables for clarity.

We can try to summarize the steps in involved in the analysis process like this:

Calculate the inheritance of variants, e.g. deNovo, Het, Hom, CHZ, inheritance error etc.
Annotate all possible variants with variant effect prediction, known variants, AF and hom count, SpliceAI, scores, domains, gene scores etc.
Find variants adjacent to know variants, e.g. with same amino acid change or codon change.
Analyse positional context with relation to alternative starts, 50bp and NMD rule.
Label candidate genes based on HPO relationships.
Find overlap with HPO specific single variant GWAS data and gene-based association of impactful CHZ and hom variants.
Varint ACMG evidence generation, scoring and classification.
Prepare the output for the ranker by one-hot-encoding and by joining with the gene Multi-score data.

Without further ado we will dive into the code and explain the following steps as we come across them and as we see fit.

In [2]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

A query to calculate deNovo, IHE, and CHZ¶

To calculate the inheritance of variants, we need genotypes from trios. For accessing them we have two options: use the row-level variants in XA/vars/wesvars.gord or to use the genotype "freeze" structure (see Incremental_Freeze.ipynb). If we ignore sequence read coverage, it may actually be faster to read the variants from the row-level tables and use PEDPIVOT to format the data for trio comparison. This is so, even though each row has many string characters (compared to one char per gt in freeze), because most of the variants are very very rare. However, if we want to take coverage and lack of data into consideration (at least on variant level) then we are better of processing the data in "binary like" character level from the freeze. Fortunately, there are versatile and sharp query commands in GOR to manipulate the data effectively, without the need for creation of custom commands.

We start by creating out pedigree:

In [3]:

mydefs = qh.add_many("""
def #samples# = XA/raw/samples.tsv;
def #freeze# = freezes/xa1_v8/main;

create #ped# = nor #samples#
| select id,case_id,relation
| pivot -gc case_id relation -v proband,father,mother -e ''
| where father_id != '' and mother_id != ''and proband_id != ''
| inset -c proband_id #freeze#/buckets.tsv
| inset -c father_id #freeze#/buckets.tsv
| inset -c mother_id #freeze#/buckets.tsv
| select #2-
;""")

In [4]:

%%gor
$mydefs
nor [#ped#] | granno -count | top 3

Query ran in 1.09 sec
Query fetched 3 rows in 0.74 sec (total time 1.83 sec)

Out[4]:

	proband_id	father_id	mother_id	allCount
0	434770	434772	434771	154911
1	64876	65011	65012	154911
2	592267	592268	592269	154911

We see that we have 155k trios. Our total list of PNs can be defined like this:

In [5]:

mydefs = qh.add_many("""create #pns# = nor [#ped#] | calc PN #1+','+#2+','+#3 | select PN | split PN;""")

In [6]:

%%gor
$mydefs
nor [#pns#] | distinct | group -count

Query ran in 0.47 sec
Query fetched 1 rows in 0.95 sec (total time 1.42 sec)

Out[6]:

	allCount
0	464733

Note: While here 154911 * 3 = 464733, we have specificly NOT made the #pns# relation distinct, because if we use data from offsprings of the same parents more than once, we want those parents to be listed more than once as we will see below when we formulate the genotypes for trios.

Next we will define which variants we consider for the annotation process. Keep in mind that for labels such as compound heterozygosity (CHZ) we are only interested in exonic variants that are not too common and of significant impact. Here is our filtering definition:

In [7]:

mydefs = qh.add_many("""
def #vep_single# = #freeze#/metadata/vepref_single_wgs.gorz;
def #af# = #freeze#/metadata/AF.gorz;
create #to_use# = pgor #af# | select 1-4,af 
| varjoin -ic <(gor user_data/hakon/refdata/XA_annotations.gorz | where classification in ('PATH','LPATH'))
| rename overlapcount GDX_Path
| varjoin -r -norm <(gor #vep_single# | select 1-4,gene_symbol,max_consequence,max_impact )
| where GDX_path > 0 or (af <= 0.01 and max_impact in ('HIGH','MODERATE'))
| select 1-4,gene_symbol,max_consequence,max_impact,GDX_PATH
| replace GDX_PATH if(GDX_path > 0,1,0);
""")

As can be seen above, we will use variants labeled by GDX as PATH or LPATH as well as non-common HIGH and MODERATE impact variants in genes. To see the fraction of XA variants, we run the following:

In [8]:

%%gor
$mydefs
create #a# = pgor #af# | group chrom -count;
create #b# = pgor [#to_use#] | group 1 -gc #3,#4 | group chrom -count;
nor [#a#] | group -sum -ic allcount | merge <(nor [#b#] | group -sum -ic allcount )

Query ran in 7.81 sec.
Query fetched 2 rows in 0.02 sec (total time 7.83 sec)

Out[8]:

	sum_allCount
0	13243061
1	22618399

In other words, we will be annotating the inheritance of about half of the variants in XA.

Next we will explore how we can use CSVSEL to generate genotype triplets, somewhat analogous to how PEDPIVOT uses pedigree to pivot columnar row data. First we pick some two example variants that are share by two of the first 9 samples from the first 3 trios. We use CSVSEL with the -tag option:

In [9]:

%%gor
$mydefs

gor XA/vars/wesvars.gord -f '434770','64876' | where callcopies in ('1','2')| group 1 -gc 3,4 -dis -sc pn | where dis_pn > 1 | top 2 
| varjoin -ir -norm #freeze#/variants.gord 
| csvsel -gc ref,alt -vs 1 -u 3 #freeze#/buckets.tsv <(nor [#pns#] | top 9) -tag PN

Query ran in 20.87 sec
Query fetched 18 rows in 10.33 sec (total time 31.20 sec)

Out[9]:

	chrom	pos	REF	alt	PN	value
0	chr1	1666175	C	T	434770	1
1	chr1	1666175	C	T	434772	1
2	chr1	1666175	C	T	434771	1
3	chr1	1666175	C	T	64876	1
4	chr1	1666175	C	T	65011	1
5	chr1	1666175	C	T	65012	1
6	chr1	1666175	C	T	592267	0
7	chr1	1666175	C	T	592268	0
8	chr1	1666175	C	T	592269	0
9	chr1	1684374	T	TCTC	434770	1
10	chr1	1684374	T	TCTC	434772	1
11	chr1	1684374	T	TCTC	434771	1
12	chr1	1684374	T	TCTC	64876	1
13	chr1	1684374	T	TCTC	65011	1
14	chr1	1684374	T	TCTC	65012	2
15	chr1	1684374	T	TCTC	592267	0
16	chr1	1684374	T	TCTC	592268	0
17	chr1	1684374	T	TCTC	592269	0

Now we take the output from CSVSEL (without -tag option, i.e. a single value column per variant) and form a bucket on the fly using the pedigree and use CSVSEL a second time to form a 3-character genotype per index-case:

In [10]:

%%gor
$mydefs

gor XA/vars/wesvars.gord -f '434770','64876' | where callcopies in ('1','2')| group 1 -gc 3,4 -dis -sc pn | where dis_pn > 1 | top 2 
| varjoin -ir -norm #freeze#/variants.gord 
| csvsel -gc ref,alt -vs 1 -u 3 #freeze#/buckets.tsv [#pns#]
| calc bucket 'b1' 
| csvsel -gc ref,alt -vs 3 -u '333' <(nor [#ped#] | select #1 | calc bucket 'b1') <(nor [#ped#] | select #1 | top 3) -tag PN
| replace value '_'+value

Query ran in 5.43 sec.
Query fetched 6 rows in 5.09 sec (total time 10.52 sec)

Out[10]:

	chrom	pos	REF	alt	PN	value
0	chr1	1666175	C	T	434770	_111
1	chr1	1666175	C	T	64876	_111
2	chr1	1666175	C	T	592267	_000
3	chr1	1684374	T	TCTC	434770	_111
4	chr1	1684374	T	TCTC	64876	_112
5	chr1	1684374	T	TCTC	592267	_000

Note that the intermediate values column that is feed into the second CSVSEL command has 464733 characters, because we are using the full list of PNs in #pns#. By using top 3, we pick the value for the first 3 PNs, each of size 3 as specified with the -vs option. Before we setup the full genome wide parallel option, we want to specify with the -hide option which genotypes we are interested in. This reduces the number of rows that the second CSVSEL command needs to materialize, saving about half in compute time.

In [11]:

mydefs = qh.add_many("""
create #av# = norrows 4;
create #comb# = nor [#av#] | multimap -cartesian [#av#] | multimap -cartesian [#av#] | rename #1 P | rename #2 F | rename #3 M;

create #valcomb# = nor [#comb#] 
| calc ihe if(p=1 and (f=0 and m=0 or f=2 and m=2) or p=2 and (f=0 or m=0),1,0)
| calc denovo if(p=1 and f=0 and m=0,1,0)
| calc inher if(denovo=1,'denovo',if(p=1 and m=0,'father',if(p=1 and f=0,'mother',if(p=2,'both','unknown'))))
| calc non_hide if(inher in ('father','mother','both') or ihe = 1,1,0)
| calc value str(p)+str(f)+str(m)
| select value,inher,ihe,non_hide;
""")

In [12]:

%%gor
$mydefs
nor [#valcomb#] | group -gc non_hide -lis -sc value

Query ran in 0.35 sec
Query fetched 2 rows in 0.00 sec (total time 0.36 sec)

Out[12]:

	non_hide	lis_value
0	0	000,001,002,003,010,011,012,013,020,021,022,02...
1	1	100,101,102,103,110,120,122,130,200,201,202,20...

All the genotype combination where non-hide is zero (false) are non-informative for CHZ, deNovo and inheritance error. Therefore, we don't want them in our output.

In [13]:

%%gor
$mydefs

gor XA/vars/wesvars.gord -f '434770','64876' | where callcopies in ('1','2')| group 1 -gc 3,4 -dis -sc pn | where dis_pn > 1 | top 2 
| varjoin -ir -norm #freeze#/variants.gord 
| csvsel -gc ref,alt -vs 1 -u 3 #freeze#/buckets.tsv [#pns#]
| calc bucket 'b1' 
| csvsel -gc ref,alt -vs 3 -u '333' <(nor [#ped#] | select #1 | calc bucket 'b1') <(nor [#ped#] | select #1 | top 3) -tag PN
  -hide '000,001,002,003,010,011,012,013,020,021,022,023,030,031,032,033,111,112,113,121,123,131,132,133,300,301,302,303,310,311,312,313,320,321,322,323,330,331,332,333'
| replace value '_'+value

Query ran in 4.72 sec.
Query fetched 0 rows in 5.78 sec (total time 10.49 sec)

Out[13]:

	chrom	pos	REF	alt	PN	value

We see that none of the genotype triplets from above make it through because they don't provide information about phase, deNovo status or inheritance error. Finally, we are ready to craft the parallel query that includes only the variant from the freeze that we are interested in. Also, because we split up the genome into jobs, we make to ensure that we don't break up genes and thereby miss out on CHZ combinations. Therefore we use genes to join into the variants in the freeze.

In [14]:

mydefs = qh.add_many("""
def #all_genes# = ref/refgenes/refgenes.gorz;

create #usesegs# = pgor ref/refgenes/refgenes.gorz | segproj
| join -segsnp -f 10 -ic [#to_use#]
| group 100000 -sum -ic overlapcount;

create #parsegs# = gor [#usesegs#] | seghist 30000;

create #inher# = pgor -split <(gor [#parsegs#]) #all_genes# | segproj | join -segsnp -f 20 -ir #freeze#/variants.gord 
| varjoin -r -norm <(gor [#to_use#] | select 1-4,gene_symbol,max_impact)
| csvsel -gc ref,alt,gene_symbol,max_impact -vs 1 -u 3 #freeze#/buckets.tsv [#pns#] 
| calc bucket 'b1' 
| csvsel -gc ref,alt,gene_symbol,max_impact -vs 3 -u '333' <(nor [#ped#] | select #1 | calc bucket 'b1') <(nor [#ped#] | select #1) -tag PN
  -hide '000,001,002,003,010,011,012,013,020,021,022,023,030,031,032,033,111,112,113,121,123,131,132,133,300,301,302,303,310,311,312,313,320,321,322,323,330,331,332,333' 
| map -c value [#valcomb#]
| calc pat if(inher='father' or inher='both',1,0)
| calc mat if(inher='mother' or inher='both',1,0)
| granno gene -range -gc pn,gene_symbol,max_impact -sum -ic pat,mat
| calc hethomchz if(left(value,1) = '1' and max_impact in ('HIGH','MODERATE')
                    and (mat = 1 and sum_pat > 0 or pat = 1 and sum_mat > 0) and gene_symbol != '' ,'chz',if(left(value,1) = '2','hom','het'))
| hide pat-sum_mat,non_hide;
""")

We see that we are annotating the freeze with gene_symbol and impact. We also use MAP to get inheritance, denovo and IHE status for the different value combinations. Next, we calculate pat and mat, Boolean like values, and aggregate them per gene per PN with GRANNO, using the -range option for memory efficiency, i.e. flushing variant rows once they are futher than maximum gene length away from the most downstream row! With the SEGHIST command, as used above, we get about 500 splits, each taking about 10 minutes or so to run - typically taking about 15-20 minutes to run in the GORdb worker system. Finally, we are ready to write out our results which we will use at later stage.

In [15]:

%%gor
$mydefs
gor [#inher#] | top 1 /* | write user_data/hakon/Ranker/xa1_v8_var_inher.gorz */

Query ran in 2.44 sec
Query fetched 1 rows in 0.19 sec (total time 2.63 sec)

Out[15]:

	chrom	pos	REF	alt	Gene_Symbol	Max_Impact	PN	value	inher	ihe	hethomchz
0	chr1	69116	G	A	OR4F5	MODERATE	700099	201	both	1	hom

and here in the file we wrote out:

In [16]:

%%gor
gor user_data/hakon/Ranker/xa1_v8_var_inher.gorz | top 10

Query ran in 0.07 sec
Query fetched 10 rows in 0.00 sec (total time 0.07 sec)

Out[16]:

	chrom	pos	REF	alt	Gene_Symbol	Max_Impact	PN	value	inher	ihe	hethomchz
0	chr1	69116	G	A	OR4F5	MODERATE	700099	201	both	1	hom
1	chr1	69121	T	G	OR4F5	MODERATE	414695	100	denovo	1	het
2	chr1	69121	T	G	OR4F5	MODERATE	414698	120	father	0	het
3	chr1	69121	T	G	OR4F5	MODERATE	372859	100	denovo	1	het
4	chr1	69134	A	G	OR4F5	MODERATE	260990	101	mother	0	het
5	chr1	69134	A	G	OR4F5	MODERATE	187762	101	mother	0	het
6	chr1	69134	A	G	OR4F5	MODERATE	619425	110	father	0	het
7	chr1	69134	A	G	OR4F5	MODERATE	153921	110	father	0	het
8	chr1	69134	A	G	OR4F5	MODERATE	388719	120	father	0	het
9	chr1	69134	A	G	OR4F5	MODERATE	531602	120	father	0	het

Phenotype associations¶

We will now move our attention to analysis that associates phenotype information (HPO terms) of the samples with variants and genes. There are four ways in which we do this:

Calculate candidate genes based on established HPO relationships and ranking using the Phrank method (see Phrank_Candidate_Genes.ipynb)
Calculate variant association using HPO phenotypes of samples and Fisher-exact GWAS associations (see Fast_HPO_V_GWAS.ipynb)
Calculate gene variant burden association using Hom and CHZ rare, high-moderate impact variants (see Fast_HPO_G_GWAS.ipynb)
Calculate so-called Multi-score using modern vector-encoding for text mining and association with GDX reported variants.

Candidate genes¶

Phrank computes a score measuring the similarity between a patient’s phenotypes and each gene/disease–phenotype set. Each phenotype’s contribution to the Phrank score is a function of the number of genes known to cause the phenotype. Intuitively, the fewer genes are known to cause a phenotype, the more impressive is a candidate gene’s ability to explain this patient phenotype, and the higher the score derived from such a match. The conditional probability for a single node in the HPO phenotype ontology are log transformed, negated, and then summed via paths to the root node. Nodes annotated with fewer genes will ultimately contribute more to the final score as the match they represent is less expected by chance. (see Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. by Jagadeesh KA, et.al. Genet Med. 2019 Feb;21(2):464-470. doi: 10.1038/s41436-018-0072-y. Epub 2018 Jul 12. PMID: 29997393.)

As mentioned above, the full logic for the Phrank algorithm, as implemented in the GOR query language, can be found i a separate notebook (Phrank_Candidate_Genes.ipynb) where we show how we can rank candidate genes for close to 200k samples in few minutes. Here we will only show the beginning part of it that relates to the established HPO association. We setup a new query holder:

In [17]:

qh2 = GQH.GOR_Query_Helper()
mydefs2 = ""

In [18]:

mydefs2 = qh2.add_many("""
def ##ref## = ref;
def #hposource# = ##ref##/disgenes/hpo.tsv;
""")
mydefs2 = qh2.add("""
create #hpo_gene_symbol# = nor #hposource# | select hpo_code,gene_symbols 
| split gene_symbols -s ';' | rename gene_symbols gene_symbol
| where gene_symbol != ''
| distinct;
""")
mydefs2 = qh2.add_many("""
create #specific_hpo_gene_symbol# = nor [#hpo_gene_symbol#]
| granno -gc hpo_code -count | where allcount <= 50 | group -gc hpo_code,gene_symbol;

create #pheno_lis# = nor XA/pheno/dag_hpo.tsv | where hpo_code != '' | inset -c hpo_code [#specific_hpo_gene_symbol#] 
| group -gc pn -lis -sc hpo_code -len 10000;

create #modparts1# = nor [#pheno_lis#] | select pn | calc modulus mod(iooa(pn),10) | group -gc modulus -lis -sc pn -len 1000000 
| replace lis_pn listmap(lis_pn,'squote(x)');

def ##all_genes## = ref/refgenes/rgenes.gorz;

create #pn_candidate_genes# = parallel -parts [#modparts1#] <(gor <(nor [#pheno_lis#] | where pn in ( #{col:lis_pn} )
| split lis_hpo_code | rename lis_hpo_code hpo_code
| multimap -c hpo_code [#specific_hpo_gene_symbol#]
| group -gc pn,gene_symbol -count
| multimap -c gene_symbol <(nor ##all_genes## | columnsort gene_symbol)
| select chrom,gene_start,gene_end,gene_symbol,PN
)
| sort genome
);""")

Note: Because of the semicolons in the #hpo_gene_symbol# statement and the simplicity of GQH, we have to use qh2.add instead of add_many.

We see that the source for HPO to gene_symbol relationships is stored in the reference data folder disgenes. In the future, this could easily be the table stored in GeneMb. Also, because in the abstraction process within GDX the practice seems to be to record very high-level unspecific HPO terms, in the statement #specific_hpo_gene_symbol# we limit us to HPO terms that only associate with 50 genes or less. To understand this better, the following query show how many samples have "phenotypes" from the top level and the second level in the HPO-DAG.

In [19]:

%%gor
$mydefs2
create #counts# = nor XA/pheno/dag_hpo.tsv
| calc level1 if(hpo_code = 'HP:0000001',1,0)
| calc level2 if(hpo_code in ('HP:0045027','HP:0040064','HP:0033127','HP:0025354','HP:0025285','HP:0025142','HP:0025031'
,'HP:0003745','HP:0002715','HP:0002686','HP:0002664','HP:0002086','HP:0001939','HP:0001871'
,'HP:0001626','HP:0001608','HP:0001574','HP:0001507','HP:0001426','HP:0001197','HP:0000818'
,'HP:0000769','HP:0000707','HP:0000598','HP:0000478','HP:0000152','HP:0000119','HP:0032443','HP:0000118'), 1,0)
| group -gc pn -sum -ic level1,level2
| replace sum_* if(int(#rc)>1,1,0)
| group -gc sum_level1,sum_level2 -count;

nor [#counts#]

Query ran in 0.46 sec
Query fetched 2 rows in 0.00 sec (total time 0.46 sec)

Out[19]:

	sum_level1	sum_level2	allCount
0	0	0	83
1	0	1	235156

For comparison, the number of samples that have the more gene-specific HPO terms:

In [20]:

%%gor
$mydefs2
nor [#pheno_lis#] | group -count

Query ran in 0.34 sec
Query fetched 1 rows in 1.14 sec (total time 1.48 sec)

Out[20]:

	allCount
0	187996

Our direct or naive candidate gene list is like this:

In [21]:

%%gor
$mydefs2 
gor -p chr7 [#pn_candidate_genes#] | top 10

Query ran in 0.39 sec
Query fetched 10 rows in 0.00 sec (total time 0.39 sec)

Out[21]:

	chrom	gene_start	gene_end	gene_symbol	PN
0	chr7	192570	300738	FAM20C	100535
1	chr7	192570	300738	FAM20C	100584
2	chr7	192570	300738	FAM20C	101029
3	chr7	192570	300738	FAM20C	101279
4	chr7	192570	300738	FAM20C	102426
5	chr7	192570	300738	FAM20C	103797
6	chr7	192570	300738	FAM20C	105737
7	chr7	192570	300738	FAM20C	109311
8	chr7	192570	300738	FAM20C	110051
9	chr7	192570	300738	FAM20C	113005

In [22]:

%%gor
$mydefs2 
create #counts# = gor [#pn_candidate_genes#] | group genome -gc pn -count | granno genome -sum -ic allcount -count;
nor [#counts#] | select PN- | rename allcount Genes | rename allcountx Samples | rename sum_allcount TotalRows
| granno -avg -std -min -max -ic genes   | top 10

Query ran in 0.35 sec
Query fetched 10 rows in 0.35 sec (total time 0.69 sec)

Out[22]:

	PN	Genes	Samples	TotalRows	min_Genes	max_Genes	avg_Genes	std_Genes
0	100003	30	187983	17433798	1	909	92.741354	78.059362
1	100004	2	187983	17433798	1	909	92.741354	78.059362
2	100007	278	187983	17433798	1	909	92.741354	78.059362
3	10001	72	187983	17433798	1	909	92.741354	78.059362
4	100010	117	187983	17433798	1	909	92.741354	78.059362
5	100013	79	187983	17433798	1	909	92.741354	78.059362
6	100016	78	187983	17433798	1	909	92.741354	78.059362
7	100018	31	187983	17433798	1	909	92.741354	78.059362
8	100019	140	187983	17433798	1	909	92.741354	78.059362
9	10002	158	187983	17433798	1	909	92.741354	78.059362

We see that with the HPO database does give us very many gene associations. By using Phrank, we can try to score the based on relevance with the complete list of HPO terms per sample. Our results from the Phrank notebook are accessible in a single GORZ file here:

In [23]:

%%gor
gor -p chr7 user_data/hakon/Phrank/gene_pheno_score_25.gorz | top 10

Query ran in 0.07 sec
Query fetched 10 rows in 0.00 sec (total time 0.08 sec)

Out[23]:

	chrom	gene_start	gene_end	gene_symbol	pheno_score	norm_pheno_score	PN	rank_pheno_score	rank_norm_pheno_score
0	chr7	192570	300738	FAM20C	14.018	0.051	40219	19	57
1	chr7	192570	300738	FAM20C	14.028	0.051	191085	23	63
2	chr7	192570	300738	FAM20C	14.031	0.051	183198	8	35
3	chr7	192570	300738	FAM20C	14.031	0.051	475054	8	35
4	chr7	192570	300738	FAM20C	14.049	0.051	184944	9	48
5	chr7	192570	300738	FAM20C	14.049	0.051	364261	5	45
6	chr7	192570	300738	FAM20C	14.049	0.051	397803	4	36
7	chr7	192570	300738	FAM20C	14.060	0.051	85002	15	39
8	chr7	192570	300738	FAM20C	14.063	0.051	49046	3	40
9	chr7	192570	300738	FAM20C	14.063	0.051	76441	20	67

In [24]:

%%gor
create #counts# = gor user_data/hakon/Phrank/gene_pheno_score_25.gorz | group genome -gc pn -count | granno genome -sum -ic allcount -count;
nor [#counts#] | select PN- | rename allcount Genes | rename allcountx Samples | rename sum_allcount TotalRows
| granno -avg -std -min -max -ic genes   | top 10

Query ran in 0.08 sec
Query fetched 10 rows in 0.50 sec (total time 0.59 sec)

Out[24]:

	PN	Genes	Samples	TotalRows	min_Genes	max_Genes	avg_Genes	std_Genes
0	100003	42	187044	7849486	1	197	41.965987	23.283037
1	100004	15	187044	7849486	1	197	41.965987	23.283037
2	100007	56	187044	7849486	1	197	41.965987	23.283037
3	10001	54	187044	7849486	1	197	41.965987	23.283037
4	100010	65	187044	7849486	1	197	41.965987	23.283037
5	100013	95	187044	7849486	1	197	41.965987	23.283037
6	100016	42	187044	7849486	1	197	41.965987	23.283037
7	100018	13	187044	7849486	1	197	41.965987	23.283037
8	100019	53	187044	7849486	1	197	41.965987	23.283037
9	10002	54	187044	7849486	1	197	41.965987	23.283037

In gene_pheno_score_25.gorz we have picked only the genes that are in the top 25 rank based on normalized and regular Phrank score. We that we have much fewer genes and less variation in the number of candidate genes. Later, in our annotation query logic, we will return to this file.

Variant associations¶

In the notebook Fast_HPO_V_GWAS.ipynb, we show how we can calculate very quickly (few hours in GORdb), using Fisher-Exact test, variant association for thousands of phenotype definitions. Here we will only explore phenotype definitions based on a presence or absence of a single HPO term, but this analysis can easily be extended to more complex man-made or machine generated phenotypes. Importantly, this analysis takes into consideration presence and absence of data, e.g. via sequence read coverage as mentioned in our notebook Incremental_Freeze.ipyn.

Here, we will use the analysis dir_hpo_prob_ctrls_fix, but it is based on analysis where the cases are proband that have a give HPO term and the controls are other probands that do not have that term or any other related HPO term. The controls could als easily be picked from unaffected parents - but the definition of this is not the subject we cover here but rather how we fold this into the variant annotation process.

In [25]:

%%gor
nor XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.tsv | granno -count | top 10

Query ran in 0.50 sec
Query fetched 10 rows in 0.19 sec (total time 0.69 sec)

Out[25]:

	Pheno	name	allCount
0	HP:0006336	Short dental roots	6379
1	HP:0008038	Aplastic/hypoplastic lacrimal glands	6379
2	HP:0040198	Non-medullary thyroid carcinoma	6379
3	HP:0002710	Commissural lip pit	6379
4	HP:0000679	Taurodontia	6379
5	HP:0020072	Persistent EBV viremia	6379
6	HP:0025423	Abnormal larynx morphology	6379
7	HP:0100576	Amaurosis fugax	6379
8	HP:0040144	L-2-hydroxyglutaric aciduria	6379
9	HP:0011016	Abnormality of urine glucose concentration	6379

Similary a preview of the GWAS results:

In [26]:

%%gor
gor XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.gord | top 10 | columnreorder 1-alt,pheno

Query ran in 3.02 sec.
Query fetched 10 rows in 0.01 sec (total time 3.03 sec)

Out[26]:

	chrom	pos	REF	alt	Pheno	AF	pVal_mm	OR_mm	OR_SEmm	pVal_dom	...	case_0_GTcount	case_1_GTcount	case_2_GTcount	ctrl_0_GTcount	ctrl_1_GTcount	ctrl_2_GTcount	AN	AC
0	chr1	69121	T	G	HP:0100694	0.000012	0.0100	150.0	410.0	0.0100	...	435	1	0	126796	2	0	254468	3
1	chr1	69121	T	G	HP:0012646	0.000012	0.0068	220.0	620.0	0.0068	...	288	1	0	126943	2	0	254468	3
2	chr1	69134	A	G	HP:0012432	0.000444	0.0045	6.2	3.7	0.1000	...	743	0	2	126400	69	20	254468	113
3	chr1	69134	A	G	HP:0001572	0.000444	0.0099	14.0	13.0	0.1100	...	166	0	1	126977	69	21	254468	113
4	chr1	69134	A	G	HP:0000817	0.000444	0.0028	3.0	1.0	0.0400	...	4423	3	4	122720	66	18	254468	113
5	chr1	69134	A	G	HP:0031590	0.000444	0.0080	130.0	250.0	0.0064	...	8	1	0	127135	68	22	254468	113
6	chr1	69134	A	G	HP:0025030	0.000444	0.0035	320.0	670.0	0.0029	...	3	1	0	127140	68	22	254468	113
7	chr1	69134	A	G	HP:0033252	0.000444	0.0071	150.0	290.0	0.0057	...	7	1	0	127136	68	22	254468	113
8	chr1	69134	A	G	HP:0020110	0.000444	0.0025	12.0	8.4	0.0200	...	299	1	1	126844	68	21	254468	113
9	chr1	69134	A	G	HP:0001812	0.000444	0.0016	36.0	35.0	0.0010	...	63	2	0	127080	67	22	254468	113

10 rows × 23 columns

The table XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.gord is massive with almost every association (p < 0.01) for 23M variants and 6k phenotypes. We want to make a subset of this table, i.e. what we consider strong associations, and then join the table with the sample variants, however, only store output for samples that harbour the corresponding phenotype.

In [27]:

qh3 = GQH.GOR_Query_Helper()
mydefs3 = ""

In [28]:

mydefs3 = qh3.add_many("""
def #freeze# = freezes/xa1_v8/main;
def ##gwastable## = XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.gord;
def #vep_single# = #freeze#/metadata/vepref_single_wgs.gorz;
def #af# = #freeze#/metadata/AF.gorz;

create ##varcount## = pgor #af# | group 100000 -count;
create ##parsegs## = gor [##varcount##] | seghist 100000;

create ##strong_assoc## = pgor -split <(gor [##parsegs##] ) ##gwastable## 
| where or_mm > 5 and pval_mm < 1e-10 and
       (case_1_gtcount+case_2_gtcount)/(case_0_gtcount++1) > 0.2 and case_1_gtcount+case_2_gtcount > 9
| granno 1 -gc ref,alt -count | where allcount < 100 /* no more than 100 HPO phenotypes that associate per marker */
| select 1-4,pheno,pval_mm,OR_mm;
""")

In the definition above, we ask for large odds-ratio of statistical significance and additionally that the AF in the case group is larger than 0.2 and that the AC is 10 or greater - quite stringent conditions.

In [29]:

%%gor
$mydefs3
create #counts# = pgor [##strong_assoc##] | group chrom -gc pheno -count;
nor [#counts#] | group -gc pheno -sum -ic allcount | granno -sum -avg -min -max -ic sum_allcount  | top 10

Query ran in 0.65 sec
Query fetched 10 rows in 0.03 sec (total time 0.68 sec)

Out[29]:

	Pheno	sum_allCount	min_sum_allCount	max_sum_allCount	avg_sum_allCount	sum_sum_allCount
0	HP:0000006	30	1	10812	356.859743	333307
1	HP:0000007	1109	1	10812	356.859743	333307
2	HP:0000013	12	1	10812	356.859743	333307
3	HP:0000019	9	1	10812	356.859743	333307
4	HP:0000032	23	1	10812	356.859743	333307
5	HP:0000033	42	1	10812	356.859743	333307
6	HP:0000035	12	1	10812	356.859743	333307
7	HP:0000055	3609	1	10812	356.859743	333307
8	HP:0000058	11	1	10812	356.859743	333307
9	HP:0000059	12	1	10812	356.859743	333307

We see that our complete list of "strong associations" is only 333k rows - something we can easily join with the sample variants, which we do next:

In [30]:

mydefs3 = qh3.add_many("""
create #pheno_lis# = nor XA/pheno/dag_hpo.tsv | group -gc pn -lis -sc hpo_code -len 10000;
create #pns# = nor [#pheno_lis#] | select PN;
def #variants# = XA/vars/wesvars.gord;

create #pn_assoc# = partgor -dict #variants# -partscale 10 -ff [#pns#] <(gor #variants# -f #{tags}
| where genotype in ('het','hom')
| select 1-4,pn
| varjoin -r [##strong_assoc##]
| multimap -c pn <(nor [#pheno_lis#] | where pn in ( #{tags:q} ) ) 
| where listhasany(lis_hpo_code,pheno)
| group 1 -gc ref,alt,pn -count -lis -sc pheno,or_mm
);
""")

Notable in the above query is that:

we have "packed" the sample HPO phenotypes into a list for the following reasons: a) it makes the data footprint and parsing cost smaller, but more importanly, b) it makes our MULTIMAP join not multiply row for samples with many HPO terms
we use listhasany() to filter out samples that have the variant but not the corresponding phenotyes.
we use PARTGOR with the partscale option to process 10k sampels at the time (600k+ in total) and filter the MULTIMAP relation to reduce memory footprint

In [31]:

%%gor
$mydefs3
gor [#pn_assoc#] | top 10 /* | write user_data/hakon/Ranker/pn_strong_assoc.gorz */

Query ran in 3.64 sec.
Query fetched 10 rows in 0.03 sec (total time 3.67 sec)

Out[31]:

	chrom	pos	ref	alt	pn	allCount	lis_Pheno	lis_OR_mm
0	chr1	69270	A	G	13640	1	HP:0011355	7.9
1	chr1	69270	A	G	55279	1	HP:0011355	7.9
2	chr1	69270	A	G	56085	1	HP:0011355	7.9
3	chr1	69270	A	G	62752	1	HP:0011355	7.9
4	chr1	69270	A	G	63539	1	HP:0011355	7.9
5	chr1	69270	A	G	67468	1	HP:0011355	7.9
6	chr1	69270	A	G	67755	1	HP:0001743	9.8
7	chr1	69270	A	G	87911	1	HP:0011355	7.9
8	chr1	69270	A	G	101233	1	HP:0011355	7.9
9	chr1	69270	A	G	101255	1	HP:0000142	13.0

We see that the total running time is about 15min - pretty good give we are reading from a variant dataset of 600k samples and 23M different variants and phenotypes from about 200k samples. Now let us quickly look at the size of this table:

In [32]:

%%gor
$mydefs3
create #count# = pgor [#pn_assoc#] | group chrom -count;
nor [#count#] | group -sum -ic allcount

Query ran in 3.89 sec.
Query fetched 1 rows in 0.01 sec (total time 3.90 sec)

Out[32]:

	sum_allCount
0	26738903

We see that the table has 27M rows. Small enough for us to treat is as an GORZ annotation table in out annotation queries, i.e. we don't have to partition it by samples and for now we simply store it as user_data/hakon/Ranker/pn_strong_assoc.gorz.

Gene based associations¶

Similarly as with the variant associations, we can annotate sample variants that overlap genes that show association with phenotypes harboured by the sample. Here, we are though primarily interested in the variants that were used to establish the gene based association in the first place, i.e. rare high and moderate impact variants.

In [33]:

qh4 = GQH.GOR_Query_Helper()
mydefs4 = ""

In [34]:

mydefs4 = qh4.add_many("""
def ##ac## = XA/analysis/hpogwas;
def ##gene_source_file## = ref/ensgenes/genes.gorz;
def ##AF## = freezes/xa1_v1/main/metadata/AF.gorz;
def ##vep## = freezes/xa1_v1/main/metadata/vep_v109/vep_single.gorz;
def ##Gene_GWAS## = XA/analysis/hpogwas/gene_phewas_dir_hpo.gorz;
def #variants# = XA/vars/wesvars.gord;

create ##gene_strong_assoc## = gor ##Gene_GWAS## 
| where or_rec > 5 and p_rec < 1e-10 and case_rec > 9 and  case_rec/(case_rec+case_no_rec) > 0.2
| select 1,2,gene_end,gene_symbol,HPO_code,P_rec,OR_rec
| rename hpo_code Pheno
;

create #pheno_lis# = nor XA/pheno/dag_hpo.tsv | group -gc pn -lis -sc hpo_code -len 10000;
create #pns# = nor [#pheno_lis#] | select PN ;

create #pn_gene_assoc# = partgor -dict #variants# -partscale 10 -ff [#pns#] <(gor #variants# -f #{tags}
| where genotype in ('het','hom')
| varjoin -i <(gor ##vep## | where max_impact in ('HIGH','MODERATE'))
| varjoin -i <(gor ##AF## | where af <= 0.01)
| select 1-4,pn
| join -snpseg -f 10 -r [##gene_strong_assoc##]
| multimap -c pn <(nor [#pheno_lis#] | where pn in ( #{tags:q} ) )
| where listhasany(lis_hpo_code,pheno)
| group 1 -gc ref,alt,pn -count -lis -sc pheno,or_rec
);
""")

In [35]:

%%gor
$mydefs4
gor [#pn_gene_assoc#] | top 10 /* | write user_data/hakon/Ranker/pn_gene_strong_assoc.gorz */

Query ran in 3.76 sec.
Query fetched 10 rows in 0.03 sec (total time 3.78 sec)

Out[35]:

	chrom	pos	ref	alt	pn	allCount	lis_Pheno	lis_OR_rec
0	chr1	69134	A	G	400911	1	HP:0011355	23.4
1	chr1	69134	A	G	141915	1	HP:0011355	23.4
2	chr1	69134	A	G	450302	1	HP:0011355	23.4
3	chr1	69134	A	G	162011	1	HP:0011355	23.4
4	chr1	69134	A	G	241432	1	HP:0011355	23.4
5	chr1	69134	A	G	211732	1	HP:0011355	23.4
6	chr1	69134	A	G	106494	1	HP:0011355	23.4
7	chr1	69134	A	G	218300	1	HP:0011355	23.4
8	chr1	69134	A	G	151679	1	HP:0011355	23.4
9	chr1	69134	A	G	240942	1	HP:0011355	23.4

We see that this took only 7 minutes. And the size of the sample variants annotated with gene based associations is:

In [36]:

%%gor
$mydefs4
create #count# = pgor [#pn_gene_assoc#] | group chrom -count;
nor [#count#] | group -sum -ic allcount

Query ran in 3.89 sec.
Query fetched 1 rows in 0.01 sec (total time 3.90 sec)

Out[36]:

	sum_allCount
0	3063945

Like argued before, this file is small enought to simply store non-partitioned in user_data/hakon/Ranker/pn_gene_strong_assoc.gorz.

Gregor style ACMG annotations for a big population of samples¶

We are now ready to dive into the variant annotation query. It resembles the code in Gregor1.ipynb, however, many of the creates there have been materialized into GORZ files. Further more, all the steps that do not require sample specific information, such as zygosity, inheritance, or phenotypes, are executed here on a distinct variant list - greatly improving the performance.

Before we go futher, here is the complete list of files that we have materizlized.

gor [##frequency_file##] | write s3data://shared/user_data/hakon/refdata/frequency_file.gorz
gor [##homozygous_vars##] | write s3data://shared/user_data/hakon/refdata/homozygous_vars.gorz
gor [##max_score##] | write s3data://shared/user_data/hakon/refdata/max_score.gorz
gor [##PP3_BP4##] | write s3data://shared/user_data/hakon/refdata/PP3_BP4.gorz
gor [##interpro_regions##] | write s3data://shared/user_data/hakon/refdata/interpro_regions.gorz
gor [##known_var_vep_multi##] | write s3data://shared/user_data/hakon/refdata/known_var_vep_multi.gorz
gor [##patho_high##] | write s3data://shared/user_data/hakon/refdata/patho_high.gorz
gor [##patho_missense##] | write s3data://shared/user_data/hakon/refdata/patho_missense.gorz
gor [##altstart_vars##] | write s3data://shared/user_data/hakon/refdata/altstart_vars.gorz
gor [##ex##] | write s3data://shared/user_data/hakon/refdata/ex.gorz
gor [##transcript_info##] | write s3data://shared/user_data/hakon/refdata/transcript_info.gorz
nor [##pvs1_genes##] | write s3data://shared/user_data/hakon/refdata/pvs1_genes.tsv
gor [##vep_highmod##] | write s3data://shared/user_data/hakon/refdata/vep_highmod.gorz
gor [##PM1_info##] | write s3data://shared/user_data/hakon/refdata/PM1_info.gorz
gor [##PP2_BP1_info##] | write s3data://shared/user_data/hakon/refdata/PP2_BP1_info.gorz
gor [##repeat_regions##] | write s3data://shared/user_data/hakon/refdata/repeat_regions.gorz
nor [##omim_inheritance_matched##] | replace omim_inheritance,omim_phenotype_mim,omim_disease_name,omim_moi if(left(#rc,1)=',',listtail(#rc),#rc) | write s3data://shared/user_data/hakon/refdata/omim_inheritance_matched.tsv
gor [##gnomad_gene_constraints_annotations##] | write s3data://shared/user_data/hakon/refdata/gnomad_gene_constraints_annotations.gorz
gor [##exac_mpc_scores##] | write s3data://shared/user_data/hakon/refdata/exac_mpc_scores.gorz
gor [##XA_annotations##] | write s3data://shared/user_data/hakon/refdata/XA_annotations.gorz
gor [##clinvar_hgmd_XA_annotations##] | write s3data://shared/user_data/hakon/refdata/clinvar_hgmd_XA_annotations.gorz
gor [##clinical_variants##] | write s3data://shared/user_data/hakon/refdata/clinical_variants.gorz
nor [##trans2biotype##] | write s3data://shared/user_data/hakon/refdata/trans2biotype.tsv
nor [##gene_panel_info##] | write s3data://shared/user_data/hakon/refdata/gene_panel_info.tsv
nor [##genealiaspanelinfofile##] | write s3data://shared/user_data/hakon/refdata/genealiaspanelinfofile.tsv `

In [37]:

qh5 = GQH.GOR_Query_Helper()
mydefs5 = ""

First some definitions from the Gregor logic:

In [38]:

mydefs5 = qh5.add_many("""
def #gor_or_pgor# = pgor;

def ##ref## = ref;
def ##vep_path## = source/anno/vep_v110;
def ##splice_path## = ref_base/spliceai/spliceai_scores.masked.all.hg19.filtered.gorz;
def ##FidelSource## = ref_base/GDX/Fidel/fidelhg19.gorz;

def ##vep_impact## = ##ref##/VEP_impact.map;
def ##gene_map## = ##ref##/ensgenes/ensgenes.map;
def #freeze# = freezes/xa1_v8/main;
def ##vep_multi## = #freeze#/metadata/vepref_multi_wgs.gorz.link;

create ##veprank## = nor -h ##vep_impact##
| rename so_term consequence
| rename vep_rank rank_consequence
| select consequence,rank_consequence;

def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##all_exons## = ##ref##/refgenes/refgenes_exons.gorz;
def ##coding_exons## = ##ref##/refgenes/refgenes_codingexons.gorz;
def ##gene_transcripts## = ##ref##/refgenes/reftranscripts.gorz;

def ##cat3_max_score_threshold## = 0.9;
def ##cat4_minAf## = 0.03;

def ##gene_overlap_fuzz_factor## = 20;
def ##known_vars_fuzz_factor_vep## = 2;
def ##fuzzdist## = 20;
def ##vep_impact_filter## = ('HIGH','MODERATE');
def ##repeat_overlap_fuzz_factor## = 2;
def ##max_common_rare_ratio## = 0.05;

def ##defMOI## = 'all';
def ##rmax_af## = 0.01;
def ##dmax_af## = 0.001;

def ##category_map## = '1-KP_cDNA,Cat1,2-KP_AA,Cat1B,3-DV_KPCodon,Cat1C,4-KNonP,Cat1D,5-DV_intron,Cat1E,6-EP,Cat2,7-novel,Cat3,8-common,Cat4';
def ##reviewstatus_to_stars## = 'no_assertion_provided,0,no_assertion_criteria_provided,0,no_interpretation_for_the_single_variant,0,criteria_provided|conflicting_interpretations,1,criteria_provided|single_submitter,1,criteria_provided|multiple_submitters|no_conflicts,2,reviewed_by_expert_panel,3,practice_guideline,4';
def #date_val($1) = int(left($1,4))*365+int(substr($1,5,7))*30+int(substr($1,8,10));
""")

Notice that we have use the Ensembl VEP workflow to process every variant in the freeze to generate vepref_multi_wgs.gord. In the past, the freezes have only been annotated with "VEP single" but we prefer to have the VEP annotation on transcript level and not just gene level.

Samples and pedigree¶

Now the set of samples we will process and XA source to define gender:

In [39]:

mydefs5 = qh5.add_many("""
def #samples# = XA/raw/samples.tsv ;

create #pn_proband_sub# = nor -asdict XA/gene/multiscore.gord | select #2 | rename #1 PN;

create ##pngender## = nor #samples# | select id,gender
| replace gender if(#rc='F','female',if(#rc='M','male','unknown')) | rename id PN | inset -c PN [#pn_proband_sub#];
""")

We see that the pedigree includes only full trios where all the samples are in the freeze and have a available Multiscore data in XA/gene/multiscore.gord.

The BIG annotation join step¶

Next, we add definitions for the VEP and other LARGE annotation sources. Because we did not have the Provean score in user_data/hakon/refdata/max_score.gorz, we include it here with a create statement. Additionally, we include #gnomad# and #fidel# as extra annotation sources that can be used for the variant ranker.

In [40]:

mydefs5 = qh5.add_many("""
def ##spliceai_join## = | varjoin -r -l ref_base/spliceai/spliceai_scores.masked.all.hg19.filtered.gorz
                        | leftwhere gene_symbolx[-1] listhasany(gene_aliases,gene_symbol)
                        | hide scorezip,gene_symbolx,gene_aliases
                        | calc BP7_score if(spliceai_max_impact = '',1,0)
                        | varjoin -r -l -e '-1' -rprefix Fidel ##FidelSource##
;
def ##spliceai_cols## = spliceai_max_consequence,spliceai_max_impact,Fidel_PRECPAT,Fidel_PRECBEN,Fidel_Calibrated,Fidel_ALPHA;
def ##spliceai_filter## = or spliceai_max_impact in ('HIGH','MODERATE') or Fidel_PRECPAT >= 0.97;
create #dbnsfp_provean# = pgor ref/variants/dbnsfp.gorz 
| group 1 -gc ref,alt -max -fc PROVEAN_score
| rename max_PROVEAN_score Provean;

create #gnomad# = pgor ref/variants/gnomad_freq.gorz 
| select chrom,pos,ref,alt,gnomad_af,gnomad_hom;

create #fidel# = pgor ref_base/GDX/Fidel/fidelhg19.gorz
| select 1-4,precpat,alpha,revel | rename precpat fidel;

def ##sub_vep## = ##vep_multi## 
| map -c consequence -h ##vep_impact##
##spliceai_join##
| atmin 1 -last -gc reference,call,feature vep_rank
| calcifmissing BP7_score 1
| calc qc_high_mod_impact if(vep_impact in ##vep_impact_filter## ##spliceai_filter## ,1,0)
| replace gene_symbol if(gene_symbol='.','',gene_symbol)
| select 1,2,Reference,Call,Max_Consequence,Gene_symbol,Feature,Consequence,amino_acids,Protein_position,Protein_Size,Biotype,Exon,HGVSc,HGVSp,Refgene,vep_impact,qc_high_mod_impact,BP7_score,##spliceai_cols##
| varjoin -norm -r -l -e 0.0 user_data/hakon/refdata/frequency_file.gorz
| varjoin -norm -r -l -e 0.0 user_data/hakon/refdata/max_score.gorz
| varjoin -norm -r -l user_data/hakon/refdata/PP3_BP4.gorz
| varjoin -norm -r -l -e 0 user_data/hakon/refdata/homozygous_vars.gorz
| varjoin -norm -r -l -xl feature -xr transcript_stable_id -e 0 user_data/hakon/refdata/exac_mpc_scores.gorz
| varjoin -norm -r -l -e '-1' [#dbnsfp_provean#]
| varjoin -norm -r -l -e '-1' [#gnomad#]
| varjoin -norm -r -l -e '-1' [#fidel#]
| hide gene_name,transcript_stable_id
| calc rsIDs '';
""")

For those who "remember" the code in Gregor1.ipynb, we see that ##sub_vep## is now a definition and not materialized, because we are not sharing this in the cache for subsequent analysis, but rather, here we use this for non-sample based variants. In other words, this analysis is fully independent of the samples and because of the square-root-law much more efficient to process in such way for multiple samples (yes - I keep saying this :).

Our monster join now looks like this:

In [41]:

mydefs5 = qh5.add("""
def #freeze# = freezes/xa1_v8/main;
create ##variants_vepped## = pgor -split 100 #freeze#/metadata/AF.gorz | select 1-4 | distinct
| rename #3 Reference | rename #4 Call
| varjoin -ic user_data/hakon/refdata/clinvar_hgmd_XA_annotations.gorz
| rename OverlapCount qc_known_vars
| join -snpsnp -f ##known_vars_fuzz_factor_vep## -ic user_data/hakon/refdata/clinvar_hgmd_XA_annotations.gorz
| rename OverlapCount qc_closeby_known_vars
| calc qc_protected_vars 0
| varjoin -l -r -e 'NO_VEP_RESULT' <(gor ##sub_vep## )
| hide biotype
| map -c feature user_data/hakon/refdata/trans2biotype.tsv /* This filters out transcript based on NM status in refgene and biotype from Ensembl */
| join -varseg -ic -f ##gene_overlap_fuzz_factor## ##all_exons##
| rename overlapCount qc_in_exon
| where qc_in_exon > 0 or qc_known_vars > 0 or qc_protected_vars > 0 or qc_high_mod_impact > 0 or qc_closeby_known_vars > 0
| replace qc_in_exon,qc_known_vars,qc_closeby_known_vars if(int(#rc)>0,1,0)
| map -c consequence -m 100 -h [##veprank##]
| map -m 'miss' -c gene_symbol user_data/hakon/refdata/pref_transcript.tsv
| calc DefTrans if(Gene_DefTrans='miss', '0', if(containsany(listfirst(refgene,'.'),gene_DefTrans), '1', '0'))
| replace rank_consequence if(refgene='.',rank_consequence+100,rank_consequence)
| replace rank_consequence if(containsany(refgene,Gene_DefTrans),rank_consequence-200,rank_consequence)
| replace protein_size if(isint(protein_size),int(protein_size),1)
| granno 1 -gc #3,#4,gene_symbol -max -ic protein_size
| replace rank_consequence if(protein_size = max_protein_size,rank_consequence-50,rank_consequence)
| rename rank_consequence tmp_rank
| rank 1 -gc #3,#4,gene_symbol tmp_rank -o asc
| rename rank_tmp_rank qc_consequence_rank
| select 1-4,gene_symbol,feature,ref_af,homcount,Max_score,Max_Consequence,Consequence,vep_impact,amino_acids,Protein_position,Protein_Size,qc_consequence_rank,qc_known_vars,rsIDs,fidel_bp4_pp3,PP3_score,BP4_score,BP7_score,MPC,provean,gnomad_af,gnomad_hom,fidel,alpha,revel,##spliceai_cols##;
""")

We see that the VEP expression shows up in a nested query via left-varjoin. The table ##variants_vepped## defines not only the key annotations that we want to use downstream but also which variants gets considered, i.e. the WHERE expression filters out variants that are not close to exons, known variant, protected variants (not use here) or variants of high or moderate VEP or SpliceAI impact.

Also, we rank the variant-transcript here (and don't filter them) because in Gregor step1 we do CHZ on transcript level. This is inconsitent with the logic described in the beginning here for CHZ which is on gene level. This is a minor flaw that can easily be fixed, however, here we will not bother with this issue. In the end of our annotation process in Gregor step1, we do indeed collapse the inheritance to gene level although the CHZ and ACMG evidence logic there is properly on transcript level.

In [42]:

%%gor
$mydefs5
create #counts# = pgor [##variants_vepped##] | group chrom -gc vep_impact,consequence -count;
nor [#counts#] | group -gc vep_impact,consequence -sum -ic allcount | granno -sum -ic sum_allcount | sort -c sum_allcount:nr

Query ran in 0.96 sec
Query fetched 25 rows in 0.01 sec (total time 0.97 sec)

Out[42]:

	VEP_Impact	Consequence	sum_allCount	sum_sum_allCount
0	MODERATE	missense_variant	33898165	87830901
1	LOW	synonymous_variant	15632968	87830901
2	LOW	intron_variant	7711420	87830901
3	LOWEST	downstream_gene_variant	6983214	87830901
4	LOWEST	upstream_gene_variant	4524536	87830901
5	LOW	splice_region_variant	3901022	87830901
6	HIGH	frameshift_variant	3355283	87830901
7	LOW	splice_polypyrimidine_tract_variant	2437143	87830901
8	LOW	5_prime_UTR_variant	1750489	87830901
9	HIGH	stop_gained	1513376	87830901
10	LOW	splice_donor_region_variant	1160228	87830901
11	MODERATE	inframe_insertion	960325	87830901
12	LOW	3_prime_UTR_variant	890454	87830901
13	HIGH	splice_donor_variant	772281	87830901
14	MODERATE	inframe_deletion	741601	87830901
15	HIGH	splice_acceptor_variant	671669	87830901
16	LOW	splice_donor_5th_base_variant	405727	87830901
17	MODERATE	protein_altering_variant	326094	87830901
18	HIGH	start_lost	112053	87830901
19	HIGH	stop_lost	55754	87830901
20	LOW	stop_retained_variant	23564	87830901
21	LOW	coding_sequence_variant	2481	87830901
22	LOWEST	intergenic_variant	1042	87830901
23	LOW	start_retained_variant	10	87830901
24	LOW	incomplete_terminal_codon_variant	2	87830901

From above, we see that we keep about 87M rows of variant annotations for the entire freeze of 23M variants. Keep in mind that the XA database is mostly significant exonic variants. The above joins in ##variants_vepped## run in few minutes and cal easily be made to run faster by using higher -split on PGOR.

Know pathogenic variants and their neighbours¶

Next is the logic to find variant adjacent to known pathogenic variants, causing the same codon change or the same amino acid change.

In [43]:

mydefs5 = qh5.add_many("""
create ##cat1_1d## = #gor_or_pgor# [##variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| varjoin -ic <(gor user_data/hakon/refdata/clinical_variants.gorz | select 1-4,MaxClinImpact | where MaxClinImpact = 'Pathogenic')
| rename overlapcount Path
| varjoin -ic <(gor user_data/hakon/refdata/clinical_variants.gorz | select 1-4,MaxClinImpact | where NOT(MaxClinImpact = 'Pathogenic'))
| rename overlapcount NonPath
| calc category if(Path > 0,'1-KP_cDNA',if(NonPath > 0,'4-KNonP','none'))
| where category != 'none'
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;

create ##cat1bc## = #gor_or_pgor# [##variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| join -snpsnp -xl feature -xr feature -f 2 user_data/hakon/refdata/known_var_vep_multi.gorz
| where not(pos = posx and reference = referencex and call = callx) and not(amino_acids in ('','.')) 
| calc category if(amino_acids = amino_acidsx and protein_position+'' = protein_positionx,'2-KP_AA',if(protein_position+'' = protein_positionx,'3-DV_KPCodon',''))
| where category != ''
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;

create ##cat1e## = #gor_or_pgor# [##variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| join -snpsnp -f 2 <(gor user_data/hakon/refdata/clinical_variants.gorz | select 1-Alt,MaxClinImpact | rename #3 Reference | rename #4 Call)
| where not(pos = posx and reference = referencex and call = callx) 
| where consequence = 'intron_variant'
| calc category '5-DV_intron'
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;

create ##all_cat1## = #gor_or_pgor# [##cat1_1d##] 
| merge [##cat1bc##]
| merge [##cat1e##] 
| calc cat1_rank if(category='1-KP_cDNA',100,if(category='2-KP_AA',200,if(category='3-DV_KPCodon',300,if(category='4-KNonP',400,500))))
| atmin 1 -gc 3,4,gene_symbol,feature cat1_rank
| select 1-4,gene_symbol,feature,category;
""")

In [44]:

%%gor
$mydefs5
nor [##all_cat1##] | group -gc category -count | calc oldACMGcat decode(category,##category_map##)

Query ran in 0.51 sec
Query fetched 5 rows in 11.07 sec (total time 11.58 sec)

Out[44]:

	category	allCount	oldACMGcat
0	1-KP_cDNA	440600	Cat1
1	2-KP_AA	6313	Cat1B
2	3-DV_KPCodon	741317	Cat1C
3	4-KNonP	5626752	Cat1D
4	5-DV_intron	870495	Cat1E

We see that we have about 440k known pathogenic variations and about 75k variations that can cause the same amino acid change or modification in the same codon.

NMD, 50bp rule and alternative start¶

Next we annotate all the variant in relation to NMD, 50bp rule and alternative start. This will impact ACMG evidence PVS1. Note that this logic could easily be improved by looking at known Cat1 variants and see if they contradict these labels.

In [45]:

mydefs5 = qh5.add_many("""
create #PVS1_info# = #gor_or_pgor# [##variants_vepped##]  | select 1-4,feature | distinct
| join -snpseg -xl feature -xr transcript_stable_id <(gor user_data/hakon/refdata/ex.gorz 
             | select 1-gene_symbol,strand,transcript_stable_id,sum_*,exonNum,numExons,transcript_size,last_exon_size)
| calc cdspos if(strand = '+',pos-chromstart+sum_bs,max(chromend-pos-len(reference)+sum_as+2,0))
| select 1-4,transcript_stable_id,exonnum,numExons,cdspos,transcript_size,last_exon_size
| join -snpseg -l -xl transcript_stable_id -xr transcript_stable_id -r user_data/hakon/refdata/transcript_info.gorz
| calc alt_start if(strand = '+' and max_m_pos > pos,max_m_pos,if(strand = '-' and min_m_pos > 0 and min_m_pos < pos,min_m_pos,0))
| calc var_in_50bp_region if(transcript_size-last_exon_size-cdspos < 50,1,0)
/* if var is downstream of 50bp threshold and patho_var is not present (is50bpRuleViolated=0) in that region, we trigger the 50bp rule */
| calc trigger50bpRule if(var_in_50bp_region = 1 and is50bpRuleViolated = 0,1,0)
/* if var is upstream of 50bp threshold and the given transcript is subjected to NMD process, we trigger the NMD rule */
| calc triggerNMDRule if(var_in_50bp_region = 0 and isNMDtranscript = 1,1,0)
| select 1-4,transcript_stable_id,trigger50bpRule,alt_start,isNMDtranscript,triggerNMDRule
| where trigger50bpRule+alt_start+isNMDtranscript+triggerNMDRule > 0;
""")

By filtering on trigger50bpRule+alt_start+isNMDtranscript+triggerNMDRule > 0 and using left-join that set these values to zero where missing, we save about 50% in storage.

Time dependent classified variants¶

Now we add two definitions that are used to enable sample-time-dependent annotation of classified variants. This is important when generating training data however not an issue once we generate data from ranking/prediction when processing new samples. We also add clinvar as a separate annotation source, for backward compatiblity with the first variant ranking query, although it may be included in user_data/hakon/refdata/clinical_variants.gorz together with classified XA variants.

In [46]:

mydefs5 = qh5.add_many("""
create #classifs_hist# = gor XA/vars_anno/variant_classifications.gorz 
| select chrom,pos,ref,alt,classification,date
| rename classification gdx_classification_hist
| rename date gdx_classif_date_hist;

create #sample_date# = nor -h user_data/vinnie/ranker_data/inputs_for_ranker_queries/ranker_training_case_ids_to20231031.tsv
| select sample_id,lims_start_date;

create #clinvar# = gor ref/dbsnp/dbsnp_clinvar.gorz 
| where clnrevstat in ('criteria_provided|multiple_submitters|no_conflicts','reviewed_by_expert_panel','practice_guideline')
| where clnsig_name in ('uncertainsignificance','benign/likelybenign','benign','likelybenign','pathogenic/likelypathogenic','pathogenic','likelypathogenic')
| rename clnsig_name clinvar_classification
| select chrom,pos,ref,alt,clinvar_classification;
""")

Annotating sample variants with inheritance¶

We can annotate our sample based variant with the CHZ labels we generated in user_data/hakon/Ranker/xa1_v8_var_inher.gorz in two way; either by reading the genotypes from the freeze or from the row level variant tables - the latter being much faster as we see below. First by reading from the freeze:

In [47]:

%%gor
create #time# = gor freezes/xa1_v8/main/variants.gord
| csvsel -gc 3,4 -vs 1 -u 3 freezes/xa1_v8/main/buckets.tsv <(nor freezes/xa1_v8/main/buckets.tsv | select pn) -hide '0,3' -tag PN
| rename value callcopies
| varjoin -l -xl pn -xr pn user_data/hakon/Ranker/xa1_v8_var_inher.gorz
| distloc 1000
| group genome -count | calc t time();
nor [#time#]

Query ran in 0.24 sec
Query fetched 1 rows in 0.00 sec (total time 0.24 sec)

Out[47]:

	chrom	bpStart	bpStop	allCount	t
0	chrA	0	1000000000	333183	47906

and now from XA/vars/wesvars.gord

In [48]:

%%gor
create #time# = gor XA/vars/wesvars.gord
| where callcopies in ('1','2')
| select 1-4,callcopies,pn
| distinct
| varjoin -l -xl pn -xr pn user_data/hakon/Ranker/xa1_v8_var_inher.gorz
| distloc 1000
| group genome -count | calc t time();
nor [#time#]

Query ran in 0.21 sec
Query fetched 1 rows in 0.00 sec (total time 0.21 sec)

Out[48]:

	chrom	bpStart	bpStop	allCount	t
0	chrA	0	1000000000	333183	9153

We see that the latter is 9sec, more than 5x faster, because most of the variants are so rare, hence few rows per variant. The fact that we have already used the coverage information in the freeze in deriving the inheritance status makes wesvars.gord a natural choice for the last stages.

In [49]:

%%gor
$mydefs5
nor [#pn_proband_sub#] | group -count

Query ran in 0.79 sec
Query fetched 1 rows in 0.02 sec (total time 0.81 sec)

Out[49]:

	allCount
0	14641

We can now check out the content of our join for our 14641 samples for which we have gene Multiscore data:

In [50]:

%%gor
$mydefs5
gor XA/vars/wesvars.gord -ff [#pn_proband_sub#]
| where callcopies in ('1','2')
| select 1-4,callcopies,pn
| distinct
| varjoin -l -xl pn -xr pn user_data/hakon/Ranker/xa1_v8_var_inher.gorz
| distloc 10

Query ran in 44.32 sec
Query fetched 304 rows in 1.50 sec (total time 45.82 sec)

Out[50]:

	chrom	pos	ref	alt	callcopies	pn	posx	REFx	altx	Gene_Symbol	Max_Impact	PNx	value	inher	ihe	hethomchz
0	chr1	69134	A	G	2	511157	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	chr1	69134	A	G	2	396122	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	chr1	69134	A	G	1	489068	69134.0	A	G	OR4F5	MODERATE	489068.0	101.0	mother	0.0	het
3	chr1	69134	A	G	1	553894	69134.0	A	G	OR4F5	MODERATE	553894.0	102.0	mother	0.0	het
4	chr1	69134	A	G	1	619425	69134.0	A	G	OR4F5	MODERATE	619425.0	110.0	father	0.0	het
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
299	chr1	69438	T	C	2	681582	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
300	chr1	69453	G	A	1	564860	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
301	chr1	69453	G	A	1	626981	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
302	chr1	69453	G	A	2	642095	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
303	chr1	69453	G	A	1	689960	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

304 rows × 16 columns

Sample annotation, ACMG scoring, Multiscore, and writing out the results¶

Now we are ready for the final query that reads the sample based variants, annotates them and writes them out to our destinations. Unlike the other queries, that we have executed for the complete XA database (600k+ samples), we will only run this for our subset, #pn_proband_sub#, because only for them do we have the gene Multi-score. Here are some of the key things to notice in the following query:

We use PARTGOR and process 1000 samples with each worker. This will take about 5min for chr1 and about an hour for the entire genome. Other option would have been to use PARALLEL, however, bit more complicated to express.
Some of our joins are on variant-only level while other are on variant-sample level. The gene Multiscore data has already been written into a partitioned/bucketized dictionary. It has about 20k rows per sample. Our other sample sources, such as xa1_v8_var_inher.gorz and gene_pheno_score_25.gorz are not partitioned, but with only about 3k and 50 rows per sample, respectively.
The pure variant joins are with ##variants_vepped##, ##all_cat1##, and #classifs_hist#. In the first, we use a filtering varjoin because we assume it to cover the complete variants, whereas for the other we use a left-varjoin becuase they are sparse.
For #classifs_hist#, we use additional LEFTWHERE because we want to have a left-join like mechanism on the date comparison, i.e. of the establishment of the classification and the time of the sample processing. Also, we take away Cat1 status in DIAG_ACMGcat based on gdx_classification_hist and is_patho_in_knowledgebase from Gregor step2 is now calculated based on it.
For the sample based joins, for the left varjoins we use additionally -xl and -xl options with PN and -c PN for maps.
The #variants# and XA/gene/multiscore.gord use the gor driver command filtering option with the tags (PNs) populated by the PARTGOR macro command.
The non-partitioned PN based files, such as xa1_v8_var_inher.gorz, pn_strong_assoc.gorz and #sample_date#, we filter the nested queries on PN with the tags in order to reduce the memory footprint of the joins - something especially important if we run this for larger cohorts.
diag_denovo and diag_chz are dependent on xa1_v8_var_inher.gorz while diag_homrecess uses callcopies from the full #variants# source.
We use some intermediare SELECTs, primarily to show which columns are necessary to proceed and to reduce the rows.
The one-hot-encoding is used to format the data for AI processing. Writing out these expressions is tedious and error prone and best done by auto-generating them, e.g. as shown in our appendix.
The definitions for #TOP# and #WRITE# make it easier to modify and test the query logic.
The rest is obvious!

In [51]:

mydefs5 = qh5.add_many("""
def #TOP# = top 100 ;
def #WRITE# = /* calc filepath 'user_data/hakon/Ranker/output/vars4ranker_'+left(md5('#{tags}'),5)+'.gorz'
| write -f filepath -r #{fork} */;

def #variants# = XA/vars/wesvars.gord;

create #output# = partgor -dict #variants# -partsize 1000 -ff <(nor [#pn_proband_sub#] )
<(gor #variants# -f #{tags}
| rename #3 Reference
| rename #4 Call
| where callcopies in ('1','2')
| select 1-4,callcopies,pn
| distinct
| varjoin -r [##variants_vepped##]

| varjoin -r -l -xl gene_symbol,pn -xr gene_symbol,pn <(gor user_data/hakon/Ranker/xa1_v8_var_inher.gorz | where pn in ( #{tags:q} ))
| prefix ihe,inher,hethomchz GT
| #TOP#
| varjoin -xl gene_symbol,feature -xr gene_symbol,feature -r -l -e '7-novel' [##all_cat1##]
| hide gene_symbolx,gene_symbolxx,featurex

| map -c pn -m '9' <(nor [#sample_date#] | where #1 in ( #{tags:q} ))
| varjoin -norm -r -l -e 'null' [#classifs_hist#]
| leftwhere lims_start_date gdx_classif_date_hist<lims_start_date or gdx_classif_date_hist = 'null'
| granno 1 -gc #3,#4,PN -max -sc gdx_classif_date_hist
| where gdx_classif_date_hist=max_gdx_classif_date_hist

| replace category if(ref_af>##cat4_minAf## and qc_known_vars=0,'8-common',if(category='7-novel' and vep_impact='HIGH','6-EP',category))
| calc DIAG_ACMGcat decode(category, ##category_map##)
| replace DIAG_ACMGcat if(DIAG_ACMGcat = 'Cat1' and gdx_classification_hist = 'null','',if(DIAG_ACMGcat = 'Cat3',if(max_score >= ##cat3_max_score_threshold##,'Cat3A','Cat3B'),DIAG_ACMGcat))

| join -varseg -l -f ##fuzzdist## -r -xl gene_symbol -xr gene -e 0 user_data/hakon/refdata/gnomad_gene_constraints_annotations.gorz
| hide gene
| prefix consequence,max_consequence,max_score VEP
| calc diag_denovo if(GT_inher = 'denovo',1,0)
| map -c PN [##pngender##] -m '' -n gender
| map -c gene_symbol -m '' user_data/hakon/refdata/omim_inheritance_matched.tsv -n omim_gene_mim,omim_inheritance,omim_disease_name,omim_phenotype_mim,omim_moi
| join -varseg -ic -f ##repeat_overlap_fuzz_factor## user_data/hakon/refdata/repeat_regions.gorz
| rename OverlapCount qc_repeat_regions

| select 1,2,Reference,Call,PN,gene_symbol,feature,ref_af,VEP_max_score,Vep_max_consequence,VEP_Impact,Vep_consequence,CallCopies,category,DIAG_ACMGcat,diag_denovo,gender,omim_inheritance,qc_repeat_regions,fidel_bp4_pp3,PP3_score,BP4_score,BP7_score,MPC,QC_CONSEQUENCE_RANK,GT_ihe,GT_inher,GT_hethomchz,provean,gdx_classification_hist,gnomad_gene_*,gnomad_af,gnomad_hom,fidel,alpha,revel

| varjoin -norm -xl PN -xr PN -r -l -e 0 <(gor user_data/hakon/Ranker/pn_strong_assoc.gorz
							| calc OR_v5c listsize(listfilter(lis_OR_mm,'float(x)>=5'))
							| calc OR_v10c listsize(listfilter(lis_OR_mm,'float(x)>=10'))
							| calc OR_v20c listsize(listfilter(lis_OR_mm,'float(x)>=20'))
							| select 1-4,pn,OR_*)
| varjoin -norm -xl PN -xr PN -r -l -e 0 <(gor user_data/hakon/Ranker/pn_gene_strong_assoc.gorz
							| calc OR_g5c listsize(listfilter(lis_OR_rec,'float(x)>=5'))
							| calc OR_g10c listsize(listfilter(lis_OR_rec,'float(x)>=10'))
							| calc OR_g20c listsize(listfilter(lis_OR_rec,'float(x)>=20'))
							| select 1-4,pn,OR_*)
| calc Evidence_PS4 if(OR_v10c > 0,'PS4','')

| distinct

| map -c gene_symbol -h -m 'missing' user_data/hakon/refdata/pvs1_genes.tsv
| varjoin -l -r -xl feature -xr transcript_stable_id -e 0 [#PVS1_info#]
| hide transcript_stable_id
| replace pvs1gene if(alt_start>0 or trigger50bpRule>0 or triggerNMDRule>0,'0',pvs1gene)
| calcifmissing spliceai_max_impact ''
| calc Evidence_PVS1 if(pvs1gene='1' and (VEP_Impact='HIGH' or spliceai_max_impact = 'HIGH') ,'PVS1','')
| calc Evidence_PS2_PM6 if(diag_denovo='1','PS2',if(GT_IHE='1','PM6',''))
| calc Evidence_PS3_BS3 ''
| varjoin -r -l -e 0 user_data/hakon/refdata/homozygous_vars.gorz
| map -c gene_symbol -m '1' user_data/hakon/refdata/gene_panel_info.tsv
| calc af_cutoff if(contains(omim_inheritance,'AD') or #1 = 'chrX' and gender = 'male', Gene_maximum_genotype_freq_for_dominant, Gene_maximum_genotype_freq_for_recessive)
| calc Evidence_BA1_BS1_BS2_PS4_PM2 if(ref_af > 0.05,'BA1',
  if( ref_af > af_cutoff,'BS1',
  if( homCount > 4 /* conservative because using GnomAD instead of 1000G */,'BS2','')))
| replace Evidence_BA1_BS1_BS2_PS4_PM2 if(Evidence_BA1_BS1_BS2_PS4_PM2='' and (ref_af<=0.0 or ( ref_af < Gene_maximum_genotype_freq_for_recessive and contains(omim_inheritance,'AR') and not (contains(omim_inheritance,'AD')))) or (ref_af < Gene_maximum_genotype_freq_for_dominant and contains(omim_inheritance,'AD')),'PM2',Evidence_BA1_BS1_BS2_PS4_PM2)

| join -snpseg -rprefix PM1 -l -e 0 -r <(gor user_data/hakon/refdata/PM1_info.gorz | where path_high_allcount + path_missense_allcount > 0 and ratio_common_rare < ##max_common_rare_ratio## | calc tomax path_high_allcount + path_missense_allcount)
| atmax 1 -gc 3,4,gene_symbol,feature,PN pm1_tomax | hide pm1_tomax /* pick the most informative Interpro domain */

| calc Evidence_PM1 if(VEP_consequence in ('missense_variant') and pm1_path_high_allcount + pm1_path_missense_allcount > 0 and float(MPC,0)>2,'PM1','')
| calc Evidence_PM4_BP3 if(Vep_consequence in ('inframe_insertion','inframe_deletion'),if(qc_repeat_regions=0,'PM4','BP3'),'')
| replace Evidence_PM4_BP3 if(vep_consequence = 'stop_lost','PM4',Evidence_PM4_BP3)
| join -snpseg -xl gene_symbol -xr gene_symbol -rprefix PP2_BP1 -l -e 0 -r user_data/hakon/refdata/PP2_BP1_info.gorz
| hide PP2_BP1_gene_symbol
| calc Evidence_PP2_BP1 if(VEP_consequence='missense_variant',if(PP2_BP1_path_missense_allcount/float(PP2_BP1_path_high_allcount+PP2_BP1_path_missense_allcount+0.1)>0.8
and PP2_BP1_common_allcount/float(PP2_BP1_rare_allcount+PP2_BP1_common_allcount+0.1)<0.1,'PP2',
if(PP2_BP1_path_high_allcount/float(PP2_BP1_path_high_allcount+PP2_BP1_path_missense_allcount+0.1)>0.8,'BP1','')),'')            
| calc Evidence_PP3_BP4 if(PP3_score=1,'PP3',if(BP4_score=1,'BP4',''))
| calc Evidence_BP7 if(BP7_score=1 and (VEP_consequence = 'synonymous_variant' or contains(VEP_consequence,'splice')),'BP7','')
| calc Evidence_all = listfilter(Evidence_PS4+','+Evidence_PVS1+','+Evidence_PS2_PM6+','+Evidence_PS3_BS3+','+Evidence_BA1_BS1_BS2_PS4_PM2+','+Evidence_PM1+','+Evidence_PM4_BP3+','+Evidence_PP2_BP1+','+Evidence_PP3_BP4+','+Evidence_BP7,'len(x)>0')
| rename Evidence_all auto_evidence

/* Gregor step2 based logic */ 

| map -c gene_symbol user_data/hakon/refdata/genealiaspanelinfofile.tsv -m 'default'
| replace Gene_DmaxAf if(Gene_DmaxAf = 'default',##dmax_af##,float(#rc))
| replace Gene_RmaxAf if(Gene_RmaxAf = 'default',##rmax_af##,float(#rc))
| replace qc_consequence_rank if(containsany(feature,Gene_DefTrans),0,qc_consequence_rank)


| join -varseg -r -f ##fuzzdist## -xl gene_symbol,pn -xr gene_symbol,pn -l -e 0 -rprefix cand <(gor user_data/hakon/Phrank/gene_pheno_score_25.gorz | where pn in ( #{tags:q} ))

| replace auto_evidence if(cand_rank_pheno_score > 0 and cand_rank_pheno_score < 10
                           or cand_rank_norm_pheno_score > 0 and cand_rank_norm_pheno_score < 10
                           or OR_g20c > 0,listadd(auto_evidence,'PP4'),auto_evidence)

| calc diag_dominant if(ref_af <= Gene_DmaxAf and GT_inher != 'denovo',1,0)
| calc diag_chz if(ref_af <= Gene_RmaxAf and GT_hethomchz = 'chz',1,0)
| calc diag_homrecess if(ref_af <= Gene_RmaxAf and (chrom = 'chrX' and (( gender = 'male' and callcopies in ('1','2')
                                                    or gender = 'female' and callcopies = '2') ) or chrom != 'chrX' and callcopies = '2'),1,0)

| replace auto_evidence if(diag_chz=1 or diag_homrecess=1,listadd(auto_evidence,'PM3'),auto_evidence)
| calc inheritance_REC if(diag_homrecess=1, if(chrom = 'chrX','XREC','REC'),if(diag_chz=1,'CHZcc', if(diag_chz=1,'CHZ','')))
| calc inheritance_DeNovoDOM if(diag_denovo=1,'DeNovo', if(diag_dominant=1,'DOM',''))

| calc inheritance_LHZ ''
| calc inheritance = listfilter(inheritance_DeNovoDOM+','+inheritance_REC+','+inheritance_LHZ,'len(x)>0')
| granno 1 -gc reference,call,gene_symbol,PN -set -sc inheritance
| replace inheritance if(left(set_inheritance,1)=',',listtail(set_inheritance),set_inheritance)
| map -c vep_max_consequence -m '' <(nor -h ##vep_impact## | rename vep_impact vep_max_impact ) -n vep_max_impact
| atmin 1 -gc reference,call,gene_symbol,PN qc_consequence_rank
| hide set_*

| calc is_patho_in_knowledgebase if(contains(gdx_classification_hist,'PATH') or contains(gdx_classification_hist,'LPATH'),1,0)

| replace auto_evidence if(is_patho_in_knowledgebase = 1 or DIAG_ACMGcat='Cat1B',listadd(auto_evidence,'PS1'),if(DIAG_ACMGcat='Cat1C',listadd(auto_evidence,'PM5'),auto_evidence))
| replace auto_evidence if(auto_evidence='',if(DIAG_ACMGcat='Cat1','PP5',if(DIAG_ACMGcat='Cat1C','BP6',auto_evidence)),auto_evidence)
| calc ls_PVS1 listsize(listfilter(auto_evidence,'x in ("PVS1")'))
| calc ls_PS1_4 listsize(listfilter(auto_evidence,'x in ("PS1","PS2","PS3","PS4")'))
| calc ls_PM1_6 listsize(listfilter(auto_evidence,'x in ("PM1","PM2","PM3","PM4","PM5","PM6")'))
| calc ls_PP1_5 listsize(listfilter(auto_evidence,'x in ("PP1","PP2","PP3","PP4","PP5")'))
| calc ls_BA1 listsize(listfilter(auto_evidence,'x in ("BA1")'))
| calc ls_BS1_4 listsize(listfilter(auto_evidence,'x in ("BS1","BS2","BS3","BS4")'))
| calc ls_BP1_7 listsize(listfilter(auto_evidence,'x in ("BP1","BP2","BP3","BP4","BP5","BP6","BP7")'))
| calc D_Pi if(ls_PVS1=1 and (ls_PS1_4>=1 or ls_PM1_6>=2 or (LS_PM1_6=1 and ls_PP1_5=1) or ls_PP1_5>=2),1,0)
| calc D_Pii if(ls_PS1_4>=2,1,0)
| calc D_Piii if(ls_PS1_4>=1 and ( ls_PM1_6>=3 or (ls_PM1_6>=2 and ls_PP1_5>=2) or (ls_PM1_6>=1 and ls_PP1_5>=4) ),1,0)
| calc D_P if(D_Pi+D_Pii+D_Piii > 0,1,0)
| calc D_LPi if(ls_PVS1=1 and ls_PM1_6=1,1,0)
| calc D_LPii if(ls_PS1_4=1 and (ls_PM1_6=1 or ls_PM1_6=2),1,0)
| calc D_LPiii if(ls_PS1_4=1 and ls_PP1_5>=2,1,0)
| calc D_LPiv if(ls_PM1_6>=3,1,0)
| calc D_LPv if(ls_PM1_6=2 and ls_PP1_5>=2,1,0)
| calc D_LPvi if(ls_PM1_6=1 and ls_PP1_5>=4,1,0)
| calc D_LP if(D_LPi+D_LPii+D_LPiii+D_LPiv+D_LPi+D_LPv+D_LPvi > 0,1,0)
| calc D_B if(ls_BA1=1 or ls_BS1_4>=2,1,0)
| calc D_LB if(ls_BS1_4=1 and ls_BP1_7=1 or ls_BP1_7>=2,1,0)
| calc D_PLP if(D_P=1,'P',if(D_LP=1,'LP',''))
| calc D_BLB if(D_B=1,'B',if(D_LB=1,'LB',''))
| calc auto_classification_simple if(len(D_PLP)>0 and len(D_BLB)=0,D_PLP,if(len(D_PLP)=0 and len(D_BLB)>0,D_BLB,'VUS'))      
| replace auto_evidence if(contains(auto_evidence,'PP3'),replace(auto_evidence,'PP3',fidel_bp4_pp3),
                      if(contains(auto_evidence,'BP4'),replace(auto_evidence,'BP4',fidel_bp4_pp3),auto_evidence))
| calc auto_score if(auto_evidence!='',int(listnumsum(listmap(auto_evidence,"decode(x,'BA1,-8,BS1,-4,BS2,-4,BS3,-4,BS4,-4,BP1,-1,BP2,-1,BP3,-1,BP4,-1,BP4_supporting,-1,BP4_moderate,-2,BP4_strong,-4,BP4_verystrong,-8,BP5,-1,BP6,-1,BP7,-1,PVS1,8,PS1,4,PS2,4,PS3,4,PS4,4,PM1,2,PM2,1,PM3,2,PM4,2,PM5,2,PM6,2,PP1,1,PP2,1,PP3,1,PP3_strong,4,PP3_moderate,2,PP3_supporting,1,PP4,1,PP5,1,0')"))),0)
| calc auto_classification if(auto_score >= 10,'P',
                       if(auto_score >= 6 and auto_score <=9,'LP',
                       if(auto_score >= 0 and auto_score <=5,'VUS',
                       if(auto_score >= -6 and auto_score <= -1,'LB',
                       if(auto_score <= -7,'B','')))))

| hide gender,gene_dmaxAf,gene_rmaxAf,BP4_score,BP7_score,PP3_score,gene_MOI,Gene_DefTrans,D_*,ls_*,inheritance_*,femalecases_*,malecases_*
| replace auto_evidence replace(replace(replace(replace(#rc,'_verystrong',''),'_supporting',''),'_strong',''),'_moderate','')

/* multi-score */
| join -snpseg -xl gene_symbol,pn -xr gene_symbol,pn -maxseg 4000000 -f 20 -l -e 0 -rprefix GMS <(gor XA/gene/multiscore.gord -s PN -f #{tags} | select 1-3,gene_symbol,PN,probability-w2v_lit)
| select 1,2,Reference,Call,PN,Gene_Symbol,ref_af,VEP_impact,VEP_Consequence,callcopies,DIAG_ACMGcat,qc_repeat_regions,lof_pli,kdg,spliceai_max_impact,homCount,auto_evidence,diag_dominant,diag_chz,diag_homrecess,diag_denovo,inheritance,auto_score,auto_classification,cand_pheno_score,cand_norm_pheno_score,cand_rank_pheno_score,cand_rank_norm_pheno_score,OR_v5c,OR_v10c,OR_v20c,OR_g5c,OR_g10c,OR_g20c,gdx_classification_hist,fidel,alpha,revel,provean,gnomad_af,gnomad_hom,gnomad_gene_*,GMS_probability,GMS_rank,GMS_doc2vec,GMS_hrss_dipr,GMS_hrss_hpoa,GMS_hrss_lit,GMS_jaccard_dipr,GMS_jaccard_lit,GMS_w2v_dipr,GMS_w2v_lit
| rename diag_acmgcat ACMG_oldCat
| rename diag_(.*) oheInher_#{1}
| rename auto_(.*) ACMG_#{1}

| varjoin -r -l [#clinvar#]

| calc oheVep_3_prime_UTR_variant,oheVep_5_prime_UTR_variant,oheVep_frameshift_variant,oheVep_inframe_deletion,oheVep_inframe_insertion,oheVep_intron_variant,oheVep_missense_variant,oheVep_non_coding_transcript_exon_variant,oheVep_protein_altering_variant,oheVep_splice_acceptor_variant,oheVep_splice_donor_variant,oheVep_splice_region_variant,oheVep_start_lost,oheVep_stop_gained,oheVep_stop_lost,oheVep_stop_retained_variant,oheVep_synonymous_variant if(VEP_consequence = '3_prime_UTR_variant',1,0),if(VEP_consequence = '5_prime_UTR_variant',1,0),if(VEP_consequence = 'frameshift_variant',1,0),if(VEP_consequence = 'inframe_deletion',1,0),if(VEP_consequence = 'inframe_insertion',1,0),if(VEP_consequence = 'intron_variant',1,0),if(VEP_consequence = 'missense_variant',1,0),if(VEP_consequence = 'non_coding_transcript_exon_variant',1,0),if(VEP_consequence = 'protein_altering_variant',1,0),if(VEP_consequence = 'splice_acceptor_variant',1,0),if(VEP_consequence = 'splice_donor_variant',1,0),if(VEP_consequence = 'splice_region_variant',1,0),if(VEP_consequence = 'start_lost',1,0),if(VEP_consequence = 'stop_gained',1,0),if(VEP_consequence = 'stop_lost',1,0),if(VEP_consequence = 'stop_retained_variant',1,0),if(VEP_consequence = 'synonymous_variant',1,0)

/* one-hot-GDXclass */
| calc oheGDXclass_BEN,oheGDXclass_GDXBEN,oheGDXclass_LBEN,oheGDXclass_LPATH,oheGDXclass_PATH,oheGDXclass_UV,oheGDXclass_VUS,oheGDXclass_null if(gdx_classification_hist = 'BEN',1,0),if(gdx_classification_hist = 'GDXBEN',1,0),if(gdx_classification_hist = 'LBEN',1,0),if(gdx_classification_hist = 'LPATH',1,0),if(gdx_classification_hist = 'PATH',1,0),if(gdx_classification_hist = 'UV',1,0),if(gdx_classification_hist = 'VUS',1,0),if(gdx_classification_hist = 'null',1,0)

/* one-hot-Clinvarclass */
| calc clinv_class decode(clinvar_classification,'benign,B,benign/likelybenign,LB,likelybenign,LB,likelypathogenic,LP,pathogenic,P,pathogenic/likelypathogenic,LP,uncertainsignificance,VUS,null')
| calc oheClinvclass_B,oheClinvclass_LB,oheClinvclass_LP,oheClinvclass_P,oheClinvclass_VUS,oheClinvclass_null if(clinv_class = 'B',1,0),if(clinv_class = 'LB',1,0),if(clinv_class = 'LP',1,0),if(clinv_class = 'P',1,0),if(clinv_class = 'VUS',1,0),if(clinv_class = 'null',1,0)
| calc oheSplice_high,oheSplice_moderate,oheSplice_other if(spliceai_max_impact = 'HIGH',1,0),if(spliceai_max_impact = 'MODERATE',1,0),if(not(spliceai_max_impact in ('HIGH','MODERATE')),1,0)

| calc use_prov if(isfloat(provean) and provean != '-1',1,0)
| replace provean if(use_prov=1,provean,'0')
| replace ref_AF,gnomad_af,lof_pli str(if(not(isfloat(#rc)) or isfloat(#rc) and float(#rc)<1e-20,-20,log(float(#rc))))

| calc oheACMGev_BA1,oheACMGev_BS1,oheACMGev_BS2,oheACMGev_BS3,oheACMGev_BS4,oheACMGev_BP1,oheACMGev_BP2,oheACMGev_BP3,oheACMGev_BP4,oheACMGev_BP4_supporting,oheACMGev_BP4_moderate,oheACMGev_BP4_strong,oheACMGev_BP4_verystrong,oheACMGev_BP5,oheACMGev_BP6,oheACMGev_BP7,oheACMGev_PVS1,oheACMGev_PS1,oheACMGev_PS2,oheACMGev_PS3,oheACMGev_PS4,oheACMGev_PM1,oheACMGev_PM2,oheACMGev_PM3,oheACMGev_PM4,oheACMGev_PM5,oheACMGev_PM6,oheACMGev_PP1,oheACMGev_PP2,oheACMGev_PP3,oheACMGev_PP3_strong,oheACMGev_PP3_moderate,oheACMGev_PP3_supporting,oheACMGev_PP4,oheACMGev_PP5 if(listhasany(acmg_evidence,'BA1'),1,0),if(listhasany(acmg_evidence,'BS1'),1,0),if(listhasany(acmg_evidence,'BS2'),1,0),if(listhasany(acmg_evidence,'BS3'),1,0),if(listhasany(acmg_evidence,'BS4'),1,0),if(listhasany(acmg_evidence,'BP1'),1,0),if(listhasany(acmg_evidence,'BP2'),1,0),if(listhasany(acmg_evidence,'BP3'),1,0),if(listhasany(acmg_evidence,'BP4'),1,0),if(listhasany(acmg_evidence,'BP4_supporting'),1,0),if(listhasany(acmg_evidence,'BP4_moderate'),1,0),if(listhasany(acmg_evidence,'BP4_strong'),1,0),if(listhasany(acmg_evidence,'BP4_verystrong'),1,0),if(listhasany(acmg_evidence,'BP5'),1,0),if(listhasany(acmg_evidence,'BP6'),1,0),if(listhasany(acmg_evidence,'BP7'),1,0),if(listhasany(acmg_evidence,'PVS1'),1,0),if(listhasany(acmg_evidence,'PS1'),1,0),if(listhasany(acmg_evidence,'PS2'),1,0),if(listhasany(acmg_evidence,'PS3'),1,0),if(listhasany(acmg_evidence,'PS4'),1,0),if(listhasany(acmg_evidence,'PM1'),1,0),if(listhasany(acmg_evidence,'PM2'),1,0),if(listhasany(acmg_evidence,'PM3'),1,0),if(listhasany(acmg_evidence,'PM4'),1,0),if(listhasany(acmg_evidence,'PM5'),1,0),if(listhasany(acmg_evidence,'PM6'),1,0),if(listhasany(acmg_evidence,'PP1'),1,0),if(listhasany(acmg_evidence,'PP2'),1,0),if(listhasany(acmg_evidence,'PP3'),1,0),if(listhasany(acmg_evidence,'PP3_strong'),1,0),if(listhasany(acmg_evidence,'PP3_moderate'),1,0),if(listhasany(acmg_evidence,'PP3_supporting'),1,0),if(listhasany(acmg_evidence,'PP4'),1,0),if(listhasany(acmg_evidence,'PP5'),1,0)
| #WRITE#
);
""")

We run our query in test mode to see its columns:

In [52]:

%%gor
$mydefs5

nor [#output#] | top 1 | unpivot #1-

Query ran in 1.53 sec
Query fetched 127 rows in 0.01 sec (total time 1.54 sec)

Out[52]:

	Col_Name	Col_Value
0	chrom	chr1
1	pos	69134
2	Reference	A
3	Call	G
4	pn	511157
...	...	...
122	oheACMGev_PP3_strong	0
123	oheACMGev_PP3_moderate	0
124	oheACMGev_PP3_supporting	0
125	oheACMGev_PP4	1
126	oheACMGev_PP5	0

127 rows × 2 columns

We now modify our definitions and run this genome wide with 15 workers. Note the write may cause an error in the end because current incompatibilty between PARGOR and fork-write. Also note that if we had more samples with gene Multiscore, the execution would not be much longer because with 600k samples we only need 600 workers in GORdb, i.e. 1200 cores and about 3Gb per worker - very modest resources indeed!

In [53]:

mydefs5 = qh5.add_many("""
def #TOP# = /* top 100 */;
def #WRITE# = calc filepath 'user_data/hakon/Ranker/output/vars4ranker_'+left(md5('#{tags}'),5)+'.gorz'
| write -f filepath -r #{fork};
""")

In [ ]:

%%gor
$mydefs5
nor [#output#] | top 1

Appendix - generating one-hot-encoding with GOR queries¶

We will show two examples here. First we take the list used in the decode() function in ACMG:## Appendic - generating one-hot-encoding with GOR queries

In [55]:

%%gor dfQuery <<
norrows 1 | calc x 'BA1,-8,BS1,-4,BS2,-4,BS3,-4,BS4,-4,BP1,-1,BP2,-1,BP3,-1,BP4,-1,BP4_supporting,-1,BP4_moderate,-2,BP4_strong,-4,BP4_verystrong,-8,BP5,-1,BP6,-1,BP7,-1,PVS1,8,PS1,4,PS2,4,PS3,4,PS4,4,PM1,2,PM2,1,PM3,2,PM4,2,PM5,2,PM6,2,PP1,1,PP2,1,PP3,1,PP3_strong,4,PP3_moderate,2,PP3_supporting,1,PP4,1,PP5,1'
| select x
| split x
| rownum
| where mod(rownum,2)=1
| calc cmd_col 'oheACMGev_'+x
| calc cmd_form "if(listhasany(acmg_evidence,'"+x+"'),1,0)"
| group -lis -sc cmd_col,cmd_form
| calc cmd 'calc '+lis_cmd_col+' '+lis_cmd_form
| select cmd

Query ran in 0.62 sec
Query fetched 1 rows in 0.01 sec (total time 0.63 sec)

In [56]:

print(dfQuery.at[0,'cmd'])

calc oheACMGev_BA1,oheACMGev_BS1,oheACMGev_BS2,oheACMGev_BS3,oheACMGev_BS4,oheACMGev_BP1,oheACMGev_BP2,oheACMGev_BP3,oheACMGev_BP4,oheACMGev_BP4_supporting,oheACMGev_BP4_moderate,oheACMGev_BP4_strong,oheACMGev_BP4_verystrong,oheACMGev_BP5,oheACMGev_BP6,oheACMGev_BP7,oheACMGev_PVS1,oheACMGev_PS1,oheACMGev_PS2,oheACMGev_PS3,oheACMGev_PS4,oheACMGev_PM1,oheACMGev_PM2,oheACMGev_PM3,oheACMGev_PM4,oheACMGev_PM5,oheACMGev_PM6,oheACMGev_PP1,oheACMGev_PP2,oheACMGev_PP3,oheACMGev_PP3_strong,oheACMGev_PP3_moderate,oheACMGev_PP3_supporting,oheACMGev_PP4,oheACMGev_PP5 if(listhasany(acmg_evidence,'BA1'),1,0),if(listhasany(acmg_evidence,'BS1'),1,0),if(listhasany(acmg_evidence,'BS2'),1,0),if(listhasany(acmg_evidence,'BS3'),1,0),if(listhasany(acmg_evidence,'BS4'),1,0),if(listhasany(acmg_evidence,'BP1'),1,0),if(listhasany(acmg_evidence,'BP2'),1,0),if(listhasany(acmg_evidence,'BP3'),1,0),if(listhasany(acmg_evidence,'BP4'),1,0),if(listhasany(acmg_evidence,'BP4_supporting'),1,0),if(listhasany(acmg_evidence,'BP4_moderate'),1,0),if(listhasany(acmg_evidence,'BP4_strong'),1,0),if(listhasany(acmg_evidence,'BP4_verystrong'),1,0),if(listhasany(acmg_evidence,'BP5'),1,0),if(listhasany(acmg_evidence,'BP6'),1,0),if(listhasany(acmg_evidence,'BP7'),1,0),if(listhasany(acmg_evidence,'PVS1'),1,0),if(listhasany(acmg_evidence,'PS1'),1,0),if(listhasany(acmg_evidence,'PS2'),1,0),if(listhasany(acmg_evidence,'PS3'),1,0),if(listhasany(acmg_evidence,'PS4'),1,0),if(listhasany(acmg_evidence,'PM1'),1,0),if(listhasany(acmg_evidence,'PM2'),1,0),if(listhasany(acmg_evidence,'PM3'),1,0),if(listhasany(acmg_evidence,'PM4'),1,0),if(listhasany(acmg_evidence,'PM5'),1,0),if(listhasany(acmg_evidence,'PM6'),1,0),if(listhasany(acmg_evidence,'PP1'),1,0),if(listhasany(acmg_evidence,'PP2'),1,0),if(listhasany(acmg_evidence,'PP3'),1,0),if(listhasany(acmg_evidence,'PP3_strong'),1,0),if(listhasany(acmg_evidence,'PP3_moderate'),1,0),if(listhasany(acmg_evidence,'PP3_supporting'),1,0),if(listhasany(acmg_evidence,'PP4'),1,0),if(listhasany(acmg_evidence,'PP5'),1,0)

and for the VEP logic:

In [57]:

%%gor
create x = gor freezes/xa1_v6/main/vep_single.gorz | group genome -gc max_consequence,max_impact -count;
 
nor [x] | where allcount > 5000  and max_impact != 'LOWEST' | calc c 'oheVep_'+max_consequence | calc i "if(max_consequence = '"+max_consequence+"',1,0)"
| group -lis -sc c,i
| calc cmd 'calc '+lis_c+' '+lis_i
| select cmd

Query ran in 0.25 sec
Query fetched 1 rows in 0.01 sec (total time 0.25 sec)

Out[57]:

	cmd
0	calc oheVep_3_prime_UTR_variant,oheVep_5_prime...

and we can test it like this:

In [58]:

%%gor
create x = gor freezes/xa1_v6/main/vep_single.gorz | group genome -gc max_consequence,max_impact -count;
 
nor [x] | select max_consequence
| calc oheVep_3_prime_UTR_variant,oheVep_5_prime_UTR_variant,oheVep_frameshift_variant,oheVep_inframe_deletion,oheVep_inframe_insertion,oheVep_intron_variant,oheVep_missense_variant,oheVep_non_coding_transcript_exon_variant,oheVep_protein_altering_variant,oheVep_splice_acceptor_variant,oheVep_splice_donor_variant,oheVep_splice_region_variant,oheVep_start_lost,oheVep_stop_gained,oheVep_stop_lost,oheVep_stop_retained_variant,oheVep_synonymous_variant if(max_consequence = '3_prime_UTR_variant',1,0),if(max_consequence = '5_prime_UTR_variant',1,0),if(max_consequence = 'frameshift_variant',1,0),if(max_consequence = 'inframe_deletion',1,0),if(max_consequence = 'inframe_insertion',1,0),if(max_consequence = 'intron_variant',1,0),if(max_consequence = 'missense_variant',1,0),if(max_consequence = 'non_coding_transcript_exon_variant',1,0),if(max_consequence = 'protein_altering_variant',1,0),if(max_consequence = 'splice_acceptor_variant',1,0),if(max_consequence = 'splice_donor_variant',1,0),if(max_consequence = 'splice_region_variant',1,0),if(max_consequence = 'start_lost',1,0),if(max_consequence = 'stop_gained',1,0),if(max_consequence = 'stop_lost',1,0),if(max_consequence = 'stop_retained_variant',1,0),if(max_consequence = 'synonymous_variant',1,0)

Query ran in 0.08 sec
Query fetched 25 rows in 0.02 sec (total time 0.10 sec)

Out[58]:

	max_consequence	oheVep_3_prime_UTR_variant	oheVep_5_prime_UTR_variant	oheVep_frameshift_variant	oheVep_inframe_deletion	oheVep_inframe_insertion	oheVep_intron_variant	oheVep_missense_variant	oheVep_non_coding_transcript_exon_variant	oheVep_protein_altering_variant	oheVep_splice_acceptor_variant	oheVep_splice_donor_variant	oheVep_splice_region_variant	oheVep_start_lost	oheVep_stop_gained	oheVep_stop_lost	oheVep_stop_retained_variant	oheVep_synonymous_variant
0	3_prime_UTR_variant	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	5_prime_UTR_variant	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	coding_sequence_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	downstream_gene_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	frameshift_variant	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
5	incomplete_terminal_codon_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
6	inframe_deletion	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
7	inframe_insertion	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
8	intergenic_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	intron_variant	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
10	mature_miRNA_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
11	missense_variant	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
12	non_coding_transcript_exon_variant	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
13	non_coding_transcript_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
14	protein_altering_variant	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
15	splice_acceptor_variant	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0
16	splice_donor_variant	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
17	splice_region_variant	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
18	start_lost	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
19	start_retained_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
20	stop_gained	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
21	stop_lost	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
22	stop_retained_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
23	synonymous_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
24	upstream_gene_variant	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

In [ ]: