SNV variant feature generation for ranking¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "internal_hg19"
%env GOR_API_PROJECT={project}
Variant annotations¶
This notebooks describes how we can annotate variants, SNPs and small InDels, for the purpose of classifying them or for decorating them for downstream filtering or ranking by ML approaches. For training a ML variant ranker, it is important to be able to process variants from multiple samples efficiently in order to be able to try out improved attributes for ranking. Likewise, in order to use such a ranker in clinical practice, it is important to be able to run the annotation process effectively in incremental fashion.
Here we show annotation process that is similar to the annotation steps used in Gregor step1 that can be applied to all the 600k+ samples in XA within two hour ors on the GORdb cluster. By slight modifications to the existing step1, we can provide all the necessary annotations for our ranker. As is evident from the fact that we process all the XA samples very quickly, in future versions of the "incremental" annotation step, we will greatly improve the efficiency of the process by batching multiple samples together. This is because:
- Many of the steps only involve the variant and its context in the genome and not other phenotypic characteristics of the samples. The fact that there is a great overlap between the variants of the samples (the "square-root law") means greater efficiency as the sample number grows.
- Many of the annotations sources involve join with very large tables, e.g. the VEP table, gnomad and inhouse AF tables, dbNSFP etc. The per variant the join becomes more efficient as we process more sample variants together.
Much of the "business logic" shown here has been describe in Gregor1.ipynb. Big part of the code there is transformation of reference data to make it better suited for the sample based variant queries there. However, here we have "materialized" most of these non-PN based create statements. This is both because it relies less on the GORdb cache and also because many of these sources will be revised in relation to GeneMB, Oscar, etc. Below, we will list these tables for clarity.
We can try to summarize the steps in involved in the analysis process like this:
- Calculate the inheritance of variants, e.g. deNovo, Het, Hom, CHZ, inheritance error etc.
- Annotate all possible variants with variant effect prediction, known variants, AF and hom count, SpliceAI, scores, domains, gene scores etc.
- Find variants adjacent to know variants, e.g. with same amino acid change or codon change.
- Analyse positional context with relation to alternative starts, 50bp and NMD rule.
- Label candidate genes based on HPO relationships.
- Find overlap with HPO specific single variant GWAS data and gene-based association of impactful CHZ and hom variants.
- Varint ACMG evidence generation, scoring and classification.
- Prepare the output for the ranker by one-hot-encoding and by joining with the gene Multi-score data.
Without further ado we will dive into the code and explain the following steps as we come across them and as we see fit.
qh = GQH.GOR_Query_Helper()
mydefs = ""
A query to calculate deNovo, IHE, and CHZ¶
To calculate the inheritance of variants, we need genotypes from trios. For accessing them we have two options: use the row-level variants in XA/vars/wesvars.gord or to use the genotype "freeze" structure (see Incremental_Freeze.ipynb). If we ignore sequence read coverage, it may actually be faster to read the variants from the row-level tables and use PEDPIVOT to format the data for trio comparison. This is so, even though each row has many string characters (compared to one char per gt in freeze), because most of the variants are very very rare. However, if we want to take coverage and lack of data into consideration (at least on variant level) then we are better of processing the data in "binary like" character level from the freeze. Fortunately, there are versatile and sharp query commands in GOR to manipulate the data effectively, without the need for creation of custom commands.
We start by creating out pedigree:
mydefs = qh.add_many("""
def #samples# = XA/raw/samples.tsv;
def #freeze# = freezes/xa1_v8/main;
create #ped# = nor #samples#
| select id,case_id,relation
| pivot -gc case_id relation -v proband,father,mother -e ''
| where father_id != '' and mother_id != ''and proband_id != ''
| inset -c proband_id #freeze#/buckets.tsv
| inset -c father_id #freeze#/buckets.tsv
| inset -c mother_id #freeze#/buckets.tsv
| select #2-
;""")
%%gor
$mydefs
nor [#ped#] | granno -count | top 3
Query ran in 1.09 sec Query fetched 3 rows in 0.74 sec (total time 1.83 sec)
proband_id | father_id | mother_id | allCount | |
---|---|---|---|---|
0 | 434770 | 434772 | 434771 | 154911 |
1 | 64876 | 65011 | 65012 | 154911 |
2 | 592267 | 592268 | 592269 | 154911 |
We see that we have 155k trios. Our total list of PNs can be defined like this:
mydefs = qh.add_many("""create #pns# = nor [#ped#] | calc PN #1+','+#2+','+#3 | select PN | split PN;""")
%%gor
$mydefs
nor [#pns#] | distinct | group -count
Query ran in 0.47 sec Query fetched 1 rows in 0.95 sec (total time 1.42 sec)
allCount | |
---|---|
0 | 464733 |
Next we will define which variants we consider for the annotation process. Keep in mind that for labels such as compound heterozygosity (CHZ) we are only interested in exonic variants that are not too common and of significant impact. Here is our filtering definition:
mydefs = qh.add_many("""
def #vep_single# = #freeze#/metadata/vepref_single_wgs.gorz;
def #af# = #freeze#/metadata/AF.gorz;
create #to_use# = pgor #af# | select 1-4,af
| varjoin -ic <(gor user_data/hakon/refdata/XA_annotations.gorz | where classification in ('PATH','LPATH'))
| rename overlapcount GDX_Path
| varjoin -r -norm <(gor #vep_single# | select 1-4,gene_symbol,max_consequence,max_impact )
| where GDX_path > 0 or (af <= 0.01 and max_impact in ('HIGH','MODERATE'))
| select 1-4,gene_symbol,max_consequence,max_impact,GDX_PATH
| replace GDX_PATH if(GDX_path > 0,1,0);
""")
As can be seen above, we will use variants labeled by GDX as PATH or LPATH as well as non-common HIGH and MODERATE impact variants in genes. To see the fraction of XA variants, we run the following:
%%gor
$mydefs
create #a# = pgor #af# | group chrom -count;
create #b# = pgor [#to_use#] | group 1 -gc #3,#4 | group chrom -count;
nor [#a#] | group -sum -ic allcount | merge <(nor [#b#] | group -sum -ic allcount )
Query ran in 7.81 sec. Query fetched 2 rows in 0.02 sec (total time 7.83 sec)
sum_allCount | |
---|---|
0 | 13243061 |
1 | 22618399 |
In other words, we will be annotating the inheritance of about half of the variants in XA.
Next we will explore how we can use CSVSEL to generate genotype triplets, somewhat analogous to how PEDPIVOT uses pedigree to pivot columnar row data. First we pick some two example variants that are share by two of the first 9 samples from the first 3 trios. We use CSVSEL with the -tag option:
%%gor
$mydefs
gor XA/vars/wesvars.gord -f '434770','64876' | where callcopies in ('1','2')| group 1 -gc 3,4 -dis -sc pn | where dis_pn > 1 | top 2
| varjoin -ir -norm #freeze#/variants.gord
| csvsel -gc ref,alt -vs 1 -u 3 #freeze#/buckets.tsv <(nor [#pns#] | top 9) -tag PN
Query ran in 20.87 sec Query fetched 18 rows in 10.33 sec (total time 31.20 sec)
chrom | pos | REF | alt | PN | value | |
---|---|---|---|---|---|---|
0 | chr1 | 1666175 | C | T | 434770 | 1 |
1 | chr1 | 1666175 | C | T | 434772 | 1 |
2 | chr1 | 1666175 | C | T | 434771 | 1 |
3 | chr1 | 1666175 | C | T | 64876 | 1 |
4 | chr1 | 1666175 | C | T | 65011 | 1 |
5 | chr1 | 1666175 | C | T | 65012 | 1 |
6 | chr1 | 1666175 | C | T | 592267 | 0 |
7 | chr1 | 1666175 | C | T | 592268 | 0 |
8 | chr1 | 1666175 | C | T | 592269 | 0 |
9 | chr1 | 1684374 | T | TCTC | 434770 | 1 |
10 | chr1 | 1684374 | T | TCTC | 434772 | 1 |
11 | chr1 | 1684374 | T | TCTC | 434771 | 1 |
12 | chr1 | 1684374 | T | TCTC | 64876 | 1 |
13 | chr1 | 1684374 | T | TCTC | 65011 | 1 |
14 | chr1 | 1684374 | T | TCTC | 65012 | 2 |
15 | chr1 | 1684374 | T | TCTC | 592267 | 0 |
16 | chr1 | 1684374 | T | TCTC | 592268 | 0 |
17 | chr1 | 1684374 | T | TCTC | 592269 | 0 |
Now we take the output from CSVSEL (without -tag option, i.e. a single value column per variant) and form a bucket on the fly using the pedigree and use CSVSEL a second time to form a 3-character genotype per index-case:
%%gor
$mydefs
gor XA/vars/wesvars.gord -f '434770','64876' | where callcopies in ('1','2')| group 1 -gc 3,4 -dis -sc pn | where dis_pn > 1 | top 2
| varjoin -ir -norm #freeze#/variants.gord
| csvsel -gc ref,alt -vs 1 -u 3 #freeze#/buckets.tsv [#pns#]
| calc bucket 'b1'
| csvsel -gc ref,alt -vs 3 -u '333' <(nor [#ped#] | select #1 | calc bucket 'b1') <(nor [#ped#] | select #1 | top 3) -tag PN
| replace value '_'+value
Query ran in 5.43 sec. Query fetched 6 rows in 5.09 sec (total time 10.52 sec)
chrom | pos | REF | alt | PN | value | |
---|---|---|---|---|---|---|
0 | chr1 | 1666175 | C | T | 434770 | _111 |
1 | chr1 | 1666175 | C | T | 64876 | _111 |
2 | chr1 | 1666175 | C | T | 592267 | _000 |
3 | chr1 | 1684374 | T | TCTC | 434770 | _111 |
4 | chr1 | 1684374 | T | TCTC | 64876 | _112 |
5 | chr1 | 1684374 | T | TCTC | 592267 | _000 |
Note that the intermediate values column that is feed into the second CSVSEL command has 464733 characters, because we are using the full list of PNs in #pns#. By using top 3, we pick the value for the first 3 PNs, each of size 3 as specified with the -vs option. Before we setup the full genome wide parallel option, we want to specify with the -hide option which genotypes we are interested in. This reduces the number of rows that the second CSVSEL command needs to materialize, saving about half in compute time.
mydefs = qh.add_many("""
create #av# = norrows 4;
create #comb# = nor [#av#] | multimap -cartesian [#av#] | multimap -cartesian [#av#] | rename #1 P | rename #2 F | rename #3 M;
create #valcomb# = nor [#comb#]
| calc ihe if(p=1 and (f=0 and m=0 or f=2 and m=2) or p=2 and (f=0 or m=0),1,0)
| calc denovo if(p=1 and f=0 and m=0,1,0)
| calc inher if(denovo=1,'denovo',if(p=1 and m=0,'father',if(p=1 and f=0,'mother',if(p=2,'both','unknown'))))
| calc non_hide if(inher in ('father','mother','both') or ihe = 1,1,0)
| calc value str(p)+str(f)+str(m)
| select value,inher,ihe,non_hide;
""")
%%gor
$mydefs
nor [#valcomb#] | group -gc non_hide -lis -sc value
Query ran in 0.35 sec Query fetched 2 rows in 0.00 sec (total time 0.36 sec)
non_hide | lis_value | |
---|---|---|
0 | 0 | 000,001,002,003,010,011,012,013,020,021,022,02... |
1 | 1 | 100,101,102,103,110,120,122,130,200,201,202,20... |
All the genotype combination where non-hide is zero (false) are non-informative for CHZ, deNovo and inheritance error. Therefore, we don't want them in our output.
%%gor
$mydefs
gor XA/vars/wesvars.gord -f '434770','64876' | where callcopies in ('1','2')| group 1 -gc 3,4 -dis -sc pn | where dis_pn > 1 | top 2
| varjoin -ir -norm #freeze#/variants.gord
| csvsel -gc ref,alt -vs 1 -u 3 #freeze#/buckets.tsv [#pns#]
| calc bucket 'b1'
| csvsel -gc ref,alt -vs 3 -u '333' <(nor [#ped#] | select #1 | calc bucket 'b1') <(nor [#ped#] | select #1 | top 3) -tag PN
-hide '000,001,002,003,010,011,012,013,020,021,022,023,030,031,032,033,111,112,113,121,123,131,132,133,300,301,302,303,310,311,312,313,320,321,322,323,330,331,332,333'
| replace value '_'+value
Query ran in 4.72 sec. Query fetched 0 rows in 5.78 sec (total time 10.49 sec)
chrom | pos | REF | alt | PN | value |
---|
We see that none of the genotype triplets from above make it through because they don't provide information about phase, deNovo status or inheritance error. Finally, we are ready to craft the parallel query that includes only the variant from the freeze that we are interested in. Also, because we split up the genome into jobs, we make to ensure that we don't break up genes and thereby miss out on CHZ combinations. Therefore we use genes to join into the variants in the freeze.
mydefs = qh.add_many("""
def #all_genes# = ref/refgenes/refgenes.gorz;
create #usesegs# = pgor ref/refgenes/refgenes.gorz | segproj
| join -segsnp -f 10 -ic [#to_use#]
| group 100000 -sum -ic overlapcount;
create #parsegs# = gor [#usesegs#] | seghist 30000;
create #inher# = pgor -split <(gor [#parsegs#]) #all_genes# | segproj | join -segsnp -f 20 -ir #freeze#/variants.gord
| varjoin -r -norm <(gor [#to_use#] | select 1-4,gene_symbol,max_impact)
| csvsel -gc ref,alt,gene_symbol,max_impact -vs 1 -u 3 #freeze#/buckets.tsv [#pns#]
| calc bucket 'b1'
| csvsel -gc ref,alt,gene_symbol,max_impact -vs 3 -u '333' <(nor [#ped#] | select #1 | calc bucket 'b1') <(nor [#ped#] | select #1) -tag PN
-hide '000,001,002,003,010,011,012,013,020,021,022,023,030,031,032,033,111,112,113,121,123,131,132,133,300,301,302,303,310,311,312,313,320,321,322,323,330,331,332,333'
| map -c value [#valcomb#]
| calc pat if(inher='father' or inher='both',1,0)
| calc mat if(inher='mother' or inher='both',1,0)
| granno gene -range -gc pn,gene_symbol,max_impact -sum -ic pat,mat
| calc hethomchz if(left(value,1) = '1' and max_impact in ('HIGH','MODERATE')
and (mat = 1 and sum_pat > 0 or pat = 1 and sum_mat > 0) and gene_symbol != '' ,'chz',if(left(value,1) = '2','hom','het'))
| hide pat-sum_mat,non_hide;
""")
We see that we are annotating the freeze with gene_symbol and impact. We also use MAP to get inheritance, denovo and IHE status for the different value combinations. Next, we calculate pat and mat, Boolean like values, and aggregate them per gene per PN with GRANNO, using the -range option for memory efficiency, i.e. flushing variant rows once they are futher than maximum gene length away from the most downstream row! With the SEGHIST command, as used above, we get about 500 splits, each taking about 10 minutes or so to run - typically taking about 15-20 minutes to run in the GORdb worker system. Finally, we are ready to write out our results which we will use at later stage.
%%gor
$mydefs
gor [#inher#] | top 1 /* | write user_data/hakon/Ranker/xa1_v8_var_inher.gorz */
Query ran in 2.44 sec Query fetched 1 rows in 0.19 sec (total time 2.63 sec)
chrom | pos | REF | alt | Gene_Symbol | Max_Impact | PN | value | inher | ihe | hethomchz | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69116 | G | A | OR4F5 | MODERATE | 700099 | 201 | both | 1 | hom |
and here in the file we wrote out:
%%gor
gor user_data/hakon/Ranker/xa1_v8_var_inher.gorz | top 10
Query ran in 0.07 sec Query fetched 10 rows in 0.00 sec (total time 0.07 sec)
chrom | pos | REF | alt | Gene_Symbol | Max_Impact | PN | value | inher | ihe | hethomchz | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69116 | G | A | OR4F5 | MODERATE | 700099 | 201 | both | 1 | hom |
1 | chr1 | 69121 | T | G | OR4F5 | MODERATE | 414695 | 100 | denovo | 1 | het |
2 | chr1 | 69121 | T | G | OR4F5 | MODERATE | 414698 | 120 | father | 0 | het |
3 | chr1 | 69121 | T | G | OR4F5 | MODERATE | 372859 | 100 | denovo | 1 | het |
4 | chr1 | 69134 | A | G | OR4F5 | MODERATE | 260990 | 101 | mother | 0 | het |
5 | chr1 | 69134 | A | G | OR4F5 | MODERATE | 187762 | 101 | mother | 0 | het |
6 | chr1 | 69134 | A | G | OR4F5 | MODERATE | 619425 | 110 | father | 0 | het |
7 | chr1 | 69134 | A | G | OR4F5 | MODERATE | 153921 | 110 | father | 0 | het |
8 | chr1 | 69134 | A | G | OR4F5 | MODERATE | 388719 | 120 | father | 0 | het |
9 | chr1 | 69134 | A | G | OR4F5 | MODERATE | 531602 | 120 | father | 0 | het |
Phenotype associations¶
We will now move our attention to analysis that associates phenotype information (HPO terms) of the samples with variants and genes. There are four ways in which we do this:
- Calculate candidate genes based on established HPO relationships and ranking using the Phrank method (see Phrank_Candidate_Genes.ipynb)
- Calculate variant association using HPO phenotypes of samples and Fisher-exact GWAS associations (see Fast_HPO_V_GWAS.ipynb)
- Calculate gene variant burden association using Hom and CHZ rare, high-moderate impact variants (see Fast_HPO_G_GWAS.ipynb)
- Calculate so-called Multi-score using modern vector-encoding for text mining and association with GDX reported variants.
Candidate genes¶
Phrank computes a score measuring the similarity between a patient’s phenotypes and each gene/disease–phenotype set. Each phenotype’s contribution to the Phrank score is a function of the number of genes known to cause the phenotype. Intuitively, the fewer genes are known to cause a phenotype, the more impressive is a candidate gene’s ability to explain this patient phenotype, and the higher the score derived from such a match. The conditional probability for a single node in the HPO phenotype ontology are log transformed, negated, and then summed via paths to the root node. Nodes annotated with fewer genes will ultimately contribute more to the final score as the match they represent is less expected by chance. (see Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. by Jagadeesh KA, et.al. Genet Med. 2019 Feb;21(2):464-470. doi: 10.1038/s41436-018-0072-y. Epub 2018 Jul 12. PMID: 29997393.)
As mentioned above, the full logic for the Phrank algorithm, as implemented in the GOR query language, can be found i a separate notebook (Phrank_Candidate_Genes.ipynb) where we show how we can rank candidate genes for close to 200k samples in few minutes. Here we will only show the beginning part of it that relates to the established HPO association. We setup a new query holder:
qh2 = GQH.GOR_Query_Helper()
mydefs2 = ""
mydefs2 = qh2.add_many("""
def ##ref## = ref;
def #hposource# = ##ref##/disgenes/hpo.tsv;
""")
mydefs2 = qh2.add("""
create #hpo_gene_symbol# = nor #hposource# | select hpo_code,gene_symbols
| split gene_symbols -s ';' | rename gene_symbols gene_symbol
| where gene_symbol != ''
| distinct;
""")
mydefs2 = qh2.add_many("""
create #specific_hpo_gene_symbol# = nor [#hpo_gene_symbol#]
| granno -gc hpo_code -count | where allcount <= 50 | group -gc hpo_code,gene_symbol;
create #pheno_lis# = nor XA/pheno/dag_hpo.tsv | where hpo_code != '' | inset -c hpo_code [#specific_hpo_gene_symbol#]
| group -gc pn -lis -sc hpo_code -len 10000;
create #modparts1# = nor [#pheno_lis#] | select pn | calc modulus mod(iooa(pn),10) | group -gc modulus -lis -sc pn -len 1000000
| replace lis_pn listmap(lis_pn,'squote(x)');
def ##all_genes## = ref/refgenes/rgenes.gorz;
create #pn_candidate_genes# = parallel -parts [#modparts1#] <(gor <(nor [#pheno_lis#] | where pn in ( #{col:lis_pn} )
| split lis_hpo_code | rename lis_hpo_code hpo_code
| multimap -c hpo_code [#specific_hpo_gene_symbol#]
| group -gc pn,gene_symbol -count
| multimap -c gene_symbol <(nor ##all_genes## | columnsort gene_symbol)
| select chrom,gene_start,gene_end,gene_symbol,PN
)
| sort genome
);""")
We see that the source for HPO to gene_symbol relationships is stored in the reference data folder disgenes. In the future, this could easily be the table stored in GeneMb. Also, because in the abstraction process within GDX the practice seems to be to record very high-level unspecific HPO terms, in the statement #specific_hpo_gene_symbol# we limit us to HPO terms that only associate with 50 genes or less. To understand this better, the following query show how many samples have "phenotypes" from the top level and the second level in the HPO-DAG.
%%gor
$mydefs2
create #counts# = nor XA/pheno/dag_hpo.tsv
| calc level1 if(hpo_code = 'HP:0000001',1,0)
| calc level2 if(hpo_code in ('HP:0045027','HP:0040064','HP:0033127','HP:0025354','HP:0025285','HP:0025142','HP:0025031'
,'HP:0003745','HP:0002715','HP:0002686','HP:0002664','HP:0002086','HP:0001939','HP:0001871'
,'HP:0001626','HP:0001608','HP:0001574','HP:0001507','HP:0001426','HP:0001197','HP:0000818'
,'HP:0000769','HP:0000707','HP:0000598','HP:0000478','HP:0000152','HP:0000119','HP:0032443','HP:0000118'), 1,0)
| group -gc pn -sum -ic level1,level2
| replace sum_* if(int(#rc)>1,1,0)
| group -gc sum_level1,sum_level2 -count;
nor [#counts#]
Query ran in 0.46 sec Query fetched 2 rows in 0.00 sec (total time 0.46 sec)
sum_level1 | sum_level2 | allCount | |
---|---|---|---|
0 | 0 | 0 | 83 |
1 | 0 | 1 | 235156 |
For comparison, the number of samples that have the more gene-specific HPO terms:
%%gor
$mydefs2
nor [#pheno_lis#] | group -count
Query ran in 0.34 sec Query fetched 1 rows in 1.14 sec (total time 1.48 sec)
allCount | |
---|---|
0 | 187996 |
Our direct or naive candidate gene list is like this:
%%gor
$mydefs2
gor -p chr7 [#pn_candidate_genes#] | top 10
Query ran in 0.39 sec Query fetched 10 rows in 0.00 sec (total time 0.39 sec)
chrom | gene_start | gene_end | gene_symbol | PN | |
---|---|---|---|---|---|
0 | chr7 | 192570 | 300738 | FAM20C | 100535 |
1 | chr7 | 192570 | 300738 | FAM20C | 100584 |
2 | chr7 | 192570 | 300738 | FAM20C | 101029 |
3 | chr7 | 192570 | 300738 | FAM20C | 101279 |
4 | chr7 | 192570 | 300738 | FAM20C | 102426 |
5 | chr7 | 192570 | 300738 | FAM20C | 103797 |
6 | chr7 | 192570 | 300738 | FAM20C | 105737 |
7 | chr7 | 192570 | 300738 | FAM20C | 109311 |
8 | chr7 | 192570 | 300738 | FAM20C | 110051 |
9 | chr7 | 192570 | 300738 | FAM20C | 113005 |
%%gor
$mydefs2
create #counts# = gor [#pn_candidate_genes#] | group genome -gc pn -count | granno genome -sum -ic allcount -count;
nor [#counts#] | select PN- | rename allcount Genes | rename allcountx Samples | rename sum_allcount TotalRows
| granno -avg -std -min -max -ic genes | top 10
Query ran in 0.35 sec Query fetched 10 rows in 0.35 sec (total time 0.69 sec)
PN | Genes | Samples | TotalRows | min_Genes | max_Genes | avg_Genes | std_Genes | |
---|---|---|---|---|---|---|---|---|
0 | 100003 | 30 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
1 | 100004 | 2 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
2 | 100007 | 278 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
3 | 10001 | 72 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
4 | 100010 | 117 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
5 | 100013 | 79 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
6 | 100016 | 78 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
7 | 100018 | 31 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
8 | 100019 | 140 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
9 | 10002 | 158 | 187983 | 17433798 | 1 | 909 | 92.741354 | 78.059362 |
We see that with the HPO database does give us very many gene associations. By using Phrank, we can try to score the based on relevance with the complete list of HPO terms per sample. Our results from the Phrank notebook are accessible in a single GORZ file here:
%%gor
gor -p chr7 user_data/hakon/Phrank/gene_pheno_score_25.gorz | top 10
Query ran in 0.07 sec Query fetched 10 rows in 0.00 sec (total time 0.08 sec)
chrom | gene_start | gene_end | gene_symbol | pheno_score | norm_pheno_score | PN | rank_pheno_score | rank_norm_pheno_score | |
---|---|---|---|---|---|---|---|---|---|
0 | chr7 | 192570 | 300738 | FAM20C | 14.018 | 0.051 | 40219 | 19 | 57 |
1 | chr7 | 192570 | 300738 | FAM20C | 14.028 | 0.051 | 191085 | 23 | 63 |
2 | chr7 | 192570 | 300738 | FAM20C | 14.031 | 0.051 | 183198 | 8 | 35 |
3 | chr7 | 192570 | 300738 | FAM20C | 14.031 | 0.051 | 475054 | 8 | 35 |
4 | chr7 | 192570 | 300738 | FAM20C | 14.049 | 0.051 | 184944 | 9 | 48 |
5 | chr7 | 192570 | 300738 | FAM20C | 14.049 | 0.051 | 364261 | 5 | 45 |
6 | chr7 | 192570 | 300738 | FAM20C | 14.049 | 0.051 | 397803 | 4 | 36 |
7 | chr7 | 192570 | 300738 | FAM20C | 14.060 | 0.051 | 85002 | 15 | 39 |
8 | chr7 | 192570 | 300738 | FAM20C | 14.063 | 0.051 | 49046 | 3 | 40 |
9 | chr7 | 192570 | 300738 | FAM20C | 14.063 | 0.051 | 76441 | 20 | 67 |
%%gor
create #counts# = gor user_data/hakon/Phrank/gene_pheno_score_25.gorz | group genome -gc pn -count | granno genome -sum -ic allcount -count;
nor [#counts#] | select PN- | rename allcount Genes | rename allcountx Samples | rename sum_allcount TotalRows
| granno -avg -std -min -max -ic genes | top 10
Query ran in 0.08 sec Query fetched 10 rows in 0.50 sec (total time 0.59 sec)
PN | Genes | Samples | TotalRows | min_Genes | max_Genes | avg_Genes | std_Genes | |
---|---|---|---|---|---|---|---|---|
0 | 100003 | 42 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
1 | 100004 | 15 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
2 | 100007 | 56 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
3 | 10001 | 54 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
4 | 100010 | 65 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
5 | 100013 | 95 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
6 | 100016 | 42 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
7 | 100018 | 13 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
8 | 100019 | 53 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
9 | 10002 | 54 | 187044 | 7849486 | 1 | 197 | 41.965987 | 23.283037 |
In gene_pheno_score_25.gorz we have picked only the genes that are in the top 25 rank based on normalized and regular Phrank score. We that we have much fewer genes and less variation in the number of candidate genes. Later, in our annotation query logic, we will return to this file.
Variant associations¶
In the notebook Fast_HPO_V_GWAS.ipynb, we show how we can calculate very quickly (few hours in GORdb), using Fisher-Exact test, variant association for thousands of phenotype definitions. Here we will only explore phenotype definitions based on a presence or absence of a single HPO term, but this analysis can easily be extended to more complex man-made or machine generated phenotypes. Importantly, this analysis takes into consideration presence and absence of data, e.g. via sequence read coverage as mentioned in our notebook Incremental_Freeze.ipyn.
Here, we will use the analysis dir_hpo_prob_ctrls_fix, but it is based on analysis where the cases are proband that have a give HPO term and the controls are other probands that do not have that term or any other related HPO term. The controls could als easily be picked from unaffected parents - but the definition of this is not the subject we cover here but rather how we fold this into the variant annotation process.
%%gor
nor XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.tsv | granno -count | top 10
Query ran in 0.50 sec Query fetched 10 rows in 0.19 sec (total time 0.69 sec)
Pheno | name | allCount | |
---|---|---|---|
0 | HP:0006336 | Short dental roots | 6379 |
1 | HP:0008038 | Aplastic/hypoplastic lacrimal glands | 6379 |
2 | HP:0040198 | Non-medullary thyroid carcinoma | 6379 |
3 | HP:0002710 | Commissural lip pit | 6379 |
4 | HP:0000679 | Taurodontia | 6379 |
5 | HP:0020072 | Persistent EBV viremia | 6379 |
6 | HP:0025423 | Abnormal larynx morphology | 6379 |
7 | HP:0100576 | Amaurosis fugax | 6379 |
8 | HP:0040144 | L-2-hydroxyglutaric aciduria | 6379 |
9 | HP:0011016 | Abnormality of urine glucose concentration | 6379 |
Similary a preview of the GWAS results:
%%gor
gor XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.gord | top 10 | columnreorder 1-alt,pheno
Query ran in 3.02 sec. Query fetched 10 rows in 0.01 sec (total time 3.03 sec)
chrom | pos | REF | alt | Pheno | AF | pVal_mm | OR_mm | OR_SEmm | pVal_dom | ... | case_0_GTcount | case_1_GTcount | case_2_GTcount | case_3_GTcount | ctrl_0_GTcount | ctrl_1_GTcount | ctrl_2_GTcount | ctrl_3_GTcount | AN | AC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69121 | T | G | HP:0100694 | 0.000012 | 0.0100 | 150.0 | 410.0 | 0.0100 | ... | 435 | 1 | 0 | 0 | 126796 | 2 | 0 | 0 | 254468 | 3 |
1 | chr1 | 69121 | T | G | HP:0012646 | 0.000012 | 0.0068 | 220.0 | 620.0 | 0.0068 | ... | 288 | 1 | 0 | 0 | 126943 | 2 | 0 | 0 | 254468 | 3 |
2 | chr1 | 69134 | A | G | HP:0012432 | 0.000444 | 0.0045 | 6.2 | 3.7 | 0.1000 | ... | 743 | 0 | 2 | 0 | 126400 | 69 | 20 | 0 | 254468 | 113 |
3 | chr1 | 69134 | A | G | HP:0001572 | 0.000444 | 0.0099 | 14.0 | 13.0 | 0.1100 | ... | 166 | 0 | 1 | 0 | 126977 | 69 | 21 | 0 | 254468 | 113 |
4 | chr1 | 69134 | A | G | HP:0000817 | 0.000444 | 0.0028 | 3.0 | 1.0 | 0.0400 | ... | 4423 | 3 | 4 | 0 | 122720 | 66 | 18 | 0 | 254468 | 113 |
5 | chr1 | 69134 | A | G | HP:0031590 | 0.000444 | 0.0080 | 130.0 | 250.0 | 0.0064 | ... | 8 | 1 | 0 | 0 | 127135 | 68 | 22 | 0 | 254468 | 113 |
6 | chr1 | 69134 | A | G | HP:0025030 | 0.000444 | 0.0035 | 320.0 | 670.0 | 0.0029 | ... | 3 | 1 | 0 | 0 | 127140 | 68 | 22 | 0 | 254468 | 113 |
7 | chr1 | 69134 | A | G | HP:0033252 | 0.000444 | 0.0071 | 150.0 | 290.0 | 0.0057 | ... | 7 | 1 | 0 | 0 | 127136 | 68 | 22 | 0 | 254468 | 113 |
8 | chr1 | 69134 | A | G | HP:0020110 | 0.000444 | 0.0025 | 12.0 | 8.4 | 0.0200 | ... | 299 | 1 | 1 | 0 | 126844 | 68 | 21 | 0 | 254468 | 113 |
9 | chr1 | 69134 | A | G | HP:0001812 | 0.000444 | 0.0016 | 36.0 | 35.0 | 0.0010 | ... | 63 | 2 | 0 | 0 | 127080 | 67 | 22 | 0 | 254468 | 113 |
10 rows × 23 columns
The table XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.gord is massive with almost every association (p < 0.01) for 23M variants and 6k phenotypes. We want to make a subset of this table, i.e. what we consider strong associations, and then join the table with the sample variants, however, only store output for samples that harbour the corresponding phenotype.
qh3 = GQH.GOR_Query_Helper()
mydefs3 = ""
mydefs3 = qh3.add_many("""
def #freeze# = freezes/xa1_v8/main;
def ##gwastable## = XA/analysis/hpogwas/dir_hpo_prob_ctrls_fix.gord;
def #vep_single# = #freeze#/metadata/vepref_single_wgs.gorz;
def #af# = #freeze#/metadata/AF.gorz;
create ##varcount## = pgor #af# | group 100000 -count;
create ##parsegs## = gor [##varcount##] | seghist 100000;
create ##strong_assoc## = pgor -split <(gor [##parsegs##] ) ##gwastable##
| where or_mm > 5 and pval_mm < 1e-10 and
(case_1_gtcount+case_2_gtcount)/(case_0_gtcount++1) > 0.2 and case_1_gtcount+case_2_gtcount > 9
| granno 1 -gc ref,alt -count | where allcount < 100 /* no more than 100 HPO phenotypes that associate per marker */
| select 1-4,pheno,pval_mm,OR_mm;
""")
In the definition above, we ask for large odds-ratio of statistical significance and additionally that the AF in the case group is larger than 0.2 and that the AC is 10 or greater - quite stringent conditions.
%%gor
$mydefs3
create #counts# = pgor [##strong_assoc##] | group chrom -gc pheno -count;
nor [#counts#] | group -gc pheno -sum -ic allcount | granno -sum -avg -min -max -ic sum_allcount | top 10
Query ran in 0.65 sec Query fetched 10 rows in 0.03 sec (total time 0.68 sec)
Pheno | sum_allCount | min_sum_allCount | max_sum_allCount | avg_sum_allCount | sum_sum_allCount | |
---|---|---|---|---|---|---|
0 | HP:0000006 | 30 | 1 | 10812 | 356.859743 | 333307 |
1 | HP:0000007 | 1109 | 1 | 10812 | 356.859743 | 333307 |
2 | HP:0000013 | 12 | 1 | 10812 | 356.859743 | 333307 |
3 | HP:0000019 | 9 | 1 | 10812 | 356.859743 | 333307 |
4 | HP:0000032 | 23 | 1 | 10812 | 356.859743 | 333307 |
5 | HP:0000033 | 42 | 1 | 10812 | 356.859743 | 333307 |
6 | HP:0000035 | 12 | 1 | 10812 | 356.859743 | 333307 |
7 | HP:0000055 | 3609 | 1 | 10812 | 356.859743 | 333307 |
8 | HP:0000058 | 11 | 1 | 10812 | 356.859743 | 333307 |
9 | HP:0000059 | 12 | 1 | 10812 | 356.859743 | 333307 |
We see that our complete list of "strong associations" is only 333k rows - something we can easily join with the sample variants, which we do next:
mydefs3 = qh3.add_many("""
create #pheno_lis# = nor XA/pheno/dag_hpo.tsv | group -gc pn -lis -sc hpo_code -len 10000;
create #pns# = nor [#pheno_lis#] | select PN;
def #variants# = XA/vars/wesvars.gord;
create #pn_assoc# = partgor -dict #variants# -partscale 10 -ff [#pns#] <(gor #variants# -f #{tags}
| where genotype in ('het','hom')
| select 1-4,pn
| varjoin -r [##strong_assoc##]
| multimap -c pn <(nor [#pheno_lis#] | where pn in ( #{tags:q} ) )
| where listhasany(lis_hpo_code,pheno)
| group 1 -gc ref,alt,pn -count -lis -sc pheno,or_mm
);
""")
Notable in the above query is that:
- we have "packed" the sample HPO phenotypes into a list for the following reasons: a) it makes the data footprint and parsing cost smaller, but more importanly, b) it makes our MULTIMAP join not multiply row for samples with many HPO terms
- we use listhasany() to filter out samples that have the variant but not the corresponding phenotyes.
- we use PARTGOR with the partscale option to process 10k sampels at the time (600k+ in total) and filter the MULTIMAP relation to reduce memory footprint
%%gor
$mydefs3
gor [#pn_assoc#] | top 10 /* | write user_data/hakon/Ranker/pn_strong_assoc.gorz */
Query ran in 3.64 sec. Query fetched 10 rows in 0.03 sec (total time 3.67 sec)
chrom | pos | ref | alt | pn | allCount | lis_Pheno | lis_OR_mm | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 69270 | A | G | 13640 | 1 | HP:0011355 | 7.9 |
1 | chr1 | 69270 | A | G | 55279 | 1 | HP:0011355 | 7.9 |
2 | chr1 | 69270 | A | G | 56085 | 1 | HP:0011355 | 7.9 |
3 | chr1 | 69270 | A | G | 62752 | 1 | HP:0011355 | 7.9 |
4 | chr1 | 69270 | A | G | 63539 | 1 | HP:0011355 | 7.9 |
5 | chr1 | 69270 | A | G | 67468 | 1 | HP:0011355 | 7.9 |
6 | chr1 | 69270 | A | G | 67755 | 1 | HP:0001743 | 9.8 |
7 | chr1 | 69270 | A | G | 87911 | 1 | HP:0011355 | 7.9 |
8 | chr1 | 69270 | A | G | 101233 | 1 | HP:0011355 | 7.9 |
9 | chr1 | 69270 | A | G | 101255 | 1 | HP:0000142 | 13.0 |
We see that the total running time is about 15min - pretty good give we are reading from a variant dataset of 600k samples and 23M different variants and phenotypes from about 200k samples. Now let us quickly look at the size of this table:
%%gor
$mydefs3
create #count# = pgor [#pn_assoc#] | group chrom -count;
nor [#count#] | group -sum -ic allcount
Query ran in 3.89 sec. Query fetched 1 rows in 0.01 sec (total time 3.90 sec)
sum_allCount | |
---|---|
0 | 26738903 |
We see that the table has 27M rows. Small enough for us to treat is as an GORZ annotation table in out annotation queries, i.e. we don't have to partition it by samples and for now we simply store it as user_data/hakon/Ranker/pn_strong_assoc.gorz.
Gene based associations¶
Similarly as with the variant associations, we can annotate sample variants that overlap genes that show association with phenotypes harboured by the sample. Here, we are though primarily interested in the variants that were used to establish the gene based association in the first place, i.e. rare high and moderate impact variants.
qh4 = GQH.GOR_Query_Helper()
mydefs4 = ""
mydefs4 = qh4.add_many("""
def ##ac## = XA/analysis/hpogwas;
def ##gene_source_file## = ref/ensgenes/genes.gorz;
def ##AF## = freezes/xa1_v1/main/metadata/AF.gorz;
def ##vep## = freezes/xa1_v1/main/metadata/vep_v109/vep_single.gorz;
def ##Gene_GWAS## = XA/analysis/hpogwas/gene_phewas_dir_hpo.gorz;
def #variants# = XA/vars/wesvars.gord;
create ##gene_strong_assoc## = gor ##Gene_GWAS##
| where or_rec > 5 and p_rec < 1e-10 and case_rec > 9 and case_rec/(case_rec+case_no_rec) > 0.2
| select 1,2,gene_end,gene_symbol,HPO_code,P_rec,OR_rec
| rename hpo_code Pheno
;
create #pheno_lis# = nor XA/pheno/dag_hpo.tsv | group -gc pn -lis -sc hpo_code -len 10000;
create #pns# = nor [#pheno_lis#] | select PN ;
create #pn_gene_assoc# = partgor -dict #variants# -partscale 10 -ff [#pns#] <(gor #variants# -f #{tags}
| where genotype in ('het','hom')
| varjoin -i <(gor ##vep## | where max_impact in ('HIGH','MODERATE'))
| varjoin -i <(gor ##AF## | where af <= 0.01)
| select 1-4,pn
| join -snpseg -f 10 -r [##gene_strong_assoc##]
| multimap -c pn <(nor [#pheno_lis#] | where pn in ( #{tags:q} ) )
| where listhasany(lis_hpo_code,pheno)
| group 1 -gc ref,alt,pn -count -lis -sc pheno,or_rec
);
""")
%%gor
$mydefs4
gor [#pn_gene_assoc#] | top 10 /* | write user_data/hakon/Ranker/pn_gene_strong_assoc.gorz */
Query ran in 3.76 sec. Query fetched 10 rows in 0.03 sec (total time 3.78 sec)
chrom | pos | ref | alt | pn | allCount | lis_Pheno | lis_OR_rec | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 69134 | A | G | 400911 | 1 | HP:0011355 | 23.4 |
1 | chr1 | 69134 | A | G | 141915 | 1 | HP:0011355 | 23.4 |
2 | chr1 | 69134 | A | G | 450302 | 1 | HP:0011355 | 23.4 |
3 | chr1 | 69134 | A | G | 162011 | 1 | HP:0011355 | 23.4 |
4 | chr1 | 69134 | A | G | 241432 | 1 | HP:0011355 | 23.4 |
5 | chr1 | 69134 | A | G | 211732 | 1 | HP:0011355 | 23.4 |
6 | chr1 | 69134 | A | G | 106494 | 1 | HP:0011355 | 23.4 |
7 | chr1 | 69134 | A | G | 218300 | 1 | HP:0011355 | 23.4 |
8 | chr1 | 69134 | A | G | 151679 | 1 | HP:0011355 | 23.4 |
9 | chr1 | 69134 | A | G | 240942 | 1 | HP:0011355 | 23.4 |
We see that this took only 7 minutes. And the size of the sample variants annotated with gene based associations is:
%%gor
$mydefs4
create #count# = pgor [#pn_gene_assoc#] | group chrom -count;
nor [#count#] | group -sum -ic allcount
Query ran in 3.89 sec. Query fetched 1 rows in 0.01 sec (total time 3.90 sec)
sum_allCount | |
---|---|
0 | 3063945 |
Like argued before, this file is small enought to simply store non-partitioned in user_data/hakon/Ranker/pn_gene_strong_assoc.gorz.
Gregor style ACMG annotations for a big population of samples¶
We are now ready to dive into the variant annotation query. It resembles the code in Gregor1.ipynb, however, many of the creates there have been materialized into GORZ files. Further more, all the steps that do not require sample specific information, such as zygosity, inheritance, or phenotypes, are executed here on a distinct variant list - greatly improving the performance.
Before we go futher, here is the complete list of files that we have materizlized.
- gor [##frequency_file##] | write s3data://shared/user_data/hakon/refdata/frequency_file.gorz
- gor [##homozygous_vars##] | write s3data://shared/user_data/hakon/refdata/homozygous_vars.gorz
- gor [##max_score##] | write s3data://shared/user_data/hakon/refdata/max_score.gorz
- gor [##PP3_BP4##] | write s3data://shared/user_data/hakon/refdata/PP3_BP4.gorz
- gor [##interpro_regions##] | write s3data://shared/user_data/hakon/refdata/interpro_regions.gorz
- gor [##known_var_vep_multi##] | write s3data://shared/user_data/hakon/refdata/known_var_vep_multi.gorz
- gor [##patho_high##] | write s3data://shared/user_data/hakon/refdata/patho_high.gorz
- gor [##patho_missense##] | write s3data://shared/user_data/hakon/refdata/patho_missense.gorz
- gor [##altstart_vars##] | write s3data://shared/user_data/hakon/refdata/altstart_vars.gorz
- gor [##ex##] | write s3data://shared/user_data/hakon/refdata/ex.gorz
- gor [##transcript_info##] | write s3data://shared/user_data/hakon/refdata/transcript_info.gorz
- nor [##pvs1_genes##] | write s3data://shared/user_data/hakon/refdata/pvs1_genes.tsv
- gor [##vep_highmod##] | write s3data://shared/user_data/hakon/refdata/vep_highmod.gorz
- gor [##PM1_info##] | write s3data://shared/user_data/hakon/refdata/PM1_info.gorz
- gor [##PP2_BP1_info##] | write s3data://shared/user_data/hakon/refdata/PP2_BP1_info.gorz
- gor [##repeat_regions##] | write s3data://shared/user_data/hakon/refdata/repeat_regions.gorz
- nor [##omim_inheritance_matched##] | replace omim_inheritance,omim_phenotype_mim,omim_disease_name,omim_moi if(left(#rc,1)=',',listtail(#rc),#rc) | write s3data://shared/user_data/hakon/refdata/omim_inheritance_matched.tsv
- gor [##gnomad_gene_constraints_annotations##] | write s3data://shared/user_data/hakon/refdata/gnomad_gene_constraints_annotations.gorz
- gor [##exac_mpc_scores##] | write s3data://shared/user_data/hakon/refdata/exac_mpc_scores.gorz
- gor [##XA_annotations##] | write s3data://shared/user_data/hakon/refdata/XA_annotations.gorz
- gor [##clinvar_hgmd_XA_annotations##] | write s3data://shared/user_data/hakon/refdata/clinvar_hgmd_XA_annotations.gorz
- gor [##clinical_variants##] | write s3data://shared/user_data/hakon/refdata/clinical_variants.gorz
- nor [##trans2biotype##] | write s3data://shared/user_data/hakon/refdata/trans2biotype.tsv
- nor [##gene_panel_info##] | write s3data://shared/user_data/hakon/refdata/gene_panel_info.tsv
- nor [##genealiaspanelinfofile##] | write s3data://shared/user_data/hakon/refdata/genealiaspanelinfofile.tsv `
qh5 = GQH.GOR_Query_Helper()
mydefs5 = ""
First some definitions from the Gregor logic:
mydefs5 = qh5.add_many("""
def #gor_or_pgor# = pgor;
def ##ref## = ref;
def ##vep_path## = source/anno/vep_v110;
def ##splice_path## = ref_base/spliceai/spliceai_scores.masked.all.hg19.filtered.gorz;
def ##FidelSource## = ref_base/GDX/Fidel/fidelhg19.gorz;
def ##vep_impact## = ##ref##/VEP_impact.map;
def ##gene_map## = ##ref##/ensgenes/ensgenes.map;
def #freeze# = freezes/xa1_v8/main;
def ##vep_multi## = #freeze#/metadata/vepref_multi_wgs.gorz.link;
create ##veprank## = nor -h ##vep_impact##
| rename so_term consequence
| rename vep_rank rank_consequence
| select consequence,rank_consequence;
def ##all_genes## = ##ref##/refgenes/rgenes.gorz;
def ##all_exons## = ##ref##/refgenes/refgenes_exons.gorz;
def ##coding_exons## = ##ref##/refgenes/refgenes_codingexons.gorz;
def ##gene_transcripts## = ##ref##/refgenes/reftranscripts.gorz;
def ##cat3_max_score_threshold## = 0.9;
def ##cat4_minAf## = 0.03;
def ##gene_overlap_fuzz_factor## = 20;
def ##known_vars_fuzz_factor_vep## = 2;
def ##fuzzdist## = 20;
def ##vep_impact_filter## = ('HIGH','MODERATE');
def ##repeat_overlap_fuzz_factor## = 2;
def ##max_common_rare_ratio## = 0.05;
def ##defMOI## = 'all';
def ##rmax_af## = 0.01;
def ##dmax_af## = 0.001;
def ##category_map## = '1-KP_cDNA,Cat1,2-KP_AA,Cat1B,3-DV_KPCodon,Cat1C,4-KNonP,Cat1D,5-DV_intron,Cat1E,6-EP,Cat2,7-novel,Cat3,8-common,Cat4';
def ##reviewstatus_to_stars## = 'no_assertion_provided,0,no_assertion_criteria_provided,0,no_interpretation_for_the_single_variant,0,criteria_provided|conflicting_interpretations,1,criteria_provided|single_submitter,1,criteria_provided|multiple_submitters|no_conflicts,2,reviewed_by_expert_panel,3,practice_guideline,4';
def #date_val($1) = int(left($1,4))*365+int(substr($1,5,7))*30+int(substr($1,8,10));
""")
Notice that we have use the Ensembl VEP workflow to process every variant in the freeze to generate vepref_multi_wgs.gord. In the past, the freezes have only been annotated with "VEP single" but we prefer to have the VEP annotation on transcript level and not just gene level.
Samples and pedigree¶
Now the set of samples we will process and XA source to define gender:
mydefs5 = qh5.add_many("""
def #samples# = XA/raw/samples.tsv ;
create #pn_proband_sub# = nor -asdict XA/gene/multiscore.gord | select #2 | rename #1 PN;
create ##pngender## = nor #samples# | select id,gender
| replace gender if(#rc='F','female',if(#rc='M','male','unknown')) | rename id PN | inset -c PN [#pn_proband_sub#];
""")
We see that the pedigree includes only full trios where all the samples are in the freeze and have a available Multiscore data in XA/gene/multiscore.gord.
The BIG annotation join step¶
Next, we add definitions for the VEP and other LARGE annotation sources. Because we did not have the Provean score in user_data/hakon/refdata/max_score.gorz, we include it here with a create statement. Additionally, we include #gnomad# and #fidel# as extra annotation sources that can be used for the variant ranker.
mydefs5 = qh5.add_many("""
def ##spliceai_join## = | varjoin -r -l ref_base/spliceai/spliceai_scores.masked.all.hg19.filtered.gorz
| leftwhere gene_symbolx[-1] listhasany(gene_aliases,gene_symbol)
| hide scorezip,gene_symbolx,gene_aliases
| calc BP7_score if(spliceai_max_impact = '',1,0)
| varjoin -r -l -e '-1' -rprefix Fidel ##FidelSource##
;
def ##spliceai_cols## = spliceai_max_consequence,spliceai_max_impact,Fidel_PRECPAT,Fidel_PRECBEN,Fidel_Calibrated,Fidel_ALPHA;
def ##spliceai_filter## = or spliceai_max_impact in ('HIGH','MODERATE') or Fidel_PRECPAT >= 0.97;
create #dbnsfp_provean# = pgor ref/variants/dbnsfp.gorz
| group 1 -gc ref,alt -max -fc PROVEAN_score
| rename max_PROVEAN_score Provean;
create #gnomad# = pgor ref/variants/gnomad_freq.gorz
| select chrom,pos,ref,alt,gnomad_af,gnomad_hom;
create #fidel# = pgor ref_base/GDX/Fidel/fidelhg19.gorz
| select 1-4,precpat,alpha,revel | rename precpat fidel;
def ##sub_vep## = ##vep_multi##
| map -c consequence -h ##vep_impact##
##spliceai_join##
| atmin 1 -last -gc reference,call,feature vep_rank
| calcifmissing BP7_score 1
| calc qc_high_mod_impact if(vep_impact in ##vep_impact_filter## ##spliceai_filter## ,1,0)
| replace gene_symbol if(gene_symbol='.','',gene_symbol)
| select 1,2,Reference,Call,Max_Consequence,Gene_symbol,Feature,Consequence,amino_acids,Protein_position,Protein_Size,Biotype,Exon,HGVSc,HGVSp,Refgene,vep_impact,qc_high_mod_impact,BP7_score,##spliceai_cols##
| varjoin -norm -r -l -e 0.0 user_data/hakon/refdata/frequency_file.gorz
| varjoin -norm -r -l -e 0.0 user_data/hakon/refdata/max_score.gorz
| varjoin -norm -r -l user_data/hakon/refdata/PP3_BP4.gorz
| varjoin -norm -r -l -e 0 user_data/hakon/refdata/homozygous_vars.gorz
| varjoin -norm -r -l -xl feature -xr transcript_stable_id -e 0 user_data/hakon/refdata/exac_mpc_scores.gorz
| varjoin -norm -r -l -e '-1' [#dbnsfp_provean#]
| varjoin -norm -r -l -e '-1' [#gnomad#]
| varjoin -norm -r -l -e '-1' [#fidel#]
| hide gene_name,transcript_stable_id
| calc rsIDs '';
""")
For those who "remember" the code in Gregor1.ipynb, we see that ##sub_vep## is now a definition and not materialized, because we are not sharing this in the cache for subsequent analysis, but rather, here we use this for non-sample based variants. In other words, this analysis is fully independent of the samples and because of the square-root-law much more efficient to process in such way for multiple samples (yes - I keep saying this :).
Our monster join now looks like this:
mydefs5 = qh5.add("""
def #freeze# = freezes/xa1_v8/main;
create ##variants_vepped## = pgor -split 100 #freeze#/metadata/AF.gorz | select 1-4 | distinct
| rename #3 Reference | rename #4 Call
| varjoin -ic user_data/hakon/refdata/clinvar_hgmd_XA_annotations.gorz
| rename OverlapCount qc_known_vars
| join -snpsnp -f ##known_vars_fuzz_factor_vep## -ic user_data/hakon/refdata/clinvar_hgmd_XA_annotations.gorz
| rename OverlapCount qc_closeby_known_vars
| calc qc_protected_vars 0
| varjoin -l -r -e 'NO_VEP_RESULT' <(gor ##sub_vep## )
| hide biotype
| map -c feature user_data/hakon/refdata/trans2biotype.tsv /* This filters out transcript based on NM status in refgene and biotype from Ensembl */
| join -varseg -ic -f ##gene_overlap_fuzz_factor## ##all_exons##
| rename overlapCount qc_in_exon
| where qc_in_exon > 0 or qc_known_vars > 0 or qc_protected_vars > 0 or qc_high_mod_impact > 0 or qc_closeby_known_vars > 0
| replace qc_in_exon,qc_known_vars,qc_closeby_known_vars if(int(#rc)>0,1,0)
| map -c consequence -m 100 -h [##veprank##]
| map -m 'miss' -c gene_symbol user_data/hakon/refdata/pref_transcript.tsv
| calc DefTrans if(Gene_DefTrans='miss', '0', if(containsany(listfirst(refgene,'.'),gene_DefTrans), '1', '0'))
| replace rank_consequence if(refgene='.',rank_consequence+100,rank_consequence)
| replace rank_consequence if(containsany(refgene,Gene_DefTrans),rank_consequence-200,rank_consequence)
| replace protein_size if(isint(protein_size),int(protein_size),1)
| granno 1 -gc #3,#4,gene_symbol -max -ic protein_size
| replace rank_consequence if(protein_size = max_protein_size,rank_consequence-50,rank_consequence)
| rename rank_consequence tmp_rank
| rank 1 -gc #3,#4,gene_symbol tmp_rank -o asc
| rename rank_tmp_rank qc_consequence_rank
| select 1-4,gene_symbol,feature,ref_af,homcount,Max_score,Max_Consequence,Consequence,vep_impact,amino_acids,Protein_position,Protein_Size,qc_consequence_rank,qc_known_vars,rsIDs,fidel_bp4_pp3,PP3_score,BP4_score,BP7_score,MPC,provean,gnomad_af,gnomad_hom,fidel,alpha,revel,##spliceai_cols##;
""")
We see that the VEP expression shows up in a nested query via left-varjoin. The table ##variants_vepped## defines not only the key annotations that we want to use downstream but also which variants gets considered, i.e. the WHERE expression filters out variants that are not close to exons, known variant, protected variants (not use here) or variants of high or moderate VEP or SpliceAI impact.
Also, we rank the variant-transcript here (and don't filter them) because in Gregor step1 we do CHZ on transcript level. This is inconsitent with the logic described in the beginning here for CHZ which is on gene level. This is a minor flaw that can easily be fixed, however, here we will not bother with this issue. In the end of our annotation process in Gregor step1, we do indeed collapse the inheritance to gene level although the CHZ and ACMG evidence logic there is properly on transcript level.
%%gor
$mydefs5
create #counts# = pgor [##variants_vepped##] | group chrom -gc vep_impact,consequence -count;
nor [#counts#] | group -gc vep_impact,consequence -sum -ic allcount | granno -sum -ic sum_allcount | sort -c sum_allcount:nr
Query ran in 0.96 sec Query fetched 25 rows in 0.01 sec (total time 0.97 sec)
VEP_Impact | Consequence | sum_allCount | sum_sum_allCount | |
---|---|---|---|---|
0 | MODERATE | missense_variant | 33898165 | 87830901 |
1 | LOW | synonymous_variant | 15632968 | 87830901 |
2 | LOW | intron_variant | 7711420 | 87830901 |
3 | LOWEST | downstream_gene_variant | 6983214 | 87830901 |
4 | LOWEST | upstream_gene_variant | 4524536 | 87830901 |
5 | LOW | splice_region_variant | 3901022 | 87830901 |
6 | HIGH | frameshift_variant | 3355283 | 87830901 |
7 | LOW | splice_polypyrimidine_tract_variant | 2437143 | 87830901 |
8 | LOW | 5_prime_UTR_variant | 1750489 | 87830901 |
9 | HIGH | stop_gained | 1513376 | 87830901 |
10 | LOW | splice_donor_region_variant | 1160228 | 87830901 |
11 | MODERATE | inframe_insertion | 960325 | 87830901 |
12 | LOW | 3_prime_UTR_variant | 890454 | 87830901 |
13 | HIGH | splice_donor_variant | 772281 | 87830901 |
14 | MODERATE | inframe_deletion | 741601 | 87830901 |
15 | HIGH | splice_acceptor_variant | 671669 | 87830901 |
16 | LOW | splice_donor_5th_base_variant | 405727 | 87830901 |
17 | MODERATE | protein_altering_variant | 326094 | 87830901 |
18 | HIGH | start_lost | 112053 | 87830901 |
19 | HIGH | stop_lost | 55754 | 87830901 |
20 | LOW | stop_retained_variant | 23564 | 87830901 |
21 | LOW | coding_sequence_variant | 2481 | 87830901 |
22 | LOWEST | intergenic_variant | 1042 | 87830901 |
23 | LOW | start_retained_variant | 10 | 87830901 |
24 | LOW | incomplete_terminal_codon_variant | 2 | 87830901 |
From above, we see that we keep about 87M rows of variant annotations for the entire freeze of 23M variants. Keep in mind that the XA database is mostly significant exonic variants. The above joins in ##variants_vepped## run in few minutes and cal easily be made to run faster by using higher -split on PGOR.
Know pathogenic variants and their neighbours¶
Next is the logic to find variant adjacent to known pathogenic variants, causing the same codon change or the same amino acid change.
mydefs5 = qh5.add_many("""
create ##cat1_1d## = #gor_or_pgor# [##variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| varjoin -ic <(gor user_data/hakon/refdata/clinical_variants.gorz | select 1-4,MaxClinImpact | where MaxClinImpact = 'Pathogenic')
| rename overlapcount Path
| varjoin -ic <(gor user_data/hakon/refdata/clinical_variants.gorz | select 1-4,MaxClinImpact | where NOT(MaxClinImpact = 'Pathogenic'))
| rename overlapcount NonPath
| calc category if(Path > 0,'1-KP_cDNA',if(NonPath > 0,'4-KNonP','none'))
| where category != 'none'
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;
create ##cat1bc## = #gor_or_pgor# [##variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| join -snpsnp -xl feature -xr feature -f 2 user_data/hakon/refdata/known_var_vep_multi.gorz
| where not(pos = posx and reference = referencex and call = callx) and not(amino_acids in ('','.'))
| calc category if(amino_acids = amino_acidsx and protein_position+'' = protein_positionx,'2-KP_AA',if(protein_position+'' = protein_positionx,'3-DV_KPCodon',''))
| where category != ''
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;
create ##cat1e## = #gor_or_pgor# [##variants_vepped##]
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position
| distinct
| join -snpsnp -f 2 <(gor user_data/hakon/refdata/clinical_variants.gorz | select 1-Alt,MaxClinImpact | rename #3 Reference | rename #4 Call)
| where not(pos = posx and reference = referencex and call = callx)
| where consequence = 'intron_variant'
| calc category '5-DV_intron'
| select 1-4,gene_symbol,feature,consequence,amino_acids,protein_position,category;
create ##all_cat1## = #gor_or_pgor# [##cat1_1d##]
| merge [##cat1bc##]
| merge [##cat1e##]
| calc cat1_rank if(category='1-KP_cDNA',100,if(category='2-KP_AA',200,if(category='3-DV_KPCodon',300,if(category='4-KNonP',400,500))))
| atmin 1 -gc 3,4,gene_symbol,feature cat1_rank
| select 1-4,gene_symbol,feature,category;
""")
%%gor
$mydefs5
nor [##all_cat1##] | group -gc category -count | calc oldACMGcat decode(category,##category_map##)
Query ran in 0.51 sec Query fetched 5 rows in 11.07 sec (total time 11.58 sec)
category | allCount | oldACMGcat | |
---|---|---|---|
0 | 1-KP_cDNA | 440600 | Cat1 |
1 | 2-KP_AA | 6313 | Cat1B |
2 | 3-DV_KPCodon | 741317 | Cat1C |
3 | 4-KNonP | 5626752 | Cat1D |
4 | 5-DV_intron | 870495 | Cat1E |
We see that we have about 440k known pathogenic variations and about 75k variations that can cause the same amino acid change or modification in the same codon.
NMD, 50bp rule and alternative start¶
Next we annotate all the variant in relation to NMD, 50bp rule and alternative start. This will impact ACMG evidence PVS1. Note that this logic could easily be improved by looking at known Cat1 variants and see if they contradict these labels.
mydefs5 = qh5.add_many("""
create #PVS1_info# = #gor_or_pgor# [##variants_vepped##] | select 1-4,feature | distinct
| join -snpseg -xl feature -xr transcript_stable_id <(gor user_data/hakon/refdata/ex.gorz
| select 1-gene_symbol,strand,transcript_stable_id,sum_*,exonNum,numExons,transcript_size,last_exon_size)
| calc cdspos if(strand = '+',pos-chromstart+sum_bs,max(chromend-pos-len(reference)+sum_as+2,0))
| select 1-4,transcript_stable_id,exonnum,numExons,cdspos,transcript_size,last_exon_size
| join -snpseg -l -xl transcript_stable_id -xr transcript_stable_id -r user_data/hakon/refdata/transcript_info.gorz
| calc alt_start if(strand = '+' and max_m_pos > pos,max_m_pos,if(strand = '-' and min_m_pos > 0 and min_m_pos < pos,min_m_pos,0))
| calc var_in_50bp_region if(transcript_size-last_exon_size-cdspos < 50,1,0)
/* if var is downstream of 50bp threshold and patho_var is not present (is50bpRuleViolated=0) in that region, we trigger the 50bp rule */
| calc trigger50bpRule if(var_in_50bp_region = 1 and is50bpRuleViolated = 0,1,0)
/* if var is upstream of 50bp threshold and the given transcript is subjected to NMD process, we trigger the NMD rule */
| calc triggerNMDRule if(var_in_50bp_region = 0 and isNMDtranscript = 1,1,0)
| select 1-4,transcript_stable_id,trigger50bpRule,alt_start,isNMDtranscript,triggerNMDRule
| where trigger50bpRule+alt_start+isNMDtranscript+triggerNMDRule > 0;
""")
By filtering on trigger50bpRule+alt_start+isNMDtranscript+triggerNMDRule > 0 and using left-join that set these values to zero where missing, we save about 50% in storage.
Time dependent classified variants¶
Now we add two definitions that are used to enable sample-time-dependent annotation of classified variants. This is important when generating training data however not an issue once we generate data from ranking/prediction when processing new samples. We also add clinvar as a separate annotation source, for backward compatiblity with the first variant ranking query, although it may be included in user_data/hakon/refdata/clinical_variants.gorz together with classified XA variants.
mydefs5 = qh5.add_many("""
create #classifs_hist# = gor XA/vars_anno/variant_classifications.gorz
| select chrom,pos,ref,alt,classification,date
| rename classification gdx_classification_hist
| rename date gdx_classif_date_hist;
create #sample_date# = nor -h user_data/vinnie/ranker_data/inputs_for_ranker_queries/ranker_training_case_ids_to20231031.tsv
| select sample_id,lims_start_date;
create #clinvar# = gor ref/dbsnp/dbsnp_clinvar.gorz
| where clnrevstat in ('criteria_provided|multiple_submitters|no_conflicts','reviewed_by_expert_panel','practice_guideline')
| where clnsig_name in ('uncertainsignificance','benign/likelybenign','benign','likelybenign','pathogenic/likelypathogenic','pathogenic','likelypathogenic')
| rename clnsig_name clinvar_classification
| select chrom,pos,ref,alt,clinvar_classification;
""")
Annotating sample variants with inheritance¶
We can annotate our sample based variant with the CHZ labels we generated in user_data/hakon/Ranker/xa1_v8_var_inher.gorz in two way; either by reading the genotypes from the freeze or from the row level variant tables - the latter being much faster as we see below. First by reading from the freeze:
%%gor
create #time# = gor freezes/xa1_v8/main/variants.gord
| csvsel -gc 3,4 -vs 1 -u 3 freezes/xa1_v8/main/buckets.tsv <(nor freezes/xa1_v8/main/buckets.tsv | select pn) -hide '0,3' -tag PN
| rename value callcopies
| varjoin -l -xl pn -xr pn user_data/hakon/Ranker/xa1_v8_var_inher.gorz
| distloc 1000
| group genome -count | calc t time();
nor [#time#]
Query ran in 0.24 sec Query fetched 1 rows in 0.00 sec (total time 0.24 sec)
chrom | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 333183 | 47906 |
and now from XA/vars/wesvars.gord
%%gor
create #time# = gor XA/vars/wesvars.gord
| where callcopies in ('1','2')
| select 1-4,callcopies,pn
| distinct
| varjoin -l -xl pn -xr pn user_data/hakon/Ranker/xa1_v8_var_inher.gorz
| distloc 1000
| group genome -count | calc t time();
nor [#time#]
Query ran in 0.21 sec Query fetched 1 rows in 0.00 sec (total time 0.21 sec)
chrom | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 333183 | 9153 |
We see that the latter is 9sec, more than 5x faster, because most of the variants are so rare, hence few rows per variant. The fact that we have already used the coverage information in the freeze in deriving the inheritance status makes wesvars.gord a natural choice for the last stages.
%%gor
$mydefs5
nor [#pn_proband_sub#] | group -count
Query ran in 0.79 sec Query fetched 1 rows in 0.02 sec (total time 0.81 sec)
allCount | |
---|---|
0 | 14641 |
We can now check out the content of our join for our 14641 samples for which we have gene Multiscore data:
%%gor
$mydefs5
gor XA/vars/wesvars.gord -ff [#pn_proband_sub#]
| where callcopies in ('1','2')
| select 1-4,callcopies,pn
| distinct
| varjoin -l -xl pn -xr pn user_data/hakon/Ranker/xa1_v8_var_inher.gorz
| distloc 10
Query ran in 44.32 sec Query fetched 304 rows in 1.50 sec (total time 45.82 sec)
chrom | pos | ref | alt | callcopies | pn | posx | REFx | altx | Gene_Symbol | Max_Impact | PNx | value | inher | ihe | hethomchz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69134 | A | G | 2 | 511157 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | chr1 | 69134 | A | G | 2 | 396122 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | chr1 | 69134 | A | G | 1 | 489068 | 69134.0 | A | G | OR4F5 | MODERATE | 489068.0 | 101.0 | mother | 0.0 | het |
3 | chr1 | 69134 | A | G | 1 | 553894 | 69134.0 | A | G | OR4F5 | MODERATE | 553894.0 | 102.0 | mother | 0.0 | het |
4 | chr1 | 69134 | A | G | 1 | 619425 | 69134.0 | A | G | OR4F5 | MODERATE | 619425.0 | 110.0 | father | 0.0 | het |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
299 | chr1 | 69438 | T | C | 2 | 681582 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
300 | chr1 | 69453 | G | A | 1 | 564860 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
301 | chr1 | 69453 | G | A | 1 | 626981 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
302 | chr1 | 69453 | G | A | 2 | 642095 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
303 | chr1 | 69453 | G | A | 1 | 689960 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
304 rows × 16 columns
Sample annotation, ACMG scoring, Multiscore, and writing out the results¶
Now we are ready for the final query that reads the sample based variants, annotates them and writes them out to our destinations. Unlike the other queries, that we have executed for the complete XA database (600k+ samples), we will only run this for our subset, #pn_proband_sub#, because only for them do we have the gene Multi-score. Here are some of the key things to notice in the following query:
- We use PARTGOR and process 1000 samples with each worker. This will take about 5min for chr1 and about an hour for the entire genome. Other option would have been to use PARALLEL, however, bit more complicated to express.
- Some of our joins are on variant-only level while other are on variant-sample level. The gene Multiscore data has already been written into a partitioned/bucketized dictionary. It has about 20k rows per sample. Our other sample sources, such as xa1_v8_var_inher.gorz and gene_pheno_score_25.gorz are not partitioned, but with only about 3k and 50 rows per sample, respectively.
- The pure variant joins are with ##variants_vepped##, ##all_cat1##, and #classifs_hist#. In the first, we use a filtering varjoin because we assume it to cover the complete variants, whereas for the other we use a left-varjoin becuase they are sparse.
- For #classifs_hist#, we use additional LEFTWHERE because we want to have a left-join like mechanism on the date comparison, i.e. of the establishment of the classification and the time of the sample processing. Also, we take away Cat1 status in DIAG_ACMGcat based on gdx_classification_hist and is_patho_in_knowledgebase from Gregor step2 is now calculated based on it.
- For the sample based joins, for the left varjoins we use additionally -xl and -xl options with PN and -c PN for maps.
- The #variants# and XA/gene/multiscore.gord use the gor driver command filtering option with the tags (PNs) populated by the PARTGOR macro command.
- The non-partitioned PN based files, such as xa1_v8_var_inher.gorz, pn_strong_assoc.gorz and #sample_date#, we filter the nested queries on PN with the tags in order to reduce the memory footprint of the joins - something especially important if we run this for larger cohorts.
- diag_denovo and diag_chz are dependent on xa1_v8_var_inher.gorz while diag_homrecess uses callcopies from the full #variants# source.
- We use some intermediare SELECTs, primarily to show which columns are necessary to proceed and to reduce the rows.
- The one-hot-encoding is used to format the data for AI processing. Writing out these expressions is tedious and error prone and best done by auto-generating them, e.g. as shown in our appendix.
- The definitions for #TOP# and #WRITE# make it easier to modify and test the query logic.
- The rest is obvious!
mydefs5 = qh5.add_many("""
def #TOP# = top 100 ;
def #WRITE# = /* calc filepath 'user_data/hakon/Ranker/output/vars4ranker_'+left(md5('#{tags}'),5)+'.gorz'
| write -f filepath -r #{fork} */;
def #variants# = XA/vars/wesvars.gord;
create #output# = partgor -dict #variants# -partsize 1000 -ff <(nor [#pn_proband_sub#] )
<(gor #variants# -f #{tags}
| rename #3 Reference
| rename #4 Call
| where callcopies in ('1','2')
| select 1-4,callcopies,pn
| distinct
| varjoin -r [##variants_vepped##]
| varjoin -r -l -xl gene_symbol,pn -xr gene_symbol,pn <(gor user_data/hakon/Ranker/xa1_v8_var_inher.gorz | where pn in ( #{tags:q} ))
| prefix ihe,inher,hethomchz GT
| #TOP#
| varjoin -xl gene_symbol,feature -xr gene_symbol,feature -r -l -e '7-novel' [##all_cat1##]
| hide gene_symbolx,gene_symbolxx,featurex
| map -c pn -m '9' <(nor [#sample_date#] | where #1 in ( #{tags:q} ))
| varjoin -norm -r -l -e 'null' [#classifs_hist#]
| leftwhere lims_start_date gdx_classif_date_hist<lims_start_date or gdx_classif_date_hist = 'null'
| granno 1 -gc #3,#4,PN -max -sc gdx_classif_date_hist
| where gdx_classif_date_hist=max_gdx_classif_date_hist
| replace category if(ref_af>##cat4_minAf## and qc_known_vars=0,'8-common',if(category='7-novel' and vep_impact='HIGH','6-EP',category))
| calc DIAG_ACMGcat decode(category, ##category_map##)
| replace DIAG_ACMGcat if(DIAG_ACMGcat = 'Cat1' and gdx_classification_hist = 'null','',if(DIAG_ACMGcat = 'Cat3',if(max_score >= ##cat3_max_score_threshold##,'Cat3A','Cat3B'),DIAG_ACMGcat))
| join -varseg -l -f ##fuzzdist## -r -xl gene_symbol -xr gene -e 0 user_data/hakon/refdata/gnomad_gene_constraints_annotations.gorz
| hide gene
| prefix consequence,max_consequence,max_score VEP
| calc diag_denovo if(GT_inher = 'denovo',1,0)
| map -c PN [##pngender##] -m '' -n gender
| map -c gene_symbol -m '' user_data/hakon/refdata/omim_inheritance_matched.tsv -n omim_gene_mim,omim_inheritance,omim_disease_name,omim_phenotype_mim,omim_moi
| join -varseg -ic -f ##repeat_overlap_fuzz_factor## user_data/hakon/refdata/repeat_regions.gorz
| rename OverlapCount qc_repeat_regions
| select 1,2,Reference,Call,PN,gene_symbol,feature,ref_af,VEP_max_score,Vep_max_consequence,VEP_Impact,Vep_consequence,CallCopies,category,DIAG_ACMGcat,diag_denovo,gender,omim_inheritance,qc_repeat_regions,fidel_bp4_pp3,PP3_score,BP4_score,BP7_score,MPC,QC_CONSEQUENCE_RANK,GT_ihe,GT_inher,GT_hethomchz,provean,gdx_classification_hist,gnomad_gene_*,gnomad_af,gnomad_hom,fidel,alpha,revel
| varjoin -norm -xl PN -xr PN -r -l -e 0 <(gor user_data/hakon/Ranker/pn_strong_assoc.gorz
| calc OR_v5c listsize(listfilter(lis_OR_mm,'float(x)>=5'))
| calc OR_v10c listsize(listfilter(lis_OR_mm,'float(x)>=10'))
| calc OR_v20c listsize(listfilter(lis_OR_mm,'float(x)>=20'))
| select 1-4,pn,OR_*)
| varjoin -norm -xl PN -xr PN -r -l -e 0 <(gor user_data/hakon/Ranker/pn_gene_strong_assoc.gorz
| calc OR_g5c listsize(listfilter(lis_OR_rec,'float(x)>=5'))
| calc OR_g10c listsize(listfilter(lis_OR_rec,'float(x)>=10'))
| calc OR_g20c listsize(listfilter(lis_OR_rec,'float(x)>=20'))
| select 1-4,pn,OR_*)
| calc Evidence_PS4 if(OR_v10c > 0,'PS4','')
| distinct
| map -c gene_symbol -h -m 'missing' user_data/hakon/refdata/pvs1_genes.tsv
| varjoin -l -r -xl feature -xr transcript_stable_id -e 0 [#PVS1_info#]
| hide transcript_stable_id
| replace pvs1gene if(alt_start>0 or trigger50bpRule>0 or triggerNMDRule>0,'0',pvs1gene)
| calcifmissing spliceai_max_impact ''
| calc Evidence_PVS1 if(pvs1gene='1' and (VEP_Impact='HIGH' or spliceai_max_impact = 'HIGH') ,'PVS1','')
| calc Evidence_PS2_PM6 if(diag_denovo='1','PS2',if(GT_IHE='1','PM6',''))
| calc Evidence_PS3_BS3 ''
| varjoin -r -l -e 0 user_data/hakon/refdata/homozygous_vars.gorz
| map -c gene_symbol -m '1' user_data/hakon/refdata/gene_panel_info.tsv
| calc af_cutoff if(contains(omim_inheritance,'AD') or #1 = 'chrX' and gender = 'male', Gene_maximum_genotype_freq_for_dominant, Gene_maximum_genotype_freq_for_recessive)
| calc Evidence_BA1_BS1_BS2_PS4_PM2 if(ref_af > 0.05,'BA1',
if( ref_af > af_cutoff,'BS1',
if( homCount > 4 /* conservative because using GnomAD instead of 1000G */,'BS2','')))
| replace Evidence_BA1_BS1_BS2_PS4_PM2 if(Evidence_BA1_BS1_BS2_PS4_PM2='' and (ref_af<=0.0 or ( ref_af < Gene_maximum_genotype_freq_for_recessive and contains(omim_inheritance,'AR') and not (contains(omim_inheritance,'AD')))) or (ref_af < Gene_maximum_genotype_freq_for_dominant and contains(omim_inheritance,'AD')),'PM2',Evidence_BA1_BS1_BS2_PS4_PM2)
| join -snpseg -rprefix PM1 -l -e 0 -r <(gor user_data/hakon/refdata/PM1_info.gorz | where path_high_allcount + path_missense_allcount > 0 and ratio_common_rare < ##max_common_rare_ratio## | calc tomax path_high_allcount + path_missense_allcount)
| atmax 1 -gc 3,4,gene_symbol,feature,PN pm1_tomax | hide pm1_tomax /* pick the most informative Interpro domain */
| calc Evidence_PM1 if(VEP_consequence in ('missense_variant') and pm1_path_high_allcount + pm1_path_missense_allcount > 0 and float(MPC,0)>2,'PM1','')
| calc Evidence_PM4_BP3 if(Vep_consequence in ('inframe_insertion','inframe_deletion'),if(qc_repeat_regions=0,'PM4','BP3'),'')
| replace Evidence_PM4_BP3 if(vep_consequence = 'stop_lost','PM4',Evidence_PM4_BP3)
| join -snpseg -xl gene_symbol -xr gene_symbol -rprefix PP2_BP1 -l -e 0 -r user_data/hakon/refdata/PP2_BP1_info.gorz
| hide PP2_BP1_gene_symbol
| calc Evidence_PP2_BP1 if(VEP_consequence='missense_variant',if(PP2_BP1_path_missense_allcount/float(PP2_BP1_path_high_allcount+PP2_BP1_path_missense_allcount+0.1)>0.8
and PP2_BP1_common_allcount/float(PP2_BP1_rare_allcount+PP2_BP1_common_allcount+0.1)<0.1,'PP2',
if(PP2_BP1_path_high_allcount/float(PP2_BP1_path_high_allcount+PP2_BP1_path_missense_allcount+0.1)>0.8,'BP1','')),'')
| calc Evidence_PP3_BP4 if(PP3_score=1,'PP3',if(BP4_score=1,'BP4',''))
| calc Evidence_BP7 if(BP7_score=1 and (VEP_consequence = 'synonymous_variant' or contains(VEP_consequence,'splice')),'BP7','')
| calc Evidence_all = listfilter(Evidence_PS4+','+Evidence_PVS1+','+Evidence_PS2_PM6+','+Evidence_PS3_BS3+','+Evidence_BA1_BS1_BS2_PS4_PM2+','+Evidence_PM1+','+Evidence_PM4_BP3+','+Evidence_PP2_BP1+','+Evidence_PP3_BP4+','+Evidence_BP7,'len(x)>0')
| rename Evidence_all auto_evidence
/* Gregor step2 based logic */
| map -c gene_symbol user_data/hakon/refdata/genealiaspanelinfofile.tsv -m 'default'
| replace Gene_DmaxAf if(Gene_DmaxAf = 'default',##dmax_af##,float(#rc))
| replace Gene_RmaxAf if(Gene_RmaxAf = 'default',##rmax_af##,float(#rc))
| replace qc_consequence_rank if(containsany(feature,Gene_DefTrans),0,qc_consequence_rank)
| join -varseg -r -f ##fuzzdist## -xl gene_symbol,pn -xr gene_symbol,pn -l -e 0 -rprefix cand <(gor user_data/hakon/Phrank/gene_pheno_score_25.gorz | where pn in ( #{tags:q} ))
| replace auto_evidence if(cand_rank_pheno_score > 0 and cand_rank_pheno_score < 10
or cand_rank_norm_pheno_score > 0 and cand_rank_norm_pheno_score < 10
or OR_g20c > 0,listadd(auto_evidence,'PP4'),auto_evidence)
| calc diag_dominant if(ref_af <= Gene_DmaxAf and GT_inher != 'denovo',1,0)
| calc diag_chz if(ref_af <= Gene_RmaxAf and GT_hethomchz = 'chz',1,0)
| calc diag_homrecess if(ref_af <= Gene_RmaxAf and (chrom = 'chrX' and (( gender = 'male' and callcopies in ('1','2')
or gender = 'female' and callcopies = '2') ) or chrom != 'chrX' and callcopies = '2'),1,0)
| replace auto_evidence if(diag_chz=1 or diag_homrecess=1,listadd(auto_evidence,'PM3'),auto_evidence)
| calc inheritance_REC if(diag_homrecess=1, if(chrom = 'chrX','XREC','REC'),if(diag_chz=1,'CHZcc', if(diag_chz=1,'CHZ','')))
| calc inheritance_DeNovoDOM if(diag_denovo=1,'DeNovo', if(diag_dominant=1,'DOM',''))
| calc inheritance_LHZ ''
| calc inheritance = listfilter(inheritance_DeNovoDOM+','+inheritance_REC+','+inheritance_LHZ,'len(x)>0')
| granno 1 -gc reference,call,gene_symbol,PN -set -sc inheritance
| replace inheritance if(left(set_inheritance,1)=',',listtail(set_inheritance),set_inheritance)
| map -c vep_max_consequence -m '' <(nor -h ##vep_impact## | rename vep_impact vep_max_impact ) -n vep_max_impact
| atmin 1 -gc reference,call,gene_symbol,PN qc_consequence_rank
| hide set_*
| calc is_patho_in_knowledgebase if(contains(gdx_classification_hist,'PATH') or contains(gdx_classification_hist,'LPATH'),1,0)
| replace auto_evidence if(is_patho_in_knowledgebase = 1 or DIAG_ACMGcat='Cat1B',listadd(auto_evidence,'PS1'),if(DIAG_ACMGcat='Cat1C',listadd(auto_evidence,'PM5'),auto_evidence))
| replace auto_evidence if(auto_evidence='',if(DIAG_ACMGcat='Cat1','PP5',if(DIAG_ACMGcat='Cat1C','BP6',auto_evidence)),auto_evidence)
| calc ls_PVS1 listsize(listfilter(auto_evidence,'x in ("PVS1")'))
| calc ls_PS1_4 listsize(listfilter(auto_evidence,'x in ("PS1","PS2","PS3","PS4")'))
| calc ls_PM1_6 listsize(listfilter(auto_evidence,'x in ("PM1","PM2","PM3","PM4","PM5","PM6")'))
| calc ls_PP1_5 listsize(listfilter(auto_evidence,'x in ("PP1","PP2","PP3","PP4","PP5")'))
| calc ls_BA1 listsize(listfilter(auto_evidence,'x in ("BA1")'))
| calc ls_BS1_4 listsize(listfilter(auto_evidence,'x in ("BS1","BS2","BS3","BS4")'))
| calc ls_BP1_7 listsize(listfilter(auto_evidence,'x in ("BP1","BP2","BP3","BP4","BP5","BP6","BP7")'))
| calc D_Pi if(ls_PVS1=1 and (ls_PS1_4>=1 or ls_PM1_6>=2 or (LS_PM1_6=1 and ls_PP1_5=1) or ls_PP1_5>=2),1,0)
| calc D_Pii if(ls_PS1_4>=2,1,0)
| calc D_Piii if(ls_PS1_4>=1 and ( ls_PM1_6>=3 or (ls_PM1_6>=2 and ls_PP1_5>=2) or (ls_PM1_6>=1 and ls_PP1_5>=4) ),1,0)
| calc D_P if(D_Pi+D_Pii+D_Piii > 0,1,0)
| calc D_LPi if(ls_PVS1=1 and ls_PM1_6=1,1,0)
| calc D_LPii if(ls_PS1_4=1 and (ls_PM1_6=1 or ls_PM1_6=2),1,0)
| calc D_LPiii if(ls_PS1_4=1 and ls_PP1_5>=2,1,0)
| calc D_LPiv if(ls_PM1_6>=3,1,0)
| calc D_LPv if(ls_PM1_6=2 and ls_PP1_5>=2,1,0)
| calc D_LPvi if(ls_PM1_6=1 and ls_PP1_5>=4,1,0)
| calc D_LP if(D_LPi+D_LPii+D_LPiii+D_LPiv+D_LPi+D_LPv+D_LPvi > 0,1,0)
| calc D_B if(ls_BA1=1 or ls_BS1_4>=2,1,0)
| calc D_LB if(ls_BS1_4=1 and ls_BP1_7=1 or ls_BP1_7>=2,1,0)
| calc D_PLP if(D_P=1,'P',if(D_LP=1,'LP',''))
| calc D_BLB if(D_B=1,'B',if(D_LB=1,'LB',''))
| calc auto_classification_simple if(len(D_PLP)>0 and len(D_BLB)=0,D_PLP,if(len(D_PLP)=0 and len(D_BLB)>0,D_BLB,'VUS'))
| replace auto_evidence if(contains(auto_evidence,'PP3'),replace(auto_evidence,'PP3',fidel_bp4_pp3),
if(contains(auto_evidence,'BP4'),replace(auto_evidence,'BP4',fidel_bp4_pp3),auto_evidence))
| calc auto_score if(auto_evidence!='',int(listnumsum(listmap(auto_evidence,"decode(x,'BA1,-8,BS1,-4,BS2,-4,BS3,-4,BS4,-4,BP1,-1,BP2,-1,BP3,-1,BP4,-1,BP4_supporting,-1,BP4_moderate,-2,BP4_strong,-4,BP4_verystrong,-8,BP5,-1,BP6,-1,BP7,-1,PVS1,8,PS1,4,PS2,4,PS3,4,PS4,4,PM1,2,PM2,1,PM3,2,PM4,2,PM5,2,PM6,2,PP1,1,PP2,1,PP3,1,PP3_strong,4,PP3_moderate,2,PP3_supporting,1,PP4,1,PP5,1,0')"))),0)
| calc auto_classification if(auto_score >= 10,'P',
if(auto_score >= 6 and auto_score <=9,'LP',
if(auto_score >= 0 and auto_score <=5,'VUS',
if(auto_score >= -6 and auto_score <= -1,'LB',
if(auto_score <= -7,'B','')))))
| hide gender,gene_dmaxAf,gene_rmaxAf,BP4_score,BP7_score,PP3_score,gene_MOI,Gene_DefTrans,D_*,ls_*,inheritance_*,femalecases_*,malecases_*
| replace auto_evidence replace(replace(replace(replace(#rc,'_verystrong',''),'_supporting',''),'_strong',''),'_moderate','')
/* multi-score */
| join -snpseg -xl gene_symbol,pn -xr gene_symbol,pn -maxseg 4000000 -f 20 -l -e 0 -rprefix GMS <(gor XA/gene/multiscore.gord -s PN -f #{tags} | select 1-3,gene_symbol,PN,probability-w2v_lit)
| select 1,2,Reference,Call,PN,Gene_Symbol,ref_af,VEP_impact,VEP_Consequence,callcopies,DIAG_ACMGcat,qc_repeat_regions,lof_pli,kdg,spliceai_max_impact,homCount,auto_evidence,diag_dominant,diag_chz,diag_homrecess,diag_denovo,inheritance,auto_score,auto_classification,cand_pheno_score,cand_norm_pheno_score,cand_rank_pheno_score,cand_rank_norm_pheno_score,OR_v5c,OR_v10c,OR_v20c,OR_g5c,OR_g10c,OR_g20c,gdx_classification_hist,fidel,alpha,revel,provean,gnomad_af,gnomad_hom,gnomad_gene_*,GMS_probability,GMS_rank,GMS_doc2vec,GMS_hrss_dipr,GMS_hrss_hpoa,GMS_hrss_lit,GMS_jaccard_dipr,GMS_jaccard_lit,GMS_w2v_dipr,GMS_w2v_lit
| rename diag_acmgcat ACMG_oldCat
| rename diag_(.*) oheInher_#{1}
| rename auto_(.*) ACMG_#{1}
| varjoin -r -l [#clinvar#]
| calc oheVep_3_prime_UTR_variant,oheVep_5_prime_UTR_variant,oheVep_frameshift_variant,oheVep_inframe_deletion,oheVep_inframe_insertion,oheVep_intron_variant,oheVep_missense_variant,oheVep_non_coding_transcript_exon_variant,oheVep_protein_altering_variant,oheVep_splice_acceptor_variant,oheVep_splice_donor_variant,oheVep_splice_region_variant,oheVep_start_lost,oheVep_stop_gained,oheVep_stop_lost,oheVep_stop_retained_variant,oheVep_synonymous_variant if(VEP_consequence = '3_prime_UTR_variant',1,0),if(VEP_consequence = '5_prime_UTR_variant',1,0),if(VEP_consequence = 'frameshift_variant',1,0),if(VEP_consequence = 'inframe_deletion',1,0),if(VEP_consequence = 'inframe_insertion',1,0),if(VEP_consequence = 'intron_variant',1,0),if(VEP_consequence = 'missense_variant',1,0),if(VEP_consequence = 'non_coding_transcript_exon_variant',1,0),if(VEP_consequence = 'protein_altering_variant',1,0),if(VEP_consequence = 'splice_acceptor_variant',1,0),if(VEP_consequence = 'splice_donor_variant',1,0),if(VEP_consequence = 'splice_region_variant',1,0),if(VEP_consequence = 'start_lost',1,0),if(VEP_consequence = 'stop_gained',1,0),if(VEP_consequence = 'stop_lost',1,0),if(VEP_consequence = 'stop_retained_variant',1,0),if(VEP_consequence = 'synonymous_variant',1,0)
/* one-hot-GDXclass */
| calc oheGDXclass_BEN,oheGDXclass_GDXBEN,oheGDXclass_LBEN,oheGDXclass_LPATH,oheGDXclass_PATH,oheGDXclass_UV,oheGDXclass_VUS,oheGDXclass_null if(gdx_classification_hist = 'BEN',1,0),if(gdx_classification_hist = 'GDXBEN',1,0),if(gdx_classification_hist = 'LBEN',1,0),if(gdx_classification_hist = 'LPATH',1,0),if(gdx_classification_hist = 'PATH',1,0),if(gdx_classification_hist = 'UV',1,0),if(gdx_classification_hist = 'VUS',1,0),if(gdx_classification_hist = 'null',1,0)
/* one-hot-Clinvarclass */
| calc clinv_class decode(clinvar_classification,'benign,B,benign/likelybenign,LB,likelybenign,LB,likelypathogenic,LP,pathogenic,P,pathogenic/likelypathogenic,LP,uncertainsignificance,VUS,null')
| calc oheClinvclass_B,oheClinvclass_LB,oheClinvclass_LP,oheClinvclass_P,oheClinvclass_VUS,oheClinvclass_null if(clinv_class = 'B',1,0),if(clinv_class = 'LB',1,0),if(clinv_class = 'LP',1,0),if(clinv_class = 'P',1,0),if(clinv_class = 'VUS',1,0),if(clinv_class = 'null',1,0)
| calc oheSplice_high,oheSplice_moderate,oheSplice_other if(spliceai_max_impact = 'HIGH',1,0),if(spliceai_max_impact = 'MODERATE',1,0),if(not(spliceai_max_impact in ('HIGH','MODERATE')),1,0)
| calc use_prov if(isfloat(provean) and provean != '-1',1,0)
| replace provean if(use_prov=1,provean,'0')
| replace ref_AF,gnomad_af,lof_pli str(if(not(isfloat(#rc)) or isfloat(#rc) and float(#rc)<1e-20,-20,log(float(#rc))))
| calc oheACMGev_BA1,oheACMGev_BS1,oheACMGev_BS2,oheACMGev_BS3,oheACMGev_BS4,oheACMGev_BP1,oheACMGev_BP2,oheACMGev_BP3,oheACMGev_BP4,oheACMGev_BP4_supporting,oheACMGev_BP4_moderate,oheACMGev_BP4_strong,oheACMGev_BP4_verystrong,oheACMGev_BP5,oheACMGev_BP6,oheACMGev_BP7,oheACMGev_PVS1,oheACMGev_PS1,oheACMGev_PS2,oheACMGev_PS3,oheACMGev_PS4,oheACMGev_PM1,oheACMGev_PM2,oheACMGev_PM3,oheACMGev_PM4,oheACMGev_PM5,oheACMGev_PM6,oheACMGev_PP1,oheACMGev_PP2,oheACMGev_PP3,oheACMGev_PP3_strong,oheACMGev_PP3_moderate,oheACMGev_PP3_supporting,oheACMGev_PP4,oheACMGev_PP5 if(listhasany(acmg_evidence,'BA1'),1,0),if(listhasany(acmg_evidence,'BS1'),1,0),if(listhasany(acmg_evidence,'BS2'),1,0),if(listhasany(acmg_evidence,'BS3'),1,0),if(listhasany(acmg_evidence,'BS4'),1,0),if(listhasany(acmg_evidence,'BP1'),1,0),if(listhasany(acmg_evidence,'BP2'),1,0),if(listhasany(acmg_evidence,'BP3'),1,0),if(listhasany(acmg_evidence,'BP4'),1,0),if(listhasany(acmg_evidence,'BP4_supporting'),1,0),if(listhasany(acmg_evidence,'BP4_moderate'),1,0),if(listhasany(acmg_evidence,'BP4_strong'),1,0),if(listhasany(acmg_evidence,'BP4_verystrong'),1,0),if(listhasany(acmg_evidence,'BP5'),1,0),if(listhasany(acmg_evidence,'BP6'),1,0),if(listhasany(acmg_evidence,'BP7'),1,0),if(listhasany(acmg_evidence,'PVS1'),1,0),if(listhasany(acmg_evidence,'PS1'),1,0),if(listhasany(acmg_evidence,'PS2'),1,0),if(listhasany(acmg_evidence,'PS3'),1,0),if(listhasany(acmg_evidence,'PS4'),1,0),if(listhasany(acmg_evidence,'PM1'),1,0),if(listhasany(acmg_evidence,'PM2'),1,0),if(listhasany(acmg_evidence,'PM3'),1,0),if(listhasany(acmg_evidence,'PM4'),1,0),if(listhasany(acmg_evidence,'PM5'),1,0),if(listhasany(acmg_evidence,'PM6'),1,0),if(listhasany(acmg_evidence,'PP1'),1,0),if(listhasany(acmg_evidence,'PP2'),1,0),if(listhasany(acmg_evidence,'PP3'),1,0),if(listhasany(acmg_evidence,'PP3_strong'),1,0),if(listhasany(acmg_evidence,'PP3_moderate'),1,0),if(listhasany(acmg_evidence,'PP3_supporting'),1,0),if(listhasany(acmg_evidence,'PP4'),1,0),if(listhasany(acmg_evidence,'PP5'),1,0)
| #WRITE#
);
""")
We run our query in test mode to see its columns:
%%gor
$mydefs5
nor [#output#] | top 1 | unpivot #1-
Query ran in 1.53 sec Query fetched 127 rows in 0.01 sec (total time 1.54 sec)
Col_Name | Col_Value | |
---|---|---|
0 | chrom | chr1 |
1 | pos | 69134 |
2 | Reference | A |
3 | Call | G |
4 | pn | 511157 |
... | ... | ... |
122 | oheACMGev_PP3_strong | 0 |
123 | oheACMGev_PP3_moderate | 0 |
124 | oheACMGev_PP3_supporting | 0 |
125 | oheACMGev_PP4 | 1 |
126 | oheACMGev_PP5 | 0 |
127 rows × 2 columns
We now modify our definitions and run this genome wide with 15 workers. Note the write may cause an error in the end because current incompatibilty between PARGOR and fork-write. Also note that if we had more samples with gene Multiscore, the execution would not be much longer because with 600k samples we only need 600 workers in GORdb, i.e. 1200 cores and about 3Gb per worker - very modest resources indeed!
mydefs5 = qh5.add_many("""
def #TOP# = /* top 100 */;
def #WRITE# = calc filepath 'user_data/hakon/Ranker/output/vars4ranker_'+left(md5('#{tags}'),5)+'.gorz'
| write -f filepath -r #{fork};
""")
%%gor
$mydefs5
nor [#output#] | top 1
Appendix - generating one-hot-encoding with GOR queries¶
We will show two examples here. First we take the list used in the decode() function in ACMG:## Appendic - generating one-hot-encoding with GOR queries
%%gor dfQuery <<
norrows 1 | calc x 'BA1,-8,BS1,-4,BS2,-4,BS3,-4,BS4,-4,BP1,-1,BP2,-1,BP3,-1,BP4,-1,BP4_supporting,-1,BP4_moderate,-2,BP4_strong,-4,BP4_verystrong,-8,BP5,-1,BP6,-1,BP7,-1,PVS1,8,PS1,4,PS2,4,PS3,4,PS4,4,PM1,2,PM2,1,PM3,2,PM4,2,PM5,2,PM6,2,PP1,1,PP2,1,PP3,1,PP3_strong,4,PP3_moderate,2,PP3_supporting,1,PP4,1,PP5,1'
| select x
| split x
| rownum
| where mod(rownum,2)=1
| calc cmd_col 'oheACMGev_'+x
| calc cmd_form "if(listhasany(acmg_evidence,'"+x+"'),1,0)"
| group -lis -sc cmd_col,cmd_form
| calc cmd 'calc '+lis_cmd_col+' '+lis_cmd_form
| select cmd
Query ran in 0.62 sec Query fetched 1 rows in 0.01 sec (total time 0.63 sec)
print(dfQuery.at[0,'cmd'])
calc oheACMGev_BA1,oheACMGev_BS1,oheACMGev_BS2,oheACMGev_BS3,oheACMGev_BS4,oheACMGev_BP1,oheACMGev_BP2,oheACMGev_BP3,oheACMGev_BP4,oheACMGev_BP4_supporting,oheACMGev_BP4_moderate,oheACMGev_BP4_strong,oheACMGev_BP4_verystrong,oheACMGev_BP5,oheACMGev_BP6,oheACMGev_BP7,oheACMGev_PVS1,oheACMGev_PS1,oheACMGev_PS2,oheACMGev_PS3,oheACMGev_PS4,oheACMGev_PM1,oheACMGev_PM2,oheACMGev_PM3,oheACMGev_PM4,oheACMGev_PM5,oheACMGev_PM6,oheACMGev_PP1,oheACMGev_PP2,oheACMGev_PP3,oheACMGev_PP3_strong,oheACMGev_PP3_moderate,oheACMGev_PP3_supporting,oheACMGev_PP4,oheACMGev_PP5 if(listhasany(acmg_evidence,'BA1'),1,0),if(listhasany(acmg_evidence,'BS1'),1,0),if(listhasany(acmg_evidence,'BS2'),1,0),if(listhasany(acmg_evidence,'BS3'),1,0),if(listhasany(acmg_evidence,'BS4'),1,0),if(listhasany(acmg_evidence,'BP1'),1,0),if(listhasany(acmg_evidence,'BP2'),1,0),if(listhasany(acmg_evidence,'BP3'),1,0),if(listhasany(acmg_evidence,'BP4'),1,0),if(listhasany(acmg_evidence,'BP4_supporting'),1,0),if(listhasany(acmg_evidence,'BP4_moderate'),1,0),if(listhasany(acmg_evidence,'BP4_strong'),1,0),if(listhasany(acmg_evidence,'BP4_verystrong'),1,0),if(listhasany(acmg_evidence,'BP5'),1,0),if(listhasany(acmg_evidence,'BP6'),1,0),if(listhasany(acmg_evidence,'BP7'),1,0),if(listhasany(acmg_evidence,'PVS1'),1,0),if(listhasany(acmg_evidence,'PS1'),1,0),if(listhasany(acmg_evidence,'PS2'),1,0),if(listhasany(acmg_evidence,'PS3'),1,0),if(listhasany(acmg_evidence,'PS4'),1,0),if(listhasany(acmg_evidence,'PM1'),1,0),if(listhasany(acmg_evidence,'PM2'),1,0),if(listhasany(acmg_evidence,'PM3'),1,0),if(listhasany(acmg_evidence,'PM4'),1,0),if(listhasany(acmg_evidence,'PM5'),1,0),if(listhasany(acmg_evidence,'PM6'),1,0),if(listhasany(acmg_evidence,'PP1'),1,0),if(listhasany(acmg_evidence,'PP2'),1,0),if(listhasany(acmg_evidence,'PP3'),1,0),if(listhasany(acmg_evidence,'PP3_strong'),1,0),if(listhasany(acmg_evidence,'PP3_moderate'),1,0),if(listhasany(acmg_evidence,'PP3_supporting'),1,0),if(listhasany(acmg_evidence,'PP4'),1,0),if(listhasany(acmg_evidence,'PP5'),1,0)
and for the VEP logic:
%%gor
create x = gor freezes/xa1_v6/main/vep_single.gorz | group genome -gc max_consequence,max_impact -count;
nor [x] | where allcount > 5000 and max_impact != 'LOWEST' | calc c 'oheVep_'+max_consequence | calc i "if(max_consequence = '"+max_consequence+"',1,0)"
| group -lis -sc c,i
| calc cmd 'calc '+lis_c+' '+lis_i
| select cmd
Query ran in 0.25 sec Query fetched 1 rows in 0.01 sec (total time 0.25 sec)
cmd | |
---|---|
0 | calc oheVep_3_prime_UTR_variant,oheVep_5_prime... |
and we can test it like this:
%%gor
create x = gor freezes/xa1_v6/main/vep_single.gorz | group genome -gc max_consequence,max_impact -count;
nor [x] | select max_consequence
| calc oheVep_3_prime_UTR_variant,oheVep_5_prime_UTR_variant,oheVep_frameshift_variant,oheVep_inframe_deletion,oheVep_inframe_insertion,oheVep_intron_variant,oheVep_missense_variant,oheVep_non_coding_transcript_exon_variant,oheVep_protein_altering_variant,oheVep_splice_acceptor_variant,oheVep_splice_donor_variant,oheVep_splice_region_variant,oheVep_start_lost,oheVep_stop_gained,oheVep_stop_lost,oheVep_stop_retained_variant,oheVep_synonymous_variant if(max_consequence = '3_prime_UTR_variant',1,0),if(max_consequence = '5_prime_UTR_variant',1,0),if(max_consequence = 'frameshift_variant',1,0),if(max_consequence = 'inframe_deletion',1,0),if(max_consequence = 'inframe_insertion',1,0),if(max_consequence = 'intron_variant',1,0),if(max_consequence = 'missense_variant',1,0),if(max_consequence = 'non_coding_transcript_exon_variant',1,0),if(max_consequence = 'protein_altering_variant',1,0),if(max_consequence = 'splice_acceptor_variant',1,0),if(max_consequence = 'splice_donor_variant',1,0),if(max_consequence = 'splice_region_variant',1,0),if(max_consequence = 'start_lost',1,0),if(max_consequence = 'stop_gained',1,0),if(max_consequence = 'stop_lost',1,0),if(max_consequence = 'stop_retained_variant',1,0),if(max_consequence = 'synonymous_variant',1,0)
Query ran in 0.08 sec Query fetched 25 rows in 0.02 sec (total time 0.10 sec)
max_consequence | oheVep_3_prime_UTR_variant | oheVep_5_prime_UTR_variant | oheVep_frameshift_variant | oheVep_inframe_deletion | oheVep_inframe_insertion | oheVep_intron_variant | oheVep_missense_variant | oheVep_non_coding_transcript_exon_variant | oheVep_protein_altering_variant | oheVep_splice_acceptor_variant | oheVep_splice_donor_variant | oheVep_splice_region_variant | oheVep_start_lost | oheVep_stop_gained | oheVep_stop_lost | oheVep_stop_retained_variant | oheVep_synonymous_variant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3_prime_UTR_variant | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5_prime_UTR_variant | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | coding_sequence_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | downstream_gene_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | frameshift_variant | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | incomplete_terminal_codon_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | inframe_deletion | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | inframe_insertion | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | intergenic_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | intron_variant | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | mature_miRNA_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | missense_variant | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | non_coding_transcript_exon_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | non_coding_transcript_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | protein_altering_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | splice_acceptor_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | splice_donor_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | splice_region_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
18 | start_lost | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
19 | start_retained_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | stop_gained | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
21 | stop_lost | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
22 | stop_retained_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
23 | synonymous_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
24 | upstream_gene_variant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |