HPO Phewas¶

HPO_Phewas

In [19]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}

HPO Phewas based on a Phecode-like catalog¶

In this notebook, we will use the phenotypes defined in the notebook HPO_Phecode in a gene based Phewas.

We use variants and sequence read coverage to perform joint-genotyping and output horizontal organized genotypes.
Then we aggregate the variants based on all the samples to enable very efficient case-control analysis.
Finally, we use the PARALLEL command to calculate Fisher-Exact test for all variant and each phenotype.

Joint genotyping¶

We use the non-probabilistic GTGEN command to perform the genotyping. As input, we need row-level based (sparse) variants in a measure of sufficient read coverage, here set to 6 reads. We use the goodcov table instead of the more generic segcov table, because of much smaller data footprint. Given that we only have less than 25k samples, we will not use PARTGOR parallelism and only store the genotypes in a single horizontal bucket.

In [20]:

qh = GQH.GOR_Query_Helper()
mydefs = qh.add_many("""
def #hpo_pn# = user_data/hakon/hpo_pheno.tsv;
def #wgsvars# = source/var/wgs_varcalls.gord -s PN ;
def #goodcov# = source/cov/goodcov_6.wgs.gord -s PN ;
create #buckets# = nor -asdict #wgsvars# | select #2 | rename #1 PN | calc bucket 'b1_1';
""")

In [21]:

%%gor
$mydefs
nor [#buckets#] | group -count

Query ran in 0.83 sec
Query fetched 1 rows in 0.06 sec (total time 0.89 sec)

Out[21]:

	allCount
0	24158

We see that we have over 24k samples. Next we create the genotypes by fetching all the variants in 20bp vicinity of a gene of choice, e.g. BRCA1. A GT column is mandatory for GTGEN and it is calculated based on the QC metrics stored in each row, e.g. we map low quality variants to unknown '3'. Because we don't assume that the variant have been properly normalized, we perform a left-normalization and ensure that there are no duplicate rows per samples, using ATMAX. Finally, we feed the variant rows into GTGEN that also joins the coverage information. Depending on the size of the gene, this query may take a minute or two to run.

In [22]:

mydefs = qh.add_many("""
create #genotypes# = gor #genes# | where gene_symbol = 'BRCA1' | segproj 
| join -segsnp -f 20 -ir <(gor #wgsvars# | calc gt if(GL_Call>=20,callcopies,'3') | select 1-4,gt,pn )
| varnorm -left #3 #4
| atmax 1 -gc pn gt
| gtgen -gc #3,#4 [#buckets#] <(gor #goodcov# ) -maxseg 10000;
""")

We can inspect the genotypes and generate a simple histogram of the genotype values:

In [23]:

%%gor
$mydefs
gor [#genotypes#] | top 10 | replace values listcount(fsvmap(values,1,'x',',')) 

Query ran in 0.11 sec
Query fetched 10 rows in 0.06 sec (total time 0.17 sec)

Out[23]:

	CHROM	POS	Reference	Call	Bucket	Values
0	chr17	43044346	C	T	b1_1	0;3121,1;36,3;21001
1	chr17	43044351	C	T	b1_1	0;3155,1;1,3;21002
2	chr17	43044355	T	C	b1_1	0;3168,1;1,3;20989
3	chr17	43044358	T	G	b1_1	0;3165,1;1,3;20992
4	chr17	43044386	A	G	b1_1	0;3173,1;1,3;20984
5	chr17	43044391	G	A	b1_1	0;2327,1;676,2;169,3;20986
6	chr17	43044492	T	C	b1_1	0;3149,1;2,3;21007
7	chr17	43044565	C	T	b1_1	0;3148,1;3,3;21007
8	chr17	43044725	C	T	b1_1	0;3088,1;1,3;21069
9	chr17	43044750	G	A	b1_1	0;3080,1;3,3;21075

and we see the total number of variants as:

In [24]:

%%gor
$mydefs
gor [#genotypes#] | group chrom -count

Query ran in 0.10 sec
Query fetched 1 rows in 1.09 sec (total time 1.19 sec)

Out[24]:

	CHROM	bpStart	bpStop	allCount
0	chr17	0	83257441	5691

Next we want to generate summary statistics for all the samples. Typically, the number of samples that we exclude is much smaller than the number of controls. Having the counts for all the samples, we can generate the counts for the controls using simple substraction. Here we also calculate allele number (AN), allele count (AC) and allele frequence. Note, this calculation is only correct for autosomal chromosomes - see the report-builder freeze_af.ftl.yml() for gender specific implementation of AF.

In [25]:

mydefs = qh.add_many("""
create #all# = gor [#genotypes#] 
| csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] <(nor [#buckets#] | select PN | calc cc 'all')
| pivot -gc #3,#4,cc -v 0,1,2,3 gt -e 0 
| rename (.*_GTcount) all_#{1}
| hide cc
| calc AN 2*(all_0_GTcount + all_1_GTcount + all_2_GTcount)
| calc AC all_1_GTcount + 2*all_2_GTcount 
| calc AF GFORM(AC/AN,4,4);
""")

In [26]:

%%gor
$mydefs
gor [#all#] | top 10

Query ran in 0.10 sec
Query fetched 10 rows in 0.01 sec (total time 0.10 sec)

Out[26]:

	CHROM	POS	Reference	Call	all_0_GTcount	all_1_GTcount	all_2_GTcount	all_3_GTcount	AN	AC	AF
0	chr17	43044346	C	T	3121	36	0	21001	6314	36	0.005702
1	chr17	43044351	C	T	3155	1	0	21002	6312	1	0.000158
2	chr17	43044355	T	C	3168	1	0	20989	6338	1	0.000158
3	chr17	43044358	T	G	3165	1	0	20992	6332	1	0.000158
4	chr17	43044386	A	G	3173	1	0	20984	6348	1	0.000158
5	chr17	43044391	G	A	2327	676	169	20986	6344	1014	0.159800
6	chr17	43044492	T	C	3149	2	0	21007	6302	2	0.000317
7	chr17	43044565	C	T	3148	3	0	21007	6302	3	0.000476
8	chr17	43044725	C	T	3088	1	0	21069	6178	1	0.000162
9	chr17	43044750	G	A	3080	3	0	21075	6166	3	0.000487

The query above took short time to run, because we are only reading the genotypes once and the command CSVCC is efficiently calculating the frequency of each genotype value, per variant row. However, we will be using close to 5k phenotypes, therefore we want to use the PARALLEL command to split up the load from the many phenotypes at hand. In order to split the load evenly, we will order the phenotypes by size and round-robin them to the workers.

In [27]:

mydefs = qh.add_many("""
create #phenosizeordered# = nor #hpo_pn# 
| group -gc pheno_code -count | sort -c allcount:n | rownum;
""")

We are now ready to setup the PARALLEL macro command that uses a rownum in the parts relation and a modulus filter to provide different phenotype input to CSVCC in each parallel task. First, we use top 1 to run only one partition and top 10 to terminate the calculation quickly:

In [28]:

mydefs = qh.add_many("""
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# | top 1) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] 
  <(nor #hpo_pn# 
    | inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount
| top 10
);
""")

In [29]:

%%gor
$mydefs
gor [#phewas#]

Query ran in 0.15 sec
Query fetched 10 rows in 0.01 sec (total time 0.16 sec)

Out[29]:

	CHROM	POS	Reference	Call	Pheno	case_0_GTcount	case_3_GTcount	excl_0_GTcount	...	all_1_GTcount	all_3_GTcount	AN	AC	AF	ctrl_0_GTcount	ctrl_1_GTcount	ctrl_3_GTcount
0	chr17	43044346	C	T	HP:0030714	0	1	6	...	36	21001	6314	36	0.005702	3115	36	20913
1	chr17	43044346	C	T	HP:0003049	0	1	11	...	36	21001	6314	36	0.005702	3110	35	20951
2	chr17	43044346	C	T	HP:0000218	26	429	17	...	36	21001	6314	36	0.005702	3078	36	20378
3	chr17	43044346	C	T	HP:0008661	3	15	3	...	36	21001	6314	36	0.005702	3115	36	20975
4	chr17	43044346	C	T	HP:0004385	0	3	294	...	36	21001	6314	36	0.005702	2827	29	20098
5	chr17	43044346	C	T	HP:0009103	0	4	62	...	36	21001	6314	36	0.005702	3059	35	20606
6	chr17	43044346	C	T	HP:0002265	2	11	3	...	36	21001	6314	36	0.005702	3116	36	20937
7	chr17	43044346	C	T	HP:0008668	0	1	9	...	36	21001	6314	36	0.005702	3112	36	20939
8	chr17	43044346	C	T	HP:0010883	2	6	21	...	36	21001	6314	36	0.005702	3098	36	20868
9	chr17	43044346	C	T	HP:0041064	0	2	38	...	36	21001	6314	36	0.005702	3083	33	20729

10 rows × 24 columns

Notice how we use the all_*_GTcounts columns to calculate the control counts. Since the number of samples that are excluded are typically much fewer than the controls, this means that the CSVCC command needs to perform much less counting. We can see the speedup from the query below:

In [30]:

%%gor
$mydefs
nor user_data/hakon/hpo_pheno_summary.tsv | select 1-3 
| calc x casecount+exclcount
| multimap -cartesian <(nor [#buckets#] | group -count)
| group -sum -ic x,allcount
| calc f sum_x/sum_allcount

Query ran in 0.13 sec
Query fetched 1 rows in 0.17 sec (total time 0.31 sec)

Out[30]:

	sum_x	sum_allCount	f
0	1511312	112817860	0.013396

From the above, we can see that we get close to 100-fold speedup by using this approach to calculate the Fisher-Exact statistics! Now we complete the final query by adding p-value and odds-ratio calculations for multiplicative, dominant, ad recessive model.

In [31]:

mydefs = qh.add_many("""
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# ) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] 
  <(nor #hpo_pn# 
    | inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount

| calc pVal_mm EFORM(PVAL(case_2_GTcount*2+case_1_GTcount,case_0_GTcount*2 + case_1_GTcount,ctrl_2_GTcount*2+ctrl_1_GTcount,ctrl_0_GTcount*2 + ctrl_1_GTcount),5,1)
| calc OR_mm if((ctrl_2_GTcount*2+ctrl_1_GTcount) = 0 or (case_0_GTcount*2 + case_1_GTcount) = 0 and (ctrl_0_GTcount*2 + ctrl_1_GTcount) != 0,1000,if((case_0_GTcount*2 + case_1_GTcount) = 0 or (ctrl_0_GTcount*2 + ctrl_1_GTcount) = 0,float('NaN'),((case_2_GTcount*2+case_1_GTcount)/(case_0_GTcount*2 + case_1_GTcount))/((ctrl_2_GTcount*2+ctrl_1_GTcount)/(ctrl_0_GTcount*2 + ctrl_1_GTcount))))
  
| calc pVal_dom EFORM(PVAL(case_2_GTcount+case_1_GTcount,case_0_GTcount,ctrl_2_GTcount+ctrl_1_GTcount,ctrl_0_GTcount),5,1)
| calc OR_dom if(ctrl_2_GTcount+ctrl_1_GTcount = 0 or case_0_GTcount = 0 and ctrl_0_GTcount != 0,1000,if(case_0_GTcount = 0 or ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount+case_1_GTcount/case_0_GTcount)/(ctrl_2_GTcount+ctrl_1_GTcount/ctrl_0_GTcount)))

| calc pVal_rec EFORM(PVAL(case_2_GTcount,case_1_GTcount + case_0_GTcount,ctrl_2_GTcount,ctrl_1_GTcount + ctrl_0_GTcount),5,1)
| calc OR_rec if(ctrl_2_GTcount = 0 or case_1_GTcount + case_0_GTcount = 0 and ctrl_1_GTcount + ctrl_0_GTcount != 0,1000,if(case_1_GTcount + case_0_GTcount = 0 or ctrl_1_GTcount + ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount/case_1_GTcount + case_0_GTcount)/(ctrl_2_GTcount/ctrl_1_GTcount + ctrl_0_GTcount)))

| calc lnse sqrt(1.0/max(1.0,float(case_2_GTcount*2+case_1_GTcount))+1.0/max(1.0,float(case_0_GTcount*2 + case_1_GTcount))+1.0/max(1.0,float(ctrl_2_GTcount*2+ctrl_1_GTcount))+1.0/max(1.0,float(ctrl_0_GTcount*2 + ctrl_1_GTcount)))
| calc OR_SEmm GFORM((exp(ln(OR_mm)+1.96*lnse)-exp(ln(OR_mm)-1.96*lnse))/(2.0*1.96),4,3)

| select 1,2,Reference,Call,Pheno,AF,pVal_mm,OR_mm,OR_SEmm,pVal_dom,OR_dom,pVal_rec,OR_rec,case_0_GTcount,case_1_GTcount,case_2_GTcount,case_3_GTcount,ctrl_0_GTcount,ctrl_1_GTcount,ctrl_2_GTcount,ctrl_3_GTcount,AN,AC
| where float(pval_mm) < 0.1 and case_1_GTcount+case_2_GTcount > 0
| replace OR_* GFORM(float(#rc),2,2)
);
""")

In [32]:

%%gor
$mydefs
gor [#phewas#] | top 100

Query ran in 0.20 sec
Query fetched 100 rows in 0.02 sec (total time 0.22 sec)

Out[32]:

	CHROM	POS	Reference	Call	Pheno	AF	pVal_mm	OR_mm	OR_SEmm	pVal_dom	...	case_0_GTcount	case_1_GTcount	case_2_GTcount	case_3_GTcount	ctrl_0_GTcount	ctrl_1_GTcount	ctrl_2_GTcount	ctrl_3_GTcount	AN	AC
0	chr17	43044346	C	T	HP:0100852	0.005702	0.03900	2.7	1.4	0.03800	...	241	6	0	1172	2812	26	0	19428	6314	36
1	chr17	43044346	C	T	HP:0001337	0.005702	0.06500	3.5	2.7	0.06500	...	80	3	0	579	2916	31	0	19499	6314	36
2	chr17	43044346	C	T	HP:0031197	0.005702	0.01100	180.0	740.0	0.01100	...	0	1	0	0	3112	35	0	20985	6314	36
3	chr17	43044346	C	T	HP:0001696	0.005702	0.05600	20.0	41.0	0.05600	...	4	1	0	15	3116	35	0	20966	6314	36
4	chr17	43044346	C	T	HP:0031218	0.005702	0.01100	180.0	760.0	0.01100	...	0	1	0	4	3108	34	0	20929	6314	36
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	chr17	43044346	C	T	HP:0000549	0.005702	0.01000	5.2	3.3	0.01000	...	71	4	0	744	3024	32	0	19948	6314	36
96	chr17	43044346	C	T	HP:0001397	0.005702	0.09800	11.0	20.0	0.09800	...	8	1	0	30	3113	35	0	20971	6314	36
97	chr17	43044346	C	T	HP:0006385	0.005702	0.00047	92.0	130.0	0.00038	...	1	2	0	20	3120	34	0	20980	6314	36
98	chr17	43044346	C	T	HP:0100836	0.005702	0.08800	12.0	23.0	0.08800	...	7	1	0	32	3108	35	0	20946	6314	36
99	chr17	43044346	C	T	HP:0006699	0.005702	0.04400	26.0	55.0	0.04400	...	3	1	0	9	3093	34	0	20860	6314	36

100 rows × 23 columns

We can see that by spreading the phenotypes across 20 parallel workers, we complete about 25 million Fisher-Exact analysis test for 25k samples in less than a minute! We can see the names and filter out the important associations like this:

In [33]:

%%gor
$mydefs

nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
      | map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
    ) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n

Query ran in 0.18 sec
Query fetched 21,235 rows in 1.19 sec (total time 1.37 sec)

Out[33]:

	CHROM	POS	Reference	Call	Pheno	AF	pVal_mm	OR_mm	OR_SEmm	name	rank_pVal_mm
0	chr17	43159574	T	C	HP:0011458	0.594900	1.900000e-219	14.0	1.50	Abdominal symptom	1
1	chr17	43159574	T	C	HP:0012531	0.594900	1.900000e-219	18.0	2.00	Pain	1
2	chr17	43169893	C	T	HP:0002017	0.341000	1.900000e-219	17.0	1.60	Nausea and vomiting	1
3	chr17	43169893	C	T	HP:0002037	0.341000	1.900000e-219	41.0	7.20	Inflammation of the large intestine	1
4	chr17	43169893	C	T	HP:0011458	0.341000	1.900000e-219	13.0	0.91	Abdominal symptom	1
...	...	...	...	...	...	...	...	...	...	...	...
21230	chr17	43097369	A	G	HP:0012014	0.001242	9.900000e-03	120.0	240.00	EEG with central focal spikes	1
21231	chr17	43097369	A	G	HP:0010471	0.001242	9.900000e-03	120.0	240.00	Oligosacchariduria	3
21232	chr17	43100620	CATATATATATGTT	C	HP:0008458	0.004777	9.900000e-03	210.0	920.00	Progressive congenital scoliosis	6
21233	chr17	43100664	A	G	HP:0010301	0.026530	9.900000e-03	19.0	25.00	Spinal dysraphism	7
21234	chr17	43133822	A	G	HP:0000327	0.001657	9.900000e-03	140.0	370.00	Hypoplasia of the maxilla	9

21235 rows × 11 columns

Finally, we print out the complete query:

In [34]:

print(mydefs)
print("""nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
      | map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
    ) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n""")

def #hpo_pn# = user_data/hakon/hpo_pheno.tsv;
def #wgsvars# = source/var/wgs_varcalls.gord -s PN;
def #goodcov# = source/cov/goodcov_6.wgs.gord -s PN;
create #buckets# = nor -asdict #wgsvars# | select #2 | rename #1 PN | calc bucket 'b1_1';
create #genotypes# = gor #genes# | where gene_symbol = 'BRCA1' | segproj 
| join -segsnp -f 20 -ir <(gor #wgsvars# | calc gt if(GL_Call>=20,callcopies,'3') | select 1-4,gt,pn )
| varnorm -left #3 #4
| atmax 1 -gc pn gt
| gtgen -gc #3,#4 [#buckets#] <(gor #goodcov# ) -maxseg 10000;
create #all# = gor [#genotypes#] 
| csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] <(nor [#buckets#] | select PN | calc cc 'all')
| pivot -gc #3,#4,cc -v 0,1,2,3 gt -e 0 
| rename (.*_GTcount) all_#{1}
| hide cc
| calc AN 2*(all_0_GTcount + all_1_GTcount + all_2_GTcount)
| calc AC all_1_GTcount + 2*all_2_GTcount 
| calc AF GFORM(AC/AN,4,4);
create #phenosizeordered# = nor #hpo_pn# 
| group -gc pheno_code -count | sort -c allcount:n | rownum;
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# ) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] 
  <(nor #hpo_pn# 
    | inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount

| calc pVal_mm EFORM(PVAL(case_2_GTcount*2+case_1_GTcount,case_0_GTcount*2 + case_1_GTcount,ctrl_2_GTcount*2+ctrl_1_GTcount,ctrl_0_GTcount*2 + ctrl_1_GTcount),5,1)
| calc OR_mm if((ctrl_2_GTcount*2+ctrl_1_GTcount) = 0 or (case_0_GTcount*2 + case_1_GTcount) = 0 and (ctrl_0_GTcount*2 + ctrl_1_GTcount) != 0,1000,if((case_0_GTcount*2 + case_1_GTcount) = 0 or (ctrl_0_GTcount*2 + ctrl_1_GTcount) = 0,float('NaN'),((case_2_GTcount*2+case_1_GTcount)/(case_0_GTcount*2 + case_1_GTcount))/((ctrl_2_GTcount*2+ctrl_1_GTcount)/(ctrl_0_GTcount*2 + ctrl_1_GTcount))))
  
| calc pVal_dom EFORM(PVAL(case_2_GTcount+case_1_GTcount,case_0_GTcount,ctrl_2_GTcount+ctrl_1_GTcount,ctrl_0_GTcount),5,1)
| calc OR_dom if(ctrl_2_GTcount+ctrl_1_GTcount = 0 or case_0_GTcount = 0 and ctrl_0_GTcount != 0,1000,if(case_0_GTcount = 0 or ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount+case_1_GTcount/case_0_GTcount)/(ctrl_2_GTcount+ctrl_1_GTcount/ctrl_0_GTcount)))

| calc pVal_rec EFORM(PVAL(case_2_GTcount,case_1_GTcount + case_0_GTcount,ctrl_2_GTcount,ctrl_1_GTcount + ctrl_0_GTcount),5,1)
| calc OR_rec if(ctrl_2_GTcount = 0 or case_1_GTcount + case_0_GTcount = 0 and ctrl_1_GTcount + ctrl_0_GTcount != 0,1000,if(case_1_GTcount + case_0_GTcount = 0 or ctrl_1_GTcount + ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount/case_1_GTcount + case_0_GTcount)/(ctrl_2_GTcount/ctrl_1_GTcount + ctrl_0_GTcount)))

| calc lnse sqrt(1.0/max(1.0,float(case_2_GTcount*2+case_1_GTcount))+1.0/max(1.0,float(case_0_GTcount*2 + case_1_GTcount))+1.0/max(1.0,float(ctrl_2_GTcount*2+ctrl_1_GTcount))+1.0/max(1.0,float(ctrl_0_GTcount*2 + ctrl_1_GTcount)))
| calc OR_SEmm GFORM((exp(ln(OR_mm)+1.96*lnse)-exp(ln(OR_mm)-1.96*lnse))/(2.0*1.96),4,3)

| select 1,2,Reference,Call,Pheno,AF,pVal_mm,OR_mm,OR_SEmm,pVal_dom,OR_dom,pVal_rec,OR_rec,case_0_GTcount,case_1_GTcount,case_2_GTcount,case_3_GTcount,ctrl_0_GTcount,ctrl_1_GTcount,ctrl_2_GTcount,ctrl_3_GTcount,AN,AC
| where float(pval_mm) < 0.1 and case_1_GTcount+case_2_GTcount > 0
| replace OR_* GFORM(float(#rc),2,2)
);

nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
      | map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
    ) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n

In [ ]: