HPO Phewas¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}
HPO Phewas based on a Phecode-like catalog¶
In this notebook, we will use the phenotypes defined in the notebook HPO_Phecode in a gene based Phewas.
- We use variants and sequence read coverage to perform joint-genotyping and output horizontal organized genotypes.
- Then we aggregate the variants based on all the samples to enable very efficient case-control analysis.
- Finally, we use the PARALLEL command to calculate Fisher-Exact test for all variant and each phenotype.
Joint genotyping¶
We use the non-probabilistic GTGEN command to perform the genotyping. As input, we need row-level based (sparse) variants in a measure of sufficient read coverage, here set to 6 reads. We use the goodcov table instead of the more generic segcov table, because of much smaller data footprint. Given that we only have less than 25k samples, we will not use PARTGOR parallelism and only store the genotypes in a single horizontal bucket.
qh = GQH.GOR_Query_Helper()
mydefs = qh.add_many("""
def #hpo_pn# = user_data/hakon/hpo_pheno.tsv;
def #wgsvars# = source/var/wgs_varcalls.gord -s PN ;
def #goodcov# = source/cov/goodcov_6.wgs.gord -s PN ;
create #buckets# = nor -asdict #wgsvars# | select #2 | rename #1 PN | calc bucket 'b1_1';
""")
%%gor
$mydefs
nor [#buckets#] | group -count
Query ran in 0.83 sec Query fetched 1 rows in 0.06 sec (total time 0.89 sec)
allCount | |
---|---|
0 | 24158 |
We see that we have over 24k samples. Next we create the genotypes by fetching all the variants in 20bp vicinity of a gene of choice, e.g. BRCA1. A GT column is mandatory for GTGEN and it is calculated based on the QC metrics stored in each row, e.g. we map low quality variants to unknown '3'. Because we don't assume that the variant have been properly normalized, we perform a left-normalization and ensure that there are no duplicate rows per samples, using ATMAX. Finally, we feed the variant rows into GTGEN that also joins the coverage information. Depending on the size of the gene, this query may take a minute or two to run.
mydefs = qh.add_many("""
create #genotypes# = gor #genes# | where gene_symbol = 'BRCA1' | segproj
| join -segsnp -f 20 -ir <(gor #wgsvars# | calc gt if(GL_Call>=20,callcopies,'3') | select 1-4,gt,pn )
| varnorm -left #3 #4
| atmax 1 -gc pn gt
| gtgen -gc #3,#4 [#buckets#] <(gor #goodcov# ) -maxseg 10000;
""")
We can inspect the genotypes and generate a simple histogram of the genotype values:
%%gor
$mydefs
gor [#genotypes#] | top 10 | replace values listcount(fsvmap(values,1,'x',','))
Query ran in 0.11 sec Query fetched 10 rows in 0.06 sec (total time 0.17 sec)
CHROM | POS | Reference | Call | Bucket | Values | |
---|---|---|---|---|---|---|
0 | chr17 | 43044346 | C | T | b1_1 | 0;3121,1;36,3;21001 |
1 | chr17 | 43044351 | C | T | b1_1 | 0;3155,1;1,3;21002 |
2 | chr17 | 43044355 | T | C | b1_1 | 0;3168,1;1,3;20989 |
3 | chr17 | 43044358 | T | G | b1_1 | 0;3165,1;1,3;20992 |
4 | chr17 | 43044386 | A | G | b1_1 | 0;3173,1;1,3;20984 |
5 | chr17 | 43044391 | G | A | b1_1 | 0;2327,1;676,2;169,3;20986 |
6 | chr17 | 43044492 | T | C | b1_1 | 0;3149,1;2,3;21007 |
7 | chr17 | 43044565 | C | T | b1_1 | 0;3148,1;3,3;21007 |
8 | chr17 | 43044725 | C | T | b1_1 | 0;3088,1;1,3;21069 |
9 | chr17 | 43044750 | G | A | b1_1 | 0;3080,1;3,3;21075 |
and we see the total number of variants as:
%%gor
$mydefs
gor [#genotypes#] | group chrom -count
Query ran in 0.10 sec Query fetched 1 rows in 1.09 sec (total time 1.19 sec)
CHROM | bpStart | bpStop | allCount | |
---|---|---|---|---|
0 | chr17 | 0 | 83257441 | 5691 |
Next we want to generate summary statistics for all the samples. Typically, the number of samples that we exclude is much smaller than the number of controls. Having the counts for all the samples, we can generate the counts for the controls using simple substraction. Here we also calculate allele number (AN), allele count (AC) and allele frequence. Note, this calculation is only correct for autosomal chromosomes - see the report-builder freeze_af.ftl.yml() for gender specific implementation of AF.
mydefs = qh.add_many("""
create #all# = gor [#genotypes#]
| csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] <(nor [#buckets#] | select PN | calc cc 'all')
| pivot -gc #3,#4,cc -v 0,1,2,3 gt -e 0
| rename (.*_GTcount) all_#{1}
| hide cc
| calc AN 2*(all_0_GTcount + all_1_GTcount + all_2_GTcount)
| calc AC all_1_GTcount + 2*all_2_GTcount
| calc AF GFORM(AC/AN,4,4);
""")
%%gor
$mydefs
gor [#all#] | top 10
Query ran in 0.10 sec Query fetched 10 rows in 0.01 sec (total time 0.10 sec)
CHROM | POS | Reference | Call | all_0_GTcount | all_1_GTcount | all_2_GTcount | all_3_GTcount | AN | AC | AF | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 43044346 | C | T | 3121 | 36 | 0 | 21001 | 6314 | 36 | 0.005702 |
1 | chr17 | 43044351 | C | T | 3155 | 1 | 0 | 21002 | 6312 | 1 | 0.000158 |
2 | chr17 | 43044355 | T | C | 3168 | 1 | 0 | 20989 | 6338 | 1 | 0.000158 |
3 | chr17 | 43044358 | T | G | 3165 | 1 | 0 | 20992 | 6332 | 1 | 0.000158 |
4 | chr17 | 43044386 | A | G | 3173 | 1 | 0 | 20984 | 6348 | 1 | 0.000158 |
5 | chr17 | 43044391 | G | A | 2327 | 676 | 169 | 20986 | 6344 | 1014 | 0.159800 |
6 | chr17 | 43044492 | T | C | 3149 | 2 | 0 | 21007 | 6302 | 2 | 0.000317 |
7 | chr17 | 43044565 | C | T | 3148 | 3 | 0 | 21007 | 6302 | 3 | 0.000476 |
8 | chr17 | 43044725 | C | T | 3088 | 1 | 0 | 21069 | 6178 | 1 | 0.000162 |
9 | chr17 | 43044750 | G | A | 3080 | 3 | 0 | 21075 | 6166 | 3 | 0.000487 |
The query above took short time to run, because we are only reading the genotypes once and the command CSVCC is efficiently calculating the frequency of each genotype value, per variant row. However, we will be using close to 5k phenotypes, therefore we want to use the PARALLEL command to split up the load from the many phenotypes at hand. In order to split the load evenly, we will order the phenotypes by size and round-robin them to the workers.
mydefs = qh.add_many("""
create #phenosizeordered# = nor #hpo_pn#
| group -gc pheno_code -count | sort -c allcount:n | rownum;
""")
We are now ready to setup the PARALLEL macro command that uses a rownum in the parts relation and a modulus filter to provide different phenotype input to CSVCC in each parallel task. First, we use top 1 to run only one partition and top 10 to terminate the calculation quickly:
mydefs = qh.add_many("""
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# | top 1) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#]
<(nor #hpo_pn#
| inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount
| top 10
);
""")
%%gor
$mydefs
gor [#phewas#]
Query ran in 0.15 sec Query fetched 10 rows in 0.01 sec (total time 0.16 sec)
CHROM | POS | Reference | Call | Pheno | case_0_GTcount | case_1_GTcount | case_2_GTcount | case_3_GTcount | excl_0_GTcount | ... | all_1_GTcount | all_2_GTcount | all_3_GTcount | AN | AC | AF | ctrl_0_GTcount | ctrl_1_GTcount | ctrl_2_GTcount | ctrl_3_GTcount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 43044346 | C | T | HP:0030714 | 0 | 0 | 0 | 1 | 6 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3115 | 36 | 0 | 20913 |
1 | chr17 | 43044346 | C | T | HP:0003049 | 0 | 0 | 0 | 1 | 11 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3110 | 35 | 0 | 20951 |
2 | chr17 | 43044346 | C | T | HP:0000218 | 26 | 0 | 0 | 429 | 17 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3078 | 36 | 0 | 20378 |
3 | chr17 | 43044346 | C | T | HP:0008661 | 3 | 0 | 0 | 15 | 3 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3115 | 36 | 0 | 20975 |
4 | chr17 | 43044346 | C | T | HP:0004385 | 0 | 0 | 0 | 3 | 294 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 2827 | 29 | 0 | 20098 |
5 | chr17 | 43044346 | C | T | HP:0009103 | 0 | 0 | 0 | 4 | 62 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3059 | 35 | 0 | 20606 |
6 | chr17 | 43044346 | C | T | HP:0002265 | 2 | 0 | 0 | 11 | 3 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3116 | 36 | 0 | 20937 |
7 | chr17 | 43044346 | C | T | HP:0008668 | 0 | 0 | 0 | 1 | 9 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3112 | 36 | 0 | 20939 |
8 | chr17 | 43044346 | C | T | HP:0010883 | 2 | 0 | 0 | 6 | 21 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3098 | 36 | 0 | 20868 |
9 | chr17 | 43044346 | C | T | HP:0041064 | 0 | 0 | 0 | 2 | 38 | ... | 36 | 0 | 21001 | 6314 | 36 | 0.005702 | 3083 | 33 | 0 | 20729 |
10 rows × 24 columns
Notice how we use the all_*_GTcounts columns to calculate the control counts. Since the number of samples that are excluded are typically much fewer than the controls, this means that the CSVCC command needs to perform much less counting. We can see the speedup from the query below:
%%gor
$mydefs
nor user_data/hakon/hpo_pheno_summary.tsv | select 1-3
| calc x casecount+exclcount
| multimap -cartesian <(nor [#buckets#] | group -count)
| group -sum -ic x,allcount
| calc f sum_x/sum_allcount
Query ran in 0.13 sec Query fetched 1 rows in 0.17 sec (total time 0.31 sec)
sum_x | sum_allCount | f | |
---|---|---|---|
0 | 1511312 | 112817860 | 0.013396 |
From the above, we can see that we get close to 100-fold speedup by using this approach to calculate the Fisher-Exact statistics! Now we complete the final query by adding p-value and odds-ratio calculations for multiplicative, dominant, ad recessive model.
mydefs = qh.add_many("""
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# ) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#]
<(nor #hpo_pn#
| inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount
| calc pVal_mm EFORM(PVAL(case_2_GTcount*2+case_1_GTcount,case_0_GTcount*2 + case_1_GTcount,ctrl_2_GTcount*2+ctrl_1_GTcount,ctrl_0_GTcount*2 + ctrl_1_GTcount),5,1)
| calc OR_mm if((ctrl_2_GTcount*2+ctrl_1_GTcount) = 0 or (case_0_GTcount*2 + case_1_GTcount) = 0 and (ctrl_0_GTcount*2 + ctrl_1_GTcount) != 0,1000,if((case_0_GTcount*2 + case_1_GTcount) = 0 or (ctrl_0_GTcount*2 + ctrl_1_GTcount) = 0,float('NaN'),((case_2_GTcount*2+case_1_GTcount)/(case_0_GTcount*2 + case_1_GTcount))/((ctrl_2_GTcount*2+ctrl_1_GTcount)/(ctrl_0_GTcount*2 + ctrl_1_GTcount))))
| calc pVal_dom EFORM(PVAL(case_2_GTcount+case_1_GTcount,case_0_GTcount,ctrl_2_GTcount+ctrl_1_GTcount,ctrl_0_GTcount),5,1)
| calc OR_dom if(ctrl_2_GTcount+ctrl_1_GTcount = 0 or case_0_GTcount = 0 and ctrl_0_GTcount != 0,1000,if(case_0_GTcount = 0 or ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount+case_1_GTcount/case_0_GTcount)/(ctrl_2_GTcount+ctrl_1_GTcount/ctrl_0_GTcount)))
| calc pVal_rec EFORM(PVAL(case_2_GTcount,case_1_GTcount + case_0_GTcount,ctrl_2_GTcount,ctrl_1_GTcount + ctrl_0_GTcount),5,1)
| calc OR_rec if(ctrl_2_GTcount = 0 or case_1_GTcount + case_0_GTcount = 0 and ctrl_1_GTcount + ctrl_0_GTcount != 0,1000,if(case_1_GTcount + case_0_GTcount = 0 or ctrl_1_GTcount + ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount/case_1_GTcount + case_0_GTcount)/(ctrl_2_GTcount/ctrl_1_GTcount + ctrl_0_GTcount)))
| calc lnse sqrt(1.0/max(1.0,float(case_2_GTcount*2+case_1_GTcount))+1.0/max(1.0,float(case_0_GTcount*2 + case_1_GTcount))+1.0/max(1.0,float(ctrl_2_GTcount*2+ctrl_1_GTcount))+1.0/max(1.0,float(ctrl_0_GTcount*2 + ctrl_1_GTcount)))
| calc OR_SEmm GFORM((exp(ln(OR_mm)+1.96*lnse)-exp(ln(OR_mm)-1.96*lnse))/(2.0*1.96),4,3)
| select 1,2,Reference,Call,Pheno,AF,pVal_mm,OR_mm,OR_SEmm,pVal_dom,OR_dom,pVal_rec,OR_rec,case_0_GTcount,case_1_GTcount,case_2_GTcount,case_3_GTcount,ctrl_0_GTcount,ctrl_1_GTcount,ctrl_2_GTcount,ctrl_3_GTcount,AN,AC
| where float(pval_mm) < 0.1 and case_1_GTcount+case_2_GTcount > 0
| replace OR_* GFORM(float(#rc),2,2)
);
""")
%%gor
$mydefs
gor [#phewas#] | top 100
Query ran in 0.20 sec Query fetched 100 rows in 0.02 sec (total time 0.22 sec)
CHROM | POS | Reference | Call | Pheno | AF | pVal_mm | OR_mm | OR_SEmm | pVal_dom | ... | case_0_GTcount | case_1_GTcount | case_2_GTcount | case_3_GTcount | ctrl_0_GTcount | ctrl_1_GTcount | ctrl_2_GTcount | ctrl_3_GTcount | AN | AC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 43044346 | C | T | HP:0100852 | 0.005702 | 0.03900 | 2.7 | 1.4 | 0.03800 | ... | 241 | 6 | 0 | 1172 | 2812 | 26 | 0 | 19428 | 6314 | 36 |
1 | chr17 | 43044346 | C | T | HP:0001337 | 0.005702 | 0.06500 | 3.5 | 2.7 | 0.06500 | ... | 80 | 3 | 0 | 579 | 2916 | 31 | 0 | 19499 | 6314 | 36 |
2 | chr17 | 43044346 | C | T | HP:0031197 | 0.005702 | 0.01100 | 180.0 | 740.0 | 0.01100 | ... | 0 | 1 | 0 | 0 | 3112 | 35 | 0 | 20985 | 6314 | 36 |
3 | chr17 | 43044346 | C | T | HP:0001696 | 0.005702 | 0.05600 | 20.0 | 41.0 | 0.05600 | ... | 4 | 1 | 0 | 15 | 3116 | 35 | 0 | 20966 | 6314 | 36 |
4 | chr17 | 43044346 | C | T | HP:0031218 | 0.005702 | 0.01100 | 180.0 | 760.0 | 0.01100 | ... | 0 | 1 | 0 | 4 | 3108 | 34 | 0 | 20929 | 6314 | 36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | chr17 | 43044346 | C | T | HP:0000549 | 0.005702 | 0.01000 | 5.2 | 3.3 | 0.01000 | ... | 71 | 4 | 0 | 744 | 3024 | 32 | 0 | 19948 | 6314 | 36 |
96 | chr17 | 43044346 | C | T | HP:0001397 | 0.005702 | 0.09800 | 11.0 | 20.0 | 0.09800 | ... | 8 | 1 | 0 | 30 | 3113 | 35 | 0 | 20971 | 6314 | 36 |
97 | chr17 | 43044346 | C | T | HP:0006385 | 0.005702 | 0.00047 | 92.0 | 130.0 | 0.00038 | ... | 1 | 2 | 0 | 20 | 3120 | 34 | 0 | 20980 | 6314 | 36 |
98 | chr17 | 43044346 | C | T | HP:0100836 | 0.005702 | 0.08800 | 12.0 | 23.0 | 0.08800 | ... | 7 | 1 | 0 | 32 | 3108 | 35 | 0 | 20946 | 6314 | 36 |
99 | chr17 | 43044346 | C | T | HP:0006699 | 0.005702 | 0.04400 | 26.0 | 55.0 | 0.04400 | ... | 3 | 1 | 0 | 9 | 3093 | 34 | 0 | 20860 | 6314 | 36 |
100 rows × 23 columns
We can see that by spreading the phenotypes across 20 parallel workers, we complete about 25 million Fisher-Exact analysis test for 25k samples in less than a minute! We can see the names and filter out the important associations like this:
%%gor
$mydefs
nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
| map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n
Query ran in 0.18 sec Query fetched 21,235 rows in 1.19 sec (total time 1.37 sec)
CHROM | POS | Reference | Call | Pheno | AF | pVal_mm | OR_mm | OR_SEmm | name | rank_pVal_mm | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 43159574 | T | C | HP:0011458 | 0.594900 | 1.900000e-219 | 14.0 | 1.50 | Abdominal symptom | 1 |
1 | chr17 | 43159574 | T | C | HP:0012531 | 0.594900 | 1.900000e-219 | 18.0 | 2.00 | Pain | 1 |
2 | chr17 | 43169893 | C | T | HP:0002017 | 0.341000 | 1.900000e-219 | 17.0 | 1.60 | Nausea and vomiting | 1 |
3 | chr17 | 43169893 | C | T | HP:0002037 | 0.341000 | 1.900000e-219 | 41.0 | 7.20 | Inflammation of the large intestine | 1 |
4 | chr17 | 43169893 | C | T | HP:0011458 | 0.341000 | 1.900000e-219 | 13.0 | 0.91 | Abdominal symptom | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21230 | chr17 | 43097369 | A | G | HP:0012014 | 0.001242 | 9.900000e-03 | 120.0 | 240.00 | EEG with central focal spikes | 1 |
21231 | chr17 | 43097369 | A | G | HP:0010471 | 0.001242 | 9.900000e-03 | 120.0 | 240.00 | Oligosacchariduria | 3 |
21232 | chr17 | 43100620 | CATATATATATGTT | C | HP:0008458 | 0.004777 | 9.900000e-03 | 210.0 | 920.00 | Progressive congenital scoliosis | 6 |
21233 | chr17 | 43100664 | A | G | HP:0010301 | 0.026530 | 9.900000e-03 | 19.0 | 25.00 | Spinal dysraphism | 7 |
21234 | chr17 | 43133822 | A | G | HP:0000327 | 0.001657 | 9.900000e-03 | 140.0 | 370.00 | Hypoplasia of the maxilla | 9 |
21235 rows × 11 columns
Finally, we print out the complete query:
print(mydefs)
print("""nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
| map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n""")
def #hpo_pn# = user_data/hakon/hpo_pheno.tsv; def #wgsvars# = source/var/wgs_varcalls.gord -s PN; def #goodcov# = source/cov/goodcov_6.wgs.gord -s PN; create #buckets# = nor -asdict #wgsvars# | select #2 | rename #1 PN | calc bucket 'b1_1'; create #genotypes# = gor #genes# | where gene_symbol = 'BRCA1' | segproj | join -segsnp -f 20 -ir <(gor #wgsvars# | calc gt if(GL_Call>=20,callcopies,'3') | select 1-4,gt,pn ) | varnorm -left #3 #4 | atmax 1 -gc pn gt | gtgen -gc #3,#4 [#buckets#] <(gor #goodcov# ) -maxseg 10000; create #all# = gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] <(nor [#buckets#] | select PN | calc cc 'all') | pivot -gc #3,#4,cc -v 0,1,2,3 gt -e 0 | rename (.*_GTcount) all_#{1} | hide cc | calc AN 2*(all_0_GTcount + all_1_GTcount + all_2_GTcount) | calc AC all_1_GTcount + 2*all_2_GTcount | calc AF GFORM(AC/AN,4,4); create #phenosizeordered# = nor #hpo_pn# | group -gc pheno_code -count | sort -c allcount:n | rownum; def #parts# = 20; create #phewas# = parallel -parts <(norrows #parts# ) <( gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] <(nor #hpo_pn# | inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} )) | pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0 | pivot -gc #3,#4,pheno -v 1,3 cc -e 0 | prefix pheno[+1]- y | rename y_1_(.*) case_#{1} | rename y_3_(.*) excl_#{1} | varjoin -r [#all#] | calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount | calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount | calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount | calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount | calc pVal_mm EFORM(PVAL(case_2_GTcount*2+case_1_GTcount,case_0_GTcount*2 + case_1_GTcount,ctrl_2_GTcount*2+ctrl_1_GTcount,ctrl_0_GTcount*2 + ctrl_1_GTcount),5,1) | calc OR_mm if((ctrl_2_GTcount*2+ctrl_1_GTcount) = 0 or (case_0_GTcount*2 + case_1_GTcount) = 0 and (ctrl_0_GTcount*2 + ctrl_1_GTcount) != 0,1000,if((case_0_GTcount*2 + case_1_GTcount) = 0 or (ctrl_0_GTcount*2 + ctrl_1_GTcount) = 0,float('NaN'),((case_2_GTcount*2+case_1_GTcount)/(case_0_GTcount*2 + case_1_GTcount))/((ctrl_2_GTcount*2+ctrl_1_GTcount)/(ctrl_0_GTcount*2 + ctrl_1_GTcount)))) | calc pVal_dom EFORM(PVAL(case_2_GTcount+case_1_GTcount,case_0_GTcount,ctrl_2_GTcount+ctrl_1_GTcount,ctrl_0_GTcount),5,1) | calc OR_dom if(ctrl_2_GTcount+ctrl_1_GTcount = 0 or case_0_GTcount = 0 and ctrl_0_GTcount != 0,1000,if(case_0_GTcount = 0 or ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount+case_1_GTcount/case_0_GTcount)/(ctrl_2_GTcount+ctrl_1_GTcount/ctrl_0_GTcount))) | calc pVal_rec EFORM(PVAL(case_2_GTcount,case_1_GTcount + case_0_GTcount,ctrl_2_GTcount,ctrl_1_GTcount + ctrl_0_GTcount),5,1) | calc OR_rec if(ctrl_2_GTcount = 0 or case_1_GTcount + case_0_GTcount = 0 and ctrl_1_GTcount + ctrl_0_GTcount != 0,1000,if(case_1_GTcount + case_0_GTcount = 0 or ctrl_1_GTcount + ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount/case_1_GTcount + case_0_GTcount)/(ctrl_2_GTcount/ctrl_1_GTcount + ctrl_0_GTcount))) | calc lnse sqrt(1.0/max(1.0,float(case_2_GTcount*2+case_1_GTcount))+1.0/max(1.0,float(case_0_GTcount*2 + case_1_GTcount))+1.0/max(1.0,float(ctrl_2_GTcount*2+ctrl_1_GTcount))+1.0/max(1.0,float(ctrl_0_GTcount*2 + ctrl_1_GTcount))) | calc OR_SEmm GFORM((exp(ln(OR_mm)+1.96*lnse)-exp(ln(OR_mm)-1.96*lnse))/(2.0*1.96),4,3) | select 1,2,Reference,Call,Pheno,AF,pVal_mm,OR_mm,OR_SEmm,pVal_dom,OR_dom,pVal_rec,OR_rec,case_0_GTcount,case_1_GTcount,case_2_GTcount,case_3_GTcount,ctrl_0_GTcount,ctrl_1_GTcount,ctrl_2_GTcount,ctrl_3_GTcount,AN,AC | where float(pval_mm) < 0.1 and case_1_GTcount+case_2_GTcount > 0 | replace OR_* GFORM(float(#rc),2,2) ); nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm | map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name) ) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n