HPO Phecode¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}
In this notebook, we will showcase how GOR can be used to define multiple phenotypes based on the HPO direct acyclic graph (DAG) and how we can use them to build a catalog similar to Phecode, with multiple HPO-based phenotypes and their corresponding inclusion and exclusion codes.
- First we explore the HPO dag to automatically define few thousand case/excl/ctrl phenotype definitions.
- Then we apply these phenotype definitions to label over 20k samples/subjects with these phenotypes, based on their underlying HPO codes.
In another notebook, HPO_Phewas, we will then use these phenotypes to run Fisher-Excact association test over a complete gene, testing for phenotype association between thousands of phenotypes against thousands of variants in couple of minutes!
Lets begin!
HPO phenotypes on subjects¶
The sample phenotypes stored by CSA/Gregor are accessible in SM and the query API through the report link as shown there:
%%gor
nor SubjectReports/Phenotypes.rep | top 3
Query ran in 11.22 sec Query fetched 3 rows in 0.03 sec (total time 11.25 sec)
pn | taxonomy | code | description | present | |
---|---|---|---|---|---|
0 | BCH_019-III-01-P2_1_38E_BC38 | HPO | HP:0012531 | Pain | Y |
1 | BCH_019-III-01-P2_1_38E_BC38 | HPO | HP:0100577 | Urinary bladder inflammation | Y |
2 | BCH_020-II-8-A2_1_38E_BC38 | HPO | HP:0012531 | Pain | Y |
At present, the GORdb system always considers the report link as "dirty" and will update every CREATE step that refers it, hence, we will start by persisting it under user_data.
%%gor
nor SubjectReports/Phenotypes.rep
| where taxonomy = 'HPO' and present = 'Y' | select pn,code
| write user_data/hakon/hpo_pn.tsv
Query ran in 6.34 sec. Query fetched 0 rows in 3.12 sec (total time 9.45 sec)
ChromNOR | PosNOR |
---|
After setting up some basic definitions we start by exploring how many samples have phenotypes and the number of HPO code per sample.
qh = GQH.GOR_Query_Helper()
mydefs = qh.add_many("""
def #hpo_path# = ref_base/HG38-109-7-2/HPO;
def #hpo_pn# = user_data/hakon/hpo_pn.tsv
""")
%%gor
$mydefs
nor #hpo_pn# | group -gc pn | group -count
Query ran in 0.26 sec Query fetched 1 rows in 1.28 sec (total time 1.54 sec)
allCount | |
---|---|
0 | 9534 |
The above query shows that out of the 24k samples in the project, 9534 have HPO phenotypes. Next count phenotypes per sample.
%%gor pncount <<
$mydefs
nor #hpo_pn# | group -gc pn -count
Query ran in 0.92 sec Query fetched 9,534 rows in 1.97 sec (total time 2.89 sec)
p = pncount.plot.hist(bins=250, range = [0,250],histtype='step',title='HPO histogram')
p.set_ylabel('#PNs')
p.set_xlabel('#HPO codes')
Text(0.5, 0, '#HPO codes')
Next we will look at the most frequent HPO codes and then we try to understand where they are placed in the HPO-DAG.
%%gor
$mydefs
nor #hpo_pn# | group -gc code -count | rename #2 PNs | sort -c PNs:nr
| top 5
| map -c code <(nor #hpo_path#/hpo_dag_info.tsv | select hpo_code,name,dag_level,decendcount,min_path)
Query ran in 0.25 sec Query fetched 5 rows in 1.65 sec (total time 1.90 sec)
code | PNs | name | DAG_level | decendCount | Min_path | |
---|---|---|---|---|---|---|
0 | HP:0000152 | 4116 | Abnormality of head or neck | 2 | 1550 | T->HP:0000118->* |
1 | HP:0012531 | 4038 | Pain | 3 | 65 | T->HP:0000118->HP:0025142->* |
2 | HP:0000234 | 3982 | Abnormality of the head | 3 | 1502 | T->HP:0000118->HP:0000152->* |
3 | HP:0011024 | 3531 | Abnormality of the gastrointestinal tract | 3 | 431 | T->HP:0000118->HP:0025031->* |
4 | HP:0025031 | 3523 | Abnormality of the digestive system | 2 | 833 | T->HP:0000118->* |
The DAGMAP command¶
We see that there are several HPO codes that most of the 5917 phenotyped samples have, out of the 24k samples in the project. These common HPO codes are 2 and 3 levels from the top-node and with very many child-nodes. The dag_info data is generated from the parent-child graph structure using the DAGMAP command. We will now use it to traverse the DAG in reverse direction, from a child to its parents, by switching the parent-child columns. With the -dp option, we get the path to all ancestral nodes, including to the top node in the DAG.
%%gor
$mydefs
nor #hpo_pn# | group -gc code -count | rename #2 PNs | sort -c PNs:nr | granno -count | rename #3 allPNs
| top 5
| dagmap -c code -dp <(nor #hpo_path#/hpo_parent_child.tsv | select #2,#1)
Query ran in 0.20 sec Query fetched 23 rows in 1.42 sec (total time 1.61 sec)
code | PNs | allPNs | DAG_node | DAG_dist | DAG_path | |
---|---|---|---|---|---|---|
0 | HP:0000152 | 4116 | 5917 | HP:0000152 | 0 | HP:0000152 |
1 | HP:0000152 | 4116 | 5917 | HP:0000118 | 1 | HP:0000152->HP:0000118 |
2 | HP:0000152 | 4116 | 5917 | HP:0000001 | 2 | HP:0000152->HP:0000118->HP:0000001 |
3 | HP:0000152 | 4116 | 5917 | 3 | HP:0000152->HP:0000118->HP:0000001-> | |
4 | HP:0012531 | 4038 | 5917 | HP:0012531 | 0 | HP:0012531 |
5 | HP:0012531 | 4038 | 5917 | HP:0025142 | 1 | HP:0012531->HP:0025142 |
6 | HP:0012531 | 4038 | 5917 | HP:0000118 | 2 | HP:0012531->HP:0025142->HP:0000118 |
7 | HP:0012531 | 4038 | 5917 | HP:0000001 | 3 | HP:0012531->HP:0025142->HP:0000118->HP:0000001 |
8 | HP:0012531 | 4038 | 5917 | 4 | HP:0012531->HP:0025142->HP:0000118->HP:0000001-> | |
9 | HP:0000234 | 3982 | 5917 | HP:0000234 | 0 | HP:0000234 |
10 | HP:0000234 | 3982 | 5917 | HP:0000152 | 1 | HP:0000234->HP:0000152 |
11 | HP:0000234 | 3982 | 5917 | HP:0000118 | 2 | HP:0000234->HP:0000152->HP:0000118 |
12 | HP:0000234 | 3982 | 5917 | HP:0000001 | 3 | HP:0000234->HP:0000152->HP:0000118->HP:0000001 |
13 | HP:0000234 | 3982 | 5917 | 4 | HP:0000234->HP:0000152->HP:0000118->HP:0000001-> | |
14 | HP:0011024 | 3531 | 5917 | HP:0011024 | 0 | HP:0011024 |
15 | HP:0011024 | 3531 | 5917 | HP:0025031 | 1 | HP:0011024->HP:0025031 |
16 | HP:0011024 | 3531 | 5917 | HP:0000118 | 2 | HP:0011024->HP:0025031->HP:0000118 |
17 | HP:0011024 | 3531 | 5917 | HP:0000001 | 3 | HP:0011024->HP:0025031->HP:0000118->HP:0000001 |
18 | HP:0011024 | 3531 | 5917 | 4 | HP:0011024->HP:0025031->HP:0000118->HP:0000001-> | |
19 | HP:0025031 | 3523 | 5917 | HP:0025031 | 0 | HP:0025031 |
20 | HP:0025031 | 3523 | 5917 | HP:0000118 | 1 | HP:0025031->HP:0000118 |
21 | HP:0025031 | 3523 | 5917 | HP:0000001 | 2 | HP:0025031->HP:0000118->HP:0000001 |
22 | HP:0025031 | 3523 | 5917 | 3 | HP:0025031->HP:0000118->HP:0000001-> |
As shown above, the DAGMAP traverses the graph and outputs DAG_node in a similar manner as MULTIMAP. Next we will use it to find every child nodes, for any given node present in the samples. In other words, for a phenotype named after a HPO code, we will include all samples that have any of it's child nodes as defined by the HPO DAG. Notice that we exclude HPO codes with crazy many child nodes.
mydefs = qh.add_many("""
create #all_codes# = nor #hpo_pn# | group -gc code;
create #case# = nor [#all_codes#]
| map -c code <(nor #hpo_path#/hpo_descr.tsv | select hpo_code,name)
| dagmap -c code #hpo_path#/hpo_parent_child.tsv
| group -gc code,name -set -sc dag_node -len 100000
| where listsize(set_dag_node) <= 1000
| rename set_dag_node case_codes;
""")
%%gor
$mydefs
nor [#case#] | top 10 | calc ls listsize(#3)
Query ran in 0.11 sec Query fetched 10 rows in 0.01 sec (total time 0.12 sec)
code | name | case_codes | ls | |
---|---|---|---|---|
0 | HP:0000002 | Abnormality of body height | HP:0000002,HP:0000098,HP:0000839,HP:0001519,HP... | 35 |
1 | HP:0000003 | Multicystic kidney dysplasia | HP:0000003 | 1 |
2 | HP:0000006 | Autosomal dominant inheritance | HP:0000006 | 1 |
3 | HP:0000007 | Autosomal recessive inheritance | HP:0000007 | 1 |
4 | HP:0000008 | Abnormal morphology of female internal genitalia | HP:0000008,HP:0000013,HP:0000130,HP:0000131,HP... | 106 |
5 | HP:0000009 | Functional abnormality of the bladder | HP:0000009,HP:0000011,HP:0000012,HP:0000016,HP... | 28 |
6 | HP:0000010 | Recurrent urinary tract infections | HP:0000010,HP:0012786,HP:0012787 | 3 |
7 | HP:0000011 | Neurogenic bladder | HP:0000011 | 1 |
8 | HP:0000012 | Urinary urgency | HP:0000012 | 1 |
9 | HP:0000013 | Hypoplasia of the uterus | HP:0000013 | 1 |
Now we have a phenotype name and all the HPO codes belonging to that phenotype listed as case_codes. The thinking is that samples with any of these HPO codes are cases in the phenotype named after "code". We could then define controls as all samples not being a case, however, here we want to exclude samples that are not cases but do have similar phenotypes. By similar, we will traverse the HPO-DAG up one level, from the code that defines the phenotype, and find all child-nodes of those parent nodes that do not overlap the case codes.
mydefs = qh.add("""
create #excl# = nor [#case#]
| select code
| dagmap -c code -dl 1 <(nor #hpo_path#/hpo_parent_child.tsv | select #2,#1)
| where dag_dist > 0
| select code,dag_node | rename dag_node parent_node
| dagmap -c parent_node -dp -ps ',' #hpo_path#/hpo_parent_child.tsv
| where not(listhasany(dag_path,code))
| group -gc code -set -sc dag_node -len 100000
| rename set_dag_node excl_codes;
""")
%%gor
$mydefs
nor [#excl#] | top 10
Query ran in 0.11 sec Query fetched 10 rows in 0.02 sec (total time 0.13 sec)
code | excl_codes | |
---|---|---|
0 | HP:0000002 | HP:0000823,HP:0000839,HP:0001507,HP:0001508,HP... |
1 | HP:0000003 | HP:0000107,HP:0000108,HP:0000113,HP:0000800,HP... |
2 | HP:0000006 | HP:0000007,HP:0001417,HP:0001419,HP:0001423,HP... |
3 | HP:0000007 | HP:0000006,HP:0001417,HP:0001419,HP:0001423,HP... |
4 | HP:0000008 | HP:0000022,HP:0000024,HP:0000030,HP:0000031,HP... |
5 | HP:0000009 | HP:0000014,HP:0000015,HP:0000021,HP:0001586,HP... |
6 | HP:0000010 | HP:0000083,HP:0000093,HP:0000099,HP:0000100,HP... |
7 | HP:0000011 | HP:0000009,HP:0000012,HP:0000016,HP:0000017,HP... |
8 | HP:0000012 | HP:0000009,HP:0000011,HP:0000016,HP:0000017,HP... |
9 | HP:0000013 | HP:0000151,HP:0008684 |
See how in the above query, we first use the DAGMAP command with reversed parent-child graph and set the option -dl to 1, because we only want to go up one level from "code". We filter out dag_dist of zero, becaues it is the node itself. Then we traverse down from dag_node (the parent to code) and use the -dp option to get the path traversed and apply initial pruning to exclude paths that return back to the node itself, e.g. not(listhasany(dag_path,code)). Next we join together the case codes and the exclusion codes and do an extra pruning, in case there is overlap between them because of the multi-parental nature of the HPO-DAG.
mydefs = qh.add("""
create #case_excl_codes# = nor [#all_codes#]
| map -c code [#case#]
| calc case_size listsize(case_codes)
| map -c code [#excl#]
| replace excl_codes listfilter(excl_codes,'not(listhasany(case_codes,x))')
| calc excl_size listsize(excl_codes);
""")
%%gor
$mydefs
nor [#case_excl_codes#] | top 10
Query ran in 0.11 sec Query fetched 10 rows in 0.02 sec (total time 0.14 sec)
code | name | case_codes | case_size | excl_codes | excl_size | |
---|---|---|---|---|---|---|
0 | HP:0000002 | Abnormality of body height | HP:0000002,HP:0000098,HP:0000839,HP:0001519,HP... | 35 | HP:0000823,HP:0001507,HP:0001508,HP:0001510,HP... | 68 |
1 | HP:0000003 | Multicystic kidney dysplasia | HP:0000003 | 1 | HP:0000107,HP:0000108,HP:0000113,HP:0000800,HP... | 12 |
2 | HP:0000006 | Autosomal dominant inheritance | HP:0000006 | 1 | HP:0000007,HP:0001417,HP:0001419,HP:0001423,HP... | 11 |
3 | HP:0000007 | Autosomal recessive inheritance | HP:0000007 | 1 | HP:0000006,HP:0001417,HP:0001419,HP:0001423,HP... | 11 |
4 | HP:0000008 | Abnormal morphology of female internal genitalia | HP:0000008,HP:0000013,HP:0000130,HP:0000131,HP... | 106 | HP:0000022,HP:0000024,HP:0000030,HP:0000031,HP... | 62 |
5 | HP:0000009 | Functional abnormality of the bladder | HP:0000009,HP:0000011,HP:0000012,HP:0000016,HP... | 28 | HP:0000014,HP:0000015,HP:0000021,HP:0001586,HP... | 30 |
6 | HP:0000010 | Recurrent urinary tract infections | HP:0000010,HP:0012786,HP:0012787 | 3 | HP:0000083,HP:0000093,HP:0000099,HP:0000100,HP... | 491 |
7 | HP:0000011 | Neurogenic bladder | HP:0000011 | 1 | HP:0000009,HP:0000012,HP:0000016,HP:0000017,HP... | 27 |
8 | HP:0000012 | Urinary urgency | HP:0000012 | 1 | HP:0000009,HP:0000011,HP:0000016,HP:0000017,HP... | 27 |
9 | HP:0000013 | Hypoplasia of the uterus | HP:0000013 | 1 | HP:0000151,HP:0008684 | 2 |
Mapping samples to the phenotypes in the catalog¶
Ultimately, we want to map the subjects based on their associated HPO observations. Before we do that, we generate a mapping table to make the logic more efficient. Additionally, we throw away these machine-generated phenotypes that have a very high number of codes in their definitions.
mydefs = qh.add_many("""
def #ce_filter# = where case_size < 100 and excl_size < 100;
create #case_map# = nor [#case_excl_codes#] | #ce_filter# | select code,case_codes | calc status '1'
| split case_codes | select case_codes,code,status | rename code pheno_code | rename #1 code;
create #excl_map# = nor [#case_excl_codes#] | #ce_filter# | select code,excl_codes | calc status '3'
| split excl_codes | select excl_codes,code,status | rename code pheno_code | rename #1 code;
create #pheno_map# = nor [#case_map#] | merge [#excl_map#];
""")
Now we use the #pheno_map# to map the PN based HPO observations to the phenotypes and corresponding case/excl status. Importantly, because the samples have multiple observations, for any given phenotype a subject can both qualify for a case and exclusion status. Here will let case status override by taking the minimum status per PN. There are definitely many more way one could do this, e.g. by counting the relative weight of these statuses, however, the purpose here is not to make any meaningful phenotypes definitions but rather to present some valuable commands and related concepts.
%%gor
$mydefs
nor #hpo_pn# | select pn,code | multimap -c code [#pheno_map#]
| group -gc pn,pheno_code -min -ic status
| rename min_status status
| write user_data/hakon/hpo_pheno.tsv
Query ran in 0.13 sec Query fetched 0 rows in 10.70 sec (total time 10.83 sec)
ChromNOR | PosNOR |
---|
Above, we mapped all the 24k samples to over 5k phenotype definitions in few seconds and wrote the results as a sparse phenotype tsv file. We can get summary level counts for these phenotypes as well:
mydefs = qh.add_many("""
create #pheno_counts# = nor user_data/hakon/hpo_pheno.tsv
| group -gc pheno_code,status -count
| pivot -gc pheno_code -v 1,3 status -e 0
| rename #2 caseCount
| rename #3 exclCount
| multimap -cartesian <(nor user_data/hakon/hpo_pheno.tsv | group -gc pn | group -count )
| calc phenoCtrlCount allcount-#2-#3
| hide allcount
| sort -c phenoCtrlCount:n
| map -c #1 <(nor [#case_excl_codes#] | select #1,name);
""")
%%gor
$mydefs
nor [#pheno_counts#]
Query ran in 0.14 sec Query fetched 4,670 rows in 0.04 sec (total time 0.18 sec)
pheno_code | caseCount | exclCount | phenoCtrlCount | name | |
---|---|---|---|---|---|
0 | HP:0011458 | 4312 | 702 | 4440 | Abdominal symptom |
1 | HP:0012531 | 4343 | 458 | 4653 | Pain |
2 | HP:0030155 | 2 | 4359 | 5093 | Scrotal pain |
3 | HP:0007281 | 4 | 4345 | 5105 | Developmental stagnation |
4 | HP:0002376 | 361 | 3988 | 5105 | Developmental regression |
... | ... | ... | ... | ... | ... |
4665 | HP:0003090 | 1 | 0 | 9453 | Hypoplasia of the capital femoral epiphysis |
4666 | HP:0410339 | 1 | 0 | 9453 | Insect bite allergy |
4667 | HP:0500156 | 1 | 0 | 9453 | Hyperasparaginemia |
4668 | HP:0012405 | 1 | 0 | 9453 | Hypocitraturia |
4669 | HP:0500191 | 1 | 0 | 9453 | Increased CSF leucine concentration |
4670 rows × 5 columns
We can now write out the summary of the entire HPO phenotype catalog as:
%%gor
$mydefs
nor [#pheno_counts#] | map -c #1 [#case_excl_codes#]
| hide namex | write user_data/hakon/hpo_pheno_summary.tsv
Query ran in 0.32 sec Query fetched 0 rows in 0.94 sec (total time 1.26 sec)
ChromNOR | PosNOR |
---|
In the notebook HPO_Phewas, we show how these 4669 phenotypes can be used to perform very rapid phewas.