HPO Phecode¶

HPO_Phecode

In [2]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}

In this notebook, we will showcase how GOR can be used to define multiple phenotypes based on the HPO direct acyclic graph (DAG) and how we can use them to build a catalog similar to Phecode, with multiple HPO-based phenotypes and their corresponding inclusion and exclusion codes.

First we explore the HPO dag to automatically define few thousand case/excl/ctrl phenotype definitions.
Then we apply these phenotype definitions to label over 20k samples/subjects with these phenotypes, based on their underlying HPO codes.

In another notebook, HPO_Phewas, we will then use these phenotypes to run Fisher-Excact association test over a complete gene, testing for phenotype association between thousands of phenotypes against thousands of variants in couple of minutes!

Lets begin!

HPO phenotypes on subjects¶

The sample phenotypes stored by CSA/Gregor are accessible in SM and the query API through the report link as shown there:

In [7]:

%%gor
nor SubjectReports/Phenotypes.rep | top 3

Query ran in 11.22 sec
Query fetched 3 rows in 0.03 sec (total time 11.25 sec)

Out[7]:

	pn	taxonomy	code	description	present
0	BCH_019-III-01-P2_1_38E_BC38	HPO	HP:0012531	Pain	Y
1	BCH_019-III-01-P2_1_38E_BC38	HPO	HP:0100577	Urinary bladder inflammation	Y
2	BCH_020-II-8-A2_1_38E_BC38	HPO	HP:0012531	Pain	Y

At present, the GORdb system always considers the report link as "dirty" and will update every CREATE step that refers it, hence, we will start by persisting it under user_data.

In [8]:

%%gor 
nor SubjectReports/Phenotypes.rep 
| where taxonomy = 'HPO' and present = 'Y' | select pn,code
| write user_data/hakon/hpo_pn.tsv

Query ran in 6.34 sec.
Query fetched 0 rows in 3.12 sec (total time 9.45 sec)

Out[8]:

	ChromNOR	PosNOR

After setting up some basic definitions we start by exploring how many samples have phenotypes and the number of HPO code per sample.

In [3]:

qh = GQH.GOR_Query_Helper()
mydefs = qh.add_many("""
def #hpo_path# = ref_base/HG38-109-7-2/HPO;
def #hpo_pn# = user_data/hakon/hpo_pn.tsv
""")

In [42]:

%%gor
$mydefs
nor #hpo_pn# | group -gc pn | group -count

Query ran in 0.26 sec
Query fetched 1 rows in 1.28 sec (total time 1.54 sec)

Out[42]:

	allCount
0	9534

The above query shows that out of the 24k samples in the project, 9534 have HPO phenotypes. Next count phenotypes per sample.

In [4]:

%%gor pncount <<
$mydefs
nor #hpo_pn# | group -gc pn -count

Query ran in 0.92 sec
Query fetched 9,534 rows in 1.97 sec (total time 2.89 sec)

In [5]:

p = pncount.plot.hist(bins=250, range = [0,250],histtype='step',title='HPO histogram')
p.set_ylabel('#PNs')
p.set_xlabel('#HPO codes')

Out[5]:

Text(0.5, 0, '#HPO codes')

Next we will look at the most frequent HPO codes and then we try to understand where they are placed in the HPO-DAG.

In [6]:

%%gor
$mydefs
nor #hpo_pn# | group -gc code -count | rename #2 PNs | sort -c PNs:nr
| top 5
| map -c code <(nor #hpo_path#/hpo_dag_info.tsv | select hpo_code,name,dag_level,decendcount,min_path)

Query ran in 0.25 sec
Query fetched 5 rows in 1.65 sec (total time 1.90 sec)

Out[6]:

	code	PNs	name	DAG_level	decendCount	Min_path
0	HP:0000152	4116	Abnormality of head or neck	2	1550	T->HP:0000118->*
1	HP:0012531	4038	Pain	3	65	T->HP:0000118->HP:0025142->*
2	HP:0000234	3982	Abnormality of the head	3	1502	T->HP:0000118->HP:0000152->*
3	HP:0011024	3531	Abnormality of the gastrointestinal tract	3	431	T->HP:0000118->HP:0025031->*
4	HP:0025031	3523	Abnormality of the digestive system	2	833	T->HP:0000118->*

The DAGMAP command¶

We see that there are several HPO codes that most of the 5917 phenotyped samples have, out of the 24k samples in the project. These common HPO codes are 2 and 3 levels from the top-node and with very many child-nodes. The dag_info data is generated from the parent-child graph structure using the DAGMAP command. We will now use it to traverse the DAG in reverse direction, from a child to its parents, by switching the parent-child columns. With the -dp option, we get the path to all ancestral nodes, including to the top node in the DAG.

In [7]:

%%gor
$mydefs
nor #hpo_pn# | group -gc code -count | rename #2 PNs | sort -c PNs:nr | granno -count | rename #3 allPNs
| top 5
| dagmap -c code -dp  <(nor #hpo_path#/hpo_parent_child.tsv | select #2,#1)

Query ran in 0.20 sec
Query fetched 23 rows in 1.42 sec (total time 1.61 sec)

Out[7]:

	code	PNs	allPNs	DAG_node	DAG_dist	DAG_path
0	HP:0000152	4116	5917	HP:0000152	0	HP:0000152
1	HP:0000152	4116	5917	HP:0000118	1	HP:0000152->HP:0000118
2	HP:0000152	4116	5917	HP:0000001	2	HP:0000152->HP:0000118->HP:0000001
3	HP:0000152	4116	5917		3	HP:0000152->HP:0000118->HP:0000001->
4	HP:0012531	4038	5917	HP:0012531	0	HP:0012531
5	HP:0012531	4038	5917	HP:0025142	1	HP:0012531->HP:0025142
6	HP:0012531	4038	5917	HP:0000118	2	HP:0012531->HP:0025142->HP:0000118
7	HP:0012531	4038	5917	HP:0000001	3	HP:0012531->HP:0025142->HP:0000118->HP:0000001
8	HP:0012531	4038	5917		4	HP:0012531->HP:0025142->HP:0000118->HP:0000001->
9	HP:0000234	3982	5917	HP:0000234	0	HP:0000234
10	HP:0000234	3982	5917	HP:0000152	1	HP:0000234->HP:0000152
11	HP:0000234	3982	5917	HP:0000118	2	HP:0000234->HP:0000152->HP:0000118
12	HP:0000234	3982	5917	HP:0000001	3	HP:0000234->HP:0000152->HP:0000118->HP:0000001
13	HP:0000234	3982	5917		4	HP:0000234->HP:0000152->HP:0000118->HP:0000001->
14	HP:0011024	3531	5917	HP:0011024	0	HP:0011024
15	HP:0011024	3531	5917	HP:0025031	1	HP:0011024->HP:0025031
16	HP:0011024	3531	5917	HP:0000118	2	HP:0011024->HP:0025031->HP:0000118
17	HP:0011024	3531	5917	HP:0000001	3	HP:0011024->HP:0025031->HP:0000118->HP:0000001
18	HP:0011024	3531	5917		4	HP:0011024->HP:0025031->HP:0000118->HP:0000001->
19	HP:0025031	3523	5917	HP:0025031	0	HP:0025031
20	HP:0025031	3523	5917	HP:0000118	1	HP:0025031->HP:0000118
21	HP:0025031	3523	5917	HP:0000001	2	HP:0025031->HP:0000118->HP:0000001
22	HP:0025031	3523	5917		3	HP:0025031->HP:0000118->HP:0000001->

As shown above, the DAGMAP traverses the graph and outputs DAG_node in a similar manner as MULTIMAP. Next we will use it to find every child nodes, for any given node present in the samples. In other words, for a phenotype named after a HPO code, we will include all samples that have any of it's child nodes as defined by the HPO DAG. Notice that we exclude HPO codes with crazy many child nodes.

In [8]:

mydefs = qh.add_many("""
create #all_codes# = nor #hpo_pn# | group -gc code;
create #case# = nor [#all_codes#]
| map -c code <(nor #hpo_path#/hpo_descr.tsv | select hpo_code,name)
| dagmap -c code #hpo_path#/hpo_parent_child.tsv
| group -gc code,name -set -sc dag_node -len 100000
| where listsize(set_dag_node) <= 1000
| rename set_dag_node case_codes;
""")

In [9]:

%%gor
$mydefs
nor [#case#] | top 10 | calc ls listsize(#3)

Query ran in 0.11 sec
Query fetched 10 rows in 0.01 sec (total time 0.12 sec)

Out[9]:

	code	name	case_codes	ls
0	HP:0000002	Abnormality of body height	HP:0000002,HP:0000098,HP:0000839,HP:0001519,HP...	35
1	HP:0000003	Multicystic kidney dysplasia	HP:0000003	1
2	HP:0000006	Autosomal dominant inheritance	HP:0000006	1
3	HP:0000007	Autosomal recessive inheritance	HP:0000007	1
4	HP:0000008	Abnormal morphology of female internal genitalia	HP:0000008,HP:0000013,HP:0000130,HP:0000131,HP...	106
5	HP:0000009	Functional abnormality of the bladder	HP:0000009,HP:0000011,HP:0000012,HP:0000016,HP...	28
6	HP:0000010	Recurrent urinary tract infections	HP:0000010,HP:0012786,HP:0012787	3
7	HP:0000011	Neurogenic bladder	HP:0000011	1
8	HP:0000012	Urinary urgency	HP:0000012	1
9	HP:0000013	Hypoplasia of the uterus	HP:0000013	1

Now we have a phenotype name and all the HPO codes belonging to that phenotype listed as case_codes. The thinking is that samples with any of these HPO codes are cases in the phenotype named after "code". We could then define controls as all samples not being a case, however, here we want to exclude samples that are not cases but do have similar phenotypes. By similar, we will traverse the HPO-DAG up one level, from the code that defines the phenotype, and find all child-nodes of those parent nodes that do not overlap the case codes.

In [10]:

mydefs = qh.add("""
create #excl# = nor [#case#]
| select code
| dagmap -c code -dl 1 <(nor #hpo_path#/hpo_parent_child.tsv | select #2,#1)
| where dag_dist > 0
| select code,dag_node | rename dag_node parent_node
| dagmap -c parent_node -dp -ps ',' #hpo_path#/hpo_parent_child.tsv
| where not(listhasany(dag_path,code))
| group -gc code -set -sc dag_node -len 100000
| rename set_dag_node excl_codes;
""")

In [11]:

%%gor
$mydefs
nor [#excl#] | top 10

Query ran in 0.11 sec
Query fetched 10 rows in 0.02 sec (total time 0.13 sec)

Out[11]:

	code	excl_codes
0	HP:0000002	HP:0000823,HP:0000839,HP:0001507,HP:0001508,HP...
1	HP:0000003	HP:0000107,HP:0000108,HP:0000113,HP:0000800,HP...
2	HP:0000006	HP:0000007,HP:0001417,HP:0001419,HP:0001423,HP...
3	HP:0000007	HP:0000006,HP:0001417,HP:0001419,HP:0001423,HP...
4	HP:0000008	HP:0000022,HP:0000024,HP:0000030,HP:0000031,HP...
5	HP:0000009	HP:0000014,HP:0000015,HP:0000021,HP:0001586,HP...
6	HP:0000010	HP:0000083,HP:0000093,HP:0000099,HP:0000100,HP...
7	HP:0000011	HP:0000009,HP:0000012,HP:0000016,HP:0000017,HP...
8	HP:0000012	HP:0000009,HP:0000011,HP:0000016,HP:0000017,HP...
9	HP:0000013	HP:0000151,HP:0008684

See how in the above query, we first use the DAGMAP command with reversed parent-child graph and set the option -dl to 1, because we only want to go up one level from "code". We filter out dag_dist of zero, becaues it is the node itself. Then we traverse down from dag_node (the parent to code) and use the -dp option to get the path traversed and apply initial pruning to exclude paths that return back to the node itself, e.g. not(listhasany(dag_path,code)). Next we join together the case codes and the exclusion codes and do an extra pruning, in case there is overlap between them because of the multi-parental nature of the HPO-DAG.

In [12]:

mydefs = qh.add("""
create #case_excl_codes# = nor [#all_codes#]
| map -c code [#case#] 
| calc case_size listsize(case_codes)
| map -c code [#excl#]
| replace excl_codes listfilter(excl_codes,'not(listhasany(case_codes,x))')
| calc excl_size listsize(excl_codes);
""")

In [13]:

%%gor
$mydefs
nor [#case_excl_codes#] | top 10

Query ran in 0.11 sec
Query fetched 10 rows in 0.02 sec (total time 0.14 sec)

Out[13]:

	code	name	case_codes	case_size	excl_codes	excl_size
0	HP:0000002	Abnormality of body height	HP:0000002,HP:0000098,HP:0000839,HP:0001519,HP...	35	HP:0000823,HP:0001507,HP:0001508,HP:0001510,HP...	68
1	HP:0000003	Multicystic kidney dysplasia	HP:0000003	1	HP:0000107,HP:0000108,HP:0000113,HP:0000800,HP...	12
2	HP:0000006	Autosomal dominant inheritance	HP:0000006	1	HP:0000007,HP:0001417,HP:0001419,HP:0001423,HP...	11
3	HP:0000007	Autosomal recessive inheritance	HP:0000007	1	HP:0000006,HP:0001417,HP:0001419,HP:0001423,HP...	11
4	HP:0000008	Abnormal morphology of female internal genitalia	HP:0000008,HP:0000013,HP:0000130,HP:0000131,HP...	106	HP:0000022,HP:0000024,HP:0000030,HP:0000031,HP...	62
5	HP:0000009	Functional abnormality of the bladder	HP:0000009,HP:0000011,HP:0000012,HP:0000016,HP...	28	HP:0000014,HP:0000015,HP:0000021,HP:0001586,HP...	30
6	HP:0000010	Recurrent urinary tract infections	HP:0000010,HP:0012786,HP:0012787	3	HP:0000083,HP:0000093,HP:0000099,HP:0000100,HP...	491
7	HP:0000011	Neurogenic bladder	HP:0000011	1	HP:0000009,HP:0000012,HP:0000016,HP:0000017,HP...	27
8	HP:0000012	Urinary urgency	HP:0000012	1	HP:0000009,HP:0000011,HP:0000016,HP:0000017,HP...	27
9	HP:0000013	Hypoplasia of the uterus	HP:0000013	1	HP:0000151,HP:0008684	2

Mapping samples to the phenotypes in the catalog¶

Ultimately, we want to map the subjects based on their associated HPO observations. Before we do that, we generate a mapping table to make the logic more efficient. Additionally, we throw away these machine-generated phenotypes that have a very high number of codes in their definitions.

In [14]:

mydefs = qh.add_many("""
def #ce_filter# = where case_size < 100 and excl_size < 100;
create #case_map# = nor [#case_excl_codes#] | #ce_filter# | select code,case_codes | calc status '1'
| split case_codes | select case_codes,code,status | rename code pheno_code | rename #1 code;
create #excl_map# = nor [#case_excl_codes#] | #ce_filter# | select code,excl_codes | calc status '3' 
| split excl_codes | select excl_codes,code,status | rename code pheno_code | rename #1 code;
create #pheno_map# = nor [#case_map#] | merge [#excl_map#];
""")

Now we use the #pheno_map# to map the PN based HPO observations to the phenotypes and corresponding case/excl status. Importantly, because the samples have multiple observations, for any given phenotype a subject can both qualify for a case and exclusion status. Here will let case status override by taking the minimum status per PN. There are definitely many more way one could do this, e.g. by counting the relative weight of these statuses, however, the purpose here is not to make any meaningful phenotypes definitions but rather to present some valuable commands and related concepts.

In [15]:

%%gor
$mydefs
nor #hpo_pn# | select pn,code | multimap -c code  [#pheno_map#]
| group -gc pn,pheno_code -min -ic status
| rename min_status status
| write user_data/hakon/hpo_pheno.tsv

Query ran in 0.13 sec
Query fetched 0 rows in 10.70 sec (total time 10.83 sec)

Out[15]:

	ChromNOR	PosNOR

Above, we mapped all the 24k samples to over 5k phenotype definitions in few seconds and wrote the results as a sparse phenotype tsv file. We can get summary level counts for these phenotypes as well:

In [16]:

mydefs = qh.add_many("""
create #pheno_counts# = nor user_data/hakon/hpo_pheno.tsv
| group -gc pheno_code,status -count
| pivot -gc pheno_code -v 1,3 status -e 0
| rename #2 caseCount
| rename #3 exclCount
| multimap -cartesian <(nor user_data/hakon/hpo_pheno.tsv | group -gc pn | group -count ) 
| calc phenoCtrlCount allcount-#2-#3
| hide allcount
| sort -c phenoCtrlCount:n
| map -c #1 <(nor [#case_excl_codes#] | select #1,name);
""")

In [18]:

%%gor
$mydefs
nor [#pheno_counts#]

Query ran in 0.14 sec
Query fetched 4,670 rows in 0.04 sec (total time 0.18 sec)

Out[18]:

	pheno_code	caseCount	exclCount	phenoCtrlCount	name
0	HP:0011458	4312	702	4440	Abdominal symptom
1	HP:0012531	4343	458	4653	Pain
2	HP:0030155	2	4359	5093	Scrotal pain
3	HP:0007281	4	4345	5105	Developmental stagnation
4	HP:0002376	361	3988	5105	Developmental regression
...	...	...	...	...	...
4665	HP:0003090	1	0	9453	Hypoplasia of the capital femoral epiphysis
4666	HP:0410339	1	0	9453	Insect bite allergy
4667	HP:0500156	1	0	9453	Hyperasparaginemia
4668	HP:0012405	1	0	9453	Hypocitraturia
4669	HP:0500191	1	0	9453	Increased CSF leucine concentration

4670 rows × 5 columns

We can now write out the summary of the entire HPO phenotype catalog as:

In [22]:

%%gor
$mydefs
nor [#pheno_counts#] | map -c #1 [#case_excl_codes#]
| hide namex | write user_data/hakon/hpo_pheno_summary.tsv

Query ran in 0.32 sec
Query fetched 0 rows in 0.94 sec (total time 1.26 sec)

Out[22]:

	ChromNOR	PosNOR

In the notebook HPO_Phewas, we show how these 4669 phenotypes can be used to perform very rapid phewas.

In [ ]: