ICD Longitudinal¶
Longitudinal phenotype analysis¶
First we include boiler plate code to define the project, set environment variables and import libraries.
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1
project = "janssen_sle"
%env GOR_API_PROJECT={project}
We start by defining a new ICD data table, here as the combination of two table in a our project.
gordefs = """
create #icdtable# = nor <(nor phenotypes/source/js_19k_array/sle/clinical_features.tsv | calc table 'clinical_features')
| merge <(nor phenotypes/source/js_19k_array/sle/selection_criteria.tsv | calc table 'selection_criteria')
| select pn,icd_code,description,entry_date,disease,disease_category,table;
"""
%%gor
$gordefs
nor [#icdtable#] | top 5
Query ran in 0.28 sec Query fetched 5 rows in 0.00 sec (total time 0.28 sec)
pn | icd_code | description | entry_date | disease | disease_category | table | |
---|---|---|---|---|---|---|---|
0 | GD515293101 | 285.9 | Anemia, unspecified | 2012-05-21 | Anemia | Hematologic | clinical_features |
1 | GD515293101 | 368.8 | Other specified visual disturbances | 2012-05-27 | Visual disturbance | Neurological | clinical_features |
2 | GD515293101 | 368.9 | Unspecified visual disturbance | 2012-05-21 | Visual disturbance | Neurological | clinical_features |
3 | GD515293101 | 585.9 | Chronic kidney disease, unspecified | 2013-11-15 | Chronic kidney disease | Renal | clinical_features |
4 | GD515293101 | 585.9 | Chronic kidney disease, unspecified | 2011-03-20 | Chronic kidney disease | Renal | clinical_features |
List summary¶
We can summarize the ICD codes for each PN in the two tables in following way using the PIVOT
command:
%%gor
$gordefs
nor [#icdtable#]
| group -gc pn,table -count -set -sc icd_code
| pivot table -v 'clinical_features','selection_criteria' -gc pn
Query ran in 0.12 sec Query fetched 1,402 rows in 0.21 sec (total time 0.33 sec)
pn | clinical_features_allCount | clinical_features_set_icd_code | selection_criteria_allCount | selection_criteria_set_icd_code | |
---|---|---|---|---|---|
0 | GD515267501 | 6 | 511.0,710.0,M32.10,M32.9 | 4 | 710.0,M32.10,M32.9 |
1 | GD515474101 | 8 | 710.0,M32.8 | 8 | 710.0,M32.8 |
2 | GD515402901 | 33 | 285.29,285.9,415.19,585.4,585.6,585.9,695.4,71... | 5 | 695.4,710.0 |
3 | GD515711101 | 35 | 288.00,515,710.0,719.42,719.43,719.44,719.46,7... | 77 | 710.0,710.8,710.9,714.0,M32.10,M35.1,M35.8,M35.9 |
4 | GD515473501 | 70 | 710.0,M32.10,M32.19,M32.9 | 72 | 710.0,M32.0,M32.10,M32.19,M32.9 |
... | ... | ... | ... | ... | ... |
1397 | GD515710701 | 27 | 585.1,585.9,710.0,719.41,782.1,791.0 | 27 | 710.0,710.9,714.0,M05.9,M06.09 |
1398 | GD515434801 | 32 | 287.5,710.0,719.44,719.46,719.49,729.1,G63,K12... | 14 | 710.0,M32.10,M32.19,M32.9 |
1399 | GD515479201 | 42 | 719.46,727.00,727.09,782.1,F06.31,M25.50,M25.5... | 4 | 710.8,M32.9 |
1400 | GD515700701 | 37 | 285.9,424.1,515,585.9,695.4,710.0,791.0 | 20 | 695.4,710.0,714.0 |
1401 | GD515434501 | 163 | 285.9,288.50,585,585.1,585.2,585.3,585.4,585.9... | 86 | 695.4,710.0,M32.10,M32.14,M32.19,M32.9 |
1402 rows × 5 columns
Similarly, we can collapse the tables into one and identify samples with two or more ICD codes of interes.
%%gor dfICDs <<
$gordefs
nor [#icdtable#]
| group -gc pn -count -set -sc icd_code
| calc has2859and2885 if(listhascount(set_icd_code,'285.9','288.50')>=2,1,0)
| sort -c has2859and2885:r
Query ran in 0.16 sec Query fetched 1,402 rows in 0.19 sec (total time 0.35 sec)
We show few selected rows from the the output with the following Python logic:
oc = 0
zc = 0
for i in range(0,len(dfICDs)):
if dfICDs.at[i,'has2859and2885'] == 1 and oc < 5:
print(f"{i+1}\t{dfICDs.at[i,'pn']}\t{dfICDs.at[i,'has2859and2885']}\t{dfICDs.at[i,'set_icd_code']}")
oc += 1
print("----------------")
if dfICDs.at[i,'has2859and2885'] == 0 and zc < 5:
print(f"{i+1}\t{dfICDs.at[i,'pn']}\t{dfICDs.at[i,'has2859and2885']}\t{dfICDs.at[i,'set_icd_code']}")
zc += 1
print("----------------")
1 GD514000901 1 285.9,288.50,424.1,425.4,585,585.1,585.2,585.3,585.9,695.4,710.0,710.2,710.3,710.8,714.0,719.41,719.46,719.47,729.1,780.60,782.1,D64.9,I42.5,M25.511,M25.519,M25.542,M25.561,M25.569,M32.9,M35.00,M35.01,M35.1,M79.1,N18.2,N18.3,R21 ---------------- 2 GD514086701 1 284.9,285.9,288.50,397.0,424.0,424.1,424.2,425.4,433,433.10,433.30,517.2,585.2,585.3,585.4,585.6,585.9,710.0,714.0,719.40,719.42,719.43,719.49,729.1,780.60,780.61,D64.9,D72.819,I36.1,I42.9,L63.9,M05.751,M05.752,M06.9,M25.512,M32.10,M32.19,M32.9,N18.3,N18.4,N18.9,R50.9,V45.81,Z95.1 ---------------- 3 GD514524901 1 285.29,285.9,287.5,288.50,424.0,710.0,719.47,729.1,780.39,780.60,784.0,D63.8,D64.9,D69.6,D70.9,D72.819,G63,H53.10,I08.1,I34.0,M25.569,M25.571,M32.11,M32.9,M79.1,N18.1,N18.2,N18.3,N18.9,R21,R50.9,R51,R56.9 ---------------- 4 GD514526801 1 285.29,285.9,288.50,517.8,528.2,585.1,585.2,585.4,585.9,695.4,710.0,714.0,719.46,780.39,782.1,784.0,791.0,D63.8,M32.10,M32.14,M32.19,M32.8,M32.9,N03.3,N18.2,N18.9,R21,R51,R56.9 ---------------- 5 GD514533401 1 285.9,287.5,288.50,413.9,515,528.00,528.09,585.2,585.3,585.6,585.9,695.4,710.0,710.1,710.3,780.60,782.1,784.0,791.0,D64.9,D69.6,I20.8,I20.9,J84.10,J84.17,M32.0,M32.10,M32.13,M32.14,M32.9,M35.8,M35.9,N18.3,N18.9,R51 ---------------- 50 GD513644101 0 710.0,727.05,L65.9,M25.50,M25.541,M25.542,M32.9,M35.00,M79.1 ---------------- 51 GD513645801 0 710.0,719.40,719.41,719.49,729.1,M12.9,M25.50,M32.9,M79.1 ---------------- 52 GD513664201 0 710.0 ---------------- 53 GD513791601 0 710.0,710.9,719.44 ---------------- 54 GD513812001 0 719.47,H53.8,M05.9,M06.059,M06.9,M25.531,M25.532,M25.551,M25.561,M25.571,M25.579,M32.9,R50.9,R51 ----------------
Finding first, last, and last-n ICDs¶
There are several ways to play with time ordered data as we will show, but first lets change the definition of are table such that it is ordered by samples and date in order to leverage "-ordered" option in some of the commads, like ÀTMIN
and GROUP
, and make the evaluation of the queries more efficient.
gordefs = """
create #icdtable# = nor <(nor phenotypes/source/js_19k_array/sle/clinical_features.tsv | calc table 'clinical_features')
| merge <(nor phenotypes/source/js_19k_array/sle/selection_criteria.tsv | calc table 'selection_criteria')
| select pn,icd_code,description,entry_date,disease,disease_category,table
| sort -c pn,entry_date;
create #ordered# = nor [#icdtable#]
| sort -c pn,entry_date;
"""
Row based approach¶
Now we can write the following queries. They are pretty self-explanatory keeping in mind the ordering of the #ordered# table and that the command ROWNUM
generates a column with the row number. Note however that the RANK
command does not support the"-ordered" option at the moment.
%%gor dfRowlast3 <<
$gordefs
create #first# = nor [#ordered#]
| rownum
| atmin rownum -gc pn -ordered
| hide rownum;
create #last# = nor [#ordered#]
| rownum
| atmax rownum -gc pn -ordered
| hide rownum;
create #last3# = nor [#ordered#]
| rownum
| rank rownum -gc pn -o desc
| where rank_rownum <= 3
| hide rownum,rank_rownum;
create #first3# = nor [#ordered#]
| rownum
| rank rownum -gc pn -o asc
| where rank_rownum <= 3
| hide rownum,rank_rownum;
nor [#last3#]
Query ran in 0.13 sec Query fetched 4,191 rows in 0.02 sec (total time 0.15 sec)
dfRowlast3[0:3]
pn | icd_code | description | entry_date | disease | disease_category | table | |
---|---|---|---|---|---|---|---|
0 | GD513644101 | M25.542 | Pain in joints of left hand | 2017-02-09 | Arthritis | Joint disease | clinical_features |
1 | GD513644101 | M35.00 | Sicca syndrome, unspecified | 2017-02-09 | Sjogren's | Other autoimmune disease | selection_criteria |
2 | GD513644101 | L65.9 | Nonscarring hair loss, unspecified | 2017-04-04 | Alopecia | Mucocutaneous | clinical_features |
List based approach¶
Here we turn all the columns into a list per PN and then use the listfilter to pick only the last 3 elemenst. The list-functions LISTMAP
, LISTFILTER
, and LISTZIPFILTER
recognize two variables in addition to the column values; namely x for the "current" element value (as string) and i for the list element index (as int).
%%gor dfListlast3 <<
$gordefs
create #listlast3# = nor [#ordered#]
| group -gc pn -ordered -lis -sc #2- -len 100000 -notruncate -count
| replace lis_* listfilter(#rc,'i>allcount-3')
| rename lis_(.*) last3_#{1};
nor [#listlast3#]
Query ran in 0.13 sec Query fetched 1,402 rows in 0.04 sec (total time 0.17 sec)
dfListlast3[0:3]
pn | allCount | last3_icd_code | last3_description | last3_entry_date | last3_disease | last3_disease_category | last3_table | |
---|---|---|---|---|---|---|---|---|
0 | GD513644101 | 13 | M25.542,M35.00,L65.9 | unspecified,Pain in unspecified joint,Myalgia... | 2017-02-09,2017-02-09,2017-04-04 | Arthritis,Sjogren's,Alopecia | Joint disease,Other autoimmune disease,Mucocut... | clinical_features,selection_criteria,clinical_... |
1 | GD513645801 | 23 | M32.9,M32.9,M32.9 | Pain in joint, multiple sites,Myalgia and myos... | 2018-10-07,2019-11-19,2019-11-19 | SLE,SLE,SLE | , | clinical_features,selection_criteria,clinical_... |
2 | GD513664201 | 4 | 710.0,710.0,710.0 | Systemic lupus erythematosus,Systemic lupus er... | 2014-06-16,2014-09-10,2014-09-10 | SLE,SLE,SLE | , | selection_criteria,clinical_features,selection... |
We can compare the results and see that they are identical using the THROWIF
command like this, utilizing the federated virtual relations based on Python dataframes.
%%gor
nor [var:dfRowlast3]
| group -gc pn -lis -sc 2- /* collapse into singleton */
| multimap -c pn -m 'missing' [var:dfListlast3] /* joining two singleton tables on PN */
| select pn,lis_icd_code,lis_disease,last3_icd_code,last3_disease
| throwif lis_icd_code != last3_icd_code or lis_disease != last3_disease
| top 3
Loading relations [var:dfListlast3], [var:dfRowlast3] from local state Query ran in 0.25 sec Query fetched 3 rows in 0.04 sec (total time 0.29 sec)
pn | lis_icd_code | lis_disease | last3_icd_code | last3_disease | |
---|---|---|---|---|---|
0 | GD513644101 | M25.542,M35.00,L65.9 | Arthritis,Sjogren's,Alopecia | M25.542,M35.00,L65.9 | Arthritis,Sjogren's,Alopecia |
1 | GD513645801 | M32.9,M32.9,M32.9 | SLE,SLE,SLE | M32.9,M32.9,M32.9 | SLE,SLE,SLE |
2 | GD513664201 | 710.0,710.0,710.0 | SLE,SLE,SLE | 710.0,710.0,710.0 | SLE,SLE,SLE |
De-list the data¶
We could also easily turn the list-based data into row-level data using the SPLIT
command, e.g.
%%gor
nor [var:dfListlast3] | select pn,last3_icd_code,last3_entry_date | split 2- | rename last3_(.*) #{1}
| top 5
Loading relations [var:dfListlast3] from local state Query ran in 0.16 sec Query fetched 5 rows in 0.00 sec (total time 0.16 sec)
pn | icd_code | entry_date | |
---|---|---|---|
0 | GD513644101 | M25.542 | 2017-02-09 |
1 | GD513644101 | M35.00 | 2017-02-09 |
2 | GD513644101 | L65.9 | 2017-04-04 |
3 | GD513645801 | M32.9 | 2018-10-07 |
4 | GD513645801 | M32.9 | 2019-11-19 |
We see that the samples have only three rows each. Note that the above approach may not work well if there are column characters that "conflict" with the list separator (here by default a simple comma). A more specific separator can be specified both in GROU
, SPLIT
and the LISFILTER
function.
Generating multiple Boolean conditions¶
This example show a simple pattern, based on GROUP
and "-max", where multiple Boolean variables are defined where each is only based on a single observation.
%%gor dfConds <<
$gordefs
nor [#ordered#]
| calc cond1 if(icd_code ~ 'M*' and description ~ '*lupus*',1,0)
| calc cond2 if(icd_code ~ '7*' and disease = 'SLE',1,0)
| group -gc pn -max -ic cond*
| rename max_(cond.*) #{1};
Query ran in 0.14 sec Query fetched 1,402 rows in 0.58 sec (total time 0.71 sec)
dfConds
pn | cond1 | cond2 | |
---|---|---|---|
0 | GD513644101 | 1 | 1 |
1 | GD513645801 | 1 | 1 |
2 | GD513664201 | 0 | 1 |
3 | GD513791601 | 0 | 1 |
4 | GD513812001 | 1 | 0 |
... | ... | ... | ... |
1397 | GD515716501 | 1 | 1 |
1398 | GD515716601 | 1 | 1 |
1399 | GD515716701 | 0 | 1 |
1400 | GD515716801 | 1 | 1 |
1401 | GD515716901 | 1 | 1 |
1402 rows × 3 columns
Table like this is easily explored in the SM grid using the Venn-tool or we can simply summarize it in the following way:
%%gor
nor [var:dfConds] | group -gc cond1,cond2 -count
Loading relations [var:dfConds] from local state Query ran in 0.12 sec Query fetched 3 rows in 0.00 sec (total time 0.12 sec)
cond1 | cond2 | allCount | |
---|---|---|---|
0 | 0 | 1 | 381 |
1 | 1 | 0 | 124 |
2 | 1 | 1 | 897 |
Time based window analysis¶
Join based approach¶
The last set of examples will show how we can work with longitudinal data and do window based analysis. For example be able to count the number of ICD codes in a particular time window or to find a set of certain codes that coexist in a given time window. The first approach uses a MULTIMAP
join, leveraging the order optimization in that command. Importantly, it prevents the right-source to be loaded into memory, which is important if we have high number of observations.
%%gor
$gordefs
def datediff($1,$2) = (int(substr($1,0,4))-int(substr($2,0,4)))*365+(int(substr($1,5,7))-int(substr($2,5,7)))*31+int(substr($1,8,10))-int(substr($2,8,10));
nor [#ordered#] | hide description
| multimap -c pn -ordered [#ordered#]
| where abs( datediff(entry_date,entry_datex) ) <= 365
| group -gc pn-entry_date -count -ordered -lis -sc icd_codex
| calc condition if(listhascount(lis_icd_codex,'710.0,M32.9,M25.50,727.05')>2,1,0)
| top 10
Query ran in 0.13 sec Query fetched 10 rows in 19.64 sec (total time 19.77 sec)
pn | icd_code | entry_date | allCount | lis_icd_codex | condition | |
---|---|---|---|---|---|---|
0 | GD513644101 | 710.0 | 2014-10-02 | 5 | 710.0,727.05,710.0,M32.9,M32.9 | 1 |
1 | GD513644101 | 727.05 | 2014-10-02 | 5 | 710.0,727.05,710.0,M32.9,M32.9 | 1 |
2 | GD513644101 | 710.0 | 2014-10-02 | 5 | 710.0,727.05,710.0,M32.9,M32.9 | 1 |
3 | GD513644101 | M32.9 | 2015-09-01 | 14 | 710.0,727.05,710.0,M32.9,M32.9,M32.9,M32.9,710... | 1 |
4 | GD513644101 | M32.9 | 2016-06-14 | 20 | M32.9,M32.9,M32.9,M32.9,M25.50,M79.1,M25.541,M... | 0 |
5 | GD513644101 | M25.50 | 2016-10-11 | 8 | M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0... | 0 |
6 | GD513644101 | M79.1 | 2017-02-09 | 8 | M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0... | 0 |
7 | GD513644101 | M25.541 | 2017-02-09 | 8 | M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0... | 0 |
8 | GD513644101 | M25.542 | 2017-02-09 | 8 | M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0... | 0 |
9 | GD513644101 | M35.00 | 2017-02-09 | 8 | M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0... | 0 |
Parallel NOR version¶
These types of queries can be quite costly and the cost goes up with the square of the number of observations per sample. We can speed things up by assigning multiple workers to the processing with the PARALLEL
command and partition the PN samples on the fly using the "in-order-of-appearance" function ÌOOA
and filter them with WHERE
based on the modulus of their row number. In addition to associating each ICD observation with the count and list of other ICD codes that are close in time, we can compute a Boolean condition based on the diagnoses found in the time window.
%%gor
$gordefs
def datediff($1,$2) = (int(substr($1,0,4))-int(substr($2,0,4)))*365+(int(substr($1,5,7))-int(substr($2,5,7)))*31+int(substr($1,8,10))-int(substr($2,8,10));
def #parts# = 4;
create #partcount# = parallel -parts <(norrows #parts#) <(nor [#ordered#] | hide description
| where mod(iooa(pn),#parts#) = #{col:rownum}
| multimap -c pn -ordered [#ordered#]
| where abs( datediff(entry_date,entry_datex) ) <= 365
| group -gc pn-entry_date -count -ordered -lis -sc icd_codex);
nor [#partcount#] | calc condition if(listhascount(lis_icd_codex,'710.0,M32.9,M25.50,727.05')>2,1,0)
| group -gc pn -max -ic condition
| group -gc max_condition -count
| rename max_condition samples_with_condition
Query ran in 0.22 sec Query fetched 2 rows in 2.71 sec (total time 2.92 sec)
samples_with_condition | allCount | |
---|---|---|
0 | 0 | 1334 |
1 | 1 | 68 |
List based approach¶
Finally, we show how the same type of window based time analysis can be done using list functions in the GORpipe syntax. One can argue that syntactically this approach is more complicated, however, for the analysis presented here it is actually close to 4x more efficient! The idea is to collapse all the observations per sample into lists. Then clone the lists, split them and extract sub-lists based on the time condition using LISTZIPFILTER
where we can filter the ICD code list based on the time list.
%%gor dfListbased <<
$gordefs
nor [#ordered#]
| calc dt int(substr(entry_date,0,4))*365+int(substr(entry_date,5,7))*31+int(substr(entry_date,8,10))
| group -gc pn -lis -sc icd_code,entry_date,dt -count -ordered -notruncate -len 100000
| calc dt lis_dt
| calc icd_code lis_icd_code
| split dt,icd_code,lis_entry_date
| rename lis_entry_date entry_date
| replace lis_icd_code listzipfilter(lis_icd_code,lis_dt,'abs(int(x)-dt) < 365')
| hide lis_dt,dt
Query ran in 0.14 sec Query fetched 149,077 rows in 16.23 sec (total time 16.37 sec)
dfListbased[0:5]
pn | allCount | lis_icd_code | entry_date | icd_code | |
---|---|---|---|---|---|
0 | GD513644101 | 13 | 710.0,727.05,710.0,M32.9,M32.9 | 2014-10-02 | 710.0 |
1 | GD513644101 | 13 | 710.0,727.05,710.0,M32.9,M32.9 | 2014-10-02 | 727.05 |
2 | GD513644101 | 13 | 710.0,727.05,710.0,M32.9,M32.9 | 2014-10-02 | 710.0 |
3 | GD513644101 | 13 | 710.0,727.05,710.0,M32.9,M32.9,M32.9,M32.9 | 2015-09-01 | M32.9 |
4 | GD513644101 | 13 | 710.0,727.05,710.0,M32.9,M32.9,M32.9,M32.9 | 2015-09-01 | M32.9 |
While the above approach is faster than the join approach shown earlier, it is more complex to read and the performance difference may be smaller if we need to keep track of more columns than icd_code and entry_date like in our example here. Obviously, the above query can be parallelized in the same way as the multimap join based approach. Finally, we show that we get the same sample count for our earlier condition analysis and we leave it as an exersize for the reader to show that they give exactly the same result.
%%gor
nor [var:dfListbased]
| calc condition if(listhascount(lis_icd_code,'710.0,M32.9,M25.50,727.05')>2,1,0)
| group -gc pn -max -ic condition
| group -gc max_condition -count
| rename max_condition samples_with_condition
Loading relations [var:dfListbased] from local state Query ran in 4.89 sec. Query fetched 2 rows in 2.65 sec (total time 7.54 sec)
samples_with_condition | allCount | |
---|---|---|
0 | 0 | 1334 |
1 | 1 | 68 |