ICD Longitudinal¶

ICD_longitudinal

Longitudinal phenotype analysis¶

First we include boiler plate code to define the project, set environment variables and import libraries.

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1

project = "janssen_sle"
%env GOR_API_PROJECT={project}

We start by defining a new ICD data table, here as the combination of two table in a our project.

In [2]:

gordefs = """
create #icdtable# = nor <(nor phenotypes/source/js_19k_array/sle/clinical_features.tsv | calc table 'clinical_features')
| merge <(nor phenotypes/source/js_19k_array/sle/selection_criteria.tsv | calc table 'selection_criteria')
| select pn,icd_code,description,entry_date,disease,disease_category,table;
"""

In [3]:

%%gor
$gordefs

nor [#icdtable#] | top 5

Query ran in 0.28 sec
Query fetched 5 rows in 0.00 sec (total time 0.28 sec)

Out[3]:

	pn	icd_code	description	entry_date	disease	disease_category	table
0	GD515293101	285.9	Anemia, unspecified	2012-05-21	Anemia	Hematologic	clinical_features
1	GD515293101	368.8	Other specified visual disturbances	2012-05-27	Visual disturbance	Neurological	clinical_features
2	GD515293101	368.9	Unspecified visual disturbance	2012-05-21	Visual disturbance	Neurological	clinical_features
3	GD515293101	585.9	Chronic kidney disease, unspecified	2013-11-15	Chronic kidney disease	Renal	clinical_features
4	GD515293101	585.9	Chronic kidney disease, unspecified	2011-03-20	Chronic kidney disease	Renal	clinical_features

List summary¶

We can summarize the ICD codes for each PN in the two tables in following way using the PIVOT command:

In [4]:

%%gor
$gordefs

nor [#icdtable#] 
| group -gc pn,table -count -set -sc icd_code
| pivot table -v 'clinical_features','selection_criteria' -gc pn

Query ran in 0.12 sec
Query fetched 1,402 rows in 0.21 sec (total time 0.33 sec)

Out[4]:

	pn	clinical_features_allCount	clinical_features_set_icd_code	selection_criteria_allCount	selection_criteria_set_icd_code
0	GD515267501	6	511.0,710.0,M32.10,M32.9	4	710.0,M32.10,M32.9
1	GD515474101	8	710.0,M32.8	8	710.0,M32.8
2	GD515402901	33	285.29,285.9,415.19,585.4,585.6,585.9,695.4,71...	5	695.4,710.0
3	GD515711101	35	288.00,515,710.0,719.42,719.43,719.44,719.46,7...	77	710.0,710.8,710.9,714.0,M32.10,M35.1,M35.8,M35.9
4	GD515473501	70	710.0,M32.10,M32.19,M32.9	72	710.0,M32.0,M32.10,M32.19,M32.9
...	...	...	...	...	...
1397	GD515710701	27	585.1,585.9,710.0,719.41,782.1,791.0	27	710.0,710.9,714.0,M05.9,M06.09
1398	GD515434801	32	287.5,710.0,719.44,719.46,719.49,729.1,G63,K12...	14	710.0,M32.10,M32.19,M32.9
1399	GD515479201	42	719.46,727.00,727.09,782.1,F06.31,M25.50,M25.5...	4	710.8,M32.9
1400	GD515700701	37	285.9,424.1,515,585.9,695.4,710.0,791.0	20	695.4,710.0,714.0
1401	GD515434501	163	285.9,288.50,585,585.1,585.2,585.3,585.4,585.9...	86	695.4,710.0,M32.10,M32.14,M32.19,M32.9

1402 rows × 5 columns

Similarly, we can collapse the tables into one and identify samples with two or more ICD codes of interes.

In [5]:

%%gor dfICDs <<
$gordefs

nor [#icdtable#] 
| group -gc pn -count -set -sc icd_code
| calc has2859and2885 if(listhascount(set_icd_code,'285.9','288.50')>=2,1,0)
| sort -c has2859and2885:r

Query ran in 0.16 sec
Query fetched 1,402 rows in 0.19 sec (total time 0.35 sec)

We show few selected rows from the the output with the following Python logic:

In [6]:

oc = 0
zc = 0
for i in range(0,len(dfICDs)):
    if dfICDs.at[i,'has2859and2885'] == 1 and oc < 5:
        print(f"{i+1}\t{dfICDs.at[i,'pn']}\t{dfICDs.at[i,'has2859and2885']}\t{dfICDs.at[i,'set_icd_code']}")
        oc += 1
        print("----------------")
    if dfICDs.at[i,'has2859and2885'] == 0 and zc < 5:
        print(f"{i+1}\t{dfICDs.at[i,'pn']}\t{dfICDs.at[i,'has2859and2885']}\t{dfICDs.at[i,'set_icd_code']}")
        zc += 1    
        print("----------------")

1	GD514000901	1	285.9,288.50,424.1,425.4,585,585.1,585.2,585.3,585.9,695.4,710.0,710.2,710.3,710.8,714.0,719.41,719.46,719.47,729.1,780.60,782.1,D64.9,I42.5,M25.511,M25.519,M25.542,M25.561,M25.569,M32.9,M35.00,M35.01,M35.1,M79.1,N18.2,N18.3,R21
----------------
2	GD514086701	1	284.9,285.9,288.50,397.0,424.0,424.1,424.2,425.4,433,433.10,433.30,517.2,585.2,585.3,585.4,585.6,585.9,710.0,714.0,719.40,719.42,719.43,719.49,729.1,780.60,780.61,D64.9,D72.819,I36.1,I42.9,L63.9,M05.751,M05.752,M06.9,M25.512,M32.10,M32.19,M32.9,N18.3,N18.4,N18.9,R50.9,V45.81,Z95.1
----------------
3	GD514524901	1	285.29,285.9,287.5,288.50,424.0,710.0,719.47,729.1,780.39,780.60,784.0,D63.8,D64.9,D69.6,D70.9,D72.819,G63,H53.10,I08.1,I34.0,M25.569,M25.571,M32.11,M32.9,M79.1,N18.1,N18.2,N18.3,N18.9,R21,R50.9,R51,R56.9
----------------
4	GD514526801	1	285.29,285.9,288.50,517.8,528.2,585.1,585.2,585.4,585.9,695.4,710.0,714.0,719.46,780.39,782.1,784.0,791.0,D63.8,M32.10,M32.14,M32.19,M32.8,M32.9,N03.3,N18.2,N18.9,R21,R51,R56.9
----------------
5	GD514533401	1	285.9,287.5,288.50,413.9,515,528.00,528.09,585.2,585.3,585.6,585.9,695.4,710.0,710.1,710.3,780.60,782.1,784.0,791.0,D64.9,D69.6,I20.8,I20.9,J84.10,J84.17,M32.0,M32.10,M32.13,M32.14,M32.9,M35.8,M35.9,N18.3,N18.9,R51
----------------
50	GD513644101	0	710.0,727.05,L65.9,M25.50,M25.541,M25.542,M32.9,M35.00,M79.1
----------------
51	GD513645801	0	710.0,719.40,719.41,719.49,729.1,M12.9,M25.50,M32.9,M79.1
----------------
52	GD513664201	0	710.0
----------------
53	GD513791601	0	710.0,710.9,719.44
----------------
54	GD513812001	0	719.47,H53.8,M05.9,M06.059,M06.9,M25.531,M25.532,M25.551,M25.561,M25.571,M25.579,M32.9,R50.9,R51
----------------

Finding first, last, and last-n ICDs¶

There are several ways to play with time ordered data as we will show, but first lets change the definition of are table such that it is ordered by samples and date in order to leverage "-ordered" option in some of the commads, like ÀTMIN and GROUP, and make the evaluation of the queries more efficient.

In [7]:

gordefs = """
create #icdtable# = nor <(nor phenotypes/source/js_19k_array/sle/clinical_features.tsv | calc table 'clinical_features')
| merge <(nor phenotypes/source/js_19k_array/sle/selection_criteria.tsv | calc table 'selection_criteria')
| select pn,icd_code,description,entry_date,disease,disease_category,table
| sort -c pn,entry_date;

create #ordered# = nor [#icdtable#]
| sort -c pn,entry_date;
"""

Row based approach¶

Now we can write the following queries. They are pretty self-explanatory keeping in mind the ordering of the #ordered# table and that the command ROWNUM generates a column with the row number. Note however that the RANK command does not support the"-ordered" option at the moment.

In [8]:

%%gor dfRowlast3 <<
$gordefs

create #first# = nor [#ordered#]
| rownum
| atmin rownum -gc pn -ordered
| hide rownum;

create #last# = nor [#ordered#]
| rownum
| atmax rownum -gc pn -ordered
| hide rownum;

create #last3# = nor [#ordered#]
| rownum
| rank rownum -gc pn -o desc
| where rank_rownum <= 3
| hide rownum,rank_rownum;

create #first3# = nor [#ordered#]
| rownum
| rank rownum -gc pn -o asc
| where rank_rownum <= 3
| hide rownum,rank_rownum;

nor [#last3#] 

Query ran in 0.13 sec
Query fetched 4,191 rows in 0.02 sec (total time 0.15 sec)

In [9]:

dfRowlast3[0:3]

Out[9]:

	pn	icd_code	description	entry_date	disease	disease_category	table
0	GD513644101	M25.542	Pain in joints of left hand	2017-02-09	Arthritis	Joint disease	clinical_features
1	GD513644101	M35.00	Sicca syndrome, unspecified	2017-02-09	Sjogren's	Other autoimmune disease	selection_criteria
2	GD513644101	L65.9	Nonscarring hair loss, unspecified	2017-04-04	Alopecia	Mucocutaneous	clinical_features

List based approach¶

Here we turn all the columns into a list per PN and then use the listfilter to pick only the last 3 elemenst. The list-functions LISTMAP, LISTFILTER, and LISTZIPFILTER recognize two variables in addition to the column values; namely x for the "current" element value (as string) and i for the list element index (as int).

In [10]:

%%gor dfListlast3 <<
$gordefs

create #listlast3# = nor [#ordered#]
| group -gc pn -ordered -lis -sc #2- -len 100000 -notruncate -count
| replace lis_* listfilter(#rc,'i>allcount-3')
| rename lis_(.*) last3_#{1};
    
nor [#listlast3#]

Query ran in 0.13 sec
Query fetched 1,402 rows in 0.04 sec (total time 0.17 sec)

In [11]:

dfListlast3[0:3]

Out[11]:

	pn	allCount	last3_icd_code	last3_description	last3_entry_date	last3_disease	last3_disease_category	last3_table
0	GD513644101	13	M25.542,M35.00,L65.9	unspecified,Pain in unspecified joint,Myalgia...	2017-02-09,2017-02-09,2017-04-04	Arthritis,Sjogren's,Alopecia	Joint disease,Other autoimmune disease,Mucocut...	clinical_features,selection_criteria,clinical_...
1	GD513645801	23	M32.9,M32.9,M32.9	Pain in joint, multiple sites,Myalgia and myos...	2018-10-07,2019-11-19,2019-11-19	SLE,SLE,SLE	,	clinical_features,selection_criteria,clinical_...
2	GD513664201	4	710.0,710.0,710.0	Systemic lupus erythematosus,Systemic lupus er...	2014-06-16,2014-09-10,2014-09-10	SLE,SLE,SLE	,	selection_criteria,clinical_features,selection...

We can compare the results and see that they are identical using the THROWIF command like this, utilizing the federated virtual relations based on Python dataframes.

In [12]:

%%gor
nor [var:dfRowlast3]
| group -gc pn -lis -sc 2- /* collapse into singleton */
| multimap -c pn -m 'missing' [var:dfListlast3] /* joining two singleton tables on PN */
| select  pn,lis_icd_code,lis_disease,last3_icd_code,last3_disease
| throwif lis_icd_code != last3_icd_code or lis_disease != last3_disease
| top 3

Loading relations [var:dfListlast3], [var:dfRowlast3] from local state
Query ran in 0.25 sec
Query fetched 3 rows in 0.04 sec (total time 0.29 sec)

Out[12]:

	pn	lis_icd_code	lis_disease	last3_icd_code	last3_disease
0	GD513644101	M25.542,M35.00,L65.9	Arthritis,Sjogren's,Alopecia	M25.542,M35.00,L65.9	Arthritis,Sjogren's,Alopecia
1	GD513645801	M32.9,M32.9,M32.9	SLE,SLE,SLE	M32.9,M32.9,M32.9	SLE,SLE,SLE
2	GD513664201	710.0,710.0,710.0	SLE,SLE,SLE	710.0,710.0,710.0	SLE,SLE,SLE

De-list the data¶

We could also easily turn the list-based data into row-level data using the SPLIT command, e.g.

In [13]:

%%gor
nor [var:dfListlast3] | select pn,last3_icd_code,last3_entry_date | split 2- | rename last3_(.*) #{1}
| top 5

Loading relations [var:dfListlast3] from local state
Query ran in 0.16 sec
Query fetched 5 rows in 0.00 sec (total time 0.16 sec)

Out[13]:

	pn	icd_code	entry_date
0	GD513644101	M25.542	2017-02-09
1	GD513644101	M35.00	2017-02-09
2	GD513644101	L65.9	2017-04-04
3	GD513645801	M32.9	2018-10-07
4	GD513645801	M32.9	2019-11-19

We see that the samples have only three rows each. Note that the above approach may not work well if there are column characters that "conflict" with the list separator (here by default a simple comma). A more specific separator can be specified both in GROU, SPLIT and the LISFILTER function.

Generating multiple Boolean conditions¶

This example show a simple pattern, based on GROUP and "-max", where multiple Boolean variables are defined where each is only based on a single observation.

In [14]:

%%gor dfConds <<
$gordefs

nor [#ordered#]
| calc cond1 if(icd_code ~ 'M*' and description ~ '*lupus*',1,0)
| calc cond2 if(icd_code ~ '7*' and disease = 'SLE',1,0)
| group -gc pn -max -ic cond*
| rename max_(cond.*) #{1};

Query ran in 0.14 sec
Query fetched 1,402 rows in 0.58 sec (total time 0.71 sec)

In [15]:

dfConds

Out[15]:

	pn	cond1	cond2
0	GD513644101	1	1
1	GD513645801	1	1
2	GD513664201	0	1
3	GD513791601	0	1
4	GD513812001	1	0
...	...	...	...
1397	GD515716501	1	1
1398	GD515716601	1	1
1399	GD515716701	0	1
1400	GD515716801	1	1
1401	GD515716901	1	1

1402 rows × 3 columns

Table like this is easily explored in the SM grid using the Venn-tool or we can simply summarize it in the following way:

In [16]:

%%gor

nor [var:dfConds] | group -gc cond1,cond2 -count

Loading relations [var:dfConds] from local state
Query ran in 0.12 sec
Query fetched 3 rows in 0.00 sec (total time 0.12 sec)

Out[16]:

	cond1	cond2	allCount
0	0	1	381
1	1	0	124
2	1	1	897

Time based window analysis¶

Join based approach¶

The last set of examples will show how we can work with longitudinal data and do window based analysis. For example be able to count the number of ICD codes in a particular time window or to find a set of certain codes that coexist in a given time window. The first approach uses a MULTIMAP join, leveraging the order optimization in that command. Importantly, it prevents the right-source to be loaded into memory, which is important if we have high number of observations.

In [17]:

%%gor
$gordefs

def datediff($1,$2) = (int(substr($1,0,4))-int(substr($2,0,4)))*365+(int(substr($1,5,7))-int(substr($2,5,7)))*31+int(substr($1,8,10))-int(substr($2,8,10));

nor [#ordered#] | hide description
| multimap -c pn -ordered [#ordered#] 
| where abs( datediff(entry_date,entry_datex) ) <= 365
| group -gc pn-entry_date -count -ordered -lis -sc icd_codex
| calc condition if(listhascount(lis_icd_codex,'710.0,M32.9,M25.50,727.05')>2,1,0)
| top 10

Query ran in 0.13 sec
Query fetched 10 rows in 19.64 sec (total time 19.77 sec)

Out[17]:

	pn	icd_code	entry_date	allCount	lis_icd_codex	condition
0	GD513644101	710.0	2014-10-02	5	710.0,727.05,710.0,M32.9,M32.9	1
1	GD513644101	727.05	2014-10-02	5	710.0,727.05,710.0,M32.9,M32.9	1
2	GD513644101	710.0	2014-10-02	5	710.0,727.05,710.0,M32.9,M32.9	1
3	GD513644101	M32.9	2015-09-01	14	710.0,727.05,710.0,M32.9,M32.9,M32.9,M32.9,710...	1
4	GD513644101	M32.9	2016-06-14	20	M32.9,M32.9,M32.9,M32.9,M25.50,M79.1,M25.541,M...	0
5	GD513644101	M25.50	2016-10-11	8	M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0...	0
6	GD513644101	M79.1	2017-02-09	8	M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0...	0
7	GD513644101	M25.541	2017-02-09	8	M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0...	0
8	GD513644101	M25.542	2017-02-09	8	M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0...	0
9	GD513644101	M35.00	2017-02-09	8	M32.9,M32.9,M25.50,M79.1,M25.541,M25.542,M35.0...	0

Parallel NOR version¶

These types of queries can be quite costly and the cost goes up with the square of the number of observations per sample. We can speed things up by assigning multiple workers to the processing with the PARALLEL command and partition the PN samples on the fly using the "in-order-of-appearance" function ÌOOA and filter them with WHERE based on the modulus of their row number. In addition to associating each ICD observation with the count and list of other ICD codes that are close in time, we can compute a Boolean condition based on the diagnoses found in the time window.

In [18]:

%%gor
$gordefs

def datediff($1,$2) = (int(substr($1,0,4))-int(substr($2,0,4)))*365+(int(substr($1,5,7))-int(substr($2,5,7)))*31+int(substr($1,8,10))-int(substr($2,8,10));

def #parts# = 4;
create #partcount# = parallel -parts <(norrows #parts#) <(nor [#ordered#] | hide description
| where mod(iooa(pn),#parts#) = #{col:rownum}
| multimap -c pn -ordered [#ordered#] 
| where abs( datediff(entry_date,entry_datex) ) <= 365
| group -gc pn-entry_date -count -ordered -lis -sc icd_codex);

nor [#partcount#] | calc condition if(listhascount(lis_icd_codex,'710.0,M32.9,M25.50,727.05')>2,1,0)
| group -gc pn -max -ic condition
| group -gc max_condition -count
| rename max_condition samples_with_condition

Query ran in 0.22 sec
Query fetched 2 rows in 2.71 sec (total time 2.92 sec)

Out[18]:

	samples_with_condition	allCount
0	0	1334
1	1	68

List based approach¶

Finally, we show how the same type of window based time analysis can be done using list functions in the GORpipe syntax. One can argue that syntactically this approach is more complicated, however, for the analysis presented here it is actually close to 4x more efficient! The idea is to collapse all the observations per sample into lists. Then clone the lists, split them and extract sub-lists based on the time condition using LISTZIPFILTER where we can filter the ICD code list based on the time list.

In [19]:

%%gor dfListbased <<
$gordefs

nor [#ordered#] 
| calc dt int(substr(entry_date,0,4))*365+int(substr(entry_date,5,7))*31+int(substr(entry_date,8,10))
| group -gc pn -lis -sc icd_code,entry_date,dt -count -ordered -notruncate -len 100000
| calc dt lis_dt
| calc icd_code lis_icd_code
| split dt,icd_code,lis_entry_date
| rename lis_entry_date entry_date
| replace lis_icd_code listzipfilter(lis_icd_code,lis_dt,'abs(int(x)-dt) < 365')
| hide lis_dt,dt
                     

Query ran in 0.14 sec
Query fetched 149,077 rows in 16.23 sec (total time 16.37 sec)

In [20]:

dfListbased[0:5]

Out[20]:

	pn	allCount	lis_icd_code	entry_date	icd_code
0	GD513644101	13	710.0,727.05,710.0,M32.9,M32.9	2014-10-02	710.0
1	GD513644101	13	710.0,727.05,710.0,M32.9,M32.9	2014-10-02	727.05
2	GD513644101	13	710.0,727.05,710.0,M32.9,M32.9	2014-10-02	710.0
3	GD513644101	13	710.0,727.05,710.0,M32.9,M32.9,M32.9,M32.9	2015-09-01	M32.9
4	GD513644101	13	710.0,727.05,710.0,M32.9,M32.9,M32.9,M32.9	2015-09-01	M32.9

While the above approach is faster than the join approach shown earlier, it is more complex to read and the performance difference may be smaller if we need to keep track of more columns than icd_code and entry_date like in our example here. Obviously, the above query can be parallelized in the same way as the multimap join based approach. Finally, we show that we get the same sample count for our earlier condition analysis and we leave it as an exersize for the reader to show that they give exactly the same result.

In [21]:

%%gor 
nor [var:dfListbased]
| calc condition if(listhascount(lis_icd_code,'710.0,M32.9,M25.50,727.05')>2,1,0)
| group -gc pn -max -ic condition
| group -gc max_condition -count
| rename max_condition samples_with_condition

Loading relations [var:dfListbased] from local state
Query ran in 4.89 sec.
Query fetched 2 rows in 2.65 sec (total time 7.54 sec)

Out[21]:

	samples_with_condition	allCount
0	0	1334
1	1	68

In [ ]: