Joins¶

Joins

In [17]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}

Genome JOINS¶

This notebook explains how genomic joins work in GOR and some of the suddle details around it performance. The genomic features that can be used in spatial joins are of three types:

snp (single nucleotype position, e.g. chrom,pos)
seg (segments, e.g. chrom,bpstart,bpstop)
var (variants, e.g. segmenst based on chrom,pos,ref)

There are two main commands, JOIN and VARJOIN, the former a general purpose join command wheras the latter is for joining two variant tables.

GOR assumes consistent genomic ordering of both tables (files) and streams and uses primarily a merge join logic (equivalent to a sort-merge locic in non-ordered databases). What makes the merge-logic particularly powerful is to be able to switch between streaming mode and seek-mode, based on real-time cost optimization. Additionally, GOR can combine hash-based equi-joins on non-positional columns, e.g. providing a very efficient ways to express genome spatial joins for multiple samples.

Join-order¶

Here we will explain how the order of relations matters in joins. Importantly, the left-source is always read fully while the right-source may not, i.e. the join-logic may decide to seek into the right-source.

In [3]:

%%gor
gor #genes# | where gene_symbol ~ 'brc*' | join -segseg #exons# | group chrom -count | calc t time()

Query ran in 2.37 sec
Query fetched 4 rows in 0.06 sec (total time 2.43 sec)

Out[3]:

	chrom	bpStop	allCount	t
0	chr13	114364328	188	1753
1	chr17	83257441	473	1753
2	chr5	181538259	1	1753
3	chrX	156040895	64	1753

In [4]:

%%gor
gor #exons#  | join -segseg <(gor #genes# | where gene_symbol ~ 'brc*') | group chrom -count | calc t time()

Query ran in 0.07 sec
Query fetched 4 rows in 1.92 sec (total time 1.98 sec)

Out[4]:

	chrom	bpStop	allCount	t
0	chr13	114364328	188	1921
1	chr17	83257441	473	1921
2	chr5	181538259	1	1921
3	chrX	156040895	64	1921

Above we see equivalent queries that join a subset of the genes together with all the exons. The first version is slightly faster because it does not have to read all the #exons# table. The difference is not great because the system reads all the 1.5M exon rows very quickly in the second example. However, if we change exons to variants in #dbsnp# we see a great difference.

In [5]:

%%gor
gor #genes# | where gene_symbol ~ 'brc*' | join -segsnp #dbsnp# | group chrom -count | calc t time()

Query ran in 0.64 sec
Query fetched 4 rows in 3.71 sec (total time 4.35 sec)

Out[5]:

	chrom	bpStop	allCount	t
0	chr13	114364328	61303	4197
1	chr17	83257441	75293	4197
2	chr5	181538259	394	4197
3	chrX	156040895	13932	4197

If we turn the query around, it will however take forever to run, i.e.

gor #dbsnp# | join -segseg <(gor #genes# | where gene_symbol ~ 'brc*') | group chrom -count | calc t time()

will read all the 1.2 billion rows. In the former example, the JOIN command made 4 seeks into #dbsnp# and only variants in the neighborhood of the genes were read. We will explore this further below.

Elevator pitch - skip-scan-join¶

Here we will explain how the "cost" of random access vs cost of streaming impacts when to perform a seek vs simply keeping on streaming. Although we will use #genes# and #exons#, which are segments, we will use -snpsnp joins to #dbsnp# just to explain the principle.

In [6]:

%%gor
gor -p chr1 #genes# | join -snpsnp -l -e 'missing' #dbsnp# 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount

Query ran in 0.28 sec
Query fetched 1 rows in 55.10 sec (total time 55.38 sec)

Out[6]:

	chrom	bpStart	bpStop	allCount	sum_found	t	at
0	chr1	0	248956422	4345	2091	55169	12.697123

In [7]:

%%gor
gor -p chr1 #dbsnp# | join -snpsnp #genes# | group chrom -count | calc t time() | calc at t/allcount

Query ran in 0.36 sec
Query fetched 1 rows in 83.24 sec (total time 83.60 sec)

Out[7]:

	chrom	bpStart	bpStop	allCount	t	at
0	chr1	0	248956422	2091	83316	39.84505

We see that the former query that reads the big #dbsnp# table as the right source, takes 55sec while the query that reads #dbsnp# fully as the left-source is 77sec. The reason is that the former query does indeed perform 4243 seeks into the #dbsnp# table, spending on average 13msec per seek. This is based on the random-access cost for S3 object-storage. Next we will move chr1 of #dbsnp# into Lustre by putting it into a temprary table.

In [8]:

qh = GQH.GOR_Query_Helper()
mydefs = qh.add("""create temp_dbsnp = gor -p chr1 #dbsnp#;""")

In [13]:

%%gor
$mydefs
gor -p chr1 #genes# | join -snpsnp -l -e 'missing' [temp_dbsnp] 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount

Query ran in 0.17 sec
Query fetched 1 rows in 1.90 sec (total time 2.08 sec)

Out[13]:

	chrom	bpStart	bpStop	allCount	sum_found	t	at
0	chr1	0	248956422	4345	2091	1916	0.440967

Now the query took less than 2sec! We see that the average seek cost is only 0.4msec, i.e. 30 times faster random-access than for S3. Now we switch the join order.

In [14]:

%%gor
$mydefs
gor -p chr1 [temp_dbsnp]  | join -snpsnp #genes# | group chrom -count | calc t time() | calc at t/allcount

Query ran in 0.39 sec
Query fetched 1 rows in 80.54 sec (total time 80.93 sec)

Out[14]:

	chrom	bpStart	bpStop	allCount	t	at
0	chr1	0	248956422	2091	80553	38.523673

We see that the streaming speed in Lustre and S3 is practically the same for the 97M rows on chr1. To hammer this in, lets try the #exons# instead of #genes; two orders of magnitude more rows to join.

In [15]:

%%gor
$mydefs
gor -p chr1 #exons# | join -snpsnp -l -e 'missing' [temp_dbsnp] 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount

Query ran in 0.30 sec
Query fetched 1 rows in 7.68 sec (total time 7.98 sec)

Out[15]:

	chrom	bpStart	bpStop	allCount	sum_found	t	at
0	chr1	0	248956422	167823	71618	7711	0.045947

We see that we get 157k rows in 8sec and average seek cost now down to 0.05sec! Finally, we try this for the S3 version:

In [16]:

%%gor
$mydefs
gor -p chr1 #exons# | join -snpsnp -l -e 'missing' #dbsnp# 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount

Query ran in 0.26 sec
Query fetched 1 rows in 54.42 sec (total time 54.68 sec)

Out[16]:

	chrom	bpStart	bpStop	allCount	sum_found	t	at
0	chr1	0	248956422	167823	71618	54519	0.32486

The query time is significantly slower, i.e. 54sec vs 8sec, but we see that the average seek cost is now down to 0.34msec. This is because may of the exons are close by and rather than performing seek on every exon segment, the JOIN command my revert to a skip-scan pattern, seeking into #dbsnp# if segments are far apart and streaming for segments that are close to each other.

Note: Seek time is not only affected by the random-access cost of the underlying file-system. It also depends on how many files are accessed, e.g. a dictionary with 100 files has 100 times slower random access.

We now look at the use-case of fetching variants from all samples in a project where there are about 25k samples stored and bucketized into over 150 buckets. First we fetch all the variants in a single exon:

In [25]:

%%gor
$mydefs
gor -p chr1 #exons# | grep WASH7P | top 1
| join -segsnp -ir <(gor #wgsvars#)
| group chrom -count | calc t time()

Query ran in 0.76 sec
Query fetched 1 rows in 0.59 sec (total time 1.35 sec)

Out[25]:

	CHROM	bpStart	bpStop	allCount	t
0	chr1	0	248956422	1497	1155

We see that it took over a second to fetch 1497 rows from these 25k samples (actually longer on first execution, since the file system was cold). Now we select 5 samples from the project and we run this experiment again, but now only fetching variant in the same exon from these five samples:

In [26]:

%%gor myPNs <<
nor <(gor #wgsvars# | top 1000) | select PN | distinct | top 5

Query ran in 1.52 sec
Query fetched 5 rows in 0.66 sec (total time 2.17 sec)

In [28]:

%%gor
$mydefs
gor -p chr1 #exons# | grep WASH7P | top 1
| join -segsnp -ir <(gor #wgsvars# -ff [var:myPNs])
| group chrom -count | calc t time()

Query ran in 0.29 sec
Loading relations [var:myPNs] from local state
Query ran in 0.62 sec
Query fetched 1 rows in 0.10 sec (total time 0.72 sec)

Out[28]:

	CHROM	bpStart	bpStop	allCount	t
0	chr1	0	248956422	3	233

As expected, we see significantly faster response because we are only doing random access into 5 files instead of 150 or so before.

Maximum segment size¶

When we join a right-source that has segments, it is necessary for the system to get information on the segment sizes, because GOR does only implicitly index the data with the start position of the features and does at the moment not rely on any file metadata to describe the segment size distribution in files.

We will now make a query that uses a join to fetch variants from two genes and then joins them with sequence read coverage segments from the same samples. In order to exaggerate the times, we pick 100 samples.

In [49]:

qh2 = GQH.GOR_Query_Helper()
mydefs2 = qh2.add_many("""
def #wgsvars# = source/var/wgs_varcalls.gord -s PN;
def #segcov# = source/cov/segment_cov.gord -s PN;
create #mysegs# = gor -p chr13 #genes# | merge <(gor -p chr17 #genes#) | grep brca;
create #mypns# = nor -asdict #wgsvars# | select #2 | top 100;
""")

In [51]:

%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| calc t1 time()
| join -snpseg -l -e 0 <(gor #segcov# -ff [#mypns#]) | calc t2 time()
| calc t t2-t1
| group chrom -count -min -max -ic t

Query ran in 0.14 sec
Query fetched 2 rows in 16.76 sec (total time 16.90 sec)

Out[51]:

	CHROM	bpStart	bpStop	allCount	min_t	max_t
0	chr13	0	114364328	212149	0	5158
1	chr17	0	83257441	222009	0	10149

In the above query, we see that the time difference from before and after adding the sequence coverage information is between 5sec and 9sec (i.e. max_t). When the JOIN command with the -snpseg option sees a nested query, it will default assume that the maximum segment size in the nested query is that of a gene, i.e. 4Mbp. However, when these segments were generated, a SEGSPAN command was use with -maxseg of 10kbp. Thus, if we provide this parameter, the seek can be done in a more efficient manner:

In [52]:

%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| calc t1 time()
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#]) | calc t2 time()
| calc t t2-t1
| group chrom -count -min -max -ic t

Query ran in 0.19 sec
Query fetched 2 rows in 1.66 sec (total time 1.84 sec)

Out[52]:

	CHROM	bpStart	bpStop	allCount	min_t	max_t
0	chr13	0	114364328	212149	0	221
1	chr17	0	83257441	222009	0	145

We see from above much reduced overhead in fetching the first rows from the #segcov# query.

Alert: One should always specify -maxseg in a JOIN with the -snpseg, -segseg, or -varseg options.

Hash-option for additional equi-join¶

In the above query, we did only join the two tables based on genomic overlap and we did not look at the PN column in both relations. More typically, we would have liked only rows from the same samples to join. We can do that like this:

Note: We are only covering genomic joins in this notebook and not MAP and MULTIMAP that do indeed use hash-join approach to perform non-genomic join, e.g. for phenotypes or other similar annotations where the right-source is not crazy-big!

In [53]:

%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9) 
| where PN = PNx
| group chrom -count | calc t time()

Query ran in 0.18 sec
Query fetched 2 rows in 1.47 sec (total time 1.66 sec)

Out[53]:

	CHROM	bpStart	bpStop	allCount	t
0	chr13	0	114364328	2782	1600
1	chr17	0	83257441	3136	1600

Notice however that the condition PN = PNx does not respect the fact that we specified a left-join with the -l option. Therefore, given the fact that the #segcov# table is sparse (i.e. may have no segments where the depth is zero - here we use random() to ensure that is the case in this demo), our result may be wrong! GOR does however have a special LEFTWHERE command that can be used in scenarious like this.

In [54]:

%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| leftwhere distance[-1] PN = PNx
| group chrom -count | calc t time()

Query ran in 0.33 sec
Query fetched 2 rows in 1.51 sec (total time 1.84 sec)

Out[54]:

	CHROM	bpStart	bpStop	allCount	t
0	chr13	0	114364328	3073	1642
1	chr17	0	83257441	3457	1642

A more appropriate way to write the above query would be to use the -xl and -xr options to enforce equi-join condition in the join command.

In [55]:

%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 -xl pn -xr pn <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| group chrom -count | calc t time()

Query ran in 0.22 sec
Query fetched 2 rows in 1.47 sec (total time 1.69 sec)

Out[55]:

	CHROM	bpStart	bpStop	allCount	t
0	chr13	0	114364328	3073	1613
1	chr17	0	83257441	3457	1613

We see that the counts are the same as with the LEFTWHERE filter and here the times are the same as before. If we however increase the number of sample, we will see big difference, i.e. from using a cartesian-join vs. hash-join approach:

In [57]:

mydefs2 = qh2.add("""create #mypns# = nor -asdict #wgsvars# | select #2 | top 1000""")

First the cartesian-join approach with 1000 samples:

In [59]:

%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| leftwhere distance[-1] PN = PNx
| group chrom -count | calc t time()

Query ran in 0.71 sec
Query fetched 2 rows in 37.11 sec (total time 37.82 sec)

Out[59]:

	CHROM	bpStart	bpStop	allCount	t
0	chr13	0	114364328	26848	37486
1	chr17	0	83257441	35289	37486

And now the hash-join approach:

In [60]:

%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 -xl pn -xr pn <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| group chrom -count | calc t time()

Query ran in 0.30 sec
Query fetched 2 rows in 6.65 sec (total time 6.95 sec)

Out[60]:

	CHROM	bpStart	bpStop	allCount	t
0	chr13	0	114364328	26848	6851
1	chr17	0	83257441	35289	6851

We see that with 1000 samples, the hash-join approach is much faster, the greater difference the more number of samples being processed simultaneously.

Note: Unlike -xl and -xr, LEFTWHERE can be be used to express any condition in a left-join. Sometimes, one can combine a fuzzy-style equi-join with a refined condition expressed with LEFTWERE, to implement fast but complex left-joins.

Join column and genome order¶

We will end with few examples that highlight table/source order in joins and how it affects the GOR-order of the output relation. Importantly, it are alway the first two (or three) columns that are used to perform the genome-spatial-join. However, we have also learned that the left-source is read fully, therefore, if we are joining a small table agains a massive one, it is preferred to use it as a left-source and allow the system to seek into the right-source if needed.

We will start by finding a region with multiple overlapping genes and exons

In [61]:

qh3 = GQH.GOR_Query_Helper()
mydefs3 = qh3.add_many("""
create #busy# = gor #genes# | segproj 
| rank genome -o desc segcount | where rank_segcount = 1 
| join -segseg -ir -maxseg 4000000 #genes#;
""")

In [62]:

%%gor
$mydefs3

gor [#busy#]

Query ran in 1.28 sec
Query fetched 22 rows in 0.00 sec (total time 1.28 sec)

Out[62]:

	chrom	gene_start	gene_end	gene_symbol
0	chr5	141330513	141512975	PCDHGA1
1	chr5	141338759	141512975	PCDHGA2
2	chr5	141343828	141512975	PCDHGA3
3	chr5	141350098	141512975	PCDHGB1
4	chr5	141355020	141512975	PCDHGA4
5	chr5	141359993	141512975	PCDHGB2
6	chr5	141364161	141512975	PCDHGA5
7	chr5	141370241	141512975	PCDHGB3
8	chr5	141373890	141512975	PCDHGA6
9	chr5	141382738	141512975	PCDHGA7
10	chr5	141387697	141512975	PCDHGB4
11	chr5	141390156	141512975	PCDHGA8
12	chr5	141397946	141512975	PCDHGB5
13	chr5	141402777	141512975	PCDHGA9
14	chr5	141408020	141512975	PCDHGB6
15	chr5	141412986	141512975	PCDHGA10
16	chr5	141417644	141512975	PCDHGB7
17	chr5	141421046	141512975	PCDHGA11
18	chr5	141430506	141512975	PCDHGA12
19	chr5	141475946	141512977	PCDHGC3
20	chr5	141484996	141512975	PCDHGC4
21	chr5	141489080	141512975	PCDHGC5

We see that we get multiple overlaping genes. Now we join with the variants:

In [82]:

mydefs3 = qh3.add("""def #cvars# = ref/disvariants/clinical_variants.gorz | select 1-4,disease;""")

In [83]:

%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars#)

Query ran in 0.34 sec
Query fetched 5,865 rows in 0.13 sec (total time 0.47 sec)

Out[83]:

	chrom	gene_start	gene_end	gene_symbol	distance	pos	ref	alt	disease
0	chr5	141330513	141512975	PCDHGA1	0	141330778	C	A	INBORN_GENETIC_DISEASES
1	chr5	141330513	141512975	PCDHGA1	0	141330954	A	G	INBORN_GENETIC_DISEASES
2	chr5	141330513	141512975	PCDHGA1	0	141331012	G	T	INBORN_GENETIC_DISEASES
3	chr5	141330513	141512975	PCDHGA1	0	141331196	A	T	INBORN_GENETIC_DISEASES
4	chr5	141330513	141512975	PCDHGA1	0	141331255	C	T	INBORN_GENETIC_DISEASES
...	...	...	...	...	...	...	...	...	...
5860	chr5	141489080	141512975	PCDHGC5	0	141491632	C	A	INBORN_GENETIC_DISEASES
5861	chr5	141489080	141512975	PCDHGC5	0	141511002	C	T	NOT_PROVIDED
5862	chr5	141489080	141512975	PCDHGC5	0	141511014	A	G	NOT_PROVIDED
5863	chr5	141489080	141512975	PCDHGC5	0	141511035	C	T	NOT_PROVIDED
5864	chr5	141489080	141512975	PCDHGC5	0	141511049	G	A	INBORN_GENETIC_DISEASES

5865 rows × 9 columns

We see that column 2 is the starting position of the genes. Therefore, say we want to join the variants with #dbnsfp#, in order to annotate them with MutPred_rankscore, we will have to perform the join either within the nested query or by switching the variant position and the gene start. Let's do the former first:

In [73]:

mydefs3 = qh3.add("""def #mutpred# = ref/variants/dbnsfp.gorz | select 1-4,MutPred_rankscore;""")

In [84]:

%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# | varjoin -r <(gor #mutpred# ) )

Query ran in 0.31 sec
Query fetched 5,522 rows in 4.45 sec (total time 4.77 sec)

Out[84]:

	chrom	gene_start	gene_end	gene_symbol	distance	pos	ref	alt	disease	MutPred_rankscore
0	chr5	141330513	141512975	PCDHGA1	0	141330778	C	A	INBORN_GENETIC_DISEASES	.
1	chr5	141330513	141512975	PCDHGA1	0	141330954	A	G	INBORN_GENETIC_DISEASES	0.97773
2	chr5	141330513	141512975	PCDHGA1	0	141331012	G	T	INBORN_GENETIC_DISEASES	0.66843
3	chr5	141330513	141512975	PCDHGA1	0	141331196	A	T	INBORN_GENETIC_DISEASES	0.45202
4	chr5	141330513	141512975	PCDHGA1	0	141331255	C	T	INBORN_GENETIC_DISEASES	0.54201
...	...	...	...	...	...	...	...	...	...	...
5517	chr5	141489080	141512975	PCDHGC5	0	141491324	T	C	INBORN_GENETIC_DISEASES	0.56767
5518	chr5	141489080	141512975	PCDHGC5	0	141491544	C	G	INBORN_GENETIC_DISEASES	0.24911
5519	chr5	141489080	141512975	PCDHGC5	0	141491591	G	A	INBORN_GENETIC_DISEASES	0.11515
5520	chr5	141489080	141512975	PCDHGC5	0	141491632	C	A	INBORN_GENETIC_DISEASES	0.06613
5521	chr5	141489080	141512975	PCDHGC5	0	141511049	G	A	INBORN_GENETIC_DISEASES	.

5522 rows × 10 columns

The above approach workes fine because within the context of the nested query, the VARJOIN command acts on the pos columns of #cvars# and #mutpred#. Now we will try to use an alternative approach:

In [85]:

%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease,distance,gene_*

Query ran in 0.19 sec
Query fetched 5,865 rows in 0.13 sec (total time 0.32 sec)

Out[85]:

	chrom	pos	ref	alt	disease	distance	gene_start	gene_end	gene_symbol
0	chr5	141330778	C	A	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1
1	chr5	141330954	A	G	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1
2	chr5	141331012	G	T	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1
3	chr5	141331196	A	T	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1
4	chr5	141331255	C	T	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1
...	...	...	...	...	...	...	...	...	...
5860	chr5	141491632	C	A	INBORN_GENETIC_DISEASES	0	141489080	141512975	PCDHGC5
5861	chr5	141511002	C	T	NOT_PROVIDED	0	141489080	141512975	PCDHGC5
5862	chr5	141511014	A	G	NOT_PROVIDED	0	141489080	141512975	PCDHGC5
5863	chr5	141511035	C	T	NOT_PROVIDED	0	141489080	141512975	PCDHGC5
5864	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	0	141489080	141512975	PCDHGC5

5865 rows × 9 columns

And now we add the varjoin to #mutpred#:

In [86]:

%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease,distance,gene_*
| varjoin -r #mutpred#

Query ran in 0.28 sec

==== Data Error ====
VARJOIN
Wrong order observed at row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2
Last row: chr5:141511049

Row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2

Stack Trace:
org.gorpipe.exceptions.GorDataException: VARJOIN
Wrong order observed at row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2
Last row: chr5:141511049
Row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2

	at gorsat.Analysis.CheckOrder.process(CheckOrder.scala:39)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.Select2.process(Select2.scala:32)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.JoinAnalysis$SegOverlap.output_row(JoinAnalysis.scala:155)
	at gorsat.Analysis.JoinAnalysis$SegOverlap.nested_process(JoinAnalysis.scala:213)
	at gorsat.Analysis.JoinAnalysis$SegOverlap.process(JoinAnalysis.scala:375)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.InRange.process(InRange.scala:36)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Monitors.CancelMonitor.process(CancelMonitor.scala:34)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.CheckOrder.process(CheckOrder.scala:41)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Monitors.CancelMonitor.process(CancelMonitor.scala:34)
	at gorsat.ReaderThread.run(ReaderThread.java:132)

We see that we get Data Error, because the rows are not in correct order! The VARJOIN command verifies that but does not correct it - it assumes it and relies on it for performance! The reason this is happening is that there are overlapping gene segments in the #busy# table and they overlaped with multiple variants. When we changed the column order, we screwed up the order!!! To fix this we can apply SORT. But we can be clever - we don't have to use a fully-blocking sort operation - because GOR provides range based sort. Now that we know that the maximum segment size of genes is 4Mbp, we can inform SORT that this is the biggest error in the ordering.

In [87]:

%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease,distance,gene_*
| sort 4000000
| varjoin -r <(gor #mutpred#)

Query ran in 0.25 sec
Query fetched 5,522 rows in 4.36 sec (total time 4.61 sec)

Out[87]:

	chrom	pos	ref	alt	disease	distance	gene_start	gene_end	gene_symbol	MutPred_rankscore
0	chr5	141330778	C	A	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1	.
1	chr5	141330954	A	G	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1	0.97773
2	chr5	141331012	G	T	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1	0.66843
3	chr5	141331196	A	T	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1	0.45202
4	chr5	141331255	C	T	INBORN_GENETIC_DISEASES	0	141330513	141512975	PCDHGA1	0.54201
...	...	...	...	...	...	...	...	...	...	...
5517	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	0	141421046	141512975	PCDHGA11	.
5518	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	0	141430506	141512975	PCDHGA12	.
5519	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	0	141475946	141512977	PCDHGC3	.
5520	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	0	141484996	141512975	PCDHGC4	.
5521	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	0	141489080	141512975	PCDHGC5	.

5522 rows × 10 columns

The important take home lesson here is that while we can easily change the column order, if we want to apply pipe-step GOR commands that rely on consistent genomic order, such as VARJOIN, GROUP, RANK, etc., we must ensure that the first two columns are in proper order. Before we leave this subject, we provide few more ways to achieve the above.

First we make sure that we don't have overlapping segments by using the SEGPROJ command, however, then we loose the gene symbol. Thereforew, we have to join it back in:

In [93]:

%%gor
$mydefs3

gor [#busy#] | segproj | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease | join -r -snpseg -maxseg 4000000 [#busy#]
| varjoin -r <(gor #mutpred# )

Query ran in 0.57 sec
Query fetched 5,522 rows in 4.58 sec (total time 5.15 sec)

Out[93]:

	chrom	pos	ref	alt	disease	gene_symbol	MutPred_rankscore
0	chr5	141330778	C	A	INBORN_GENETIC_DISEASES	PCDHGA1	.
1	chr5	141330954	A	G	INBORN_GENETIC_DISEASES	PCDHGA1	0.97773
2	chr5	141331012	G	T	INBORN_GENETIC_DISEASES	PCDHGA1	0.66843
3	chr5	141331196	A	T	INBORN_GENETIC_DISEASES	PCDHGA1	0.45202
4	chr5	141331255	C	T	INBORN_GENETIC_DISEASES	PCDHGA1	0.54201
...	...	...	...	...	...	...	...
5517	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	PCDHGA11	.
5518	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	PCDHGA12	.
5519	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	PCDHGC3	.
5520	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	PCDHGC4	.
5521	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	PCDHGC5	.

5522 rows × 7 columns

Notice that we did not have to use a SORT because the segments in the left-source are non-overlapping after SEGPROJ. We can indeed do something similar by using the -ir option in JOIN. This option outputs only the right-source rows, but only once, even though they may overlap multiple rows in the left-source.

In [94]:

%%gor
$mydefs3

gor [#busy#]  | join -segsnp -ir <(gor #cvars# ) 
| varjoin -r <(gor #mutpred# )
| join -r -snpseg -maxseg 4000000 [#busy#]

Query ran in 0.31 sec
Query fetched 5,522 rows in 4.33 sec (total time 4.64 sec)

Out[94]:

	chrom	pos	ref	alt	disease	MutPred_rankscore	gene_symbol
0	chr5	141330778	C	A	INBORN_GENETIC_DISEASES	.	PCDHGA1
1	chr5	141330954	A	G	INBORN_GENETIC_DISEASES	0.97773	PCDHGA1
2	chr5	141331012	G	T	INBORN_GENETIC_DISEASES	0.66843	PCDHGA1
3	chr5	141331196	A	T	INBORN_GENETIC_DISEASES	0.45202	PCDHGA1
4	chr5	141331255	C	T	INBORN_GENETIC_DISEASES	0.54201	PCDHGA1
...	...	...	...	...	...	...	...
5517	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	.	PCDHGA11
5518	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	.	PCDHGA12
5519	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	.	PCDHGC3
5520	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	.	PCDHGC4
5521	chr5	141511049	G	A	INBORN_GENETIC_DISEASES	.	PCDHGC5

5522 rows × 7 columns

We see that for this approach to work, we must read the gene segments twice, if we want to both use gene segments to select rows from the right source and use gene annotations in the output. In the above examples, this is not of big concern because the gene table is very light-weight. This is very often the case and for example reading all genes or exons comes with very little cost.

In [ ]: