Joins¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}
Genome JOINS¶
This notebook explains how genomic joins work in GOR and some of the suddle details around it performance. The genomic features that can be used in spatial joins are of three types:
- snp (single nucleotype position, e.g. chrom,pos)
- seg (segments, e.g. chrom,bpstart,bpstop)
- var (variants, e.g. segmenst based on chrom,pos,ref)
There are two main commands, JOIN and VARJOIN, the former a general purpose join command wheras the latter is for joining two variant tables.
GOR assumes consistent genomic ordering of both tables (files) and streams and uses primarily a merge join logic (equivalent to a sort-merge locic in non-ordered databases). What makes the merge-logic particularly powerful is to be able to switch between streaming mode and seek-mode, based on real-time cost optimization. Additionally, GOR can combine hash-based equi-joins on non-positional columns, e.g. providing a very efficient ways to express genome spatial joins for multiple samples.
Join-order¶
Here we will explain how the order of relations matters in joins. Importantly, the left-source is always read fully while the right-source may not, i.e. the join-logic may decide to seek into the right-source.
%%gor
gor #genes# | where gene_symbol ~ 'brc*' | join -segseg #exons# | group chrom -count | calc t time()
Query ran in 2.37 sec Query fetched 4 rows in 0.06 sec (total time 2.43 sec)
chrom | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 188 | 1753 |
1 | chr17 | 0 | 83257441 | 473 | 1753 |
2 | chr5 | 0 | 181538259 | 1 | 1753 |
3 | chrX | 0 | 156040895 | 64 | 1753 |
%%gor
gor #exons# | join -segseg <(gor #genes# | where gene_symbol ~ 'brc*') | group chrom -count | calc t time()
Query ran in 0.07 sec Query fetched 4 rows in 1.92 sec (total time 1.98 sec)
chrom | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 188 | 1921 |
1 | chr17 | 0 | 83257441 | 473 | 1921 |
2 | chr5 | 0 | 181538259 | 1 | 1921 |
3 | chrX | 0 | 156040895 | 64 | 1921 |
Above we see equivalent queries that join a subset of the genes together with all the exons. The first version is slightly faster because it does not have to read all the #exons# table. The difference is not great because the system reads all the 1.5M exon rows very quickly in the second example. However, if we change exons to variants in #dbsnp# we see a great difference.
%%gor
gor #genes# | where gene_symbol ~ 'brc*' | join -segsnp #dbsnp# | group chrom -count | calc t time()
Query ran in 0.64 sec Query fetched 4 rows in 3.71 sec (total time 4.35 sec)
chrom | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 61303 | 4197 |
1 | chr17 | 0 | 83257441 | 75293 | 4197 |
2 | chr5 | 0 | 181538259 | 394 | 4197 |
3 | chrX | 0 | 156040895 | 13932 | 4197 |
If we turn the query around, it will however take forever to run, i.e.
gor #dbsnp# | join -segseg <(gor #genes# | where gene_symbol ~ 'brc*') | group chrom -count | calc t time()
will read all the 1.2 billion rows. In the former example, the JOIN command made 4 seeks into #dbsnp# and only variants in the neighborhood of the genes were read. We will explore this further below.
Elevator pitch - skip-scan-join¶
Here we will explain how the "cost" of random access vs cost of streaming impacts when to perform a seek vs simply keeping on streaming. Although we will use #genes# and #exons#, which are segments, we will use -snpsnp joins to #dbsnp# just to explain the principle.
%%gor
gor -p chr1 #genes# | join -snpsnp -l -e 'missing' #dbsnp#
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.28 sec Query fetched 1 rows in 55.10 sec (total time 55.38 sec)
chrom | bpStart | bpStop | allCount | sum_found | t | at | |
---|---|---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 4345 | 2091 | 55169 | 12.697123 |
%%gor
gor -p chr1 #dbsnp# | join -snpsnp #genes# | group chrom -count | calc t time() | calc at t/allcount
Query ran in 0.36 sec Query fetched 1 rows in 83.24 sec (total time 83.60 sec)
chrom | bpStart | bpStop | allCount | t | at | |
---|---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 2091 | 83316 | 39.84505 |
We see that the former query that reads the big #dbsnp# table as the right source, takes 55sec while the query that reads #dbsnp# fully as the left-source is 77sec. The reason is that the former query does indeed perform 4243 seeks into the #dbsnp# table, spending on average 13msec per seek. This is based on the random-access cost for S3 object-storage. Next we will move chr1 of #dbsnp# into Lustre by putting it into a temprary table.
qh = GQH.GOR_Query_Helper()
mydefs = qh.add("""create temp_dbsnp = gor -p chr1 #dbsnp#;""")
%%gor
$mydefs
gor -p chr1 #genes# | join -snpsnp -l -e 'missing' [temp_dbsnp]
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.17 sec Query fetched 1 rows in 1.90 sec (total time 2.08 sec)
chrom | bpStart | bpStop | allCount | sum_found | t | at | |
---|---|---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 4345 | 2091 | 1916 | 0.440967 |
Now the query took less than 2sec! We see that the average seek cost is only 0.4msec, i.e. 30 times faster random-access than for S3. Now we switch the join order.
%%gor
$mydefs
gor -p chr1 [temp_dbsnp] | join -snpsnp #genes# | group chrom -count | calc t time() | calc at t/allcount
Query ran in 0.39 sec Query fetched 1 rows in 80.54 sec (total time 80.93 sec)
chrom | bpStart | bpStop | allCount | t | at | |
---|---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 2091 | 80553 | 38.523673 |
We see that the streaming speed in Lustre and S3 is practically the same for the 97M rows on chr1. To hammer this in, lets try the #exons# instead of #genes; two orders of magnitude more rows to join.
%%gor
$mydefs
gor -p chr1 #exons# | join -snpsnp -l -e 'missing' [temp_dbsnp]
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.30 sec Query fetched 1 rows in 7.68 sec (total time 7.98 sec)
chrom | bpStart | bpStop | allCount | sum_found | t | at | |
---|---|---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 167823 | 71618 | 7711 | 0.045947 |
We see that we get 157k rows in 8sec and average seek cost now down to 0.05sec! Finally, we try this for the S3 version:
%%gor
$mydefs
gor -p chr1 #exons# | join -snpsnp -l -e 'missing' #dbsnp#
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.26 sec Query fetched 1 rows in 54.42 sec (total time 54.68 sec)
chrom | bpStart | bpStop | allCount | sum_found | t | at | |
---|---|---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 167823 | 71618 | 54519 | 0.32486 |
The query time is significantly slower, i.e. 54sec vs 8sec, but we see that the average seek cost is now down to 0.34msec. This is because may of the exons are close by and rather than performing seek on every exon segment, the JOIN command my revert to a skip-scan pattern, seeking into #dbsnp# if segments are far apart and streaming for segments that are close to each other.
We now look at the use-case of fetching variants from all samples in a project where there are about 25k samples stored and bucketized into over 150 buckets. First we fetch all the variants in a single exon:
%%gor
$mydefs
gor -p chr1 #exons# | grep WASH7P | top 1
| join -segsnp -ir <(gor #wgsvars#)
| group chrom -count | calc t time()
Query ran in 0.76 sec Query fetched 1 rows in 0.59 sec (total time 1.35 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 1497 | 1155 |
We see that it took over a second to fetch 1497 rows from these 25k samples (actually longer on first execution, since the file system was cold). Now we select 5 samples from the project and we run this experiment again, but now only fetching variant in the same exon from these five samples:
%%gor myPNs <<
nor <(gor #wgsvars# | top 1000) | select PN | distinct | top 5
Query ran in 1.52 sec Query fetched 5 rows in 0.66 sec (total time 2.17 sec)
%%gor
$mydefs
gor -p chr1 #exons# | grep WASH7P | top 1
| join -segsnp -ir <(gor #wgsvars# -ff [var:myPNs])
| group chrom -count | calc t time()
Query ran in 0.29 sec Loading relations [var:myPNs] from local state Query ran in 0.62 sec Query fetched 1 rows in 0.10 sec (total time 0.72 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 3 | 233 |
As expected, we see significantly faster response because we are only doing random access into 5 files instead of 150 or so before.
Maximum segment size¶
When we join a right-source that has segments, it is necessary for the system to get information on the segment sizes, because GOR does only implicitly index the data with the start position of the features and does at the moment not rely on any file metadata to describe the segment size distribution in files.
We will now make a query that uses a join to fetch variants from two genes and then joins them with sequence read coverage segments from the same samples. In order to exaggerate the times, we pick 100 samples.
qh2 = GQH.GOR_Query_Helper()
mydefs2 = qh2.add_many("""
def #wgsvars# = source/var/wgs_varcalls.gord -s PN;
def #segcov# = source/cov/segment_cov.gord -s PN;
create #mysegs# = gor -p chr13 #genes# | merge <(gor -p chr17 #genes#) | grep brca;
create #mypns# = nor -asdict #wgsvars# | select #2 | top 100;
""")
%%gor
$mydefs2
gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| calc t1 time()
| join -snpseg -l -e 0 <(gor #segcov# -ff [#mypns#]) | calc t2 time()
| calc t t2-t1
| group chrom -count -min -max -ic t
Query ran in 0.14 sec Query fetched 2 rows in 16.76 sec (total time 16.90 sec)
CHROM | bpStart | bpStop | allCount | min_t | max_t | |
---|---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 212149 | 0 | 5158 |
1 | chr17 | 0 | 83257441 | 222009 | 0 | 10149 |
In the above query, we see that the time difference from before and after adding the sequence coverage information is between 5sec and 9sec (i.e. max_t). When the JOIN command with the -snpseg option sees a nested query, it will default assume that the maximum segment size in the nested query is that of a gene, i.e. 4Mbp. However, when these segments were generated, a SEGSPAN command was use with -maxseg of 10kbp. Thus, if we provide this parameter, the seek can be done in a more efficient manner:
%%gor
$mydefs2
gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| calc t1 time()
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#]) | calc t2 time()
| calc t t2-t1
| group chrom -count -min -max -ic t
Query ran in 0.19 sec Query fetched 2 rows in 1.66 sec (total time 1.84 sec)
CHROM | bpStart | bpStop | allCount | min_t | max_t | |
---|---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 212149 | 0 | 221 |
1 | chr17 | 0 | 83257441 | 222009 | 0 | 145 |
We see from above much reduced overhead in fetching the first rows from the #segcov# query.
Hash-option for additional equi-join¶
In the above query, we did only join the two tables based on genomic overlap and we did not look at the PN column in both relations. More typically, we would have liked only rows from the same samples to join. We can do that like this:
%%gor
$mydefs2
gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| where PN = PNx
| group chrom -count | calc t time()
Query ran in 0.18 sec Query fetched 2 rows in 1.47 sec (total time 1.66 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 2782 | 1600 |
1 | chr17 | 0 | 83257441 | 3136 | 1600 |
Notice however that the condition PN = PNx does not respect the fact that we specified a left-join with the -l option. Therefore, given the fact that the #segcov# table is sparse (i.e. may have no segments where the depth is zero - here we use random() to ensure that is the case in this demo), our result may be wrong! GOR does however have a special LEFTWHERE command that can be used in scenarious like this.
%%gor
$mydefs2
gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| leftwhere distance[-1] PN = PNx
| group chrom -count | calc t time()
Query ran in 0.33 sec Query fetched 2 rows in 1.51 sec (total time 1.84 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 3073 | 1642 |
1 | chr17 | 0 | 83257441 | 3457 | 1642 |
A more appropriate way to write the above query would be to use the -xl and -xr options to enforce equi-join condition in the join command.
%%gor
$mydefs2
gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 -xl pn -xr pn <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| group chrom -count | calc t time()
Query ran in 0.22 sec Query fetched 2 rows in 1.47 sec (total time 1.69 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 3073 | 1613 |
1 | chr17 | 0 | 83257441 | 3457 | 1613 |
We see that the counts are the same as with the LEFTWHERE filter and here the times are the same as before. If we however increase the number of sample, we will see big difference, i.e. from using a cartesian-join vs. hash-join approach:
mydefs2 = qh2.add("""create #mypns# = nor -asdict #wgsvars# | select #2 | top 1000""")
First the cartesian-join approach with 1000 samples:
%%gor
$mydefs2
gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| leftwhere distance[-1] PN = PNx
| group chrom -count | calc t time()
Query ran in 0.71 sec Query fetched 2 rows in 37.11 sec (total time 37.82 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 26848 | 37486 |
1 | chr17 | 0 | 83257441 | 35289 | 37486 |
And now the hash-join approach:
%%gor
$mydefs2
gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 -xl pn -xr pn <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| group chrom -count | calc t time()
Query ran in 0.30 sec Query fetched 2 rows in 6.65 sec (total time 6.95 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr13 | 0 | 114364328 | 26848 | 6851 |
1 | chr17 | 0 | 83257441 | 35289 | 6851 |
We see that with 1000 samples, the hash-join approach is much faster, the greater difference the more number of samples being processed simultaneously.
Join column and genome order¶
We will end with few examples that highlight table/source order in joins and how it affects the GOR-order of the output relation. Importantly, it are alway the first two (or three) columns that are used to perform the genome-spatial-join. However, we have also learned that the left-source is read fully, therefore, if we are joining a small table agains a massive one, it is preferred to use it as a left-source and allow the system to seek into the right-source if needed.
We will start by finding a region with multiple overlapping genes and exons
qh3 = GQH.GOR_Query_Helper()
mydefs3 = qh3.add_many("""
create #busy# = gor #genes# | segproj
| rank genome -o desc segcount | where rank_segcount = 1
| join -segseg -ir -maxseg 4000000 #genes#;
""")
%%gor
$mydefs3
gor [#busy#]
Query ran in 1.28 sec Query fetched 22 rows in 0.00 sec (total time 1.28 sec)
chrom | gene_start | gene_end | gene_symbol | |
---|---|---|---|---|
0 | chr5 | 141330513 | 141512975 | PCDHGA1 |
1 | chr5 | 141338759 | 141512975 | PCDHGA2 |
2 | chr5 | 141343828 | 141512975 | PCDHGA3 |
3 | chr5 | 141350098 | 141512975 | PCDHGB1 |
4 | chr5 | 141355020 | 141512975 | PCDHGA4 |
5 | chr5 | 141359993 | 141512975 | PCDHGB2 |
6 | chr5 | 141364161 | 141512975 | PCDHGA5 |
7 | chr5 | 141370241 | 141512975 | PCDHGB3 |
8 | chr5 | 141373890 | 141512975 | PCDHGA6 |
9 | chr5 | 141382738 | 141512975 | PCDHGA7 |
10 | chr5 | 141387697 | 141512975 | PCDHGB4 |
11 | chr5 | 141390156 | 141512975 | PCDHGA8 |
12 | chr5 | 141397946 | 141512975 | PCDHGB5 |
13 | chr5 | 141402777 | 141512975 | PCDHGA9 |
14 | chr5 | 141408020 | 141512975 | PCDHGB6 |
15 | chr5 | 141412986 | 141512975 | PCDHGA10 |
16 | chr5 | 141417644 | 141512975 | PCDHGB7 |
17 | chr5 | 141421046 | 141512975 | PCDHGA11 |
18 | chr5 | 141430506 | 141512975 | PCDHGA12 |
19 | chr5 | 141475946 | 141512977 | PCDHGC3 |
20 | chr5 | 141484996 | 141512975 | PCDHGC4 |
21 | chr5 | 141489080 | 141512975 | PCDHGC5 |
We see that we get multiple overlaping genes. Now we join with the variants:
mydefs3 = qh3.add("""def #cvars# = ref/disvariants/clinical_variants.gorz | select 1-4,disease;""")
%%gor
$mydefs3
gor [#busy#] | join -segsnp <(gor #cvars#)
Query ran in 0.34 sec Query fetched 5,865 rows in 0.13 sec (total time 0.47 sec)
chrom | gene_start | gene_end | gene_symbol | distance | pos | ref | alt | disease | |
---|---|---|---|---|---|---|---|---|---|
0 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141330778 | C | A | INBORN_GENETIC_DISEASES |
1 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141330954 | A | G | INBORN_GENETIC_DISEASES |
2 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141331012 | G | T | INBORN_GENETIC_DISEASES |
3 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141331196 | A | T | INBORN_GENETIC_DISEASES |
4 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141331255 | C | T | INBORN_GENETIC_DISEASES |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5860 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141491632 | C | A | INBORN_GENETIC_DISEASES |
5861 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141511002 | C | T | NOT_PROVIDED |
5862 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141511014 | A | G | NOT_PROVIDED |
5863 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141511035 | C | T | NOT_PROVIDED |
5864 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141511049 | G | A | INBORN_GENETIC_DISEASES |
5865 rows × 9 columns
We see that column 2 is the starting position of the genes. Therefore, say we want to join the variants with #dbnsfp#, in order to annotate them with MutPred_rankscore, we will have to perform the join either within the nested query or by switching the variant position and the gene start. Let's do the former first:
mydefs3 = qh3.add("""def #mutpred# = ref/variants/dbnsfp.gorz | select 1-4,MutPred_rankscore;""")
%%gor
$mydefs3
gor [#busy#] | join -segsnp <(gor #cvars# | varjoin -r <(gor #mutpred# ) )
Query ran in 0.31 sec Query fetched 5,522 rows in 4.45 sec (total time 4.77 sec)
chrom | gene_start | gene_end | gene_symbol | distance | pos | ref | alt | disease | MutPred_rankscore | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141330778 | C | A | INBORN_GENETIC_DISEASES | . |
1 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141330954 | A | G | INBORN_GENETIC_DISEASES | 0.97773 |
2 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141331012 | G | T | INBORN_GENETIC_DISEASES | 0.66843 |
3 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141331196 | A | T | INBORN_GENETIC_DISEASES | 0.45202 |
4 | chr5 | 141330513 | 141512975 | PCDHGA1 | 0 | 141331255 | C | T | INBORN_GENETIC_DISEASES | 0.54201 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5517 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141491324 | T | C | INBORN_GENETIC_DISEASES | 0.56767 |
5518 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141491544 | C | G | INBORN_GENETIC_DISEASES | 0.24911 |
5519 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141491591 | G | A | INBORN_GENETIC_DISEASES | 0.11515 |
5520 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141491632 | C | A | INBORN_GENETIC_DISEASES | 0.06613 |
5521 | chr5 | 141489080 | 141512975 | PCDHGC5 | 0 | 141511049 | G | A | INBORN_GENETIC_DISEASES | . |
5522 rows × 10 columns
The above approach workes fine because within the context of the nested query, the VARJOIN command acts on the pos columns of #cvars# and #mutpred#. Now we will try to use an alternative approach:
%%gor
$mydefs3
gor [#busy#] | join -segsnp <(gor #cvars# )
| select chrom,pos-disease,distance,gene_*
Query ran in 0.19 sec Query fetched 5,865 rows in 0.13 sec (total time 0.32 sec)
chrom | pos | ref | alt | disease | distance | gene_start | gene_end | gene_symbol | |
---|---|---|---|---|---|---|---|---|---|
0 | chr5 | 141330778 | C | A | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 |
1 | chr5 | 141330954 | A | G | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 |
2 | chr5 | 141331012 | G | T | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 |
3 | chr5 | 141331196 | A | T | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 |
4 | chr5 | 141331255 | C | T | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5860 | chr5 | 141491632 | C | A | INBORN_GENETIC_DISEASES | 0 | 141489080 | 141512975 | PCDHGC5 |
5861 | chr5 | 141511002 | C | T | NOT_PROVIDED | 0 | 141489080 | 141512975 | PCDHGC5 |
5862 | chr5 | 141511014 | A | G | NOT_PROVIDED | 0 | 141489080 | 141512975 | PCDHGC5 |
5863 | chr5 | 141511035 | C | T | NOT_PROVIDED | 0 | 141489080 | 141512975 | PCDHGC5 |
5864 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | 0 | 141489080 | 141512975 | PCDHGC5 |
5865 rows × 9 columns
And now we add the varjoin to #mutpred#:
%%gor
$mydefs3
gor [#busy#] | join -segsnp <(gor #cvars# )
| select chrom,pos-disease,distance,gene_*
| varjoin -r #mutpred#
Query ran in 0.28 sec
==== Data Error ==== VARJOIN Wrong order observed at row: chr5 141338997 A G INBORN_GENETIC_DISEASES 0 141338759 141512975 PCDHGA2 Last row: chr5:141511049 Row: chr5 141338997 A G INBORN_GENETIC_DISEASES 0 141338759 141512975 PCDHGA2 Stack Trace: org.gorpipe.exceptions.GorDataException: VARJOIN Wrong order observed at row: chr5 141338997 A G INBORN_GENETIC_DISEASES 0 141338759 141512975 PCDHGA2 Last row: chr5:141511049 Row: chr5 141338997 A G INBORN_GENETIC_DISEASES 0 141338759 141512975 PCDHGA2 at gorsat.Analysis.CheckOrder.process(CheckOrder.scala:39) at gorsat.Commands.Analysis.process(Analysis.scala:167) at gorsat.Analysis.Select2.process(Select2.scala:32) at gorsat.Commands.Analysis.process(Analysis.scala:167) at gorsat.Analysis.JoinAnalysis$SegOverlap.output_row(JoinAnalysis.scala:155) at gorsat.Analysis.JoinAnalysis$SegOverlap.nested_process(JoinAnalysis.scala:213) at gorsat.Analysis.JoinAnalysis$SegOverlap.process(JoinAnalysis.scala:375) at gorsat.Commands.Analysis.process(Analysis.scala:167) at gorsat.Commands.Analysis.process(Analysis.scala:167) at gorsat.Analysis.InRange.process(InRange.scala:36) at gorsat.Commands.Analysis.process(Analysis.scala:167) at gorsat.Monitors.CancelMonitor.process(CancelMonitor.scala:34) at gorsat.Commands.Analysis.process(Analysis.scala:167) at gorsat.Analysis.CheckOrder.process(CheckOrder.scala:41) at gorsat.Commands.Analysis.process(Analysis.scala:167) at gorsat.Monitors.CancelMonitor.process(CancelMonitor.scala:34) at gorsat.ReaderThread.run(ReaderThread.java:132)
We see that we get Data Error, because the rows are not in correct order! The VARJOIN command verifies that but does not correct it - it assumes it and relies on it for performance! The reason this is happening is that there are overlapping gene segments in the #busy# table and they overlaped with multiple variants. When we changed the column order, we screwed up the order!!! To fix this we can apply SORT. But we can be clever - we don't have to use a fully-blocking sort operation - because GOR provides range based sort. Now that we know that the maximum segment size of genes is 4Mbp, we can inform SORT that this is the biggest error in the ordering.
%%gor
$mydefs3
gor [#busy#] | join -segsnp <(gor #cvars# )
| select chrom,pos-disease,distance,gene_*
| sort 4000000
| varjoin -r <(gor #mutpred#)
Query ran in 0.25 sec Query fetched 5,522 rows in 4.36 sec (total time 4.61 sec)
chrom | pos | ref | alt | disease | distance | gene_start | gene_end | gene_symbol | MutPred_rankscore | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr5 | 141330778 | C | A | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 | . |
1 | chr5 | 141330954 | A | G | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 | 0.97773 |
2 | chr5 | 141331012 | G | T | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 | 0.66843 |
3 | chr5 | 141331196 | A | T | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 | 0.45202 |
4 | chr5 | 141331255 | C | T | INBORN_GENETIC_DISEASES | 0 | 141330513 | 141512975 | PCDHGA1 | 0.54201 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5517 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | 0 | 141421046 | 141512975 | PCDHGA11 | . |
5518 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | 0 | 141430506 | 141512975 | PCDHGA12 | . |
5519 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | 0 | 141475946 | 141512977 | PCDHGC3 | . |
5520 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | 0 | 141484996 | 141512975 | PCDHGC4 | . |
5521 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | 0 | 141489080 | 141512975 | PCDHGC5 | . |
5522 rows × 10 columns
The important take home lesson here is that while we can easily change the column order, if we want to apply pipe-step GOR commands that rely on consistent genomic order, such as VARJOIN, GROUP, RANK, etc., we must ensure that the first two columns are in proper order. Before we leave this subject, we provide few more ways to achieve the above.
First we make sure that we don't have overlapping segments by using the SEGPROJ command, however, then we loose the gene symbol. Thereforew, we have to join it back in:
%%gor
$mydefs3
gor [#busy#] | segproj | join -segsnp <(gor #cvars# )
| select chrom,pos-disease | join -r -snpseg -maxseg 4000000 [#busy#]
| varjoin -r <(gor #mutpred# )
Query ran in 0.57 sec Query fetched 5,522 rows in 4.58 sec (total time 5.15 sec)
chrom | pos | ref | alt | disease | gene_symbol | MutPred_rankscore | |
---|---|---|---|---|---|---|---|
0 | chr5 | 141330778 | C | A | INBORN_GENETIC_DISEASES | PCDHGA1 | . |
1 | chr5 | 141330954 | A | G | INBORN_GENETIC_DISEASES | PCDHGA1 | 0.97773 |
2 | chr5 | 141331012 | G | T | INBORN_GENETIC_DISEASES | PCDHGA1 | 0.66843 |
3 | chr5 | 141331196 | A | T | INBORN_GENETIC_DISEASES | PCDHGA1 | 0.45202 |
4 | chr5 | 141331255 | C | T | INBORN_GENETIC_DISEASES | PCDHGA1 | 0.54201 |
... | ... | ... | ... | ... | ... | ... | ... |
5517 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | PCDHGA11 | . |
5518 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | PCDHGA12 | . |
5519 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | PCDHGC3 | . |
5520 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | PCDHGC4 | . |
5521 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | PCDHGC5 | . |
5522 rows × 7 columns
Notice that we did not have to use a SORT because the segments in the left-source are non-overlapping after SEGPROJ. We can indeed do something similar by using the -ir option in JOIN. This option outputs only the right-source rows, but only once, even though they may overlap multiple rows in the left-source.
%%gor
$mydefs3
gor [#busy#] | join -segsnp -ir <(gor #cvars# )
| varjoin -r <(gor #mutpred# )
| join -r -snpseg -maxseg 4000000 [#busy#]
Query ran in 0.31 sec Query fetched 5,522 rows in 4.33 sec (total time 4.64 sec)
chrom | pos | ref | alt | disease | MutPred_rankscore | gene_symbol | |
---|---|---|---|---|---|---|---|
0 | chr5 | 141330778 | C | A | INBORN_GENETIC_DISEASES | . | PCDHGA1 |
1 | chr5 | 141330954 | A | G | INBORN_GENETIC_DISEASES | 0.97773 | PCDHGA1 |
2 | chr5 | 141331012 | G | T | INBORN_GENETIC_DISEASES | 0.66843 | PCDHGA1 |
3 | chr5 | 141331196 | A | T | INBORN_GENETIC_DISEASES | 0.45202 | PCDHGA1 |
4 | chr5 | 141331255 | C | T | INBORN_GENETIC_DISEASES | 0.54201 | PCDHGA1 |
... | ... | ... | ... | ... | ... | ... | ... |
5517 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | . | PCDHGA11 |
5518 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | . | PCDHGA12 |
5519 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | . | PCDHGC3 |
5520 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | . | PCDHGC4 |
5521 | chr5 | 141511049 | G | A | INBORN_GENETIC_DISEASES | . | PCDHGC5 |
5522 rows × 7 columns
We see that for this approach to work, we must read the gene segments twice, if we want to both use gene segments to select rows from the right source and use gene annotations in the output. In the above examples, this is not of big concern because the gene table is very light-weight. This is very often the case and for example reading all genes or exons comes with very little cost.