Joining Tables¶
Joins are a key operation when working in relational databases and this is also the case in the GOR system. For whole genome query analysis, it is often necessary to be able to perform genomically spatial joins on a large number of features quickly. It is implicit when performing joins in the GOR system to do spatial proximity joins according to the genomic position of the data.
There are many different types of joins that can be performed using the GOR language. A full reference for the JOIN command can be found here, but this chapter will go into several of the most useful operations that can be performed with GOR joins.
Join conditions¶
We must first consider whether we are joining based on single positions (for example, if we are working with a file that has only the chromosome and the position defined, as is the case with the #dbsnp#
table) or based on segments, where the start and stop position are defined in each row (as is the case in a table such as #genes#
).
The table below shows a list of the different types of joins that can be performed along with a description of each.
Type |
Description |
---|---|
|
Matches single positions in one table to single positions in another table. |
|
Matches single positions in one table to segments in another table. |
|
Matches segments in two tables that overlap. |
|
Matches the overlap between one table (based on ref,alt) with a segment. |
|
Matches a segment to a variant (ref,alt) in the right source. |
Join options¶
The first type of join (-snpsnp
) matches single positions in one table to single position in another table. In this most basic example, we could join wesVars
to a VEP dictionary (in this case the vep_single_wgs.gord) to annotate the variants in a particular set of samples with VEP data.
gor source/var/wes_varcalls.gor
| TOP 10
| JOIN -snpsnp -xl reference,call -xr reference,call source/anno/vep_v85/vep_single_wgs_gord
Here we have introduced two extra options for having an exact match also on the reference and call columns in the left source (-xl
) and in the right source (-xr
). This is also referred to in this manual as an equi-join.
In the next example, we are using a negative join to return the variants in our samples minus the variants in dbSNP, which would tell us if there are any novel variants in our sample data:
gor #wesvars#
| JOIN -n #dbsnp# -snpsnp -xl reference,call -xr reference,allele
Note that, in this case, the reference column from the left must match the reference column on the right, but the call must match to allele.
We can perform a basic intersect join by using the -i
option. An intersect join will return only the overlap between two tables. In the following example, we are finding variants overlapping with a gene list (defined here as [my_gene_coordinates]
). Since the gene list will be defined in terms of segments, we must use a -snpseg
join condition here.
gor #wesVars#
| JOIN -snpseg -i [my_gene_coordinates]
Note
Segment ranges in GOR are zero-based UCSC style, e.g. (start,stop)=(100,200) denotes a genomic segment including bases 101-200, i.e. of length 100bp.
Left joins¶
When we talk about left joins in GOR, we are talking about joins that are known in the SQL world as left outer joins. In the following example, we are joining our sample variants to the frequency table, but we do not want to drop the row that fail to match. In this case, if we encounter rows that do not match, we use the -e
option to fill empty results with “0.0”.
gor #wesvars#
| join -l -snpsnp -e 0.0 #freqmax#
Stop positions¶
When you are joining using any of the options that use segments (-snpseg
, -segseg
, or -segsnp
), the stop position for the segments is assumed to be the third column (column #3) in the table. However, this may not be the case for some reason, such as when you are joining to a nested stream and the order of the columns has been changed.
In these cases, it can be useful to use the options -lstop
and -rstop
which both take a single parameter of the column that should be used as the stop position for the segment (in the left and right sources respectively).
Fuzzy joins¶
Fuzzy joins are created in GOR queries by using the -f
option on the query followed by the degree (an integer value) of the fuzziness.
gor #dbsnp# | TOP 1000 | JOIN -snpseg #exons# -f 10 | SELECT 1-10
Chrom |
POS |
reference |
allele |
rsIDs |
distance |
chromstart |
chromend |
gene_symbol |
exon |
---|---|---|---|---|---|---|---|---|---|
chr16 |
89978525 |
C |
T |
rs908913173 |
2 |
89978526 |
89979053 |
MC1R |
ENSE00002231523 |
chr16 |
89978529 |
A |
T |
rs187669455 |
0 |
89978526 |
89979053 |
MC1R |
ENSE00002231523 |
In the example shown above, we have set the join to be fuzzy with a degree of 10, which allows us to widen our result set.
Notice that all of the rows from the original #dbsnp#
table are included this time, but in the case of the first row, the distance is now equal to 2, which is how far outside of the exact match that row is.
Verifying genomic order¶
If we are joining multiple tables and rearranging the columns in our result set, it can sometimes occur that the genomic order is violated in the GOR output. In these cases, the VERIFYORDER command can be used to check the genomic order of the GOR output. If the command is added to the end of a GOR query, it will throw an exception if the genomic order of the GOR stream is violated.
If there is a problem with the genomic order of a GOR stream, it can be useful to use the SORT command to correct the violated order.
Joins in NOR¶
As we discussed at the beginning of this manual, when we are working with files in a NOR context, we cannot rely on genomic-ordered data to join tables. Joins in NOR queries are performed using the MAP and MULTIMAP commands, which are discussed in the following chapter on Map and Multimap.