Used in: gor only

JOIN¶

The JOIN command can be used in GOR queries to work with multiple sources (files, tables, or GOR streams) at once based on some overlap condition between the sources. It is possible to define the type of overlap condition and also to structure the output of the query based on some options that are defined in the table below.

Usage¶

gor <left_source> | JOIN <condition> [options] <right_source>

Conditions¶

A special set of arguments specify the matching conditions for the key columns in the left and right sources (i.e. snp, seg and var).

Type of join	Description
`-snpsnp`	The overlap condition is an exact match between the first two columns (chr,pos) if no fuzz-factor is specified. Otherwise the overlap condition is snpB.pos-fuzzfac <= snpA.pos and snpA.pos <= snpB.pos+fuzzfac
`-snpseg`	The overlap condition is seg.start-fuzzfac < snp.pos and snp.pos <= seg.stop+fuzzfac.
`-segseg`	The overlap condition is segA.start-fuzzfac < segB.stop && segA.stop+fuzzfac > segB.start.
`-segsnp`	The overlap condition is seg.start-fuzzfac < snp.pos and snp.pos <= seg.stop+fuzzfac.
`-varseg`	Overlap of (ref,alt) variant with segment.
`-segvar`	Overlap of segment with (ref,alt) variant.

Options¶

Further options on joins are specified in the table below.

`-lstop col`	The column with the stop position of the segments in the left-source. Defaults to #3.
`-rstop col`	The column with the stop position of the segments in the right-source. Defaults to #3.
`-xl cols`	The columns to perform additional equi-join in the left-source.
`-xr cols`	The columns to perform additional equi-join in the right-source
`-xcis`	Use case insensitive equi-joins.
`-maxseg size`	The maximum segment size. The default is 3MB for -segseg and -snpseg join. The default for the `-maxseg` value is set to 1 for `-segsnp`, `-varsnp`, `-snpsnp` joins.
`-l`	Left-join style overlap, shows all rows in the left-source.
`-t`	To-List style, single output line per left-source row. Comma separation of multiple overlap values.
`-c`	Count, returns a column with the number of overlaps. Only used with the -t option.
`-f fuzzfac`	A base pair proximity integer value, e.g. 0, 1000 or 10000000. Note that a large fuzz factor can increase the running time and memory usage.
`-n`	Return only rows from the left-stream that do NOT overlap with rightfile.gor.
`-i`	Return only rows from leftfile.gor that are included in the overlap (without the columns from the right-source).
`-ic`	Return the rows in left-source and a column named overlap. Count which shows how many rows in right-source overlap.
`-ir`	Return the rows from right-source that overlap with the left-source. Depending on the nature of the left-source and the overlap condition, the join may have to be followed with SORT step to ensure GOR.
`-m`	Return only overlaps with the smallest distance (one or many). Only makes sense when used with the -f option.
`-o n`	Return only the closest n overlaps row from rightfile per row in the left input source.
`-rprefix name`	A prefix to be added to the columns in the right-source.
`-r`	Reduced output. Do not include distances and position columns from the right source.
`-s colname`	Source column name (for GOR dictionary database file).
`-e char`	Character to denote empty field. Defaults to empty string, i.e. of length 0.

For -varseg and -segvar joins, there are a few extra options possible:

`-ref`	The column denoting the reference seq. (for both left- and right-source).
`-refl`	The column denoting the reference seq. in left-source.
`-refr`	The column denoting the reference seq. right-source.

Examples¶

The following join command joins a snp left source (the #dbSNP# table) to a right source containing segments (#genes#).

gor #dbSNP# | join -snpseg #genes#

In this next example, the #genes# table is joined as a nested query, which is necessary if there will be some calculations or other modifications done to the right source. Here, the size of the genes is being calculated and they are then ranked in relation to all other genes on the same chromosome. Furthermore, the -maxseg is setting the maximum segment size to 1B bases to cover the length of the longest chromosome, since otherwise the result would be incorrect.

gor #dbSNP# | join -snpseg -maxseg 1000000000 <(gor #genes# | calc size #3-#2 | rank chrom size -o asc | where rank_size < 100)

gor #genes# | where gene_symbol ~ 'B*' | join -segvar -ir #dbSNP# | sort 3000000 | verifyorder

gor #genes# | where gene_symbol ~ 'B*' | segspan | join -segvar -ir #dbSNP# | verifyorder

gor #genes# | where gene_symbol ~ 'B*' | segspan | join -segvar -ir #wgsvars# -s PN

gor #genes# | where gene_symbol ~ 'B*' | segspan | join -segvar -ir <(gor #wgsvars# -s PN -f PN1,PN2)

gor #wgsvars# -ff myPNs.tsv | join -snpseg -ic -xl PN -xr PN -maxseg 10000 <(gor #goodcov8# -ff myPNs.tsv)

gor #dbsnp# | multimap -cartesian myPNs.tsv | varjoin -l -xl PN -xr PN -e '0' -r <(#wgsvars# -ff myPNs.tsv | select 1-4,FILTER,Callcopies)
| join -snpseg -ic -xl PN -xr PN <(gor #goodcov8# -ff myPNs.tsv) | replace callcopies if(FILTER != 'PASS','NA',if(callcopies='0' and overlapCount=0,'NA',callcopies)
| sort 1 -c PN | group 1 -gc 3,4,rsID -lis -sc

Equivalent Queries¶

The following queries are equivalent and illustrate how to use different commands to achieve the same result.

gor #dbsnp# | join -snpseg -i #genes# == gor #dbsnp# | join -snpseg #genes# | group 1 -gc 3-rsids

gor #dbsnp# | join -snpseg -ic #genes# == gor #dbsnp# | join -snpseg -l <(gor #genes# | calc overlap 1) | group 1 -gc 3-rsids -sum -ic overlap | rename sum_overlap overlapcount

gor #dbsnp# | join -snpseg -ir #genes# | sort 3000000 == gor #dbsnp# | join -snpseg  #genes#  | select rsids[+1]- | sort 3000000