Filtering Streams¶
Once you have defined a source table for your data, you can start to structure the way you want to have the data processed and filtered. You can define commands for data processing and filtering in any number of pipe steps. Note that data flows through the pipes from left to right so that, most often, the order of the steps matters.
TOP and SKIP¶
There are several commands that filter streams in the GOR Query Language. The most basic are the TOP and SKIP commands, which are used to limit the number and section of rows in the stream.
gor #dbsnp# | top 100
The above example will only show the first 100 rows of the ##dbsnp##
table and terminate the reading of the data.
gor -p chr2 #dbsnp# | skip 100 | top 100
will likewise only show rows from 101 to 200 on chr2 from the ##dbsnp##
table.
WHERE¶
The main function for filtering streams is the WHERE command. We will now go over some of the different ways that a WHERE command can be used.
Matching Patterns¶
To start with a simple example of a WHERE command, we can consider the gene table, which is included in the reference data for the Sequence Miner. If you were to run the following GOR query:
gor #genes# | WHERE Gene_Symbol IN ('BRCA1', 'BRCA2')
you would expect the following output from the query showing only the two rows:
Chrom gene_start gene_end Gene_Symbol
chr13 32889610 32973805 BRCA1
chr17 41196311 41277500 BRCA2
Multiple match conditions can be linked together in a single WHERE command such in the following case:
gor #dbsnp# | WHERE reference+’>’+allele IN (’A>G’,’G>A’,’C>T’,’T>C’) OR (len(#3)=len(#4) AND (len(#3)=2 OR len(#3)=3))
As shown above, AND
and OR
statements can be nested in curved brackets to include multiple match conditions.
Note
Clicking the TAB
key while writing queries in the Sequence Miner brings up a context-sensitive dialog that allows you to select columns that can be added to the query.
Using Wildcard Notation¶
It is also possible to use wildcard notation in WHERE clauses. The previous example with the ##genes##
table could be accomplished as follows by using a wildcard:
gor #genes# | WHERE Gene_Symbol LIKE 'BRCA*'
Another example is shown with wildcards below, this time using LIKE and RLIKE which accepts regular expressions:
gor #dbsnp# | top 1000 | WHERE rsIDs ~ 'rs22*' and NOT( rsIDs LIKE 'rs?23*' or rsIDs RLIKE 'rs(3|4)*2.*')
Note
Using ~
is equivalent to using LIKE. ?
can be used for any character.
Base-pair Notation¶
There is a special syntax for base-pair positions, i.e. integers can be represented with commas every 3 digits. They must however be followed by the letters bp, e.g. not( pos > 123,345,789bp ).
Using Functions in WHERE commmands¶
It is possible to use Functions in WHERE commands and also in CALC commands, which are discussed in a later chapter.
Chrom |
POS |
reference |
allele |
rsIDs |
---|---|---|---|---|
chr1 |
10233 |
CCTAACCCTAACCCTAAACCCTAAACCC |
C |
rs200462216 |
chr1 |
10332 |
CCTAACCCTAACCCTAACCCTACCC |
C |
rs201106462 |
chr1 |
10621 |
GTTGCAAAGGCGCGCCGCGCCG |
G |
rs376342519 |
chr1 |
12940 |
AAACA |
A |
rs756849893 |
chr1 |
13421 |
A |
AGAGA |
rs777038595 |
chr1 |
15189 |
CGGGCACTGATGAGACAGCGGC |
C |
rs768510816 |
gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | TOP 6
In the above example, we are using the LEN function, which calculates the length of a string, to filter out any results from the #dbsnp#
table that do not have reference or allele base lengths longer than 4 bases. You can find more information on different formulas in GOR queries here. More information on the CALC command can be found here
Using SELECT to choose columns¶
The SELECT command allows you to choose which columns to show from the source by entering a comma-delimited list.
When selecting in GOR, genomic ordered relational data is expected, so it is mandatory to have a CHROM
and POS
as the first two columns. However, when the SELECT statement is used in NOR, you can choose whatever columns you want. If columns other than the first two (e.g. #1
and #2
) are selected to become the first two columns, care must be taken to ensure that proper genomic order is preserved (see discussion on the -ir
option in the JOIN command).
Showing and Hiding Columns¶
In the next example, let’s take a table that has many different columns such as the whole exon sequence variants from the subject project, which can be loaded using the alias #wesVars#
:
CHROM |
POS |
Reference |
Call |
CallCopies |
CallRatio |
Depth |
GL_Call |
FILTER |
FS |
FormatZip |
PN |
---|---|---|---|---|---|---|---|---|---|---|---|
chr1 |
13273 |
G |
C |
1 |
0.2 |
85 |
318 |
VQSRTrancheSNP99.90to100.00 |
0.0 |
Alt=C:GT=0/1,AD=68,17,DP=85,GQ=99,PL=318,0,826 |
C416TO_ANDERSON_CHILD_2005_M |
chr1 |
13273 |
G |
C |
1 |
0.239 |
88 |
408 |
VQSRTrancheSNP99.90to100.00 |
0.0 |
Alt=C:GT=0/1,AD=67,21,DP=88,GQ=99,PL=408,0,723 |
C416TO_ANDERSON_FATHER_1967_M |
chr1 |
13273 |
G |
C |
1 |
0.206 |
68 |
231 |
VQSRTrancheSNP99.90to100.00 |
12.71 |
Alt=C:GT=0/1,AD=54,14,DP=68,GQ=99,PL=231,0,687 |
C416TO_ANDERSON_CHILD_2000_F |
chr1 |
13302 |
C |
T |
1 |
0.614 |
44 |
50 |
LowQual |
0.0 |
Alt=T:GT=0/1,AD=17,27,DP=44,GQ=50,PL=50,0,239 |
C416TO_FATHER |
Now let’s say that we only want to retrieve the reference and call columns along with quality filter (FILTER column) and the unique patient identifier. To retrieve only those columns, we could execute the following GOR query on the table:
gor #wesVars# | SELECT 1,2,Reference,Call,FILTER,PN | TOP 4
Note
When writing comma-separated lists, take care not to have spaces between the items in a list.
This would result in the following result set being returned:
CHROM |
POS |
Reference |
Call |
FILTER |
PN |
---|---|---|---|---|---|
chr1 |
13273 |
G |
C |
VQSRTrancheSNP99.90to100.00 |
C416TO_ANDERSON_CHILD_2005_M |
chr1 |
13273 |
G |
C |
VQSRTrancheSNP99.90to100.00 |
C416TO_ANDERSON_FATHER_1967_M |
chr1 |
13273 |
G |
C |
VQSRTrancheSNP99.90to100.00 |
C416TO_ANDERSON_CHILD_2000_F |
chr1 |
13302 |
C |
T |
LowQual |
C416TO_FATHER |
We could accomplish the same result if we were to use the HIDE command to hide the columns we do not wish to see.
gor #wesVars# | HIDE 5-8,FS,#11 | TOP 4
In the example above, we are defining a range of columns (i.e. 5 through to 8); the FS column is referred to by its name; column #11 is referred to by its position in the column sequence.
Commands allow reference to columns based on their number and their names, however, numbers must be prefixed by #
must be used in WHERE and CALC to distinguish
them from numbers in expressions. If the listed columns do not exist, the query will throw an exception.
Note that commands and column names are case insensitive.