Filtering Streams¶

Once you have defined a source table for your data, you can start to structure the way you want to have the data processed and filtered. You can define commands for data processing and filtering in any number of pipe steps. Note that data flows through the pipes from left to right so that, most often, the order of the steps matters.

TOP and SKIP¶

There are several commands that filter streams in the GOR Query Language. The most basic are the TOP and SKIP commands, which are used to limit the number and section of rows in the stream.

gor #dbsnp# | top 100

The above example will only show the first 100 rows of the ##dbsnp## table and terminate the reading of the data.

gor -p chr2 #dbsnp# | skip 100 | top 100

will likewise only show rows from 101 to 200 on chr2 from the ##dbsnp## table.

WHERE¶

The main function for filtering streams is the WHERE command. We will now go over some of the different ways that a WHERE command can be used.

Matching Patterns¶

To start with a simple example of a WHERE command, we can consider the gene table, which is included in the reference data for the Sequence Miner. If you were to run the following GOR query:

gor #genes# | WHERE Gene_Symbol IN ('BRCA1', 'BRCA2')

you would expect the following output from the query showing only the two rows:

Chrom    gene_start     gene_end    Gene_Symbol
chr13    32889610       32973805    BRCA1
chr17    41196311       41277500    BRCA2

Multiple match conditions can be linked together in a single WHERE command such in the following case:

gor #dbsnp# | WHERE reference+’>’+allele IN (’A>G’,’G>A’,’C>T’,’T>C’) OR (len(#3)=len(#4) AND (len(#3)=2 OR len(#3)=3))

As shown above, AND and OR statements can be nested in curved brackets to include multiple match conditions.

Note

Clicking the TAB key while writing queries in the Sequence Miner brings up a context-sensitive dialog that allows you to select columns that can be added to the query.

Using Wildcard Notation¶

It is also possible to use wildcard notation in WHERE clauses. The previous example with the ##genes## table could be accomplished as follows by using a wildcard:

gor #genes# | WHERE Gene_Symbol LIKE 'BRCA*'

Another example is shown with wildcards below, this time using LIKE and RLIKE which accepts regular expressions:

gor #dbsnp# | top 1000 | WHERE rsIDs ~ 'rs22*' and NOT( rsIDs LIKE 'rs?23*' or rsIDs RLIKE 'rs(3|4)*2.*')

Note

Using ~ is equivalent to using LIKE. ? can be used for any character.

Base-pair Notation¶

There is a special syntax for base-pair positions, i.e. integers can be represented with commas every 3 digits. They must however be followed by the letters bp, e.g. not( pos > 123,345,789bp ).

Using Functions in WHERE commmands¶

It is possible to use Functions in WHERE commands and also in CALC commands, which are discussed in a later chapter.

#dbsnp# (showing different Indel examples)¶
Chrom	POS	reference	allele	rsIDs
chr1	10233	CCTAACCCTAACCCTAAACCCTAAACCC	C	rs200462216
chr1	10332	CCTAACCCTAACCCTAACCCTACCC	C	rs201106462
chr1	10621	GTTGCAAAGGCGCGCCGCGCCG	G	rs376342519
chr1	12940	AAACA	A	rs756849893
chr1	13421	A	AGAGA	rs777038595
chr1	15189	CGGGCACTGATGAGACAGCGGC	C	rs768510816

gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | TOP 6

In the above example, we are using the LEN function, which calculates the length of a string, to filter out any results from the #dbsnp# table that do not have reference or allele base lengths longer than 4 bases. You can find more information on different formulas in GOR queries here. More information on the CALC command can be found here

Using SELECT to choose columns¶

The SELECT command allows you to choose which columns to show from the source by entering a comma-delimited list.

When selecting in GOR, genomic ordered relational data is expected, so it is mandatory to have a CHROM and POS as the first two columns. However, when the SELECT statement is used in NOR, you can choose whatever columns you want. If columns other than the first two (e.g. #1 and #2) are selected to become the first two columns, care must be taken to ensure that proper genomic order is preserved (see discussion on the -ir option in the JOIN command).

Showing and Hiding Columns¶

In the next example, let’s take a table that has many different columns such as the whole exon sequence variants from the subject project, which can be loaded using the alias #wesVars#:

#wesVars#¶
CHROM	POS	Reference	Call	CallCopies	CallRatio	Depth	GL_Call	FILTER	FS	FormatZip	PN
chr1	13273	G	C	1	0.2	85	318	VQSRTrancheSNP99.90to100.00	0.0	Alt=C:GT=0/1,AD=68,17,DP=85,GQ=99,PL=318,0,826	C416TO_ANDERSON_CHILD_2005_M
chr1	13273	G	C	1	0.239	88	408	VQSRTrancheSNP99.90to100.00	0.0	Alt=C:GT=0/1,AD=67,21,DP=88,GQ=99,PL=408,0,723	C416TO_ANDERSON_FATHER_1967_M
chr1	13273	G	C	1	0.206	68	231	VQSRTrancheSNP99.90to100.00	12.71	Alt=C:GT=0/1,AD=54,14,DP=68,GQ=99,PL=231,0,687	C416TO_ANDERSON_CHILD_2000_F
chr1	13302	C	T	1	0.614	44	50	LowQual	0.0	Alt=T:GT=0/1,AD=17,27,DP=44,GQ=50,PL=50,0,239	C416TO_FATHER

Now let’s say that we only want to retrieve the reference and call columns along with quality filter (FILTER column) and the unique patient identifier. To retrieve only those columns, we could execute the following GOR query on the table:

gor #wesVars# | SELECT 1,2,Reference,Call,FILTER,PN | TOP 4

Note

When writing comma-separated lists, take care not to have spaces between the items in a list.

This would result in the following result set being returned:

#wesVars# with 6 columns selected¶
CHROM	POS	Reference	Call	FILTER	PN
chr1	13273	G	C	VQSRTrancheSNP99.90to100.00	C416TO_ANDERSON_CHILD_2005_M
chr1	13273	G	C	VQSRTrancheSNP99.90to100.00	C416TO_ANDERSON_FATHER_1967_M
chr1	13273	G	C	VQSRTrancheSNP99.90to100.00	C416TO_ANDERSON_CHILD_2000_F
chr1	13302	C	T	LowQual	C416TO_FATHER

We could accomplish the same result if we were to use the HIDE command to hide the columns we do not wish to see.

gor #wesVars# | HIDE 5-8,FS,#11 | TOP 4

In the example above, we are defining a range of columns (i.e. 5 through to 8); the FS column is referred to by its name; column #11 is referred to by its position in the column sequence. Commands allow reference to columns based on their number and their names, however, numbers must be prefixed by # must be used in WHERE and CALC to distinguish them from numbers in expressions. If the listed columns do not exist, the query will throw an exception.

Note that commands and column names are case insensitive.