Column Modifications¶
In this chapter, we will be going through several of the commands in the GOR query language that allow us to add columns to the GOR stream by calculating data and manipulating the columns’ titles in various ways to organize the data better.
To show examples of column modifications, we will be taking the example of the #dbsnp#
table shown below:
Chrom |
POS |
reference |
allele |
rsIDs |
---|---|---|---|---|
chr1 |
10233 |
CCTAACCCTAACCCTAAACCCTAAACCC |
C |
rs200462216 |
chr1 |
10332 |
CCTAACCCTAACCCTAACCCTACCC |
C |
rs201106462 |
chr1 |
10621 |
GTTGCAAAGGCGCGCCGCGCCG |
G |
rs376342519 |
chr1 |
12940 |
AAACA |
A |
rs756849893 |
chr1 |
13421 |
A |
AGAGA |
rs777038595 |
chr1 |
15189 |
CGGGCACTGATGAGACAGCGGC |
C |
rs768510816 |
gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | TOP 6
As can be seen above, the #dbsnp#
table shows different variants that can occur at various positions on each chromosome. In the example above, we have selected several rows that illustrate five different deletions and a single insertion, which will provide us with a good example for our next topics.
Calculated Columns¶
Using the CALC command in a GOR query, we can add columns to the GOR output. For example, we can add a column that shows the length of the reference string using the same len function that we used in the WHERE from the above query.
Chrom |
POS |
reference |
allele |
rsIDs |
refLength |
---|---|---|---|---|---|
chr1 |
10233 |
CCTAACCCTAACCCTAAACCCTAAACCC |
C |
rs200462216 |
28 |
chr1 |
10332 |
CCTAACCCTAACCCTAACCCTACCC |
C |
rs201106462 |
25 |
chr1 |
10621 |
GTTGCAAAGGCGCGCCGCGCCG |
G |
rs376342519 |
22 |
chr1 |
12940 |
AAACA |
A |
rs756849893 |
5 |
chr1 |
13421 |
A |
AGAGA |
rs777038595 |
1 |
chr1 |
15189 |
CGGGCACTGATGAGACAGCGGC |
C |
rs768510816 |
22 |
gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | CALC refLength len(reference) | TOP 6
As you can see above, a refLength column has been added to the end of the output with the base length of the reference column.
Replacing Columns with Calculated Columns¶
You may also choose to replace an existing column with some calculated data. The REPLACE command can be used to replace any column in the output (other than the Chrom and POS columns, which must always be present in the GOR stream). It works exactly like the CALC command, but instead of adding a column and specifying the name of the column, you specify the name of the column that you wish to replace.
For example, the following query replaces the rsIDs in the #dbsnp
table with the listsize of the content of the rsIDs cells.
gor #dbsnp# | REPLACE rsIDs listsize(rsIDs) | TOP 10
Note
Care should be taken when replacing columns with calculated content that the name of the column still makes sense.
Prefix and Rename¶
The column names and column order can be managed with commands such as RENAME, PREFIX and COLUMNSORT. As an example, in the following gor query, :
gor #dbsnp# | CALC splitprefix_cols listsize(rsIDs) | CALC splitprefix_1 listfilter(rsIDs,'i=1')
| CALC splitprefix_2 listfilter(rsIDs,'i=2') | RENAME splitprefix_(.*) #{1}
| PREFIX rsIDs[+1]- differentprefix | COLUMNSORT 1-4,different* | top 10
The RENAME command supports regular expression match on columns and the binding variable are referred to as #{n}. In the example above, the rename eliminates splitprefix_
from all columns having the corresponding prefix. The PREFIX command adds a new prefix, “differentprefix”, to all the columns to the right of the rsIDs column and, finally, the COLUMNSORT command lists the new output order of columns. Note that since all the columns are in the output, those who are not listed will be placed in alphabetical order. Thus, the COLUMNSORT command makes it easy to emphasize certain columns as the left-most columns in the output.
More about prefixes¶
It can be useful, particularly when working with joins, to add prefixes to column names to the columns that come from one of the sources to distinguish between analogous columns from two or more tables. However, the PREFIX command can be used to prefix any column name.
Using our CALC example from above, we could prefix the calculated column with a calc_ as shown below:
Chrom |
POS |
reference |
allele |
rsIDs |
calc_refLength |
---|---|---|---|---|---|
chr1 |
10233 |
CCTAACCCTAACCCTAAACCCTAAACCC |
C |
rs200462216 |
28 |
chr1 |
10332 |
CCTAACCCTAACCCTAACCCTACCC |
C |
rs201106462 |
25 |
chr1 |
10621 |
GTTGCAAAGGCGCGCCGCGCCG |
G |
rs376342519 |
22 |
chr1 |
12940 |
AAACA |
A |
rs756849893 |
5 |
chr1 |
13421 |
A |
AGAGA |
rs777038595 |
1 |
chr1 |
15189 |
CGGGCACTGATGAGACAGCGGC |
C |
rs768510816 |
22 |
gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | CALC refLength len(reference) | PREFIX refLength calc
Prefixes can also be useful with SELECT commands. We could, for example, prefix the variation data in a GOR stream with “var-” and then refer to all of the variation data columns in a SELECT command with var*
.
Note
It is not necessary to add the underscore to the prefix that you specify in the query. The underscore is added automatically.
Adding Row Numbers¶
It is possible to add a column with row numbers to the GOR output by using the ROWNUM command. Rows in the output will be given unique, sequential row numbers, based on the number of rows that have been passed through the query.
The ROWNUM command takes no extra parameters and can be added to a GOR query as the following example shows:
Chrom |
POS |
reference |
allele |
rsIDs |
rownum |
---|---|---|---|---|---|
chr1 |
10233 |
CCTAACCCTAACCCTAAACCCTAAACCC |
C |
rs200462216 |
1 |
chr1 |
10332 |
CCTAACCCTAACCCTAACCCTACCC |
C |
rs201106462 |
2 |
chr1 |
10621 |
GTTGCAAAGGCGCGCCGCGCCG |
G |
rs376342519 |
3 |
chr1 |
12940 |
AAACA |
A |
rs756849893 |
4 |
chr1 |
13421 |
A |
AGAGA |
rs777038595 |
5 |
chr1 |
15189 |
CGGGCACTGATGAGACAGCGGC |
C |
rs768510816 |
6 |
gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | ROWNUM | TOP 6
The row numbers are added to the output as a column with the title rownum
.
Take care when using row numbers with GOR queries containing nested queries as the nested queries are also given unique row numbers by this command, which may lead to unexpected results.
If we were to add the ROWNUM command before (and after) the WHERE command above, you can see where the filtered data points occur in relation to the whole variant table, which could be useful for data plots. That example is shown below:
Chrom |
POS |
reference |
allele |
rsIDs |
rownum |
rownumx |
---|---|---|---|---|---|---|
chr1 |
10233 |
CCTAACCCTAACCCTAAACCCTAAACCC |
C |
rs200462216 |
24 |
1 |
chr1 |
10332 |
CCTAACCCTAACCCTAACCCTACCC |
C |
rs201106462 |
58 |
2 |
chr1 |
10621 |
GTTGCAAAGGCGCGCCGCGCCG |
G |
rs376342519 |
104 |
3 |
chr1 |
12940 |
AAACA |
A |
rs756849893 |
146 |
4 |
chr1 |
13421 |
A |
AGAGA |
rs777038595 |
183 |
5 |
chr1 |
15189 |
CGGGCACTGATGAGACAGCGGC |
C |
rs768510816 |
234 |
6 |
gor #dbsnp# | ROWNUM | WHERE len(reference) > 4 OR len(allele) > 4 | ROWNUM | TOP 6