Column Modifications

In this chapter, we will be going through several of the commands in the GOR query language that allow us to add columns to the GOR stream by calculating data and manipulating the columns’ titles in various ways to organize the data better.

To show examples of column modifications, we will be taking the example of the #dbsnp# table shown below:

#dbsnp# (showing different Indel examples)

Chrom

POS

reference

allele

rsIDs

chr1

10233

CCTAACCCTAACCCTAAACCCTAAACCC

C

rs200462216

chr1

10332

CCTAACCCTAACCCTAACCCTACCC

C

rs201106462

chr1

10621

GTTGCAAAGGCGCGCCGCGCCG

G

rs376342519

chr1

12940

AAACA

A

rs756849893

chr1

13421

A

AGAGA

rs777038595

chr1

15189

CGGGCACTGATGAGACAGCGGC

C

rs768510816

gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | TOP 6

As can be seen above, the #dbsnp# table shows different variants that can occur at various positions on each chromosome. In the example above, we have selected several rows that illustrate five different deletions and a single insertion, which will provide us with a good example for our next topics.

Calculated Columns

Using the CALC command in a GOR query, we can add columns to the GOR output. For example, we can add a column that shows the length of the reference string using the same len function that we used in the WHERE from the above query.

#dbsnp# (with a calculated column showing the reference length)

Chrom

POS

reference

allele

rsIDs

refLength

chr1

10233

CCTAACCCTAACCCTAAACCCTAAACCC

C

rs200462216

28

chr1

10332

CCTAACCCTAACCCTAACCCTACCC

C

rs201106462

25

chr1

10621

GTTGCAAAGGCGCGCCGCGCCG

G

rs376342519

22

chr1

12940

AAACA

A

rs756849893

5

chr1

13421

A

AGAGA

rs777038595

1

chr1

15189

CGGGCACTGATGAGACAGCGGC

C

rs768510816

22

gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | CALC refLength len(reference) | TOP 6

As you can see above, a refLength column has been added to the end of the output with the base length of the reference column.

Replacing Columns with Calculated Columns

You may also choose to replace an existing column with some calculated data. The REPLACE command can be used to replace any column in the output (other than the Chrom and POS columns, which must always be present in the GOR stream). It works exactly like the CALC command, but instead of adding a column and specifying the name of the column, you specify the name of the column that you wish to replace.

For example, the following query replaces the rsIDs in the #dbsnp table with the listsize of the content of the rsIDs cells.

gor #dbsnp# | REPLACE rsIDs listsize(rsIDs) | TOP 10

Note

Care should be taken when replacing columns with calculated content that the name of the column still makes sense.

Prefix and Rename

The column names and column order can be managed with commands such as RENAME, PREFIX and COLUMNSORT. As an example, in the following gor query, :

gor #dbsnp# | CALC splitprefix_cols listsize(rsIDs) | CALC splitprefix_1 listfilter(rsIDs,'i=1')
| CALC splitprefix_2 listfilter(rsIDs,'i=2') | RENAME splitprefix_(.*) #{1}
| PREFIX rsIDs[+1]- differentprefix | COLUMNSORT 1-4,different* | top 10

The RENAME command supports regular expression match on columns and the binding variable are referred to as #{n}. In the example above, the rename eliminates splitprefix_ from all columns having the corresponding prefix. The PREFIX command adds a new prefix, “differentprefix”, to all the columns to the right of the rsIDs column and, finally, the COLUMNSORT command lists the new output order of columns. Note that since all the columns are in the output, those who are not listed will be placed in alphabetical order. Thus, the COLUMNSORT command makes it easy to emphasize certain columns as the left-most columns in the output.

More about prefixes

It can be useful, particularly when working with joins, to add prefixes to column names to the columns that come from one of the sources to distinguish between analogous columns from two or more tables. However, the PREFIX command can be used to prefix any column name.

Using our CALC example from above, we could prefix the calculated column with a calc_ as shown below:

#dbsnp# (with a calculated column showing the reference length)

Chrom

POS

reference

allele

rsIDs

calc_refLength

chr1

10233

CCTAACCCTAACCCTAAACCCTAAACCC

C

rs200462216

28

chr1

10332

CCTAACCCTAACCCTAACCCTACCC

C

rs201106462

25

chr1

10621

GTTGCAAAGGCGCGCCGCGCCG

G

rs376342519

22

chr1

12940

AAACA

A

rs756849893

5

chr1

13421

A

AGAGA

rs777038595

1

chr1

15189

CGGGCACTGATGAGACAGCGGC

C

rs768510816

22

gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | CALC refLength len(reference) | PREFIX refLength calc

Prefixes can also be useful with SELECT commands. We could, for example, prefix the variation data in a GOR stream with “var-” and then refer to all of the variation data columns in a SELECT command with var*.

Note

It is not necessary to add the underscore to the prefix that you specify in the query. The underscore is added automatically.

Adding Row Numbers

It is possible to add a column with row numbers to the GOR output by using the ROWNUM command. Rows in the output will be given unique, sequential row numbers, based on the number of rows that have been passed through the query.

The ROWNUM command takes no extra parameters and can be added to a GOR query as the following example shows:

#dbsnp# (with a calculated column showing the reference length)

Chrom

POS

reference

allele

rsIDs

rownum

chr1

10233

CCTAACCCTAACCCTAAACCCTAAACCC

C

rs200462216

1

chr1

10332

CCTAACCCTAACCCTAACCCTACCC

C

rs201106462

2

chr1

10621

GTTGCAAAGGCGCGCCGCGCCG

G

rs376342519

3

chr1

12940

AAACA

A

rs756849893

4

chr1

13421

A

AGAGA

rs777038595

5

chr1

15189

CGGGCACTGATGAGACAGCGGC

C

rs768510816

6

gor #dbsnp# | WHERE len(reference) > 4 OR len(allele) > 4 | ROWNUM | TOP 6

The row numbers are added to the output as a column with the title rownum.

Take care when using row numbers with GOR queries containing nested queries as the nested queries are also given unique row numbers by this command, which may lead to unexpected results.

If we were to add the ROWNUM command before (and after) the WHERE command above, you can see where the filtered data points occur in relation to the whole variant table, which could be useful for data plots. That example is shown below:

#dbsnp# (with a calculated column showing the reference length)

Chrom

POS

reference

allele

rsIDs

rownum

rownumx

chr1

10233

CCTAACCCTAACCCTAAACCCTAAACCC

C

rs200462216

24

1

chr1

10332

CCTAACCCTAACCCTAACCCTACCC

C

rs201106462

58

2

chr1

10621

GTTGCAAAGGCGCGCCGCGCCG

G

rs376342519

104

3

chr1

12940

AAACA

A

rs756849893

146

4

chr1

13421

A

AGAGA

rs777038595

183

5

chr1

15189

CGGGCACTGATGAGACAGCGGC

C

rs768510816

234

6

gor #dbsnp# | ROWNUM | WHERE len(reference) > 4 OR len(allele) > 4 | ROWNUM | TOP 6