Grouping and Aggregation¶

The GOR syntax provides two commands to do grouping and aggregation, the GROUP command and the GRANNO command, which is for aggregating and annotating in a single pass.

Grouping¶

The GROUP command allows you aggregate your results similarly to the way grouping works in SQL, although the syntax is slightly different. In the GOR language, grouping is always based on a sliding genomic bin size.

For example, the following GOR query calculates the number of PNs with vars in chr21, i.e. the number of variants per pn and then the number of PNs.

gor -p chr21 #wgsvars# | GROUP chrom -gc pn -count | GROUP chrom -count

We will now look at a few different types of grouping in the GOR query language.

Defining the Binsize¶

When performing a GROUP command, the query can be set to group based on an integer to denote a range (in base pairs) or it can be set to chromosome or genome. The following query returns the #dbsnp# variant table grouped by chromosome with a count of how many SNPs are on each (shown in the allCount column):

gor #dbsnp# | GROUP chrom -count

#dbsnp# grouped by chromosome¶
Chrom	bpStart	bpStop	allCount
chr1	0	249250621	18642860
chr2	0	243199373	20216254
chr3	0	198022430	16557941
chr4	0	191154276	16044444
chr5	0	180915260	14938502
chr6	0	171115067	14143210
…	…	…	…

Another way to get the total number of SNPs on chromosome 1, would be to use the following GOR query:

gor -p chr1 #dbsnp# | GROUP genome -count

Since we made the grouping over the genome, a special segment named chrA (chromosome all) is returned.

To further illustrate the use of binsize, we could then take only chromosome 21 (using the position filter on the GOR query) and group with a binsize of 1 million basepairs, as shown below:

gor -p chr21 #dbsnp# | GROUP 10000000 -count

#dbsnp# chr21 grouped into buckets of 10m bp¶
Chrom	bpStart	bpStop	allCount
chr21	0	10000000	38990
chr21	10000000	20000000	607009
chr21	20000000	30000000	889141
chr21	30000000	40000000	857939
chr21	40000000	50000000	782185

The above example shows how many SNPs exist in each segment of 1m base pairs on chromosome 21.

It is important to note that the binsize can directly impact the performance of your queries. It is advisable to limit the binsize to as small a number as possible (i.e. do not use the “genome” binsize when the “chrom” will do). Often you can limit the binsize to 3 megabases, which is longer than the longest human gene.

Introducing Parallel Queries¶

The query above that shows a GROUP command that returns the number of SNPs on each chromosome, but that query takes a very long time to run because it goes through the genome chromosome by chromosome. A better way to do this is using PGOR. If we were to write the query as follows:

pgor #dbsnp# | GROUP chrom -count

the query will automatically be split up and will return the count for each chromosome much faster. A full discussion of parallelization can be found in a later chapter specifically on Parallel GOR.

Grouping on Columns¶

We can specify additional grouping by using the -gc option, which stands for “grouping column”. To understand this type of group, we could take the same example of the 21st chromosome in the #dbsnp#, but in this case we would only look at substitutions as in the table shown below:

gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1

#dbsnp# chr21 - substitutions only¶
Chrom	POS	reference	allele	rsIDs
chr21	9411199	T	C	rs376129767
chr21	9411236	G	A	rs922165264
chr21	9411239	G	A	rs559462325
chr21	9411242	C	A	rs531773366
chr21	9411243	A	C	rs191612142

Next, we might want to use the CALC command to create a new column with the type of SNP in each substitution, as shown below:

gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs

#dbsnp# chr21 with SNP type¶
Chrom	POS	reference	allele	snptype
chr21	9411199	T	C	T/C
chr21	9411236	G	A	G/A
chr21	9411239	G	A	G/A
chr21	9411242	C	A	C/A
chr21	9411243	A	C	A/C

Using the -gc option on the GROUP command, we could then group based on this new snptype column using the GOR query shown here:

gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs | GROUP chrom -count -gc snptype

#dbsnp# chr21: SNP types grouped over the chromosome¶
Chrom	bpStop	snptype	allCount
chr21	48129895	A/C	117055
chr21	48129895	A/G	399872
chr21	48129895	A/T	109699
chr21	48129895	C/A	152813
chr21	48129895	C/G	133850

As before, we could also define the binsize as a specific number of base pairs, in which case the output of the query would give us the number of each type of SNP in each interval we define in the binsize. In the GOR query below, we are grouping the SNP types with a bin size of 1m base pairs.

gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs | GROUP 1000000 -count -gc snptype

The next example shows how we could calculate the number of PNs with variants in chr21.

gor -p chr21 #wgsvars# | GROUP chrom -gc PN -count | GROUP chrom -count

The example also shows a good example of using the GROUP command twice in a query, first to count the number of variants per PN and then to count the number of PNs.

As we will see in the next section, we can then use another GROUP command to perform various statistical calculations in the query.

Calculating Max, Min, Avgs¶

We can use some options in the GROUP command to calculate the maximum, minimum, and average number of each type of SNP in each 1 megabase binsize as is shown in the following example. Note that you must specify the type (in this case -ic for integer column) and name of the column (or multiple columns, comma delimted) that you wish to perform the calculations for.

gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs | GROUP 100000 -count -gc snptype
| GROUP chrom -max -min -avg -ic allCount -gc snptype

#dbsnp# chr21: SNP types (max, min, avg)¶
Chrom	bpStart	bpStop	snptype	min_allCount	max_allCount	avg_allCount
chr21	0	48129895	A/C	37	936	327.885154
chr21	0	48129895	A/G	66	2337	1120.08963
chr21	0	48129895	A/T	24	1279	307.28011
chr21	0	48129895	C/A	61	1422	428.04761
chr21	0	48129895	C/G	39	1055	374.92997
…	…	…	…	…	…	…

As you can see in the table above, the calculated columns are prefixed with the type of calculation that has been performed in each case.

Using GRANNO to Annotate¶

The GRANNO command allows you to group and annotate in a single pass. A full description of the GRANNO command can be found here.

To illustrate the use of the GRANNO command, we could take a look at the previous example. Let’s say that instead of grouping over the whole chromosome, we might want to find out how many different SNPs there are, on average, on each gene. The #dbsnp# does not have gene data, so we will have to join to either the #genes# or #exons# table and group on the gene_symbol column, as shown in the example below:

gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| JOIN -snpseg #exons# | CALC snptype reference+'/'+allele
| HIDE rsIDs | GROUP chrom -gc gene_symbol -count

Note that in the example we show here, we are hiding certain columns to sculpt the output of the query to include only those columns relevant to the example.

Aggregation of SNPs for each Gene¶
Chrom	bpStart	bpStop	gene_symbol	allCount
chr21	0	48129895	7SK	107
chr21	0	48129895	ABCC13	1654
chr21	0	48129895	ABCG1	5180
chr21	0	48129895	ADAMTS1	1955
chr21	0	48129895	ADARB1	1103
chr21	0	48129895	ADAMTS5	5693
…	…	…	…	…

However, as you can see above, this only gives us the start and stop position for the chromosome itself and not the exon in question. We may want to have this data genomic-ordered and so we need to add in the chromstart and chromend columns from the #exons# table.

gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| JOIN -snpseg #exons# | CALC snptype reference+'/'+allele
| HIDE rsIDs | GROUP chrom -gc gene_symbol -count

Aggregation of SNPs for each gene with position information¶
Chrom	bpStart	bpStop	chromstart	chromend	gene_symbol	allCount
chr21	0	48129895	10199942	10200025	CR381653.1	9
chr21	0	48129895	10380379	10380661	RN7SL52P	13
chr21	0	48129895	10385952	10386047	SNORA70	9
chr21	0	48129895	10475514	10476061	bP-21201H5.1	57
chr21	0	48129895	10862621	10862667	IGHV1OR21-1	64
chr21	0	48129895	10862750	10863057	IGHV1OR21-1	243
chr21	0	48129895	10862750	10863067	IGHV1OR21-1	250
…	…	…	…	…	…	…

As you can above, this gives us something closer to the genomic ordered stream we are looking for, but there are still multiple entries for individual genes (since they sometimes overlap with multiple exonic regions). To remedy this, we can use the GRANNO command, which can group and annotate in a single pass.

In the next example, we use the GRANNO command and we can define the binsize as a special value of gene, which essentially sets the value to 3M base pairs in conjunction with the -range option. This means that we will not look for any grouping of the gene_symbol column (indicated with the -gc option) outside of a range of 3M base pairs. Next, we define the annotation columns that we wish to generate (with the -ic option) and what types of annotations we wish to perform (in this case, -max and -min).

gor -p chr21 #dbsnp# | JOIN -snpseg #exons#
| WHERE len(reference)=1 AND len(allele)=1 | CALC snptype reference+'/'+allele
| GRANNO gene -range -gc gene_symbol -ic chromstart,chromend -max -min
| SELECT 1-4,chromstart,chromend,gene_symbol,snptype-

Aggregation of SNPs for each gene with GRANNO command¶
Chrom	POS	reference	allele	chromstart	chromend	gene_symbol	snptype	min_chromstart	max_chromstart	min_chromend	max_chromend
chr21	9683195	G	A	9683190	9683272	CR381670.1	G/A	9683190	9683190	9683272	9683272
chr21	9683199	C	G	9683190	9683272	CR381670.1	C/G	9683190	9683190	9683272	9683272
chr21	9683201	G	T	9683190	9683272	CR381670.1	G/T	9683190	9683190	9683272	9683272

We can then further group this result over the chromosome (with a GROUP command) on the columns min_chromstart, max_chromend, gene_symbol, and snptype, as shown in the example below:

gor -p chr21 #dbsnp# | JOIN -snpseg #exons#
| WHERE len(reference)=1 AND len(allele)=1 | CALC snptype reference+'/'+allele
| GRANNO gene -range -gc gene_symbol -ic chromstart,chromend -max -min
| GROUP chrom -gc min_chromstart,max_chromend,gene_symbol,snptype -count
| SELECT Chrom,min_chromstart- | SORT chrom

Aggregation of SNPs using both GRANNO and GROUP, genomic-ordered stream¶
Chrom	min_chromstart	max_chromend	gene_symbol	snptype	allCount
chr21	9683190	9683272	CR381670.1	A/C	1
chr21	9683190	9683272	CR381670.1	C/G	1
chr21	9683190	9683272	CR381670.1	C/T	2
chr21	9683190	9683272	CR381670.1	G/A	1
chr21	9683190	9683272	CR381670.1	G/T	3
…	…	…	…	…	…

There is a lot going on here, so we can take it line by line. The first line is the join between the tables. The second line does the filtering on the left source and adds the calculated column. The third line runs the GRANNO command as described above, grouping and annotating in a single step to find the min_chromstart and max_chromend for the genes. Next we group over the whole chromosome to get the count of each type of SNP. Finally, we SELECT the columns we want to display.

You’ll note at the end of this query that we have added a SORT command to make sure that the genomic order is preserved in the GOR stream. As discussed in a previous section on genomic order, this is often necessary when columns are being moved around so much within a GOR query.