Grouping and Aggregation¶
The GOR syntax provides two commands to do grouping and aggregation, the GROUP command and the GRANNO command, which is for aggregating and annotating in a single pass.
Grouping¶
The GROUP command allows you aggregate your results similarly to the way grouping works in SQL, although the syntax is slightly different. In the GOR language, grouping is always based on a sliding genomic bin size.
For example, the following GOR query calculates the number of PNs with vars in chr21, i.e. the number of variants per pn and then the number of PNs.
gor -p chr21 #wgsvars# | GROUP chrom -gc pn -count | GROUP chrom -count
We will now look at a few different types of grouping in the GOR query language.
Defining the Binsize¶
When performing a GROUP command, the query can be set to group based on an integer to denote a range (in base pairs) or it can be set to chromosome or genome. The following query returns the #dbsnp#
variant table grouped by chromosome with a count of how many SNPs are on each (shown in the allCount column):
gor #dbsnp# | GROUP chrom -count
Chrom |
bpStart |
bpStop |
allCount |
---|---|---|---|
chr1 |
0 |
249250621 |
18642860 |
chr2 |
0 |
243199373 |
20216254 |
chr3 |
0 |
198022430 |
16557941 |
chr4 |
0 |
191154276 |
16044444 |
chr5 |
0 |
180915260 |
14938502 |
chr6 |
0 |
171115067 |
14143210 |
… |
… |
… |
… |
Another way to get the total number of SNPs on chromosome 1, would be to use the following GOR query:
gor -p chr1 #dbsnp# | GROUP genome -count
Since we made the grouping over the genome, a special segment named chrA
(chromosome all) is returned.
To further illustrate the use of binsize, we could then take only chromosome 21 (using the position filter on the GOR query) and group with a binsize of 1 million basepairs, as shown below:
gor -p chr21 #dbsnp# | GROUP 10000000 -count
Chrom |
bpStart |
bpStop |
allCount |
---|---|---|---|
chr21 |
0 |
10000000 |
38990 |
chr21 |
10000000 |
20000000 |
607009 |
chr21 |
20000000 |
30000000 |
889141 |
chr21 |
30000000 |
40000000 |
857939 |
chr21 |
40000000 |
50000000 |
782185 |
The above example shows how many SNPs exist in each segment of 1m base pairs on chromosome 21.
It is important to note that the binsize can directly impact the performance of your queries. It is advisable to limit the binsize to as small a number as possible (i.e. do not use the “genome” binsize when the “chrom” will do). Often you can limit the binsize to 3 megabases, which is longer than the longest human gene.
Introducing Parallel Queries¶
The query above that shows a GROUP command that returns the number of SNPs on each chromosome, but that query takes a very long time to run because it goes through the genome chromosome by chromosome. A better way to do this is using PGOR. If we were to write the query as follows:
pgor #dbsnp# | GROUP chrom -count
the query will automatically be split up and will return the count for each chromosome much faster. A full discussion of parallelization can be found in a later chapter specifically on Parallel GOR.
Grouping on Columns¶
We can specify additional grouping by using the -gc
option, which stands for “grouping column”. To understand this type of group, we could take the same example of the 21st chromosome in the #dbsnp#
, but in this case we would only look at substitutions as in the table shown below:
gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
Chrom |
POS |
reference |
allele |
rsIDs |
---|---|---|---|---|
chr21 |
9411199 |
T |
C |
rs376129767 |
chr21 |
9411236 |
G |
A |
rs922165264 |
chr21 |
9411239 |
G |
A |
rs559462325 |
chr21 |
9411242 |
C |
A |
rs531773366 |
chr21 |
9411243 |
A |
C |
rs191612142 |
Next, we might want to use the CALC command to create a new column with the type of SNP in each substitution, as shown below:
gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs
Chrom |
POS |
reference |
allele |
snptype |
---|---|---|---|---|
chr21 |
9411199 |
T |
C |
T/C |
chr21 |
9411236 |
G |
A |
G/A |
chr21 |
9411239 |
G |
A |
G/A |
chr21 |
9411242 |
C |
A |
C/A |
chr21 |
9411243 |
A |
C |
A/C |
Using the -gc
option on the GROUP command, we could then group based on this new snptype
column using the GOR query shown here:
gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs | GROUP chrom -count -gc snptype
Chrom |
bpStart |
bpStop |
snptype |
allCount |
---|---|---|---|---|
chr21 |
0 |
48129895 |
A/C |
117055 |
chr21 |
0 |
48129895 |
A/G |
399872 |
chr21 |
0 |
48129895 |
A/T |
109699 |
chr21 |
0 |
48129895 |
C/A |
152813 |
chr21 |
0 |
48129895 |
C/G |
133850 |
As before, we could also define the binsize as a specific number of base pairs, in which case the output of the query would give us the number of each type of SNP in each interval we define in the binsize. In the GOR query below, we are grouping the SNP types with a bin size of 1m base pairs.
gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs | GROUP 1000000 -count -gc snptype
The next example shows how we could calculate the number of PNs with variants in chr21.
gor -p chr21 #wgsvars# | GROUP chrom -gc PN -count | GROUP chrom -count
The example also shows a good example of using the GROUP command twice in a query, first to count the number of variants per PN and then to count the number of PNs.
As we will see in the next section, we can then use another GROUP command to perform various statistical calculations in the query.
Calculating Max, Min, Avgs¶
We can use some options in the GROUP command to calculate the maximum, minimum, and average number of each type of SNP in each 1 megabase binsize as is shown in the following example. Note that you must specify the type (in this case -ic
for integer column) and name of the column (or multiple columns, comma delimted) that you wish to perform the calculations for.
gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| CALC snptype reference+'/'+allele | HIDE rsIDs | GROUP 100000 -count -gc snptype
| GROUP chrom -max -min -avg -ic allCount -gc snptype
Chrom |
bpStart |
bpStop |
snptype |
min_allCount |
max_allCount |
avg_allCount |
---|---|---|---|---|---|---|
chr21 |
0 |
48129895 |
A/C |
37 |
936 |
327.885154 |
chr21 |
0 |
48129895 |
A/G |
66 |
2337 |
1120.08963 |
chr21 |
0 |
48129895 |
A/T |
24 |
1279 |
307.28011 |
chr21 |
0 |
48129895 |
C/A |
61 |
1422 |
428.04761 |
chr21 |
0 |
48129895 |
C/G |
39 |
1055 |
374.92997 |
… |
… |
… |
… |
… |
… |
… |
As you can see in the table above, the calculated columns are prefixed with the type of calculation that has been performed in each case.
Using GRANNO to Annotate¶
The GRANNO command allows you to group and annotate in a single pass. A full description of the GRANNO command can be found here.
To illustrate the use of the GRANNO command, we could take a look at the previous example. Let’s say that instead of grouping over the whole chromosome, we might want to find out how many different SNPs there are, on average, on each gene. The #dbsnp#
does not have gene data, so we will have to join to either the #genes#
or #exons#
table and group on the gene_symbol
column, as shown in the example below:
gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| JOIN -snpseg #exons# | CALC snptype reference+'/'+allele
| HIDE rsIDs | GROUP chrom -gc gene_symbol -count
Note that in the example we show here, we are hiding certain columns to sculpt the output of the query to include only those columns relevant to the example.
Chrom |
bpStart |
bpStop |
gene_symbol |
allCount |
---|---|---|---|---|
chr21 |
0 |
48129895 |
7SK |
107 |
chr21 |
0 |
48129895 |
ABCC13 |
1654 |
chr21 |
0 |
48129895 |
ABCG1 |
5180 |
chr21 |
0 |
48129895 |
ADAMTS1 |
1955 |
chr21 |
0 |
48129895 |
ADARB1 |
1103 |
chr21 |
0 |
48129895 |
ADAMTS5 |
5693 |
… |
… |
… |
… |
… |
However, as you can see above, this only gives us the start and stop position for the chromosome itself and not the exon in question. We may want to have this data genomic-ordered and so we need to add in the chromstart
and chromend
columns from the #exons#
table.
gor -p chr21 #dbsnp# | WHERE len(reference)=1 AND len(allele)=1
| JOIN -snpseg #exons# | CALC snptype reference+'/'+allele
| HIDE rsIDs | GROUP chrom -gc gene_symbol -count
Chrom |
bpStart |
bpStop |
chromstart |
chromend |
gene_symbol |
allCount |
---|---|---|---|---|---|---|
chr21 |
0 |
48129895 |
10199942 |
10200025 |
CR381653.1 |
9 |
chr21 |
0 |
48129895 |
10380379 |
10380661 |
RN7SL52P |
13 |
chr21 |
0 |
48129895 |
10385952 |
10386047 |
SNORA70 |
9 |
chr21 |
0 |
48129895 |
10475514 |
10476061 |
bP-21201H5.1 |
57 |
chr21 |
0 |
48129895 |
10862621 |
10862667 |
IGHV1OR21-1 |
64 |
chr21 |
0 |
48129895 |
10862750 |
10863057 |
IGHV1OR21-1 |
243 |
chr21 |
0 |
48129895 |
10862750 |
10863067 |
IGHV1OR21-1 |
250 |
… |
… |
… |
… |
… |
… |
… |
As you can above, this gives us something closer to the genomic ordered stream we are looking for, but there are still multiple entries for individual genes (since they sometimes overlap with multiple exonic regions). To remedy this, we can use the GRANNO command, which can group and annotate in a single pass.
In the next example, we use the GRANNO command and we can define the binsize as a special value of gene
, which essentially sets the value to 3M base pairs in conjunction with the -range
option. This means that we will not look for any grouping of the gene_symbol column (indicated with the -gc
option) outside of a range of 3M base pairs. Next, we define the annotation columns that we wish to generate (with the -ic
option) and what types of annotations we wish to perform (in this case, -max
and -min
).
gor -p chr21 #dbsnp# | JOIN -snpseg #exons#
| WHERE len(reference)=1 AND len(allele)=1 | CALC snptype reference+'/'+allele
| GRANNO gene -range -gc gene_symbol -ic chromstart,chromend -max -min
| SELECT 1-4,chromstart,chromend,gene_symbol,snptype-
Chrom |
POS |
reference |
allele |
chromstart |
chromend |
gene_symbol |
snptype |
min_chromstart |
max_chromstart |
min_chromend |
max_chromend |
---|---|---|---|---|---|---|---|---|---|---|---|
chr21 |
9683195 |
G |
A |
9683190 |
9683272 |
CR381670.1 |
G/A |
9683190 |
9683190 |
9683272 |
9683272 |
chr21 |
9683199 |
C |
G |
9683190 |
9683272 |
CR381670.1 |
C/G |
9683190 |
9683190 |
9683272 |
9683272 |
chr21 |
9683201 |
G |
T |
9683190 |
9683272 |
CR381670.1 |
G/T |
9683190 |
9683190 |
9683272 |
9683272 |
We can then further group this result over the chromosome (with a GROUP command) on the columns min_chromstart
, max_chromend
, gene_symbol
, and snptype
, as shown in the example below:
gor -p chr21 #dbsnp# | JOIN -snpseg #exons#
| WHERE len(reference)=1 AND len(allele)=1 | CALC snptype reference+'/'+allele
| GRANNO gene -range -gc gene_symbol -ic chromstart,chromend -max -min
| GROUP chrom -gc min_chromstart,max_chromend,gene_symbol,snptype -count
| SELECT Chrom,min_chromstart- | SORT chrom
Chrom |
min_chromstart |
max_chromend |
gene_symbol |
snptype |
allCount |
---|---|---|---|---|---|
chr21 |
9683190 |
9683272 |
CR381670.1 |
A/C |
1 |
chr21 |
9683190 |
9683272 |
CR381670.1 |
C/G |
1 |
chr21 |
9683190 |
9683272 |
CR381670.1 |
C/T |
2 |
chr21 |
9683190 |
9683272 |
CR381670.1 |
G/A |
1 |
chr21 |
9683190 |
9683272 |
CR381670.1 |
G/T |
3 |
… |
… |
… |
… |
… |
… |
There is a lot going on here, so we can take it line by line. The first line is the join between the tables. The second line does the filtering on the left source and adds the calculated column. The third line runs the GRANNO command as described above, grouping and annotating in a single step to find the min_chromstart
and max_chromend
for the genes. Next we group over the whole chromosome to get the count of each type of SNP. Finally, we SELECT the columns we want to display.
You’ll note at the end of this query that we have added a SORT command to make sure that the genomic order is preserved in the GOR stream. As discussed in a previous section on genomic order, this is often necessary when columns are being moved around so much within a GOR query.