GROUP¶
The GROUP command allows you to group the output of your query according to a number of different criteria. The GROUP command is very similar to the “group by” function in SQL.
This command groups a set of rows into a set of summary row by values of columns. It returns one row for each group. We often use the GROUP command with aggregate functions such as -count
, -cdist
, -min
, -med
, -max
, -dis
, -set
, -lis
, -avg
, -std
, and -sum
.
The aggregate functions allow us to perform a calculation of a set of rows while grouping.
Using GROUP in a nor query will give an error if binsize
input parameter is used.
Options¶
|
Return the count for each bin. |
|
Return the number of distinct rows for each bin. |
|
Grouping columns (other than bin). |
|
String columns (-ac has been deprecated). |
|
Integer columns. |
|
Floating valued columns. |
|
Calculate the min for any type of column. |
|
Calculate the median for any type of column. |
|
Calculate the max for any type of column. |
|
Calculate the number of distinct values for any type of column. |
|
Return a comma separated set with the distinct values in the column. |
|
Return a comma separated list with the values in the column. |
|
Specify the maximum column length of a set and a list. Defaults to 10000 chars. |
|
Throw an exception rather than truncating the list or set if it exceeds the maximum length. |
|
Calculate the avg of all numeric columns. |
|
Calculate the std of all numeric columns. |
|
Calculate the sum of all numeric columns. |
|
The number of sliding steps per group window. |
|
The separator for elements in lists and sets. |
|
Assume the grouping columns are ordered. |
Attributes -ic
and -fc
explicitly define columns of either integer type or floating point type, while the -sc
attribute defines it as a string type. Any column defined with the -gc
option and is a part of the following aggregation
options -sc
, -ic
or -fc
will result in an error
|
Works with -max, -min and -med. |
|
Works with -max, -min, -med, -avg, -std and -sum. |
|
Works with -max, -min, -med, -avg, -std and -sum. |
Ensure that there are no spaces between the numbers. Use SELECT to pick a subset of columns from the output.
Use binsize = chrom
to aggregate for a whole chromosome and binsize = genome
to aggregate for the entire genome.
When using GROUP in a NOR context, the ordered flag can both speed up the operation and reduce the memory usage significantly. Note that there are no checks to see if the order is correct - only use this option if the input stream is correctly ordered.
Examples¶
Group columns a and b on each chromosome and counting the number of groups.
gor ... | group chrom -gc a,b -count
Group column a for the whole genome and calculate the sum and average for columns b and c (which are defined as floating point columns).
gor ... | group genome -gc a -fc b,c -ave -sum
Group #dnsnp#
on reference for the whole genome and return it as a set with -set attribute. This query should return a set
of A,C,G,T
(in undefined order)
gor #dbsnp# | where len(reference) = 1 | group genome -sc reference -set
In some cases, we may wish to know which individual rows have been grouped together. In that case, the -lis
argument comes in handy:
gor source/var/wgs_varcalls.gord -s PN
| group 1 -gc reference,call -lis -sc PN
| top 100
The query above will output a list on each variant showing which PN carried that variant.