Used in: gor/nor

GROUP¶

The GROUP command allows you to group the output of your query according to a number of different criteria. The GROUP command is very similar to the “group by” function in SQL.

This command groups a set of rows into a set of summary row by values of columns. It returns one row for each group. We often use the GROUP command with aggregate functions such as -count, -cdist, -min, -med, -max, -dis, -set, -lis, -avg, -std, and -sum.

The aggregate functions allow us to perform a calculation of a set of rows while grouping.

Using GROUP in a nor query will give an error if binsize input parameter is used.

Usage¶

gor ... | GROUP binsize [ attributes ]

nor ... | GROUP [ attributes ]

Options¶

`-count`	Return the count for each bin.
`-cdist`	Return the number of distinct rows for each bin.
`-gc cols`	Grouping columns (other than bin).
`-sc cols`	String columns (-ac has been deprecated).
`-ic cols`	Integer columns.
`-fc cols`	Floating valued columns.
`-min`	Calculate the min for any type of column.
`-med`	Calculate the median for any type of column.
`-max`	Calculate the max for any type of column.
`-dis`	Calculate the number of distinct values for any type of column.
`-set`	Return a comma separated set with the distinct values in the column.
`-lis`	Return a comma separated list with the values in the column.
`-len number`	Specify the maximum column length of a set and a list. Defaults to 10000 chars.
`-notruncate`	Throw an exception rather than truncating the list or set if it exceeds the maximum length.
`-avg`	Calculate the avg of all numeric columns.
`-std`	Calculate the std of all numeric columns.
`-sum`	Calculate the sum of all numeric columns.
`-steps number`	The number of sliding steps per group window.
`-s 'sep'`	The separator for elements in lists and sets.
`-ordered`	Assume the grouping columns are ordered.

Attributes -ic and -fc explicitly define columns of either integer type or floating point type, while the -sc attribute defines it as a string type. Any column defined with the -gc option and is a part of the following aggregation options -sc, -ic or -fc will result in an error

`-sc`	Works with -max, -min and -med.
`-ic`	Works with -max, -min, -med, -avg, -std and -sum.
`-fc`	Works with -max, -min, -med, -avg, -std and -sum.

Ensure that there are no spaces between the numbers. Use SELECT to pick a subset of columns from the output.

Use binsize = chrom to aggregate for a whole chromosome and binsize = genome to aggregate for the entire genome.

When using GROUP in a NOR context, the ordered flag can both speed up the operation and reduce the memory usage significantly. Note that there are no checks to see if the order is correct - only use this option if the input stream is correctly ordered.

Examples¶

Group columns a and b on each chromosome and counting the number of groups.

gor ... | group chrom -gc a,b -count

Group column a for the whole genome and calculate the sum and average for columns b and c (which are defined as floating point columns).

gor ... | group genome -gc a -fc b,c -ave -sum

Group #dnsnp# on reference for the whole genome and return it as a set with -set attribute. This query should return a set of A,C,G,T (in undefined order)

gor #dbsnp# | where len(reference) = 1 | group genome -sc reference -set

In some cases, we may wish to know which individual rows have been grouped together. In that case, the -lis argument comes in handy:

gor source/var/wgs_varcalls.gord -s PN
    | group 1 -gc reference,call -lis -sc PN
    | top 100

The query above will output a list on each variant showing which PN carried that variant.