GTGEN¶
The GTGEN command is used to generate genotypes in a horizontal bucket format, e.g. with one or more rows per variant
(chrom,pos,ref,alt) and bucket
and values
column to store the genotypes as characters (0,1,2,3). The input is a
input stream of row-based variants with a genotype column (GT
), a tagbucketrelation and coverage relation.
The GT
column (0,1,2,3) is mapped to a bucket and the horizontal character seat in the Values
.
GT in rows with empty tag value (PN) are not put into the value column, but can be used to enforce that
a variant row is present for each bucket (see the -u option in CSVCC and CSVSEL).
A presence of a coverage segment (per tag PN
) indicates sufficient coverage, i.e. missing variant data (GT not 0,1, or 3)
is interpreted as homozygous reference.
Usage¶
gor ... | GTGEN tagbucketrelation coveragefile [ attributes ]
Options¶
|
Grouping columns other than (chrom,pos) |
|
Override the default PN colunm name used to represent the tag value. |
|
The maximum span size of the coverage segments. Defaults to 10000 (10kb). |
Examples¶
/* An example of creating horizontal variations stored in buckets */
create #PNbuckets# = nor -asdict #wesVars# | SELECT #2 | RENAME #1 PN | ROWNUM
| CALC bucket 'bucket'+str(div((rownum-1),500)+1) | SELECT PN,bucket | SORT -c PN;
create #allvars# = gor #wesvars# -ff [#PNbuckets#] -s PN | SELECT 1-8,PN | DISTINCT;
create #horvars# = gor [#allvars#]
| CALC GT IF( Depth = 9999 OR GL_Call >= 5 AND Depth >= 8 AND
(CallCopies = 2 AND CallRatio >= 0.66 OR CallCopies = 1 AND CallRatio >= 0.2
AND CallRatio <= 1.0-0.2), CallCopies, 3)
| gtgen -gc #3,#4 [#PNbuckets#] <(gor source/cov/goodcov_8.wgs.gord -s PN -ff [#PNbuckets#]);
/* Selecting a random subset of PNs*/
gor [#horvars#] | CSVSEL [#PNbuckets#] <(nor [#PNbuckets#] | SELECT PN | WHERE random() < 0.05) -gc #3,#4 -u 3
The above query example generates a tag-bucket relation and a GOR relation with all variants.
The GTGEN command generates horizontal genotypes from the sparse row-based genotypes, based on
the PN-bucket structure, the quality of the variants and the presence of coverage information. The
output has a bucket
column and values
column storing the genotypes of the PN as a single character (0,1,2, or 3)
according to the position within the bucket.
For comparison, the #horvar# query can also be generated in the following manner using left-join logic, although it is much less efficient.
create #horvars# = gor [#allvars#] | SELECT 1-4 | MULTIMAP -cartesian -h [#PNbuckets#]
| VARJOIN -r -l -e 0 -xl pn -xr pn <(gor #wesVars# -ff [#PNbuckets#]
| CALC qual IF( Depth = 9999 OR GL_Call >= 5 AND Depth >= 8 AND
(CallCopies = 2 AND CallRatio >= 0.66 OR CallCopies = 1 AND CallRatio >= 0.2
AND CallRatio <= 1.0-0.2),1,-1)
| SELECT 1-4,PN,callcopies,qual )
| JOIN -snpseg -l -r -maxseg 10000 -e 0 -xl pn -xr pn
<(gor source/cov/goodcov_8.wgs.gord -s PN -ff [#PNbuckets#] | CALC goodcov 1)
| REPLACE callcopies IF(qual=-1,'NA',IF(callcopies='0' AND goodcov=0,'NA',callcopies))
| SORT 1 -c #3,#4,PN | GROUP 1 -gc #3-#4,bucket -sc callcopies -lis -len 100000
| RENAME lis_CallCopies values | SELECT 1-4,values,bucket;
Then horizontal variants are then created by expanding each variants with PN and bucket and followed with a left-join into
the sparse variant table #wesVars#
and the coverage table. The variant stream is then sorted into a PN order, consistent with the tag-bucket relation, and then collapsed into a comma-separated list per variant. The final query reads a random subset of PNs from the #horvars#
relation.
/* Selecting 10 PNs from the first bucket and return as rows */
gor [#horvars#] | CSVSEL -tag PN [#PNbuckets#]
<(nor [#PNbuckets#] | WHERE bucket = 'bucket1' | SELECT #PN | top 10) -gc #3,#4 -u 4
/* Selecting 10 PNs from the first bucket, a slightly faster version */
gor [#horvars#] | WHERE bucket = 'bucket1' | CSVSEL -tag PN [#PNbuckets#]
<(nor [#PNbuckets#] | WHERE bucket = 'bucket1' | SELECT #PN | top 10) -gc #3,#4 -u 4
The last example show how data is read from a table with horizontal buckets and filtered, partially in the reading of the bucket partitions and also in the CSVSEL step.
gor source/hvar/transwgs.gord -nf -ff myPNs.tsv | CSVSEL source/hvar/PNbuckets.tsv myPNs.tsv -gc #3,#4 -u 4