QUEEN¶
The QUEEN command is used to calculate parameters that can be used in the KING algorithm (see Ani Manichaikul et.al. Bioinformatics, Vol. 26 no. 22 2010, pages 2867–2873).
The left-input stream is of the same type as for :ref:CSVSEL, i.e. it must contain the columns values
and bucket
.
The tag-bucket relation is used to define how the values for different tags are stored in the value column in each bucket.
The tag selections specify which tags are selected and the calculation is carried out for each tag (PN) in the first
tag list against each tag in the second tag list, for every variant in the left-input stream. The left-input stream
must include a column with the allele frequency, named AF. Such column is most easily added to the horizontal genotype
variant format with a :ref:VARJOIN.
The output is a row for each tag-pair, using columns named PN1 and PN2. Since the output is an aggregate over the genome is also includes dummy columns for Chrom and Pos with the values “ChrA” and 0, respectively. The important columns are however PN1, PN2, IBS0, XX, tpq, kpq, Nhet, Nhom, NAai, NAaj, count, pi0, phi, and theta. See the code examples and the KING paper for further details related to the :ref:KING command.
Usage¶
gor ... | QUEEN tagbucketrelation tagselection1 tagselection2 [-gc cols] [(-s sep | -vs charsize)]
Options¶
|
Grouping columns other than (chrom,pos). NOTE, one of the must include AF |
|
Specify separator for the elements in values. Default to comma. |
|
Use a fixed character size for values, e.g. rather than variable length separated with comma. |
|
Threshold value to filter output on pi0, e.g. to pass we must have pi0 < value |
|
Threshold value to filter output on phi, e.g. to pass we must have phi > value |
|
Threshold value to filter output on theta, e.g. to pass we must have theta > value |
Examples¶
The following query shows an example where the the QUEEN command is used to estimate which sample pairs
to evaluate in a KING2 analysis. The QUEEN command uses two PN lists that are generated by grouping the
pn-pairs according to the bucket groups of the genotype matrix into an upper-triagonal matrix.
Only the genotype buckets necessary for evaluating the pairs are read from disk in each job and following the analysis,
the output-pairs are filtered to avoid storing symmetric pairs.
def #buckets# = freezes/ukbb_500k/imputed_comm_vep95/buckets.tsv;
def #hgt# = freezes/ukbb_500k/imputed_comm_vep95/variants.gord;
def #af# = freezes/ukbb_500k/imputed_comm_vep95/metadata/AF.gorz;
create #useaf# = gor #af# /* Only use rare variants for Queen */
| where len(ref)=len(alt) and af < 0.0025
| select 1-4,af;
create #buckgroups# = nor #buckets#
| select bucket
| distinct
| sort -c bucket:n
| rownum
| calc gr div(rownum,4); /* larger groups mean more memory usage but less repeated IO access of genotypes */
create #parts# = nor [#buckgroups#]
| select gr
| distinct
| multimap -cartesian <(nor [#buckgroups#] | select gr | distinct)
| where gr <= grx
| rownum
| calc p 1+mod(rownum,2) /* Allow us to avid limit on the number of maximum job per query, e.g. 1000 */
| hide rownum;
create #sharepairs1# = parallel -parts <(nor [#parts#] | where p = 1)
<(nor <(gor #hgt# -nf -ff <(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:gr}' or gr = '#{col:grx}'))
| where chrom < 'chrX' | varjoin -i <(gor [#useaf#])
| QUEEN #buckets# <(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:gr}') | select pn)
<(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:grx}') | select pn) -vs 1 -minSharing 0.02
| multimap -c pn1 #buckets# | multimap -c pn2 #buckets# | where bucket < bucketx or bucket = bucketx and pn1 < pn2 | hide bucket,bucketx )
| select pn1- );
create #sharepairs2# = parallel -parts <(nor [#parts#])
<(nor <(gor #hgt# -nf -ff <(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:gr}' or gr = '#{col:grx}'))
| where chrom < 'chrX' | varjoin -i <(gor [#useaf#])
| QUEEN #buckets# <(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:gr}') | select pn)
<(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:grx}') | select pn) -vs 1 -minSharing 0.02
| multimap -c pn1 #buckets# | multimap -c pn2 #buckets# | where bucket < bucketx or bucket = bucketx and pn1 < pn2 | hide bucket,bucketx )
| select pn1- );
create #usedvars# = pgor #af#
| where len(ref)=len(alt)
| select 1-4,af
| where abs(af-0.5)<0.4
| where random()<1e-2
| granno 10000 -min -ic pos
| where min_pos = pos
| hide min_pos;
create #af# = pgor [#usedvars#]
| select 1-4,af
| where len(#3)=len(#4)
| select 1-4,af
| calc tpq 2*af*af*(1.0-af)*(1.0-af)
| calc kpq 2.0*af*(1-af);
create #varsharing# = parallel -parts <(nor [#parts#] )
<(nor <(gor #hgt# -nf -ff <(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:gr}' or gr = '#{col:grx}'))
| where len(ref)=len(alt) | until chrom = 'chrX' | varjoin -r [#af#]
| KING2 -gc af #buckets# <(nor [#sharepairs1#] | merge [#sharepairs2#]
/* unfortunately we don't have partitioned access to this data, hence the filtering with INSET */
| inset -c pn1 <(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:gr}' ))
| inset -c pn2 <(nor #buckets# | inset -c bucket <(nor [#buckgroups#] | where gr = '#{col:grx}' )) | select pn1,pn2 )
-vs 1 -pi0thr 0.76 -phithr 0.039 )
| group -gc pn1,pn2 -sum -fc IBS0-
| calc pi0 sum_IBS0/float(sum_tpq)
| calc phi 0.5-float(sum_XX)/float(4.0*sum_kpq)
| calc theta (sum_Nhet-2.0*sum_Nhom)/(sum_NAai+sum_NAaj)
);
def expo($1,$2,$3) = exp(-sqr( $1 - $2 )/float(2* $3 * $3 )) / $3;
nor [#varsharing#]
| calc monozygotic if(phi > pow(2.0,-1.5) and phi < 0.1,1,0)
| calc parent_offspring if(phi > pow(2.0,-2.5) and phi < pow(2.0,-1.5) and pi0 < 0.1,1,0)
| calc full_sib if(phi > pow(2.0,-2.5) and phi < pow(2.0,-1.5) and pi0 > 0.1 and pi0 < 0.365,1,0)
| calc second_degree if(phi > pow(2.0,-3.5) and phi < pow(2.0,-2.5) and pi0 > 0.365 and pi0 < 1.0-pow(2,-1.5),1,0)
| calc third_degree if(phi > pow(2.0,-4.5) and phi < pow(2.0,-3.5) and pi0 > 1.0-pow(2,-1.5) and pi0 < 1.0-pow(2,-2.5),1,0)
| calc temp if(monozygotic=1,'Monozygotic twin,','')+if(parent_offspring=1,'Parent-offspring,','')+if(full_sib=1,'Full sib,','')+if(second_degree=1,'2nd Degree,','')+if(third_degree=1,'3rd Degree,','')
| calc relationship listfilter(temp,'len(x)>1')
| calc monozygotic_dist ln( expo(pi0,0.0,0.000001) * expo(phi,0.498,0.0013) )
| calc parent_offspring_dist ln( expo(pi0,0.00297,0.0025) * expo(phi,0.248,0.01) )
| calc full_sib_dist ln( expo(pi0,0.248,0.0406) * expo(phi,0.248,0.0187) )
| calc second_degree_dist ln( expo(pi0,0.4954,0.0608) * expo(phi,0.127,0.027) )
| calc third_degree_dist ln( expo(pi0,0.745,0.0428) * expo(phi,0.059,0.02) )
| calc rel 'Monozygotic twin,Parent-offspring,Full sib,2nd Degree,3rd Degree'
| calc relp str(monozygotic_dist)+','+str(parent_offspring_dist)+','+str(full_sib_dist)+','+str(second_degree_dist)+','+str(third_degree_dist)
| calc relmax listnummax(relp)
| calc GaussRel listzipfilter(rel,relp,'abs(float(x)-relmax)<0.001')
| select pn1,pn2,relationship,GaussRel,pi0,phi,theta