Incremental AF calculation¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "clin-hg19"
%env GOR_API_PROJECT={project}
Incremental AF and GF calculation¶
Allele and genotype frequencies of variants are of key importance in clinical diagnostic interpretation of variants. In this notebook we will show how this information can be calculated easily and efficiently from a freeze structure, using the CSVCC command as the main workhorse, i.e. to collapse the genotypes stored in the VALUES columns to aggreate counts.
- First we will go through the full parallel calculation, based on a query similar to the report builder freeze_af.ftl.yml()
- Next we show how the same calculation can be done incrementally.
- Finally, we show how the calculation can be done for multiple sub-populations simultaneously and comment on performance.
We will populate the GOR query using the data freeze we wrote out in Incremental_Feeze.ipynb.
qh = GQH.GOR_Query_Helper()
mydefs = ""
mydefs = qh.add_many("""
def ##freeze## = user_data/IncrFreeze;
def #buckets# = user_data/IncrFreeze/PNbuckets_v2.tsv;
def #hgt# = ##freeze##/freeze_v2.gord;
def ##input_vars## = ##freeze##/allvars_2.gorz;
create #pns# = nor #buckets# | select pn;
create #parts# = nor #buckets# | group -gc bucket;
create #gen_gender# = nor [#pns#] | calc pheno 'female' /* skip sex calculation and fix */ ;
create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;
def #diploid# = 2;
create #partaf# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket}
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
<(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,gt cc -e 0 -v male,female
| pivot -gc ref,alt gt -v 0,1,2,3 -e 0
| prefix alt[+1]- x);
create #af# = pgor [#partaf#]
| group 1 -gc ref,alt -sum -ic x_*
| rename sum_(.*) #{1}
| calc x_AN_auto #diploid# * (x_0_female_gtcount + x_1_female_gtcount + x_2_female_gtcount + x_0_male_gtcount + x_1_male_gtcount + x_2_male_gtcount)
| calc x_AN_chrX #diploid# * (x_0_female_gtcount + x_1_female_gtcount + x_2_female_gtcount) + x_0_male_gtcount + x_1_male_gtcount + x_2_male_gtcount
| calc x_AN_chrYM x_0_female_gtcount + x_1_female_gtcount + x_2_female_gtcount + x_0_male_gtcount + x_1_male_gtcount + x_2_male_gtcount
| calc x_AC_auto x_1_female_gtcount + x_1_male_gtcount + #diploid# * (x_2_female_gtcount + x_2_male_gtcount)
| calc x_AC_chrX x_1_female_gtcount + x_1_male_gtcount + #diploid# * (x_2_female_gtcount) + x_2_male_gtcount
| calc x_AC_chrYM x_1_female_gtcount + x_1_male_gtcount + x_2_female_gtcount + x_2_male_gtcount
| calc AN if(chrom='chrX', x_AN_chrX, if(chrom='chrY' or chrom='chrM', x_AN_chrYM, x_AN_auto))
| calc AC if(chrom='chrX', x_AC_chrX, if(chrom='chrY' or chrom='chrM', x_AC_chrYM, x_AC_auto))
| calc x_0_gtcount x_0_female_gtcount + x_0_male_gtcount
| calc x_1_gtcount x_1_female_gtcount + x_1_male_gtcount
| calc x_2_gtcount x_2_female_gtcount + x_2_male_gtcount
| calc x_3_gtcount x_3_female_gtcount + x_3_male_gtcount
| calc AF form(AC/float(AN),6,7)
| calc conservative_AF form(max(0.0,float(float(AC) - 2*sqrt(float(AC) + 1)+2)/float(AN)),6,7)
| calc pnCount x_0_gtcount+x_1_gtcount+x_2_gtcount+x_3_gtcount
| calc marker_yield form((float(x_0_gtcount)+float(x_1_gtcount)+float(x_2_gtcount))/float(pncount),6,7)
| calc HWE_homFreq form(x_2_gtcount/float(PNcount),6,7) /* HWE assumes diploid, so we ignore sex */
| calc hetFreq form(x_1_gtcount/float(pncount),6,7)
| calc HWE_alleleFreq form((2*x_2_gtcount+x_1_gtcount)/float(2*pncount),6,7)
| calc pA = (2*x_2_gtcount+x_1_gtcount)/float(2*PNcount)
| calc pB = 1.0 - pA
| calc mAA = pA * pA * PNcount
| calc mAB = 2 * pA * pB * PNcount
| calc mBB = pB * pB * PNcount
| calc HWE_chisq = float(form(sqr(max(abs(x_2_gtcount-mAA)-0.5,0.0))/mAA
+ sqr(max(abs(x_1_gtcount-mAB)-0.5,0.0))/mAB
+ sqr(max(abs(PNcount-x_1_gtcount-x_2_gtcount-mBB)-0.5,0.0))/mBB ,5,5))
| calc HWE_pVal = chisquarecompl(1,HWE_chisq)
| calc HWE_logPval = float(form(-log(HWE_pVal),3,2))
| calc varCount x_1_gtcount + x_2_gtcount
| calc hetCount x_1_gtcount
| calc homCount x_2_gtcount
| hide pA-mBB
| hide x_*;
""")
In the query above, we have skipped the gender analysis that is in freeze_af.ftl.yml() and simply fixed it to "female". Because our freeze was only written out of chr1, we calculate #segs# using -p chr1 option instead of PGOR.
The main query, #partaf#, has then be written using PARALLEL instead of PARTGOR, because of our use of buckets instead of PNs to define the partitions. Therefore, we use -f #{col:bucket} when reading the variants and INSET to filter the PN-list that we process in each CSVCC execution.
In #af#, we aggregate the partial counts using the GROUP command. The CALC steps that follow do therefore only apply once per variant and not per genotype.
We can inspect #partaf#:
%%gor
$mydefs
nor [#partaf#] | top 1 | unpivot #1-
Query ran in 0.68 sec Query fetched 12 rows in 0.01 sec (total time 0.69 sec)
Col_Name | Col_Value | |
---|---|---|
0 | CHROM | chr1 |
1 | POS | 10048 |
2 | Ref | CT |
3 | Alt | C |
4 | x_0_male_GTcount | 0 |
5 | x_0_female_GTcount | 2 |
6 | x_1_male_GTcount | 0 |
7 | x_1_female_GTcount | 0 |
8 | x_2_male_GTcount | 0 |
9 | x_2_female_GTcount | 0 |
10 | x_3_male_GTcount | 0 |
11 | x_3_female_GTcount | 35 |
%%gor
$mydefs
gor [#af#] | top 10
Query ran in 0.36 sec Query fetched 10 rows in 0.00 sec (total time 0.37 sec)
CHROM | POS | Ref | Alt | AN | AC | AF | conservative_AF | pnCount | marker_yield | HWE_homFreq | hetFreq | HWE_alleleFreq | HWE_chisq | HWE_pVal | HWE_logPval | varCount | hetCount | homCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10048 | CT | C | 610 | 1 | 0.001639 | 0.000281 | 787 | 0.387548 | 0.000000 | 0.001271 | 0.000635 | 0.00000 | 1.000000e+00 | -0.00 | 1 | 1 | 0 |
1 | chr1 | 10057 | A | AC | 644 | 1 | 0.001553 | 0.000266 | 787 | 0.409149 | 0.000000 | 0.001271 | 0.000635 | 0.00000 | 1.000000e+00 | -0.00 | 1 | 1 | 0 |
2 | chr1 | 10119 | CT | C | 692 | 2 | 0.002890 | 0.000774 | 787 | 0.439644 | 0.000000 | 0.002541 | 0.001271 | 0.00000 | 1.000000e+00 | -0.00 | 2 | 2 | 0 |
3 | chr1 | 10143 | CTAACCCCT | C | 648 | 1 | 0.001543 | 0.000265 | 787 | 0.411690 | 0.000000 | 0.001271 | 0.000635 | 0.00000 | 1.000000e+00 | -0.00 | 1 | 1 | 0 |
4 | chr1 | 10146 | AC | A | 618 | 196 | 0.317152 | 0.274965 | 787 | 0.392630 | 0.011436 | 0.226175 | 0.124523 | 0.81427 | 3.668608e-01 | 0.44 | 187 | 178 | 9 |
5 | chr1 | 10149 | CCT | C | 614 | 1 | 0.001629 | 0.000279 | 787 | 0.390089 | 0.000000 | 0.001271 | 0.000635 | 0.00000 | 1.000000e+00 | -0.00 | 1 | 1 | 0 |
6 | chr1 | 10154 | CCCTAA | C | 592 | 3 | 0.005068 | 0.001689 | 787 | 0.376112 | 0.001271 | 0.001271 | 0.001906 | 87.19333 | 9.841514e-21 | 20.01 | 2 | 1 | 1 |
7 | chr1 | 10155 | CCTAACCCTAACCCTAACCCTAA | C | 594 | 1 | 0.001684 | 0.000289 | 787 | 0.377383 | 0.000000 | 0.001271 | 0.000635 | 0.00000 | 1.000000e+00 | -0.00 | 1 | 1 | 0 |
8 | chr1 | 10167 | CCTAACCCTAA | C | 580 | 4 | 0.006897 | 0.002634 | 787 | 0.368488 | 0.001271 | 0.002541 | 0.002541 | 48.74921 | 2.908757e-12 | 11.54 | 3 | 2 | 1 |
9 | chr1 | 10172 | C | A | 576 | 1 | 0.001736 | 0.000298 | 787 | 0.365947 | 0.000000 | 0.001271 | 0.000635 | 0.00000 | 1.000000e+00 | -0.00 | 1 | 1 | 0 |
We see that the variant at chr:10119 has AC of 2. We can quickly check that:
%%gor
gor -p chr1:10119 source/var/wgs_varcalls.gord -s PN
Query ran in 0.61 sec Query fetched 2 rows in 0.00 sec (total time 0.61 sec)
CHROM | POS | Reference | Call | CallCopies | CallRatio | Depth | GL_Call | FILTER | FS | formatZip | PN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10119 | CT | C | 1 | 0.226 | 31 | 18 | PASS | 0.0 | NaN | GDX_2866201 |
1 | chr1 | 10119 | CT | C | 1 | 0.242 | 33 | 17 | PASS | 0.0 | NaN | GDX_2866252 |
%%gor
gor -p chr1:10119 user_data/IncrFreeze/freeze_v2.gord
| csvsel -gc ref,alt -vs 1 -tag PN -hide '0,3'
user_data/IncrFreeze/PNbuckets_v2.tsv <(nor user_data/IncrFreeze/PNbuckets_v2.tsv | select pn)
Query ran in 0.10 sec Query fetched 2 rows in 0.01 sec (total time 0.10 sec)
CHROM | POS | Ref | Alt | PN | value | |
---|---|---|---|---|---|---|
0 | chr1 | 10119 | CT | C | GDX_2866201 | 1 |
1 | chr1 | 10119 | CT | C | GDX_2866252 | 1 |
Incremental AF¶
We want to show that we can generate the relation #partaf# incrementally. If we can do that, the rest of the calculation to write out #af# (or subset of its columns) is easy.
qh2 = GQH.GOR_Query_Helper()
mydefs2 = ""
We set the freeze to version 1 and use allvars_1 and PNbuckets_v1.
mydefs2 = qh2.add_many("""
def ##freeze## = user_data/IncrFreeze;
def #buckets# = user_data/IncrFreeze/PNbuckets_v1.tsv;
def #hgt# = ##freeze##/freeze_v1.gord;
def ##input_vars## = ##freeze##/allvars_1.gorz;
create #pns# = nor #buckets# | select pn /* this can be changed to any subset */;
create #parts# = nor #buckets# | group -gc bucket;
create #gen_gender# = nor [#pns#] | calc pheno 'female' /* skip sex calculation and fix */ ;
create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;
def #diploid# = 2;
create #partaf1# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket}
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
<(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,gt cc -e 0 -v male,female
| pivot -gc ref,alt gt -v 0,1,2,3 -e 0
| prefix alt[+1]- x);
""")
%%gor
$mydefs2
gor [#partaf1#] | top 1
Query ran in 2.06 sec Query fetched 1 rows in 12.63 sec (total time 14.69 sec)
CHROM | POS | Ref | Alt | x_0_male_GTcount | x_0_female_GTcount | x_1_male_GTcount | x_1_female_GTcount | x_2_male_GTcount | x_2_female_GTcount | x_3_male_GTcount | x_3_female_GTcount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10057 | A | AC | 0 | 86 | 0 | 0 | 0 | 0 | 0 | 164 |
We write it out into our folder:
%%gor
$mydefs2
gor [#partaf1#] | write user_data/IncrFreeze/AF/part1.gorz
Query ran in 0.14 sec Query fetched 0 rows in 6.61 sec (total time 6.75 sec)
CHROM | POS |
---|
Incremental update¶
For this we will actually access the internal delta buckets, i.e. we want to eliminate the buckets use to calculate AF/part1.gorz. Lets look at the internals:
%%gor
nor -asdict user_data/IncrFreeze/freeze_v2.gord
Query ran in 0.16 sec Query fetched 6 rows in 0.00 sec (total time 0.16 sec)
file | part | chrstart | posstart | chrstop | posstop | tags | |
---|---|---|---|---|---|---|---|
0 | hgt_1_1.gorz | b_1 | chr1 | 0 | chrZ | 1000000000 | b_1 |
1 | hgt_1_2.gorz | b_2 | chr1 | 0 | chrZ | 1000000000 | b_2 |
2 | hgt_2_3.gorz | b_3 | chr1 | 0 | chrZ | 1000000000 | b_3 |
3 | hgt_2_4.gorz | b_4 | chr1 | 0 | chrZ | 1000000000 | b_4 |
4 | hgt_delta_1_1.gorz | b_1 | chr1 | 0 | chrZ | 1000000000 | b_1 |
5 | hgt_delta_1_2.gorz | b_2 | chr1 | 0 | chrZ | 1000000000 | b_2 |
We now write a new AF update GORD dictionary file:
%%gor
nor -asdict user_data/IncrFreeze/freeze_v2.gord | where not(left(file,5)='hgt_1') | write user_data/IncrFreeze/freeze_afdelt.tsv
Query ran in 0.08 sec Query fetched 0 rows in 0.02 sec (total time 0.09 sec)
ChromNOR | PosNOR |
---|
import os
proj_path = '/home/jovyan/projects/clin-hg19/'
os.rename(proj_path+'user_data/IncrFreeze/freeze_afdelt.tsv',proj_path+'user_data/IncrFreeze/freeze_afdelt.gord')
%%gor
nor -asdict user_data/IncrFreeze/freeze_afdelt.gord
Query ran in 0.08 sec Query fetched 4 rows in 0.00 sec (total time 0.08 sec)
file | part | chrstart | posstart | chrstop | posstop | tags | |
---|---|---|---|---|---|---|---|
0 | hgt_2_3.gorz | b_3 | chr1 | 0 | chrZ | 1000000000 | b_3 |
1 | hgt_2_4.gorz | b_4 | chr1 | 0 | chrZ | 1000000000 | b_4 |
2 | hgt_delta_1_1.gorz | b_1 | chr1 | 0 | chrZ | 1000000000 | b_1 |
3 | hgt_delta_1_2.gorz | b_2 | chr1 | 0 | chrZ | 1000000000 | b_2 |
For reference the version 1 is like this:
%%gor
nor -asdict user_data/IncrFreeze/freeze_v1.gord
Query ran in 0.07 sec Query fetched 2 rows in 0.00 sec (total time 0.07 sec)
file | part | chrstart | posstart | chrstop | posstop | tags | |
---|---|---|---|---|---|---|---|
0 | hgt_1_1.gorz | b_1 | chr1 | 0 | chrZ | 1000000000 | b_1 |
1 | hgt_1_2.gorz | b_2 | chr1 | 0 | chrZ | 1000000000 | b_2 |
Now we use this freeze dictionary to calculate #partaf2#. We use PNbuckets_v2.tsv and allvars_2.gorz with it.
qh3 = GQH.GOR_Query_Helper()
mydefs3 = ""
mydefs3 = qh3.add_many("""
def ##freeze## = user_data/IncrFreeze;
def #buckets# = user_data/IncrFreeze/PNbuckets_v2.tsv;
def #hgt# = ##freeze##/freeze_afdelt.gord;
def ##input_vars## = ##freeze##/allvars_2.gorz;
create #pns# = nor #buckets# | select pn /* this can be changed to any subset */;
create #parts# = nor #buckets# | group -gc bucket;
create #gen_gender# = nor [#pns#] | calc pheno 'female' /* skip sex calculation and fix */ ;
create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;
def #diploid# = 2;
create #partaf2# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket}
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
<(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,gt cc -e 0 -v male,female
| pivot -gc ref,alt gt -v 0,1,2,3 -e 0
| prefix alt[+1]- x);
""")
%%gor
$mydefs3
gor [#partaf2#] | write user_data/IncrFreeze/AF/part2.gorz
Query ran in 0.35 sec Query fetched 0 rows in 9.46 sec (total time 9.81 sec)
CHROM | POS |
---|
Our combined equivalent of #partaf# can now be generated as:
%%gor
gor user_data/IncrFreeze/AF/part1.gorz user_data/IncrFreeze/AF/part2.gorz
| group 1 -gc ref,alt -sum -ic x_*
| rename sum_(.*) #{1}
| top 10
Query ran in 0.08 sec Query fetched 10 rows in 0.00 sec (total time 0.09 sec)
CHROM | POS | Ref | Alt | x_0_male_GTcount | x_0_female_GTcount | x_1_male_GTcount | x_1_female_GTcount | x_2_male_GTcount | x_2_female_GTcount | x_3_male_GTcount | x_3_female_GTcount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10048 | CT | C | 0 | 304 | 0 | 1 | 0 | 0 | 0 | 482 |
1 | chr1 | 10057 | A | AC | 0 | 321 | 0 | 1 | 0 | 0 | 0 | 465 |
2 | chr1 | 10119 | CT | C | 0 | 344 | 0 | 2 | 0 | 0 | 0 | 441 |
3 | chr1 | 10143 | CTAACCCCT | C | 0 | 323 | 0 | 1 | 0 | 0 | 0 | 463 |
4 | chr1 | 10146 | AC | A | 0 | 122 | 0 | 178 | 0 | 9 | 0 | 478 |
5 | chr1 | 10149 | CCT | C | 0 | 306 | 0 | 1 | 0 | 0 | 0 | 480 |
6 | chr1 | 10154 | CCCTAA | C | 0 | 294 | 0 | 1 | 0 | 1 | 0 | 491 |
7 | chr1 | 10155 | CCTAACCCTAACCCTAACCCTAA | C | 0 | 296 | 0 | 1 | 0 | 0 | 0 | 490 |
8 | chr1 | 10167 | CCTAACCCTAA | C | 0 | 287 | 0 | 2 | 0 | 1 | 0 | 497 |
9 | chr1 | 10172 | C | A | 0 | 287 | 0 | 1 | 0 | 0 | 0 | 499 |
We can indeed compare it with our original calculation and see that it gives the same aggregate results for every variant:
%%gor
$mydefs
gor -p chr1 <(gor user_data/IncrFreeze/AF/part1.gorz user_data/IncrFreeze/AF/part2.gorz
| group 1 -gc ref,alt -sum -ic x_*
| rename sum_(.*) #{1}
| merge -s <(gor [#partaf#] | group 1 -gc ref,alt -sum -ic x_* | rename sum_(.*) #{1} )
)
| group 1 -gc ref-source[-1] -dis -sc source
| throwif dis_source != 2
| group chrom -count
Query ran in 0.66 sec Query fetched 1 rows in 48.74 sec (total time 49.40 sec)
CHROM | bpStart | bpStop | allCount | |
---|---|---|---|---|
0 | chr1 | 0 | 249250621 | 3633503 |
For each incremental step, we must aggregate the files part1.gorz and part2.gorz together as shown above, before they are merged with the aggregation from the next iteration. We will then store this file together with the final AF calculation to allow this incremental freeze to be done. The cost of keeping this file is proportional to the number of variants (sqrt of samples). Clearly, this type of analysis can be extended to incremental subtractions of old samples as well, however, given how rare this use-case is we will not discuss that here. A full recalc is not that expensive!
Processing multiple sample groups simultaneously¶
We can extend the calculation above and aggregate the genotypes for multiple groups in a single sweep as we read them. To show this we will modify the #pn# relation and generate three groups with some random sampling.
qh4 = GQH.GOR_Query_Helper()
mydefs4 = ""
mydefs4 = qh4.add_many("""
def ##freeze## = user_data/IncrFreeze;
def #buckets# = user_data/IncrFreeze/PNbuckets_v2.tsv;
def #hgt# = ##freeze##/freeze_v2.gord;
def ##input_vars## = ##freeze##/allvars_2.gorz;
create #pns# = nor #buckets# | select pn | calc cohort 'A,B,C' | split cohort | where random() < 0.5;
create #parts# = nor #buckets# | group -gc bucket;
create #gen_gender# = nor [#pns#] | calc pheno 'female';
""")
We can inspect our sample relation as:
%%gor
$mydefs4
nor [#gen_gender#] | group -gc cohort -count
Query ran in 0.38 sec Query fetched 3 rows in 0.01 sec (total time 0.39 sec)
cohort | allCount | |
---|---|---|
0 | A | 386 |
1 | B | 388 |
2 | C | 397 |
We can now try out CSVCC with this new ternary relation #gen_gender#:
%%gor
$mydefs4
gor #hgt# | distloc 1
| csvcc -gc ref,alt -vs 1 -u 3 #buckets# [#gen_gender#]
Query ran in 0.09 sec Query fetched 12 rows in 0.02 sec (total time 0.11 sec)
CHROM | POS | Ref | Alt | Pheno | CC | GT | GTcount | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 10048 | CT | C | A | female | 0 | 159 |
1 | chr1 | 10048 | CT | C | A | female | 1 | 1 |
2 | chr1 | 10048 | CT | C | A | female | 2 | 0 |
3 | chr1 | 10048 | CT | C | A | female | 3 | 226 |
4 | chr1 | 10048 | CT | C | C | female | 0 | 137 |
5 | chr1 | 10048 | CT | C | C | female | 1 | 0 |
6 | chr1 | 10048 | CT | C | C | female | 2 | 0 |
7 | chr1 | 10048 | CT | C | C | female | 3 | 260 |
8 | chr1 | 10048 | CT | C | B | female | 0 | 144 |
9 | chr1 | 10048 | CT | C | B | female | 1 | 0 |
10 | chr1 | 10048 | CT | C | B | female | 2 | 0 |
11 | chr1 | 10048 | CT | C | B | female | 3 | 244 |
We see that the cohort is now under the Pheno column while our pheno attribute shows up in CC as before. Somewhat confusing but this is because of how CSVCC deals with binary and ternary relations. But this means our gender in the output show up in the same column as before. Our new PIVOT logic is only slightly modified from above:
%%gor
$mydefs4
gor #hgt# | distloc 1
| csvcc -gc ref,alt -vs 1 -u 3 #buckets# [#gen_gender#]
| pivot -gc ref,alt,pheno,gt cc -e 0 -v male,female
| pivot -gc ref,alt,pheno gt -v 0,1,2,3 -e 0
| prefix pheno[+1]- x
Query ran in 0.09 sec Query fetched 3 rows in 0.02 sec (total time 0.10 sec)
CHROM | POS | Ref | Alt | Pheno | x_0_male_GTcount | x_0_female_GTcount | x_1_male_GTcount | x_1_female_GTcount | x_2_male_GTcount | x_2_female_GTcount | x_3_male_GTcount | x_3_female_GTcount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10048 | CT | C | B | 0 | 144 | 0 | 0 | 0 | 0 | 0 | 244 |
1 | chr1 | 10048 | CT | C | C | 0 | 137 | 0 | 0 | 0 | 0 | 0 | 260 |
2 | chr1 | 10048 | CT | C | A | 0 | 159 | 0 | 1 | 0 | 0 | 0 | 226 |
We add the rest of the parallel statements with this new pivot code:
mydefs4 = qh4.add_many("""
create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;
def #diploid# = 2;
create #partaf# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket}
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
<(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,pheno,gt cc -e 0 -v male,female
| pivot -gc ref,alt,pheno gt -v 0,1,2,3 -e 0
| prefix pheno[+1]- x);
""");
To summarize the partitions, we also have to group by the pheno column:
%%gor
$mydefs4
gor [#partaf#] | group 1 -gc ref,alt,pheno -sum -ic x_*
| rename sum_(.*) #{1}
| top 6
Query ran in 0.24 sec Query fetched 6 rows in 0.01 sec (total time 0.25 sec)
CHROM | POS | Ref | Alt | Pheno | x_0_male_GTcount | x_0_female_GTcount | x_1_male_GTcount | x_1_female_GTcount | x_2_male_GTcount | x_2_female_GTcount | x_3_male_GTcount | x_3_female_GTcount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10048 | CT | C | A | 0 | 159 | 0 | 1 | 0 | 0 | 0 | 226 |
1 | chr1 | 10048 | CT | C | B | 0 | 144 | 0 | 0 | 0 | 0 | 0 | 244 |
2 | chr1 | 10048 | CT | C | C | 0 | 137 | 0 | 0 | 0 | 0 | 0 | 260 |
3 | chr1 | 10057 | A | AC | A | 0 | 166 | 0 | 0 | 0 | 0 | 0 | 220 |
4 | chr1 | 10057 | A | AC | B | 0 | 154 | 0 | 0 | 0 | 0 | 0 | 234 |
5 | chr1 | 10057 | A | AC | C | 0 | 147 | 0 | 0 | 0 | 0 | 0 | 250 |
The rest of the logic for AF is identical to the code shown for #af#, that is apart from this extra grouping by pheno. We can now choose to keep the AF information in a single file, separated by rows, or to "fork" write each phenotype into its own gord partition files.
Few remarks about performance¶
Both the incremental freeze and AF calculation can save a lot of compute when the number of samples gets very large. Clearly, both benefit from fewer and larger increments (since we have to spend some effort on the old samples). Depending on which approach is used for inferring unknown in a freeze, the incremental cost for the old samples will differ. The method that uses only the #goodcov# data is close to 10x faster than the one that also reads #variants#. However, as shown in (https://doi.org/10.1101/2022.10.25.513331), both methods use GTGEN and are 50-100x faster than gVCF based methods.
Once we have a large freeze with M samples and add N new samples, we can estimate the incremental cost to be; a) for the new samples proportional to sqrt(M+N)N and b) for the old samples proportional to M(sqrt(M+N)-sqrt(M)), c) for writing out the AF calc a time proportional to sqrt(M+N). These are however more like Bio-O notation and we will have to measure the times of each step with real data sets to understand what the relative factors are in each step. Theoretically, they should all converge to increase as sqrt(M), i.e. even the incremental update will become more and more costly, however, not grow linearly. Regular full update should grow as M*sqrt(M).