Incremental AF calculation¶

Incremental_AF_calc

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "clin-hg19"
%env GOR_API_PROJECT={project}

Incremental AF and GF calculation¶

Allele and genotype frequencies of variants are of key importance in clinical diagnostic interpretation of variants. In this notebook we will show how this information can be calculated easily and efficiently from a freeze structure, using the CSVCC command as the main workhorse, i.e. to collapse the genotypes stored in the VALUES columns to aggreate counts.

First we will go through the full parallel calculation, based on a query similar to the report builder freeze_af.ftl.yml()
Next we show how the same calculation can be done incrementally.
Finally, we show how the calculation can be done for multiple sub-populations simultaneously and comment on performance.

We will populate the GOR query using the data freeze we wrote out in Incremental_Feeze.ipynb.

In [2]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

In [38]:

mydefs = qh.add_many("""
def ##freeze## = user_data/IncrFreeze;
def #buckets# = user_data/IncrFreeze/PNbuckets_v2.tsv;
def #hgt# = ##freeze##/freeze_v2.gord;
def ##input_vars## = ##freeze##/allvars_2.gorz;

create #pns# = nor #buckets# | select pn;
create #parts# = nor #buckets# | group -gc bucket;

create #gen_gender# = nor [#pns#] | calc pheno 'female' /* skip sex calculation and fix */ ;

create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;

def #diploid# = 2;

create #partaf# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket}
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
        <(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,gt cc -e 0 -v male,female
| pivot -gc ref,alt gt -v 0,1,2,3 -e 0
| prefix alt[+1]- x);

create #af# = pgor [#partaf#]
| group 1 -gc ref,alt -sum -ic x_*
| rename sum_(.*) #{1}
| calc x_AN_auto #diploid# * (x_0_female_gtcount + x_1_female_gtcount + x_2_female_gtcount + x_0_male_gtcount + x_1_male_gtcount + x_2_male_gtcount)
| calc x_AN_chrX #diploid# * (x_0_female_gtcount + x_1_female_gtcount + x_2_female_gtcount) + x_0_male_gtcount + x_1_male_gtcount + x_2_male_gtcount
| calc x_AN_chrYM x_0_female_gtcount + x_1_female_gtcount + x_2_female_gtcount + x_0_male_gtcount + x_1_male_gtcount + x_2_male_gtcount
| calc x_AC_auto x_1_female_gtcount + x_1_male_gtcount + #diploid# * (x_2_female_gtcount + x_2_male_gtcount)
| calc x_AC_chrX x_1_female_gtcount + x_1_male_gtcount + #diploid# * (x_2_female_gtcount) + x_2_male_gtcount
| calc x_AC_chrYM x_1_female_gtcount + x_1_male_gtcount + x_2_female_gtcount + x_2_male_gtcount
| calc AN if(chrom='chrX', x_AN_chrX, if(chrom='chrY' or chrom='chrM', x_AN_chrYM, x_AN_auto))
| calc AC if(chrom='chrX', x_AC_chrX, if(chrom='chrY' or chrom='chrM', x_AC_chrYM, x_AC_auto))
| calc x_0_gtcount x_0_female_gtcount + x_0_male_gtcount
| calc x_1_gtcount x_1_female_gtcount + x_1_male_gtcount
| calc x_2_gtcount x_2_female_gtcount + x_2_male_gtcount
| calc x_3_gtcount x_3_female_gtcount + x_3_male_gtcount
| calc AF form(AC/float(AN),6,7)
| calc conservative_AF form(max(0.0,float(float(AC) - 2*sqrt(float(AC) + 1)+2)/float(AN)),6,7)
| calc pnCount x_0_gtcount+x_1_gtcount+x_2_gtcount+x_3_gtcount
| calc marker_yield form((float(x_0_gtcount)+float(x_1_gtcount)+float(x_2_gtcount))/float(pncount),6,7)
| calc HWE_homFreq form(x_2_gtcount/float(PNcount),6,7) /* HWE assumes diploid, so we ignore sex */
| calc hetFreq form(x_1_gtcount/float(pncount),6,7)
| calc HWE_alleleFreq form((2*x_2_gtcount+x_1_gtcount)/float(2*pncount),6,7)
| calc pA = (2*x_2_gtcount+x_1_gtcount)/float(2*PNcount)
| calc pB = 1.0 - pA
| calc mAA = pA * pA * PNcount
| calc mAB = 2 * pA * pB * PNcount
| calc mBB = pB * pB * PNcount
| calc HWE_chisq = float(form(sqr(max(abs(x_2_gtcount-mAA)-0.5,0.0))/mAA
                   + sqr(max(abs(x_1_gtcount-mAB)-0.5,0.0))/mAB
                   + sqr(max(abs(PNcount-x_1_gtcount-x_2_gtcount-mBB)-0.5,0.0))/mBB ,5,5))
| calc HWE_pVal = chisquarecompl(1,HWE_chisq)
| calc HWE_logPval = float(form(-log(HWE_pVal),3,2))
| calc varCount x_1_gtcount + x_2_gtcount
| calc hetCount x_1_gtcount
| calc homCount x_2_gtcount
| hide pA-mBB
| hide x_*;
""")

In the query above, we have skipped the gender analysis that is in freeze_af.ftl.yml() and simply fixed it to "female". Because our freeze was only written out of chr1, we calculate #segs# using -p chr1 option instead of PGOR.

The main query, #partaf#, has then be written using PARALLEL instead of PARTGOR, because of our use of buckets instead of PNs to define the partitions. Therefore, we use -f #{col:bucket} when reading the variants and INSET to filter the PN-list that we process in each CSVCC execution.

In #af#, we aggregate the partial counts using the GROUP command. The CALC steps that follow do therefore only apply once per variant and not per genotype.

We can inspect #partaf#:

In [4]:

%%gor
$mydefs

nor [#partaf#] | top 1 | unpivot #1-

Query ran in 0.68 sec
Query fetched 12 rows in 0.01 sec (total time 0.69 sec)

Out[4]:

	Col_Name	Col_Value
0	CHROM	chr1
1	POS	10048
2	Ref	CT
3	Alt	C
4	x_0_male_GTcount	0
5	x_0_female_GTcount	2
6	x_1_male_GTcount	0
7	x_1_female_GTcount	0
8	x_2_male_GTcount	0
9	x_2_female_GTcount	0
10	x_3_male_GTcount	0
11	x_3_female_GTcount	35

In [34]:

%%gor
$mydefs

gor [#af#] | top 10

Query ran in 0.36 sec
Query fetched 10 rows in 0.00 sec (total time 0.37 sec)

Out[34]:

	CHROM	POS	Ref	Alt	AN	AC	AF	conservative_AF	pnCount	marker_yield	HWE_homFreq	hetFreq	HWE_alleleFreq	HWE_chisq	HWE_pVal	HWE_logPval	varCount	hetCount	homCount
0	chr1	10048	CT	C	610	1	0.001639	0.000281	787	0.387548	0.000000	0.001271	0.000635	0.00000	1.000000e+00	-0.00	1	1	0
1	chr1	10057	A	AC	644	1	0.001553	0.000266	787	0.409149	0.000000	0.001271	0.000635	0.00000	1.000000e+00	-0.00	1	1	0
2	chr1	10119	CT	C	692	2	0.002890	0.000774	787	0.439644	0.000000	0.002541	0.001271	0.00000	1.000000e+00	-0.00	2	2	0
3	chr1	10143	CTAACCCCT	C	648	1	0.001543	0.000265	787	0.411690	0.000000	0.001271	0.000635	0.00000	1.000000e+00	-0.00	1	1	0
4	chr1	10146	AC	A	618	196	0.317152	0.274965	787	0.392630	0.011436	0.226175	0.124523	0.81427	3.668608e-01	0.44	187	178	9
5	chr1	10149	CCT	C	614	1	0.001629	0.000279	787	0.390089	0.000000	0.001271	0.000635	0.00000	1.000000e+00	-0.00	1	1	0
6	chr1	10154	CCCTAA	C	592	3	0.005068	0.001689	787	0.376112	0.001271	0.001271	0.001906	87.19333	9.841514e-21	20.01	2	1	1
7	chr1	10155	CCTAACCCTAACCCTAACCCTAA	C	594	1	0.001684	0.000289	787	0.377383	0.000000	0.001271	0.000635	0.00000	1.000000e+00	-0.00	1	1	0
8	chr1	10167	CCTAACCCTAA	C	580	4	0.006897	0.002634	787	0.368488	0.001271	0.002541	0.002541	48.74921	2.908757e-12	11.54	3	2	1
9	chr1	10172	C	A	576	1	0.001736	0.000298	787	0.365947	0.000000	0.001271	0.000635	0.00000	1.000000e+00	-0.00	1	1	0

We see that the variant at chr:10119 has AC of 2. We can quickly check that:

In [36]:

%%gor
gor -p chr1:10119 source/var/wgs_varcalls.gord -s PN

Query ran in 0.61 sec
Query fetched 2 rows in 0.00 sec (total time 0.61 sec)

Out[36]:

	CHROM	POS	Reference	Call	CallCopies	CallRatio	Depth	GL_Call	FILTER	FS	formatZip	PN
0	chr1	10119	CT	C	1	0.226	31	18	PASS	0.0	NaN	GDX_2866201
1	chr1	10119	CT	C	1	0.242	33	17	PASS	0.0	NaN	GDX_2866252

In [38]:

%%gor
gor -p chr1:10119 user_data/IncrFreeze/freeze_v2.gord 
| csvsel -gc ref,alt -vs 1 -tag PN -hide '0,3'
  user_data/IncrFreeze/PNbuckets_v2.tsv <(nor user_data/IncrFreeze/PNbuckets_v2.tsv | select pn)

Query ran in 0.10 sec
Query fetched 2 rows in 0.01 sec (total time 0.10 sec)

Out[38]:

	CHROM	POS	Ref	Alt	PN	value
0	chr1	10119	CT	C	GDX_2866201	1
1	chr1	10119	CT	C	GDX_2866252	1

Incremental AF¶

We want to show that we can generate the relation #partaf# incrementally. If we can do that, the rest of the calculation to write out #af# (or subset of its columns) is easy.

In [5]:

qh2 = GQH.GOR_Query_Helper()
mydefs2 = ""

We set the freeze to version 1 and use allvars_1 and PNbuckets_v1.

In [6]:

mydefs2 = qh2.add_many("""
def ##freeze## = user_data/IncrFreeze;

def #buckets# = user_data/IncrFreeze/PNbuckets_v1.tsv;
def #hgt# = ##freeze##/freeze_v1.gord;
def ##input_vars## = ##freeze##/allvars_1.gorz;

create #pns# = nor #buckets# | select pn /* this can be changed to any subset */;
create #parts# = nor #buckets# | group -gc bucket;

create #gen_gender# = nor [#pns#] | calc pheno 'female' /* skip sex calculation and fix */ ;

create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;

def #diploid# = 2;

create #partaf1# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket}
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
        <(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,gt cc -e 0 -v male,female
| pivot -gc ref,alt gt -v 0,1,2,3 -e 0
| prefix alt[+1]- x);
""")

In [7]:

%%gor
$mydefs2
gor [#partaf1#] | top 1

Query ran in 2.06 sec
Query fetched 1 rows in 12.63 sec (total time 14.69 sec)

Out[7]:

	CHROM	POS	Ref	Alt	x_0_male_GTcount	x_0_female_GTcount	x_1_male_GTcount	x_1_female_GTcount	x_2_male_GTcount	x_2_female_GTcount	x_3_male_GTcount	x_3_female_GTcount
0	chr1	10057	A	AC	0	86	0	0	0	0	0	164

We write it out into our folder:

In [8]:

%%gor
$mydefs2
gor [#partaf1#] | write user_data/IncrFreeze/AF/part1.gorz

Query ran in 0.14 sec
Query fetched 0 rows in 6.61 sec (total time 6.75 sec)

Out[8]:

	CHROM	POS

Incremental update¶

For this we will actually access the internal delta buckets, i.e. we want to eliminate the buckets use to calculate AF/part1.gorz. Lets look at the internals:

In [9]:

%%gor
nor -asdict user_data/IncrFreeze/freeze_v2.gord

Query ran in 0.16 sec
Query fetched 6 rows in 0.00 sec (total time 0.16 sec)

Out[9]:

	file	part	chrstart	chrstop	posstop	tags
0	hgt_1_1.gorz	b_1	chr1	chrZ	1000000000	b_1
1	hgt_1_2.gorz	b_2	chr1	chrZ	1000000000	b_2
2	hgt_2_3.gorz	b_3	chr1	chrZ	1000000000	b_3
3	hgt_2_4.gorz	b_4	chr1	chrZ	1000000000	b_4
4	hgt_delta_1_1.gorz	b_1	chr1	chrZ	1000000000	b_1
5	hgt_delta_1_2.gorz	b_2	chr1	chrZ	1000000000	b_2

We now write a new AF update GORD dictionary file:

In [10]:

%%gor
nor -asdict user_data/IncrFreeze/freeze_v2.gord | where not(left(file,5)='hgt_1') | write user_data/IncrFreeze/freeze_afdelt.tsv

Query ran in 0.08 sec
Query fetched 0 rows in 0.02 sec (total time 0.09 sec)

Out[10]:

	ChromNOR	PosNOR

In [11]:

import os
proj_path = '/home/jovyan/projects/clin-hg19/'
os.rename(proj_path+'user_data/IncrFreeze/freeze_afdelt.tsv',proj_path+'user_data/IncrFreeze/freeze_afdelt.gord')

In [12]:

%%gor
nor -asdict user_data/IncrFreeze/freeze_afdelt.gord

Query ran in 0.08 sec
Query fetched 4 rows in 0.00 sec (total time 0.08 sec)

Out[12]:

	file	part	chrstart	chrstop	posstop	tags
0	hgt_2_3.gorz	b_3	chr1	chrZ	1000000000	b_3
1	hgt_2_4.gorz	b_4	chr1	chrZ	1000000000	b_4
2	hgt_delta_1_1.gorz	b_1	chr1	chrZ	1000000000	b_1
3	hgt_delta_1_2.gorz	b_2	chr1	chrZ	1000000000	b_2

For reference the version 1 is like this:

In [13]:

%%gor
nor -asdict user_data/IncrFreeze/freeze_v1.gord

Query ran in 0.07 sec
Query fetched 2 rows in 0.00 sec (total time 0.07 sec)

Out[13]:

	file	part	chrstart	posstart	chrstop	posstop	tags
0	hgt_1_1.gorz	b_1	chr1	0	chrZ	1000000000	b_1
1	hgt_1_2.gorz	b_2	chr1	0	chrZ	1000000000	b_2

Now we use this freeze dictionary to calculate #partaf2#. We use PNbuckets_v2.tsv and allvars_2.gorz with it.

In [14]:

qh3 = GQH.GOR_Query_Helper()
mydefs3 = ""

In [15]:

mydefs3 = qh3.add_many("""
def ##freeze## = user_data/IncrFreeze;

def #buckets# = user_data/IncrFreeze/PNbuckets_v2.tsv;
def #hgt# = ##freeze##/freeze_afdelt.gord;
def ##input_vars## = ##freeze##/allvars_2.gorz;

create #pns# = nor #buckets# | select pn /* this can be changed to any subset */;
create #parts# = nor #buckets# | group -gc bucket;

create #gen_gender# = nor [#pns#] | calc pheno 'female' /* skip sex calculation and fix */ ;

create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;

def #diploid# = 2;

create #partaf2# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket} 
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
        <(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,gt cc -e 0 -v male,female
| pivot -gc ref,alt gt -v 0,1,2,3 -e 0
| prefix alt[+1]- x);
""")

In [16]:

%%gor
$mydefs3
gor [#partaf2#] | write user_data/IncrFreeze/AF/part2.gorz

Query ran in 0.35 sec
Query fetched 0 rows in 9.46 sec (total time 9.81 sec)

Out[16]:

	CHROM	POS

Our combined equivalent of #partaf# can now be generated as:

In [78]:

%%gor

gor user_data/IncrFreeze/AF/part1.gorz user_data/IncrFreeze/AF/part2.gorz
| group 1 -gc ref,alt -sum -ic x_*
| rename sum_(.*) #{1}
| top 10

Query ran in 0.08 sec
Query fetched 10 rows in 0.00 sec (total time 0.09 sec)

Out[78]:

	CHROM	POS	Ref	Alt	x_0_female_GTcount	x_1_female_GTcount	x_2_female_GTcount	x_3_female_GTcount
0	chr1	10048	CT	C	304	1	0	482
1	chr1	10057	A	AC	321	1	0	465
2	chr1	10119	CT	C	344	2	0	441
3	chr1	10143	CTAACCCCT	C	323	1	0	463
4	chr1	10146	AC	A	122	178	9	478
5	chr1	10149	CCT	C	306	1	0	480
6	chr1	10154	CCCTAA	C	294	1	1	491
7	chr1	10155	CCTAACCCTAACCCTAACCCTAA	C	296	1	0	490
8	chr1	10167	CCTAACCCTAA	C	287	2	1	497
9	chr1	10172	C	A	287	1	0	499

We can indeed compare it with our original calculation and see that it gives the same aggregate results for every variant:

In [40]:

%%gor
$mydefs

gor -p chr1 <(gor user_data/IncrFreeze/AF/part1.gorz user_data/IncrFreeze/AF/part2.gorz
| group 1 -gc ref,alt -sum -ic x_*
| rename sum_(.*) #{1} 
| merge -s <(gor [#partaf#] | group 1 -gc ref,alt -sum -ic x_* | rename sum_(.*) #{1} )
)
| group 1 -gc ref-source[-1] -dis -sc source
| throwif dis_source != 2
| group chrom -count

Query ran in 0.66 sec
Query fetched 1 rows in 48.74 sec (total time 49.40 sec)

Out[40]:

	CHROM	bpStart	bpStop	allCount
0	chr1	0	249250621	3633503

Note: It is important to recognize that when we do incremental freeze, all previous buckets must be updated with their deltas. However, before we merge the new deltas with the existing deltas, we should use the new deltas in a dictionary like freeze_afdelt.gord. Thus, in our simple single step example, the files hgt_delta_1_1.gorz and hgt_delta_1_2.gorz represent the new additions to the deltas. If we were in subsequent steps, these files will then have to be merged into corresponding delta files for each partition. In other words, in each incremental step, freeze_afdelt.gord should only include the new full partitions from the new samples and the additions to the delta partitions for existing samples.

For each incremental step, we must aggregate the files part1.gorz and part2.gorz together as shown above, before they are merged with the aggregation from the next iteration. We will then store this file together with the final AF calculation to allow this incremental freeze to be done. The cost of keeping this file is proportional to the number of variants (sqrt of samples). Clearly, this type of analysis can be extended to incremental subtractions of old samples as well, however, given how rare this use-case is we will not discuss that here. A full recalc is not that expensive!

Processing multiple sample groups simultaneously¶

We can extend the calculation above and aggregate the genotypes for multiple groups in a single sweep as we read them. To show this we will modify the #pn# relation and generate three groups with some random sampling.

In [44]:

qh4 = GQH.GOR_Query_Helper()
mydefs4 = ""

In [51]:

mydefs4 = qh4.add_many("""
def ##freeze## = user_data/IncrFreeze;
def #buckets# = user_data/IncrFreeze/PNbuckets_v2.tsv;
def #hgt# = ##freeze##/freeze_v2.gord;
def ##input_vars## = ##freeze##/allvars_2.gorz;

create #pns# = nor #buckets# | select pn | calc cohort 'A,B,C' | split cohort | where random() < 0.5;
create #parts# = nor #buckets# | group -gc bucket;

create #gen_gender# = nor [#pns#] | calc pheno 'female';
""")

We can inspect our sample relation as:

In [49]:

%%gor
$mydefs4

nor [#gen_gender#] | group -gc cohort -count

Query ran in 0.38 sec
Query fetched 3 rows in 0.01 sec (total time 0.39 sec)

Out[49]:

	cohort	allCount
0	A	386
1	B	388
2	C	397

We can now try out CSVCC with this new ternary relation #gen_gender#:

In [54]:

%%gor
$mydefs4
gor #hgt# | distloc 1
| csvcc -gc ref,alt -vs 1 -u 3 #buckets# [#gen_gender#]

Query ran in 0.09 sec
Query fetched 12 rows in 0.02 sec (total time 0.11 sec)

Out[54]:

	CHROM	POS	Ref	Alt	Pheno	CC	GT	GTcount
0	chr1	10048	CT	C	A	female	0	159
1	chr1	10048	CT	C	A	female	1	1
2	chr1	10048	CT	C	A	female	2	0
3	chr1	10048	CT	C	A	female	3	226
4	chr1	10048	CT	C	C	female	0	137
5	chr1	10048	CT	C	C	female	1	0
6	chr1	10048	CT	C	C	female	2	0
7	chr1	10048	CT	C	C	female	3	260
8	chr1	10048	CT	C	B	female	0	144
9	chr1	10048	CT	C	B	female	1	0
10	chr1	10048	CT	C	B	female	2	0
11	chr1	10048	CT	C	B	female	3	244

We see that the cohort is now under the Pheno column while our pheno attribute shows up in CC as before. Somewhat confusing but this is because of how CSVCC deals with binary and ternary relations. But this means our gender in the output show up in the same column as before. Our new PIVOT logic is only slightly modified from above:

In [58]:

%%gor
$mydefs4
gor #hgt# | distloc 1
| csvcc -gc ref,alt -vs 1 -u 3 #buckets# [#gen_gender#]
| pivot -gc ref,alt,pheno,gt cc -e 0 -v male,female
| pivot -gc ref,alt,pheno gt -v 0,1,2,3 -e 0
| prefix pheno[+1]- x

Query ran in 0.09 sec
Query fetched 3 rows in 0.02 sec (total time 0.10 sec)

Out[58]:

	CHROM	POS	Ref	Alt	Pheno	x_0_female_GTcount	x_1_female_GTcount	x_3_female_GTcount
0	chr1	10048	CT	C	B	144	0	244
1	chr1	10048	CT	C	C	137	0	260
2	chr1	10048	CT	C	A	159	1	226

We add the rest of the parallel statements with this new pivot code:

In [61]:

mydefs4 = qh4.add_many("""
create #varcount# = pgor ##input_vars## | group 100000 -count;
create #segs# = gor -p chr1 [#varcount#] | seghist 1000000;

def #diploid# = 2;

create #partaf# = parallel -parts [#parts#] <(pgor -split <(gor [#segs#]) #hgt# -f #{col:bucket}
| csvcc -gc ref,alt -vs 1 -u 3 #buckets#
        <(nor [#gen_gender#] | inset -c pn <(nor #buckets# | where bucket = '#{col:bucket}' ) )
| pivot -gc ref,alt,pheno,gt cc -e 0 -v male,female
| pivot -gc ref,alt,pheno gt -v 0,1,2,3 -e 0
| prefix pheno[+1]- x);
""");

To summarize the partitions, we also have to group by the pheno column:

In [66]:

%%gor
$mydefs4
gor [#partaf#] | group 1 -gc ref,alt,pheno -sum -ic x_*
| rename sum_(.*) #{1} 
| top 6

Query ran in 0.24 sec
Query fetched 6 rows in 0.01 sec (total time 0.25 sec)

Out[66]:

	CHROM	POS	Ref	Alt	Pheno	x_0_female_GTcount	x_1_female_GTcount	x_3_female_GTcount
0	chr1	10048	CT	C	A	159	1	226
1	chr1	10048	CT	C	B	144	0	244
2	chr1	10048	CT	C	C	137	0	260
3	chr1	10057	A	AC	A	166	0	220
4	chr1	10057	A	AC	B	154	0	234
5	chr1	10057	A	AC	C	147	0	250

The rest of the logic for AF is identical to the code shown for #af#, that is apart from this extra grouping by pheno. We can now choose to keep the AF information in a single file, separated by rows, or to "fork" write each phenotype into its own gord partition files.

Few remarks about performance¶

Both the incremental freeze and AF calculation can save a lot of compute when the number of samples gets very large. Clearly, both benefit from fewer and larger increments (since we have to spend some effort on the old samples). Depending on which approach is used for inferring unknown in a freeze, the incremental cost for the old samples will differ. The method that uses only the #goodcov# data is close to 10x faster than the one that also reads #variants#. However, as shown in (https://doi.org/10.1101/2022.10.25.513331), both methods use GTGEN and are 50-100x faster than gVCF based methods.

Once we have a large freeze with M samples and add N new samples, we can estimate the incremental cost to be; a) for the new samples proportional to sqrt(M+N)N and b) for the old samples proportional to M(sqrt(M+N)-sqrt(M)), c) for writing out the AF calc a time proportional to sqrt(M+N). These are however more like Bio-O notation and we will have to measure the times of each step with real data sets to understand what the relative factors are in each step. Theoretically, they should all converge to increase as sqrt(M), i.e. even the incremental update will become more and more costly, however, not grow linearly. Regular full update should grow as M*sqrt(M).

In [ ]: