Gnomad¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "matthew_sampson_hg38"
%env GOR_API_PROJECT={project}
Gnomad v3.1.2¶
Here we will show how we can take very large VCF files from the Genome Aggregation Database and convert them to GOR tables. We will use this to explore some strategic options for computational efficiency and explain how we can leverage parallelism to both process and store the results in S3 based GOR tables (dictionaries). The focus here is on version 3.1.2 but we will also show how we can combine v2.1.1 to produce a more reliable AF table in order to filter out "common" variants in rare-disease analysis.
The input VCF files are massive; 2.3Tbytes with over 759M rows and close to 1000 attributes in the INFO column. We will convert this into GORZ files with equal number of rown but with close to 1000 columns, i.e. each attribute stored in a dedicated column. By storing it a columns in GORZ, it reduces the storage footprint to 942Gbytes, without loosing any information, and at the same time makes it much faster to access the data.
Create a dictionary for the VCF files¶
First we create a dictionary to make it easy to refer to all the VCF files through a single "table". Because we are lazy, we use the #genes# table to list all the chromosomes, with the GROUP command. This also gives us the start and stop positions for the genomes, but we must convert the start coordinate because dictionaries are one-based (like the -p option in GOR) whereas the genome segments in GOR follow the strange zero-based convention ala UCSC. We don't want life to be too simple! Also, we specifically filter out chrM because we don't have VCF file for the mitochondria.
%%gor
nor <(gor #genes# | group chrom | where chrom != 'chrM')
| replace bpStart bpStart+1 | select 1-3 | calc range #1+':'+bpstart+'-'+bpstop
| calc inputfile 's3://gsci-reference-data-sources/extdata/gnomad/3.1.2/genomes/gnomad.genomes.v3.1.2.sites.'+chrom+'.vcf.gz'
| rownum
| select inputfile,rownum,chrom,bpstart,chrom,bpstop
| rename #1 file
| rename #2 ID
| write user_data/hakon/gnomad/vcf_input.gord
Now we can sample the VCF files by reading 10k rows in different 500 locations. We split the info field with SPLIT into attribute-value pairs and then use COLSPLIT to get the attribute name. We the use GROUP to get a distinct list of them:
%%gor dfAttributes <<
create x = pgor -split 500 user_data/hakon/gnomad/vcf_input.gord | top 10000
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| group chrom -gc x_1
;
nor [x] | group -gc x_1 | rename #1 attr
Query ran in 2.61 sec Query fetched 908 rows in 1.01 sec (total time 3.63 sec)
dfAttributes.shape
(908, 1)
dfAttributes.sort_values(by='attr', ascending=True).head()
attr | |
---|---|
0 | AC |
1 | AC_XX |
2 | AC_XY |
3 | AC_afr |
4 | AC_afr_XX |
We see that we have 908 attributes in the INFO field. Our plan is to convert these AV pairs into columns, to make the more easily accessible in tables, smaller data footprint and much faster parsing. Note that in order to generate this attribute list, we sampled the VCF files. Another way is to take the VCF header and carve out the header. With minor manual text-edit to make the header into a single-column TSV, we then ran the following:
av_info_query = """nor x.tsv
| replace x replace(replace(x,'INFO=<',''),'>','')
| calc Attribute TAG(x,'ID',',')
| calc Number TAG(x,'Number',',')
| calc Type TAG(x,'Type',',')
| calc Description dunquote(TAG(x,'Description',','))
| hide x
| write user_data/hakon/gnomad3/attribute_descr.tsv """
%%gor
nor user_data/hakon/gnomad3/attribute_descr.tsv | top 5
Query ran in 0.56 sec Query fetched 5 rows in 0.00 sec (total time 0.56 sec)
Attribute | Number | Type | Description | |
---|---|---|---|---|
0 | AC | A | Integer | Alternate allele count |
1 | AN | 1 | Integer | Total number of alleles |
2 | AF | A | Float | Alternate allele frequency |
3 | popmax | A | String | Population with maximum allele frequency |
4 | faf95_popmax | A | Float | Filtering allele frequency (using Poisson 95% ... |
Generate multiple CALC commands¶
To make it easier to visualize and execute, we first only generate five attribute columns.
calcCommands = ''
adf = dfAttributes
for i in range(0,len(adf.sort_values(by='attr', ascending=True).head())):
calcCommands += f"""| calc {adf.at[i,'attr']} replace(tag(info,'{adf.at[i,'attr']}',';'),'NOT_FOUND','.')\n"""
print(calcCommands)
| calc AC replace(tag(info,'AC',';'),'NOT_FOUND','.') | calc AC_XX replace(tag(info,'AC_XX',';'),'NOT_FOUND','.') | calc AC_XY replace(tag(info,'AC_XY',';'),'NOT_FOUND','.') | calc AC_afr replace(tag(info,'AC_afr',';'),'NOT_FOUND','.') | calc AC_afr_XX replace(tag(info,'AC_afr_XX',';'),'NOT_FOUND','.')
Note that we use the REPLACE function to change the default behaviour in the TAG function, i.e. to retur a dot when the attribute is not found. We can now form a query that picks up the first few attributes like this:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 10
$calcCommands
| select chrom,pos,AC-
Query ran in 0.71 sec Query fetched 10 rows in 0.10 sec (total time 0.80 sec)
CHROM | POS | AC | AC_XX | AC_XY | AC_afr | AC_afr_XX | |
---|---|---|---|---|---|---|---|
0 | chr7 | 10020 | 0 | 0 | 0 | 0 | 0 |
1 | chr7 | 10020 | 0 | 0 | 0 | 0 | 0 |
2 | chr7 | 10021 | 0 | 0 | 0 | 0 | 0 |
3 | chr7 | 10021 | 0 | 0 | 0 | 0 | 0 |
4 | chr7 | 10022 | 0 | 0 | 0 | 0 | 0 |
5 | chr7 | 10026 | 0 | 0 | 0 | 0 | 0 |
6 | chr7 | 10028 | 0 | 0 | 0 | 0 | 0 |
7 | chr7 | 10030 | 0 | 0 | 0 | 0 | 0 |
8 | chr7 | 10050 | 0 | 0 | 0 | 0 | 0 |
9 | chr7 | 10059 | 0 | 0 | 0 | 0 | 0 |
Generate single multi CALC command¶
It turns out that when we have 908 consequtive CALC commands we are introducing overhead which is not eliminated by the GOR compiler (as it stands today). First, when few rows are ready, each row will introduce overhead in type inference. Second, with high number of rows, the overhead each CALC commands introduces in order to add the column to the row starts to be noticed. Therefore, in this extreeme schenario with houndres of new columns, we may be better of by using the multi-calc functionality in CALC. Here we will actually use the GORpipe syntax to generate the query stubs. We use the query helper and add our create statement into it.
qh = GQH.GOR_Query_Helper()
mydefs = qh.add("""
create x = pgor -split 500 user_data/hakon/gnomad/vcf_input.gord | top 10000
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| group chrom -gc x_1;
""")
We use GROUP twice, first to get a distinct set of attributes and the to make a list out of them.
%%gor dfQuery <<
$mydefs
nor [x] | group -gc x_1
| rename #1 col
| calc cmd "replace(tag(info,'"+#1+"',';'),'NOT_FOUND','.')"
| top 5
| group -lis -sc col,cmd -len 1000000
Query ran in 1.50 sec Query fetched 1 rows in 0.93 sec (total time 2.43 sec)
Now we form the multi CALC command:
calcCommand = '| calc '+dfQuery.at[0,'lis_col']+' '+dfQuery.at[0,'lis_cmd']
print(calcCommand)
| calc AC,AC_XX,AC_XY,AC_afr,AC_afr_XX replace(tag(info,'AC',';'),'NOT_FOUND','.'),replace(tag(info,'AC_XX',';'),'NOT_FOUND','.'),replace(tag(info,'AC_XY',';'),'NOT_FOUND','.'),replace(tag(info,'AC_afr',';'),'NOT_FOUND','.'),replace(tag(info,'AC_afr_XX',';'),'NOT_FOUND','.')
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 10
$calcCommand
| select chrom,pos,AC-
Query ran in 1.07 sec Query fetched 10 rows in 0.10 sec (total time 1.16 sec)
CHROM | POS | AC | AC_XX | AC_XY | AC_afr | AC_afr_XX | |
---|---|---|---|---|---|---|---|
0 | chr7 | 10020 | 0 | 0 | 0 | 0 | 0 |
1 | chr7 | 10020 | 0 | 0 | 0 | 0 | 0 |
2 | chr7 | 10021 | 0 | 0 | 0 | 0 | 0 |
3 | chr7 | 10021 | 0 | 0 | 0 | 0 | 0 |
4 | chr7 | 10022 | 0 | 0 | 0 | 0 | 0 |
5 | chr7 | 10026 | 0 | 0 | 0 | 0 | 0 |
6 | chr7 | 10028 | 0 | 0 | 0 | 0 | 0 |
7 | chr7 | 10030 | 0 | 0 | 0 | 0 | 0 |
8 | chr7 | 10050 | 0 | 0 | 0 | 0 | 0 |
9 | chr7 | 10059 | 0 | 0 | 0 | 0 | 0 |
PIVOT approach¶
Before we compare the performance of these two approaches shown above, we will present yet another way to extract AV data into columns from VCF file, without introducing a special purpose command for the task, that would be the fasted option but not necessary in our opinion, given that this is one-time conversion exersize.
colList = dfQuery.at[0,'lis_col']
colList
'AC,AC_XX,AC_XY,AC_afr,AC_afr_XX'
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 10
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e . x_1 -v $colList
| rename (.*)_x_2 #{1}
| select chrom,pos,AC-
Query ran in 0.70 sec Query fetched 10 rows in 0.11 sec (total time 0.82 sec)
CHROM | POS | AC | AC_XX | AC_XY | AC_afr | AC_afr_XX | |
---|---|---|---|---|---|---|---|
0 | chr7 | 10020 | 0 | 0 | 0 | 0 | 0 |
1 | chr7 | 10020 | 0 | 0 | 0 | 0 | 0 |
2 | chr7 | 10021 | 0 | 0 | 0 | 0 | 0 |
3 | chr7 | 10021 | 0 | 0 | 0 | 0 | 0 |
4 | chr7 | 10022 | 0 | 0 | 0 | 0 | 0 |
5 | chr7 | 10026 | 0 | 0 | 0 | 0 | 0 |
6 | chr7 | 10028 | 0 | 0 | 0 | 0 | 0 |
7 | chr7 | 10030 | 0 | 0 | 0 | 0 | 0 |
8 | chr7 | 10050 | 0 | 0 | 0 | 0 | 0 |
9 | chr7 | 10059 | 0 | 0 | 0 | 0 | 0 |
The above method uses SPLIT and COLSPLIT to generate multiple rows from a single row (908 in the full case). While this introduces some overhead in handling the AV pair, it turns out that it is smaller than the cost of calling the TAG function repeatedly to extract the attribute value from the INFO string in the VCF file. A specialized command for AV to columnar transformation could combined the SPLIT, COLSPLIT, and PIVOT, all into a single command without introducing the row-model to represent the AV pairs in between. But now lets just test the performance of these three methods for all the 908 AV and large number of rows. First, lets time the cost of reading the VCF file:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000 | group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 1.42 sec Query fetched 1 rows in 18.12 sec (total time 19.54 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr7 | 0 | 159345973 | 50000 | 18.952sec |
calcCommands = ''
adf = dfAttributes
for i in range(0,len(adf.sort_values(by='attr', ascending=True))):
calcCommands += f"""| calc {adf.at[i,'attr']} replace(tag(info,'{adf.at[i,'attr']}',';'),'NOT_FOUND','.')\n"""
%%gor dfQuery <<
$mydefs
nor [x] | group -gc x_1
| rename #1 col
| calc cmd "replace(tag(info,'"+#1+"',';'),'NOT_FOUND','.')"
| group -lis -sc col,cmd -len 1000000
Query ran in 2.59 sec Query fetched 1 rows in 0.94 sec (total time 3.53 sec)
calcCommand = '| calc '+dfQuery.at[0,'lis_col']+' '+dfQuery.at[0,'lis_cmd']
colList = dfQuery.at[0,'lis_col']
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000
$calcCommands
| group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 2.46 sec Query fetched 1 rows in 521.22 sec (total time 523.68 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr7 | 0 | 159345973 | 50000 | 523.138sec |
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000
$calcCommand
| group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 1.51 sec Query fetched 1 rows in 451.86 sec (total time 453.37 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr7 | 0 | 159345973 | 50000 | 452.741sec |
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e . x_1 -v $colList
| rename (.*)_x_2 #{1}
| group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 1.40 sec Query fetched 1 rows in 131.03 sec (total time 132.42 sec)
CHROM | bpStart | bpStop | allCount | t | |
---|---|---|---|---|---|
0 | chr7 | 0 | 159345973 | 50000 | 131.874sec |
What we see from these experiments is that because the cost of using the TAG function multiple times, there is not that great difference between the regular CALC approach and the multi-CALC approach, i.e. 523sec vs 452sec. Both are much longer than 19sec it takes to simply streamin the VCF. The PIVOT approach does however only take 132sec - that is 7x overhead on top of reading the data.
Genome wide processing¶
The Gnomad data is close is 759M rows. Therefore converting the entire Gnomad 3.1.2 to tabular GOR files will take about 15 thousand times longer, i.e. 557hours. With GORdb parallel processing this will be a walk in the park!
However, since we know that this query will run for more than few minutes, we will execute it in the long-running queue, namely "lord". We therefore import the "old" query service SDK which supports the lord job types. First we run it for only 10 rows per worker, to make sure that everything is in order.
import nextcode
svc_query = nextcode.get_service("query")
lqh = GQH.GOR_Query_Helper()
replace_symbol = "#{1}"
mydefs = lqh.add("""def ##top## = | top 10;""")
mydefs = lqh.add(f"""
create #temp# = pgor -split <(gor #genes# | group chrom | where chrom != 'chrM' | segspan -maxseg 10000000)
user_data/hakon/gnomad/vcf_input.gord ##top##
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e . x_1 -v {colList}
| rename (.*)_x_2 {replace_symbol}
""")
job = svc_query.execute(f"{mydefs} gor -p chr1 [#temp#] | top 3",job_type="lord") #,nowait="true"
job.dataframe()
CHROM | POS | REF | ALT | FILTER | AC | AC_XX | AC_XY | AC_afr | AC_afr_XX | ... | popmax | primate_ai_score | revel_score | segdup | splice_ai_consequence | splice_ai_max_ds | transmitted_singleton | variant_type | vep | was_mixed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10031 | T | C | AC0;AS_VQSR | 0 | 0 | 0 | 0 | 0 | ... | . | . | . | NaN | . | . | . | multi-snv | C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0... | . |
1 | chr1 | 10037 | T | C | AS_VQSR | 2 | 1 | 1 | 0 | 0 | ... | eas | . | . | NaN | . | . | . | snv | C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0... | . |
2 | chr1 | 10043 | T | C | AS_VQSR | 1 | 1 | 0 | 1 | 1 | ... | afr | . | . | NaN | . | . | . | snv | C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0... | . |
3 rows × 913 columns
Now we modify the ##top## definition such that it passes all rows through. Based on our estimates above, the 440 partitions we provide in the -split option using the SEGSPAN of 10Mb, the following query will take between one and two hours, depending on the number of workers available.
mydefs = lqh.add("""def ##top## = | skip 0;""")
job = svc_query.execute(f"{mydefs} nor <(gor -p chr1 [#temp#] | top 1) | unpivot 1-",job_type="lord",nowait="true")
job
<GorQuery 652492 (DONE)>
job.dataframe()
Col_Name | Col_Value | |
---|---|---|
0 | CHROM | chr1 |
1 | POS | 10031 |
2 | REF | T |
3 | ALT | C |
4 | FILTER | AC0;AS_VQSR |
... | ... | ... |
908 | splice_ai_max_ds | . |
909 | transmitted_singleton | . |
910 | variant_type | multi-snv |
911 | vep | C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0... |
912 | was_mixed | . |
913 rows × 2 columns
Persisting the results in S3¶
In the query above that took close to two hours, we created a temporary table from our massive gnomad table with close to 100Mrows x 1000cols. This table is typically stored in Lustre/Posix GORdb cache and given its size, it will most likely be garbage collected automatically, unless it is used often. Now we want to write it into S3, a cheaper storage that we can also link into multiple projects. Writing out all these rows and columns from the cache to S3 is also not trivial and therefore we will use parallel write into GOR dictionary.
job_w = svc_query.execute(f"""{mydefs} pgor -split 100 [#temp#]
| write s3data://project/user_data/hakon/gnomad/all.gord""",job_type="lord") #,nowait="true")
The above write took about 30 minutes. Notice that while the content of the dictionary is written to S3, the dictionarity folder all.gord is here stored under user_data on Lustre/Posix. Thus, when we access it, we skip the s3data prefix in the url-source, for instance now when we write out a subset of the columns:
%%gor
pgor -split 100 user_data/hakon/gnomad/all.gord
| select 1,2,REF,ALT,FILTER,AS_VQSLOD,AC,AC_afr,AC_afr_XX,AC_afr_XY,AC_ami,AC_ami_XX,AC_ami_XY,AC_amr,AC_amr_XX,AC_amr_XY,AC_asj,AC_asj_XX,AC_asj_XY,AC_eas,AC_eas_XX,AC_eas_XY,AC_fin,AC_fin_XX,AC_fin_XY,AC_mid,AC_mid_XX,AC_mid_XY,AC_nfe,AC_nfe_XX,AC_nfe_XY,AC_oth,AC_oth_XX,AC_oth_XY,AC_popmax,AC_raw,AC_sas,AC_sas_XX,AC_sas_XY,AC_XX,AC_XY,AF,AF_afr,AF_afr_XX,AF_afr_XY,AF_ami,AF_ami_XX,AF_ami_XY,AF_amr,AF_amr_XX,AF_amr_XY,AF_asj,AF_asj_XX,AF_asj_XY,AF_eas,AF_eas_XX,AF_eas_XY,AF_fin,AF_fin_XX,AF_fin_XY,AF_mid,AF_mid_XX,AF_mid_XY,AF_nfe,AF_nfe_XX,AF_nfe_XY,AF_oth,AF_oth_XX,AF_oth_XY,AF_popmax,AF_raw,AF_sas,AF_sas_XX,AF_sas_XY,AF_XX,AF_XY,AN,AN_afr,AN_afr_XX,AN_afr_XY,AN_ami,AN_ami_XX,AN_ami_XY,AN_amr,AN_amr_XX,AN_amr_XY,AN_asj,AN_asj_XX,AN_asj_XY,AN_eas,AN_eas_XX,AN_eas_XY,AN_fin,AN_fin_XX,AN_fin_XY,AN_mid,AN_mid_XX,AN_mid_XY,AN_nfe,AN_nfe_XX,AN_nfe_XY,AN_oth,AN_oth_XX,AN_oth_XY,AN_popmax,AN_raw,AN_sas,AN_sas_XX,AN_sas_XY,AN_XX,AN_XY,nhomalt,nhomalt_afr,nhomalt_afr_XX,nhomalt_afr_XY,nhomalt_ami,nhomalt_ami_XX,nhomalt_ami_XY,nhomalt_amr,nhomalt_amr_XX,nhomalt_amr_XY,nhomalt_asj,nhomalt_asj_XX,nhomalt_asj_XY,nhomalt_eas,nhomalt_eas_XX,nhomalt_eas_XY,nhomalt_fin,nhomalt_fin_XX,nhomalt_fin_XY,nhomalt_mid,nhomalt_mid_XX,nhomalt_mid_XY,nhomalt_nfe,nhomalt_nfe_XX,nhomalt_nfe_XY,nhomalt_oth,nhomalt_oth_XX,nhomalt_oth_XY,nhomalt_popmax,nhomalt_raw,nhomalt_sas,nhomalt_sas_XX,nhomalt_sas_XY,nhomalt_XX,nhomalt_XY,n_alt_alleles,popmax,faf95,faf95_popmax,faf99
| write s3data://project/user_data/hakon/gnomad/maincols.gord
Query ran in 3.12 sec. Query fetched 0 rows in 780.38 sec (total time 783.50 sec)
CHROM | POS | REF | ALT | FILTER | AS_VQSLOD | AC | AC_afr | AC_afr_XX | AC_afr_XY | ... | nhomalt_sas | nhomalt_sas_XX | nhomalt_sas_XY | nhomalt_XX | nhomalt_XY | n_alt_alleles | popmax | faf95 | faf95_popmax | faf99 |
---|
0 rows × 151 columns
Writing out a subset of the columns in gnomad/all.gord was finished in just few minutes, because the GOR format is much more efficient for extracting subset of columns/attributes than the VCF format. And reading the maincols.gord is likewise significantly faster than all.gord. We can try to see how long time it takes us to count the number of variants.
%%gor
create #temp# = pgor -split 100 user_data/hakon/gnomad/maincols.gord | group chrom -count | calc tsec int(time()/1000);
nor [#temp#] | group -sum -ic allcount,tsec -avg -min -max
Query ran in 0.85 sec Query fetched 1 rows in 0.04 sec (total time 0.89 sec)
min_allCount | max_allCount | avg_allCount | sum_allCount | min_tsec | max_tsec | avg_tsec | sum_tsec | |
---|---|---|---|---|---|---|---|---|
0 | 23880 | 10602398 | 6.602628e+06 | 759302267 | 0 | 64 | 42.104348 | 4842 |
We see that there are 759M variants and with a split of 100 the average for a worker to count the rows is 42sec and here all the workers completed in 2 minutes.
Combining Gnomad version 3.1.2 and version 2.1.1¶
Version 3.1.2 is based on WGS data from about 75k samples while version 2.1.1 is a a combination of 125k WES samples and 15k WGS samples. Here we will generate a combined Gnomad table that include the most important allele frequencies, including "popmax" frequencies and conservative AF based on 95% confidence interval. For the conservative estimate, we use the parameterized definition #conserv(). It is derived from the assumption of a Poisson distributed allele count and the fact that the stdev is the square root of the mean, i.e. ACobs = ACm + 2sqrt(ACm), which leads us to a formula for our conservative estimate. It converges to a formula that is easier to understand, ACm = ACobs - 2sqrt(ACobs), when AC is high.
The query below first generates intermediate transformation in #v2# and #v3#, both of which pick the popmax frequency and population using SPLIT and ATMAX. For v2.1.1 we pick popmax both for the regular and conservative AF whereas in #v3# we leverage the fact that the regular popmax frequency is already provided, hence only a simple rename. We then pick out the most important columns that we think are relevant for fast downstream filtering, but keeping the column number low (compared with close to 1000 cols) is important when we have to join a table with close to 800M rows in downstream annotation analysis such as the one intended for this table.
def ##gnomad_freq## = ref/variants/gnomad_freq.gorz;
def ##freq_max_file## = ##gnomad_freq##;
def #conserv($1,$2) = if($1 = '.' or $2 = '.','0',if(float($2)!=0,str(max(0.0,float(float($1) - 2*sqrt(float($1) + 1)+2)/float($2))),'0'));
create #v2# = pgor -split 100 ##gnomad_freq##
| calc popmax_af gnomad_af_afr + ',' + gnomad_af_amr + ',' + gnomad_af_eas
+ ',' + gnomad_af_nfe + ',' + gnomad_af_sas
| calc popmax 'afr,amr,eas,nfe,sas'
| select 1-4,popmax,popmax_af
| split popmax,popmax_af
| atmax 1 -gc ref,alt popmax_af
| replace popmax_af gform(popmax_af,4,5)
| varjoin -r -norm <(gor ##gnomad_freq##
| calc popmax_c_af #conserv(gnomad_ac_afr,gnomad_an_afr) + ',' + #conserv(gnomad_ac_amr,gnomad_an_amr) + ',' + #conserv(gnomad_ac_eas,gnomad_an_eas)
+ ',' + #conserv(gnomad_ac_nfe,gnomad_an_nfe)+ ',' + #conserv(gnomad_ac_sas,gnomad_an_sas)
| calc popmax_c 'afr,amr,eas,nfe,sas'
| select 1-4,popmax_c_af,popmax_c
| split popmax_c,popmax_c_af
| atmax 1 -gc ref,alt popmax_c_af
)
| varjoin -r -norm <(gor ##gnomad_freq## | select 1-4,gnomad_an,gnomad_genomes_af,gnomad_genomes_hom,gnomad_exomes_af,gnomad_exomes_hom,gnomad_genomes_qc,gnomad_exomes_qc)
| rename gnomad_(.*) #{1}
| replace popmax_af,popmax_c_af,genomes_af,exomes_af if(isfloat(#rc),gform(float(#rc),4,4),if(#rc='.','0',#rc))
;
create #v3# = pgor -split 500 user_data/hakon/gnomad/maincols.gord
| rename af_popmax popmax_af
| calc popmax_c_af #conserv(ac_afr,an_afr) + ',' + #conserv(ac_amr,an_amr) + ',' + #conserv(ac_eas,an_eas)
+ ',' + #conserv(ac_nfe,an_nfe)+ ',' + #conserv(ac_sas,an_sas)
| calc popmax_c_af_XX #conserv(ac_afr_XX,an_afr_XX) + ',' + #conserv(ac_amr_XX,an_amr_XX) + ',' + #conserv(ac_eas_XX,an_eas_XX)
+ ',' + #conserv(ac_nfe_XX,an_nfe_XX)+ ',' + #conserv(ac_sas_XX,an_sas_XX)
| calc popmax_c_af_XY #conserv(ac_afr_XY,an_afr_XY) + ',' + #conserv(ac_amr_XY,an_amr_XY) + ',' + #conserv(ac_eas_XY,an_eas_XY)
+ ',' + #conserv(ac_nfe_XY,an_nfe_XY)+ ',' + #conserv(ac_sas_XY,an_sas_XY)
| calc popmax_c 'afr,amr,eas,nfe,sas'
| select 1-alt,an,popmax_af,popmax_c_af,popmax_c_af_XX,popmax_c_af_XY,popmax,popmax_c,AF_XX,AF_XY,FILTER,AS_VQSLOD,faf95_*
| split popmax_c,popmax_c_af,popmax_c_af_XX,popmax_c_af_XY
| atmax 1 -gc ref,alt popmax_c_af
| replace popmax_af,popmax_c_af,popmax_c_af_XX,popmax_c_af_XY,AF_XX,AF_XY if(isfloat(#rc),gform(float(#rc),4,4),if(#rc='.','0',#rc))
;
pgor -split 100 [#v2#] | select 1-4 | calc version 'v2' | merge <(gor [#v3#] | select 1-4 | calc version 'v3')
| group 1 -gc ref,alt -set -sc version | rename set_version versions
| varjoin -r -norm -l -e 'NaN' -rprefix v2 [#v2#]
| varjoin -r -norm -l -e 'NaN' -rprefix v3 [#v3#]
| calc version_use if(versions='v2,v3',if(v2_exomes_qc='PASS' and v3_FILTER != 'PASS' or v2_an > v3_an,'v2','v3'),if(versions='v3','v3','v2'))
| calc ref_af,ref_c_af if(version_use = 'v2',v2_popmax_af,v3_popmax_af),if(version_use = 'v2',v2_popmax_c_af,v3_popmax_c_af)
| columnsort chrom,pos,ref,alt,versions,version_use,ref_*
| write s3data://project/user_data/hakon/gnomad/v211v312popmaxconserv.gord
The transformations for #v2# and #v3# ran in less than 10 minutes but the write took slighly longer, i.e. close to 20 minutes. Because the source table for v3.1.2 is significantly larger than the table with v2.1.1, we used -split 500 for #v3#. Both of these creates will eventually be automatically garbage collected. In the final write we use only -split 100 because we don't want the end result, that will most likely be used very often, to be stored in too many partition because it can slow seeks, i.e. large number of random access lookups that may happen in joins, due to a larger number of files per chromosome.
Accessing the data¶
Before we end we will perform a quick analysis to count the number of variants in the combined table:
%%gor
create x = pgor -split 100 user_data/hakon/gnomad/v211v312popmaxconserv.gord | group chrom -gc versions -count;
nor [x] | group -sum -ic allcount -gc versions | granno -sum -ic sum_allcount
Query ran in 2.10 sec Query fetched 3 rows in 65.24 sec (total time 67.34 sec)
versions | sum_allCount | sum_sum_allCount | |
---|---|---|---|
0 | v2 | 21160577 | 780527236 |
1 | v2,v3 | 253987685 | 780527236 |
2 | v3 | 505378974 | 780527236 |
With 28 gorworkers standby, this query ran in about a minute. We see that in total there are 780 million variants in Gnomad, most of them present in v3.1.2. Now lets do a quick lookup test to the original VCF files and compare it in speed with the GOR files:
%%gor
gor -p chr7:100000000 user_data/hakon/gnomad/vcf_input.gord
| calc AF tag(info,'AF',';')
| calc AC tag(info,'AC',';')
| calc AN tag(info,'AN',';')
| hide info
| calc t time()
Query ran in 3.85 sec. Query fetched 1 rows in 0.01 sec (total time 3.86 sec)
CHROM | POS | ID | REF | ALT | QUAL | FILTER | IDx | AF | AC | AN | t | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr7 | 100000000 | . | A | G | . | PASS | 20 | 0.000007 | 1 | 151970 | 2848 |
And now read the same information from the GOR files:
%%gor
gor -p chr7:100000000 user_data/hakon/gnomad/all.gord
| select 1-4,AF,AC,AN | calc t time()
Query ran in 2.01 sec Query fetched 1 rows in 0.00 sec (total time 2.02 sec)
CHROM | POS | REF | ALT | AF | AC | AN | t | |
---|---|---|---|---|---|---|---|---|
0 | chr7 | 100000000 | A | G | 0.000007 | 1 | 151970 | 1864 |
We see that the GOR lookup is almost twice as fast and the difference is greater when we access more attributes:
%%gor
nor <(gor -p chr7:100000000 user_data/hakon/gnomad/vcf_input.gord)
| select chrom,pos,ref,alt,id,info
| split info -s ';'
Query ran in 3.10 sec. Query fetched 872 rows in 2.73 sec (total time 5.83 sec)
CHROM | POS | REF | ALT | ID | INFO | |
---|---|---|---|---|---|---|
0 | chr7 | 100000000 | A | G | . | AC=1 |
1 | chr7 | 100000000 | A | G | . | AN=151970 |
2 | chr7 | 100000000 | A | G | . | AF=6.58025e-06 |
3 | chr7 | 100000000 | A | G | . | popmax=nfe |
4 | chr7 | 100000000 | A | G | . | faf95_popmax=0.00000 |
... | ... | ... | ... | ... | ... | ... |
867 | chr7 | 100000000 | A | G | . | dp_hist_all_n_larger=1 |
868 | chr7 | 100000000 | A | G | . | ab_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|1... |
869 | chr7 | 100000000 | A | G | . | cadd_raw_score=0.192810 |
870 | chr7 | 100000000 | A | G | . | cadd_phred=3.04200 |
871 | chr7 | 100000000 | A | G | . | vep=G|intron_variant&non_coding_transcript_var... |
872 rows × 6 columns
%%gor
nor <(gor -p chr7:100000000 user_data/hakon/gnomad/all.gord)
| unpivot 5- | calc t time()
Query ran in 0.49 sec Query fetched 909 rows in 0.22 sec (total time 0.71 sec)
CHROM | POS | REF | ALT | Col_Name | Col_Value | t | |
---|---|---|---|---|---|---|---|
0 | chr7 | 100000000 | A | G | FILTER | PASS | 645 |
1 | chr7 | 100000000 | A | G | AC | 1 | 645 |
2 | chr7 | 100000000 | A | G | AC_XX | 0 | 645 |
3 | chr7 | 100000000 | A | G | AC_XY | 1 | 645 |
4 | chr7 | 100000000 | A | G | AC_afr | 0 | 645 |
... | ... | ... | ... | ... | ... | ... | ... |
904 | chr7 | 100000000 | A | G | splice_ai_max_ds | . | 646 |
905 | chr7 | 100000000 | A | G | transmitted_singleton | . | 646 |
906 | chr7 | 100000000 | A | G | variant_type | snv | 646 |
907 | chr7 | 100000000 | A | G | vep | G|intron_variant&non_coding_transcript_variant... | 646 |
908 | chr7 | 100000000 | A | G | was_mixed | . | 646 |
909 rows × 7 columns
We see that with GOR, we fetch all the attributes for a variant within a second, 5x faster than the VCF lookup.