GnomAD variant liftover¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "internal_hg19"
%env GOR_API_PROJECT={project}
gnomAD liftover from hg38 to hg19¶
Here we will explore the liftover of gnomAD v4 from hg38 to build hg19 that GDX is still using.
The gnomAD v4 exome data was downloaded and stored in user_data/ref_gnomad4 in internal-hg19 project. This includes both SNVs, CNVs and SVs.
Converting VCF attribute value tags to GOR columns.¶
The VCF files store all the information about the variatns in a single INFO column using attribute value structure. This is not only difficult to read but also inefficient in processing and storage. Therefore, we like to convert this information to multi-columnar GOR files.
We start by locating one of the source files and read its header. This can be done like this:
aws s3 cp s3://gdb-demo-dev-blobstorage/shared/extdata/hg38/gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chr4.vcf.gz - | gunzip - | head -n590 > header_example.txt
Note that we fine tune the head command with a binary-like search until we find where the first data line starts, i.e. to get all the attribute descriptions but NOT the massive data rows.
We then take this txt file and eliminate the ## header lines to make it possible to read it with NOR. We save this to header_example.tsv
Now we can run the following query to generate a description file.
%%gor
nor user_data/ref_gnomad4/files/header_example.tsv
| where contains(col1,'INFO=<ID=')
| replace col1 replace(replace(col1,'INFO=<',''),'>','')
| calc ID tag(col1,'ID',',')
| calc Number tag(col1,'Number',',')
| calc Type tag(col1,'Type',',')
| calc Description replace(tag(col1,'Description',','),'"','')
| select id-
| rename ID Column
/* | write user_data/ref_gnomad4/files/exome_col_descr.tsv */
Query ran in 0.19 sec Query fetched 583 rows in 0.02 sec (total time 0.21 sec)
Column | Number | Type | Description | |
---|---|---|---|---|
0 | AC | A | Integer | Alternate allele count |
1 | AN | 1 | Integer | Total number of alleles |
2 | AF | A | Float | Alternate allele frequency |
3 | grpmax | A | String | Genetic ancestry group with the maximum allele... |
4 | fafmax_faf95_max | A | Float | Maximum filtering allele frequency (using Pois... |
... | ... | ... | ... | ... |
578 | VRS_Allele_IDs | R | String | The computed identifiers for the GA4GH VRS All... |
579 | VRS_Starts | R | Integer | Interresidue coordinates used as the location ... |
580 | VRS_Ends | R | Integer | Interresidue coordinates used as the location ... |
581 | VRS_States | . | String | The literal sequence states used for the GA4GH... |
582 | vep | . | String | Consequence annotations from Ensembl VEP. Form... |
583 rows × 4 columns
We see that this has 582 columns. To transform the data we can now make a list from the columns using GROUP -lis -sc column. Few of the last columns, such as VEP, are annotation columns we are not interested in. The other we put into our #colslist# definition as shown below.
In the following query, we read the VCF data from a GORD file and transform it with the pivot approach, as described in https://docs.gorpipe.org/notebooks/Gnomad.html#notebooks-3
We will use the liftover definition file to fetch the end position of the chromosomes in hg38 while we simply use GROUP chrom to get them for hg19. With the following definition, we can modify the segment from SEGSPAN to fully cover hg38.
%%gor
gor <(nor <(gor #genes# | group chrom | segspan -maxseg 10000000)
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop=max_bpstop,max_tEnd,bpstop)
)
| granno genome -count | where #2 < #3 and chrom = 'chr17'
Query ran in 0.16 sec Query fetched 16 rows in 0.31 sec (total time 0.47 sec)
Chrom | bpStart | bpStop | segCount | max_bpStop | max_tEnd | allCount | |
---|---|---|---|---|---|---|---|
0 | chr17 | 0 | 5074700 | 1 | 81195210 | 83247441 | 433 |
1 | chr17 | 5074700 | 10149401 | 1 | 81195210 | 83247441 | 433 |
2 | chr17 | 10149401 | 15224101 | 1 | 81195210 | 83247441 | 433 |
3 | chr17 | 15224101 | 20298802 | 1 | 81195210 | 83247441 | 433 |
4 | chr17 | 20298802 | 25373502 | 1 | 81195210 | 83247441 | 433 |
5 | chr17 | 25373502 | 30448203 | 1 | 81195210 | 83247441 | 433 |
6 | chr17 | 30448203 | 35522904 | 1 | 81195210 | 83247441 | 433 |
7 | chr17 | 35522904 | 40597605 | 1 | 81195210 | 83247441 | 433 |
8 | chr17 | 40597605 | 45672305 | 1 | 81195210 | 83247441 | 433 |
9 | chr17 | 45672305 | 50747006 | 1 | 81195210 | 83247441 | 433 |
10 | chr17 | 50747006 | 55821706 | 1 | 81195210 | 83247441 | 433 |
11 | chr17 | 55821706 | 60896407 | 1 | 81195210 | 83247441 | 433 |
12 | chr17 | 60896407 | 65971107 | 1 | 81195210 | 83247441 | 433 |
13 | chr17 | 65971107 | 71045808 | 1 | 81195210 | 83247441 | 433 |
14 | chr17 | 71045808 | 76120509 | 1 | 81195210 | 83247441 | 433 |
15 | chr17 | 76120509 | 83247441 | 1 | 81195210 | 83247441 | 433 |
Above, we see for instance the last split segment that ends in '83247441' and not '8119521' as plain SEGSPAN in hg19 would let it do.
%%gor
def #top# = top 10;
def #colslist# = AC,AN,AF,grpmax,fafmax_faf95_max,fafmax_faf95_max_gen_anc,AC_XX,AF_XX,AN_XX,nhomalt_XX,AC_XY,AF_XY,AN_XY,nhomalt_XY,nhomalt,AC_afr_XX,AF_afr_XX,AN_afr_XX,nhomalt_afr_XX,AC_afr_XY,AF_afr_XY,AN_afr_XY,nhomalt_afr_XY,AC_afr,AF_afr,AN_afr,nhomalt_afr,AC_amr_XX,AF_amr_XX,AN_amr_XX,nhomalt_amr_XX,AC_amr_XY,AF_amr_XY,AN_amr_XY,nhomalt_amr_XY,AC_amr,AF_amr,AN_amr,nhomalt_amr,AC_asj_XX,AF_asj_XX,AN_asj_XX,nhomalt_asj_XX,AC_asj_XY,AF_asj_XY,AN_asj_XY,nhomalt_asj_XY,AC_asj,AF_asj,AN_asj,nhomalt_asj,AC_eas_XX,AF_eas_XX,AN_eas_XX,nhomalt_eas_XX,AC_eas_XY,AF_eas_XY,AN_eas_XY,nhomalt_eas_XY,AC_eas,AF_eas,AN_eas,nhomalt_eas,AC_fin_XX,AF_fin_XX,AN_fin_XX,nhomalt_fin_XX,AC_fin_XY,AF_fin_XY,AN_fin_XY,nhomalt_fin_XY,AC_fin,AF_fin,AN_fin,nhomalt_fin,AC_mid_XX,AF_mid_XX,AN_mid_XX,nhomalt_mid_XX,AC_mid_XY,AF_mid_XY,AN_mid_XY,nhomalt_mid_XY,AC_mid,AF_mid,AN_mid,nhomalt_mid,AC_nfe_XX,AF_nfe_XX,AN_nfe_XX,nhomalt_nfe_XX,AC_nfe_XY,AF_nfe_XY,AN_nfe_XY,nhomalt_nfe_XY,AC_nfe,AF_nfe,AN_nfe,nhomalt_nfe,AC_non_ukb_XX,AF_non_ukb_XX,AN_non_ukb_XX,nhomalt_non_ukb_XX,AC_non_ukb_XY,AF_non_ukb_XY,AN_non_ukb_XY,nhomalt_non_ukb_XY,AC_non_ukb,AF_non_ukb,AN_non_ukb,nhomalt_non_ukb,AC_non_ukb_afr_XX,AF_non_ukb_afr_XX,AN_non_ukb_afr_XX,nhomalt_non_ukb_afr_XX,AC_non_ukb_afr_XY,AF_non_ukb_afr_XY,AN_non_ukb_afr_XY,nhomalt_non_ukb_afr_XY,AC_non_ukb_afr,AF_non_ukb_afr,AN_non_ukb_afr,nhomalt_non_ukb_afr,AC_non_ukb_amr_XX,AF_non_ukb_amr_XX,AN_non_ukb_amr_XX,nhomalt_non_ukb_amr_XX,AC_non_ukb_amr_XY,AF_non_ukb_amr_XY,AN_non_ukb_amr_XY,nhomalt_non_ukb_amr_XY,AC_non_ukb_amr,AF_non_ukb_amr,AN_non_ukb_amr,nhomalt_non_ukb_amr,AC_non_ukb_asj_XX,AF_non_ukb_asj_XX,AN_non_ukb_asj_XX,nhomalt_non_ukb_asj_XX,AC_non_ukb_asj_XY,AF_non_ukb_asj_XY,AN_non_ukb_asj_XY,nhomalt_non_ukb_asj_XY,AC_non_ukb_asj,AF_non_ukb_asj,AN_non_ukb_asj,nhomalt_non_ukb_asj,AC_non_ukb_eas_XX,AF_non_ukb_eas_XX,AN_non_ukb_eas_XX,nhomalt_non_ukb_eas_XX,AC_non_ukb_eas_XY,AF_non_ukb_eas_XY,AN_non_ukb_eas_XY,nhomalt_non_ukb_eas_XY,AC_non_ukb_eas,AF_non_ukb_eas,AN_non_ukb_eas,nhomalt_non_ukb_eas,AC_non_ukb_fin_XX,AF_non_ukb_fin_XX,AN_non_ukb_fin_XX,nhomalt_non_ukb_fin_XX,AC_non_ukb_fin_XY,AF_non_ukb_fin_XY,AN_non_ukb_fin_XY,nhomalt_non_ukb_fin_XY,AC_non_ukb_fin,AF_non_ukb_fin,AN_non_ukb_fin,nhomalt_non_ukb_fin,AC_non_ukb_mid_XX,AF_non_ukb_mid_XX,AN_non_ukb_mid_XX,nhomalt_non_ukb_mid_XX,AC_non_ukb_mid_XY,AF_non_ukb_mid_XY,AN_non_ukb_mid_XY,nhomalt_non_ukb_mid_XY,AC_non_ukb_mid,AF_non_ukb_mid,AN_non_ukb_mid,nhomalt_non_ukb_mid,AC_non_ukb_nfe_XX,AF_non_ukb_nfe_XX,AN_non_ukb_nfe_XX,nhomalt_non_ukb_nfe_XX,AC_non_ukb_nfe_XY,AF_non_ukb_nfe_XY,AN_non_ukb_nfe_XY,nhomalt_non_ukb_nfe_XY,AC_non_ukb_nfe,AF_non_ukb_nfe,AN_non_ukb_nfe,nhomalt_non_ukb_nfe,AC_non_ukb_raw,AF_non_ukb_raw,AN_non_ukb_raw,nhomalt_non_ukb_raw,AC_non_ukb_remaining_XX,AF_non_ukb_remaining_XX,AN_non_ukb_remaining_XX,nhomalt_non_ukb_remaining_XX,AC_non_ukb_remaining_XY,AF_non_ukb_remaining_XY,AN_non_ukb_remaining_XY,nhomalt_non_ukb_remaining_XY,AC_non_ukb_remaining,AF_non_ukb_remaining,AN_non_ukb_remaining,nhomalt_non_ukb_remaining,AC_non_ukb_sas_XX,AF_non_ukb_sas_XX,AN_non_ukb_sas_XX,nhomalt_non_ukb_sas_XX,AC_non_ukb_sas_XY,AF_non_ukb_sas_XY,AN_non_ukb_sas_XY,nhomalt_non_ukb_sas_XY,AC_non_ukb_sas,AF_non_ukb_sas,AN_non_ukb_sas,nhomalt_non_ukb_sas,AC_raw,AF_raw,AN_raw,nhomalt_raw,AC_remaining_XX,AF_remaining_XX,AN_remaining_XX,nhomalt_remaining_XX,AC_remaining_XY,AF_remaining_XY,AN_remaining_XY,nhomalt_remaining_XY,AC_remaining,AF_remaining,AN_remaining,nhomalt_remaining,AC_sas_XX,AF_sas_XX,AN_sas_XX,nhomalt_sas_XX,AC_sas_XY,AF_sas_XY,AN_sas_XY,nhomalt_sas_XY,AC_sas,AF_sas,AN_sas,nhomalt_sas,AC_joint_XX,AF_joint_XX,AN_joint_XX,nhomalt_joint_XX,AC_joint_XY,AF_joint_XY,AN_joint_XY,nhomalt_joint_XY,AC_joint,AF_joint,AN_joint,nhomalt_joint,AC_joint_afr_XX,AF_joint_afr_XX,AN_joint_afr_XX,nhomalt_joint_afr_XX,AC_joint_afr_XY,AF_joint_afr_XY,AN_joint_afr_XY,nhomalt_joint_afr_XY,AC_joint_afr,AF_joint_afr,AN_joint_afr,nhomalt_joint_afr,AC_joint_ami_XX,AF_joint_ami_XX,AN_joint_ami_XX,nhomalt_joint_ami_XX,AC_joint_ami_XY,AF_joint_ami_XY,AN_joint_ami_XY,nhomalt_joint_ami_XY,AC_joint_ami,AF_joint_ami,AN_joint_ami,nhomalt_joint_ami,AC_joint_amr_XX,AF_joint_amr_XX,AN_joint_amr_XX,nhomalt_joint_amr_XX,AC_joint_amr_XY,AF_joint_amr_XY,AN_joint_amr_XY,nhomalt_joint_amr_XY,AC_joint_amr,AF_joint_amr,AN_joint_amr,nhomalt_joint_amr,AC_joint_asj_XX,AF_joint_asj_XX,AN_joint_asj_XX,nhomalt_joint_asj_XX,AC_joint_asj_XY,AF_joint_asj_XY,AN_joint_asj_XY,nhomalt_joint_asj_XY,AC_joint_asj,AF_joint_asj,AN_joint_asj,nhomalt_joint_asj,AC_joint_eas_XX,AF_joint_eas_XX,AN_joint_eas_XX,nhomalt_joint_eas_XX,AC_joint_eas_XY,AF_joint_eas_XY,AN_joint_eas_XY,nhomalt_joint_eas_XY,AC_joint_eas,AF_joint_eas,AN_joint_eas,nhomalt_joint_eas,AC_joint_fin_XX,AF_joint_fin_XX,AN_joint_fin_XX,nhomalt_joint_fin_XX,AC_joint_fin_XY,AF_joint_fin_XY,AN_joint_fin_XY,nhomalt_joint_fin_XY,AC_joint_fin,AF_joint_fin,AN_joint_fin,nhomalt_joint_fin,AC_joint_mid_XX,AF_joint_mid_XX,AN_joint_mid_XX,nhomalt_joint_mid_XX,AC_joint_mid_XY,AF_joint_mid_XY,AN_joint_mid_XY,nhomalt_joint_mid_XY,AC_joint_mid,AF_joint_mid,AN_joint_mid,nhomalt_joint_mid,AC_joint_nfe_XX,AF_joint_nfe_XX,AN_joint_nfe_XX,nhomalt_joint_nfe_XX,AC_joint_nfe_XY,AF_joint_nfe_XY,AN_joint_nfe_XY,nhomalt_joint_nfe_XY,AC_joint_nfe,AF_joint_nfe,AN_joint_nfe,nhomalt_joint_nfe,AC_joint_raw,AF_joint_raw,AN_joint_raw,nhomalt_joint_raw,AC_joint_remaining_XX,AF_joint_remaining_XX,AN_joint_remaining_XX,nhomalt_joint_remaining_XX,AC_joint_remaining_XY,AF_joint_remaining_XY,AN_joint_remaining_XY,nhomalt_joint_remaining_XY,AC_joint_remaining,AF_joint_remaining,AN_joint_remaining,nhomalt_joint_remaining,AC_joint_sas_XX,AF_joint_sas_XX,AN_joint_sas_XX,nhomalt_joint_sas_XX,AC_joint_sas_XY,AF_joint_sas_XY,AN_joint_sas_XY,nhomalt_joint_sas_XY,AC_joint_sas,AF_joint_sas,AN_joint_sas,nhomalt_joint_sas,AC_grpmax,AF_grpmax,AN_grpmax,nhomalt_grpmax,grpmax_non_ukb,AC_grpmax_non_ukb,AF_grpmax_non_ukb,AN_grpmax_non_ukb,nhomalt_grpmax_non_ukb,grpmax_joint,AC_grpmax_joint,AF_grpmax_joint,AN_grpmax_joint,nhomalt_grpmax_joint,faf95_XX,faf95_XY,faf95,faf95_afr_XX,faf95_afr_XY,faf95_afr,faf95_amr_XX,faf95_amr_XY,faf95_amr,faf95_eas_XX,faf95_eas_XY,faf95_eas,faf95_nfe_XX,faf95_nfe_XY,faf95_nfe,faf95_non_ukb_XX,faf95_non_ukb_XY,faf95_non_ukb,faf95_non_ukb_afr_XX,faf95_non_ukb_afr_XY,faf95_non_ukb_afr,faf95_non_ukb_amr_XX,faf95_non_ukb_amr_XY,faf95_non_ukb_amr,faf95_non_ukb_eas_XX,faf95_non_ukb_eas_XY,faf95_non_ukb_eas,faf95_non_ukb_nfe_XX,faf95_non_ukb_nfe_XY,faf95_non_ukb_nfe,faf95_non_ukb_sas_XX,faf95_non_ukb_sas_XY,faf95_non_ukb_sas,faf95_sas_XX,faf95_sas_XY,faf95_sas,faf99_XX,faf99_XY,faf99,faf99_afr_XX,faf99_afr_XY,faf99_afr,faf99_amr_XX,faf99_amr_XY,faf99_amr,faf99_eas_XX,faf99_eas_XY,faf99_eas,faf99_nfe_XX,faf99_nfe_XY,faf99_nfe,faf99_non_ukb_XX,faf99_non_ukb_XY,faf99_non_ukb,faf99_non_ukb_afr_XX,faf99_non_ukb_afr_XY,faf99_non_ukb_afr,faf99_non_ukb_amr_XX,faf99_non_ukb_amr_XY,faf99_non_ukb_amr,faf99_non_ukb_eas_XX,faf99_non_ukb_eas_XY,faf99_non_ukb_eas,faf99_non_ukb_nfe_XX,faf99_non_ukb_nfe_XY,faf99_non_ukb_nfe,faf99_non_ukb_sas_XX,faf99_non_ukb_sas_XY,faf99_non_ukb_sas,faf99_sas_XX,faf99_sas_XY,faf99_sas,fafmax_faf99_max,fafmax_faf99_max_gen_anc,fafmax_faf95_max_non_ukb,fafmax_faf95_max_gen_anc_non_ukb,fafmax_faf99_max_non_ukb,fafmax_faf99_max_gen_anc_non_ukb,faf95_joint_XX,faf95_joint_XY,faf95_joint,faf95_joint_afr_XX,faf95_joint_afr_XY,faf95_joint_afr,faf95_joint_amr_XX,faf95_joint_amr_XY,faf95_joint_amr,faf95_joint_eas_XX,faf95_joint_eas_XY,faf95_joint_eas,faf95_joint_nfe_XX,faf95_joint_nfe_XY,faf95_joint_nfe,faf95_joint_sas_XX,faf95_joint_sas_XY,faf95_joint_sas,faf99_joint_XX,faf99_joint_XY,faf99_joint,faf99_joint_afr_XX,faf99_joint_afr_XY,faf99_joint_afr,faf99_joint_amr_XX,faf99_joint_amr_XY,faf99_joint_amr,faf99_joint_eas_XX,faf99_joint_eas_XY,faf99_joint_eas,faf99_joint_nfe_XX,faf99_joint_nfe_XY,faf99_joint_nfe,faf99_joint_sas_XX,faf99_joint_sas_XY,faf99_joint_sas,fafmax_faf95_max_joint,fafmax_faf95_max_gen_anc_joint,fafmax_faf99_max_joint,fafmax_faf99_max_gen_anc_joint,fafmax_data_type_joint,age_hist_het_bin_freq,age_hist_het_n_smaller,age_hist_het_n_larger,age_hist_hom_bin_freq,age_hist_hom_n_smaller,age_hist_hom_n_larger,FS,MQ,MQRankSum,QUALapprox,QD,ReadPosRankSum,SOR,VarDP,monoallelic,only_het,transmitted_singleton,sibling_singleton,AS_FS,AS_MQ,AS_MQRankSum,AS_pab_max,AS_QUALapprox,AS_QD,AS_ReadPosRankSum,AS_SB_TABLE,AS_SOR,AS_VarDP,inbreeding_coeff,AS_culprit,AS_VQSLOD,negative_train_site,positive_train_site,allele_type,n_alt_alleles,variant_type,was_mixed,lcr,non_par,segdup,fail_interval_qc,outside_ukb_capture_region,outside_broad_capture_region,revel_max,spliceai_ds_max,pangolin_largest_ds,phylop,sift_max,polyphen_max;
create #hg38_433_splits# = gor <(nor <(gor #genes# | group chrom | segspan -maxseg 10000000)
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop=max_bpstop,max_tEnd,bpstop)
)
| where #2 < #3 | select 1-3;
create #temp# = pgor -split <(gor [#hg38_433_splits#]) user_data/ref_gnomad4/exomes.gord
| #top#
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| replace monoallelic,only_het,transmitted_singleton,sibling_singleton,negative_train_site,positive_train_site,was_mixed,lcr,non_par,segdup,fail_interval_qc,outside_ukb_capture_region,outside_broad_capture_region if(#rc='.',0,1)
;
create #w# = pgor [#temp#]; /* | write s3data://shared/user_data/ref_gnomad4/exome_cols_v2.gord; */
nor [#w#] | top 1 | unpivot #1-
Query ran in 1.52 sec Query fetched 574 rows in 12.09 sec (total time 13.61 sec)
Col_Name | Col_Value | |
---|---|---|
0 | CHROM | chr1 |
1 | POS | 11994 |
2 | REF | T |
3 | ALT | C |
4 | FILTER | AC0;AS_VQSR |
... | ... | ... |
569 | spliceai_ds_max | . |
570 | pangolin_largest_ds | -0.110000 |
571 | phylop | 1.08800 |
572 | sift_max | . |
573 | polyphen_max | . |
574 rows × 2 columns
Notice that we have applied a replace on the columns (attributes) that are labeled in the VCF to be flags, e.g.:
%%gor
nor user_data/ref_gnomad4/files/exome_col_descr.tsv | where Type in ('Flag')
Query ran in 0.17 sec Query fetched 13 rows in 0.01 sec (total time 0.18 sec)
Column | Number | Type | Description | |
---|---|---|---|---|
0 | monoallelic | 0 | Flag | All samples are homozygous alternate for the v... |
1 | only_het | 0 | Flag | All samples are heterozygous for the variant |
2 | transmitted_singleton | 0 | Flag | Variant was a callset-wide doubleton that was ... |
3 | sibling_singleton | 0 | Flag | Variant was a callset-wide doubleton that was ... |
4 | negative_train_site | 0 | Flag | Variant was used to build the negative trainin... |
5 | positive_train_site | 0 | Flag | Variant was used to build the positive trainin... |
6 | was_mixed | 0 | Flag | Variant type was mixed |
7 | lcr | 0 | Flag | Variant falls within a low complexity region |
8 | non_par | 0 | Flag | Variant (on sex chromosome) falls outside a ps... |
9 | segdup | 0 | Flag | Variant falls within a segmental duplication r... |
10 | fail_interval_qc | 0 | Flag | Less than 85 percent of samples meet 20X cover... |
11 | outside_ukb_capture_region | 0 | Flag | Variant falls outside of UK Biobank exome capt... |
12 | outside_broad_capture_region | 0 | Flag | Variant falls outside of Broad exome capture r... |
CNV and SV¶
We did very much the same for CNVs and the 1672 SV attributes:
%%gor
def #top# = top 10;
def #colslist# = END,SVLEN,SVTYPE,POSMIN,POSMAX,ENDMIN,ENDMAX,Genes,N_EXN_VAR,N_INT_VAR,SC,afr_SC,amr_SC,asj_SC,eas_SC,fin_SC,mid_SC,nfe_SC,sas_SC,SN,afr_SN,amr_SN,asj_SN,eas_SN,fin_SN,mid_SN,nfe_SN,sas_SN,SF,afr_SF,amr_SF,asj_SF,eas_SF,fin_SF,mid_SF,nfe_SF,sas_SF,MALE_SC,afr_MALE_SC,amr_MALE_SC,asj_MALE_SC,eas_MALE_SC,fin_MALE_SC,mid_MALE_SC,nfe_MALE_SC,sas_MALE_SC,MALE_SN,afr_MALE_SN,amr_MALE_SN,asj_MALE_SN,eas_MALE_SN,fin_MALE_SN,mid_MALE_SN,nfe_MALE_SN,sas_MALE_SN,MALE_SF,afr_MALE_SF,amr_MALE_SF,asj_MALE_SF,eas_MALE_SF,fin_MALE_SF,mid_MALE_SF,nfe_MALE_SF,sas_MALE_SF,FEMALE_SC,afr_FEMALE_SC,amr_FEMALE_SC,asj_FEMALE_SC,eas_FEMALE_SC,fin_FEMALE_SC,mid_FEMALE_SC,nfe_FEMALE_SC,sas_FEMALE_SC,FEMALE_SN,afr_FEMALE_SN,amr_FEMALE_SN,asj_FEMALE_SN,eas_FEMALE_SN,fin_FEMALE_SN,mid_FEMALE_SN,nfe_FEMALE_SN,sas_FEMALE_SN,FEMALE_SF,afr_FEMALE_SF,amr_FEMALE_SF,asj_FEMALE_SF,eas_FEMALE_SF,fin_FEMALE_SF,mid_FEMALE_SF,nfe_FEMALE_SF,sas_FEMALE_SF;
gor user_data/ref_gnomad4/cnv/gnomad.v4.0.cnv.all.vcf.gz
| #top#
| sort genome
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| select chrom,pos,end,ref,alt,svlen-
/* | write user_data/ref_gnomad4/cnv_all.gorz */
Query ran in 0.64 sec Query fetched 4 rows in 0.02 sec (total time 0.65 sec)
CHROM | POS | END | REF | ALT | SVLEN | SVTYPE | POSMIN | POSMAX | ENDMIN | ... | sas_FEMALE_SN | FEMALE_SF | afr_FEMALE_SF | amr_FEMALE_SF | asj_FEMALE_SF | eas_FEMALE_SF | fin_FEMALE_SF | mid_FEMALE_SF | nfe_FEMALE_SF | sas_FEMALE_SF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 925634 | 931187 | N | <DEL> | 5553 | DEL | 925634 | 925634 | 930434 | ... | 7959 | 0.000009 | 0 | 0 | 0 | 0.000126 | 0 | 0 | 0.000006 | 0 |
1 | chr1 | 925634 | 935994 | N | <DUP> | 10360 | DUP | 912956 | 925634 | 935994 | ... | 7959 | 0.000009 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 |
2 | chr1 | 935675 | 948701 | N | <DEL> | 13026 | DEL | 935675 | 935675 | 948701 | ... | 7960 | 0.000004 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 |
3 | chr1 | 940979 | 943475 | N | <DUP> | 2496 | DUP | 940979 | 940979 | 943475 | ... | 7960 | 0.000004 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 |
4 rows × 95 columns
Here we had to do use SORT genome before PIVOT because the single VCF file has data from multiple chromosomes (and no index) and the fact that VCF use numerical order of chromosomes while GOR uses alphabetical order.
Below we show a reduced version of the SV query (because the 1600+ attributes take too much space!):
%%gor
def #top# = top 10;
def #colslist# = AC,ALGORITHMS,AN,BOTHSIDES_SUPPORT,CHR2,CPX_INTERVALS,CPX_TYPE,END,END2,EVIDENCE,FEMALE_NCR,FEMALE_PCRMINUS_NCR,FEMALE_PCRPLUS_NCR,HIGH_SR_BACKGROUND,LIKELY_REFERENCE_ARTIFACT,LOW_CONFIDENCE_REPETITIVE_LARGE_DUP,MALE_NCR,MALE_PCRMINUS_NCR,MALE_PCRPLUS_NCR,MEMBERS,MULTIALLELIC,NCR,OUTLIER_SAMPLE_ENRICHED_LENIENT,PCRMINUS_NCR,PCRPLUS_NCR,PESR_GT_OVERDISPERSION,POS2,PREDICTED_BREAKEND_EXONIC,PREDICTED_COPY_GAIN,PREDICTED_DUP_PARTIAL,PREDICTED_INTERGENIC,PREDICTED_INTRAGENIC_EXON_DUP,PREDICTED_INTRONIC,PREDICTED_INV_SPAN,PREDICTED_LOF,PREDICTED_MSV_EXON_OVERLAP,PREDICTED_NEAREST_TSS,PREDICTED_PARTIAL_EXON_DUP,PREDICTED_PROMOTER,PREDICTED_TSS_DUP,PREDICTED_UTR,RESOLVED_POSTHOC,SOURCE,STRANDS,SVLEN,SVTYPE,UNRESOLVED_TYPE,AF;
pgor user_data/ref_gnomad4/sv.gord
| #top#
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,qual,filter,x_1,x_2
| pivot -gc ref,alt,qual,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| columnreorder chrom,pos,END,END2
/* | write s3data://shared/user_data/ref_gnomad4/sv_1672_cols.gord */
Query ran in 1.54 sec Query fetched 366 rows in 2.49 sec (total time 4.04 sec)
CHROM | POS | END | END2 | REF | ALT | QUAL | FILTER | AC | ALGORITHMS | ... | PREDICTED_PROMOTER | PREDICTED_TSS_DUP | PREDICTED_UTR | RESOLVED_POSTHOC | SOURCE | STRANDS | SVLEN | SVTYPE | UNRESOLVED_TYPE | AF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10000 | 295666 | . | N | <DUP> | 134 | HIGH_NCR | 139 | depth | ... | . | . | . | . | . | . | 285666 | DUP | . | 0.001997 |
1 | chr1 | 10434 | 10434 | 242183515 | N | <BND> | 260 | HIGH_NCR;UNRESOLVED | 8474 | manta | ... | . | . | . | . | . | ++ | -1 | BND | MIXED_BREAKENDS | 0.114134 |
2 | chr1 | 10440 | 10440 | 10977 | N | <BND> | 198 | HIGH_NCR;UNRESOLVED | 466 | manta | ... | . | . | . | . | . | +- | -1 | BND | SINGLE_ENDER_+- | 0.004201 |
3 | chr1 | 10450 | 10450 | 80258275 | N | <BND> | 287 | HIGH_NCR;UNRESOLVED | 21766 | manta | ... | . | . | . | . | . | ++ | -1 | BND | MIXED_BREAKENDS | 0.323899 |
4 | chr1 | 10464 | 10464 | 10502 | N | <BND> | 198 | HIGH_NCR;UNRESOLVED | 3119 | manta | ... | . | . | . | . | . | +- | -1 | BND | SINGLE_ENDER_+- | 0.036985 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | chrY | 2809388 | 2815388 | . | N | <DEL> | 107 | PASS | 3 | depth | ... | . | . | . | . | . | . | 6000 | DEL | . | 0.000101674 |
362 | chrY | 2813985 | 2818893 | . | N | <DEL> | 351 | PASS | 1 | manta | ... | . | . | . | . | . | . | 4908 | DEL | . | 3.31719e-05 |
363 | chrY | 2816900 | 2817065 | . | N | <DUP> | 859 | PASS | 1 | wham | ... | . | . | . | . | . | . | 165 | DUP | . | 3.31719e-05 |
364 | chrY | 2817388 | 2822788 | . | N | <DEL> | 999 | PASS | 2 | depth | ... | . | . | . | . | . | . | 5400 | DEL | . | 6.77025e-05 |
365 | chrY | 2817388 | 2913388 | . | N | <DUP> | 999 | PASS | 8 | depth | ... | . | . | . | . | . | . | 96000 | DUP | . | 0.00026888 |
366 rows × 54 columns
Liftover¶
For the liftover to hg19 there are slighltly different concerns for SNVs that the segment based CNVs and SVs.
Let's first look at the SNP and small InDels. We will use PARALLEL to split up the work to a high number of tasks, each dealing with 1M rows.
qh = GQH.GOR_Query_Helper()
mydefs = ""
mydefs = qh.add_many("""
def #source_table# = user_data/ref_gnomad4/exome_cols_v2.gord;
create #c# = pgor -split 100 #source_table#
| group 100000 -count;
create #segs# = gor [#c#]
| seghist 1000000;
create #parts# = nor [#segs#]
| replace #2 #2+1
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop = max_bpstop,max_tEnd,bpstop)
| hide count,max_tEnd;
""")
%%gor
$mydefs
nor [#parts#] | granno -count | top 3
Query ran in 1.64 sec Query fetched 3 rows in 18.66 sec (total time 20.30 sec)
Chrom | bpStart | bpStop | max_bpStop | allCount | |
---|---|---|---|---|---|
0 | chr1 | 1 | 6134734 | 249250621 | 196 |
1 | chr1 | 6134735 | 15123500 | 249250621 | 196 |
2 | chr1 | 15123501 | 21937240 | 249250621 | 196 |
In #parts# we shift the start position by one because the -p option in GOR is one-based while all segments use zero-based system (confusing - yes!). The query to execute the liftover uses the -var option in LIFTOVER, because for variants where the strand changes we must reverse-complement the refefrence sequence and the alternative.
mydefs = qh.add_many("""
create #lift# = parallel -parts [#parts#] <(gor -p #{col:chrom}:#{col:bpstart}-#{col:bpstop} #source_table#
| liftover config/liftover/hg38tohg19.gor -var -build hg38);
create #write# = pgor [#lift#] | write s3data://shared/user_data/hakon/gnomad_v4.gord;
""")
The #lift# relation above has many partitions and each partition my have data from multiple chromosomes. However, the final write will create a dictionary with regular PGOR split, i.e. 2 files for the bigger chromosomes in total 38 files, each file with data from single chromosome. We will not run the above query here but it takes only few minutes. We will however count the number of variants:
%%gor
create x = pgor -split 100 user_data/hakon/gnomad_v4.gord | group chrom -gc hg38_liftoverStatus -count;
nor [x] | group -gc hg38_liftoverStatus -sum -ic allcount | granno -sum -ic #2 | calc f #2/#3
Query ran in 0.31 sec Query fetched 2 rows in 0.03 sec (total time 0.34 sec)
hg38_liftoverStatus | sum_allCount | sum_sum_allCount | f | |
---|---|---|---|---|
0 | mapped | 181793123 | 182145119 | 0.998067 |
1 | unmapped | 351996 | 182145119 | 0.001933 |
Our dataset has 182M rows x 580 cols. We notice that 0.2% of the variants do not translate from hg38 to hg19. This is a bigger issue in CNVs where we have larger variants.
CNV liftover¶
When we do the CNV liftover, we are usually dealing with variants that span many bases. We now use -seg option for LIFTOVER, i.e. NO ref/alt to deal with.
qh2 = GQH.GOR_Query_Helper()
mydefs2 = ""
mydefs2 = qh2.add_many("""
create #pseg# = gor user_data/ref_gnomad4/cnv_all.gorz | select 1-3 |rownum
| calc oldsize #3-#2
| liftover config/liftover/hg38tohg19.gor -seg -build hg38;
""")
%%gor
$mydefs2
gor [#pseg#] | calc same if(#3-#2 = oldsize,1,0) | group genome -gc same,hg38_liftoverStatus -count
Query ran in 0.60 sec Query fetched 2 rows in 0.22 sec (total time 0.82 sec)
CHROM | bpStart | bpStop | same | hg38_liftoverStatus | allCount | |
---|---|---|---|---|---|---|
0 | chrA | 0 | 1000000000 | 0 | unmapped | 3364 |
1 | chrA | 0 | 1000000000 | 1 | mapped | 48934 |
Now close to 10% of the CNVs do not map. However, all the ones that do map conserver their segment size. This is because the LIFTOVER command only maps the segmens where both start and stop positions move in tandem. We can relax this condition by applying liftover on the start position and stop position separately and then join these individual liftover steps together by their row-numbers, e.g. :
mydefs2 = qh2.add_many("""
create #psta# = gor user_data/ref_gnomad4/cnv_all.gorz | select 1,2 | rownum
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;
create #pend# = gor user_data/ref_gnomad4/cnv_all.gorz | select 1,3 | rownum | sort chrom
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;
""")
%%gor
$mydefs2
nor [#psta#] | group -gc hg38_liftoverStatus -count
| merge <(nor [#pend#] | group -gc hg38_liftoverStatus -count)
Query ran in 0.27 sec Query fetched 4 rows in 0.16 sec (total time 0.43 sec)
hg38_liftoverStatus | allCount | |
---|---|---|
0 | mapped | 52285 |
1 | mapped | 52290 |
2 | unmapped | 13 |
3 | unmapped | 8 |
Now we see that almost all of the single-nucleotide-positions are mapped. Hence, if we map the start and end positions independently, we are likely to get a better yield in the liftover process.
%%gor
$mydefs2
nor [#pseg#] | map -c rownum <(nor [#psta#] | select rownum,pos,chrom | rename chrom sta_chrom)
| map -c rownum <(nor [#pend#] | select rownum,end,chrom | rename chrom end_chrom)
| calc s2 endx-posx
| calc f s2/oldsize
| calc similar if(f < 1.05 and f > 0.95 and sta_chrom = end_chrom,1,0)
| group -gc similar -count | granno -sum -ic allcount | calc fr allcount/sum_allcount
Query ran in 0.10 sec Query fetched 2 rows in 0.42 sec (total time 0.52 sec)
similar | allCount | sum_allCount | fr | |
---|---|---|---|---|
0 | 0 | 238 | 52298 | 0.004551 |
1 | 1 | 52060 | 52298 | 0.995449 |
Now we see that only 0.5% of the CNV segments do not map, if we insist that the change in their size is no more than 5% (picked from a hat!). We can use this approach to map most of the CNVs to hg19, allowing for a slight fuzzyness in the size.
%%gor
$mydefs2
create #cnvs# = nor user_data/ref_gnomad4/cnv_all.gorz | rownum | columnreorder rownum | sort -c rownum;
gor <(nor [#pseg#] | map -c rownum <(nor [#psta#] | select rownum,pos,chrom | rename chrom sta_chrom)
| map -c rownum <(nor [#pend#] | select rownum,end,chrom | rename chrom end_chrom)
| calc s2 endx-posx
| calc f s2/oldsize
| where f < 1.05 and f > 0.95 and sta_chrom = end_chrom
| select chrom,posx,endx,rownum
| rename #2 cnv_start
| rename #3 cnv_end
| sort -c rownum
| map -ordered -c rownum <(nor [#cnvs#] | prefix chrom,pos,end hg38)
| hide rownum
)
| sort genome
| select 1-cnv_end,ref-,hg38_*
| top 10 /* write user_data/hakon/gnomAD/cnv_all_v4_hg19.gorz */
Query ran in 0.13 sec Query fetched 10 rows in 5.00 sec (total time 5.14 sec)
CHROM | cnv_start | cnv_end | REF | ALT | SVLEN | SVTYPE | POSMIN | POSMAX | ENDMIN | ... | amr_FEMALE_SF | asj_FEMALE_SF | eas_FEMALE_SF | fin_FEMALE_SF | mid_FEMALE_SF | nfe_FEMALE_SF | sas_FEMALE_SF | hg38_CHROM | hg38_POS | hg38_END | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 861014 | 866567 | N | <DEL> | 5553 | DEL | 925634 | 925634 | 930434 | ... | 0 | 0 | 0.000126 | 0 | 0 | 0.000006 | 0 | chr1 | 925634 | 931187 |
1 | chr1 | 861014 | 871374 | N | <DUP> | 10360 | DUP | 912956 | 925634 | 935994 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 | chr1 | 925634 | 935994 |
2 | chr1 | 871055 | 884081 | N | <DEL> | 13026 | DEL | 935675 | 935675 | 948701 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 | chr1 | 935675 | 948701 |
3 | chr1 | 876359 | 878855 | N | <DUP> | 2496 | DUP | 940979 | 940979 | 943475 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 | chr1 | 940979 | 943475 |
4 | chr1 | 876359 | 883710 | N | <DEL> | 7351 | DEL | 940979 | 940979 | 948330 | ... | 0 | 0 | 0.000126 | 0 | 0 | 0.000000 | 0 | chr1 | 940979 | 948330 |
5 | chr1 | 877386 | 880278 | N | <DUP> | 2892 | DUP | 942006 | 942006 | 944898 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 | chr1 | 942006 | 944898 |
6 | chr1 | 878538 | 881131 | N | <DEL> | 2593 | DEL | 943158 | 943158 | 945751 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000000 | 0 | chr1 | 943158 | 945751 |
7 | chr1 | 883414 | 887617 | N | <DEL> | 4203 | DEL | 948034 | 948034 | 952237 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000000 | 0 | chr1 | 948034 | 952237 |
8 | chr1 | 895867 | 901197 | N | <DUP> | 5330 | DUP | 960487 | 960487 | 965817 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000000 | 0 | chr1 | 960487 | 965817 |
9 | chr1 | 895867 | 912119 | N | <DEL> | 16252 | DEL | 960487 | 960487 | 976739 | ... | 0 | 0 | 0.000000 | 0 | 0 | 0.000006 | 0 | chr1 | 960487 | 976739 |
10 rows × 98 columns
SV liftover¶
Like with the CNVs, we expect SV to have similar behaviour with regard to liftover. We therefore first check the yield of the how the regualr segment based liftover for SV:
qh3 = GQH.GOR_Query_Helper()
mydefs3 = ""
mydefs3 = qh3.add_many("""
create #thin# = gor user_data/ref_gnomad4/sv_1672_cols.gord | select 1-3 ;
create #pseg# = gor [#thin#] | select 1-3 |rownum
| calc oldsize #3-#2
| liftover config/liftover/hg38tohg19.gor -seg -build hg38;
create #psta# = gor [#thin#] | select 1,2 | rownum
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;
create #pend# = gor [#thin#] | select 1,3 | rownum | sort chrom
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;
""")
%%gor
$mydefs3
create #temp# = nor [#pseg#] | group -gc hg38_liftoverstatus -count | granno -sum -ic allcount | calc fr allcount/sum_allcount;
nor [#temp#]
Query ran in 1.53 sec Query fetched 2 rows in 1.79 sec (total time 3.32 sec)
hg38_liftoverStatus | allCount | sum_allCount | fr | |
---|---|---|---|---|
0 | mapped | 2103093 | 2151209 | 0.977633 |
1 | unmapped | 48116 | 2151209 | 0.022367 |
Actually, the yield is significantly better than for the CNVs. To see how much we gain by fuzzy liftover we run:
%%gor
$mydefs3
create #temp# = nor [#pseg#] | map -c rownum <(nor [#psta#] | select rownum,pos,chrom | rename chrom sta_chrom)
| map -c rownum <(nor [#pend#] | select rownum,end,chrom | rename chrom end_chrom)
| calc s2 endx-posx
| calc f if(s2=oldsize,1.0,s2/oldsize)
| calc similar if(f < 1.05 and f > 0.95 and sta_chrom = end_chrom,1,0)
| group -gc similar -count | granno -sum -ic allcount | calc fr allcount/sum_allcount;
nor [#temp#]
Query ran in 1.24 sec Query fetched 2 rows in 11.20 sec (total time 12.44 sec)
similar | allCount | sum_allCount | fr | |
---|---|---|---|---|
0 | 0 | 32826 | 2151209 | 0.015259 |
1 | 1 | 2118383 | 2151209 | 0.984741 |
We are only slightly better than using perfect match. Thus, we will simply write our hg19 SVs using the regular segment based liftover.
%%gor
create #sv_hg19# = gor user_data/ref_gnomad4/sv_1672_cols.gord | liftover config/liftover/hg38tohg19.gor -seg -build hg38;
gor [#sv_hg19#] | where hg38_liftoverstatus = 'mapped' | top 10 /* | write user_data/hakon/gnomAD/sv_all_v4_1672_cols_hg19.gorz */
Query ran in 0.22 sec Query fetched 0 rows in 203.62 sec (total time 203.84 sec)
CHROM | POS |
---|
The liftover for the 2.1M SV variants takes about 10 minutes and another 3 minutes to writer the 1600+ columns out. This table should be narrowed down to a smaller table that has only the most important columns.
%%gor
nor user_data/hakon/gnomAD/sv_all_v4_1672_cols_hg19.gorz | top 1 | unpivot 1-
Query ran in 0.44 sec Query fetched 1,678 rows in 0.01 sec (total time 0.45 sec)
Col_Name | Col_Value | |
---|---|---|
0 | CHROM | chr1 |
1 | POS | 10240 |
2 | END | 10240 |
3 | END2 | 205738 |
4 | REF | N |
... | ... | ... |
1673 | hg38_CHROM | chr1 |
1674 | hg38_POS | 180903 |
1675 | hg38_END | 180903 |
1676 | hg38_qStrand | + |
1677 | hg38_liftoverStatus | mapped |
1678 rows × 2 columns
In the folder user_data/hakon/gnomAD we have now several gnomAD sources:
- gnomad_v4_hg19_clean.gord
- cnv_all_v4_hg19.gorz
- sv_all_v4_1672_cols_hg19.gorz
and from gnomAD v3
- gnomad_v3_maincols_hg19.gord
WGS data¶
We did the very much the same analysis for the WGS data as for the exoms. First, we make a GORD file from the VCF files in S3:
%%gor
nor user_data/ref_gnomad4/files/
| where right(filename,8)='vcf.link'
| select filename
| replace filename 's3://gdb-demo-dev-blobstorage/shared/extdata/hg38/gnomad/release/4.0/vcf/genomes/'+filename
| replace filename replace(filename,'vcf.link','vcf.gz')
| rownum
| rename rownum alias
| calc chr_start regsel(filename,'.*sites.(.*).vcf.gz')
| calc bp_start 0
| calc chr_stop chr_start
| calc chr_end 1000000000
/* | write user_data/ref_gnomad4/WGS/genomes.gord */ | top 5
Query ran in 0.15 sec Query fetched 5 rows in 0.19 sec (total time 0.34 sec)
Filename | alias | chr_start | bp_start | chr_stop | chr_end | |
---|---|---|---|---|---|---|
0 | s3://gdb-demo-dev-blobstorage/shared/extdata/h... | 1 | chr16 | 0 | chr16 | 1000000000 |
1 | s3://gdb-demo-dev-blobstorage/shared/extdata/h... | 2 | chr19 | 0 | chr19 | 1000000000 |
2 | s3://gdb-demo-dev-blobstorage/shared/extdata/h... | 3 | chr6 | 0 | chr6 | 1000000000 |
3 | s3://gdb-demo-dev-blobstorage/shared/extdata/h... | 4 | chr9 | 0 | chr9 | 1000000000 |
4 | s3://gdb-demo-dev-blobstorage/shared/extdata/h... | 5 | chr11 | 0 | chr11 | 1000000000 |
Then we transform the data to columnar GOR files:
%%gor
def #top# = top 10;
def #colslist# = AC,AN,AF,grpmax,fafmax_faf95_max,fafmax_faf95_max_gen_anc,AC_XX,AF_XX,AN_XX,nhomalt_XX,AC_XY,AF_XY,AN_XY,nhomalt_XY,nhomalt,AC_afr_XX,AF_afr_XX,AN_afr_XX,nhomalt_afr_XX,AC_afr_XY,AF_afr_XY,AN_afr_XY,nhomalt_afr_XY,AC_afr,AF_afr,AN_afr,nhomalt_afr,AC_ami_XX,AF_ami_XX,AN_ami_XX,nhomalt_ami_XX,AC_ami_XY,AF_ami_XY,AN_ami_XY,nhomalt_ami_XY,AC_ami,AF_ami,AN_ami,nhomalt_ami,AC_amr_XX,AF_amr_XX,AN_amr_XX,nhomalt_amr_XX,AC_amr_XY,AF_amr_XY,AN_amr_XY,nhomalt_amr_XY,AC_amr,AF_amr,AN_amr,nhomalt_amr,AC_asj_XX,AF_asj_XX,AN_asj_XX,nhomalt_asj_XX,AC_asj_XY,AF_asj_XY,AN_asj_XY,nhomalt_asj_XY,AC_asj,AF_asj,AN_asj,nhomalt_asj,AC_eas_XX,AF_eas_XX,AN_eas_XX,nhomalt_eas_XX,AC_eas_XY,AF_eas_XY,AN_eas_XY,nhomalt_eas_XY,AC_eas,AF_eas,AN_eas,nhomalt_eas,AC_fin_XX,AF_fin_XX,AN_fin_XX,nhomalt_fin_XX,AC_fin_XY,AF_fin_XY,AN_fin_XY,nhomalt_fin_XY,AC_fin,AF_fin,AN_fin,nhomalt_fin,AC_mid_XX,AF_mid_XX,AN_mid_XX,nhomalt_mid_XX,AC_mid_XY,AF_mid_XY,AN_mid_XY,nhomalt_mid_XY,AC_mid,AF_mid,AN_mid,nhomalt_mid,AC_nfe_XX,AF_nfe_XX,AN_nfe_XX,nhomalt_nfe_XX,AC_nfe_XY,AF_nfe_XY,AN_nfe_XY,nhomalt_nfe_XY,AC_nfe,AF_nfe,AN_nfe,nhomalt_nfe,AC_raw,AF_raw,AN_raw,nhomalt_raw,AC_remaining_XX,AF_remaining_XX,AN_remaining_XX,nhomalt_remaining_XX,AC_remaining_XY,AF_remaining_XY,AN_remaining_XY,nhomalt_remaining_XY,AC_remaining,AF_remaining,AN_remaining,nhomalt_remaining,AC_sas_XX,AF_sas_XX,AN_sas_XX,nhomalt_sas_XX,AC_sas_XY,AF_sas_XY,AN_sas_XY,nhomalt_sas_XY,AC_sas,AF_sas,AN_sas,nhomalt_sas,AC_joint_XX,AF_joint_XX,AN_joint_XX,nhomalt_joint_XX,AC_joint_XY,AF_joint_XY,AN_joint_XY,nhomalt_joint_XY,AC_joint,AF_joint,AN_joint,nhomalt_joint,AC_joint_afr_XX,AF_joint_afr_XX,AN_joint_afr_XX,nhomalt_joint_afr_XX,AC_joint_afr_XY,AF_joint_afr_XY,AN_joint_afr_XY,nhomalt_joint_afr_XY,AC_joint_afr,AF_joint_afr,AN_joint_afr,nhomalt_joint_afr,AC_joint_ami_XX,AF_joint_ami_XX,AN_joint_ami_XX,nhomalt_joint_ami_XX,AC_joint_ami_XY,AF_joint_ami_XY,AN_joint_ami_XY,nhomalt_joint_ami_XY,AC_joint_ami,AF_joint_ami,AN_joint_ami,nhomalt_joint_ami,AC_joint_amr_XX,AF_joint_amr_XX,AN_joint_amr_XX,nhomalt_joint_amr_XX,AC_joint_amr_XY,AF_joint_amr_XY,AN_joint_amr_XY,nhomalt_joint_amr_XY,AC_joint_amr,AF_joint_amr,AN_joint_amr,nhomalt_joint_amr,AC_joint_asj_XX,AF_joint_asj_XX,AN_joint_asj_XX,nhomalt_joint_asj_XX,AC_joint_asj_XY,AF_joint_asj_XY,AN_joint_asj_XY,nhomalt_joint_asj_XY,AC_joint_asj,AF_joint_asj,AN_joint_asj,nhomalt_joint_asj,AC_joint_eas_XX,AF_joint_eas_XX,AN_joint_eas_XX,nhomalt_joint_eas_XX,AC_joint_eas_XY,AF_joint_eas_XY,AN_joint_eas_XY,nhomalt_joint_eas_XY,AC_joint_eas,AF_joint_eas,AN_joint_eas,nhomalt_joint_eas,AC_joint_fin_XX,AF_joint_fin_XX,AN_joint_fin_XX,nhomalt_joint_fin_XX,AC_joint_fin_XY,AF_joint_fin_XY,AN_joint_fin_XY,nhomalt_joint_fin_XY,AC_joint_fin,AF_joint_fin,AN_joint_fin,nhomalt_joint_fin,AC_joint_mid_XX,AF_joint_mid_XX,AN_joint_mid_XX,nhomalt_joint_mid_XX,AC_joint_mid_XY,AF_joint_mid_XY,AN_joint_mid_XY,nhomalt_joint_mid_XY,AC_joint_mid,AF_joint_mid,AN_joint_mid,nhomalt_joint_mid,AC_joint_nfe_XX,AF_joint_nfe_XX,AN_joint_nfe_XX,nhomalt_joint_nfe_XX,AC_joint_nfe_XY,AF_joint_nfe_XY,AN_joint_nfe_XY,nhomalt_joint_nfe_XY,AC_joint_nfe,AF_joint_nfe,AN_joint_nfe,nhomalt_joint_nfe,AC_joint_raw,AF_joint_raw,AN_joint_raw,nhomalt_joint_raw,AC_joint_remaining_XX,AF_joint_remaining_XX,AN_joint_remaining_XX,nhomalt_joint_remaining_XX,AC_joint_remaining_XY,AF_joint_remaining_XY,AN_joint_remaining_XY,nhomalt_joint_remaining_XY,AC_joint_remaining,AF_joint_remaining,AN_joint_remaining,nhomalt_joint_remaining,AC_joint_sas_XX,AF_joint_sas_XX,AN_joint_sas_XX,nhomalt_joint_sas_XX,AC_joint_sas_XY,AF_joint_sas_XY,AN_joint_sas_XY,nhomalt_joint_sas_XY,AC_joint_sas,AF_joint_sas,AN_joint_sas,nhomalt_joint_sas,AC_grpmax,AF_grpmax,AN_grpmax,nhomalt_grpmax,grpmax_joint,AC_grpmax_joint,AF_grpmax_joint,AN_grpmax_joint,nhomalt_grpmax_joint,faf95_XX,faf95_XY,faf95,faf95_afr_XX,faf95_afr_XY,faf95_afr,faf95_amr_XX,faf95_amr_XY,faf95_amr,faf95_eas_XX,faf95_eas_XY,faf95_eas,faf95_nfe_XX,faf95_nfe_XY,faf95_nfe,faf95_sas_XX,faf95_sas_XY,faf95_sas,faf99_XX,faf99_XY,faf99,faf99_afr_XX,faf99_afr_XY,faf99_afr,faf99_amr_XX,faf99_amr_XY,faf99_amr,faf99_eas_XX,faf99_eas_XY,faf99_eas,faf99_nfe_XX,faf99_nfe_XY,faf99_nfe,faf99_sas_XX,faf99_sas_XY,faf99_sas,fafmax_faf99_max,fafmax_faf99_max_gen_anc,faf95_joint_XX,faf95_joint_XY,faf95_joint,faf95_joint_afr_XX,faf95_joint_afr_XY,faf95_joint_afr,faf95_joint_amr_XX,faf95_joint_amr_XY,faf95_joint_amr,faf95_joint_eas_XX,faf95_joint_eas_XY,faf95_joint_eas,faf95_joint_nfe_XX,faf95_joint_nfe_XY,faf95_joint_nfe,faf95_joint_sas_XX,faf95_joint_sas_XY,faf95_joint_sas,faf99_joint_XX,faf99_joint_XY,faf99_joint,faf99_joint_afr_XX,faf99_joint_afr_XY,faf99_joint_afr,faf99_joint_amr_XX,faf99_joint_amr_XY,faf99_joint_amr,faf99_joint_eas_XX,faf99_joint_eas_XY,faf99_joint_eas,faf99_joint_nfe_XX,faf99_joint_nfe_XY,faf99_joint_nfe,faf99_joint_sas_XX,faf99_joint_sas_XY,faf99_joint_sas,fafmax_faf95_max_joint,fafmax_faf95_max_gen_anc_joint,fafmax_faf99_max_joint,fafmax_faf99_max_gen_anc_joint,fafmax_data_type_joint,age_hist_het_bin_freq,age_hist_het_n_smaller,age_hist_het_n_larger,age_hist_hom_bin_freq,age_hist_hom_n_smaller,age_hist_hom_n_larger,FS,MQ,MQRankSum,QUALapprox,QD,ReadPosRankSum,SOR,VarDP,monoallelic,only_het,transmitted_singleton,AS_FS,AS_MQ,AS_MQRankSum,AS_pab_max,AS_QUALapprox,AS_QD,AS_ReadPosRankSum,AS_SB_TABLE,AS_SOR,AS_VarDP,inbreeding_coeff,AS_culprit,AS_VQSLOD,negative_train_site,positive_train_site,allele_type,n_alt_alleles,variant_type,was_mixed,lcr,non_par,segdup;
create #hg38_433_splits# = gor <(nor <(gor #genes# | group chrom | segspan -maxseg 10000000)
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop=max_bpstop,max_tEnd,bpstop)
)
| where #2 < #3 | select 1-3;
create #temp# = pgor -split <(gor [#hg38_433_splits#] ) user_data/ref_gnomad4/WGS/genomes.gord
| #top#
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| replace monoallelic,only_het,transmitted_singleton,negative_train_site,positive_train_site,was_mixed,lcr,non_par,segdup if(#rc='.',0,1)
;
create #w# = pgor [#temp#]; /* write s3data://shared/user_data/ref_gnomad4/WGS/genome_cols.gord; */
nor [#w#] | top 1 | unpivot #1-
Query ran in 5.97 sec. Query fetched 407 rows in 10.28 sec (total time 16.25 sec)
Col_Name | Col_Value | |
---|---|---|
0 | CHROM | chr1 |
1 | POS | 10031 |
2 | REF | T |
3 | ALT | C |
4 | FILTER | AC0;AS_VQSR |
... | ... | ... |
402 | variant_type | multi-snv |
403 | was_mixed | 0 |
404 | lcr | 1 |
405 | non_par | 0 |
406 | segdup | 1 |
407 rows × 2 columns
and then we do the liftover
def #source_table# = user_data/ref_gnomad4/WGS/genome_cols.gord;
create #c# = pgor -split 100 #source_table#
| group 100000 -count;
create #segs# = gor [#c#]
| seghist 1000000;
create #parts# = nor [#segs#]
| replace #2 #2+1
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop = max_bpstop,max_tEnd,bpstop)
| hide count,max_tEnd
;
create #lift# = parallel -parts [#parts#] <(gor -p #{col:chrom}:#{col:bpstart}-#{col:bpstop} #source_table#
| liftover config/liftover/hg38tohg19.gor -var -build hg38);
create #write# = pgor [#lift#] | write s3data://shared/user_data/ref_gnomad4/WGS/genome_cols_hg19.gord;
On our GORdb system, the former step took 18 min and the latter liftover step 32 minutes. Not bad given the volume of the data - 759,125,619 rows x 409 columns!
%%gor
gor user_data/ref_gnomad4/WGS/genome_cols_hg19.gord
| until hg38_liftoverstatus != 'unmapped'
| group chrom -count
Query ran in 1.19 sec Query fetched 1 rows in 13.64 sec (total time 14.83 sec)
CHROM | bpStart | bpStop | allCount | |
---|---|---|---|---|
0 | chr1 | 0 | 249250621 | 697113 |
We see that there are close to 700k variants that don't map.
How many have consistent Ref value?¶
Of those who do map, how do their reference bases compare with the build (the liftover command will reverse complement them if there is a change in strand):
%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;
create #temp# = pgor [#slice#] | where pos > 0
| calc good if(ref = upper(refbases(chrom,pos,pos+len(ref)-1)),1,0)
| calc same_chrom if(hg38_chrom = chrom,1,0)
| group chrom -gc good,same_chrom,hg38_qstrand -count;
nor [#temp#]
| group -gc good,same_chrom,hg38_qstrand -sum -ic allcount
| granno -gc same_chrom,hg38_qstrand -sum -ic sum_allcount
| calc ratio sum_allcount/sum_sum_allcount
| sort -c same_chrom,hg38_qstrand
Query ran in 1.97 sec Query fetched 8 rows in 75.54 sec (total time 77.51 sec)
good | same_chrom | hg38_qStrand | sum_allCount | sum_sum_allCount | ratio | |
---|---|---|---|---|---|---|
0 | 0 | 0 | + | 6680 | 814317 | 0.008203 |
1 | 1 | 0 | + | 807637 | 814317 | 0.991797 |
2 | 0 | 0 | - | 3375 | 355184 | 0.009502 |
3 | 1 | 0 | - | 351809 | 355184 | 0.990498 |
4 | 0 | 1 | + | 75926 | 747149623 | 0.000102 |
5 | 1 | 1 | + | 747073697 | 747149623 | 0.999898 |
6 | 0 | 1 | - | 16472 | 4269451 | 0.003858 |
7 | 1 | 1 | - | 4252979 | 4269451 | 0.996142 |
We see that when the liftover does not map across chromosomes and on '+' qStrand, it is good in 99.9898% cases and 99.6% cases for the '-'. Between chromosomes it is still good in 99% of the cases.
How many make it back?¶
For comparison, we can also look at how many of the variants "make it back" to their original location on hg38. We now run the liftover process in the other direction, using only the thin #slice# to make it faster. The slice has the original hg38 position and we can compare it with POS (now in hg38) to see if they are same:
%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;
create #back# = parallel -parts <(gor #genes# | group chrom)
<(gor -p #{col:chrom} [#slice#] | where pos > 0 | liftover config/liftover/hg19tohg38.gor -var -build hg19
| calc same if(chrom = hg38_chrom and pos = hg38_pos,1,0) | group chrom -gc same -count);
nor [#back#]
| group -gc same -sum -ic allcount
| granno -sum -ic sum_allcount | calc ratio sum_allcount/sum_sum_allcount
Query ran in 1.24 sec Query fetched 2 rows in 0.01 sec (total time 1.25 sec)
same | sum_allCount | sum_sum_allCount | ratio | |
---|---|---|---|---|
0 | 0 | 3955461 | 752588571 | 0.005256 |
1 | 1 | 748633110 | 752588571 | 0.994744 |
We see that 99.5% of the variant make it back.
How is the variant normalization holding up?¶
First, lets look at some of the variants where we see more than one row after variant normalization:
%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;
gor [#slice#] | where pos > 0 | varnorm -left ref alt | group 1 -gc ref,alt -count
| where allcount > 1
| varjoin [#slice#]
| top 100
Query ran in 1.33 sec Query fetched 100 rows in 1.07 sec (total time 2.40 sec)
CHROM | POS | REF | ALT | allCount | POSx | REFx | ALTx | hg38_qStrand | hg38_CHROM | hg38_POS | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10232 | CCCTAA | C | 2 | 10232 | CCCTAA | C | + | chr1 | 10232 |
1 | chr1 | 10232 | CCCTAA | C | 2 | 10232 | CCCTAA | C | + | chr1 | 180895 |
2 | chr1 | 10234 | CT | C | 2 | 10234 | CT | C | + | chr1 | 180897 |
3 | chr1 | 10234 | CT | C | 2 | 10234 | CT | C | + | chr1 | 10234 |
4 | chr1 | 10235 | T | A | 2 | 10235 | T | A | + | chr1 | 10235 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | chr1 | 10253 | CTA | C | 2 | 10253 | CTA | C | + | chr1 | 180916 |
96 | chr1 | 10254 | T | A | 2 | 10254 | T | A | + | chr1 | 10254 |
97 | chr1 | 10254 | T | A | 2 | 10254 | T | A | + | chr1 | 180917 |
98 | chr1 | 10254 | T | C | 2 | 10254 | T | C | + | chr1 | 10254 |
99 | chr1 | 10254 | T | C | 2 | 10254 | T | C | + | chr1 | 180917 |
100 rows × 11 columns
Clearly, we are seeing some duplicates!
gor s3://gdb-demo-dev-blobstorage/shared/extdata/hg38/gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr10.vcf.gz
| select 1,2,ref,alt
| granno 1 -gc ref,alt -count
| throwif allcount > 1
| calc oldpos pos
| top 100000
| varnorm -left ref alt
| where oldpos != pos
Now, lets understand the scope of these duplicates genome wide and in our targets:
%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;
create #gdx_targets# = gor user_data/hakon/gdx_targest.gorz | segproj;
create #temp# = pgor [#slice#] | where pos > 0
| varnorm -left ref alt | group 1 -gc ref,alt -count
| join -snpseg -ic [#gdx_targets#]
| rename overlapcount inTarget
| calc bad if(allcount > 1,1,0)
| calc type if(len(ref)=len(alt),'snp','InDel')
| group chrom -gc bad,type,inTarget -count;
nor [#temp#] | group -gc bad,type,inTarget -sum -ic allcount
| replace bad if(bad=1,'bad','good')
| pivot -v bad,good bad -gc type,intarget
| calc ratio good_sum_allcount/(bad_sum_allcount+good_sum_allcount)
Query ran in 0.75 sec Query fetched 4 rows in 0.02 sec (total time 0.77 sec)
type | inTarget | bad_sum_allCount | good_sum_allCount | ratio | |
---|---|---|---|---|---|
0 | InDel | 0 | 233695 | 108350603 | 0.997848 |
1 | InDel | 1 | 1095 | 349764 | 0.996879 |
2 | snp | 0 | 1497712 | 633885213 | 0.997643 |
3 | snp | 1 | 23431 | 6414500 | 0.996360 |
We see that the extent of the "problem" is similar within our targets as outside of them, i.e. only 99.6% of the variants or so have non-duplicates in the mapping. Note that ignoring variants that are have wrong reference value should not change these numbers. Also, as we see from the above table, this does not only apply to InDels but also SNPs, hence not a only a result of variant normalization having gone away after the liftover.
Resolving duplicates¶
In locations where we have duplicate rows, as described above, we must resolve the issues. We can choose several strategies, including
- eliminate all rows, i.e. ignore AF estimates from gnomAD where this is the case
- pick the row with the largest AN, i.e. where the highest numbers samples had good genotype of coverage
- pick the row with the higher AF estimate, assuming the AN there compares favorably with the one wit the largest AN
First, lets look at an example that shows the situation we are dealing with:
%%gor
gor -p chr1:12892115- user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,af,an
| granno 1 -gc ref,alt -count
| where allcount > 1
| sort 1 -c ref,alt,an:nr
| distloc 3
Query ran in 0.81 sec Query fetched 11 rows in 0.35 sec (total time 1.17 sec)
CHROM | POS | REF | ALT | AF | AN | allCount | |
---|---|---|---|---|---|---|---|
0 | chr1 | 12892215 | G | A | 0.000000 | 148116 | 2 |
1 | chr1 | 12892215 | G | A | 0.000059 | 102546 | 2 |
2 | chr1 | 12892215 | G | C | 0.349764 | 150264 | 2 |
3 | chr1 | 12892215 | G | C | 0.000027 | 148116 | 2 |
4 | chr1 | 12892215 | G | T | 0.000027 | 150252 | 3 |
5 | chr1 | 12892215 | G | T | 0.000013 | 148230 | 3 |
6 | chr1 | 12892215 | G | T | 0.000020 | 102452 | 3 |
7 | chr1 | 12892229 | A | T | 0.000040 | 148686 | 2 |
8 | chr1 | 12892229 | A | T | 0.000177 | 95950 | 2 |
9 | chr1 | 12892236 | A | G | 0.000013 | 148704 | 2 |
10 | chr1 | 12892236 | A | G | 0.000011 | 90658 | 2 |
Our select strategy could the be expressed with the code below distloc 3:
%%gor
gor -p chr1:12892115- user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,af,an
| granno 1 -gc ref,alt -count
| where allcount > 1
| distloc 3
| varnorm -left ref alt
| granno 1 -gc ref,alt -max -ic AN
| where AN > 0.75 * max_AN
| atmax 1 -gc ref,alt AF
Query ran in 0.28 sec Query fetched 5 rows in 0.54 sec (total time 0.82 sec)
CHROM | POS | REF | ALT | AF | AN | allCount | max_AN | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 12892215 | G | A | 0.000000 | 148116 | 2 | 148116 |
1 | chr1 | 12892215 | G | C | 0.349764 | 150264 | 2 | 150264 |
2 | chr1 | 12892215 | G | T | 0.000027 | 150252 | 3 | 150252 |
3 | chr1 | 12892229 | A | T | 0.000040 | 148686 | 2 | 148686 |
4 | chr1 | 12892236 | A | G | 0.000013 | 148704 | 2 | 148704 |
As we see, in some cases we pick the row with the largest AF while in some cases the difference in AN is so large that we pick the one with the better AN.
How does gnomad V4 compare with our inhouse AF estimates?¶
We have recently calculated the AF in XA, by taking into consideration the target coverage. Let's comare these variants with the AF in our new gnomAD v4 source:
%%gor
nor <(gor #genes# | grep brc | join -segsnp -ir <(gor ref_gdx/XA/AF/xa_af.gord -f ANY)
| where af > 0.001 | select 1-AF,homcount | top 100
| varjoin -r -e 0 <(gor user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,af,an | granno 1 -gc ref,alt -max -ic AN
| where AN > 0.75 * max_AN
| atmax 1 -gc ref,alt AF)
| join -snpseg -r -l -t #genes#
| calc diff abs(af-afx)/(af+afx)
)
| sort -c diff:nr
| calc big if(diff>0.2,1,0)
| granno -sum -ic big
Query ran in 1.77 sec Query fetched 65 rows in 3.56 sec (total time 5.33 sec)
chrom | pos | ref | alt | AN | AC | AF | homCount | AFx | ANx | max_AN | gene_symbol | diff | big | sum_big | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32929387 | T | C | 483384 | 12492 | 0.025843 | 6208 | 0.980573 | 152310 | 152310 | BRCA2 | 0.948644 | 1 | 36 |
1 | chr13 | 32911463 | A | G | 483384 | 1293 | 0.002675 | 41 | 0.046940 | 152300 | 152300 | BRCA2 | 0.892174 | 1 | 36 |
2 | chr13 | 32910721 | T | C | 483384 | 1252 | 0.002590 | 40 | 0.040967 | 152222 | 152222 | BRCA2 | 0.881070 | 1 | 36 |
3 | chr17 | 41245471 | C | T | 483384 | 1779 | 0.003680 | 71 | 0.056823 | 152244 | 152244 | BRCA1 | 0.878344 | 1 | 36 |
4 | chr17 | 41245900 | T | G | 483384 | 585 | 0.001210 | 17 | 0.000348 | 152334 | 152334 | BRCA1 | 0.553411 | 1 | 36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
60 | chr13 | 32937488 | G | T | 483384 | 649 | 0.001343 | 1 | 0.001281 | 152248 | 152248 | BRCA2 | 0.023553 | 0 | 36 |
61 | chr13 | 32972760 | G | A | 483384 | 637 | 0.001318 | 2 | 0.001267 | 152270 | 152270 | BRCA2 | 0.019460 | 0 | 36 |
62 | chr13 | 32937521 | G | A | 483384 | 1151 | 0.002381 | 1 | 0.002298 | 152290 | 152290 | BRCA2 | 0.017705 | 0 | 36 |
63 | chr17 | 41267763 | C | T | 483384 | 1450 | 0.003000 | 8 | 0.002950 | 152222 | 152222 | BRCA1 | 0.008414 | 0 | 36 |
64 | chr13 | 32912008 | G | A | 483384 | 1082 | 0.002238 | 1 | 0.002233 | 152288 | 152288 | BRCA2 | 0.001295 | 0 | 36 |
65 rows × 15 columns
In the BRCA1 and BRCA2 genes, there are only 65 variants with AF > 0.001 in XA. These variants are all present in gnomAD and 36 have relative difference greater than 20%. Some of the have 10x greater AF in gnomAD, however, about 50% of then are smaller in gnomAD. Most of those variants, however, have AF < 0.1%. We notice that the variant with the largest discrepancy has mostly homozygous carriers in XA, consistent with the AF value in gnomAD and not the AF in XA!