GnomAD variant liftover¶

gnomAD_liftover

In [4]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "internal_hg19"
%env GOR_API_PROJECT={project}

gnomAD liftover from hg38 to hg19¶

Here we will explore the liftover of gnomAD v4 from hg38 to build hg19 that GDX is still using.

The gnomAD v4 exome data was downloaded and stored in user_data/ref_gnomad4 in internal-hg19 project. This includes both SNVs, CNVs and SVs.

Converting VCF attribute value tags to GOR columns.¶

The VCF files store all the information about the variatns in a single INFO column using attribute value structure. This is not only difficult to read but also inefficient in processing and storage. Therefore, we like to convert this information to multi-columnar GOR files.

We start by locating one of the source files and read its header. This can be done like this:

aws s3 cp s3://gdb-demo-dev-blobstorage/shared/extdata/hg38/gnomad/release/4.0/vcf/exomes/gnomad.exomes.v4.0.sites.chr4.vcf.gz - | gunzip - | head -n590 > header_example.txt

Note that we fine tune the head command with a binary-like search until we find where the first data line starts, i.e. to get all the attribute descriptions but NOT the massive data rows.

We then take this txt file and eliminate the ## header lines to make it possible to read it with NOR. We save this to header_example.tsv

Now we can run the following query to generate a description file.

In [6]:

%%gor
nor user_data/ref_gnomad4/files/header_example.tsv 
| where contains(col1,'INFO=<ID=')
| replace col1 replace(replace(col1,'INFO=<',''),'>','')
| calc ID tag(col1,'ID',',')
| calc Number tag(col1,'Number',',')
| calc Type tag(col1,'Type',',')
| calc Description replace(tag(col1,'Description',','),'"','')
| select id-
| rename ID Column
/* | write user_data/ref_gnomad4/files/exome_col_descr.tsv */

Query ran in 0.19 sec
Query fetched 583 rows in 0.02 sec (total time 0.21 sec)

Out[6]:

	Column	Number	Type	Description
0	AC	A	Integer	Alternate allele count
1	AN	1	Integer	Total number of alleles
2	AF	A	Float	Alternate allele frequency
3	grpmax	A	String	Genetic ancestry group with the maximum allele...
4	fafmax_faf95_max	A	Float	Maximum filtering allele frequency (using Pois...
...	...	...	...	...
578	VRS_Allele_IDs	R	String	The computed identifiers for the GA4GH VRS All...
579	VRS_Starts	R	Integer	Interresidue coordinates used as the location ...
580	VRS_Ends	R	Integer	Interresidue coordinates used as the location ...
581	VRS_States	.	String	The literal sequence states used for the GA4GH...
582	vep	.	String	Consequence annotations from Ensembl VEP. Form...

583 rows × 4 columns

We see that this has 582 columns. To transform the data we can now make a list from the columns using GROUP -lis -sc column. Few of the last columns, such as VEP, are annotation columns we are not interested in. The other we put into our #colslist# definition as shown below.

In the following query, we read the VCF data from a GORD file and transform it with the pivot approach, as described in https://docs.gorpipe.org/notebooks/Gnomad.html#notebooks-3

Note: We want to use 500 workers to do the transformation and then write out the results in a dictionary with smaller number of files. We must however be careful that commands such as PGOR, GROUP, SEGHIST and SEGSPAN implicitly use the reference build to define the size of the chromosomes. Here we are working in hg19 but dealing with files in hg38 build.

We will use the liftover definition file to fetch the end position of the chromosomes in hg38 while we simply use GROUP chrom to get them for hg19. With the following definition, we can modify the segment from SEGSPAN to fully cover hg38.

In [15]:

%%gor
gor <(nor <(gor #genes# | group chrom | segspan -maxseg 10000000)
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop=max_bpstop,max_tEnd,bpstop)
)
| granno genome -count | where #2 < #3 and chrom = 'chr17'

Query ran in 0.16 sec
Query fetched 16 rows in 0.31 sec (total time 0.47 sec)

Out[15]:

	Chrom	bpStart	bpStop	segCount	max_bpStop	max_tEnd	allCount
0	chr17	0	5074700	1	81195210	83247441	433
1	chr17	5074700	10149401	1	81195210	83247441	433
2	chr17	10149401	15224101	1	81195210	83247441	433
3	chr17	15224101	20298802	1	81195210	83247441	433
4	chr17	20298802	25373502	1	81195210	83247441	433
5	chr17	25373502	30448203	1	81195210	83247441	433
6	chr17	30448203	35522904	1	81195210	83247441	433
7	chr17	35522904	40597605	1	81195210	83247441	433
8	chr17	40597605	45672305	1	81195210	83247441	433
9	chr17	45672305	50747006	1	81195210	83247441	433
10	chr17	50747006	55821706	1	81195210	83247441	433
11	chr17	55821706	60896407	1	81195210	83247441	433
12	chr17	60896407	65971107	1	81195210	83247441	433
13	chr17	65971107	71045808	1	81195210	83247441	433
14	chr17	71045808	76120509	1	81195210	83247441	433
15	chr17	76120509	83247441	1	81195210	83247441	433

Above, we see for instance the last split segment that ends in '83247441' and not '8119521' as plain SEGSPAN in hg19 would let it do.

In [17]:

%%gor
def #top# = top 10;

def #colslist# = AC,AN,AF,grpmax,fafmax_faf95_max,fafmax_faf95_max_gen_anc,AC_XX,AF_XX,AN_XX,nhomalt_XX,AC_XY,AF_XY,AN_XY,nhomalt_XY,nhomalt,AC_afr_XX,AF_afr_XX,AN_afr_XX,nhomalt_afr_XX,AC_afr_XY,AF_afr_XY,AN_afr_XY,nhomalt_afr_XY,AC_afr,AF_afr,AN_afr,nhomalt_afr,AC_amr_XX,AF_amr_XX,AN_amr_XX,nhomalt_amr_XX,AC_amr_XY,AF_amr_XY,AN_amr_XY,nhomalt_amr_XY,AC_amr,AF_amr,AN_amr,nhomalt_amr,AC_asj_XX,AF_asj_XX,AN_asj_XX,nhomalt_asj_XX,AC_asj_XY,AF_asj_XY,AN_asj_XY,nhomalt_asj_XY,AC_asj,AF_asj,AN_asj,nhomalt_asj,AC_eas_XX,AF_eas_XX,AN_eas_XX,nhomalt_eas_XX,AC_eas_XY,AF_eas_XY,AN_eas_XY,nhomalt_eas_XY,AC_eas,AF_eas,AN_eas,nhomalt_eas,AC_fin_XX,AF_fin_XX,AN_fin_XX,nhomalt_fin_XX,AC_fin_XY,AF_fin_XY,AN_fin_XY,nhomalt_fin_XY,AC_fin,AF_fin,AN_fin,nhomalt_fin,AC_mid_XX,AF_mid_XX,AN_mid_XX,nhomalt_mid_XX,AC_mid_XY,AF_mid_XY,AN_mid_XY,nhomalt_mid_XY,AC_mid,AF_mid,AN_mid,nhomalt_mid,AC_nfe_XX,AF_nfe_XX,AN_nfe_XX,nhomalt_nfe_XX,AC_nfe_XY,AF_nfe_XY,AN_nfe_XY,nhomalt_nfe_XY,AC_nfe,AF_nfe,AN_nfe,nhomalt_nfe,AC_non_ukb_XX,AF_non_ukb_XX,AN_non_ukb_XX,nhomalt_non_ukb_XX,AC_non_ukb_XY,AF_non_ukb_XY,AN_non_ukb_XY,nhomalt_non_ukb_XY,AC_non_ukb,AF_non_ukb,AN_non_ukb,nhomalt_non_ukb,AC_non_ukb_afr_XX,AF_non_ukb_afr_XX,AN_non_ukb_afr_XX,nhomalt_non_ukb_afr_XX,AC_non_ukb_afr_XY,AF_non_ukb_afr_XY,AN_non_ukb_afr_XY,nhomalt_non_ukb_afr_XY,AC_non_ukb_afr,AF_non_ukb_afr,AN_non_ukb_afr,nhomalt_non_ukb_afr,AC_non_ukb_amr_XX,AF_non_ukb_amr_XX,AN_non_ukb_amr_XX,nhomalt_non_ukb_amr_XX,AC_non_ukb_amr_XY,AF_non_ukb_amr_XY,AN_non_ukb_amr_XY,nhomalt_non_ukb_amr_XY,AC_non_ukb_amr,AF_non_ukb_amr,AN_non_ukb_amr,nhomalt_non_ukb_amr,AC_non_ukb_asj_XX,AF_non_ukb_asj_XX,AN_non_ukb_asj_XX,nhomalt_non_ukb_asj_XX,AC_non_ukb_asj_XY,AF_non_ukb_asj_XY,AN_non_ukb_asj_XY,nhomalt_non_ukb_asj_XY,AC_non_ukb_asj,AF_non_ukb_asj,AN_non_ukb_asj,nhomalt_non_ukb_asj,AC_non_ukb_eas_XX,AF_non_ukb_eas_XX,AN_non_ukb_eas_XX,nhomalt_non_ukb_eas_XX,AC_non_ukb_eas_XY,AF_non_ukb_eas_XY,AN_non_ukb_eas_XY,nhomalt_non_ukb_eas_XY,AC_non_ukb_eas,AF_non_ukb_eas,AN_non_ukb_eas,nhomalt_non_ukb_eas,AC_non_ukb_fin_XX,AF_non_ukb_fin_XX,AN_non_ukb_fin_XX,nhomalt_non_ukb_fin_XX,AC_non_ukb_fin_XY,AF_non_ukb_fin_XY,AN_non_ukb_fin_XY,nhomalt_non_ukb_fin_XY,AC_non_ukb_fin,AF_non_ukb_fin,AN_non_ukb_fin,nhomalt_non_ukb_fin,AC_non_ukb_mid_XX,AF_non_ukb_mid_XX,AN_non_ukb_mid_XX,nhomalt_non_ukb_mid_XX,AC_non_ukb_mid_XY,AF_non_ukb_mid_XY,AN_non_ukb_mid_XY,nhomalt_non_ukb_mid_XY,AC_non_ukb_mid,AF_non_ukb_mid,AN_non_ukb_mid,nhomalt_non_ukb_mid,AC_non_ukb_nfe_XX,AF_non_ukb_nfe_XX,AN_non_ukb_nfe_XX,nhomalt_non_ukb_nfe_XX,AC_non_ukb_nfe_XY,AF_non_ukb_nfe_XY,AN_non_ukb_nfe_XY,nhomalt_non_ukb_nfe_XY,AC_non_ukb_nfe,AF_non_ukb_nfe,AN_non_ukb_nfe,nhomalt_non_ukb_nfe,AC_non_ukb_raw,AF_non_ukb_raw,AN_non_ukb_raw,nhomalt_non_ukb_raw,AC_non_ukb_remaining_XX,AF_non_ukb_remaining_XX,AN_non_ukb_remaining_XX,nhomalt_non_ukb_remaining_XX,AC_non_ukb_remaining_XY,AF_non_ukb_remaining_XY,AN_non_ukb_remaining_XY,nhomalt_non_ukb_remaining_XY,AC_non_ukb_remaining,AF_non_ukb_remaining,AN_non_ukb_remaining,nhomalt_non_ukb_remaining,AC_non_ukb_sas_XX,AF_non_ukb_sas_XX,AN_non_ukb_sas_XX,nhomalt_non_ukb_sas_XX,AC_non_ukb_sas_XY,AF_non_ukb_sas_XY,AN_non_ukb_sas_XY,nhomalt_non_ukb_sas_XY,AC_non_ukb_sas,AF_non_ukb_sas,AN_non_ukb_sas,nhomalt_non_ukb_sas,AC_raw,AF_raw,AN_raw,nhomalt_raw,AC_remaining_XX,AF_remaining_XX,AN_remaining_XX,nhomalt_remaining_XX,AC_remaining_XY,AF_remaining_XY,AN_remaining_XY,nhomalt_remaining_XY,AC_remaining,AF_remaining,AN_remaining,nhomalt_remaining,AC_sas_XX,AF_sas_XX,AN_sas_XX,nhomalt_sas_XX,AC_sas_XY,AF_sas_XY,AN_sas_XY,nhomalt_sas_XY,AC_sas,AF_sas,AN_sas,nhomalt_sas,AC_joint_XX,AF_joint_XX,AN_joint_XX,nhomalt_joint_XX,AC_joint_XY,AF_joint_XY,AN_joint_XY,nhomalt_joint_XY,AC_joint,AF_joint,AN_joint,nhomalt_joint,AC_joint_afr_XX,AF_joint_afr_XX,AN_joint_afr_XX,nhomalt_joint_afr_XX,AC_joint_afr_XY,AF_joint_afr_XY,AN_joint_afr_XY,nhomalt_joint_afr_XY,AC_joint_afr,AF_joint_afr,AN_joint_afr,nhomalt_joint_afr,AC_joint_ami_XX,AF_joint_ami_XX,AN_joint_ami_XX,nhomalt_joint_ami_XX,AC_joint_ami_XY,AF_joint_ami_XY,AN_joint_ami_XY,nhomalt_joint_ami_XY,AC_joint_ami,AF_joint_ami,AN_joint_ami,nhomalt_joint_ami,AC_joint_amr_XX,AF_joint_amr_XX,AN_joint_amr_XX,nhomalt_joint_amr_XX,AC_joint_amr_XY,AF_joint_amr_XY,AN_joint_amr_XY,nhomalt_joint_amr_XY,AC_joint_amr,AF_joint_amr,AN_joint_amr,nhomalt_joint_amr,AC_joint_asj_XX,AF_joint_asj_XX,AN_joint_asj_XX,nhomalt_joint_asj_XX,AC_joint_asj_XY,AF_joint_asj_XY,AN_joint_asj_XY,nhomalt_joint_asj_XY,AC_joint_asj,AF_joint_asj,AN_joint_asj,nhomalt_joint_asj,AC_joint_eas_XX,AF_joint_eas_XX,AN_joint_eas_XX,nhomalt_joint_eas_XX,AC_joint_eas_XY,AF_joint_eas_XY,AN_joint_eas_XY,nhomalt_joint_eas_XY,AC_joint_eas,AF_joint_eas,AN_joint_eas,nhomalt_joint_eas,AC_joint_fin_XX,AF_joint_fin_XX,AN_joint_fin_XX,nhomalt_joint_fin_XX,AC_joint_fin_XY,AF_joint_fin_XY,AN_joint_fin_XY,nhomalt_joint_fin_XY,AC_joint_fin,AF_joint_fin,AN_joint_fin,nhomalt_joint_fin,AC_joint_mid_XX,AF_joint_mid_XX,AN_joint_mid_XX,nhomalt_joint_mid_XX,AC_joint_mid_XY,AF_joint_mid_XY,AN_joint_mid_XY,nhomalt_joint_mid_XY,AC_joint_mid,AF_joint_mid,AN_joint_mid,nhomalt_joint_mid,AC_joint_nfe_XX,AF_joint_nfe_XX,AN_joint_nfe_XX,nhomalt_joint_nfe_XX,AC_joint_nfe_XY,AF_joint_nfe_XY,AN_joint_nfe_XY,nhomalt_joint_nfe_XY,AC_joint_nfe,AF_joint_nfe,AN_joint_nfe,nhomalt_joint_nfe,AC_joint_raw,AF_joint_raw,AN_joint_raw,nhomalt_joint_raw,AC_joint_remaining_XX,AF_joint_remaining_XX,AN_joint_remaining_XX,nhomalt_joint_remaining_XX,AC_joint_remaining_XY,AF_joint_remaining_XY,AN_joint_remaining_XY,nhomalt_joint_remaining_XY,AC_joint_remaining,AF_joint_remaining,AN_joint_remaining,nhomalt_joint_remaining,AC_joint_sas_XX,AF_joint_sas_XX,AN_joint_sas_XX,nhomalt_joint_sas_XX,AC_joint_sas_XY,AF_joint_sas_XY,AN_joint_sas_XY,nhomalt_joint_sas_XY,AC_joint_sas,AF_joint_sas,AN_joint_sas,nhomalt_joint_sas,AC_grpmax,AF_grpmax,AN_grpmax,nhomalt_grpmax,grpmax_non_ukb,AC_grpmax_non_ukb,AF_grpmax_non_ukb,AN_grpmax_non_ukb,nhomalt_grpmax_non_ukb,grpmax_joint,AC_grpmax_joint,AF_grpmax_joint,AN_grpmax_joint,nhomalt_grpmax_joint,faf95_XX,faf95_XY,faf95,faf95_afr_XX,faf95_afr_XY,faf95_afr,faf95_amr_XX,faf95_amr_XY,faf95_amr,faf95_eas_XX,faf95_eas_XY,faf95_eas,faf95_nfe_XX,faf95_nfe_XY,faf95_nfe,faf95_non_ukb_XX,faf95_non_ukb_XY,faf95_non_ukb,faf95_non_ukb_afr_XX,faf95_non_ukb_afr_XY,faf95_non_ukb_afr,faf95_non_ukb_amr_XX,faf95_non_ukb_amr_XY,faf95_non_ukb_amr,faf95_non_ukb_eas_XX,faf95_non_ukb_eas_XY,faf95_non_ukb_eas,faf95_non_ukb_nfe_XX,faf95_non_ukb_nfe_XY,faf95_non_ukb_nfe,faf95_non_ukb_sas_XX,faf95_non_ukb_sas_XY,faf95_non_ukb_sas,faf95_sas_XX,faf95_sas_XY,faf95_sas,faf99_XX,faf99_XY,faf99,faf99_afr_XX,faf99_afr_XY,faf99_afr,faf99_amr_XX,faf99_amr_XY,faf99_amr,faf99_eas_XX,faf99_eas_XY,faf99_eas,faf99_nfe_XX,faf99_nfe_XY,faf99_nfe,faf99_non_ukb_XX,faf99_non_ukb_XY,faf99_non_ukb,faf99_non_ukb_afr_XX,faf99_non_ukb_afr_XY,faf99_non_ukb_afr,faf99_non_ukb_amr_XX,faf99_non_ukb_amr_XY,faf99_non_ukb_amr,faf99_non_ukb_eas_XX,faf99_non_ukb_eas_XY,faf99_non_ukb_eas,faf99_non_ukb_nfe_XX,faf99_non_ukb_nfe_XY,faf99_non_ukb_nfe,faf99_non_ukb_sas_XX,faf99_non_ukb_sas_XY,faf99_non_ukb_sas,faf99_sas_XX,faf99_sas_XY,faf99_sas,fafmax_faf99_max,fafmax_faf99_max_gen_anc,fafmax_faf95_max_non_ukb,fafmax_faf95_max_gen_anc_non_ukb,fafmax_faf99_max_non_ukb,fafmax_faf99_max_gen_anc_non_ukb,faf95_joint_XX,faf95_joint_XY,faf95_joint,faf95_joint_afr_XX,faf95_joint_afr_XY,faf95_joint_afr,faf95_joint_amr_XX,faf95_joint_amr_XY,faf95_joint_amr,faf95_joint_eas_XX,faf95_joint_eas_XY,faf95_joint_eas,faf95_joint_nfe_XX,faf95_joint_nfe_XY,faf95_joint_nfe,faf95_joint_sas_XX,faf95_joint_sas_XY,faf95_joint_sas,faf99_joint_XX,faf99_joint_XY,faf99_joint,faf99_joint_afr_XX,faf99_joint_afr_XY,faf99_joint_afr,faf99_joint_amr_XX,faf99_joint_amr_XY,faf99_joint_amr,faf99_joint_eas_XX,faf99_joint_eas_XY,faf99_joint_eas,faf99_joint_nfe_XX,faf99_joint_nfe_XY,faf99_joint_nfe,faf99_joint_sas_XX,faf99_joint_sas_XY,faf99_joint_sas,fafmax_faf95_max_joint,fafmax_faf95_max_gen_anc_joint,fafmax_faf99_max_joint,fafmax_faf99_max_gen_anc_joint,fafmax_data_type_joint,age_hist_het_bin_freq,age_hist_het_n_smaller,age_hist_het_n_larger,age_hist_hom_bin_freq,age_hist_hom_n_smaller,age_hist_hom_n_larger,FS,MQ,MQRankSum,QUALapprox,QD,ReadPosRankSum,SOR,VarDP,monoallelic,only_het,transmitted_singleton,sibling_singleton,AS_FS,AS_MQ,AS_MQRankSum,AS_pab_max,AS_QUALapprox,AS_QD,AS_ReadPosRankSum,AS_SB_TABLE,AS_SOR,AS_VarDP,inbreeding_coeff,AS_culprit,AS_VQSLOD,negative_train_site,positive_train_site,allele_type,n_alt_alleles,variant_type,was_mixed,lcr,non_par,segdup,fail_interval_qc,outside_ukb_capture_region,outside_broad_capture_region,revel_max,spliceai_ds_max,pangolin_largest_ds,phylop,sift_max,polyphen_max;

create #hg38_433_splits# = gor <(nor <(gor #genes# | group chrom | segspan -maxseg 10000000)
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop=max_bpstop,max_tEnd,bpstop)
)
| where #2 < #3 | select 1-3;
      
create #temp# = pgor -split <(gor [#hg38_433_splits#]) user_data/ref_gnomad4/exomes.gord
| #top#
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| replace monoallelic,only_het,transmitted_singleton,sibling_singleton,negative_train_site,positive_train_site,was_mixed,lcr,non_par,segdup,fail_interval_qc,outside_ukb_capture_region,outside_broad_capture_region if(#rc='.',0,1)
;

create #w# = pgor [#temp#]; /* | write s3data://shared/user_data/ref_gnomad4/exome_cols_v2.gord; */

nor [#w#] | top 1 | unpivot #1-

Query ran in 1.52 sec
Query fetched 574 rows in 12.09 sec (total time 13.61 sec)

Out[17]:

	Col_Name	Col_Value
0	CHROM	chr1
1	POS	11994
2	REF	T
3	ALT	C
4	FILTER	AC0;AS_VQSR
...	...	...
569	spliceai_ds_max	.
570	pangolin_largest_ds	-0.110000
571	phylop	1.08800
572	sift_max	.
573	polyphen_max	.

574 rows × 2 columns

Notice that we have applied a replace on the columns (attributes) that are labeled in the VCF to be flags, e.g.:

In [11]:

%%gor
nor user_data/ref_gnomad4/files/exome_col_descr.tsv | where Type in ('Flag')

Query ran in 0.17 sec
Query fetched 13 rows in 0.01 sec (total time 0.18 sec)

Out[11]:

	Column	Type	Description
0	monoallelic	Flag	All samples are homozygous alternate for the v...
1	only_het	Flag	All samples are heterozygous for the variant
2	transmitted_singleton	Flag	Variant was a callset-wide doubleton that was ...
3	sibling_singleton	Flag	Variant was a callset-wide doubleton that was ...
4	negative_train_site	Flag	Variant was used to build the negative trainin...
5	positive_train_site	Flag	Variant was used to build the positive trainin...
6	was_mixed	Flag	Variant type was mixed
7	lcr	Flag	Variant falls within a low complexity region
8	non_par	Flag	Variant (on sex chromosome) falls outside a ps...
9	segdup	Flag	Variant falls within a segmental duplication r...
10	fail_interval_qc	Flag	Less than 85 percent of samples meet 20X cover...
11	outside_ukb_capture_region	Flag	Variant falls outside of UK Biobank exome capt...
12	outside_broad_capture_region	Flag	Variant falls outside of Broad exome capture r...

CNV and SV¶

We did very much the same for CNVs and the 1672 SV attributes:

In [12]:

%%gor
def #top# = top 10;
def #colslist# = END,SVLEN,SVTYPE,POSMIN,POSMAX,ENDMIN,ENDMAX,Genes,N_EXN_VAR,N_INT_VAR,SC,afr_SC,amr_SC,asj_SC,eas_SC,fin_SC,mid_SC,nfe_SC,sas_SC,SN,afr_SN,amr_SN,asj_SN,eas_SN,fin_SN,mid_SN,nfe_SN,sas_SN,SF,afr_SF,amr_SF,asj_SF,eas_SF,fin_SF,mid_SF,nfe_SF,sas_SF,MALE_SC,afr_MALE_SC,amr_MALE_SC,asj_MALE_SC,eas_MALE_SC,fin_MALE_SC,mid_MALE_SC,nfe_MALE_SC,sas_MALE_SC,MALE_SN,afr_MALE_SN,amr_MALE_SN,asj_MALE_SN,eas_MALE_SN,fin_MALE_SN,mid_MALE_SN,nfe_MALE_SN,sas_MALE_SN,MALE_SF,afr_MALE_SF,amr_MALE_SF,asj_MALE_SF,eas_MALE_SF,fin_MALE_SF,mid_MALE_SF,nfe_MALE_SF,sas_MALE_SF,FEMALE_SC,afr_FEMALE_SC,amr_FEMALE_SC,asj_FEMALE_SC,eas_FEMALE_SC,fin_FEMALE_SC,mid_FEMALE_SC,nfe_FEMALE_SC,sas_FEMALE_SC,FEMALE_SN,afr_FEMALE_SN,amr_FEMALE_SN,asj_FEMALE_SN,eas_FEMALE_SN,fin_FEMALE_SN,mid_FEMALE_SN,nfe_FEMALE_SN,sas_FEMALE_SN,FEMALE_SF,afr_FEMALE_SF,amr_FEMALE_SF,asj_FEMALE_SF,eas_FEMALE_SF,fin_FEMALE_SF,mid_FEMALE_SF,nfe_FEMALE_SF,sas_FEMALE_SF;

gor user_data/ref_gnomad4/cnv/gnomad.v4.0.cnv.all.vcf.gz
| #top# 
| sort genome
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| select chrom,pos,end,ref,alt,svlen-
/* | write user_data/ref_gnomad4/cnv_all.gorz */

Query ran in 0.64 sec
Query fetched 4 rows in 0.02 sec (total time 0.65 sec)

Out[12]:

	CHROM	POS	END	REF	ALT	SVLEN	SVTYPE	POSMIN	POSMAX	ENDMIN	...	sas_FEMALE_SN	FEMALE_SF	eas_FEMALE_SF	nfe_FEMALE_SF
0	chr1	925634	931187	N	<DEL>	5553	DEL	925634	925634	930434	...	7959	0.000009	0.000126	0.000006
1	chr1	925634	935994	N	<DUP>	10360	DUP	912956	925634	935994	...	7959	0.000009	0.000000	0.000006
2	chr1	935675	948701	N	<DEL>	13026	DEL	935675	935675	948701	...	7960	0.000004	0.000000	0.000006
3	chr1	940979	943475	N	<DUP>	2496	DUP	940979	940979	943475	...	7960	0.000004	0.000000	0.000006

4 rows × 95 columns

Here we had to do use SORT genome before PIVOT because the single VCF file has data from multiple chromosomes (and no index) and the fact that VCF use numerical order of chromosomes while GOR uses alphabetical order.

Below we show a reduced version of the SV query (because the 1600+ attributes take too much space!):

In [13]:

%%gor
def #top# = top 10;
def #colslist# = AC,ALGORITHMS,AN,BOTHSIDES_SUPPORT,CHR2,CPX_INTERVALS,CPX_TYPE,END,END2,EVIDENCE,FEMALE_NCR,FEMALE_PCRMINUS_NCR,FEMALE_PCRPLUS_NCR,HIGH_SR_BACKGROUND,LIKELY_REFERENCE_ARTIFACT,LOW_CONFIDENCE_REPETITIVE_LARGE_DUP,MALE_NCR,MALE_PCRMINUS_NCR,MALE_PCRPLUS_NCR,MEMBERS,MULTIALLELIC,NCR,OUTLIER_SAMPLE_ENRICHED_LENIENT,PCRMINUS_NCR,PCRPLUS_NCR,PESR_GT_OVERDISPERSION,POS2,PREDICTED_BREAKEND_EXONIC,PREDICTED_COPY_GAIN,PREDICTED_DUP_PARTIAL,PREDICTED_INTERGENIC,PREDICTED_INTRAGENIC_EXON_DUP,PREDICTED_INTRONIC,PREDICTED_INV_SPAN,PREDICTED_LOF,PREDICTED_MSV_EXON_OVERLAP,PREDICTED_NEAREST_TSS,PREDICTED_PARTIAL_EXON_DUP,PREDICTED_PROMOTER,PREDICTED_TSS_DUP,PREDICTED_UTR,RESOLVED_POSTHOC,SOURCE,STRANDS,SVLEN,SVTYPE,UNRESOLVED_TYPE,AF;

pgor user_data/ref_gnomad4/sv.gord 
| #top#
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,qual,filter,x_1,x_2
| pivot -gc ref,alt,qual,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| columnreorder chrom,pos,END,END2
/* | write s3data://shared/user_data/ref_gnomad4/sv_1672_cols.gord */

Query ran in 1.54 sec
Query fetched 366 rows in 2.49 sec (total time 4.04 sec)

Out[13]:

	CHROM	POS	END	END2	REF	ALT	QUAL	FILTER	AC	ALGORITHMS	...	PREDICTED_PROMOTER	PREDICTED_TSS_DUP	PREDICTED_UTR	RESOLVED_POSTHOC	SOURCE	STRANDS	SVLEN	SVTYPE	UNRESOLVED_TYPE	AF
0	chr1	10000	295666	.	N	<DUP>	134	HIGH_NCR	139	depth	...	.	.	.	.	.	.	285666	DUP	.	0.001997
1	chr1	10434	10434	242183515	N	<BND>	260	HIGH_NCR;UNRESOLVED	8474	manta	...	.	.	.	.	.	++	-1	BND	MIXED_BREAKENDS	0.114134
2	chr1	10440	10440	10977	N	<BND>	198	HIGH_NCR;UNRESOLVED	466	manta	...	.	.	.	.	.	+-	-1	BND	SINGLE_ENDER_+-	0.004201
3	chr1	10450	10450	80258275	N	<BND>	287	HIGH_NCR;UNRESOLVED	21766	manta	...	.	.	.	.	.	++	-1	BND	MIXED_BREAKENDS	0.323899
4	chr1	10464	10464	10502	N	<BND>	198	HIGH_NCR;UNRESOLVED	3119	manta	...	.	.	.	.	.	+-	-1	BND	SINGLE_ENDER_+-	0.036985
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
361	chrY	2809388	2815388	.	N	<DEL>	107	PASS	3	depth	...	.	.	.	.	.	.	6000	DEL	.	0.000101674
362	chrY	2813985	2818893	.	N	<DEL>	351	PASS	1	manta	...	.	.	.	.	.	.	4908	DEL	.	3.31719e-05
363	chrY	2816900	2817065	.	N	<DUP>	859	PASS	1	wham	...	.	.	.	.	.	.	165	DUP	.	3.31719e-05
364	chrY	2817388	2822788	.	N	<DEL>	999	PASS	2	depth	...	.	.	.	.	.	.	5400	DEL	.	6.77025e-05
365	chrY	2817388	2913388	.	N	<DUP>	999	PASS	8	depth	...	.	.	.	.	.	.	96000	DUP	.	0.00026888

366 rows × 54 columns

Liftover¶

For the liftover to hg19 there are slighltly different concerns for SNVs that the segment based CNVs and SVs.

Let's first look at the SNP and small InDels. We will use PARALLEL to split up the work to a high number of tasks, each dealing with 1M rows.

In [18]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

In [19]:

mydefs = qh.add_many("""
def #source_table# = user_data/ref_gnomad4/exome_cols_v2.gord;

create #c# = pgor -split 100 #source_table#
| group 100000 -count;

create #segs# = gor [#c#]
| seghist 1000000;

create #parts# = nor [#segs#]
| replace #2 #2+1
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop = max_bpstop,max_tEnd,bpstop)
| hide count,max_tEnd;
""")

Note: At the moment, the PGOR command does not by default specify -nowithin option and hence it does not work in conjunction with LIFTOVER. In the future, this will be changed and be the default while one can optionally specify a -within option.

In [20]:

%%gor
$mydefs
nor [#parts#] | granno -count | top 3

Query ran in 1.64 sec
Query fetched 3 rows in 18.66 sec (total time 20.30 sec)

Out[20]:

	Chrom	bpStart	bpStop	max_bpStop	allCount
0	chr1	1	6134734	249250621	196
1	chr1	6134735	15123500	249250621	196
2	chr1	15123501	21937240	249250621	196

In #parts# we shift the start position by one because the -p option in GOR is one-based while all segments use zero-based system (confusing - yes!). The query to execute the liftover uses the -var option in LIFTOVER, because for variants where the strand changes we must reverse-complement the refefrence sequence and the alternative.

In [9]:

mydefs = qh.add_many("""
create #lift# = parallel -parts [#parts#] <(gor -p #{col:chrom}:#{col:bpstart}-#{col:bpstop} #source_table#
| liftover config/liftover/hg38tohg19.gor -var -build hg38);

create #write# = pgor [#lift#] | write s3data://shared/user_data/hakon/gnomad_v4.gord;
""")

The #lift# relation above has many partitions and each partition my have data from multiple chromosomes. However, the final write will create a dictionary with regular PGOR split, i.e. 2 files for the bigger chromosomes in total 38 files, each file with data from single chromosome. We will not run the above query here but it takes only few minutes. We will however count the number of variants:

In [22]:

%%gor
create x = pgor -split 100 user_data/hakon/gnomad_v4.gord | group chrom -gc hg38_liftoverStatus -count;
nor [x] | group -gc hg38_liftoverStatus -sum -ic allcount | granno -sum -ic #2 | calc f #2/#3

Query ran in 0.31 sec
Query fetched 2 rows in 0.03 sec (total time 0.34 sec)

Out[22]:

	hg38_liftoverStatus	sum_allCount	sum_sum_allCount	f
0	mapped	181793123	182145119	0.998067
1	unmapped	351996	182145119	0.001933

Our dataset has 182M rows x 580 cols. We notice that 0.2% of the variants do not translate from hg38 to hg19. This is a bigger issue in CNVs where we have larger variants.

CNV liftover¶

When we do the CNV liftover, we are usually dealing with variants that span many bases. We now use -seg option for LIFTOVER, i.e. NO ref/alt to deal with.

In [4]:

qh2 = GQH.GOR_Query_Helper()
mydefs2 = ""

In [6]:

mydefs2 = qh2.add_many("""
create #pseg# = gor user_data/ref_gnomad4/cnv_all.gorz | select 1-3 |rownum
| calc oldsize #3-#2
| liftover config/liftover/hg38tohg19.gor -seg -build hg38;
""")

In [7]:

%%gor 
$mydefs2
gor [#pseg#] | calc same if(#3-#2 = oldsize,1,0) | group genome -gc same,hg38_liftoverStatus -count

Query ran in 0.60 sec
Query fetched 2 rows in 0.22 sec (total time 0.82 sec)

Out[7]:

	CHROM	bpStart	bpStop	same	hg38_liftoverStatus	allCount
0	chrA	0	1000000000	0	unmapped	3364
1	chrA	0	1000000000	1	mapped	48934

Now close to 10% of the CNVs do not map. However, all the ones that do map conserver their segment size. This is because the LIFTOVER command only maps the segmens where both start and stop positions move in tandem. We can relax this condition by applying liftover on the start position and stop position separately and then join these individual liftover steps together by their row-numbers, e.g. :

In [8]:

mydefs2 = qh2.add_many("""
create #psta# = gor user_data/ref_gnomad4/cnv_all.gorz | select 1,2 | rownum
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;

create #pend# = gor user_data/ref_gnomad4/cnv_all.gorz | select 1,3 | rownum | sort chrom
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;
""")

In [9]:

%%gor
$mydefs2
nor [#psta#] | group -gc hg38_liftoverStatus -count
| merge <(nor [#pend#] | group -gc hg38_liftoverStatus -count)

Query ran in 0.27 sec
Query fetched 4 rows in 0.16 sec (total time 0.43 sec)

Out[9]:

	hg38_liftoverStatus	allCount
0	mapped	52285
1	mapped	52290
2	unmapped	13
3	unmapped	8

Now we see that almost all of the single-nucleotide-positions are mapped. Hence, if we map the start and end positions independently, we are likely to get a better yield in the liftover process.

In [23]:

%%gor
$mydefs2
nor [#pseg#] | map -c rownum <(nor [#psta#] | select rownum,pos,chrom | rename chrom sta_chrom)
             | map -c rownum <(nor [#pend#] | select rownum,end,chrom | rename chrom end_chrom)
| calc s2 endx-posx
| calc f s2/oldsize
| calc similar if(f < 1.05 and f > 0.95 and sta_chrom = end_chrom,1,0)
| group -gc similar -count | granno -sum -ic allcount | calc fr allcount/sum_allcount

Query ran in 0.10 sec
Query fetched 2 rows in 0.42 sec (total time 0.52 sec)

Out[23]:

	similar	allCount	sum_allCount	fr
0	0	238	52298	0.004551
1	1	52060	52298	0.995449

Now we see that only 0.5% of the CNV segments do not map, if we insist that the change in their size is no more than 5% (picked from a hat!). We can use this approach to map most of the CNVs to hg19, allowing for a slight fuzzyness in the size.

In [20]:

%%gor
$mydefs2
create #cnvs# = nor user_data/ref_gnomad4/cnv_all.gorz | rownum | columnreorder rownum | sort -c rownum;

gor <(nor [#pseg#] | map -c rownum <(nor [#psta#] | select rownum,pos,chrom | rename chrom sta_chrom)
                   | map -c rownum <(nor [#pend#] | select rownum,end,chrom | rename chrom end_chrom)
| calc s2 endx-posx
| calc f s2/oldsize
| where f < 1.05 and f > 0.95 and sta_chrom = end_chrom
| select chrom,posx,endx,rownum
| rename #2 cnv_start
| rename #3 cnv_end
| sort -c rownum
| map -ordered -c rownum <(nor [#cnvs#] | prefix chrom,pos,end hg38)
| hide rownum
)
| sort genome
| select 1-cnv_end,ref-,hg38_*
| top 10 /* write user_data/hakon/gnomAD/cnv_all_v4_hg19.gorz */

Query ran in 0.13 sec
Query fetched 10 rows in 5.00 sec (total time 5.14 sec)

Out[20]:

	CHROM	cnv_start	cnv_end	REF	ALT	SVLEN	SVTYPE	POSMIN	POSMAX	ENDMIN	...	eas_FEMALE_SF	nfe_FEMALE_SF	hg38_CHROM	hg38_POS	hg38_END
0	chr1	861014	866567	N	<DEL>	5553	DEL	925634	925634	930434	...	0.000126	0.000006	chr1	925634	931187
1	chr1	861014	871374	N	<DUP>	10360	DUP	912956	925634	935994	...	0.000000	0.000006	chr1	925634	935994
2	chr1	871055	884081	N	<DEL>	13026	DEL	935675	935675	948701	...	0.000000	0.000006	chr1	935675	948701
3	chr1	876359	878855	N	<DUP>	2496	DUP	940979	940979	943475	...	0.000000	0.000006	chr1	940979	943475
4	chr1	876359	883710	N	<DEL>	7351	DEL	940979	940979	948330	...	0.000126	0.000000	chr1	940979	948330
5	chr1	877386	880278	N	<DUP>	2892	DUP	942006	942006	944898	...	0.000000	0.000006	chr1	942006	944898
6	chr1	878538	881131	N	<DEL>	2593	DEL	943158	943158	945751	...	0.000000	0.000000	chr1	943158	945751
7	chr1	883414	887617	N	<DEL>	4203	DEL	948034	948034	952237	...	0.000000	0.000000	chr1	948034	952237
8	chr1	895867	901197	N	<DUP>	5330	DUP	960487	960487	965817	...	0.000000	0.000000	chr1	960487	965817
9	chr1	895867	912119	N	<DEL>	16252	DEL	960487	960487	976739	...	0.000000	0.000006	chr1	960487	976739

10 rows × 98 columns

SV liftover¶

Like with the CNVs, we expect SV to have similar behaviour with regard to liftover. We therefore first check the yield of the how the regualr segment based liftover for SV:

In [24]:

qh3 = GQH.GOR_Query_Helper()
mydefs3 = ""

In [25]:

mydefs3 = qh3.add_many("""
create #thin# = gor user_data/ref_gnomad4/sv_1672_cols.gord | select 1-3 ;

create #pseg# = gor [#thin#] | select 1-3 |rownum
| calc oldsize #3-#2
| liftover config/liftover/hg38tohg19.gor -seg -build hg38;
create #psta# = gor [#thin#] | select 1,2 | rownum
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;
create #pend# = gor [#thin#] | select 1,3 | rownum | sort chrom
| liftover config/liftover/hg38tohg19.gor -snp -build hg38;
""")

In [26]:

%%gor
$mydefs3
create #temp# = nor [#pseg#] | group -gc hg38_liftoverstatus -count | granno -sum -ic allcount | calc fr allcount/sum_allcount;
nor [#temp#]

Query ran in 1.53 sec
Query fetched 2 rows in 1.79 sec (total time 3.32 sec)

Out[26]:

	hg38_liftoverStatus	allCount	sum_allCount	fr
0	mapped	2103093	2151209	0.977633
1	unmapped	48116	2151209	0.022367

Actually, the yield is significantly better than for the CNVs. To see how much we gain by fuzzy liftover we run:

In [27]:

%%gor
$mydefs3
create #temp# = nor [#pseg#] | map -c rownum <(nor [#psta#] | select rownum,pos,chrom | rename chrom sta_chrom)
             | map -c rownum <(nor [#pend#] | select rownum,end,chrom | rename chrom end_chrom)
| calc s2 endx-posx
| calc f if(s2=oldsize,1.0,s2/oldsize)
| calc similar if(f < 1.05 and f > 0.95 and sta_chrom = end_chrom,1,0)
| group -gc similar -count | granno -sum -ic allcount | calc fr allcount/sum_allcount;
                 
nor [#temp#]

Query ran in 1.24 sec
Query fetched 2 rows in 11.20 sec (total time 12.44 sec)

Out[27]:

	similar	allCount	sum_allCount	fr
0	0	32826	2151209	0.015259
1	1	2118383	2151209	0.984741

We are only slightly better than using perfect match. Thus, we will simply write our hg19 SVs using the regular segment based liftover.

In [30]:

%%gor
create #sv_hg19# = gor user_data/ref_gnomad4/sv_1672_cols.gord | liftover config/liftover/hg38tohg19.gor -seg -build hg38;

gor [#sv_hg19#] | where hg38_liftoverstatus = 'mapped' | top 10 /* |  write user_data/hakon/gnomAD/sv_all_v4_1672_cols_hg19.gorz */

Query ran in 0.22 sec
Query fetched 0 rows in 203.62 sec (total time 203.84 sec)

Out[30]:

	CHROM	POS

The liftover for the 2.1M SV variants takes about 10 minutes and another 3 minutes to writer the 1600+ columns out. This table should be narrowed down to a smaller table that has only the most important columns.

In [34]:

%%gor
nor user_data/hakon/gnomAD/sv_all_v4_1672_cols_hg19.gorz | top 1 | unpivot 1-

Query ran in 0.44 sec
Query fetched 1,678 rows in 0.01 sec (total time 0.45 sec)

Out[34]:

	Col_Name	Col_Value
0	CHROM	chr1
1	POS	10240
2	END	10240
3	END2	205738
4	REF	N
...	...	...
1673	hg38_CHROM	chr1
1674	hg38_POS	180903
1675	hg38_END	180903
1676	hg38_qStrand	+
1677	hg38_liftoverStatus	mapped

1678 rows × 2 columns

In the folder user_data/hakon/gnomAD we have now several gnomAD sources:

gnomad_v4_hg19_clean.gord
cnv_all_v4_hg19.gorz
sv_all_v4_1672_cols_hg19.gorz

and from gnomAD v3

gnomad_v3_maincols_hg19.gord

WGS data¶

We did the very much the same analysis for the WGS data as for the exoms. First, we make a GORD file from the VCF files in S3:

In [22]:

%%gor
nor user_data/ref_gnomad4/files/
| where right(filename,8)='vcf.link'
| select filename
| replace filename 's3://gdb-demo-dev-blobstorage/shared/extdata/hg38/gnomad/release/4.0/vcf/genomes/'+filename
| replace filename replace(filename,'vcf.link','vcf.gz')
| rownum 
| rename rownum alias 
| calc chr_start regsel(filename,'.*sites.(.*).vcf.gz')
| calc bp_start 0
| calc chr_stop chr_start
| calc chr_end 1000000000
/* | write user_data/ref_gnomad4/WGS/genomes.gord */ | top 5

Query ran in 0.15 sec
Query fetched 5 rows in 0.19 sec (total time 0.34 sec)

Out[22]:

	Filename	alias	chr_start	chr_stop	chr_end
0	s3://gdb-demo-dev-blobstorage/shared/extdata/h...	1	chr16	chr16	1000000000
1	s3://gdb-demo-dev-blobstorage/shared/extdata/h...	2	chr19	chr19	1000000000
2	s3://gdb-demo-dev-blobstorage/shared/extdata/h...	3	chr6	chr6	1000000000
3	s3://gdb-demo-dev-blobstorage/shared/extdata/h...	4	chr9	chr9	1000000000
4	s3://gdb-demo-dev-blobstorage/shared/extdata/h...	5	chr11	chr11	1000000000

Then we transform the data to columnar GOR files:

In [21]:

%%gor
def #top# = top 10;

def #colslist# = AC,AN,AF,grpmax,fafmax_faf95_max,fafmax_faf95_max_gen_anc,AC_XX,AF_XX,AN_XX,nhomalt_XX,AC_XY,AF_XY,AN_XY,nhomalt_XY,nhomalt,AC_afr_XX,AF_afr_XX,AN_afr_XX,nhomalt_afr_XX,AC_afr_XY,AF_afr_XY,AN_afr_XY,nhomalt_afr_XY,AC_afr,AF_afr,AN_afr,nhomalt_afr,AC_ami_XX,AF_ami_XX,AN_ami_XX,nhomalt_ami_XX,AC_ami_XY,AF_ami_XY,AN_ami_XY,nhomalt_ami_XY,AC_ami,AF_ami,AN_ami,nhomalt_ami,AC_amr_XX,AF_amr_XX,AN_amr_XX,nhomalt_amr_XX,AC_amr_XY,AF_amr_XY,AN_amr_XY,nhomalt_amr_XY,AC_amr,AF_amr,AN_amr,nhomalt_amr,AC_asj_XX,AF_asj_XX,AN_asj_XX,nhomalt_asj_XX,AC_asj_XY,AF_asj_XY,AN_asj_XY,nhomalt_asj_XY,AC_asj,AF_asj,AN_asj,nhomalt_asj,AC_eas_XX,AF_eas_XX,AN_eas_XX,nhomalt_eas_XX,AC_eas_XY,AF_eas_XY,AN_eas_XY,nhomalt_eas_XY,AC_eas,AF_eas,AN_eas,nhomalt_eas,AC_fin_XX,AF_fin_XX,AN_fin_XX,nhomalt_fin_XX,AC_fin_XY,AF_fin_XY,AN_fin_XY,nhomalt_fin_XY,AC_fin,AF_fin,AN_fin,nhomalt_fin,AC_mid_XX,AF_mid_XX,AN_mid_XX,nhomalt_mid_XX,AC_mid_XY,AF_mid_XY,AN_mid_XY,nhomalt_mid_XY,AC_mid,AF_mid,AN_mid,nhomalt_mid,AC_nfe_XX,AF_nfe_XX,AN_nfe_XX,nhomalt_nfe_XX,AC_nfe_XY,AF_nfe_XY,AN_nfe_XY,nhomalt_nfe_XY,AC_nfe,AF_nfe,AN_nfe,nhomalt_nfe,AC_raw,AF_raw,AN_raw,nhomalt_raw,AC_remaining_XX,AF_remaining_XX,AN_remaining_XX,nhomalt_remaining_XX,AC_remaining_XY,AF_remaining_XY,AN_remaining_XY,nhomalt_remaining_XY,AC_remaining,AF_remaining,AN_remaining,nhomalt_remaining,AC_sas_XX,AF_sas_XX,AN_sas_XX,nhomalt_sas_XX,AC_sas_XY,AF_sas_XY,AN_sas_XY,nhomalt_sas_XY,AC_sas,AF_sas,AN_sas,nhomalt_sas,AC_joint_XX,AF_joint_XX,AN_joint_XX,nhomalt_joint_XX,AC_joint_XY,AF_joint_XY,AN_joint_XY,nhomalt_joint_XY,AC_joint,AF_joint,AN_joint,nhomalt_joint,AC_joint_afr_XX,AF_joint_afr_XX,AN_joint_afr_XX,nhomalt_joint_afr_XX,AC_joint_afr_XY,AF_joint_afr_XY,AN_joint_afr_XY,nhomalt_joint_afr_XY,AC_joint_afr,AF_joint_afr,AN_joint_afr,nhomalt_joint_afr,AC_joint_ami_XX,AF_joint_ami_XX,AN_joint_ami_XX,nhomalt_joint_ami_XX,AC_joint_ami_XY,AF_joint_ami_XY,AN_joint_ami_XY,nhomalt_joint_ami_XY,AC_joint_ami,AF_joint_ami,AN_joint_ami,nhomalt_joint_ami,AC_joint_amr_XX,AF_joint_amr_XX,AN_joint_amr_XX,nhomalt_joint_amr_XX,AC_joint_amr_XY,AF_joint_amr_XY,AN_joint_amr_XY,nhomalt_joint_amr_XY,AC_joint_amr,AF_joint_amr,AN_joint_amr,nhomalt_joint_amr,AC_joint_asj_XX,AF_joint_asj_XX,AN_joint_asj_XX,nhomalt_joint_asj_XX,AC_joint_asj_XY,AF_joint_asj_XY,AN_joint_asj_XY,nhomalt_joint_asj_XY,AC_joint_asj,AF_joint_asj,AN_joint_asj,nhomalt_joint_asj,AC_joint_eas_XX,AF_joint_eas_XX,AN_joint_eas_XX,nhomalt_joint_eas_XX,AC_joint_eas_XY,AF_joint_eas_XY,AN_joint_eas_XY,nhomalt_joint_eas_XY,AC_joint_eas,AF_joint_eas,AN_joint_eas,nhomalt_joint_eas,AC_joint_fin_XX,AF_joint_fin_XX,AN_joint_fin_XX,nhomalt_joint_fin_XX,AC_joint_fin_XY,AF_joint_fin_XY,AN_joint_fin_XY,nhomalt_joint_fin_XY,AC_joint_fin,AF_joint_fin,AN_joint_fin,nhomalt_joint_fin,AC_joint_mid_XX,AF_joint_mid_XX,AN_joint_mid_XX,nhomalt_joint_mid_XX,AC_joint_mid_XY,AF_joint_mid_XY,AN_joint_mid_XY,nhomalt_joint_mid_XY,AC_joint_mid,AF_joint_mid,AN_joint_mid,nhomalt_joint_mid,AC_joint_nfe_XX,AF_joint_nfe_XX,AN_joint_nfe_XX,nhomalt_joint_nfe_XX,AC_joint_nfe_XY,AF_joint_nfe_XY,AN_joint_nfe_XY,nhomalt_joint_nfe_XY,AC_joint_nfe,AF_joint_nfe,AN_joint_nfe,nhomalt_joint_nfe,AC_joint_raw,AF_joint_raw,AN_joint_raw,nhomalt_joint_raw,AC_joint_remaining_XX,AF_joint_remaining_XX,AN_joint_remaining_XX,nhomalt_joint_remaining_XX,AC_joint_remaining_XY,AF_joint_remaining_XY,AN_joint_remaining_XY,nhomalt_joint_remaining_XY,AC_joint_remaining,AF_joint_remaining,AN_joint_remaining,nhomalt_joint_remaining,AC_joint_sas_XX,AF_joint_sas_XX,AN_joint_sas_XX,nhomalt_joint_sas_XX,AC_joint_sas_XY,AF_joint_sas_XY,AN_joint_sas_XY,nhomalt_joint_sas_XY,AC_joint_sas,AF_joint_sas,AN_joint_sas,nhomalt_joint_sas,AC_grpmax,AF_grpmax,AN_grpmax,nhomalt_grpmax,grpmax_joint,AC_grpmax_joint,AF_grpmax_joint,AN_grpmax_joint,nhomalt_grpmax_joint,faf95_XX,faf95_XY,faf95,faf95_afr_XX,faf95_afr_XY,faf95_afr,faf95_amr_XX,faf95_amr_XY,faf95_amr,faf95_eas_XX,faf95_eas_XY,faf95_eas,faf95_nfe_XX,faf95_nfe_XY,faf95_nfe,faf95_sas_XX,faf95_sas_XY,faf95_sas,faf99_XX,faf99_XY,faf99,faf99_afr_XX,faf99_afr_XY,faf99_afr,faf99_amr_XX,faf99_amr_XY,faf99_amr,faf99_eas_XX,faf99_eas_XY,faf99_eas,faf99_nfe_XX,faf99_nfe_XY,faf99_nfe,faf99_sas_XX,faf99_sas_XY,faf99_sas,fafmax_faf99_max,fafmax_faf99_max_gen_anc,faf95_joint_XX,faf95_joint_XY,faf95_joint,faf95_joint_afr_XX,faf95_joint_afr_XY,faf95_joint_afr,faf95_joint_amr_XX,faf95_joint_amr_XY,faf95_joint_amr,faf95_joint_eas_XX,faf95_joint_eas_XY,faf95_joint_eas,faf95_joint_nfe_XX,faf95_joint_nfe_XY,faf95_joint_nfe,faf95_joint_sas_XX,faf95_joint_sas_XY,faf95_joint_sas,faf99_joint_XX,faf99_joint_XY,faf99_joint,faf99_joint_afr_XX,faf99_joint_afr_XY,faf99_joint_afr,faf99_joint_amr_XX,faf99_joint_amr_XY,faf99_joint_amr,faf99_joint_eas_XX,faf99_joint_eas_XY,faf99_joint_eas,faf99_joint_nfe_XX,faf99_joint_nfe_XY,faf99_joint_nfe,faf99_joint_sas_XX,faf99_joint_sas_XY,faf99_joint_sas,fafmax_faf95_max_joint,fafmax_faf95_max_gen_anc_joint,fafmax_faf99_max_joint,fafmax_faf99_max_gen_anc_joint,fafmax_data_type_joint,age_hist_het_bin_freq,age_hist_het_n_smaller,age_hist_het_n_larger,age_hist_hom_bin_freq,age_hist_hom_n_smaller,age_hist_hom_n_larger,FS,MQ,MQRankSum,QUALapprox,QD,ReadPosRankSum,SOR,VarDP,monoallelic,only_het,transmitted_singleton,AS_FS,AS_MQ,AS_MQRankSum,AS_pab_max,AS_QUALapprox,AS_QD,AS_ReadPosRankSum,AS_SB_TABLE,AS_SOR,AS_VarDP,inbreeding_coeff,AS_culprit,AS_VQSLOD,negative_train_site,positive_train_site,allele_type,n_alt_alleles,variant_type,was_mixed,lcr,non_par,segdup;

create #hg38_433_splits# = gor <(nor <(gor #genes# | group chrom | segspan -maxseg 10000000)
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop=max_bpstop,max_tEnd,bpstop)
)
| where #2 < #3 | select 1-3;

create #temp# = pgor -split <(gor [#hg38_433_splits#] ) user_data/ref_gnomad4/WGS/genomes.gord
| #top# 
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e '.' x_1 -v #colslist#
| rename (.*)_x_2 #{1}
| replace monoallelic,only_het,transmitted_singleton,negative_train_site,positive_train_site,was_mixed,lcr,non_par,segdup if(#rc='.',0,1)
;

create #w# = pgor [#temp#]; /* write s3data://shared/user_data/ref_gnomad4/WGS/genome_cols.gord; */

nor [#w#] | top 1 | unpivot #1-

Query ran in 5.97 sec.
Query fetched 407 rows in 10.28 sec (total time 16.25 sec)

Out[21]:

	Col_Name	Col_Value
0	CHROM	chr1
1	POS	10031
2	REF	T
3	ALT	C
4	FILTER	AC0;AS_VQSR
...	...	...
402	variant_type	multi-snv
403	was_mixed	0
404	lcr	1
405	non_par	0
406	segdup	1

407 rows × 2 columns

and then we do the liftover

In [ ]:

def #source_table# = user_data/ref_gnomad4/WGS/genome_cols.gord;

create #c# = pgor -split 100 #source_table#
| group 100000 -count;

create #segs# = gor [#c#]
| seghist 1000000;

create #parts# = nor [#segs#]
| replace #2 #2+1
| granno -gc chrom -max -ic bpstop
| map -c chrom <(nor config/liftover/hg38tohg19.gor | group -gc chrom -max -ic tEnd)
| replace bpstop if(bpstop < bpstart or bpstop = max_bpstop,max_tEnd,bpstop)
| hide count,max_tEnd
;

create #lift# = parallel -parts [#parts#] <(gor -p #{col:chrom}:#{col:bpstart}-#{col:bpstop} #source_table#
| liftover config/liftover/hg38tohg19.gor -var -build hg38);

create #write# = pgor [#lift#] | write s3data://shared/user_data/ref_gnomad4/WGS/genome_cols_hg19.gord;

On our GORdb system, the former step took 18 min and the latter liftover step 32 minutes. Not bad given the volume of the data - 759,125,619 rows x 409 columns!

Validating the liftover¶

We know that liftover is not a perfect process. Here we will inspect the quality of the liftover process.

How many map?¶

To start with, lets see how many of the variants did not map between the builds:

In [24]:

%%gor
gor user_data/ref_gnomad4/WGS/genome_cols_hg19.gord
| until hg38_liftoverstatus != 'unmapped'
| group chrom -count

Query ran in 1.19 sec
Query fetched 1 rows in 13.64 sec (total time 14.83 sec)

Out[24]:

	CHROM	bpStart	bpStop	allCount
0	chr1	0	249250621	697113

We see that there are close to 700k variants that don't map.

How many have consistent Ref value?¶

Of those who do map, how do their reference bases compare with the build (the liftover command will reverse complement them if there is a change in strand):

In [29]:

%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;

create #temp# = pgor [#slice#] | where pos > 0 
| calc good if(ref = upper(refbases(chrom,pos,pos+len(ref)-1)),1,0)
| calc same_chrom if(hg38_chrom = chrom,1,0)
| group chrom -gc good,same_chrom,hg38_qstrand -count;

nor [#temp#] 
| group -gc good,same_chrom,hg38_qstrand -sum -ic allcount
| granno -gc same_chrom,hg38_qstrand -sum -ic sum_allcount
| calc ratio sum_allcount/sum_sum_allcount
| sort -c same_chrom,hg38_qstrand

Query ran in 1.97 sec
Query fetched 8 rows in 75.54 sec (total time 77.51 sec)

Out[29]:

	good	same_chrom	hg38_qStrand	sum_allCount	sum_sum_allCount	ratio
0	0	0	+	6680	814317	0.008203
1	1	0	+	807637	814317	0.991797
2	0	0	-	3375	355184	0.009502
3	1	0	-	351809	355184	0.990498
4	0	1	+	75926	747149623	0.000102
5	1	1	+	747073697	747149623	0.999898
6	0	1	-	16472	4269451	0.003858
7	1	1	-	4252979	4269451	0.996142

We see that when the liftover does not map across chromosomes and on '+' qStrand, it is good in 99.9898% cases and 99.6% cases for the '-'. Between chromosomes it is still good in 99% of the cases.

How many make it back?¶

For comparison, we can also look at how many of the variants "make it back" to their original location on hg38. We now run the liftover process in the other direction, using only the thin #slice# to make it faster. The slice has the original hg38 position and we can compare it with POS (now in hg38) to see if they are same:

In [30]:

%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;

create #back# = parallel -parts <(gor #genes# | group chrom)
<(gor -p #{col:chrom} [#slice#] | where pos > 0 | liftover config/liftover/hg19tohg38.gor -var -build hg19
| calc same if(chrom = hg38_chrom and pos = hg38_pos,1,0) | group chrom -gc same -count);

nor [#back#]
| group -gc same -sum -ic allcount
| granno -sum -ic sum_allcount | calc ratio sum_allcount/sum_sum_allcount

Query ran in 1.24 sec
Query fetched 2 rows in 0.01 sec (total time 1.25 sec)

Out[30]:

	same	sum_allCount	sum_sum_allCount	ratio
0	0	3955461	752588571	0.005256
1	1	748633110	752588571	0.994744

We see that 99.5% of the variant make it back.

How is the variant normalization holding up?¶

First, lets look at some of the variants where we see more than one row after variant normalization:

In [31]:

%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;

gor [#slice#] | where pos > 0 | varnorm -left ref alt | group 1 -gc ref,alt -count
| where allcount > 1
| varjoin [#slice#]
| top 100

Query ran in 1.33 sec
Query fetched 100 rows in 1.07 sec (total time 2.40 sec)

Out[31]:

	CHROM	POS	REF	ALT	allCount	POSx	REFx	ALTx	hg38_qStrand	hg38_CHROM	hg38_POS
0	chr1	10232	CCCTAA	C	2	10232	CCCTAA	C	+	chr1	10232
1	chr1	10232	CCCTAA	C	2	10232	CCCTAA	C	+	chr1	180895
2	chr1	10234	CT	C	2	10234	CT	C	+	chr1	180897
3	chr1	10234	CT	C	2	10234	CT	C	+	chr1	10234
4	chr1	10235	T	A	2	10235	T	A	+	chr1	10235
...	...	...	...	...	...	...	...	...	...	...	...
95	chr1	10253	CTA	C	2	10253	CTA	C	+	chr1	180916
96	chr1	10254	T	A	2	10254	T	A	+	chr1	10254
97	chr1	10254	T	A	2	10254	T	A	+	chr1	180917
98	chr1	10254	T	C	2	10254	T	C	+	chr1	10254
99	chr1	10254	T	C	2	10254	T	C	+	chr1	180917

100 rows × 11 columns

Clearly, we are seeing some duplicates!

Note: We ran the following command in hg38 context and saw nothing, i.e. the gnomad data is properly left-normalized as stated

In [ ]:

gor s3://gdb-demo-dev-blobstorage/shared/extdata/hg38/gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr10.vcf.gz
| select 1,2,ref,alt
| granno 1 -gc ref,alt -count
| throwif allcount > 1
| calc oldpos pos
| top 100000
| varnorm -left ref alt
| where oldpos != pos

Now, lets understand the scope of these duplicates genome wide and in our targets:

In [40]:

%%gor
create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;

create #slice# = pgor -split 500 user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,hg38_qstrand,hg38_chrom,hg38_pos;

create #gdx_targets# = gor user_data/hakon/gdx_targest.gorz | segproj;

create #temp# = pgor [#slice#] | where pos > 0
| varnorm -left ref alt | group 1 -gc ref,alt -count
| join -snpseg -ic [#gdx_targets#]
| rename overlapcount inTarget
| calc bad if(allcount > 1,1,0)
| calc type if(len(ref)=len(alt),'snp','InDel')    
| group chrom -gc bad,type,inTarget -count;
    
nor [#temp#] | group -gc bad,type,inTarget -sum -ic allcount
| replace bad if(bad=1,'bad','good')
| pivot -v bad,good bad -gc type,intarget
| calc ratio good_sum_allcount/(bad_sum_allcount+good_sum_allcount)

Query ran in 0.75 sec
Query fetched 4 rows in 0.02 sec (total time 0.77 sec)

Out[40]:

	type	inTarget	bad_sum_allCount	good_sum_allCount	ratio
0	InDel	0	233695	108350603	0.997848
1	InDel	1	1095	349764	0.996879
2	snp	0	1497712	633885213	0.997643
3	snp	1	23431	6414500	0.996360

We see that the extent of the "problem" is similar within our targets as outside of them, i.e. only 99.6% of the variants or so have non-duplicates in the mapping. Note that ignoring variants that are have wrong reference value should not change these numbers. Also, as we see from the above table, this does not only apply to InDels but also SNPs, hence not a only a result of variant normalization having gone away after the liftover.

Resolving duplicates¶

In locations where we have duplicate rows, as described above, we must resolve the issues. We can choose several strategies, including

eliminate all rows, i.e. ignore AF estimates from gnomAD where this is the case
pick the row with the largest AN, i.e. where the highest numbers samples had good genotype of coverage
pick the row with the higher AF estimate, assuming the AN there compares favorably with the one wit the largest AN

First, lets look at an example that shows the situation we are dealing with:

In [41]:

%%gor

gor -p chr1:12892115- user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,af,an
| granno 1 -gc ref,alt -count
| where allcount > 1
| sort 1 -c ref,alt,an:nr
| distloc 3

Query ran in 0.81 sec
Query fetched 11 rows in 0.35 sec (total time 1.17 sec)

Out[41]:

	CHROM	POS	REF	ALT	AF	AN	allCount
0	chr1	12892215	G	A	0.000000	148116	2
1	chr1	12892215	G	A	0.000059	102546	2
2	chr1	12892215	G	C	0.349764	150264	2
3	chr1	12892215	G	C	0.000027	148116	2
4	chr1	12892215	G	T	0.000027	150252	3
5	chr1	12892215	G	T	0.000013	148230	3
6	chr1	12892215	G	T	0.000020	102452	3
7	chr1	12892229	A	T	0.000040	148686	2
8	chr1	12892229	A	T	0.000177	95950	2
9	chr1	12892236	A	G	0.000013	148704	2
10	chr1	12892236	A	G	0.000011	90658	2

Our select strategy could the be expressed with the code below distloc 3:

In [47]:

%%gor

gor -p chr1:12892115- user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,af,an
| granno 1 -gc ref,alt -count
| where allcount > 1
| distloc 3 
| varnorm -left ref alt
| granno 1 -gc ref,alt -max -ic AN
| where AN > 0.75 * max_AN
| atmax 1 -gc ref,alt AF

Query ran in 0.28 sec
Query fetched 5 rows in 0.54 sec (total time 0.82 sec)

Out[47]:

	CHROM	POS	REF	ALT	AF	AN	allCount	max_AN
0	chr1	12892215	G	A	0.000000	148116	2	148116
1	chr1	12892215	G	C	0.349764	150264	2	150264
2	chr1	12892215	G	T	0.000027	150252	3	150252
3	chr1	12892229	A	T	0.000040	148686	2	148686
4	chr1	12892236	A	G	0.000013	148704	2	148704

As we see, in some cases we pick the row with the largest AF while in some cases the difference in AN is so large that we pick the one with the better AN.

How does gnomad V4 compare with our inhouse AF estimates?¶

We have recently calculated the AF in XA, by taking into consideration the target coverage. Let's comare these variants with the AF in our new gnomAD v4 source:

In [52]:

%%gor

nor <(gor #genes# | grep brc | join -segsnp -ir <(gor ref_gdx/XA/AF/xa_af.gord -f ANY)
| where af > 0.001 | select 1-AF,homcount | top 100
| varjoin -r -e 0 <(gor user_data/ref_gnomad4/WGS/genome_cols_hg19.gord | select 1-4,af,an | granno 1 -gc ref,alt -max -ic AN
| where AN > 0.75 * max_AN
| atmax 1 -gc ref,alt AF)
| join -snpseg -r -l -t #genes#
| calc diff abs(af-afx)/(af+afx)
)
| sort -c diff:nr
| calc big if(diff>0.2,1,0)
| granno -sum -ic big

Query ran in 1.77 sec
Query fetched 65 rows in 3.56 sec (total time 5.33 sec)

Out[52]:

	chrom	pos	ref	alt	AN	AC	AF	homCount	AFx	ANx	max_AN	gene_symbol	diff	big	sum_big
0	chr13	32929387	T	C	483384	12492	0.025843	6208	0.980573	152310	152310	BRCA2	0.948644	1	36
1	chr13	32911463	A	G	483384	1293	0.002675	41	0.046940	152300	152300	BRCA2	0.892174	1	36
2	chr13	32910721	T	C	483384	1252	0.002590	40	0.040967	152222	152222	BRCA2	0.881070	1	36
3	chr17	41245471	C	T	483384	1779	0.003680	71	0.056823	152244	152244	BRCA1	0.878344	1	36
4	chr17	41245900	T	G	483384	585	0.001210	17	0.000348	152334	152334	BRCA1	0.553411	1	36
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
60	chr13	32937488	G	T	483384	649	0.001343	1	0.001281	152248	152248	BRCA2	0.023553	0	36
61	chr13	32972760	G	A	483384	637	0.001318	2	0.001267	152270	152270	BRCA2	0.019460	0	36
62	chr13	32937521	G	A	483384	1151	0.002381	1	0.002298	152290	152290	BRCA2	0.017705	0	36
63	chr17	41267763	C	T	483384	1450	0.003000	8	0.002950	152222	152222	BRCA1	0.008414	0	36
64	chr13	32912008	G	A	483384	1082	0.002238	1	0.002233	152288	152288	BRCA2	0.001295	0	36

65 rows × 15 columns

In the BRCA1 and BRCA2 genes, there are only 65 variants with AF > 0.001 in XA. These variants are all present in gnomAD and 36 have relative difference greater than 20%. Some of the have 10x greater AF in gnomAD, however, about 50% of then are smaller in gnomAD. Most of those variants, however, have AF < 0.1%. We notice that the variant with the largest discrepancy has mostly homozygous carriers in XA, consistent with the AF value in gnomAD and not the AF in XA!

In [ ]: