AF VEP Join¶

Joining variant allele frequency information with variant effect prediction¶

First we include boiler plate code to define the project, set environment variables and import libraries.

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1

project = "janssen_sle"
%env GOR_API_PROJECT={project}

Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.

In [2]:

gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;
"""

The first query is a simple inspection of the AF table using the UNPIVOT command. In GOR context we must have chrom and pos columns and therefore we would have had to use unpivot only from column three and upwards. Hence, we use non-ordered context and use NOR to read the AF table and unpivot all the columns.

In [3]:

%%gor
$gordefs
nor #af# | top 1 | unpivot 1-

Query ran in 0.47 sec
Query fetched 19 rows in 0.00 sec (total time 0.47 sec)

Out[3]:

	Col_Name	Col_Value
0	CHROM	chr1
1	POS	630053
2	REF	C
3	alt	T
4	AN	38852
5	AC	89
6	AF	0.0022907
7	conservative_AF	0.0018051
8	pnCount	19462
9	marker_yield	0.9981502414962491
10	HWE_homFreq	0.0022608
11	hetFreq	0.0000514
12	HWE_alleleFreq	0.0022865
13	HWE_chisq	18596.152229760144
14	HWE_pVal	0.0
15	HWE_logPval	Infinity
16	varCount	45
17	hetCount	1
18	homCount	44

We will now select few rows from the beginning of chr7, showing only a handful of the columns listed above.

In [4]:

%%gor
$gordefs
gor -p chr7 #af#
| select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
| top 5

Query ran in 0.21 sec
Query fetched 5 rows in 0.00 sec (total time 0.21 sec)

Out[4]:

	CHROM	POS	REF	alt	AF	marker_yield	HWE_pVal	hetCount	homCount
0	chr7	43748	T	G	0.028676	0.999846	0.517244	1078	19
1	chr7	43961	C	T	0.028843	0.998510	0.348696	1097	12
2	chr7	44766	C	A	0.022768	0.994091	0.259247	869	6
3	chr7	44935	C	T	0.390294	0.993629	0.000341	9003	3046
4	chr7	46239	G	A	0.515991	0.999332	0.011186	9899	5086

We can now join the bi-allelic variants in the AF table (#af#) with the VEP information stored in #vep#. To do this, we use VARJOIN instead of the more generic JOIN command. It joins equivalent representations of bi-allelic variants in the left-source with the data in the right-source (here #vep#). The option "-r" is used to eliminate redundant display of (pos,ref,alt) columns from the right-source, although it can be different for InDels when no normalization is used. However, in the freeze tables, all variants have been left-normalized.

In [5]:

%%gor
$gordefs
gor -p chr7 #af#
| top 5
| select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
| varjoin -r #vep#
| select 1-homcount,gene_symbol,max_impact,max_consequence

Query ran in 0.34 sec
Query fetched 5 rows in 0.05 sec (total time 0.40 sec)

Out[5]:

	CHROM	POS	REF	alt	AF	marker_yield	HWE_pVal	hetCount	homCount	Gene_Symbol	Max_Impact	max_consequence
0	chr7	43748	T	G	0.028676	0.999846	0.517244	1078	19	.	LOWEST	intergenic_variant
1	chr7	43961	C	T	0.028843	0.998510	0.348696	1097	12	.	LOWEST	intergenic_variant
2	chr7	44766	C	A	0.022768	0.994091	0.259247	869	6	.	LOWEST	intergenic_variant
3	chr7	44935	C	T	0.390294	0.993629	0.000341	9003	3046	.	LOWEST	intergenic_variant
4	chr7	46239	G	A	0.515991	0.999332	0.011186	9899	5086	.	LOWEST	intergenic_variant

It is important to emphasize that #vep# table (often referred to as VEP-single) may have the same variant represented in multiple rows. This happens when the variant overlap with multiple genes. Indeed, the columns Max_impact and Max_consequence refer to "maximum" for the variant in the corresponding gene. We can see this easily if we join the genes stored in the #genes# table with the #vep# table. We us the JOIN command with the a "-segsnp" option:

In [6]:

%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp #vep#
| top 5

Query ran in 0.20 sec
Query fetched 5 rows in 0.23 sec (total time 0.43 sec)

Out[6]:

	chrom	gene_start	gene_end	gene_symbol	POS	Reference	Call	Max_Impact	Biotype	Gene_Symbolx	Transcript_count	Amino_Acids	Protein_Position	CDS_position	Refgene	max_consequence
0	chr13	32315085	32400266	BRCA2	32315226	G	A	LOW	protein_coding	BRCA2	1	NaN	NaN	NaN	.	intron_variant
1	chr13	32315085	32400266	BRCA2	32315226	G	A	LOW	protein_coding	ZAR1L	1	NaN	NaN	NaN	.	intron_variant
2	chr13	32315085	32400266	BRCA2	32315531	G	A	LOW	protein_coding	BRCA2	3	NaN	NaN	NaN	NM_000059	5_prime_UTR_variant
3	chr13	32315085	32400266	BRCA2	32315531	G	A	LOWEST	protein_coding	ZAR1L	2	NaN	NaN	NaN	NM_001136571	upstream_gene_variant
4	chr13	32315085	32400266	BRCA2	32315644	A	G	LOW	protein_coding	BRCA2	3	NaN	NaN	NaN	NM_000059	5_prime_UTR_variant

We notice that first columns are from the left-source (#genes#) and that the Gene_symbol column (automatically rename Gene_symbolx to avoid naming conflict) has symbols different from BRCA2. We can see this more easily by using a join option that includes only the right-source, i.e. "-ir" that specifies "intersect from right":

In [7]:

%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp -ir #vep#
| top 5

Query ran in 0.15 sec
Query fetched 5 rows in 0.06 sec (total time 0.21 sec)

Out[7]:

	CHROM	POS	Reference	Call	Max_Impact	Biotype	Gene_Symbol	Transcript_count	Amino_Acids	Protein_Position	CDS_position	Refgene	max_consequence
0	chr13	32315226	G	A	LOW	protein_coding	BRCA2	1	NaN	NaN	NaN	.	intron_variant
1	chr13	32315226	G	A	LOW	protein_coding	ZAR1L	1	NaN	NaN	NaN	.	intron_variant
2	chr13	32315531	G	A	LOW	protein_coding	BRCA2	3	NaN	NaN	NaN	NM_000059	5_prime_UTR_variant
3	chr13	32315531	G	A	LOWEST	protein_coding	ZAR1L	2	NaN	NaN	NaN	NM_001136571	upstream_gene_variant
4	chr13	32315644	A	G	LOW	protein_coding	BRCA2	3	NaN	NaN	NaN	NM_000059	5_prime_UTR_variant

We can do similarly for the join between #af# and #vep# that we showed earlier if we join the genes with a nested query as the right-source:

In [8]:

%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2' 
| join -segsnp -ir <(gor #af#
                     | select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
                     | varjoin -r #vep#
                     | select 1-homcount,gene_symbol,max_impact,max_consequence)
| top 5

Query ran in 0.25 sec
Query fetched 5 rows in 0.91 sec (total time 1.16 sec)

Out[8]:

	CHROM	POS	REF	alt	AF	marker_yield	HWE_pVal	hetCount	homCount	Gene_Symbol	Max_Impact	max_consequence
0	chr13	32315226	G	A	0.174649	0.999846	0.369971	5647	575	BRCA2	LOW	intron_variant
1	chr13	32315226	G	A	0.174649	0.999846	0.369971	5647	575	ZAR1L	LOW	intron_variant
2	chr13	32316435	G	A	0.255409	0.999743	0.000549	7217	1361	BRCA2	LOW	5_prime_UTR_variant
3	chr13	32316435	G	A	0.255409	0.999743	0.000549	7217	1361	ZAR1L	LOWEST	upstream_gene_variant
4	chr13	32316446	A	C	0.000231	0.999692	1.000000	9	0	BRCA2	LOW	5_prime_UTR_variant

Finally, we can "collapse" the rows such that there is only a single row per variant, by using the list or set functionality of the GROUP command:

In [9]:

%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2' 
| join -segsnp -ir <(gor #af#
                     | select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
                     | varjoin -r #vep#
                     | select 1-homcount,gene_symbol,max_impact,max_consequence)
| group 1 -gc ref,alt-homcount -lis -sc homcount[+1]-
| top 5

Query ran in 0.27 sec
Query fetched 5 rows in 0.26 sec (total time 0.53 sec)

Out[9]:

	CHROM	POS	REF	alt	AF	marker_yield	HWE_pVal	hetCount	homCount	lis_Gene_Symbol	lis_Max_Impact	lis_max_consequence
0	chr13	32315226	G	A	0.174649	0.999846	0.369971	5647	575	BRCA2,ZAR1L	LOW,LOW	intron_variant,intron_variant
1	chr13	32316435	G	A	0.255409	0.999743	0.000549	7217	1361	BRCA2,ZAR1L	LOW,LOWEST	5_prime_UTR_variant,upstream_gene_variant
2	chr13	32316446	A	C	0.000231	0.999692	1.000000	9	0	BRCA2,ZAR1L	LOW,LOWEST	5_prime_UTR_variant,upstream_gene_variant
3	chr13	32316447	T	C	0.000051	0.999692	1.000000	2	0	BRCA2,ZAR1L	LOW,LOWEST	5_prime_UTR_variant,upstream_gene_variant
4	chr13	32316494	T	G	0.000026	0.999692	1.000000	1	0	BRCA2,ZAR1L	MODERATE,LOWEST	missense_variant,upstream_gene_variant

Notice, that we select a grouping genomic bin-size of 1bp but in addition, we use the "-gc" option to select grouping columns. Here, we select not only ref and alt but also the columns from the #af# table, since they should be constant per variant, given that there is only on row per variant in the #af# table. The "-sc" options specifies the string columns for which we apply the list-aggregator.

In [ ]: