AF VEP Join¶
Joining variant allele frequency information with variant effect prediction¶
First we include boiler plate code to define the project, set environment variables and import libraries.
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1
project = "janssen_sle"
%env GOR_API_PROJECT={project}
Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.
gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;
"""
The first query is a simple inspection of the AF table using the UNPIVOT
command. In GOR context we must have chrom and pos columns and therefore we would have had to use unpivot only from column three and upwards. Hence, we use non-ordered context and use NOR
to read the AF table and unpivot all the columns.
%%gor
$gordefs
nor #af# | top 1 | unpivot 1-
Query ran in 0.47 sec Query fetched 19 rows in 0.00 sec (total time 0.47 sec)
Col_Name | Col_Value | |
---|---|---|
0 | CHROM | chr1 |
1 | POS | 630053 |
2 | REF | C |
3 | alt | T |
4 | AN | 38852 |
5 | AC | 89 |
6 | AF | 0.0022907 |
7 | conservative_AF | 0.0018051 |
8 | pnCount | 19462 |
9 | marker_yield | 0.9981502414962491 |
10 | HWE_homFreq | 0.0022608 |
11 | hetFreq | 0.0000514 |
12 | HWE_alleleFreq | 0.0022865 |
13 | HWE_chisq | 18596.152229760144 |
14 | HWE_pVal | 0.0 |
15 | HWE_logPval | Infinity |
16 | varCount | 45 |
17 | hetCount | 1 |
18 | homCount | 44 |
We will now select few rows from the beginning of chr7, showing only a handful of the columns listed above.
%%gor
$gordefs
gor -p chr7 #af#
| select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
| top 5
Query ran in 0.21 sec Query fetched 5 rows in 0.00 sec (total time 0.21 sec)
CHROM | POS | REF | alt | AF | marker_yield | HWE_pVal | hetCount | homCount | |
---|---|---|---|---|---|---|---|---|---|
0 | chr7 | 43748 | T | G | 0.028676 | 0.999846 | 0.517244 | 1078 | 19 |
1 | chr7 | 43961 | C | T | 0.028843 | 0.998510 | 0.348696 | 1097 | 12 |
2 | chr7 | 44766 | C | A | 0.022768 | 0.994091 | 0.259247 | 869 | 6 |
3 | chr7 | 44935 | C | T | 0.390294 | 0.993629 | 0.000341 | 9003 | 3046 |
4 | chr7 | 46239 | G | A | 0.515991 | 0.999332 | 0.011186 | 9899 | 5086 |
We can now join the bi-allelic variants in the AF table (#af#) with the VEP information stored in #vep#. To do this, we use VARJOIN
instead of the more generic JOIN
command. It joins equivalent representations of bi-allelic variants in the left-source with the data in the right-source (here #vep#). The option "-r" is used to eliminate redundant display of (pos,ref,alt) columns from the right-source, although it can be different for InDels when no normalization is used. However, in the freeze tables, all variants have been left-normalized.
%%gor
$gordefs
gor -p chr7 #af#
| top 5
| select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
| varjoin -r #vep#
| select 1-homcount,gene_symbol,max_impact,max_consequence
Query ran in 0.34 sec Query fetched 5 rows in 0.05 sec (total time 0.40 sec)
CHROM | POS | REF | alt | AF | marker_yield | HWE_pVal | hetCount | homCount | Gene_Symbol | Max_Impact | max_consequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr7 | 43748 | T | G | 0.028676 | 0.999846 | 0.517244 | 1078 | 19 | . | LOWEST | intergenic_variant |
1 | chr7 | 43961 | C | T | 0.028843 | 0.998510 | 0.348696 | 1097 | 12 | . | LOWEST | intergenic_variant |
2 | chr7 | 44766 | C | A | 0.022768 | 0.994091 | 0.259247 | 869 | 6 | . | LOWEST | intergenic_variant |
3 | chr7 | 44935 | C | T | 0.390294 | 0.993629 | 0.000341 | 9003 | 3046 | . | LOWEST | intergenic_variant |
4 | chr7 | 46239 | G | A | 0.515991 | 0.999332 | 0.011186 | 9899 | 5086 | . | LOWEST | intergenic_variant |
It is important to emphasize that #vep# table (often referred to as VEP-single) may have the same variant represented in multiple rows. This happens when the variant overlap with multiple genes. Indeed, the columns Max_impact and Max_consequence refer to "maximum" for the variant in the corresponding gene. We can see this easily if we join the genes stored in the #genes# table with the #vep# table. We us the JOIN
command with the a "-segsnp" option:
%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp #vep#
| top 5
Query ran in 0.20 sec Query fetched 5 rows in 0.23 sec (total time 0.43 sec)
chrom | gene_start | gene_end | gene_symbol | distance | POS | Reference | Call | Max_Impact | Biotype | Gene_Symbolx | Transcript_count | Amino_Acids | Protein_Position | CDS_position | Refgene | max_consequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32315085 | 32400266 | BRCA2 | 0 | 32315226 | G | A | LOW | protein_coding | BRCA2 | 1 | NaN | NaN | NaN | . | intron_variant |
1 | chr13 | 32315085 | 32400266 | BRCA2 | 0 | 32315226 | G | A | LOW | protein_coding | ZAR1L | 1 | NaN | NaN | NaN | . | intron_variant |
2 | chr13 | 32315085 | 32400266 | BRCA2 | 0 | 32315531 | G | A | LOW | protein_coding | BRCA2 | 3 | NaN | NaN | NaN | NM_000059 | 5_prime_UTR_variant |
3 | chr13 | 32315085 | 32400266 | BRCA2 | 0 | 32315531 | G | A | LOWEST | protein_coding | ZAR1L | 2 | NaN | NaN | NaN | NM_001136571 | upstream_gene_variant |
4 | chr13 | 32315085 | 32400266 | BRCA2 | 0 | 32315644 | A | G | LOW | protein_coding | BRCA2 | 3 | NaN | NaN | NaN | NM_000059 | 5_prime_UTR_variant |
We notice that first columns are from the left-source (#genes#) and that the Gene_symbol column (automatically rename Gene_symbolx to avoid naming conflict) has symbols different from BRCA2. We can see this more easily by using a join option that includes only the right-source, i.e. "-ir" that specifies "intersect from right":
%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp -ir #vep#
| top 5
Query ran in 0.15 sec Query fetched 5 rows in 0.06 sec (total time 0.21 sec)
CHROM | POS | Reference | Call | Max_Impact | Biotype | Gene_Symbol | Transcript_count | Amino_Acids | Protein_Position | CDS_position | Refgene | max_consequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | LOW | protein_coding | BRCA2 | 1 | NaN | NaN | NaN | . | intron_variant |
1 | chr13 | 32315226 | G | A | LOW | protein_coding | ZAR1L | 1 | NaN | NaN | NaN | . | intron_variant |
2 | chr13 | 32315531 | G | A | LOW | protein_coding | BRCA2 | 3 | NaN | NaN | NaN | NM_000059 | 5_prime_UTR_variant |
3 | chr13 | 32315531 | G | A | LOWEST | protein_coding | ZAR1L | 2 | NaN | NaN | NaN | NM_001136571 | upstream_gene_variant |
4 | chr13 | 32315644 | A | G | LOW | protein_coding | BRCA2 | 3 | NaN | NaN | NaN | NM_000059 | 5_prime_UTR_variant |
We can do similarly for the join between #af# and #vep# that we showed earlier if we join the genes with a nested query as the right-source:
%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2'
| join -segsnp -ir <(gor #af#
| select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
| varjoin -r #vep#
| select 1-homcount,gene_symbol,max_impact,max_consequence)
| top 5
Query ran in 0.25 sec Query fetched 5 rows in 0.91 sec (total time 1.16 sec)
CHROM | POS | REF | alt | AF | marker_yield | HWE_pVal | hetCount | homCount | Gene_Symbol | Max_Impact | max_consequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | 0.174649 | 0.999846 | 0.369971 | 5647 | 575 | BRCA2 | LOW | intron_variant |
1 | chr13 | 32315226 | G | A | 0.174649 | 0.999846 | 0.369971 | 5647 | 575 | ZAR1L | LOW | intron_variant |
2 | chr13 | 32316435 | G | A | 0.255409 | 0.999743 | 0.000549 | 7217 | 1361 | BRCA2 | LOW | 5_prime_UTR_variant |
3 | chr13 | 32316435 | G | A | 0.255409 | 0.999743 | 0.000549 | 7217 | 1361 | ZAR1L | LOWEST | upstream_gene_variant |
4 | chr13 | 32316446 | A | C | 0.000231 | 0.999692 | 1.000000 | 9 | 0 | BRCA2 | LOW | 5_prime_UTR_variant |
Finally, we can "collapse" the rows such that there is only a single row per variant, by using the list or set functionality of the GROUP
command:
%%gor
$gordefs
gor #genes# | where gene_symbol = 'BRCA2'
| join -segsnp -ir <(gor #af#
| select chrom,pos,ref,alt,af,marker_yield,HWE_pVal,hetCount,homCount
| varjoin -r #vep#
| select 1-homcount,gene_symbol,max_impact,max_consequence)
| group 1 -gc ref,alt-homcount -lis -sc homcount[+1]-
| top 5
Query ran in 0.27 sec Query fetched 5 rows in 0.26 sec (total time 0.53 sec)
CHROM | POS | REF | alt | AF | marker_yield | HWE_pVal | hetCount | homCount | lis_Gene_Symbol | lis_Max_Impact | lis_max_consequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | 0.174649 | 0.999846 | 0.369971 | 5647 | 575 | BRCA2,ZAR1L | LOW,LOW | intron_variant,intron_variant |
1 | chr13 | 32316435 | G | A | 0.255409 | 0.999743 | 0.000549 | 7217 | 1361 | BRCA2,ZAR1L | LOW,LOWEST | 5_prime_UTR_variant,upstream_gene_variant |
2 | chr13 | 32316446 | A | C | 0.000231 | 0.999692 | 1.000000 | 9 | 0 | BRCA2,ZAR1L | LOW,LOWEST | 5_prime_UTR_variant,upstream_gene_variant |
3 | chr13 | 32316447 | T | C | 0.000051 | 0.999692 | 1.000000 | 2 | 0 | BRCA2,ZAR1L | LOW,LOWEST | 5_prime_UTR_variant,upstream_gene_variant |
4 | chr13 | 32316494 | T | G | 0.000026 | 0.999692 | 1.000000 | 1 | 0 | BRCA2,ZAR1L | MODERATE,LOWEST | missense_variant,upstream_gene_variant |
Notice, that we select a grouping genomic bin-size of 1bp but in addition, we use the "-gc" option to select grouping columns. Here, we select not only ref and alt but also the columns from the #af# table, since they should be constant per variant, given that there is only on row per variant in the #af# table. The "-sc" options specifies the string columns for which we apply the list-aggregator.