GWAS Comparison¶
Demonstrate functionality to compare different GWAS runs¶
First we include boiler plate code to define the project, set environment variables and import libraries.
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1
project = "janssen_sle"
%env GOR_API_PROJECT={project}
Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the examples and paths to two GWAS runs: one Plink analysis and a corresponding Saige analysis of the same phenotype.
gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;
def #plinkrun# = workflow/plink-regression/2021/09/07/sle_demo_run/gwas/plink_us01_sle_case_control_allanc_unannotated_unfiltered.gorz;
def #saigerun# = workflow/saige/2021/09/07/saige_demo1/gwas/saige_us01_sle_case_control_allanc.gorz;
"""
First we setup a query that fetches rows from both of the GWAS analysis. We do this by using the MERGE
command within a nested query, hence the range specified with the "-p" option on the GOR
command applies to both of the sources.
%%gor
$gordefs
gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| top 5
Query ran in 0.10 sec Query fetched 5 rows in 0.27 sec (total time 0.37 sec)
CHROM | POS | REF | ALT | P | OR | GC | type | |
---|---|---|---|---|---|---|---|---|
0 | chr2 | 11944 | C | T | 0.252400 | 1.193795 | NaN | saige |
1 | chr2 | 11944 | C | T | 0.254322 | 1.180080 | 0.262492 | plink |
2 | chr2 | 17822 | G | A | 0.193109 | 1.148673 | NaN | saige |
3 | chr2 | 34049 | G | A | 0.087926 | 1.185207 | NaN | saige |
4 | chr2 | 34049 | G | A | 0.092090 | 1.172690 | 0.097759 | plink |
Join approach¶
GOR does not support outer-join that would ensure that a join between the two GWAS runs returns all variants from both runs. We can however achieve the same results as an outer-join by performing a left-join from the union of variants with each of the GWAS tables:
%%gor
$gordefs
gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| select 1-4 | distinct
| varjoin -r -l -e '' -rprefix plink <(gor #plinkrun# | select chrom,pos,ref,alt,p,or,gc)
| varjoin -r -l -e '' -rprefix saige <(gor #saigerun# | select chrom,pos,ref,alt,p,or)
| top 5
Query ran in 0.07 sec Query fetched 5 rows in 0.40 sec (total time 0.47 sec)
CHROM | POS | REF | ALT | plink_P | plink_OR | plink_GC | saige_P | saige_OR | |
---|---|---|---|---|---|---|---|---|---|
0 | chr2 | 11944 | C | T | 0.254322 | 1.180080 | 0.262492 | 0.252400 | 1.193795 |
1 | chr2 | 17822 | G | A | NaN | NaN | NaN | 0.193109 | 1.148673 |
2 | chr2 | 34049 | G | A | 0.092090 | 1.172690 | 0.097759 | 0.087926 | 1.185207 |
3 | chr2 | 34289 | C | T | 0.216733 | 0.836433 | 0.224653 | 0.217158 | 0.839865 |
4 | chr2 | 35015 | G | T | 0.338578 | 1.181850 | 0.346860 | 0.319962 | 1.204271 |
Pivot approach¶
An alternative to using join to present GWAS data from two or more runs, it is possible to use pivoting. The PIVOT
command takes multiple rows from an input stream with a pivot column and presents the in a horizontal fashion. The benefit of this approach compared to the join approach above is that the data is only read once but also this approach is more easily extended to deal with multiple GWAS runs.
%%gor
$gordefs
gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge -e '#missing#' <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| pivot -gc ref,alt type -v 'plink','saige' -e '#missing#'
| top 5
Query ran in 0.07 sec Query fetched 5 rows in 0.21 sec (total time 0.27 sec)
CHROM | POS | REF | ALT | plink_P | plink_OR | plink_GC | saige_P | saige_OR | saige_GC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr2 | 11944 | C | T | 0.254322 | 1.18008 | 0.262492 | 0.252400 | 1.193795 | #missing# |
1 | chr2 | 17822 | G | A | #missing# | #missing# | #missing# | 0.193109 | 1.148673 | #missing# |
2 | chr2 | 34049 | G | A | 0.0920904 | 1.17269 | 0.0977592 | 0.087926 | 1.185207 | #missing# |
3 | chr2 | 34289 | C | T | 0.216733 | 0.836433 | 0.224653 | 0.217158 | 0.839865 | #missing# |
4 | chr2 | 35015 | G | T | 0.338578 | 1.18185 | 0.34686 | 0.319962 | 1.204271 | #missing# |
As an example, if we are only interested in variants where one of the runs has significant association, we can apply a filter with OR condition on the plink_P column and the saige_P column shown in previous output. When there are many GWAS runs, this can be done more elegantly by using the GRANNO
command, where we aggregate over rows and annotate them for the same input stream. In our case, we find the minimum p-value per variant.
%%gor
$gordefs
gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge -e '#missing#' <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| granno 1 -gc ref,alt -min -fc p
| where min_p < 1e-3
| hide min_p
| pivot -gc ref,alt type -v 'plink','saige' -e '#missing#'
| top 5
Query ran in 0.09 sec Query fetched 5 rows in 0.42 sec (total time 0.52 sec)
CHROM | POS | REF | ALT | plink_P | plink_OR | plink_GC | saige_P | saige_OR | saige_GC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr2 | 6854878 | A | C | 0.000892 | 1.44220 | 0.001090 | 0.001128 | 1.490767 | #missing# |
1 | chr2 | 8712377 | G | A | 0.000366 | 1.56093 | 0.000460 | 0.000334 | 1.660539 | #missing# |
2 | chr2 | 9812615 | T | C | 0.000114 | 1.45902 | 0.000149 | 0.000137 | 1.421855 | #missing# |
3 | chr2 | 26659157 | C | T | 0.000190 | 1.68182 | 0.000244 | 0.000198 | 1.833909 | #missing# |
4 | chr2 | 26687688 | T | C | 0.000512 | 1.20112 | 0.000637 | 0.000784 | 1.196617 | #missing# |
With minor modification to the above query, we can select variants from a subset of genes and add annotations to the results. The VEP consequence and AF annotations need no explanation. The NHGRI-EBI GWAS Catalog annotations are more involved. First, we use the RANK
command to pick only the strongest disease associations per variant. Second, because the variants in the catalog may be based on different variants that the Plink and Saige runs, we do approximate/fuzzy join instead of a varjoin. The "-f 1000" option allow join with distance up to 1000bp and then we use ÀTMIN
to pick only the join output representing the closest distance.
%%gor
$gordefs
def #gwas_cat# = ref_external_datasets/gwascat/gwas_catalog_v1.0.2-associations_e100_r2021-01-14.gorz;
gor #genes# | where gene_symbol ~ 'BRC*' | join -segsnp -ir <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge -e '#missing#' <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| pivot -gc ref,alt type -v 'plink','saige' -e '#missing#'
| varjoin -r <(gor #vep# | select 1-4,gene_symbol,max_consequence)
| varjoin -r <(gor #af# | select 1-4,af)
| join -snpsnp -f 1000 <(gor #gwas_cat# | select 1-4,pvalue,disease_trait
| rank 1 pvalue | where rank_pvalue <= 2 | group 1 -gc ref,alt -lis -sc pvalue,disease_trait -count)
| replace distance abs(distance)
| atmin 1 distance -gc ref,alt
| hide posx-altx,plink_gc,saige_gc
Query ran in 0.31 sec Query fetched 9 rows in 1.93 sec (total time 2.24 sec)
CHROM | POS | REF | ALT | plink_P | plink_OR | saige_P | saige_OR | Gene_Symbol | max_consequence | AF | distance | allCount | lis_pvalue | lis_disease_trait | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32354065 | A | G | 0.0599445 | 1.10752 | 0.063479 | 1.110271 | BRCA2 | intron_variant | 0.264648 | 206 | 1 | 1.0E-9 | Squamous cell carcinoma |
1 | chr13 | 32379251 | T | C | 0.793676 | 1.01269 | 0.775338 | 1.014224 | BRCA2 | intron_variant | 0.518345 | 0 | 3 | 2.0E-6,6.0E-6,2.0E-6 | Low density lipoprotein cholesterol levels,Tot... |
2 | chr13 | 32385099 | C | T | 0.0152087 | 0.707357 | 0.018276 | 0.727597 | BRCA2 | intron_variant | 0.040305 | 37 | 2 | 3.0E-27,4.0E-15 | Apolipoprotein B levels,LDL cholesterol levels |
3 | chr17 | 43092919 | G | A | 0.15997 | 1.07704 | 0.170581 | 1.077417 | BRCA1 | missense_variant | 0.401352 | 530 | 2 | 8.0E-11,3.0E-8 | Menopause (age at onset),Menarche (age at onset) |
4 | chr17 | 43092919 | G | T | 0.177814 | 1.07372 | 0.189337 | 1.073985 | BRCA1 | missense_variant | 0.401316 | 530 | 2 | 8.0E-11,3.0E-8 | Menopause (age at onset),Menarche (age at onset) |
5 | chr17 | 43093220 | A | G | 0.658944 | 1.02378 | 0.665534 | 1.023980 | BRCA1 | synonymous_variant | 0.316521 | 229 | 2 | 8.0E-11,3.0E-8 | Menopause (age at onset),Menarche (age at onset) |
6 | chr17 | 43093448 | C | A | 0.323196 | 1.2337 | 0.309669 | 1.262938 | BRCA1 | missense_variant | 0.013408 | 1 | 2 | 8.0E-11,3.0E-8 | Menopause (age at onset),Menarche (age at onset) |
7 | chr17 | 43093449 | G | A | #missing# | #missing# | 0.798490 | 1.013768 | BRCA1 | synonymous_variant | 0.282838 | 0 | 2 | 8.0E-11,3.0E-8 | Menopause (age at onset),Menarche (age at onset) |
8 | chr17 | 43093454 | C | T | 0.933784 | 1.00804 | 0.942843 | 1.007130 | BRCA1 | missense_variant | 0.072382 | 5 | 2 | 8.0E-11,3.0E-8 | Menopause (age at onset),Menarche (age at onset) |