GWAS Comparison¶

GWAS_Comparison

Demonstrate functionality to compare different GWAS runs¶

First we include boiler plate code to define the project, set environment variables and import libraries.

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1

project = "janssen_sle"
%env GOR_API_PROJECT={project}

Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the examples and paths to two GWAS runs: one Plink analysis and a corresponding Saige analysis of the same phenotype.

In [2]:

gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;

def #plinkrun# = workflow/plink-regression/2021/09/07/sle_demo_run/gwas/plink_us01_sle_case_control_allanc_unannotated_unfiltered.gorz;
def #saigerun# = workflow/saige/2021/09/07/saige_demo1/gwas/saige_us01_sle_case_control_allanc.gorz;
"""

First we setup a query that fetches rows from both of the GWAS analysis. We do this by using the MERGE command within a nested query, hence the range specified with the "-p" option on the GOR command applies to both of the sources.

In [3]:

%%gor
$gordefs

gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
              
| top 5

Query ran in 0.10 sec
Query fetched 5 rows in 0.27 sec (total time 0.37 sec)

Out[3]:

	CHROM	POS	REF	ALT	P	OR	GC	type
0	chr2	11944	C	T	0.252400	1.193795	NaN	saige
1	chr2	11944	C	T	0.254322	1.180080	0.262492	plink
2	chr2	17822	G	A	0.193109	1.148673	NaN	saige
3	chr2	34049	G	A	0.087926	1.185207	NaN	saige
4	chr2	34049	G	A	0.092090	1.172690	0.097759	plink

Join approach¶

GOR does not support outer-join that would ensure that a join between the two GWAS runs returns all variants from both runs. We can however achieve the same results as an outer-join by performing a left-join from the union of variants with each of the GWAS tables:

In [4]:

%%gor
$gordefs

gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| select 1-4 | distinct
| varjoin -r -l -e '' -rprefix plink <(gor #plinkrun# | select chrom,pos,ref,alt,p,or,gc)
| varjoin -r -l -e '' -rprefix saige <(gor #saigerun# | select chrom,pos,ref,alt,p,or)
| top 5

Query ran in 0.07 sec
Query fetched 5 rows in 0.40 sec (total time 0.47 sec)

Out[4]:

	CHROM	POS	REF	ALT	plink_P	plink_OR	plink_GC	saige_P	saige_OR
0	chr2	11944	C	T	0.254322	1.180080	0.262492	0.252400	1.193795
1	chr2	17822	G	A	NaN	NaN	NaN	0.193109	1.148673
2	chr2	34049	G	A	0.092090	1.172690	0.097759	0.087926	1.185207
3	chr2	34289	C	T	0.216733	0.836433	0.224653	0.217158	0.839865
4	chr2	35015	G	T	0.338578	1.181850	0.346860	0.319962	1.204271

Pivot approach¶

An alternative to using join to present GWAS data from two or more runs, it is possible to use pivoting. The PIVOT command takes multiple rows from an input stream with a pivot column and presents the in a horizontal fashion. The benefit of this approach compared to the join approach above is that the data is only read once but also this approach is more easily extended to deal with multiple GWAS runs.

In [5]:

%%gor
$gordefs

gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge -e '#missing#' <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| pivot -gc ref,alt type -v 'plink','saige' -e '#missing#'
| top 5

Query ran in 0.07 sec
Query fetched 5 rows in 0.21 sec (total time 0.27 sec)

Out[5]:

	CHROM	POS	REF	ALT	plink_P	plink_OR	plink_GC	saige_P	saige_OR	saige_GC
0	chr2	11944	C	T	0.254322	1.18008	0.262492	0.252400	1.193795	#missing#
1	chr2	17822	G	A	#missing#	#missing#	#missing#	0.193109	1.148673	#missing#
2	chr2	34049	G	A	0.0920904	1.17269	0.0977592	0.087926	1.185207	#missing#
3	chr2	34289	C	T	0.216733	0.836433	0.224653	0.217158	0.839865	#missing#
4	chr2	35015	G	T	0.338578	1.18185	0.34686	0.319962	1.204271	#missing#

As an example, if we are only interested in variants where one of the runs has significant association, we can apply a filter with OR condition on the plink_P column and the saige_P column shown in previous output. When there are many GWAS runs, this can be done more elegantly by using the GRANNO command, where we aggregate over rows and annotate them for the same input stream. In our case, we find the minimum p-value per variant.

In [6]:

%%gor
$gordefs

gor -p chr2 <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge -e '#missing#' <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| granno 1 -gc ref,alt -min -fc p
| where min_p < 1e-3
| hide min_p
| pivot -gc ref,alt type -v 'plink','saige' -e '#missing#'
| top 5

Query ran in 0.09 sec
Query fetched 5 rows in 0.42 sec (total time 0.52 sec)

Out[6]:

	CHROM	POS	REF	ALT	plink_P	plink_OR	plink_GC	saige_P	saige_OR	saige_GC
0	chr2	6854878	A	C	0.000892	1.44220	0.001090	0.001128	1.490767	#missing#
1	chr2	8712377	G	A	0.000366	1.56093	0.000460	0.000334	1.660539	#missing#
2	chr2	9812615	T	C	0.000114	1.45902	0.000149	0.000137	1.421855	#missing#
3	chr2	26659157	C	T	0.000190	1.68182	0.000244	0.000198	1.833909	#missing#
4	chr2	26687688	T	C	0.000512	1.20112	0.000637	0.000784	1.196617	#missing#

With minor modification to the above query, we can select variants from a subset of genes and add annotations to the results. The VEP consequence and AF annotations need no explanation. The NHGRI-EBI GWAS Catalog annotations are more involved. First, we use the RANK command to pick only the strongest disease associations per variant. Second, because the variants in the catalog may be based on different variants that the Plink and Saige runs, we do approximate/fuzzy join instead of a varjoin. The "-f 1000" option allow join with distance up to 1000bp and then we use ÀTMIN to pick only the join output representing the closest distance.

In [7]:

%%gor
$gordefs

def #gwas_cat# = ref_external_datasets/gwascat/gwas_catalog_v1.0.2-associations_e100_r2021-01-14.gorz;

gor #genes# | where gene_symbol ~ 'BRC*' | join -segsnp -ir <(gor #plinkrun#
| select chrom,pos,ref,alt,p,or,gc | calc type 'plink'
| merge -e '#missing#' <(gor #saigerun# | select chrom,pos,ref,alt,p,or | calc type 'saige')
)
| pivot -gc ref,alt type -v 'plink','saige' -e '#missing#'
| varjoin -r <(gor #vep# | select 1-4,gene_symbol,max_consequence)
| varjoin -r <(gor #af# | select 1-4,af)
| join -snpsnp -f 1000 <(gor #gwas_cat# | select 1-4,pvalue,disease_trait
                         | rank 1 pvalue | where rank_pvalue <= 2 | group 1 -gc ref,alt -lis -sc pvalue,disease_trait -count)
| replace distance abs(distance)
| atmin 1 distance -gc ref,alt
| hide posx-altx,plink_gc,saige_gc

Query ran in 0.31 sec
Query fetched 9 rows in 1.93 sec (total time 2.24 sec)

Out[7]:

	CHROM	POS	REF	ALT	plink_P	plink_OR	saige_P	saige_OR	Gene_Symbol	max_consequence	AF	distance	allCount	lis_pvalue	lis_disease_trait
0	chr13	32354065	A	G	0.0599445	1.10752	0.063479	1.110271	BRCA2	intron_variant	0.264648	206	1	1.0E-9	Squamous cell carcinoma
1	chr13	32379251	T	C	0.793676	1.01269	0.775338	1.014224	BRCA2	intron_variant	0.518345	0	3	2.0E-6,6.0E-6,2.0E-6	Low density lipoprotein cholesterol levels,Tot...
2	chr13	32385099	C	T	0.0152087	0.707357	0.018276	0.727597	BRCA2	intron_variant	0.040305	37	2	3.0E-27,4.0E-15	Apolipoprotein B levels,LDL cholesterol levels
3	chr17	43092919	G	A	0.15997	1.07704	0.170581	1.077417	BRCA1	missense_variant	0.401352	530	2	8.0E-11,3.0E-8	Menopause (age at onset),Menarche (age at onset)
4	chr17	43092919	G	T	0.177814	1.07372	0.189337	1.073985	BRCA1	missense_variant	0.401316	530	2	8.0E-11,3.0E-8	Menopause (age at onset),Menarche (age at onset)
5	chr17	43093220	A	G	0.658944	1.02378	0.665534	1.023980	BRCA1	synonymous_variant	0.316521	229	2	8.0E-11,3.0E-8	Menopause (age at onset),Menarche (age at onset)
6	chr17	43093448	C	A	0.323196	1.2337	0.309669	1.262938	BRCA1	missense_variant	0.013408	1	2	8.0E-11,3.0E-8	Menopause (age at onset),Menarche (age at onset)
7	chr17	43093449	G	A	#missing#	#missing#	0.798490	1.013768	BRCA1	synonymous_variant	0.282838	0	2	8.0E-11,3.0E-8	Menopause (age at onset),Menarche (age at onset)
8	chr17	43093454	C	T	0.933784	1.00804	0.942843	1.007130	BRCA1	missense_variant	0.072382	5	2	8.0E-11,3.0E-8	Menopause (age at onset),Menarche (age at onset)

In [ ]: