LD¶

LD

Linkage disequilibrium¶

First we include boiler plate code to define the project, set environment variables and import libraries.

In [118]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1

project = "janssen_sle"
%env GOR_API_PROJECT={project}

Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.

In [119]:

gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;
"""

We start by selecting a single variant from a GWAS run that we are interested in:

In [120]:

gwasdefs = """
def #chrom# = chr6 /* chr2 */;
def #mypos# = 32335915 /* 191099907 */;
def #maxdist# = 1000000;
def #gwas# = workflow/plink-regression/2021/09/07/sle_demo_run/gwas/plink_us01_sle_case_control_allanc_unannotated_unfiltered.gorz;
"""

In [121]:

%%gor
$gordefs
$gwasdefs

gor -p #chrom#:#mypos# #gwas#

Query ran in 0.05 sec
Query fetched 1 rows in 0.00 sec (total time 0.06 sec)

Out[121]:

	CHROM	POS	ID	REF	ALT	A1	TEST	OBS_CT	SE	P	GC	QQ	BETA	OR	AF	AF_CASES	AF_CONTROLS
0	chr6	32335915	rs1265754	T	A	A	ADD	13583	0.075092	6.058370e-17	2.004860e-16	9.626490e-07	0.628075	1.874	0.109144	0.13797	0.106015

To explain the self-join like behaviour of the GTLD command we first show a simple query processing only 3 variants:

In [122]:

%%gor
$gordefs
$gwasdefs

gor -p chr2:10000000- #hgt# 
| csvsel -gc ref,alt -vs 1 ##freeze##/buckets.tsv <(nor #buckets# | select pn)
| top 3
| rownum
| gtld -f #maxdist# -sum -calc

Query ran in 1.57 sec
Query fetched 9 rows in 0.01 sec (total time 1.57 sec)

Out[122]:

	CHROM	POS	REF	alt	rownum	distance	POSx	REFx	altx	rownumx	...	LD_g20	LD_g01	LD_g11	LD_g21	LD_g02	LD_g12	LD_g22	LD_D	LD_Dp	LD_r
0	chr2	10000853	C	T	1	0	10000853	C	T	1	...	0	0	8624	0	0	0	2214	0.2235	1.0000	1.0000
1	chr2	10000853	C	T	1	15745	10016598	G	A	3	...	681	2939	4027	1037	449	912	493	0.0508	0.2531	0.2339
2	chr2	10000853	C	T	1	9481	10010334	C	A	2	...	2206	615	342	8	18	1	0	-0.0081	-0.0126	-0.1080
3	chr2	10010334	C	A	2	-9481	10000853	C	T	1	...	18	8274	342	1	2206	8	0	-0.0081	-0.0126	-0.1080
4	chr2	10010334	C	A	2	0	10010334	C	A	2	...	0	0	969	0	0	0	19	0.0252	1.0000	1.0000
5	chr2	10010334	C	A	2	6264	10016598	G	A	3	...	17	7744	296	2	1859	3	0	-0.0075	-0.0110	-0.1026
6	chr2	10016598	G	A	3	-15745	10000853	C	T	1	...	449	3678	4027	912	681	1037	493	0.0508	0.2531	0.2339
7	chr2	10016598	G	A	3	-6264	10010334	C	A	2	...	1859	670	296	3	17	2	0	-0.0075	-0.0110	-0.1026
8	chr2	10016598	G	A	3	0	10016598	G	A	3	...	0	0	8049	0	0	0	1864	0.2111	1.0000	1.0000

9 rows × 22 columns

We notice that we get 9 rows showing the LD (D-prime and r) between each combination of variants. Also notice that the column rownum that we calculated is shown for both variants. Below, we expand the above to select only variants close (as defined by #maxdist#) to one variant of interest and also label that variant to be the only variant used as left-variant. Furthermore, we add annotations from AF, VEP and our GWAS run. These annotations are generated by the nested query defined in #panno# below.

In [123]:

%%gor
$gordefs
$gwasdefs

def #panno# = <(gor #af# | select 1-4,af | varjoin -r <(gor #vep# | select 1-4,gene_symbol,max_consequence | group 1 -gc #3,#4 -lis -sc gene_symbol,max_consequence)
| varjoin -r -l <(gor #gwas# | select chrom,pos,ref,alt,p,or) );

gor -p #chrom#:#mypos# #af# | where pos = #mypos# | join -snpsnp -f #maxdist# -ir <(gor #hgt#)
/* | varjoin -i <(gor #af# | where abs(af-0.5)<0.49 or pos = #mypos# ) */
| csvsel -gc ref,alt -vs 1 ##freeze##/buckets.tsv <(nor #buckets# | select pn)
| calc useonlyasleftvar if(pos = #mypos#, 1,0)
| varjoin -r -l #panno#
| gtld -f #maxdist# -sum -calc
| select 1-alt,af,lis_gene_symbol,lis_max_consequence,p,or,distance,ld_dp,ld_r,posx-altx,afx,lis_gene_symbolx,lis_max_consequencex,px,orx
| where abs(ld_dp) > 0.2

Query ran in 0.87 sec
Query fetched 879 rows in 5.48 sec (total time 6.35 sec)

Out[123]:

	CHROM	POS	REF	alt	AF	lis_Gene_Symbol	lis_max_consequence	P	OR	distance	LD_Dp	LD_r	POSx	REFx	altx	AFx	lis_Gene_Symbolx	lis_max_consequencex	Px	ORx
0	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	-104340	0.9677	0.2533	32231575	C	T	0.629452	.	intergenic_variant	7.389440e-02	1.101430
1	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	-104548	0.9881	0.6502	32231367	C	A	0.211807	.	intergenic_variant	9.838840e-07	1.321630
2	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	-104756	0.9862	0.4290	32231159	G	A	0.380763	.	intergenic_variant	6.423180e-04	1.184010
3	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	-10736	0.9998	0.2360	32325179	T	C	0.676371	TSBP1-AS1,TSBP1,HNRNPA1P2	intron_variant,intron_variant,upstream_gene_va...	8.740700e-01	1.008230
4	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	-108441	0.9691	0.2432	32227474	T	C	0.648823	NOTCH4	upstream_gene_variant	2.571610e-01	1.063360
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
874	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	97247	0.9800	0.5139	32433162	G	C	0.297398	.	intergenic_variant	2.233160e-02	1.129580
875	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	97387	0.9854	0.1349	32433302	A	G	0.861384	.	intergenic_variant	9.331980e-03	0.831227
876	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	97525	0.9882	0.2538	32433440	C	T	0.638325	.	intergenic_variant	5.805040e-01	0.972710
877	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	9967	1.0000	0.5874	32345882	G	A	0.038549	TSBP1-AS1,TSBP1	intron_variant,intron_variant	4.337390e-07	1.784040
878	chr6	32335915	T	A	0.104271	TSBP1,TSBP1-AS1	missense_variant,intron_variant	6.058370e-17	1.874	99963	0.9804	0.5454	32435878	T	G	0.273255	HLA-DRA	upstream_gene_variant	1.028970e-02	1.151890

879 rows × 20 columns

Notice that not only do we see the annotation of the variant itself but also the corresponding annotations of the variant in LD. Commented out in the above query is a intersect join filtering to limit LD calculations only to common variants.

Finally, we expand the left-variants from a single variant to all the strongly associated variants (P <= 1e-6) and compare them with all the moderately significant variants. Keep in mind that the complexity goes up as the product of the number of variants (left and right) that overlap in the region defined by #maxdist# then times the number of samples (PNs). Hence, it is important to select these parameters in a thoughtful manner!

In [124]:

%%gor
$gordefs
$gwasdefs

create #leftvars# = gor #gwas# | where P <= 1e-6 | select chrom,pos,ref,alt,p,or;
create #allvars# = gor #gwas# | where P <= 1e-4 | select chrom,pos,ref,alt,p,or;

create #ld# = pgor [#allvars#] | varjoin -r <(gor #hgt#)
| csvsel -gc ref,alt,p,or -vs 1 #buckets#  <(nor #buckets# | select pn)
| varjoin -r -l -e 0 <(gor [#leftvars#] | select 1-4 | calc useonlyasleftvar 1)
| gtld -f #maxdist# -sum -calc;

gor [#ld#] | select 1-alt,p,or,distance-altx,px,orx,ld_dp,ld_r

Query ran in 0.09 sec
Query fetched 119,899 rows in 0.51 sec (total time 0.60 sec)

Out[124]:

	CHROM	POS	REF	ALT	P	OR	distance	POSx	REFx	ALTx	Px	ORx	LD_Dp	LD_r
0	chr1	183563445	G	T	1.958070e-07	1.67993	0	183563445	G	T	1.958070e-07	1.67993	1.0000	1.0000
1	chr2	191079016	C	T	5.302250e-09	1.38603	-40984	191038032	G	A	8.209250e-05	1.24798	0.7084	0.7032
2	chr2	191079016	C	T	5.302250e-09	1.38603	-43293	191035723	G	A	6.383040e-05	1.25269	0.7074	0.7022
3	chr2	191079016	C	T	5.302250e-09	1.38603	-7938	191071078	C	T	1.189440e-06	1.32898	0.8682	0.8055
4	chr2	191079016	C	T	5.302250e-09	1.38603	0	191079016	C	T	5.302250e-09	1.38603	1.0000	1.0000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
119894	chr7	129055929	C	A	3.168790e-11	1.64458	-115941	128939988	A	G	1.644540e-05	1.23192	0.9951	0.3728
119895	chr7	129055929	C	A	3.168790e-11	1.64458	-122016	128933913	G	A	3.983020e-05	1.22396	0.7927	0.3130
119896	chr7	129055929	C	A	3.168790e-11	1.64458	-78517	128977412	A	G	1.831400e-11	1.65603	0.9962	0.9961
119897	chr7	129055929	C	A	3.168790e-11	1.64458	0	129055929	C	A	3.168790e-11	1.64458	1.0000	1.0000
119898	chr7	129055929	C	A	3.168790e-11	1.64458	21923	129077852	G	A	1.193920e-06	1.42165	0.8453	0.7637

119899 rows × 14 columns

The examples above show how easy it is to interactively explore LD for variants of interest. It is for example trivial to modify the above queries to include a subset of the samples (e.g. by changing the use of the CSVSEL command). We have also pre-build workflows and report builders to perform LD-clumping where GWAS runs are pruned based on proximity to stronger associations.

In [ ]: