LD¶
Linkage disequilibrium¶
First we include boiler plate code to define the project, set environment variables and import libraries.
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1
project = "janssen_sle"
%env GOR_API_PROJECT={project}
Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.
gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;
"""
We start by selecting a single variant from a GWAS run that we are interested in:
gwasdefs = """
def #chrom# = chr6 /* chr2 */;
def #mypos# = 32335915 /* 191099907 */;
def #maxdist# = 1000000;
def #gwas# = workflow/plink-regression/2021/09/07/sle_demo_run/gwas/plink_us01_sle_case_control_allanc_unannotated_unfiltered.gorz;
"""
%%gor
$gordefs
$gwasdefs
gor -p #chrom#:#mypos# #gwas#
Query ran in 0.05 sec Query fetched 1 rows in 0.00 sec (total time 0.06 sec)
CHROM | POS | ID | REF | ALT | A1 | TEST | OBS_CT | SE | P | GC | BETA | OR | AF | AF_CASES | AF_CONTROLS | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr6 | 32335915 | rs1265754 | T | A | A | ADD | 13583 | 0.075092 | 6.058370e-17 | 2.004860e-16 | 9.626490e-07 | 0.628075 | 1.874 | 0.109144 | 0.13797 | 0.106015 |
To explain the self-join like behaviour of the GTLD
command we first show a simple query processing only 3 variants:
%%gor
$gordefs
$gwasdefs
gor -p chr2:10000000- #hgt#
| csvsel -gc ref,alt -vs 1 ##freeze##/buckets.tsv <(nor #buckets# | select pn)
| top 3
| rownum
| gtld -f #maxdist# -sum -calc
Query ran in 1.57 sec Query fetched 9 rows in 0.01 sec (total time 1.57 sec)
CHROM | POS | REF | alt | rownum | distance | POSx | REFx | altx | rownumx | ... | LD_g20 | LD_g01 | LD_g11 | LD_g21 | LD_g02 | LD_g12 | LD_g22 | LD_D | LD_Dp | LD_r | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr2 | 10000853 | C | T | 1 | 0 | 10000853 | C | T | 1 | ... | 0 | 0 | 8624 | 0 | 0 | 0 | 2214 | 0.2235 | 1.0000 | 1.0000 |
1 | chr2 | 10000853 | C | T | 1 | 15745 | 10016598 | G | A | 3 | ... | 681 | 2939 | 4027 | 1037 | 449 | 912 | 493 | 0.0508 | 0.2531 | 0.2339 |
2 | chr2 | 10000853 | C | T | 1 | 9481 | 10010334 | C | A | 2 | ... | 2206 | 615 | 342 | 8 | 18 | 1 | 0 | -0.0081 | -0.0126 | -0.1080 |
3 | chr2 | 10010334 | C | A | 2 | -9481 | 10000853 | C | T | 1 | ... | 18 | 8274 | 342 | 1 | 2206 | 8 | 0 | -0.0081 | -0.0126 | -0.1080 |
4 | chr2 | 10010334 | C | A | 2 | 0 | 10010334 | C | A | 2 | ... | 0 | 0 | 969 | 0 | 0 | 0 | 19 | 0.0252 | 1.0000 | 1.0000 |
5 | chr2 | 10010334 | C | A | 2 | 6264 | 10016598 | G | A | 3 | ... | 17 | 7744 | 296 | 2 | 1859 | 3 | 0 | -0.0075 | -0.0110 | -0.1026 |
6 | chr2 | 10016598 | G | A | 3 | -15745 | 10000853 | C | T | 1 | ... | 449 | 3678 | 4027 | 912 | 681 | 1037 | 493 | 0.0508 | 0.2531 | 0.2339 |
7 | chr2 | 10016598 | G | A | 3 | -6264 | 10010334 | C | A | 2 | ... | 1859 | 670 | 296 | 3 | 17 | 2 | 0 | -0.0075 | -0.0110 | -0.1026 |
8 | chr2 | 10016598 | G | A | 3 | 0 | 10016598 | G | A | 3 | ... | 0 | 0 | 8049 | 0 | 0 | 0 | 1864 | 0.2111 | 1.0000 | 1.0000 |
9 rows × 22 columns
We notice that we get 9 rows showing the LD (D-prime and r) between each combination of variants. Also notice that the column rownum that we calculated is shown for both variants. Below, we expand the above to select only variants close (as defined by #maxdist#) to one variant of interest and also label that variant to be the only variant used as left-variant. Furthermore, we add annotations from AF, VEP and our GWAS run. These annotations are generated by the nested query defined in #panno# below.
%%gor
$gordefs
$gwasdefs
def #panno# = <(gor #af# | select 1-4,af | varjoin -r <(gor #vep# | select 1-4,gene_symbol,max_consequence | group 1 -gc #3,#4 -lis -sc gene_symbol,max_consequence)
| varjoin -r -l <(gor #gwas# | select chrom,pos,ref,alt,p,or) );
gor -p #chrom#:#mypos# #af# | where pos = #mypos# | join -snpsnp -f #maxdist# -ir <(gor #hgt#)
/* | varjoin -i <(gor #af# | where abs(af-0.5)<0.49 or pos = #mypos# ) */
| csvsel -gc ref,alt -vs 1 ##freeze##/buckets.tsv <(nor #buckets# | select pn)
| calc useonlyasleftvar if(pos = #mypos#, 1,0)
| varjoin -r -l #panno#
| gtld -f #maxdist# -sum -calc
| select 1-alt,af,lis_gene_symbol,lis_max_consequence,p,or,distance,ld_dp,ld_r,posx-altx,afx,lis_gene_symbolx,lis_max_consequencex,px,orx
| where abs(ld_dp) > 0.2
Query ran in 0.87 sec Query fetched 879 rows in 5.48 sec (total time 6.35 sec)
CHROM | POS | REF | alt | AF | lis_Gene_Symbol | lis_max_consequence | P | OR | distance | LD_Dp | LD_r | POSx | REFx | altx | AFx | lis_Gene_Symbolx | lis_max_consequencex | Px | ORx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | -104340 | 0.9677 | 0.2533 | 32231575 | C | T | 0.629452 | . | intergenic_variant | 7.389440e-02 | 1.101430 |
1 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | -104548 | 0.9881 | 0.6502 | 32231367 | C | A | 0.211807 | . | intergenic_variant | 9.838840e-07 | 1.321630 |
2 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | -104756 | 0.9862 | 0.4290 | 32231159 | G | A | 0.380763 | . | intergenic_variant | 6.423180e-04 | 1.184010 |
3 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | -10736 | 0.9998 | 0.2360 | 32325179 | T | C | 0.676371 | TSBP1-AS1,TSBP1,HNRNPA1P2 | intron_variant,intron_variant,upstream_gene_va... | 8.740700e-01 | 1.008230 |
4 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | -108441 | 0.9691 | 0.2432 | 32227474 | T | C | 0.648823 | NOTCH4 | upstream_gene_variant | 2.571610e-01 | 1.063360 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
874 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | 97247 | 0.9800 | 0.5139 | 32433162 | G | C | 0.297398 | . | intergenic_variant | 2.233160e-02 | 1.129580 |
875 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | 97387 | 0.9854 | 0.1349 | 32433302 | A | G | 0.861384 | . | intergenic_variant | 9.331980e-03 | 0.831227 |
876 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | 97525 | 0.9882 | 0.2538 | 32433440 | C | T | 0.638325 | . | intergenic_variant | 5.805040e-01 | 0.972710 |
877 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | 9967 | 1.0000 | 0.5874 | 32345882 | G | A | 0.038549 | TSBP1-AS1,TSBP1 | intron_variant,intron_variant | 4.337390e-07 | 1.784040 |
878 | chr6 | 32335915 | T | A | 0.104271 | TSBP1,TSBP1-AS1 | missense_variant,intron_variant | 6.058370e-17 | 1.874 | 99963 | 0.9804 | 0.5454 | 32435878 | T | G | 0.273255 | HLA-DRA | upstream_gene_variant | 1.028970e-02 | 1.151890 |
879 rows × 20 columns
Notice that not only do we see the annotation of the variant itself but also the corresponding annotations of the variant in LD. Commented out in the above query is a intersect join filtering to limit LD calculations only to common variants.
Finally, we expand the left-variants from a single variant to all the strongly associated variants (P <= 1e-6) and compare them with all the moderately significant variants. Keep in mind that the complexity goes up as the product of the number of variants (left and right) that overlap in the region defined by #maxdist# then times the number of samples (PNs). Hence, it is important to select these parameters in a thoughtful manner!
%%gor
$gordefs
$gwasdefs
create #leftvars# = gor #gwas# | where P <= 1e-6 | select chrom,pos,ref,alt,p,or;
create #allvars# = gor #gwas# | where P <= 1e-4 | select chrom,pos,ref,alt,p,or;
create #ld# = pgor [#allvars#] | varjoin -r <(gor #hgt#)
| csvsel -gc ref,alt,p,or -vs 1 #buckets# <(nor #buckets# | select pn)
| varjoin -r -l -e 0 <(gor [#leftvars#] | select 1-4 | calc useonlyasleftvar 1)
| gtld -f #maxdist# -sum -calc;
gor [#ld#] | select 1-alt,p,or,distance-altx,px,orx,ld_dp,ld_r
Query ran in 0.09 sec Query fetched 119,899 rows in 0.51 sec (total time 0.60 sec)
CHROM | POS | REF | ALT | P | OR | distance | POSx | REFx | ALTx | Px | ORx | LD_Dp | LD_r | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 183563445 | G | T | 1.958070e-07 | 1.67993 | 0 | 183563445 | G | T | 1.958070e-07 | 1.67993 | 1.0000 | 1.0000 |
1 | chr2 | 191079016 | C | T | 5.302250e-09 | 1.38603 | -40984 | 191038032 | G | A | 8.209250e-05 | 1.24798 | 0.7084 | 0.7032 |
2 | chr2 | 191079016 | C | T | 5.302250e-09 | 1.38603 | -43293 | 191035723 | G | A | 6.383040e-05 | 1.25269 | 0.7074 | 0.7022 |
3 | chr2 | 191079016 | C | T | 5.302250e-09 | 1.38603 | -7938 | 191071078 | C | T | 1.189440e-06 | 1.32898 | 0.8682 | 0.8055 |
4 | chr2 | 191079016 | C | T | 5.302250e-09 | 1.38603 | 0 | 191079016 | C | T | 5.302250e-09 | 1.38603 | 1.0000 | 1.0000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
119894 | chr7 | 129055929 | C | A | 3.168790e-11 | 1.64458 | -115941 | 128939988 | A | G | 1.644540e-05 | 1.23192 | 0.9951 | 0.3728 |
119895 | chr7 | 129055929 | C | A | 3.168790e-11 | 1.64458 | -122016 | 128933913 | G | A | 3.983020e-05 | 1.22396 | 0.7927 | 0.3130 |
119896 | chr7 | 129055929 | C | A | 3.168790e-11 | 1.64458 | -78517 | 128977412 | A | G | 1.831400e-11 | 1.65603 | 0.9962 | 0.9961 |
119897 | chr7 | 129055929 | C | A | 3.168790e-11 | 1.64458 | 0 | 129055929 | C | A | 3.168790e-11 | 1.64458 | 1.0000 | 1.0000 |
119898 | chr7 | 129055929 | C | A | 3.168790e-11 | 1.64458 | 21923 | 129077852 | G | A | 1.193920e-06 | 1.42165 | 0.8453 | 0.7637 |
119899 rows × 14 columns
The examples above show how easy it is to interactively explore LD for variants of interest. It is for example trivial to modify the above queries to include a subset of the samples (e.g. by changing the use of the CSVSEL
command). We have also pre-build workflows and report builders to perform LD-clumping where GWAS runs are pruned based on proximity to stronger associations.