GWAS vs. Finngen¶
Comparison of Plink GWAS with Finngen results in Analysis Catalog structure¶
Here we show how we can compare our Plink results with GWAS data stored in AC structure. First we use our standard boilerplaite code.
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1
project = "janssen_sle"
%env GOR_API_PROJECT={project}
Finngen in Analysis Catalog structure¶
In the Finngen AC, we have GWAS runs for over 2800 different phenotypes. It is important to keep in mind that these are BIG data sets with over 12 million rows of marker associations per phenotype; over 24 billion rows in total. We can count the phenotypes using the phenotype overview table:
%%gor
nor AC/finngen_r5/phenotypes.tsv | group -count
Query ran in 0.06 sec Query fetched 1 rows in 0.00 sec (total time 0.06 sec)
allCount | |
---|---|
0 | 2803 |
In our first example, we will pick three cancer phenotypes that have the highest number of cases:
%%gor dfPhenocodelist <<
nor AC/finngen_r5/phenotypes.tsv | grep cancer | sort -c n_cases:nr | top 3 | group -lis -sc phenocode
Query ran in 0.06 sec Query fetched 1 rows in 0.01 sec (total time 0.07 sec)
print(f"Top three cancer phenotypes: {dfPhenocodelist.at[0,'lis_phenocode']}")
Top three cancer phenotypes: CD2_BENIGN_EXALLC,C3_CANCER,CD2_BENIGN_LEIOMYOMA_UTERI_EXALLC
It is now possible to use first principles to access the GWAS results for each phenotype, however, it is important to keep in mind that these are BIG data sets with over 12 million rows per phenotype; over 24 billion rows in total.
For an example, if we want to list the top five p-values for each of these three phenotypes in the BRCA2 gene we can do the following:
phenolist = dfPhenocodelist.at[0,'lis_phenocode']
%%gor
gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp <(gor AC/finngen_r5/ac.gord -s phenocode -f $phenolist )
| rank 1 -gc gene_symbol,phenocode pval -o asc | where rank_pval <= 3
| select chrom,pos-pval,gene_symbol,phenocode
| sort 4000000
Query ran in 1.83 sec Query fetched 9 rows in 5.41 sec (total time 7.24 sec)
chrom | pos | ref | alt | pval | gene_symbol | phenocode | |
---|---|---|---|---|---|---|---|
0 | chr13 | 32318557 | G | T | 0.007522 | BRCA2 | CD2_BENIGN_EXALLC |
1 | chr13 | 32318683 | T | C | 0.007518 | BRCA2 | CD2_BENIGN_EXALLC |
2 | chr13 | 32318907 | A | G | 0.001871 | BRCA2 | C3_CANCER |
3 | chr13 | 32321537 | C | T | 0.007480 | BRCA2 | CD2_BENIGN_EXALLC |
4 | chr13 | 32326722 | C | G | 0.005910 | BRCA2 | CD2_BENIGN_LEIOMYOMA_UTERI_EXALLC |
5 | chr13 | 32345218 | T | C | 0.003759 | BRCA2 | CD2_BENIGN_LEIOMYOMA_UTERI_EXALLC |
6 | chr13 | 32363033 | C | T | 0.004839 | BRCA2 | CD2_BENIGN_LEIOMYOMA_UTERI_EXALLC |
7 | chr13 | 32394413 | G | A | 0.002818 | BRCA2 | C3_CANCER |
8 | chr13 | 32398489 | A | T | 0.003545 | BRCA2 | C3_CANCER |
In the query above, we apply the RANK
command on the start position of the gene
(because it is more memory efficient that to rank over the entire chromosome and all variants overlaping the gene will have the same values in the
first two columns). However, we then change the position column by selecting the variant position, therefore we must apply window based sort to preserve proper GOR ordering in case of overlaping genes (not really needed in this example because we select only a single gene - an important pattern nevertheless!)
Using AC table functions¶
In case we want to run this type of analysis genome wide and for high number of phenotypes or for a gene list or for a marker list, it is important to leverage parallel execution. For the AC structures, the easiest way to do that is to use the pre-built YML report builders that can be used and queried as parameterized table functions. Importantly, behind the scenes they are setup to run optimized queries in parallel that depend on the input parameters at hand. Ofcourse, we must be careful not to select "crazy" combinations of input parameters, e.g. all phenotypes and p-value threshold of 1.0, since it may result in duplication of the entire AC dataset in temporary files. Here is how we can repeat the above query using the YML-table function.
%%gor
create #genesOfInterest# = gor #genes# | where gene_symbol = 'BRCA2';
gor AC/finngen_r5/ac_filter.yml(phenocodes = '$phenolist', input_gene_grid = [#genesOfInterest#], max_pval='1.0', annotate = 'true')
| top 5
Query ran in 1.38 sec Query fetched 5 rows in 0.16 sec (total time 1.53 sec)
chrom | pos | ref | alt | pval | beta | sebeta | maf | maf_cases | maf_controls | ... | Max_Impact | Biotype | Gene_Symbol | Transcript_count | Amino_Acids | Protein_Position | CDS_position | Refgene | max_consequence | AF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | 0.06518 | -0.0198 | 0.0107 | 0.2265 | 0.2221 | 0.2275 | ... | LOW | protein_coding | BRCA2 | 1 | NaN | NaN | NaN | . | intron_variant | 0.2265 |
1 | chr13 | 32315226 | G | A | 0.06518 | -0.0198 | 0.0107 | 0.2265 | 0.2221 | 0.2275 | ... | LOW | protein_coding | ZAR1L | 1 | NaN | NaN | NaN | . | intron_variant | 0.2265 |
2 | chr13 | 32315226 | G | A | 0.63190 | -0.0077 | 0.0160 | 0.2269 | 0.2243 | 0.2274 | ... | LOW | protein_coding | BRCA2 | 1 | NaN | NaN | NaN | . | intron_variant | 0.2265 |
3 | chr13 | 32315226 | G | A | 0.63190 | -0.0077 | 0.0160 | 0.2269 | 0.2243 | 0.2274 | ... | LOW | protein_coding | ZAR1L | 1 | NaN | NaN | NaN | . | intron_variant | 0.2265 |
4 | chr13 | 32315226 | G | A | 0.66270 | 0.0043 | 0.0099 | 0.2272 | 0.2269 | 0.2273 | ... | LOW | protein_coding | BRCA2 | 1 | NaN | NaN | NaN | . | intron_variant | 0.2265 |
5 rows × 27 columns
We see annotated variant rows from BRCA2 and overlaping genes for the three phenotypes of interest. Now we can add similar ranking logic as we used before, to find the top three hits per phenotype.
%%gor
create #genesOfInterest# = gor #genes# | where gene_symbol = 'BRCA2';
gor AC/finngen_r5/ac_filter.yml(phenocodes = '$phenolist', input_gene_grid = [#genesOfInterest#], max_pval='1.0', annotate = 'true')
| select chrom,pos-pval,gene_symbol,phenocode,name,max_consequence
| rank chrom -gc gene_symbol,phenocode pval -o asc | where rank_pval <= 3
| where gene_symbol = 'BRCA2'
Query ran in 0.80 sec Query fetched 9 rows in 0.07 sec (total time 0.87 sec)
chrom | pos | ref | alt | pval | Gene_Symbol | phenocode | name | max_consequence | rank_pval | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32318557 | G | T | 0.007522 | BRCA2 | CD2_BENIGN_EXALLC | Benign neoplasms (all cancers excluded) | intron_variant | 3 |
1 | chr13 | 32318683 | T | C | 0.007518 | BRCA2 | CD2_BENIGN_EXALLC | Benign neoplasms (all cancers excluded) | intron_variant | 2 |
2 | chr13 | 32318907 | A | G | 0.001871 | BRCA2 | C3_CANCER | Malignant neoplasm | intron_variant | 1 |
3 | chr13 | 32321537 | C | T | 0.007480 | BRCA2 | CD2_BENIGN_EXALLC | Benign neoplasms (all cancers excluded) | intron_variant | 1 |
4 | chr13 | 32326722 | C | G | 0.005910 | BRCA2 | CD2_BENIGN_LEIOMYOMA_UTERI_EXALLC | Leiomyoma of uterus (all cancers excluded) | intron_variant | 3 |
5 | chr13 | 32345218 | T | C | 0.003759 | BRCA2 | CD2_BENIGN_LEIOMYOMA_UTERI_EXALLC | Leiomyoma of uterus (all cancers excluded) | intron_variant | 1 |
6 | chr13 | 32363033 | C | T | 0.004839 | BRCA2 | CD2_BENIGN_LEIOMYOMA_UTERI_EXALLC | Leiomyoma of uterus (all cancers excluded) | intron_variant | 2 |
7 | chr13 | 32394413 | G | A | 0.002818 | BRCA2 | C3_CANCER | Malignant neoplasm | intron_variant | 2 |
8 | chr13 | 32398489 | A | T | 0.003545 | BRCA2 | C3_CANCER | Malignant neoplasm | stop_gained | 3 |
Genome wide comparison¶
In our final example, we will show how we can define regions of interest based on a GWAS run and then use the AC table function to locate hits from the Finngen phenotypes.
mydefs = """
def #plinkrun# = workflow/plink-regression/2021/09/07/sle_demo_run/gwas/plink_us01_sle_case_control_allanc_unannotated_unfiltered.gorz;
create #varsOfInterest# = gor #plinkrun#
| where p < 1e-4 and chrom != 'chr6'
| select chrom,pos,ref,alt;
def #bpwindow# = 10000;
create #regionsOfInterest# = gor [#varsOfInterest#]
| calc bpstart pos - #bpwindow# | calc bpstop pos + #bpwindow#
| select chrom,bpstart,bpstop | segproj;
"""
The above definitions locate all moderately significant association hits outside of chr6 and find 10kb regions surrounding them. The command SEGPROJ
is use to project overlaping segments together. We can now inspect these regions and the number of variants that overlap:
%%gor
$mydefs
gor [#regionsOfInterest#]
| join -segsnp -ic [#varsOfInterest#] | rename overlapcount numVarsOfInterst
| join -segsnp -ic #plinkrun# | rename overlapcount numPlinkVars
Query ran in 0.09 sec Query fetched 94 rows in 0.17 sec (total time 0.25 sec)
CHROM | bpstart | bpstop | segCount | numVarsOfInterst | numPlinkVars | |
---|---|---|---|---|---|---|
0 | chr1 | 77491822 | 77500432 | 1 | 0 | 1 |
1 | chr1 | 77500432 | 77511822 | 2 | 2 | 3 |
2 | chr1 | 77511822 | 77520432 | 1 | 0 | 1 |
3 | chr1 | 161473076 | 161493076 | 1 | 1 | 3 |
4 | chr1 | 173336486 | 173354690 | 1 | 1 | 4 |
... | ... | ... | ... | ... | ... | ... |
89 | chr8 | 12847660 | 12863246 | 1 | 1 | 10 |
90 | chr8 | 65104392 | 65124392 | 1 | 1 | 6 |
91 | chr9 | 17993635 | 18013635 | 1 | 1 | 10 |
92 | chr9 | 18070919 | 18090919 | 1 | 1 | 8 |
93 | chr9 | 116193874 | 116213874 | 1 | 1 | 3 |
94 rows × 6 columns
To fetch significant Finngen hits from these regions we can do the following:
%%gor
$mydefs
gor AC/finngen_r5/ac_filter.yml(region_grid = [#regionsOfInterest#], max_pval='1e-4', annotate = 'false')
Query ran in 0.83 sec Query fetched 5,731 rows in 0.69 sec (total time 1.52 sec)
chrom | pos | ref | alt | pval | beta | sebeta | maf | maf_cases | maf_controls | n_hom_cases | n_het_cases | n_hom_controls | n_het_controls | phenocode | name | rsid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 77492143 | A | G | 0.000012 | 2.7132 | 0.6206 | 0.011360 | 0.040540 | 0.011330 | 0.00 | 15.00 | 36.32 | 4711.61 | L12_PERIORALDERM | Perioral dermatitis | rs61778697 |
1 | chr1 | 77492803 | A | G | 0.000023 | 3.2163 | 0.7604 | 0.005268 | 0.022240 | 0.005247 | 0.00 | 12.01 | 7.11 | 2228.01 | AB1_CMV | Cytomegaloviral disease | rs183450646 |
2 | chr1 | 77495104 | A | G | 0.000071 | 0.2135 | 0.0537 | 0.283200 | 0.316500 | 0.283100 | 91.94 | 395.27 | 16755.96 | 84813.31 | R18_ABNORMAL_SPERMATOZ | Abnormal spermatozoa | rs35038885 |
3 | chr1 | 77495339 | G | A | 0.000042 | 25.3979 | 6.2016 | 0.000608 | 0.013850 | 0.000600 | 0.00 | 3.13 | 0.09 | 216.76 | CD2_PRIMARY_LYMPHOID_HEMATOPOIETIC_NAS_EXALLC | Other and unspecified malignant neoplasms of l... | rs1268680760 |
4 | chr1 | 77495339 | G | A | 0.000050 | 24.1656 | 5.9586 | 0.000625 | 0.013850 | 0.000618 | 0.00 | 3.13 | 0.09 | 270.26 | CD2_PRIMARY_LYMPHOID_HEMATOPOIETIC_NAS | Other and unspecified malignant neoplasms of l... | rs1268680760 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5726 | chr9 | 116210845 | C | T | 0.000003 | 3.9723 | 0.8499 | 0.000787 | 0.003320 | 0.000765 | 0.00 | 12.34 | 0.00 | 332.06 | Z21_PERSONS_W_POTEN_HEALTH_HAZARDS_RELATED_COM... | Persons with potential health hazards related ... | rs765054452 |
5727 | chr9 | 116210845 | C | T | 0.000054 | 4.3205 | 1.0701 | 0.000780 | 0.003612 | 0.000765 | 0.00 | 8.34 | 0.00 | 332.06 | Z21_CONTACT_W_EXPOS_COMMUNICAB_DISEA | Contact with and exposure to communicable dise... | rs765054452 |
5728 | chr9 | 116212550 | C | A | 0.000069 | 9.4753 | 2.3815 | 0.002277 | 0.022080 | 0.002267 | 0.00 | 4.77 | 0.03 | 940.62 | L12_PIGPURPDERM | Pigmented purpuric dermatosis | rs79537882 |
5729 | chr9 | 116212745 | G | A | 0.000090 | 13.2505 | 3.3826 | 0.000519 | 0.009079 | 0.000512 | 0.00 | 3.34 | 0.01 | 223.93 | Z21_ALCOHOL_ABUSE_COUNSELLI_SURVE | Alcohol abuse counselling and surveillance | rs1276461641 |
5730 | chr9 | 116213286 | T | C | 0.000016 | 0.6704 | 0.1556 | 0.006944 | 0.011580 | 0.006853 | 0.91 | 81.11 | 11.46 | 2487.95 | E4_DMNAS | Unspecified diabetes | rs74889321 |
5731 rows × 17 columns
Currently the AC tablefunction does not support a phenotype relation as parameter, however, we can use the approach from above.
%%gor dfCancerPhenos <<
nor AC/finngen_r5/phenotypes.tsv | grep cancer | select phenocode | top 10 |group -lis -sc phenocode
Query ran in 0.05 sec Query fetched 1 rows in 0.00 sec (total time 0.05 sec)
cancerPhenoslist = dfCancerPhenos.at[0,'lis_phenocode']
cancerPhenoslist
'C3_AML,C3_AML_EXALLC,C3_ANUS_ANALCANAL,C3_ANUS_ANALCANAL_EXALLC,C3_BILIARY_TRACT,C3_BILIARY_TRACT_EXALLC,C3_BLADDER,C3_BLADDER_EXALLC,C3_BONE_CARTILAGE,C3_BONE_CARTILAGE_EXALLC'
%%gor
$mydefs
gor AC/finngen_r5/ac_filter.yml(phenocodes = '$cancerPhenoslist', region_grid = [#regionsOfInterest#], max_pval='1e-4', annotate = 'false')
| select chrom,pos,ref,alt,pval,beta,phenocode,name
Query ran in 0.35 sec Query fetched 10 rows in 0.05 sec (total time 0.41 sec)
chrom | pos | ref | alt | pval | beta | phenocode | name | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 161475170 | G | GAAGGACT | 0.000068 | 24.0631 | C3_BONE_CARTILAGE | Malignant neoplasm of bone and articular carti... |
1 | chr1 | 161475170 | G | GAAGGACT | 0.000095 | 21.5256 | C3_BONE_CARTILAGE_EXALLC | Malignant neoplasm of bone and articular carti... |
2 | chr17 | 38891032 | A | G | 0.000049 | -0.5771 | C3_BILIARY_TRACT | Malignant neoplasm of other and unspecified pa... |
3 | chr17 | 38891032 | A | G | 0.000055 | -0.5770 | C3_BILIARY_TRACT_EXALLC | Malignant neoplasm of other and unspecified pa... |
4 | chr17 | 38891405 | A | G | 0.000035 | -0.5823 | C3_BILIARY_TRACT | Malignant neoplasm of other and unspecified pa... |
5 | chr17 | 38891405 | A | G | 0.000039 | -0.5819 | C3_BILIARY_TRACT_EXALLC | Malignant neoplasm of other and unspecified pa... |
6 | chr17 | 38892000 | T | G | 0.000045 | -0.5774 | C3_BILIARY_TRACT | Malignant neoplasm of other and unspecified pa... |
7 | chr17 | 38892000 | T | G | 0.000049 | -0.5783 | C3_BILIARY_TRACT_EXALLC | Malignant neoplasm of other and unspecified pa... |
8 | chr2 | 191115382 | G | A | 0.000012 | 20.4189 | C3_AML | Acute myeloid leukaemia |
9 | chr2 | 191115382 | G | A | 0.000010 | 21.0000 | C3_AML_EXALLC | Acute myeloid leukaemia (all cancers excluded) |
Finally, we can find which hits are closest to our variants of interest:
%%gor
$mydefs
create #temp# = gor AC/finngen_r5/ac_filter.yml(phenocodes = '$cancerPhenoslist', region_grid = [#regionsOfInterest#], max_pval='1e-4', annotate = 'false')
| select chrom,pos,ref,alt,pval,beta,phenocode,name;
gor #plinkrun# | select 1-5,p,or | join -snpsnp -f #bpwindow# [#temp#]
| replace distance abs(distance)
| rank 1 -gc ref,alt pval -o asc | where rank_pval <= 5
| rank 1 -gc ref,alt distance -o asc | where rank_distance <= 2
/* | group 1 -gc 3-OR -lis -sc distance,pval-name */
Query ran in 0.10 sec Query fetched 14 rows in 1.55 sec (total time 1.65 sec)
CHROM | POS | ID | REF | ALT | P | OR | distance | posx | refx | altx | pval | beta | phenocode | name | rank_pval | rank_distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 161483076 | rs71639066 | G | T | 1.179060e-05 | 1.701230 | 7906 | 161475170 | G | GAAGGACT | 0.000068 | 24.0631 | C3_BONE_CARTILAGE | Malignant neoplasm of bone and articular carti... | 1 | 1 |
1 | chr1 | 161483076 | rs71639066 | G | T | 1.179060e-05 | 1.701230 | 7906 | 161475170 | G | GAAGGACT | 0.000095 | 21.5256 | C3_BONE_CARTILAGE_EXALLC | Malignant neoplasm of bone and articular carti... | 2 | 1 |
2 | chr17 | 38895878 | rs12940572 | T | C | 7.186300e-05 | 1.228520 | 4473 | 38891405 | A | G | 0.000035 | -0.5823 | C3_BILIARY_TRACT | Malignant neoplasm of other and unspecified pa... | 1 | 2 |
3 | chr17 | 38895878 | rs12940572 | T | C | 7.186300e-05 | 1.228520 | 4473 | 38891405 | A | G | 0.000039 | -0.5819 | C3_BILIARY_TRACT_EXALLC | Malignant neoplasm of other and unspecified pa... | 2 | 2 |
4 | chr17 | 38895878 | rs12940572 | T | C | 7.186300e-05 | 1.228520 | 3878 | 38892000 | T | G | 0.000045 | -0.5774 | C3_BILIARY_TRACT | Malignant neoplasm of other and unspecified pa... | 3 | 1 |
5 | chr17 | 38895878 | rs12940572 | T | C | 7.186300e-05 | 1.228520 | 3878 | 38892000 | T | G | 0.000049 | -0.5783 | C3_BILIARY_TRACT_EXALLC | Malignant neoplasm of other and unspecified pa... | 4 | 1 |
6 | chr17 | 38898519 | rs1130638 | C | T | 1.138850e-02 | 1.136420 | 7114 | 38891405 | A | G | 0.000035 | -0.5823 | C3_BILIARY_TRACT | Malignant neoplasm of other and unspecified pa... | 1 | 2 |
7 | chr17 | 38898519 | rs1130638 | C | T | 1.138850e-02 | 1.136420 | 7114 | 38891405 | A | G | 0.000039 | -0.5819 | C3_BILIARY_TRACT_EXALLC | Malignant neoplasm of other and unspecified pa... | 2 | 2 |
8 | chr17 | 38898519 | rs1130638 | C | T | 1.138850e-02 | 1.136420 | 6519 | 38892000 | T | G | 0.000045 | -0.5774 | C3_BILIARY_TRACT | Malignant neoplasm of other and unspecified pa... | 3 | 1 |
9 | chr17 | 38898519 | rs1130638 | C | T | 1.138850e-02 | 1.136420 | 6519 | 38892000 | T | G | 0.000049 | -0.5783 | C3_BILIARY_TRACT_EXALLC | Malignant neoplasm of other and unspecified pa... | 4 | 1 |
10 | chr2 | 191105394 | rs7582694 | C | G | 1.106290e-07 | 0.748417 | 9988 | 191115382 | G | A | 0.000012 | 20.4189 | C3_AML | Acute myeloid leukaemia | 2 | 1 |
11 | chr2 | 191105394 | rs7582694 | C | G | 1.106290e-07 | 0.748417 | 9988 | 191115382 | G | A | 0.000010 | 21.0000 | C3_AML_EXALLC | Acute myeloid leukaemia (all cancers excluded) | 1 | 1 |
12 | chr2 | 191124630 | rs6738544 | A | C | 8.776110e-01 | 0.992415 | 9248 | 191115382 | G | A | 0.000012 | 20.4189 | C3_AML | Acute myeloid leukaemia | 2 | 1 |
13 | chr2 | 191124630 | rs6738544 | A | C | 8.776110e-01 | 0.992415 | 9248 | 191115382 | G | A | 0.000010 | 21.0000 | C3_AML_EXALLC | Acute myeloid leukaemia (all cancers excluded) | 1 | 1 |