XA report builder query demo¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = 'ref-data-hg19-stg'
jp_root = f'/home/jovyan/projects/{project}/'
%env GOR_API_PROJECT={project}
Demo of report builders related to gene burden analysis of XA variant and HPO data¶
This notebook provides short demos to explore the data generated in the gene analysis notebooks for XA. Here we will simply show quick command line use of the data but these same report builders are more easily used from the Sequnece Miner.
The notebooks are under data_onboarding/xa_notebooks/gene_analysis:
- rare_var_base_with_qc.ipynb: Filters the variant in XA/vars/wesvar.gord by 1% AF, and variant QC, trims them down and saves them in XA/gene_analysis/rare_vars_001_qc.gorz, and pns in XA/gene_analysis/rare_vars_001_qc_pns.tsv
- phased_rare_var_base_with_qc: Phases all probands in cases with full trio data, filters it like above and saves in XA/gene_analysis/phased_rare_vars.gorz and pn/ped in XA/gene_analysis/phased_rare_vars_ped.tsv. Note the name should include 001_qc in it since it uses the same filtering.
- gene_kit_stat_4_usage: Calculates the avg and stdev of variant counts (unfiltered) in genes per kit-type. Uses these means and deviations to find kit-outliers that are then excluded in gene analysis.
- dominant_gene_freezes: Generates gene freezes based on dominant inheritance model. Two types of freezes are generated, e.g. for HIGH vep variants and HIGH+MODERATE. About 860k samples.
- recessive_gene_greezes: Uses only proband samples with phased data (ca 200k samples). Same filtering as above but requiring two alleles in trans to be present in gene. See XA/gene_analysis/gene_freeze
- HPO_analysis: sets up phenotype data of a possible HPO singletons and "relevant" HPO pairs. HPO terms with more than 50k samples are excluded. 12985 HPO singleton phenotypes are generated and 68292 HPO pairs. This data is accessible under XA/pheno/HPO_*
- hpo_gene_phewas: Runs gene based phewas burden analysis using Fisher-exact test for all the HPO singletons and pairs. The output is stored under XA/gene_analysis/HPO_phewas
Exploring the gene association results¶
XA/gene_analysis/report_builders/XA_phewas_lookup.yml(¶
XA/gene_analysis/report_builders/XA_phewas_lookup.yml(
- name: gene_list_selection
display_name: Limit to gene(s)
description: Select gene(s) for analysis - regression will be performed on these genes only.
optional: no
type: query
quoted: true
query: "nor ref_gregor/genemb/gmb_genes.gorz | select gene_symbol | distinct"
format:
keywords: "%s"
values: "%s"
empty: ""
display_width: 300
- name: pairs_single
display_name: Pairs or Singletons
description: HPO pairs or HPO singletons
optional: no
type: string
default: "S"
values: ["S","P"]
- name: min_OR
display_name: Minimum OR
description: Minimum Odds-ratio
optional: no
type: string
default: "2"
values: ["1","2","4","8","10","50","100"]
- name: max_Pval
display_name: Maximum Pval
description: Maximum P-value
optional: no
type: string
default: "1e-6"
values: ["0.05","1e-2","1e-3","1e-4","1e-5","1e-6","1e-7","1e-8","1e-9"]
- name: min_Cases
display_name: Minimum Cases
description: Minimum Cases with burden
optional: yes
type: string
default: "4"
values: ["1","2","4","8","10","50","100"])
)
HPO singletons in two genes:¶
%%gor
gor XA/gene_analysis/report_builders/XA_phewas_lookup.yml(gene_list_selection = 'LPL,CHRNG',pairs_single = 'S', min_OR = '20', max_Pval = '1e-8')
Query fetched -1 rows in 0.36 sec
chrom | gene_start | gene_symbol | type | PhenoSingleton | OR_Fisher | pVal_Fisher | cases_1 | ctrls_1 | CaseCount | ... | all_1 | all_3 | beta_Fisher | cases_0 | cases_3 | ctrls_0 | ctrls_3 | HPO_DAG_level | HPO_name | HPO_XA_Samples | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr2 | 233404401 | CHRNG | Rec_High | pheno7 | 1000.00 | 4.000000e-09 | 10 | 0 | 47225 | ... | 10 | 2 | 6.907755 | 28676 | 0 | 169667 | 2 | 4 | Abnormal foot morphology | 47225 |
1 | chr2 | 233404401 | CHRNG | Rec_High | pheno257 | 234.40 | 1.300000e-12 | 9 | 1 | 12242 | ... | 10 | 2 | 5.457024 | 7334 | 0 | 191009 | 2 | 5 | Flexion contracture | 12241 |
2 | chr2 | 233404401 | CHRNG | Rec_High | pheno251 | 226.60 | 1.700000e-12 | 9 | 1 | 12704 | ... | 10 | 2 | 5.423292 | 7576 | 0 | 190767 | 2 | 4 | Joint contracture | 12703 |
3 | chr2 | 233404401 | CHRNG | Rec_High | pheno221 | 215.20 | 2.600000e-12 | 9 | 1 | 13624 | ... | 10 | 2 | 5.371441 | 7963 | 0 | 190380 | 2 | 5 | Abnormal tendon morphology | 13623 |
4 | chr2 | 233404401 | CHRNG | Rec_High | pheno191 | 81.58 | 9.500000e-10 | 8 | 2 | 15071 | ... | 10 | 2 | 4.401531 | 9271 | 1 | 189072 | 1 | 4 | Abnormal lower limb bone morphology | 15070 |
5 | chr2 | 233404401 | CHRNG | Rec_High | pheno988 | 163.20 | 1.200000e-10 | 6 | 4 | 2740 | ... | 10 | 2 | 5.095201 | 1806 | 0 | 196537 | 2 | 6 | Congenital contracture | 2740 |
6 | chr2 | 233404401 | CHRNG | Rec_High | pheno582 | 82.83 | 6.300000e-09 | 6 | 4 | 5527 | ... | 10 | 2 | 4.416784 | 3528 | 0 | 194815 | 2 | 7 | Talipes equinovarus | 5527 |
7 | chr2 | 233404401 | CHRNG | Rec_High | pheno557 | 77.63 | 9.200000e-09 | 6 | 4 | 5850 | ... | 10 | 2 | 4.351905 | 3760 | 0 | 194583 | 2 | 6 | Talipes | 5850 |
8 | chr2 | 233404401 | CHRNG | Rec_High | pheno155 | 136.10 | 1.300000e-10 | 9 | 1 | 18357 | ... | 10 | 2 | 4.913516 | 12301 | 0 | 186042 | 2 | 4 | Abnormal neck morphology | 18355 |
9 | chr2 | 233404401 | CHRNG | Rec_High | pheno150 | 133.50 | 1.500000e-10 | 9 | 1 | 18831 | ... | 10 | 2 | 4.893754 | 12531 | 0 | 185812 | 2 | 3 | Abnormality of the neck | 18829 |
10 | chr2 | 233404401 | CHRNG | Rec_High | pheno1747 | 1170.00 | 8.600000e-19 | 8 | 2 | 980 | ... | 10 | 2 | 7.064440 | 676 | 0 | 197667 | 2 | 6 | Abnormal talus morphology | 980 |
11 | chr2 | 233404401 | CHRNG | Rec_High | pheno1814 | 1243.00 | 5.300000e-19 | 8 | 2 | 919 | ... | 10 | 2 | 7.125637 | 636 | 0 | 197707 | 2 | 7 | Rocker bottom foot | 919 |
12 | chr2 | 233404401 | CHRNG | Rec_High | pheno1101 | 543.90 | 3.700000e-16 | 8 | 2 | 2249 | ... | 10 | 2 | 6.298782 | 1448 | 0 | 196895 | 2 | 5 | Abnormality of the tarsal bones | 2249 |
13 | chr2 | 233404401 | CHRNG | Rec_HighMode | pheno1747 | 27.44 | 1.500000e-10 | 9 | 96 | 980 | ... | 105 | 2 | 3.312027 | 675 | 0 | 197573 | 2 | 6 | Abnormal talus morphology | 980 |
14 | chr2 | 233404401 | CHRNG | Rec_HighMode | pheno1814 | 29.18 | 8.600000e-11 | 9 | 96 | 919 | ... | 105 | 2 | 3.373317 | 635 | 0 | 197613 | 2 | 7 | Rocker bottom foot | 919 |
15 | chr8 | 19796763 | LPL | Rec_HighMode | pheno609 | 22.29 | 9.700000e-09 | 8 | 49 | 5154 | ... | 57 | 63115 | 3.104099 | 983 | 560 | 134200 | 62555 | 5 | Hyperlipidemia | 5154 |
16 | chr8 | 19796763 | LPL | Rec_HighMode | pheno1312 | 41.40 | 8.400000e-11 | 8 | 49 | 1676 | ... | 57 | 63115 | 3.723308 | 531 | 317 | 134652 | 62798 | 6 | Hypertriglyceridemia | 1676 |
17 rows × 22 columns
HPO pairs:¶
Looking up HPO pair associations is similar, i.e. by changing the value for the parameter pairs_single. In both cases the lookup is very fast:
%%gor
gor XA/gene_analysis/report_builders/XA_phewas_lookup.yml(gene_list_selection = 'LPL,CHRNG',pairs_single = 'P', min_OR = '20', max_Pval = '1e-8')
| sort chrom -c pval_fisher:n
Query fetched -1 rows in 0.04 sec
chrom | gene_start | gene_symbol | type | PhenoPair | OR_Fisher | pVal_Fisher | cases_1 | ctrls_1 | CaseCount | ... | HPO2_XA_Samples | HPO_codes | all_0 | all_1 | all_3 | beta_Fisher | cases_0 | cases_3 | ctrls_0 | ctrls_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr2 | 233404401 | CHRNG | Rec_High | pheno28881 | 2736.00 | 4.600000e-20 | 7 | 3 | 242 | ... | 919 | HP:0000464&HP:0001838 | 198343 | 10 | 2 | 7.914300 | 169 | 0 | 198174 | 2 |
1 | chr2 | 233404401 | CHRNG | Rec_High | pheno29081 | 2736.00 | 4.600000e-20 | 7 | 3 | 241 | ... | 18355 | HP:0001838&HP:0025668 | 198343 | 10 | 2 | 7.914300 | 169 | 0 | 198174 | 2 |
2 | chr2 | 233404401 | CHRNG | Rec_High | pheno28080 | 2642.00 | 5.800000e-20 | 7 | 3 | 249 | ... | 18355 | HP:0008365&HP:0025668 | 198343 | 10 | 2 | 7.879382 | 175 | 0 | 198168 | 2 |
3 | chr2 | 233404401 | CHRNG | Rec_High | pheno27760 | 2627.00 | 6.100000e-20 | 7 | 3 | 251 | ... | 980 | HP:0000464&HP:0008365 | 198343 | 10 | 2 | 7.873679 | 176 | 0 | 198167 | 2 |
4 | chr2 | 233404401 | CHRNG | Rec_High | pheno17804 | 1764.00 | 9.300000e-19 | 7 | 3 | 371 | ... | 919 | HP:0001371&HP:0001838 | 198343 | 10 | 2 | 7.475385 | 262 | 0 | 198081 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
86 | chr2 | 233404401 | CHRNG | Rec_High | pheno1063 | 134.40 | 5.400000e-09 | 5 | 5 | 2343 | ... | 3142 | HP:0001155&HP:0100360 | 198343 | 10 | 2 | 4.900729 | 1465 | 0 | 196878 | 2 |
87 | chr2 | 233404401 | CHRNG | Rec_HighMode | pheno3836 | 22.04 | 7.900000e-09 | 8 | 97 | 1155 | ... | 43031 | HP:0001850&HP:0011729 | 198248 | 105 | 2 | 3.092972 | 739 | 0 | 197509 | 2 |
88 | chr2 | 233404401 | CHRNG | Rec_HighMode | pheno3827 | 22.01 | 8.000000e-09 | 8 | 97 | 1158 | ... | 43094 | HP:0001850&HP:0034430 | 198248 | 105 | 2 | 3.091615 | 740 | 0 | 197508 | 2 |
89 | chr8 | 19796763 | LPL | Rec_HighMode | pheno12075 | 87.75 | 2.200000e-10 | 6 | 51 | 510 | ... | 1676 | HP:0001392&HP:0002155 | 135183 | 57 | 63115 | 4.474482 | 181 | 95 | 135002 | 63020 |
90 | chr8 | 19796763 | LPL | Rec_HighMode | pheno10034 | 79.01 | 4.100000e-10 | 6 | 51 | 590 | ... | 1676 | HP:0002012&HP:0002155 | 135183 | 57 | 63115 | 4.369526 | 201 | 101 | 134982 | 63014 |
91 rows × 25 columns
%%gor
gor XA/gene_analysis/report_builders/XA_phewas_lookup.yml(gene_list_selection = 'LPL,CHRNG',pairs_single = 'S', min_OR = '20', max_Pval = '1e-8')
| group 1 -gc gene_symbol,type -count
Query fetched -1 rows in 0.04 sec
chrom | gene_start | gene_symbol | type | allCount | |
---|---|---|---|---|---|
0 | chr2 | 233404401 | CHRNG | Rec_High | 13 |
1 | chr2 | 233404401 | CHRNG | Rec_HighMode | 2 |
2 | chr8 | 19796763 | LPL | Rec_HighMode | 2 |
%%gor
gor XA/gene_analysis/report_builders/XA_phewas_lookup.yml(gene_list_selection = 'BRCA1',pairs_single = 'S')
Query fetched -1 rows in 0.04 sec
chrom | gene_start | gene_symbol | type | PhenoSingleton | OR_Fisher | pVal_Fisher | cases_1 | ctrls_1 | CaseCount | ... | all_1 | all_3 | beta_Fisher | cases_0 | cases_3 | ctrls_0 | ctrls_3 | HPO_DAG_level | HPO_name | HPO_XA_Samples | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | BRCA1 | Dom_HighMod | pheno2480 | 2.079 | 1.300000e-08 | 87 | 84710 | 473 | ... | 84797 | 71 | 0.731787 | 386 | 0 | 781293 | 71 | 10 | Diffuse white matter abnormalities | 473 |
1 | chr17 | 41196311 | BRCA1 | Rec_HighMode | pheno814 | 3.938 | 3.900000e-11 | 36 | 684 | 3477 | ... | 720 | 23 | 1.370775 | 2606 | 0 | 195006 | 23 | 7 | Poor gross motor coordination | 3477 |
2 | chr17 | 41196311 | BRCA1 | Rec_HighMode | pheno306 | 2.544 | 6.100000e-10 | 62 | 658 | 10447 | ... | 720 | 23 | 0.933850 | 7057 | 0 | 190555 | 23 | 6 | Poor motor coordination | 10446 |
3 | chr17 | 41196311 | BRCA1 | Dom_HighMod | pheno1965 | 2.039 | 1.000000e-12 | 143 | 84654 | 790 | ... | 84797 | 71 | 0.712542 | 647 | 0 | 781032 | 71 | 7 | Moderate global developmental delay | 790 |
4 rows × 22 columns
Using covariants¶
The above association results are pre-calculated for all genes and all HPO singletons and pairs, ie. 20k genes times 80k phenotypes, using 900k and 200k samples for dominant and recessive analysis respectively. To make this feasible, we use a Fisher-exact test and leverage a simple calculation trick to handle the cost of the controls. However, the Fisher-exact test does not allow us to fine tune the analysis by providing covariates. Similarly, we have only applied this analysis for all the auto-generated HPO singletons and pairs. Here we explore a report builder to calculate dominant and recessive associations for the same gene freezes, however, for custom defined phenotypes and covariates using logistic regression. This freeze based gene regression is fast and can easily be extended to use more complicated burden model.
Defining HPO phenotype matrix¶
We will define two phenotypes, i.e. the phenotypes pheno28080 and pheno28881 from above that show significant association in the gene CHRNG. For this we use the following report builder:
XA/gene_analysis/report_builders/XA_HPO_pairs.yml
arguments:
- name: HPO_pairs
display_name: HPO pairs
description: Select pheno from list of HPO pairs
optional: yes
quoted: true
type: query
single_selection: no
query: "nor XA/pheno/HPO_pairs/pheno_groups.tsv | hide #1"
- name: HPO_pheno_grid
display_name: HPO grid
description: A grid with one or more row of pheno representing HPO pairs.
optional: yes
type: grid
required_columns: ["pheno"]
display_width: 300
- name: CtrlCaseRatio
display_name: Ctrl Case ratio
description: Target ratio per kit - may not be achieved
optional: yes
type: string
default: "All"
values: ["0.5", "1", "2", "4", "10", "100", "1000", "All"]
- name: GeneFreeze
display_name: Gene freeze
description: Dominant or recessive freeze
optional: yes
type: string
default: "Dom"
values: ["Dom","Rec"]
%%gor
nor XA/gene_analysis/report_builders/XA_HPO_pairs.yml(HPO_pairs='pheno28080,pheno28881',CtrlCaseRatio='All',GeneFreeze='Rec')
| group -gc #2,#3 -count
Query fetched -1 rows in 0.44 sec
pheno28080 | pheno28881 | allCount | |
---|---|---|---|
0 | 1 | 1 | 198173 |
1 | 2 | 1 | 6 |
2 | 2 | 2 | 176 |
We see that these phenotypes are very are, i.e. 176 and 182 cases. The number agree with the case counts above for case_1 + case_0, i.e 7 + 169 and 7 + 175. Note that these numbers are not the same as the following:
%%gor
nor XA/pheno/HPO_pairs/pheno_groups.tsv | where pheno in ('pheno28080','pheno28881')
Query fetched -1 rows in 0.40 sec
pheno_group | pheno | HPO_codes | CaseCount | HPO1_name | HPO1_DAG_level | HPO1_XA_Samples | HPO2_name | HPO2_DAG_level | HPO2_XA_Samples | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 80 | pheno28080 | HP:0008365&HP:0025668 | 249 | Abnormal talus morphology | 6 | 980 | Abnormal neck morphology | 4 | 18355 |
1 | 881 | pheno28881 | HP:0000464&HP:0001838 | 242 | Abnormality of the neck | 3 | 18829 | Rocker bottom foot | 7 | 919 |
Here we see case counts of 249 and 242, however, this does not take into consideration overlap with the samples in the recessive freeze. We can indeed query the phenotype sources and see how some of the samples are not part of the recessive freeze. The fast way to do that is use the partitioned dictionary table case.nord:
%%gor
create #pheno_group_filter# = nor XA/pheno/HPO_pairs/pheno_groups.tsv | where pheno in ('pheno28080','pheno28881');
def #pheno_filter# = <(nor [#pheno_group_filter#] | columnsort pheno);
nor XA/pheno/HPO_pairs/case.nord -s pheno_group -ff [#pheno_group_filter#]
| map -c pheno #pheno_filter#
| inset -c pn -b XA/gene_analysis/gene_freeze/recessive_buckets.tsv
| group -gc pheno,inset -count
Query fetched -1 rows in 0.24 sec
pheno | inSet | allCount | |
---|---|---|---|
0 | pheno28080 | 0 | 67 |
1 | pheno28080 | 1 | 182 |
2 | pheno28881 | 0 | 66 |
3 | pheno28881 | 1 | 176 |
Note that we can also limit the number of controls by specifying the desired case/ctrl ratio:
%%gor
nor XA/gene_analysis/report_builders/XA_HPO_pairs.yml(HPO_pairs='pheno28080,pheno28881',CtrlCaseRatio='10',GeneFreeze='Rec')
Query fetched -1 rows in 1.09 sec
pn | kit | pheno28080 | pheno28881 | |
---|---|---|---|---|
0 | 647146 | V6 | NaN | 1.0 |
1 | 25150 | V2 | NaN | 1.0 |
2 | 316583 | V6 | 1.0 | NaN |
3 | 49623 | V4 | 2.0 | 2.0 |
4 | 832912 | V8 | 1.0 | NaN |
... | ... | ... | ... | ... |
5121 | 234472 | V5 | NaN | 1.0 |
5122 | 485100 | V6 | 1.0 | NaN |
5123 | 783555 | V8 | 1.0 | NaN |
5124 | 32339 | V2 | NaN | 1.0 |
5125 | 617018 | V6 | NaN | 1.0 |
5126 rows × 4 columns
Importantly, we can see that the sampling of the controls is done in such a manner that the kit distribution is the same:
%%gor
nor XA/gene_analysis/report_builders/XA_HPO_pairs.yml(HPO_pairs='pheno28080,pheno28881',CtrlCaseRatio='10',GeneFreeze='Rec')
| group -gc kit,pheno28080 -count
Query fetched -1 rows in 0.08 sec
kit | pheno28080 | allCount | |
---|---|---|---|
0 | V1 | 1.0 | 50 |
1 | V1 | 2.0 | 5 |
2 | V1 | NaN | 50 |
3 | V2 | 1.0 | 170 |
4 | V2 | 2.0 | 17 |
5 | V2 | NaN | 163 |
6 | V4 | 1.0 | 130 |
7 | V4 | 2.0 | 13 |
8 | V4 | NaN | 128 |
9 | V5 | 1.0 | 880 |
10 | V5 | 2.0 | 88 |
11 | V5 | NaN | 842 |
12 | V6 | 1.0 | 730 |
13 | V6 | 2.0 | 73 |
14 | V6 | NaN | 723 |
15 | V8 | 1.0 | 310 |
16 | V8 | 2.0 | 31 |
17 | V8 | NaN | 266 |
18 | WGS-V3 | 1.0 | 220 |
19 | WGS-V3 | 2.0 | 22 |
20 | WGS-V3 | NaN | 215 |
We see that the ratio 10/1 holds for all kits! As we proceed, we will however use the full control list to show that we can validate the Fisher-exact results with logistic regression that does not use covariates.
XA/gene_analysis/report_builders/XA_freeze_logistic_regression.yml
arguments:
- name: pheno_grid
display_name: Phenotype grid
description: A phenotype grid with the first column PN containing the sample IDs, the second column containing the primary outcome (The default encoding is 0 for controls and 1 for cases for a dichotomous/logistic model, or otherwise numerical values for a continuous model), the third column containing the secondary outcome (phenotype 2) etc... Missing phenotypes must be coded as "NA". No duplicated PNs are allowed. When multiple phenotypes are present, the regression is ran sequentially for each phenotypes.
optional: no
type: grid
display_width: 300
- name: covar_grid
display_name: Covariate grid
description: A covariate grid with the first column PN containing the sample IDs, the following column(s) must contain the covariates. Missing phenotypes must be coded as "NA". Duplicated PNs are not allowed. For sex use 1 for males and 2 for females and "NA" for missing values.
optional: yes
type: grid
display_width: 300
- name: GeneFreeze
display_name: Gene freeze
description: Gene freeze type, 001 = 1% AF cutoff, VEP high/moderate impact
optional: no
type: string
default: "dominant_af001_high"
values: ["dominant_af001_high","dominant_af001_highmod","recessive_af001_high","recessive_af001_highmod"]
- name: gene_list_selection
display_name: Limit to gene(s)
description: Select gene(s) for analysis - regression will be performed on these genes only.
optional: yes
type: query
quoted: true
query: "nor ref_gregor/genemb/gmb_genes.gorz | select gene_symbol | distinct"
format:
keywords: "%s"
values: "%s"
empty: ""
display_width: 300
%%gor
create #pheno_matrix# = nor XA/gene_analysis/report_builders/XA_HPO_pairs.yml(HPO_pairs='pheno28080,pheno28881',CtrlCaseRatio='All',GeneFreeze='Rec');
gor XA/gene_analysis/report_builders/XA_freeze_logistic_regression.yml(pheno_grid=[#pheno_matrix#],gene_list_selection='CHRNG',GeneFreeze='recessive_af001_high')
Query fetched -1 rows in 0.87 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr2 | 233404401 | CHRNG | pheno28080 | 7.8794 | 11.350 | 0.0 | 2642.286712 |
1 | chr2 | 233404401 | CHRNG | pheno28881 | 7.9143 | 11.398 | 0.0 | 2736.130568 |
as a reminder we show the results from above by filtering:
%%gor
gor XA/gene_analysis/report_builders/XA_phewas_lookup.yml(gene_list_selection = 'CHRNG',pairs_single = 'P', min_OR = '20', max_Pval = '1e-8')
| where phenopair in ('pheno28080','pheno28881') and type = 'Rec_High'
Query fetched -1 rows in 0.04 sec
chrom | gene_start | gene_symbol | type | PhenoPair | OR_Fisher | pVal_Fisher | cases_1 | ctrls_1 | CaseCount | ... | HPO2_XA_Samples | HPO_codes | all_0 | all_1 | all_3 | beta_Fisher | cases_0 | cases_3 | ctrls_0 | ctrls_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr2 | 233404401 | CHRNG | Rec_High | pheno28881 | 2736.0 | 4.600000e-20 | 7 | 3 | 242 | ... | 919 | HP:0000464&HP:0001838 | 198343 | 10 | 2 | 7.914300 | 169 | 0 | 198174 | 2 |
1 | chr2 | 233404401 | CHRNG | Rec_High | pheno28080 | 2642.0 | 5.800000e-20 | 7 | 3 | 249 | ... | 18355 | HP:0008365&HP:0025668 | 198343 | 10 | 2 | 7.879382 | 175 | 0 | 198168 | 2 |
2 rows × 25 columns
We see that the OR are very much the same while the precision in the P-values is not the same in logistic regression as in the Fisher-exact algorithm.
While we used XA_HPO_pairs.yml to create the phenotype matrix, we can indeed generate the phenotypes in any way we choose. Similarly, we can provide a covariants via the covar_grid parameter to XA_freeze_logistic_regression.yml. While we scoped the analysis here to a single gene to get instant results, this analysis is still really fast and can be executed for multple genes in few seconds and by using this report builder within the Sequence Miner, one can access the code and modify it to run genome wide with ease.
Running logistic regression from first principles¶
The above example showed us how we can run analysis on existing 4 gene freezes (it is easy to generate more with different models) but with other phenotypes and covariates using logistic regression instead of Fisher-exact.
Instead of using the gene freezes, we can also use the rare variant row sources (phased or not) as an input to the logistic regression. This give us opportunity to modify the model for the gene burden, the variant filtering, the use of kit-statistics to eliminate samples etc.
Already there is a report builder that allows this to be done quickly for handful of genes. It provides option to use the KitStatFilter or not. It is easy to modify this report builder and provide more options to impact the variant filter
| join -segvar -xl gene_symbol -xr gene_symbol <(gor #VEP# | where max_impact in ('HIGH','MODERATE') | select 1-4,gene_symbol,max_consequence,max_impact
| varjoin -i <(gor #AF# | where af <= 0.01 | select 1-4)
| varjoin -r <(gor #rare_vars# | inset -c PN <(nor [#buckets#] | where bucket = '${colbucket}') ) )
or the burden model
<#if AnalysisType.val == 'Dom' >
| calc dom if(gt = 1 or gt = 2,1,0)
| group 1 -gc #3,gene_symbol,pn -sum -ic dom
| calc value if(sum_dom > 0,1,0)
<#else>
| group 1 -gc #3,gene_symbol,pn -sum -ic pat,mat
| calc value if(sum_pat > 0 and sum_mat > 0,1,0)
</#if>
etc. Alternatively, one can access the query from within the Sequence Miner, edit it and run it ad-hoc. Here we will show it some basic use of it for the BRCA1 gene:
XA/gene_analysis/report_builders/XA_gene_logistic_regression.yml
arguments:
- name: pheno_grid
display_name: Phenotype grid
description: A phenotype grid with the first column PN containing the sample IDs, the second column containing the primary outcome (The default encoding is 0 for controls and 1 for cases for a dichotomous/logistic model, or otherwise numerical values for a continuous model), the third column containing the secondary outcome (phenotype 2) etc... Missing phenotypes must be coded as "NA". No duplicated PNs are allowed. When multiple phenotypes are present, the regression is ran sequentially for each phenotypes.
optional: no
type: grid
display_width: 300
- name: covar_grid
display_name: Covariate grid
description: A covariate grid with the first column PN containing the sample IDs, the following column(s) must contain the covariates. Missing phenotypes must be coded as "NA". Duplicated PNs are not allowed. For sex use 1 for males and 2 for females and "NA" for missing values.
optional: yes
type: grid
display_width: 300
- name: AnalysisType
display_name: Analysis type
description: Dominant or recessive inheritance model
optional: yes
type: string
default: "Dom"
values: ["Dom","Rec"]
- name: gene_list_selection
display_name: Limit to gene(s)
description: Select gene(s) for analysis - regression will be performed on these genes only.
optional: no
type: query
quoted: true
query: "nor ref_gregor/genemb/gmb_genes.gorz | select gene_symbol | distinct"
format:
keywords: "%s"
values: "%s"
empty: ""
display_width: 300
- name: KitStatFilter
display_name: Kit Stat Filter
description: Ignore samples with poor kit stat for gene
optional: yes
type: string
default: "Yes"
values: ["Yes","No"]
%%gor
create #pheno_matrix# = nor XA/gene_analysis/report_builders/XA_HPO_singletons.yml(HPO_singletons='pheno814,pheno306',CtrlCaseRatio='All',GeneFreeze='Rec');
gor XA/gene_analysis/report_builders/XA_gene_logistic_regression.yml(pheno_grid=[#pheno_matrix#],gene_list_selection='BRCA1',AnalysisType='Rec')
Query fetched -1 rows in 2.21 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | BRCA1 | pheno306 | 0.93385 | 7.0003 | 2.553400e-12 | 2.544286 |
1 | chr17 | 41196311 | BRCA1 | pheno814 | 1.37080 | 7.9636 | 1.665300e-15 | 3.938500 |
and from above for comparison
%%gor
gor XA/gene_analysis/report_builders/XA_phewas_lookup.yml(gene_list_selection = 'BRCA1',pairs_single = 'S')
| where phenosingleton in ('pheno814','pheno306')
Query fetched -1 rows in 0.13 sec
chrom | gene_start | gene_symbol | type | PhenoSingleton | OR_Fisher | pVal_Fisher | cases_1 | ctrls_1 | CaseCount | ... | all_1 | all_3 | beta_Fisher | cases_0 | cases_3 | ctrls_0 | ctrls_3 | HPO_DAG_level | HPO_name | HPO_XA_Samples | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | BRCA1 | Rec_HighMode | pheno814 | 3.938 | 3.900000e-11 | 36 | 684 | 3477 | ... | 720 | 23 | 1.370775 | 2606 | 0 | 195006 | 23 | 7 | Poor gross motor coordination | 3477 |
1 | chr17 | 41196311 | BRCA1 | Rec_HighMode | pheno306 | 2.544 | 6.100000e-10 | 62 | 658 | 10447 | ... | 720 | 23 | 0.933850 | 7057 | 0 | 190555 | 23 | 6 | Poor motor coordination | 10446 |
2 rows × 22 columns
Againg, we see that the ORs are identical and the P-values are similar indeed. While we processed the variants in the gene from scratch, it only took 15sec to run for these two phenotypes. Indeed, the number of phenotypes does not govern the computation cost is when they are few and we can easily use a phenotype matrix with dozens of phenotypes.
We can now for instance inspect the impact of not using the KitStatFilter:
%%gor
create #pheno_matrix# = nor XA/gene_analysis/report_builders/XA_HPO_singletons.yml(HPO_singletons='pheno814,pheno306',CtrlCaseRatio='All',GeneFreeze='Rec');
gor XA/gene_analysis/report_builders/XA_gene_logistic_regression.yml(pheno_grid=[#pheno_matrix#],gene_list_selection='BRCA1',AnalysisType='Rec',KitStatFilter='No')
Query fetched -1 rows in 0.51 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | BRCA1 | pheno306 | 0.93385 | 7.0003 | 2.553400e-12 | 2.544286 |
1 | chr17 | 41196311 | BRCA1 | pheno814 | 1.37080 | 7.9636 | 1.665300e-15 | 3.938500 |
We see not difference. This is indeed because the PNf fraction for BRCA1 is very close to 1 and hardly no kits are excluded from being outliers from the total variant distribution, as shown below:
%%gor
gor XA/gene_analysis/kit_stat/gene_use.gorz | where gene_symbol = 'BRCA1'
Query fetched -1 rows in 0.03 sec
chrom | gene_start | gene_end | gene_symbol | kit | use | avg | wavg | std | wstd | stdm | PNf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | 41277381 | BRCA1 | NaN | 1 | 0.2714 | 0.338 | 0.4757 | 0.6046 | 0.0569 | 0.9999 |
1 | chr17 | 41196311 | 41277381 | BRCA1 | Roxanne_2705a | 0 | 1.0000 | 0.338 | 0.0000 | 0.6046 | 0.0000 | 0.9999 |
2 | chr17 | 41196311 | 41277381 | BRCA1 | V1 | 1 | 0.4122 | 0.338 | 0.6294 | 0.6046 | 0.0048 | 0.9999 |
3 | chr17 | 41196311 | 41277381 | BRCA1 | V2 | 1 | 0.4090 | 0.338 | 0.6425 | 0.6046 | 0.0039 | 0.9999 |
4 | chr17 | 41196311 | 41277381 | BRCA1 | V3 | 1 | 0.4418 | 0.338 | 0.6299 | 0.6046 | 0.0195 | 0.9999 |
5 | chr17 | 41196311 | 41277381 | BRCA1 | V4 | 1 | 0.4064 | 0.338 | 0.6372 | 0.6046 | 0.0037 | 0.9999 |
6 | chr17 | 41196311 | 41277381 | BRCA1 | V5 | 1 | 0.3217 | 0.338 | 0.5905 | 0.6046 | 0.0012 | 0.9999 |
7 | chr17 | 41196311 | 41277381 | BRCA1 | V6 | 1 | 0.3062 | 0.338 | 0.5803 | 0.6046 | 0.0011 | 0.9999 |
8 | chr17 | 41196311 | 41277381 | BRCA1 | V8 | 1 | 0.3640 | 0.338 | 0.6222 | 0.6046 | 0.0014 | 0.9999 |
9 | chr17 | 41196311 | 41277381 | BRCA1 | WGS-V1 | 1 | 0.2727 | 0.338 | 0.5745 | 0.6046 | 0.0322 | 0.9999 |
10 | chr17 | 41196311 | 41277381 | BRCA1 | WGS-V2 | 0 | 0.1449 | 0.338 | 0.3910 | 0.6046 | 0.0471 | 0.9999 |
11 | chr17 | 41196311 | 41277381 | BRCA1 | WGS-V3 | 1 | 0.3987 | 0.338 | 0.6865 | 0.6046 | 0.0030 | 0.9999 |
Defining more complex HPO phenotypes for cases and exclusion list¶
The following report builder allows us to specify cases based on inclusion of N out of M HPO codes. Similalry, exclusion from case group can be defined in a similar fashion, e.g. 1 out of some list of HPO codes.
XA_HPO_case_excl_ctrl.yml
arguments:
- name: Case_HPOs
display_name: Case HPOs
description: Select pheno from list of HPO singletons
optional: yes
quoted: true
type: query
single_selection: no
query: "nor XA/pheno/HPO_overview.tsv | inset -c hpo_code <(nor XA/pheno/HPO_singletons/pheno_groups.tsv | select hpo_code) | select hpo_code,name,description,XA_Samples,XA_Samples_direct,DAG_level,Min_path"
format:
keywords: "%s"
values: "%s"
empty: ""
display_width: 300
- name: CaseHPOcount
display_name: Case HPO Count
description: Min num of HPO codes from case HPO selection
optional: no
type: string
default: "2"
values: ["1", "2", "3", "4", "5"]
- name: Excl_HPOs
display_name: Excl HPOs
description: Select pheno from list of HPO singletons
optional: yes
quoted: true
type: query
single_selection: no
query: "nor XA/pheno/HPO_overview.tsv | inset -c hpo_code <(nor XA/pheno/HPO_singletons/pheno_groups.tsv | select hpo_code) | select hpo_code,name,description,XA_Samples,XA_Samples_direct,DAG_level,Min_path"
format:
keywords: "%s"
values: "%s"
empty: ""
display_width: 300
- name: ExclHPOcount
display_name: Excl HPO Count
description: Min num of HPO codes from exclusion HPO selection
optional: no
type: string
default: "1"
values: ["1", "2", "3", "4", "5"]
- name: CtrlCaseRatio
display_name: Ctrl Case ratio
description: Target ratio per kit - may not be achieved
optional: yes
type: string
default: "All"
values: ["0.5", "1", "2", "4", "10", "100", "1000", "All"]
- name: GeneFreeze
display_name: Gene freeze
description: Dominant or recessive freeze
optional: yes
type: string
default: "Dom"
values: ["Dom","Rec"]
Consider an hypothetical example of studying high-frequence hearing impariment. Below is a query that give us some HPO codes related to hearing:
%%gor
nor XA/pheno/HPO_overview.tsv
| where (name in ('High-frequency hearing impairment','High-frequency sensorineural hearing impairment','Low-frequency hearing loss','Low-frequency sensorineural hearing impairment','Mild conductive hearing impairment','Mild hearing impairment','Mild neurosensory hearing impairment','Mixed hearing impairment','Moderate conductive hearing impairment','Moderate hearing impairment','Moderate sensorineural hearing impairment','Old-aged sensorineural hearing impairment'))
Query fetched -1 rows in 0.28 sec
hpo_code | XA_Samples | XA_Samples_direct | name | description | parentCount | childCount | decendCount | Min_path | Path_counts | DAG_level | geneCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HP:0000410 | 792 | 792 | Mixed hearing impairment | A type of hearing loss resulting from a combin... | 2 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0011... | 4 | 6 | 44 |
1 | HP:0001757 | 58 | 58 | High-frequency sensorineural hearing impairment | A form of sensorineural hearing impairment tha... | 1 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0011... | 2 | 6 | 6 |
2 | HP:0005101 | 216 | 216 | High-frequency hearing impairment | A type of hearing impairment affecting primari... | 1 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0000... | 1 | 6 | 8 |
3 | HP:0008504 | 71 | 71 | Moderate sensorineural hearing impairment | The presence of a moderate form of sensorineur... | 2 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0011... | 3 | 6 | 0 |
4 | HP:0008542 | 88 | 88 | Low-frequency hearing loss | A type of hearing impairment affecting primari... | 1 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0000... | 1 | 6 | 0 |
5 | HP:0008573 | 27 | 27 | Low-frequency sensorineural hearing impairment | A form of sensorineural hearing impairment tha... | 1 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0011... | 2 | 6 | 4 |
6 | HP:0008587 | 19 | 19 | Mild neurosensory hearing impairment | The presence of a mild form of sensorineural h... | 2 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0011... | 3 | 6 | 2 |
7 | HP:0008598 | 263 | 263 | Mild conductive hearing impairment | A mild form of conductive hearing impairment. | 2 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031703->HP:0000... | 3 | 7 | 1 |
8 | HP:0012712 | 552 | 211 | Mild hearing impairment | The presence of a mild form of hearing impairm... | 1 | 4 | 4 | T->HP:0000118->HP:0000598->HP:0031704->HP:0000... | 1 | 6 | 7 |
9 | HP:0012713 | 167 | 34 | Moderate hearing impairment | The presence of a moderate form of hearing imp... | 1 | 2 | 2 | T->HP:0000118->HP:0000598->HP:0031704->HP:0000... | 1 | 6 | 0 |
10 | HP:0012716 | 62 | 62 | Moderate conductive hearing impairment | The presence of a moderate form of conductive ... | 3 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0000... | 4 | 7 | 0 |
11 | HP:0040113 | 5 | 5 | Old-aged sensorineural hearing impairment | Old-aged sensorineural hearing impairment | 1 | 0 | 0 | T->HP:0000118->HP:0000598->HP:0031704->HP:0011... | 2 | 6 | 1 |
Lets say we want to define cases from those who have either of HP:0001757, HP:0005101 or HP:0000410 and exclude from controls any sample with any of the following: HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113
%%gor
nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1'
, Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113',ExclHPOcount='1',CtrlCaseRatio='100', GeneFreeze='Dom')
| group -gc pheno -count
Query fetched -1 rows in 0.20 sec
pheno | allCount | |
---|---|---|
0 | 1.0 | 104900 |
1 | 2.0 | 1049 |
2 | NaN | 723 |
We see that we get 1049 cases, 10x more controls and 723 samples are excluded from the dominant freeze (which we are targeting here). For fun, we can make the exclusion criteria more stringent by insisting for 2 out of the listed exclusion HPO codes:
%%gor
nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1', Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113'
,ExclHPOcount='2',CtrlCaseRatio='100', GeneFreeze='Dom')
| group -gc pheno -count
Query fetched -1 rows in 3.44 sec
pheno | allCount | |
---|---|---|
0 | 1.0 | 104900 |
1 | 2.0 | 1049 |
2 | NaN | 394 |
Clearly, this reduces the number of excluded sampels as expected.
%%gor
create #pheno_matrix# = nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1'
, Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113',ExclHPOcount='1',CtrlCaseRatio='100', GeneFreeze='Dom')
| hide kit;
gor XA/gene_analysis/report_builders/XA_gene_logistic_regression.yml(pheno_grid=[#pheno_matrix#],gene_list_selection='EFTUD2',AnalysisType='Dom',KitStatFilter='No')
Query fetched -1 rows in 1.01 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr17 | 42927315 | EFTUD2 | pheno | 0.35297 | 1.2072 | 0.22736 | 1.423288 |
We can see the impact of adding filtering to kit outliers
%%gor
create #pheno_matrix# = nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1'
, Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113',ExclHPOcount='1',CtrlCaseRatio='100', GeneFreeze='Dom')
| hide kit;
gor XA/gene_analysis/report_builders/XA_gene_logistic_regression.yml(pheno_grid=[#pheno_matrix#],gene_list_selection='EFTUD2',AnalysisType='Dom',KitStatFilter='Yes')
Query fetched -1 rows in 0.65 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr17 | 42927315 | EFTUD2 | pheno | 0.4082 | 1.3365 | 0.18137 | 1.504108 |
We see that there is slight change, i.e. a little big larger odds-ratio.
%%gor
gor XA/gene_analysis/kit_stat/gene_use.gorz | where gene_symbol = 'EFTUD2'
Query fetched -1 rows in 0.22 sec
chrom | gene_start | gene_end | gene_symbol | kit | use | avg | wavg | std | wstd | stdm | PNf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr17 | 42927315 | 42976813 | EFTUD2 | NaN | 1 | 0.1286 | 0.0747 | 0.3347 | 0.2712 | 0.0400 | 0.9999 |
1 | chr17 | 42927315 | 42976813 | EFTUD2 | V1 | 0 | 0.1290 | 0.0747 | 0.3698 | 0.2712 | 0.0028 | 0.9999 |
2 | chr17 | 42927315 | 42976813 | EFTUD2 | V2 | 1 | 0.0712 | 0.0747 | 0.2681 | 0.2712 | 0.0016 | 0.9999 |
3 | chr17 | 42927315 | 42976813 | EFTUD2 | V3 | 1 | 0.0953 | 0.0747 | 0.3247 | 0.2712 | 0.0101 | 0.9999 |
4 | chr17 | 42927315 | 42976813 | EFTUD2 | V4 | 1 | 0.0645 | 0.0747 | 0.2552 | 0.2712 | 0.0015 | 0.9999 |
5 | chr17 | 42927315 | 42976813 | EFTUD2 | V5 | 1 | 0.0682 | 0.0747 | 0.2604 | 0.2712 | 0.0005 | 0.9999 |
6 | chr17 | 42927315 | 42976813 | EFTUD2 | V6 | 1 | 0.0679 | 0.0747 | 0.2585 | 0.2712 | 0.0005 | 0.9999 |
7 | chr17 | 42927315 | 42976813 | EFTUD2 | V8 | 1 | 0.0812 | 0.0747 | 0.2808 | 0.2712 | 0.0006 | 0.9999 |
8 | chr17 | 42927315 | 42976813 | EFTUD2 | WGS-V1 | 1 | 0.0815 | 0.0747 | 0.2736 | 0.2712 | 0.0153 | 0.9999 |
9 | chr17 | 42927315 | 42976813 | EFTUD2 | WGS-V2 | 0 | 0.1594 | 0.0747 | 0.3661 | 0.2712 | 0.0441 | 0.9999 |
10 | chr17 | 42927315 | 42976813 | EFTUD2 | WGS-V3 | 1 | 0.1092 | 0.0747 | 0.3218 | 0.2712 | 0.0014 | 0.9999 |
%%gor
nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1'
, Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113',ExclHPOcount='1',CtrlCaseRatio='100', GeneFreeze='Dom')
| where kit in ('V1','WGS-V2') | group -gc pheno,kit -count
Query fetched -1 rows in 0.20 sec
pheno | kit | allCount | |
---|---|---|---|
0 | 1.0 | V1 | 3600 |
1 | 2.0 | V1 | 36 |
2 | NaN | V1 | 18 |
From the above, we see that there are 36 cases and 18 excluded samples of kit-type V1 that are eliminated, since use = 0 for this kit in EFTUD2. We can similarly summarize the variants in this gene for kit V1 in a manner that mimics the query used to feed the logistic regression:
%%gor
create #pheno_matrix# = nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1'
, Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113',ExclHPOcount='1',CtrlCaseRatio='100', GeneFreeze='Dom')
;
def #AF# = XA/freezes/xa_feb20_2025/metadata/af.gorz;
def #VEP# = XA/freezes/250303/metadata/vepref_single_wgs.gorz;
gor ref_gregor/genemb/gmb_genes.gorz | where gene_symbol = 'EFTUD2'
| join -segsnp -ir XA/gene_analysis/rare_vars_001_qc.gorz
| varjoin -i <(gor #VEP# | where max_impact in ('HIGH','MODERATE') and gene_symbol = 'EFTUD2')
| varjoin -i <(gor #AF# | where af <= 0.01 )
| map -c pn [#pheno_matrix#]
| group chrom -gc kit,pheno,gt -count
| where kit = 'V1'
| sort 1 -c gt:r
Query fetched -1 rows in 0.58 sec
chrom | bpStart | bpStop | kit | pheno | gt | allCount | |
---|---|---|---|---|---|---|---|
0 | chr17 | 0 | 81195210 | V1 | 1.0 | 3 | 144 |
1 | chr17 | 0 | 81195210 | V1 | 2.0 | 3 | 1 |
2 | chr17 | 0 | 81195210 | V1 | NaN | 3 | 1 |
3 | chr17 | 0 | 81195210 | V1 | 1.0 | 1 | 112 |
4 | chr17 | 0 | 81195210 | V1 | 2.0 | 1 | 1 |
5 | chr17 | 0 | 81195210 | V1 | NaN | 1 | 2 |
By eliminating data from kit V1, we throw away 112 carriers in controls and only 1 in cases. This is slightly larger than the CtrlCaseRatio='100' that we used for sampling the controls and explains the slight drop in the odds-ratio.
Adding covariants based on kit and gender¶
We will now explore the use the kit as covariate.
qh = GQH.GOR_Query_Helper()
mydefs = qh.add_many("""
create #pheno_matrix# = nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1'
, Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113',ExclHPOcount='1',CtrlCaseRatio='100', GeneFreeze='Dom');
""")
%%gor
$mydefs
nor [#pheno_matrix#] | group -gc kit -count
Query fetched -1 rows in 1.45 sec
kit | allCount | |
---|---|---|
0 | V1 | 3654 |
1 | V2 | 7811 |
2 | V3 | 202 |
3 | V4 | 8044 |
4 | V5 | 36640 |
5 | V6 | 24893 |
6 | V8 | 21762 |
7 | WGS-V3 | 3666 |
For the GOR REGRESSION command we must convert this categorial value into multiple binary variables (or one-hot-encoding). We can do that easily with the PIVOT command:
mydefs = qh.add_many("""
create #covars# = nor [#pheno_matrix#] | select pn,kit | calc v 1
| pivot -gc pn kit -v V1,V2,V3,V4,V5,V6,V8,WGS-V3 -e 0;
""")
%%gor
$mydefs
nor [#covars#]
Query fetched -1 rows in 0.72 sec
pn | V1_v | V2_v | V3_v | V4_v | V5_v | V6_v | V8_v | WGSxV3_v | |
---|---|---|---|---|---|---|---|---|---|
0 | 348079 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 181424 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 42216 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 738084 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 173387 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
106667 | 783784 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
106668 | 239099 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
106669 | 217786 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
106670 | 78535 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
106671 | 858367 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
106672 rows × 9 columns
Now lets feed this into the logistic regression as above:
%%gor
$mydefs
create #pheno_without_kit_col# = nor [#pheno_matrix#] | hide kit;
gor XA/gene_analysis/report_builders/XA_gene_logistic_regression.yml(pheno_grid=[#pheno_without_kit_col#],covar_grid=[#covars#],gene_list_selection='BRCA1',AnalysisType='Rec',KitStatFilter='No')
Query fetched -1 rows in 1.01 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | BRCA1 | pheno | NaN | NaN | NaN | NaN |
Clearly, this did not work out as expected! Lets try using sex/gender instead.
mydefs = qh.add_many("""
create #covars# = nor XA/pheno/samples_slim.tsv | select pn,gender | replace gender if(gender='M',1,0);
""")
%%gor
$mydefs
create #pheno_without_kit_col# = nor [#pheno_matrix#] | hide kit;
gor XA/gene_analysis/report_builders/XA_gene_logistic_regression.yml(pheno_grid=[#pheno_without_kit_col#],covar_grid=[#covars#],gene_list_selection='BRCA1',AnalysisType='Rec',KitStatFilter='No')
Query fetched -1 rows in 0.51 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | BRCA1 | pheno | 0.11143 | 0.18998 | 0.84932 | 1.117875 |
This got us at least somewhere (:=) - we see that the odds-ratio has dropped a bit - in both cases this is FAR away from any significance - and concludes our exploration of the gene burden freezes, predefined genome wide Phewas analysis and HPO based tools for phenotype definition.
The GORy details¶
It is revealing to show the code under the hood that is generated by XA_gene_logistic_regression.yml in the examples above.
%%gor
create #pheno_matrix_input# = nor XA/gene_analysis/report_builders/XA_HPO_case_excl_ctrl.yml(Case_HPOs='HP:0001757,HP:0005101,HP:0000410', CaseHPOcount='1'
, Excl_HPOs='HP:0008504,HP:0008542,HP:0008573,HP:0008587,HP:0008598,HP:0012712,HP:0012713,HP:0012716,HP:0040113',ExclHPOcount='1',CtrlCaseRatio='100', GeneFreeze='Dom');
create #pheno_without_kit_col# = nor [#pheno_matrix_input#] | hide kit;
create #covars_input# = nor XA/pheno/samples_slim.tsv | select pn,gender | replace gender if(gender='M',1,0);
def #AF# = XA/freezes/xa_feb20_2025/metadata/af.gorz;
def #VEP# = XA/freezes/250303/metadata/vepref_single_wgs.gorz;
def #kit_use# = XA/gene_analysis/kit_stat/gene_use.gorz;
def #rare_vars# = XA/gene_analysis/phased_rare_vars.gorz;
def #rare_pns# = XA/gene_analysis/phased_rare_vars_ped.tsv;
create #buckets# = nor #rare_pns#
| select pn
| sort -c PN
| rownum
| calc bucket 'b'+str(div(rownum-1,250000))
| hide rownum;
create #pn_kit# = nor -h XA/raw/samples.tsv
| select id,kit
| rename #1 PN;
def ##all_genes## = ref_gregor/genemb/gmb_genes.gorz;
def #gene_filter# = where gene_symbol in ('BRCA1');
create #horiz_gt# = parallel -parts <(nor [#buckets#] | group -gc bucket) <(gor ##all_genes##
| #gene_filter#
| join -segvar -xl gene_symbol -xr gene_symbol <(gor #VEP# | where max_impact in ('HIGH','MODERATE') | select 1-4,gene_symbol,max_consequence,max_impact
| varjoin -i <(gor #AF# | where af <= 0.01 | select 1-4)
| varjoin -r <(gor #rare_vars# | inset -c PN <(nor [#buckets#] | where bucket = '#{col:bucket}') ) )
| group 1 -gc #3,gene_symbol,pn -sum -ic pat,mat
| calc value if(sum_pat > 0 and sum_mat > 0,1,0)
| select chrom,gene_start,gene_end,gene_symbol,value,pn
| merge <(gor ##all_genes## | #gene_filter# | select chrom,gene_start,gene_end,gene_symbol | calc value 0
| multimap -cartesian <(nor [#buckets#] | where bucket = '#{col:bucket}' | select pn))
| atmax 1 -gc gene_symbol,pn value
/*
| map -c pn -m '' <(nor [#pn_kit#] | inset -c PN <(nor [#buckets#] | where bucket = '#{col:bucket}') )
| join -segseg -xl gene_symbol,kit -xr gene_symbol,kit -l -e 0 <(gor #kit_use# | select 1-use)
| replace value if(use = 0,'',value)
*/
| sort 1 -c gene_end,gene_symbol,pn
| group 1 -gc gene_end,gene_symbol -lis -sc value -s ',' -len 1000000
| calc bucket '#{col:bucket}'
);
create #pheno_matrix# = nor [#buckets#]
| select pn
| map -c pn -m 'NA' <(nor [#pheno_without_kit_col#] | columnsort pn);
create #covar_matrix# = nor [#buckets#]
| select pn
| map -c pn -m '0' <(nor [#covars_input#] | columnsort pn);
create #regression# = gor [#horiz_gt#]
| rename lis_value values
| csvsel -gc gene_end,gene_symbol -s ',' -u '' [#buckets#] <(nor [#buckets#] | select pn)
| regression [#pheno_matrix#] -logistic -s ',' -covar [#covar_matrix#]
| where covariate = 'Genotype'
| hide covariate,gene_end
| tryhide bucket
| calc OR exp(beta);
gor [#regression#]
Query fetched -1 rows in 1.30 sec
chrom | gene_start | gene_symbol | Phenotype | beta | Z_stat | P_value | OR | |
---|---|---|---|---|---|---|---|---|
0 | chr17 | 41196311 | BRCA1 | pheno | 0.11143 | 0.18998 | 0.84932 | 1.117875 |
This is exactly the same OR number as shown above.