Exon coverage analysis¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "clin-hg19"
%env GOR_API_PROJECT={project}
Gene exon coverage analysis¶
This notebook looks into how well our WES and WGS sequencing is covering the genes in the genome.
Definition of genes and coding exons¶
The definition of the Refgene transcripts in hg19 and the corresponding coding regions it downloaded from the following UCSC link:
The scripts to convert this table into exons, coding exon, and transcript files can be found in: user_data/hakon/refgene/write_coding_exons.gorq and write_transcripts.gorq. Our definition of genes is an export from GeneMb, ref_gregor/genemb/gmb_genes.gorz, and for gene primary transcript definition, we use definitions from Oscar. However, for genes where such definition is missing, we use Mane transcripts and were neither Oscar nor Mane definition is available, we use the transcript with the largest coding region, based on our UCSC download. The logic can be found in user_data/hakon/refgene/select_new_primary_trans.gorq.
Genes are also classified into validated and non-validated based on statuses and disease inheritance classification in GeneMB, e.g. where mechanism = 'amorph' or disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR','AR','XD','AR','DI','MI').
Coverage statistics from samples¶
We have then calculated coverage statistcs for every coding exon for 13225 V8 (WES) samples and 1478 WGS-V3 samples. This query for this analysis can be found in user_data/hakon/wesCov/ucsc_coding_exon_kit_coverage.gorq and the output data is stored in user_data/hakon/wesCov/ucsc_exon_kit_stat.gord, shown below:
%%gor
gor user_data/hakon/wesCov/ucsc_exon_kit_stat.gord | top 6
Query ran in 4.42 sec. Query fetched 6 rows in 0.01 sec (total time 4.43 sec)
chrom | bpStart | bpStop | transcript_stable_id | exon | kit | Samples | avg_depth | std_depth | min_depth | max_depth | avg_lt5 | min_lt5 | max_lt5 | avg_lt10 | min_lt10 | max_lt10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69090 | 70008 | NM_001005484 | 1 | V8 | 13225 | 199.623 | 79.869 | 52.9 | 1067.0 | 0.031 | 0.000 | 0.164 | 0.043 | 0.0 | 0.236 |
1 | chr1 | 69090 | 70008 | NM_001005484 | 1 | WGS-V3 | 1478 | 15.745 | 11.630 | 0.0 | 61.1 | 0.264 | 0.000 | 1.000 | 0.375 | 0.0 | 1.000 |
2 | chr1 | 367658 | 368597 | NM_001005221 | 1 | V8 | 13225 | 0.255 | 0.306 | 0.0 | 2.3 | 0.999 | 0.803 | 1.000 | 1.000 | 1.0 | 1.000 |
3 | chr1 | 367658 | 368597 | NM_001005221 | 1 | WGS-V3 | 1478 | 0.000 | NaN | 0.0 | 0.2 | 1.000 | 1.000 | 1.000 | 1.000 | 1.0 | 1.000 |
4 | chr1 | 367658 | 368597 | NM_001005224 | 1 | V8 | 13225 | 0.255 | 0.306 | 0.0 | 2.3 | 0.999 | 0.803 | 1.000 | 1.000 | 1.0 | 1.000 |
5 | chr1 | 367658 | 368597 | NM_001005224 | 1 | WGS-V3 | 1478 | 0.000 | NaN | 0.0 | 0.2 | 1.000 | 1.000 | 1.000 | 1.000 | 1.0 | 1.000 |
Regions of interest¶
In addition to coverage analysis from samples, we also look into how our ROI definitions correspond with the actual coding exon regions. While we have explored several version of ROI definitions, we will only consider the "old" definition (believed to be based on the XA onboarding) and the "new" definition, based on regions covered by with 10 or more reads by by 75% of the WES samples.
Difficult regions in the genome¶
Finally, we also used a method, described by Illumina to classify the genome into regions where good sequence variant concordance can be expected. This is done by analysing sequence read base quality, mapping quality and pileup depth for 50 WGS samples (see https://www.illumina.com/science/genomics-research/articles/identifying-genomic-regions-with-high-quality-single-nucleotide-.html) and the corresponding GORdb query logic in user_data/hakon/IlluminaQC/WGS/pileup_analysis.gorq. As shown in this article, the SNV concordance is 95.7% for autosomal SNPs in about 90% of the genome that is high-quality (where the three quality indicators are good as defined by them), while it drops to 66.7% or other regions and down to 46.8% in regions where all three metrics indicate poor quality. The output of this analysis is stored in user_data/hakon/IlluminaQC/WGS/IllQualityRegions.gord. We can summarize it for the genome as shown below and see that it compares will with the results in the paper (e.g. 90% vs 88.3% of high-quality regions)
%%gor
create #temp# = gor user_data/hakon/IlluminaQC/WGS/IllQualityRegions.gord
| calc s #3-#2
| group genome -sum -ic s -gc good*
| granno genome -sum -ic sum*
| calc f sum_s/sum_sum_s;
nor [#temp#] | select goodMapQ- | sort -c f:nr
Query ran in 4.24 sec. Query fetched 8 rows in 0.00 sec (total time 4.24 sec)
goodMapQ | goodDepth | goodBaseQ | goodLabels | sum_s | sum_sum_s | f | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | three | 2588400062 | 2861338695 | 0.904612 |
1 | 1 | 0 | 1 | two | 108502086 | 2861338695 | 0.037920 |
2 | 0 | 1 | 1 | two | 62452951 | 2861338695 | 0.021826 |
3 | 0 | 0 | 1 | one | 55622933 | 2861338695 | 0.019439 |
4 | 1 | 1 | 0 | two | 34774931 | 2861338695 | 0.012153 |
5 | 1 | 0 | 0 | one | 6204224 | 2861338695 | 0.002168 |
6 | 0 | 0 | 0 | none | 3947616 | 2861338695 | 0.001380 |
7 | 0 | 1 | 0 | one | 1433892 | 2861338695 | 0.000501 |
Exon analysis¶
Now we proceed with few definitions and the code that calculates overlap with our coding exons. First few definitions:
qh = GQH.GOR_Query_Helper()
mydefs = ""
mydefs = qh.add_many("""
def #exons# = user_data/hakon/refgene/coding_exons.gorz;
def #roi# = user_data/hakon/wesCov/roi10_75.gorz;
def #genes# = ref_gregor/genemb/gmb_genes.gorz;
def #old_roi# = ref_gregor/util/target_roi.gorz;
def #new_roi# = ref_gregor/util/roi10_75.gorz;
create ##pt## = nor user_data/hakon/refgene/hgnc_primary_transcript.tsv
| replace #2 if(posof(#2,'.')>0,left(#2,posof(#2,'.')),#2);
""")
Then we calculate the fractional overlap of each exon with the two ROI definitions. We then define 4 different boolean quality labels based on fractional overlap with each exon.
mydefs = qh.add_many("""
create #exon_old_roi# = gor #exons#
| join -segseg -l -e 0 <(gor #old_roi# | rename #2 bpstart | rename #3 bpstop)
| calc ov if(bpstop=0,0,min(bpstop,exon_stop)-max(bpstart,exon_start))
| group 1 -gc #3,transcript_stable_id,hgnc_id,gene_symbol,strand,exon -sum -ic ov
| rename sum_ov ov
| calc size #3-#2
| calc lovf ov/size
| calc missEx95 if(lovf<0.95,1,0)
| calc missEx90 if(lovf<0.90,1,0)
| calc missEx75 if(lovf<0.75,1,0)
| calc missEx50 if(lovf<0.50,1,0)
;
create #exon_new_roi# = gor #exons#
| join -segseg -l -e 0 <(gor #new_roi# | rename #2 bpstart | rename #3 bpstop)
| calc ov if(bpstop=0,0,min(bpstop,exon_stop)-max(bpstart,exon_start))
| group 1 -gc #3,transcript_stable_id,hgnc_id,gene_symbol,strand,exon -sum -ic ov
| rename sum_ov ov
| calc size #3-#2
| calc lovf ov/size
| calc missEx95 if(lovf<0.95,1,0)
| calc missEx90 if(lovf<0.90,1,0)
| calc missEx75 if(lovf<0.75,1,0)
| calc missEx50 if(lovf<0.50,1,0)
;
""")
To understand this data lets look at few rows:
%%gor
$mydefs
gor [#exon_new_roi#] | top 3
Query ran in 5.27 sec. Query fetched 3 rows in 0.00 sec (total time 5.27 sec)
chrom | exon_start | exon_stop | transcript_stable_id | hgnc_id | gene_symbol | strand | exon | ov | size | lovf | missEx95 | missEx90 | missEx75 | missEx50 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69090 | 70008 | NM_001005484 | HGNC:14825 | OR4F5 | + | 1 | 822 | 918 | 0.895425 | 1 | 1 | 0 | 0 |
1 | chr1 | 367658 | 368597 | NM_001005221 | HGNC:31275 | OR4F29 | + | 1 | 0 | 939 | 0.000000 | 1 | 1 | 1 | 1 |
2 | chr1 | 367658 | 368597 | NM_001005224 | HGNC:8300 | OR4F3 | + | 1 | 0 | 939 | 0.000000 | 1 | 1 | 1 | 1 |
Then the overlap with our regions based on the Illumina quality metrics, i.e. based on three good, two, one etc.:
mydefs = qh.add_many("""
create #exon_Ill# = gor #exons#
| join -segseg -l -e 0 <(gor user_data/hakon/IlluminaQC/WGS/IllQualityRegions.gord | select 1-bpstop,goodlabels)
| calc ov if(bpstop=0,0,min(bpstop,exon_stop)-max(bpstart,exon_start))
| group 1 -gc #3,transcript_stable_id,goodlabels -sum -ic ov
| rename sum_ov ov
| pivot -gc #3,transcript_stable_id goodlabels -v none,one,two,three -e 0
;
""")
Here are few rows for clarification:
%%gor
$mydefs
gor [#exon_Ill#] | top 3
Query ran in 0.19 sec Query fetched 3 rows in 0.00 sec (total time 0.19 sec)
chrom | exon_start | exon_stop | transcript_stable_id | none_ov | one_ov | two_ov | three_ov | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 69090 | 70008 | NM_001005484 | 4 | 355 | 559 | 0 |
1 | chr1 | 367658 | 368597 | NM_001005221 | 0 | 538 | 401 | 0 |
2 | chr1 | 367658 | 368597 | NM_001005224 | 0 | 538 | 401 | 0 |
Then we pick up the exon based coverage and calculate boolean flags based on some cut-offs:
mydefs = qh.add_many("""
create #exon_cov# = gor #exons# | join -snpsnp -xl exon_stop,transcript_stable_id -xr bpstop,transcript_stable_id user_data/hakon/wesCov/ucsc_exon_kit_stat.gord
| calc bad_avg_depth if(avg_depth < 10,1,0)
| calc bad_avg_lt10 if(avg_lt10 > 0.25,1,0)
| select chrom-gene_symbol,kit,avg_depth,avg_lt10,bad_*;
""")
%%gor
$mydefs
gor [#exon_cov#] | top 6
Query ran in 1.37 sec Query fetched 6 rows in 0.00 sec (total time 1.37 sec)
chrom | exon_start | exon_stop | exon | transcript_stable_id | strand | hgnc_id | gene_symbol | kit | avg_depth | avg_lt10 | bad_avg_depth | bad_avg_lt10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69090 | 70008 | 1 | NM_001005484 | + | HGNC:14825 | OR4F5 | V8 | 199.623 | 0.043 | 0 | 0 |
1 | chr1 | 69090 | 70008 | 1 | NM_001005484 | + | HGNC:14825 | OR4F5 | WGS-V3 | 15.745 | 0.375 | 0 | 1 |
2 | chr1 | 367658 | 368597 | 1 | NM_001005221 | + | HGNC:31275 | OR4F29 | V8 | 0.255 | 1.000 | 1 | 1 |
3 | chr1 | 367658 | 368597 | 1 | NM_001005221 | + | HGNC:31275 | OR4F29 | WGS-V3 | 0.000 | 1.000 | 1 | 1 |
4 | chr1 | 367658 | 368597 | 1 | NM_001005224 | + | HGNC:8300 | OR4F3 | V8 | 0.255 | 1.000 | 1 | 1 |
5 | chr1 | 367658 | 368597 | 1 | NM_001005224 | + | HGNC:8300 | OR4F3 | WGS-V3 | 0.000 | 1.000 | 1 | 1 |
We notice that in the above relation, we have two rows for each exon; one for WGS and one for WES.
Gene analysis¶
First our GeneMB gene classification:
mydefs = qh.add_many("""
create #geneMBval# = nor ref_gregor/genemb/gmb_genes.gorz | map -c hgnc_id ref_gregor/genemb/gene_disease_inher.tsv
| where mechanism = 'amorph' or disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR','AR','XD','AR','DI','MI')
| select hgnc_id ;
""")
Gene aggregation¶
To simplify our interpretation of the gene analysis, we only use exons that are part of the primary transcripts.
mydefs = qh.add_many("""
create #gene_summ# = gor #genes# | join -segseg -xl hgnc_id -xr hgnc_id <(gor [#exon_new_roi#]
| map -c hgnc_id [##pt##]
| where transcript_stable_id = primary_transcript
| join -snpsnp -xl exon_stop,transcript_stable_id -xr exon_stop,transcript_stable_id [#exon_cov#]
| calc ov_size ov*size
| calc avg_depth_size avg_depth*size
| calc avg_lt10_size avg_lt10*size
| join -snpsnp -xl exon_stop,transcript_stable_id -xr exon_stop,transcript_stable_id [#exon_Ill#]
)
| group 1 -gc gene_end,transcript_stable_id,hgnc_id,gene_symbol,strand,kit -count -sum -ic size,ov_size,avg_depth_size,avg_lt10_size,missEx95,missEx90,missEx75,missEx50,bad_avg_depth,bad_avg_lt10,none_ov-three_ov
| calc avg_depth form(sum_avg_depth_size/sum_size,3,1)
| calc avg_lt10 form(sum_avg_lt10_size/sum_size,3,1)
| rename allcount Exons
| replace sum_none_ov-sum_three_ov form(float(#rc)/sum_size,3,3)
| rename sum_(.*)_ov #{1}_ovfr
| select 1,2,gene_end,gene_symbol,hgnc_id,transcript_stable_id,strand,kit,Exons,sum_size,sum_missEx95,sum_missEx90,sum_missEx75,sum_missEx50,sum_bad_avg_depth,sum_bad_avg_lt10,avg_depth,avg_lt10,none_ovfr-three_ovfr
| rename sum_size CodingSize
| inset -c hgnc_id -b [#geneMBval#] | rename inSet GeneMBval
;
create #gene_roi# = nor [#exon_new_roi#]
| map -c hgnc_id [##pt##]
| where transcript_stable_id = primary_transcript
| group -gc hgnc_id -sum -ic ov,size | calc roi_fr form(sum_ov/sum_size,3,3) | calc roi 'new'
| merge <(nor [#exon_old_roi#]
| map -c hgnc_id [##pt##]
| where transcript_stable_id = primary_transcript
| group -gc hgnc_id -sum -ic ov,size | calc roi_fr form(sum_ov/sum_size,3,3) | calc roi 'old')
| hide sum_*
| pivot -gc hgnc_id roi -v new,old -e 0;
""")
%%gor
$mydefs
gor [#gene_summ#] | top 3
Query ran in 7.47 sec. Query fetched 3 rows in 0.00 sec (total time 7.48 sec)
chrom | gene_start | gene_end | gene_symbol | hgnc_id | transcript_stable_id | strand | kit | Exons | CodingSize | ... | sum_missEx50 | sum_bad_avg_depth | sum_bad_avg_lt10 | avg_depth | avg_lt10 | none_ovfr | one_ovfr | two_ovfr | three_ovfr | GeneMBval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 65418 | 71585 | OR4F5 | HGNC:14825 | NM_001005484 | + | V8 | 1 | 918 | ... | 0 | 0 | 0 | 199.6 | 0.0 | 0.004 | 0.387 | 0.609 | 0.0 | 0 |
1 | chr1 | 65418 | 71585 | OR4F5 | HGNC:14825 | NM_001005484 | + | WGS-V3 | 1 | 918 | ... | 0 | 0 | 1 | 15.7 | 0.4 | 0.004 | 0.387 | 0.609 | 0.0 | 0 |
2 | chr1 | 367658 | 368597 | OR4F29 | HGNC:31275 | NM_001005221 | + | V8 | 1 | 939 | ... | 1 | 1 | 1 | 0.3 | 1.0 | 0.000 | 0.573 | 0.427 | 0.0 | 0 |
3 rows × 23 columns
%%gor
$mydefs
nor [#gene_roi#] | top 3
Query ran in 0.20 sec Query fetched 3 rows in 0.00 sec (total time 0.21 sec)
hgnc_id | new_roi_fr | old_roi_fr | |
---|---|---|---|
0 | HGNC:33809 | 1.000 | 1.0 |
1 | HGNC:25954 | 0.432 | 1.0 |
2 | HGNC:25489 | 1.000 | 1.0 |
Finally, one example of using this data to group the genes into good and bad based on coverage.
- We split the genes into validated and non-validated
- We separate WGS and WES
- We annotate with the Illumina QC region, i.e. the fractional coverage of the different types (three,two, and one)
- We also annotate with the ROI overlap
%%gor
$mydefs
nor [#gene_summ#]
| calc good if(sum_bad_avg_depth = 0 and sum_bad_avg_lt10 = 0 and avg_lt10 < 0.01,1,0)
| calc bad if(sum_bad_avg_depth/exons > 0.1 or sum_bad_avg_lt10/exons >0.1 or avg_lt10 > 0.1 or avg_depth < 20,1,0)
| calc verybad if(sum_bad_avg_depth/exons > 0.1 and sum_bad_avg_lt10/exons >0.1 and avg_lt10 > 0.1 and avg_depth < 20,1,0)
| calc qctype if(verybad=1,'verybad',if(bad=1,'bad',if(good=1,'good','good')))
| map -c hgnc_id [#gene_roi#]
| group -gc kit,genembval,qctype -count -fc new_roi_fr,old_roi_fr,one_ovfr-three_ovfr -avg -std
| replace avg*,std* form(float(#rc),3,3)
| calc order decode(qctype,'good,1,bad,2,verybad,3')
| sort -c kit,genembval,order | hide order
| granno -gc kit,genembval -sum -ic allcount
| calc f form(100*allcount/sum_allcount,4,1)+'%'
| columnreorder kit,genembval,allcount,f,qctype
| hide std_*,sum_allcount
Query ran in 0.26 sec Query fetched 12 rows in 0.35 sec (total time 0.61 sec)
kit | GeneMBval | allCount | f | qctype | avg_one_ovfr | avg_two_ovfr | avg_three_ovfr | avg_new_roi_fr | avg_old_roi_fr | |
---|---|---|---|---|---|---|---|---|---|---|
0 | V8 | 0 | 15108 | 95.4% | good | 0.012 | 0.051 | 0.937 | 0.999 | 0.974 |
1 | V8 | 0 | 599 | 3.8% | bad | 0.158 | 0.203 | 0.635 | 0.653 | 0.861 |
2 | V8 | 0 | 124 | 0.8% | verybad | 0.155 | 0.164 | 0.679 | 0.082 | 0.899 |
3 | V8 | 1 | 1537 | 99.4% | good | 0.003 | 0.041 | 0.956 | 1.000 | 0.996 |
4 | V8 | 1 | 7 | 0.5% | bad | 0.000 | 0.059 | 0.941 | 0.777 | 1.000 |
5 | V8 | 1 | 2 | 0.1% | verybad | 0.482 | 0.000 | 0.500 | 0.250 | 0.750 |
6 | WGS-V3 | 0 | 15099 | 95.4% | good | 0.005 | 0.046 | 0.949 | 0.985 | 0.979 |
7 | WGS-V3 | 0 | 419 | 2.6% | bad | 0.135 | 0.182 | 0.682 | 0.924 | 0.924 |
8 | WGS-V3 | 0 | 313 | 2.0% | verybad | 0.524 | 0.456 | 0.013 | 0.716 | 0.530 |
9 | WGS-V3 | 1 | 1505 | 97.3% | good | 0.002 | 0.038 | 0.960 | 0.998 | 0.996 |
10 | WGS-V3 | 1 | 38 | 2.5% | bad | 0.043 | 0.132 | 0.825 | 1.000 | 1.000 |
11 | WGS-V3 | 1 | 3 | 0.2% | verybad | 0.580 | 0.302 | 0.102 | 0.833 | 0.500 |
We can also ignore the coverage and just compare the two ROI definitions:
%%gor
$mydefs
nor [#gene_roi#]
| inset -c hgnc_id -b [#geneMBval#] | rename inSet GeneMBval
| select hgnc_id,genembval,new_roi_fr,old_roi_fr
| unpivot 3-
| calc qual if(col_value = 1,'perfect',if(col_value > 0.95,'good',if(col_value > 0.9,'ok','bad')))
| calc roi if(col_name = 'new_roi_fr','new','old')
| group -gc roi,genembval,qual -count
| granno -gc roi,genembval -sum -ic allcount
| calc f form(100*allcount/sum_allcount,3,1)+'%'
| hide sum_allcount
| sort -c roi,genembval,allcount:nr
| rename allcount Genes
Query ran in 0.26 sec Query fetched 16 rows in 0.11 sec (total time 0.37 sec)
roi | GeneMBval | qual | Genes | f | |
---|---|---|---|---|---|
0 | new | 0 | perfect | 14865 | 93.5% |
1 | new | 0 | bad | 715 | 4.5% |
2 | new | 0 | good | 168 | 1.1% |
3 | new | 0 | ok | 149 | 0.9% |
4 | new | 1 | perfect | 1529 | 98.9% |
5 | new | 1 | good | 6 | 0.4% |
6 | new | 1 | ok | 6 | 0.4% |
7 | new | 1 | bad | 5 | 0.3% |
8 | old | 0 | perfect | 15107 | 95.0% |
9 | old | 0 | bad | 655 | 4.1% |
10 | old | 0 | good | 91 | 0.6% |
11 | old | 0 | ok | 44 | 0.3% |
12 | old | 1 | perfect | 1518 | 98.2% |
13 | old | 1 | bad | 14 | 0.9% |
14 | old | 1 | good | 11 | 0.7% |
15 | old | 1 | ok | 3 | 0.2% |