Exon coverage analysis¶

ExonCovRoiandIlluminaQC_analysis

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "clin-hg19"
%env GOR_API_PROJECT={project}

Gene exon coverage analysis¶

This notebook looks into how well our WES and WGS sequencing is covering the genes in the genome.

Definition of genes and coding exons¶

The definition of the Refgene transcripts in hg19 and the corresponding coding regions it downloaded from the following UCSC link:

https://genome-euro.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=1&hgta_group=genes&hgta_track=refSeqComposite&hgta_table=refGene&hgsid=345929287_QygU2zjucafYXtAuZEr4HnbqYwAC

The scripts to convert this table into exons, coding exon, and transcript files can be found in: user_data/hakon/refgene/write_coding_exons.gorq and write_transcripts.gorq. Our definition of genes is an export from GeneMb, ref_gregor/genemb/gmb_genes.gorz, and for gene primary transcript definition, we use definitions from Oscar. However, for genes where such definition is missing, we use Mane transcripts and were neither Oscar nor Mane definition is available, we use the transcript with the largest coding region, based on our UCSC download. The logic can be found in user_data/hakon/refgene/select_new_primary_trans.gorq.

Genes are also classified into validated and non-validated based on statuses and disease inheritance classification in GeneMB, e.g. where mechanism = 'amorph' or disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR','AR','XD','AR','DI','MI').

Coverage statistics from samples¶

We have then calculated coverage statistcs for every coding exon for 13225 V8 (WES) samples and 1478 WGS-V3 samples. This query for this analysis can be found in user_data/hakon/wesCov/ucsc_coding_exon_kit_coverage.gorq and the output data is stored in user_data/hakon/wesCov/ucsc_exon_kit_stat.gord, shown below:

In [2]:

%%gor
gor user_data/hakon/wesCov/ucsc_exon_kit_stat.gord | top 6

Query ran in 4.42 sec.
Query fetched 6 rows in 0.01 sec (total time 4.43 sec)

Out[2]:

	chrom	bpStart	bpStop	transcript_stable_id	exon	kit	Samples	avg_depth	std_depth	min_depth	max_depth	avg_lt5	min_lt5	max_lt5	avg_lt10	min_lt10	max_lt10
0	chr1	69090	70008	NM_001005484	1	V8	13225	199.623	79.869	52.9	1067.0	0.031	0.000	0.164	0.043	0.0	0.236
1	chr1	69090	70008	NM_001005484	1	WGS-V3	1478	15.745	11.630	0.0	61.1	0.264	0.000	1.000	0.375	0.0	1.000
2	chr1	367658	368597	NM_001005221	1	V8	13225	0.255	0.306	0.0	2.3	0.999	0.803	1.000	1.000	1.0	1.000
3	chr1	367658	368597	NM_001005221	1	WGS-V3	1478	0.000	NaN	0.0	0.2	1.000	1.000	1.000	1.000	1.0	1.000
4	chr1	367658	368597	NM_001005224	1	V8	13225	0.255	0.306	0.0	2.3	0.999	0.803	1.000	1.000	1.0	1.000
5	chr1	367658	368597	NM_001005224	1	WGS-V3	1478	0.000	NaN	0.0	0.2	1.000	1.000	1.000	1.000	1.0	1.000

Regions of interest¶

In addition to coverage analysis from samples, we also look into how our ROI definitions correspond with the actual coding exon regions. While we have explored several version of ROI definitions, we will only consider the "old" definition (believed to be based on the XA onboarding) and the "new" definition, based on regions covered by with 10 or more reads by by 75% of the WES samples.

Difficult regions in the genome¶

Finally, we also used a method, described by Illumina to classify the genome into regions where good sequence variant concordance can be expected. This is done by analysing sequence read base quality, mapping quality and pileup depth for 50 WGS samples (see https://www.illumina.com/science/genomics-research/articles/identifying-genomic-regions-with-high-quality-single-nucleotide-.html) and the corresponding GORdb query logic in user_data/hakon/IlluminaQC/WGS/pileup_analysis.gorq. As shown in this article, the SNV concordance is 95.7% for autosomal SNPs in about 90% of the genome that is high-quality (where the three quality indicators are good as defined by them), while it drops to 66.7% or other regions and down to 46.8% in regions where all three metrics indicate poor quality. The output of this analysis is stored in user_data/hakon/IlluminaQC/WGS/IllQualityRegions.gord. We can summarize it for the genome as shown below and see that it compares will with the results in the paper (e.g. 90% vs 88.3% of high-quality regions)

In [3]:

%%gor
create #temp# = gor user_data/hakon/IlluminaQC/WGS/IllQualityRegions.gord
| calc s #3-#2
| group genome -sum -ic s -gc good*
| granno genome -sum -ic sum*
| calc f sum_s/sum_sum_s;

nor [#temp#] | select goodMapQ- | sort -c f:nr

Query ran in 4.24 sec.
Query fetched 8 rows in 0.00 sec (total time 4.24 sec)

Out[3]:

	goodMapQ	goodDepth	goodBaseQ	goodLabels	sum_s	sum_sum_s	f
0	1	1	1	three	2588400062	2861338695	0.904612
1	1	0	1	two	108502086	2861338695	0.037920
2	0	1	1	two	62452951	2861338695	0.021826
3	0	0	1	one	55622933	2861338695	0.019439
4	1	1	0	two	34774931	2861338695	0.012153
5	1	0	0	one	6204224	2861338695	0.002168
6	0	0	0	none	3947616	2861338695	0.001380
7	0	1	0	one	1433892	2861338695	0.000501

Exon analysis¶

Now we proceed with few definitions and the code that calculates overlap with our coding exons. First few definitions:

In [4]:

qh = GQH.GOR_Query_Helper()
mydefs = ""

In [5]:

mydefs = qh.add_many("""
def #exons# = user_data/hakon/refgene/coding_exons.gorz;
def #roi# = user_data/hakon/wesCov/roi10_75.gorz;
def #genes# = ref_gregor/genemb/gmb_genes.gorz;

def #old_roi# = ref_gregor/util/target_roi.gorz;

def #new_roi# = ref_gregor/util/roi10_75.gorz;

create ##pt## = nor user_data/hakon/refgene/hgnc_primary_transcript.tsv
    | replace #2 if(posof(#2,'.')>0,left(#2,posof(#2,'.')),#2);
""")

Then we calculate the fractional overlap of each exon with the two ROI definitions. We then define 4 different boolean quality labels based on fractional overlap with each exon.

In [7]:

mydefs = qh.add_many("""
create #exon_old_roi# = gor #exons#
| join -segseg -l -e 0 <(gor #old_roi# | rename #2 bpstart | rename #3 bpstop)
| calc ov if(bpstop=0,0,min(bpstop,exon_stop)-max(bpstart,exon_start))
| group 1 -gc #3,transcript_stable_id,hgnc_id,gene_symbol,strand,exon -sum -ic ov
| rename sum_ov ov
| calc size #3-#2
| calc lovf ov/size
| calc missEx95 if(lovf<0.95,1,0)
| calc missEx90 if(lovf<0.90,1,0)
| calc missEx75 if(lovf<0.75,1,0)
| calc missEx50 if(lovf<0.50,1,0)
;

create #exon_new_roi# = gor #exons#
| join -segseg -l -e 0 <(gor #new_roi# | rename #2 bpstart | rename #3 bpstop)
| calc ov if(bpstop=0,0,min(bpstop,exon_stop)-max(bpstart,exon_start))
| group 1 -gc #3,transcript_stable_id,hgnc_id,gene_symbol,strand,exon -sum -ic ov
| rename sum_ov ov
| calc size #3-#2
| calc lovf ov/size
| calc missEx95 if(lovf<0.95,1,0)
| calc missEx90 if(lovf<0.90,1,0)
| calc missEx75 if(lovf<0.75,1,0)
| calc missEx50 if(lovf<0.50,1,0)
;
""")

To understand this data lets look at few rows:

In [10]:

%%gor
$mydefs
gor [#exon_new_roi#] | top 3

Query ran in 5.27 sec.
Query fetched 3 rows in 0.00 sec (total time 5.27 sec)

Out[10]:

	chrom	exon_start	exon_stop	transcript_stable_id	hgnc_id	gene_symbol	strand	exon	ov	size	lovf	missEx95	missEx90	missEx75	missEx50
0	chr1	69090	70008	NM_001005484	HGNC:14825	OR4F5	+	1	822	918	0.895425	1	1	0	0
1	chr1	367658	368597	NM_001005221	HGNC:31275	OR4F29	+	1	0	939	0.000000	1	1	1	1
2	chr1	367658	368597	NM_001005224	HGNC:8300	OR4F3	+	1	0	939	0.000000	1	1	1	1

Then the overlap with our regions based on the Illumina quality metrics, i.e. based on three good, two, one etc.:

In [11]:

mydefs = qh.add_many("""
create #exon_Ill# = gor #exons#
| join -segseg -l -e 0 <(gor user_data/hakon/IlluminaQC/WGS/IllQualityRegions.gord | select 1-bpstop,goodlabels)
| calc ov if(bpstop=0,0,min(bpstop,exon_stop)-max(bpstart,exon_start))
| group 1 -gc #3,transcript_stable_id,goodlabels -sum -ic ov
| rename sum_ov ov
| pivot -gc #3,transcript_stable_id goodlabels -v none,one,two,three -e 0
;
""")

Here are few rows for clarification:

In [12]:

%%gor
$mydefs
gor [#exon_Ill#] | top 3

Query ran in 0.19 sec
Query fetched 3 rows in 0.00 sec (total time 0.19 sec)

Out[12]:

	chrom	exon_start	exon_stop	transcript_stable_id	none_ov	one_ov	two_ov
0	chr1	69090	70008	NM_001005484	4	355	559
1	chr1	367658	368597	NM_001005221	0	538	401
2	chr1	367658	368597	NM_001005224	0	538	401

Then we pick up the exon based coverage and calculate boolean flags based on some cut-offs:

In [13]:

mydefs = qh.add_many("""
create #exon_cov# = gor #exons#  | join -snpsnp -xl exon_stop,transcript_stable_id -xr bpstop,transcript_stable_id user_data/hakon/wesCov/ucsc_exon_kit_stat.gord
| calc bad_avg_depth if(avg_depth < 10,1,0)
| calc bad_avg_lt10 if(avg_lt10 > 0.25,1,0)
| select chrom-gene_symbol,kit,avg_depth,avg_lt10,bad_*;
""")

In [14]:

%%gor
$mydefs
gor [#exon_cov#] | top 6

Query ran in 1.37 sec
Query fetched 6 rows in 0.00 sec (total time 1.37 sec)

Out[14]:

	chrom	exon_start	exon_stop	exon	transcript_stable_id	strand	hgnc_id	gene_symbol	kit	avg_depth	avg_lt10	bad_avg_depth	bad_avg_lt10
0	chr1	69090	70008	1	NM_001005484	+	HGNC:14825	OR4F5	V8	199.623	0.043	0	0
1	chr1	69090	70008	1	NM_001005484	+	HGNC:14825	OR4F5	WGS-V3	15.745	0.375	0	1
2	chr1	367658	368597	1	NM_001005221	+	HGNC:31275	OR4F29	V8	0.255	1.000	1	1
3	chr1	367658	368597	1	NM_001005221	+	HGNC:31275	OR4F29	WGS-V3	0.000	1.000	1	1
4	chr1	367658	368597	1	NM_001005224	+	HGNC:8300	OR4F3	V8	0.255	1.000	1	1
5	chr1	367658	368597	1	NM_001005224	+	HGNC:8300	OR4F3	WGS-V3	0.000	1.000	1	1

We notice that in the above relation, we have two rows for each exon; one for WGS and one for WES.

Gene analysis¶

First our GeneMB gene classification:

In [15]:

mydefs = qh.add_many("""
create #geneMBval# = nor  ref_gregor/genemb/gmb_genes.gorz | map -c hgnc_id ref_gregor/genemb/gene_disease_inher.tsv
| where mechanism = 'amorph' or disease_status = 'validated' and inheritance in ('AD','AD-ImpM','AD-ImpP','XL','XR','AR','XD','AR','DI','MI')
| select hgnc_id ;
""")

Gene aggregation¶

To simplify our interpretation of the gene analysis, we only use exons that are part of the primary transcripts.

In [16]:

mydefs = qh.add_many("""
create #gene_summ# = gor #genes# | join -segseg -xl hgnc_id -xr hgnc_id <(gor [#exon_new_roi#]
| map -c hgnc_id [##pt##]
| where transcript_stable_id = primary_transcript
| join -snpsnp -xl exon_stop,transcript_stable_id -xr exon_stop,transcript_stable_id [#exon_cov#]
| calc ov_size ov*size
| calc avg_depth_size avg_depth*size
| calc avg_lt10_size avg_lt10*size
| join -snpsnp -xl exon_stop,transcript_stable_id -xr exon_stop,transcript_stable_id [#exon_Ill#]
)
| group 1 -gc gene_end,transcript_stable_id,hgnc_id,gene_symbol,strand,kit -count -sum -ic size,ov_size,avg_depth_size,avg_lt10_size,missEx95,missEx90,missEx75,missEx50,bad_avg_depth,bad_avg_lt10,none_ov-three_ov
| calc avg_depth form(sum_avg_depth_size/sum_size,3,1)
| calc avg_lt10 form(sum_avg_lt10_size/sum_size,3,1)
| rename allcount Exons
| replace sum_none_ov-sum_three_ov form(float(#rc)/sum_size,3,3)
| rename sum_(.*)_ov #{1}_ovfr
| select 1,2,gene_end,gene_symbol,hgnc_id,transcript_stable_id,strand,kit,Exons,sum_size,sum_missEx95,sum_missEx90,sum_missEx75,sum_missEx50,sum_bad_avg_depth,sum_bad_avg_lt10,avg_depth,avg_lt10,none_ovfr-three_ovfr
| rename sum_size CodingSize
| inset -c hgnc_id -b [#geneMBval#] | rename inSet GeneMBval
;

create #gene_roi# = nor [#exon_new_roi#] 
| map -c hgnc_id [##pt##]
| where transcript_stable_id = primary_transcript
| group -gc hgnc_id -sum -ic ov,size | calc roi_fr form(sum_ov/sum_size,3,3) | calc roi 'new'
| merge <(nor [#exon_old_roi#] 
| map -c hgnc_id [##pt##]
| where transcript_stable_id = primary_transcript
| group -gc hgnc_id -sum -ic ov,size | calc roi_fr form(sum_ov/sum_size,3,3) | calc roi 'old')
| hide sum_*
| pivot -gc hgnc_id roi -v new,old -e 0;
""")

In [17]:

%%gor
$mydefs
gor [#gene_summ#] | top 3

Query ran in 7.47 sec.
Query fetched 3 rows in 0.00 sec (total time 7.48 sec)

Out[17]:

	chrom	gene_start	gene_end	gene_symbol	hgnc_id	transcript_stable_id	strand	kit	Exons	CodingSize	...	sum_missEx50	sum_bad_avg_depth	sum_bad_avg_lt10	avg_depth	avg_lt10	none_ovfr	one_ovfr	two_ovfr
0	chr1	65418	71585	OR4F5	HGNC:14825	NM_001005484	+	V8	1	918	...	0	0	0	199.6	0.0	0.004	0.387	0.609
1	chr1	65418	71585	OR4F5	HGNC:14825	NM_001005484	+	WGS-V3	1	918	...	0	0	1	15.7	0.4	0.004	0.387	0.609
2	chr1	367658	368597	OR4F29	HGNC:31275	NM_001005221	+	V8	1	939	...	1	1	1	0.3	1.0	0.000	0.573	0.427

3 rows × 23 columns

In [23]:

%%gor
$mydefs
nor [#gene_roi#] | top 3

Query ran in 0.20 sec
Query fetched 3 rows in 0.00 sec (total time 0.21 sec)

Out[23]:

	hgnc_id	new_roi_fr	old_roi_fr
0	HGNC:33809	1.000	1.0
1	HGNC:25954	0.432	1.0
2	HGNC:25489	1.000	1.0

Finally, one example of using this data to group the genes into good and bad based on coverage.

We split the genes into validated and non-validated
We separate WGS and WES
We annotate with the Illumina QC region, i.e. the fractional coverage of the different types (three,two, and one)
We also annotate with the ROI overlap

In [26]:

%%gor
$mydefs

nor [#gene_summ#]
| calc good if(sum_bad_avg_depth = 0 and sum_bad_avg_lt10 = 0 and avg_lt10 < 0.01,1,0)
| calc bad if(sum_bad_avg_depth/exons > 0.1  or sum_bad_avg_lt10/exons >0.1 or avg_lt10 > 0.1 or avg_depth < 20,1,0)
| calc verybad if(sum_bad_avg_depth/exons > 0.1  and sum_bad_avg_lt10/exons >0.1  and avg_lt10 > 0.1 and avg_depth < 20,1,0)
| calc qctype if(verybad=1,'verybad',if(bad=1,'bad',if(good=1,'good','good')))
| map -c hgnc_id [#gene_roi#]
| group -gc kit,genembval,qctype -count -fc new_roi_fr,old_roi_fr,one_ovfr-three_ovfr -avg -std
| replace avg*,std* form(float(#rc),3,3)
| calc order decode(qctype,'good,1,bad,2,verybad,3')
| sort -c kit,genembval,order | hide order
| granno -gc kit,genembval -sum -ic allcount
| calc f form(100*allcount/sum_allcount,4,1)+'%'
| columnreorder kit,genembval,allcount,f,qctype
| hide std_*,sum_allcount

Query ran in 0.26 sec
Query fetched 12 rows in 0.35 sec (total time 0.61 sec)

Out[26]:

	kit	GeneMBval	allCount	f	qctype	avg_one_ovfr	avg_two_ovfr	avg_three_ovfr	avg_new_roi_fr	avg_old_roi_fr
0	V8	0	15108	95.4%	good	0.012	0.051	0.937	0.999	0.974
1	V8	0	599	3.8%	bad	0.158	0.203	0.635	0.653	0.861
2	V8	0	124	0.8%	verybad	0.155	0.164	0.679	0.082	0.899
3	V8	1	1537	99.4%	good	0.003	0.041	0.956	1.000	0.996
4	V8	1	7	0.5%	bad	0.000	0.059	0.941	0.777	1.000
5	V8	1	2	0.1%	verybad	0.482	0.000	0.500	0.250	0.750
6	WGS-V3	0	15099	95.4%	good	0.005	0.046	0.949	0.985	0.979
7	WGS-V3	0	419	2.6%	bad	0.135	0.182	0.682	0.924	0.924
8	WGS-V3	0	313	2.0%	verybad	0.524	0.456	0.013	0.716	0.530
9	WGS-V3	1	1505	97.3%	good	0.002	0.038	0.960	0.998	0.996
10	WGS-V3	1	38	2.5%	bad	0.043	0.132	0.825	1.000	1.000
11	WGS-V3	1	3	0.2%	verybad	0.580	0.302	0.102	0.833	0.500

We can also ignore the coverage and just compare the two ROI definitions:

In [28]:

%%gor
$mydefs

nor [#gene_roi#]
| inset -c hgnc_id -b [#geneMBval#] | rename inSet GeneMBval
| select hgnc_id,genembval,new_roi_fr,old_roi_fr
| unpivot 3-
| calc qual if(col_value = 1,'perfect',if(col_value > 0.95,'good',if(col_value > 0.9,'ok','bad')))
| calc roi if(col_name = 'new_roi_fr','new','old') 
| group -gc roi,genembval,qual -count
| granno -gc roi,genembval -sum -ic allcount
| calc f form(100*allcount/sum_allcount,3,1)+'%'
| hide sum_allcount
| sort -c roi,genembval,allcount:nr
| rename allcount Genes

Query ran in 0.26 sec
Query fetched 16 rows in 0.11 sec (total time 0.37 sec)

Out[28]:

	roi	GeneMBval	qual	Genes	f
0	new	0	perfect	14865	93.5%
1	new	0	bad	715	4.5%
2	new	0	good	168	1.1%
3	new	0	ok	149	0.9%
4	new	1	perfect	1529	98.9%
5	new	1	good	6	0.4%
6	new	1	ok	6	0.4%
7	new	1	bad	5	0.3%
8	old	0	perfect	15107	95.0%
9	old	0	bad	655	4.1%
10	old	0	good	91	0.6%
11	old	0	ok	44	0.3%
12	old	1	perfect	1518	98.2%
13	old	1	bad	14	0.9%
14	old	1	good	11	0.7%
15	old	1	ok	3	0.2%