Parallel Queries: PGOR¶

ParallelQueries_PGOR

In [2]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "matthew_sampson_hg38"
%env GOR_API_PROJECT={project}

Parallel GORpipe queries¶

This notebook explains the basics of how to use parallel execution in GORdb. Parallel execution in important when working with very large datasets or when applying computationally heavy operations on the data. Importantly, GOR is inherently multi-threaded, meaning that regular commands may actually involve more than one thread, e.g. a special thread for decompressing files, another thread for command steps, another for nested queries, or for writing and compressing data. The commands we will discuss here are the macro commands for parallel execution. They are three:

PGOR
PARTGOR
PARALLEL

This notebook will focus only on PGOR and the more advanced commands will be described separately. We will proceed with selected examples to explain the behaviour but more detailed specification can be found in the documentation pages.

Basic PGOR functionality¶

Consider counting the number of exons per chromosome. Note that the number of exons is less than 2 millions and this query runs a second or two so it hardly justifies using parallel commands to speed it up.

In [3]:

%%gor

gor #exons# | group chrom -count | granno genome -sum -ic allcount

Query ran in 1.16 sec
Query fetched 25 rows in 2.86 sec (total time 4.02 sec)

Out[3]:

	chrom	bpStop	allCount	sum_allCount
0	chr1	248956422	145831	1572313
1	chr10	133797422	61586	1572313
2	chr11	135086622	91380	1572313
3	chr12	133275309	86588	1572313
4	chr13	114364328	27565	1572313
5	chr14	107043718	55289	1572313
6	chr15	101991189	58444	1572313
7	chr16	90338345	70107	1572313
8	chr17	83257441	88751	1572313
9	chr18	80373285	27803	1572313
10	chr19	58617616	81557	1572313
11	chr2	242193529	123667	1572313
12	chr20	64444167	37696	1572313
13	chr21	46709983	18791	1572313
14	chr22	50818468	33549	1572313
15	chr3	198295559	104300	1572313
16	chr4	190214555	65051	1572313
17	chr5	181538259	70514	1572313
18	chr6	170805979	73980	1572313
19	chr7	159345973	74905	1572313
20	chr8	145138636	58508	1572313
21	chr9	138394717	60518	1572313
22	chrM	16569	37	1572313
23	chrX	156040895	51549	1572313
24	chrY	57227415	4347	1572313

Note: We use the GRANNO genome command to sum and annotate the number of exons across the genome. One has to be careful applying this command with large grouping range because it load the rows into memory for annotation, however, here the rows are only 24.

Now we perform the same using the macro commad PGOR:

In [4]:

%%gor

pgor #exons# | group chrom -count | granno genome -sum -ic allcount

Query ran in 0.36 sec
Query fetched 25 rows in 0.02 sec (total time 0.38 sec)

Out[4]:

	chrom	bpStop	allCount	sum_allCount
0	chr1	248956422	145831	145831
1	chr10	133797422	61586	61586
2	chr11	135086622	91380	91380
3	chr12	133275309	86588	86588
4	chr13	114364328	27565	27565
5	chr14	107043718	55289	55289
6	chr15	101991189	58444	58444
7	chr16	90338345	70107	70107
8	chr17	83257441	88751	88751
9	chr18	80373285	27803	27803
10	chr19	58617616	81557	81557
11	chr2	242193529	123667	123667
12	chr20	64444167	37696	37696
13	chr21	46709983	18791	18791
14	chr22	50818468	33549	33549
15	chr3	198295559	104300	104300
16	chr4	190214555	65051	65051
17	chr5	181538259	70514	70514
18	chr6	170805979	73980	73980
19	chr7	159345973	74905	74905
20	chr8	145138636	58508	58508
21	chr9	138394717	60518	60518
22	chrM	16569	37	37
23	chrX	156040895	51549	51549
24	chrY	57227415	4347	4347

Alert: PGOR cannot be used inside a nested-query. Also, notice that even though GRANNO has genome group range, the sum_allCount number is only the sum per chromosome because of the way the PGOR command is implemented, i.e. there is no automatic distributive aggregation.

Previous parallel command is indeed a shorthand equivalent to the following query:

create #chr1# = gor -p chr1 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
create #chr10# = gor -p chr10 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
create #chr11# = gor -p chr11 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
...
create #chrY# = gor -p chrY <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);

gor [#chr1#] [#chr10#] [#chr11#] ... [#chrY#]

Therefore, one or more gor-workers execute each CREATE statement and the result (a single row in this case) will be persited in cache. Also the data in scope for each GRANNO command is only the rows within each create statement. Because of the caching, re-run of a PGOR is always fast. Indeed, PGOR does always use a CREATE statement and the query can also be written the in following way:

In [5]:

%%gor
create #exon_counts# = pgor #exons# | group chrom -count | granno genome -sum -ic allcount;
gor [#exon_counts#] | top 3

Query ran in 0.19 sec
Query fetched 3 rows in 0.02 sec (total time 0.22 sec)

Out[5]:

	chrom	bpStop	allCount	sum_allCount
0	chr1	248956422	145831	145831
1	chr10	133797422	61586	61586
2	chr11	135086622	91380	91380

Fix the aggregation¶

In order to get correct exon counts across the genome we must aggregate after the parallel execution, a pattern that is commonly applied in GOR queries:

In [6]:

%%gor 
create #exon_counts# = pgor #exons# | group chrom -count ;
gor [#exon_counts#] | granno genome -sum -ic allcount | top 3

Query ran in 1.08 sec
Query fetched 3 rows in 0.20 sec (total time 1.28 sec)

Out[6]:

	chrom	bpStop	allCount	sum_allCount
0	chr1	248956422	145831	1572313
1	chr10	133797422	61586	1572313
2	chr11	135086622	91380	1572313

Controlling the level of splitting¶

Before we simply used PGOR without any instructions of how to split the queries across the genome. By default, the execution engine will rely on a metadata chrom split file and partition the larger chromosomes into two segments, with the cut at the centromeres, and a single partition for the smaller ones, i.e. 38 partitions. However, if there are commands that perform grouping where the grouping range is larger than 1base, it uses only one per chromosome, i.e. 25 partitions.

The PGOR command does however have a -split option that allows us to control how the splitting is performed. For instance, if there is large volume of data, we may want to use many more partitions.

Tip: For efficient parallel processing, it is important not to use too many partitions, because there is a small startup overhead cost for each gor-worker. Also, each worker will deliver it's results in a cache-file and merging the results from too many files can generate additional overhead.

We will now to our attention to the dbNSFP data set. It is a very large table with about 84M rows of variants and 643 columns. Our goal is to map calibrated Revel and MutPred scores to ACMG evidence level for PP3 and BP4. For the following examples, we will use the Python GOR query helper library.

In [7]:

qh = GQH.GOR_Query_Helper()
qh.add("""def ##ref## = ref;""")
qh.add("""def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol""")
mydefs = qh.defs()
print(mydefs)

def ##ref## = ref;
def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol;

Now we can use the string variable mydefs within the magic syntax like this:

In [8]:

%%gor
$mydefs
nor ##dbnsfp## | top 1 | unpivot 1- | granno -count | rename allcount Tot_Col_Count
| where lower(col_name) in ('chr','pos','ref','alt','transcript_stable_id','gene_symbol','revel_score','mutpred_score')

Query ran in 1.13 sec
Query fetched 8 rows in 0.01 sec (total time 1.15 sec)

Out[8]:

	Col_Name	Col_Value	Tot_Col_Count
0	chr	chr1	643
1	pos	65565	643
2	ref	A	643
3	alt	C	643
4	gene_symbol	OR4F5	643
5	transcript_stable_id	ENST00000641515	643
6	REVEL_score	.	643
7	MutPred_score	.	643

We inspect the dbNSFP table and see that it has many of the columns with semicolon separated list, i.e. multiple transcrips are collapsed into lists per variant.

In [9]:

%%gor
$mydefs

gor ##ref##/ensgenes/ensgenes_transcripts.gor | segproj | where segcount > 1
| join -segsnp -ir ##dbnsfp## 
| where contains(transcript_stable_id,';')
| top 1
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| unpivot 3-

Query ran in 1.07 sec
Query fetched 6 rows in 2.12 sec (total time 3.19 sec)

Out[9]:

	chr	pos	Col_Name	Col_Value
0	chr1	925942	ref	A
1	chr1	925942	alt	C
2	chr1	925942	transcript_stable_id	ENST00000420190;ENST00000437963;ENST0000034206...
3	chr1	925942	gene_symbol	SAMD11;SAMD11;SAMD11;SAMD11;SAMD11;SAMD11;SAMD...
4	chr1	925942	REVEL_score	0.521;0.521;0.521;.;.;.;.;.;.;.;.
5	chr1	925942	MutPred_score	0.988

We will now split these lists into multiple rows and replace the score value where it is missing for the transcrips with the minimum score per variant.

In [10]:

%%gor
$mydefs

gor ##ref##/ensgenes/ensgenes_transcripts.gor | segproj | where segcount > 1
| join -segsnp -ir ##dbnsfp## 
| where contains(transcript_stable_id,';')
| top 1
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))

Query ran in 0.98 sec
Query fetched 11 rows in 1.19 sec (total time 2.17 sec)

Out[10]:

	chr	pos	ref	alt	transcript_stable_id	gene_symbol	REVEL_score	MutPred_score	min_REVEL_score	min_MutPred_score	new_revel_score	new_mutpred_score
0	chr1	925942	A	C	ENST00000420190	SAMD11	0.521	0.988	0.521	0.988	0.521	0.988
1	chr1	925942	A	C	ENST00000437963	SAMD11	0.521	NaN	0.521	0.988	0.521	0.988
2	chr1	925942	A	C	ENST00000342066	SAMD11	0.521	NaN	0.521	0.988	0.521	0.988
3	chr1	925942	A	C	ENST00000618181	SAMD11	.	NaN	0.521	0.988	0.521	0.988
4	chr1	925942	A	C	ENST00000622503	SAMD11	.	NaN	0.521	0.988	0.521	0.988
5	chr1	925942	A	C	ENST00000618323	SAMD11	.	NaN	0.521	0.988	0.521	0.988
6	chr1	925942	A	C	ENST00000616016	SAMD11	.	NaN	0.521	0.988	0.521	0.988
7	chr1	925942	A	C	ENST00000618779	SAMD11	.	NaN	0.521	0.988	0.521	0.988
8	chr1	925942	A	C	ENST00000616125	SAMD11	.	NaN	0.521	0.988	0.521	0.988
9	chr1	925942	A	C	ENST00000620200	SAMD11	.	NaN	0.521	0.988	0.521	0.988
10	chr1	925942	A	C	ENST00000617307	SAMD11	.	NaN	0.521	0.988	0.521	0.988

Specifying the number of split partitions¶

We are now ready to create our first transformation of dbNSFP

In [11]:

mydefs = qh.add("""create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## 
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score
| replace new_* if(#rc='NaN','.',#rc)
| rename new_(.*) #{1}
""")
print(mydefs)

def ##ref## = ref;
def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol;
create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## 
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score
| replace new_* if(#rc='NaN','.',#rc)
| rename new_(.*) #{1}
;

Notice that we use here 100 splits, because dbNSFP is a large table and we are performing significant calculations per row. With sufficient number of stand-by workers this should finish in about a minute. Also, worth noting is that with the CREATE statement, we are creating a temporary table which is significantly smaller than the original dbNSFP table, with its 643 columns.

In [12]:

%%gor
$mydefs

gor [##dbnsfp_small##] | where revel_score != '.' | top 2

Query ran in 0.74 sec
Query fetched 2 rows in 0.15 sec (total time 0.90 sec)

Out[12]:

	chr	pos	ref	alt	transcript_stable_id	gene_symbol	revel_score	mutpred_score
0	chr1	69091	A	C	ENST00000641515	OR4F5	0.109	0.823
1	chr1	69091	A	C	ENST00000335137	OR4F5	0.109	0.823

To understand better how the results in ##dbnsfp_small## are stored we can inspect the "dictionary" like this:

In [13]:

%%gor
$mydefs

nor -asdict [##dbnsfp_small##] | granno -count | top 5

Query ran in 0.58 sec
Query fetched 5 rows in 0.02 sec (total time 0.60 sec)

Out[13]:

	col1	col2	col3	col4	col5	col6	allCount
0	../../../../cache/result_cache/cache01/a64/a64...	1	chr7	120274633	chr7	149879677	114
1	../../../../cache/result_cache/cache01/130/130...	2	chr12	61710374	chr12	89655886	114
2	../../../../cache/result_cache/cache01/440/440...	3	chr6	30061525	chr6	57646158	114
3	../../../../cache/result_cache/cache01/bc3/bc3...	4	chr2	94588522	chr2	119977070	114
4	../../../../cache/result_cache/cache01/e9e/e9e...	5	chr9	14807	chr9	27950671	114

We see reference to 114 cache files (slightly different than the target split of 100 because of uneven chromosome sizes) and each cache file has metadata telling the system the position of the first and last row in any given file. Now we are ready to map the scores to ACMG evidence.

In [14]:

mydefs = qh.add("""create ##BP4_PP3_scores## = pgor [##dbnsfp_small##]
| calc Revel_BP4_PP3 if(revel_score!='.',if(float(revel_score) <= 0.003,'BP4_verystrong',
if(0.003 < float(revel_score) and float(revel_score) <= 0.016,'BP4_strong',
if(0.016 < float(revel_score) and float(revel_score) <= 0.183,'BP4_moderate',
if(0.183 < float(revel_score) and float(revel_score) <= 0.290,'BP4_supporting',
if(0.644 <= float(revel_score) and float(revel_score) < 0.773,'PP3_supporting',
if(0.773 <= float(revel_score) and float(revel_score) < 0.932,'PP3_moderate',
if(0.932 <= float(revel_score),'PP3_strong',''))))))),'')
| calc Mutpred_BP4_PP3 if(mutpred_score!='.',if(float(mutpred_score) <= 0.010,'BP4_strong',
if(0.010 < float(mutpred_score) and float(mutpred_score) <= 0.197,'BP4_moderate',
if(0.197 < float(mutpred_score) and float(mutpred_score) <= 0.391,'BP4_supporting',
if(0.737 <= float(mutpred_score) and float(mutpred_score) < 0.829,'PP3_supporting',
if(0.829 <= float(mutpred_score) and float(mutpred_score) < 0.932,'PP3_moderate',
if(0.932 <= float(mutpred_score),'PP3_strong','')))))),'')
""")

In [15]:

%%gor
$mydefs
gor [##BP4_PP3_scores##] | where len(revel_bp4_pp3)>1 | top 4

Query ran in 0.73 sec
Query fetched 4 rows in 0.06 sec (total time 0.79 sec)

Out[15]:

	chr	pos	ref	alt	transcript_stable_id	gene_symbol	revel_score	mutpred_score	Revel_BP4_PP3	Mutpred_BP4_PP3
0	chr1	69091	A	C	ENST00000641515	OR4F5	0.109	0.823	BP4_moderate	PP3_supporting
1	chr1	69091	A	C	ENST00000335137	OR4F5	0.109	0.823	BP4_moderate	PP3_supporting
2	chr1	69091	A	G	ENST00000641515	OR4F5	0.133	0.813	BP4_moderate	PP3_supporting
3	chr1	69091	A	G	ENST00000335137	OR4F5	0.133	0.813	BP4_moderate	PP3_supporting

Notice, that in ##BP4_PP3_scores## we use only the default split, because the volume of data we are working with now is much smaller than in the original dbNSFP table.

Specifying the partition with segments¶

Here we want to show how we can calculate how the evidence levels are distributes with VEP consequence labels. Hence, we will join the project VEP table with ##BP4_PP3_scores##. This can be a heavy quer and we would like to make sure that the load per gor-workers is similar. To do that we will use the SEGHIST command to generate regions with similar number of rows. We start by counting the total number of rows in the VEP table by joining the distinct variants with the transcrips:

In [16]:

mydefs = qh.add("def ##veppath## = source/anno/vep_v107;")

In [17]:

%%gor rowcount <<
$mydefs
create #temp# = pgor ##veppath##/allvars.gord 
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor 
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group chrom -sum -ic overlapcount;
nor [#temp#] | group -sum -ic sum_OverlapCount 

Query ran in 2.32 sec
Query fetched 1 rows in 5.91 sec (total time 8.22 sec)

In [18]:

rowCountPerSeg = int(rowcount.at[0,'sum_sum_OverlapCount']/200)
rowCountPerSeg

Out[18]:

The number above is the targe count to achieve 200 partitions. With it in hand, we can apply SEGHIST that gives us close to this count per segment.

In [19]:

mydefs = qh.add(f"""create #segs# = pgor ##veppath##/allvars.gord 
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group 10000 -sum -ic overlapcount | seghist {rowCountPerSeg}""")

In [20]:

%%gor
$mydefs
nor [#segs#] | granno -count | top 5

Query ran in 1.31 sec
Query fetched 5 rows in 1.69 sec (total time 3.00 sec)

Out[20]:

	Chrom	bpStart	bpStop	Count	allCount
0	chr1	0	6088290	1084847	216
1	chr1	6088290	16584106	1084847	216
2	chr1	16584106	26792265	1084847	216
3	chr1	26792265	39327763	1084847	216
4	chr1	39327763	45336868	1084847	216

We see that we get approximately 200 segments that contain close to 216k rows in the VEP table. Now we will join the consequence of the transcripts in the VEP table with the evidence level in the ##BP4_PP3_scores## table. For demonstration purposes, we will only use the segments on chr22 to begin with.

Note: SEGHIST provides a simple yet a very powerful framework to split parallelization based on expected load, e.g. density of variants, density of variants multiplied by transcript count as shown here, or any other measure that is expected to reflect query cost. The GROUP command in front of SEGHIST should use a range that is large enough such that the input to SEGHIST does not contain too many rows and small enough to enable SEGHIST to interpolate accurately. 10kbp is usually a safe number.

In [21]:

mydefs = qh.add(f"""create ##part_counts## = pgor -split <(gor [#segs#] | where chrom = 'chr22') [##BP4_PP3_scores##] 
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count""")

In [22]:

%%gor
$mydefs
gor [##part_counts##]

Query ran in 0.72 sec
Query fetched 791 rows in 0.02 sec (total time 0.73 sec)

Out[22]:

	chr	bpStart	bpStop	Revel_BP4_PP3	Mutpred_BP4_PP3	Consequence	allCount
0	chr22	0	50818468	NaN	NaN	3_prime_UTR_variant	70
1	chr22	0	50818468	NaN	NaN	NMD_transcript_variant	88
2	chr22	0	50818468	NaN	NaN	coding_sequence_variant	1
3	chr22	0	50818468	NaN	NaN	intron_variant	94
4	chr22	0	50818468	NaN	NaN	missense_variant	15360
...	...	...	...	...	...	...	...
786	chr22	0	50818468	PP3_supporting	PP3_strong	missense_variant	57
787	chr22	0	50818468	PP3_supporting	PP3_supporting	5_prime_UTR_variant	2
788	chr22	0	50818468	PP3_supporting	PP3_supporting	missense_variant	271
789	chr22	0	50818468	PP3_supporting	PP3_supporting	splice_region_variant	13
790	chr22	0	50818468	PP3_supporting	PP3_supporting	start_lost	3

791 rows × 7 columns

Now we override ##part_counts## and for the whole genome:

In [23]:

mydefs = qh.add(f"""create ##part_counts## = pgor -split <(gor [#segs#]) [##BP4_PP3_scores##] 
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count""")

and run a summary showing how the evidence levels are covered by different VEP impacts:

In [24]:

%%gor
$mydefs
nor [##part_counts##] 
| map -c consequence ##ref##/VEP_impact.map
| group -gc Revel_BP4_PP3,vep_impact -sum -ic allcount
| granno -gc Revel_BP4_PP3 -sum -ic sum_allcount
| calc f form(100*sum_allcount/sum_sum_allcount,2,1)+'%'
| rename sum_allcount count
| select revel_bp4_pp3,vep_impact,f,count
| pivot -gc revel_bp4_pp3 vep_impact -v 'HIGH','MODERATE','LOW','LOWEST' -e '0%'
| calc order decode(#1,'BP4_verystrong,1,BP4_strong,2,BP4_moderate,3,BP4_supporting,4,PP3_supporting,5,PP3_moderate,6,PP3_strong,7,0')
| sort -c order:r | hide order

Query ran in 3.91 sec.
Query fetched 8 rows in 6.14 sec (total time 10.05 sec)

Out[24]:

	Revel_BP4_PP3	HIGH_f	HIGH_count	MODERATE_f	MODERATE_count	LOW_f	LOW_count	LOWEST_f	LOWEST_count
0	PP3_strong	0.1%	282	94.4%	206552	5.3%	11704	0.1%	246
1	PP3_moderate	0.3%	2190	94.6%	685780	5.1%	36741	0.1%	428
2	PP3_supporting	0.5%	3315	94.8%	664206	4.7%	33059	0.1%	371
3	BP4_supporting	0.6%	13501	95.3%	2061078	4.1%	87969	0.0%	966
4	BP4_moderate	0.2%	12682	96.3%	5934168	3.5%	215552	0.0%	2658
5	BP4_strong	0.1%	320	97.2%	349011	2.6%	9515	0.1%	341
6	BP4_verystrong	0.1%	10	97.6%	18445	2.3%	435	0.1%	15
7	NaN	17.1%	822918	78.7%	3776064	4.1%	197932	0.1%	3973

Because we aggregated the counts for -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence, we can use the same CREATE statement ##part_counts## to summarize for Mutpred a second:

In [25]:

%%gor
$mydefs
nor [##part_counts##] 
| map -c consequence ##ref##/VEP_impact.map
| group -gc Mutpred_BP4_PP3,vep_impact -sum -ic allcount
| granno -gc Mutpred_BP4_PP3 -sum -ic sum_allcount
| calc f form(100*sum_allcount/sum_sum_allcount,2,1)+'%'
| rename sum_allcount count
| select Mutpred_BP4_PP3,vep_impact,f,count
| pivot -gc Mutpred_BP4_PP3 vep_impact -v 'HIGH','MODERATE','LOW','LOWEST' -e '0%'
| calc order decode(#1,'BP4_verystrong,1,BP4_strong,2,BP4_moderate,3,BP4_supporting,4,PP3_supporting,5,PP3_moderate,6,PP3_strong,7,0')
| sort -c order:r | hide order

Query ran in 1.01 sec
Query fetched 6 rows in 0.20 sec (total time 1.21 sec)

Out[25]:

	Mutpred_BP4_PP3	HIGH_f	HIGH_count	MODERATE_f	MODERATE_count	LOW_f	LOW_count	LOWEST_f	LOWEST_count
0	PP3_strong	17.1%	22489	77.9%	102720	4.9%	6427	0.2%	200
1	PP3_moderate	1.1%	5085	94.3%	427941	4.5%	20637	0.1%	303
2	PP3_supporting	0.5%	2778	95.2%	577747	4.3%	25966	0.0%	227
3	BP4_supporting	0.1%	4879	96.0%	3254904	3.8%	129346	0.1%	1932
4	BP4_moderate	0.1%	1853	96.4%	1196030	3.3%	41469	0.1%	1122
5	NaN	8.8%	818134	87.2%	8135962	4.0%	369062	0.1%	5214

Finally, if we want to test these queries in the Sequence Miner, we can get all the complete definitions like this:

In [26]:

print(qh.defs())

def ##ref## = ref;
def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol;
create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## 
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score
| replace new_* if(#rc='NaN','.',#rc)
| rename new_(.*) #{1}
;
create ##BP4_PP3_scores## = pgor [##dbnsfp_small##]
| calc Revel_BP4_PP3 if(revel_score!='.',if(float(revel_score) <= 0.003,'BP4_verystrong',
if(0.003 < float(revel_score) and float(revel_score) <= 0.016,'BP4_strong',
if(0.016 < float(revel_score) and float(revel_score) <= 0.183,'BP4_moderate',
if(0.183 < float(revel_score) and float(revel_score) <= 0.290,'BP4_supporting',
if(0.644 <= float(revel_score) and float(revel_score) < 0.773,'PP3_supporting',
if(0.773 <= float(revel_score) and float(revel_score) < 0.932,'PP3_moderate',
if(0.932 <= float(revel_score),'PP3_strong',''))))))),'')
| calc Mutpred_BP4_PP3 if(mutpred_score!='.',if(float(mutpred_score) <= 0.010,'BP4_strong',
if(0.010 < float(mutpred_score) and float(mutpred_score) <= 0.197,'BP4_moderate',
if(0.197 < float(mutpred_score) and float(mutpred_score) <= 0.391,'BP4_supporting',
if(0.737 <= float(mutpred_score) and float(mutpred_score) < 0.829,'PP3_supporting',
if(0.829 <= float(mutpred_score) and float(mutpred_score) < 0.932,'PP3_moderate',
if(0.932 <= float(mutpred_score),'PP3_strong','')))))),'')
;
def ##veppath## = source/anno/vep_v107;
create #segs# = pgor ##veppath##/allvars.gord 
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group 10000 -sum -ic overlapcount | seghist 1084847;
create ##part_counts## = pgor -split <(gor [#segs#]) [##BP4_PP3_scores##] 
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count;

In [ ]: