Parallel Queries: PGOR¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "matthew_sampson_hg38"
%env GOR_API_PROJECT={project}
Parallel GORpipe queries¶
This notebook explains the basics of how to use parallel execution in GORdb. Parallel execution in important when working with very large datasets or when applying computationally heavy operations on the data. Importantly, GOR is inherently multi-threaded, meaning that regular commands may actually involve more than one thread, e.g. a special thread for decompressing files, another thread for command steps, another for nested queries, or for writing and compressing data. The commands we will discuss here are the macro commands for parallel execution. They are three:
- PGOR
- PARTGOR
- PARALLEL
This notebook will focus only on PGOR and the more advanced commands will be described separately. We will proceed with selected examples to explain the behaviour but more detailed specification can be found in the documentation pages.
Basic PGOR functionality¶
Consider counting the number of exons per chromosome. Note that the number of exons is less than 2 millions and this query runs a second or two so it hardly justifies using parallel commands to speed it up.
%%gor
gor #exons# | group chrom -count | granno genome -sum -ic allcount
Query ran in 1.16 sec Query fetched 25 rows in 2.86 sec (total time 4.02 sec)
chrom | bpStart | bpStop | allCount | sum_allCount | |
---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 145831 | 1572313 |
1 | chr10 | 0 | 133797422 | 61586 | 1572313 |
2 | chr11 | 0 | 135086622 | 91380 | 1572313 |
3 | chr12 | 0 | 133275309 | 86588 | 1572313 |
4 | chr13 | 0 | 114364328 | 27565 | 1572313 |
5 | chr14 | 0 | 107043718 | 55289 | 1572313 |
6 | chr15 | 0 | 101991189 | 58444 | 1572313 |
7 | chr16 | 0 | 90338345 | 70107 | 1572313 |
8 | chr17 | 0 | 83257441 | 88751 | 1572313 |
9 | chr18 | 0 | 80373285 | 27803 | 1572313 |
10 | chr19 | 0 | 58617616 | 81557 | 1572313 |
11 | chr2 | 0 | 242193529 | 123667 | 1572313 |
12 | chr20 | 0 | 64444167 | 37696 | 1572313 |
13 | chr21 | 0 | 46709983 | 18791 | 1572313 |
14 | chr22 | 0 | 50818468 | 33549 | 1572313 |
15 | chr3 | 0 | 198295559 | 104300 | 1572313 |
16 | chr4 | 0 | 190214555 | 65051 | 1572313 |
17 | chr5 | 0 | 181538259 | 70514 | 1572313 |
18 | chr6 | 0 | 170805979 | 73980 | 1572313 |
19 | chr7 | 0 | 159345973 | 74905 | 1572313 |
20 | chr8 | 0 | 145138636 | 58508 | 1572313 |
21 | chr9 | 0 | 138394717 | 60518 | 1572313 |
22 | chrM | 0 | 16569 | 37 | 1572313 |
23 | chrX | 0 | 156040895 | 51549 | 1572313 |
24 | chrY | 0 | 57227415 | 4347 | 1572313 |
Now we perform the same using the macro commad PGOR:
%%gor
pgor #exons# | group chrom -count | granno genome -sum -ic allcount
Query ran in 0.36 sec Query fetched 25 rows in 0.02 sec (total time 0.38 sec)
chrom | bpStart | bpStop | allCount | sum_allCount | |
---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 145831 | 145831 |
1 | chr10 | 0 | 133797422 | 61586 | 61586 |
2 | chr11 | 0 | 135086622 | 91380 | 91380 |
3 | chr12 | 0 | 133275309 | 86588 | 86588 |
4 | chr13 | 0 | 114364328 | 27565 | 27565 |
5 | chr14 | 0 | 107043718 | 55289 | 55289 |
6 | chr15 | 0 | 101991189 | 58444 | 58444 |
7 | chr16 | 0 | 90338345 | 70107 | 70107 |
8 | chr17 | 0 | 83257441 | 88751 | 88751 |
9 | chr18 | 0 | 80373285 | 27803 | 27803 |
10 | chr19 | 0 | 58617616 | 81557 | 81557 |
11 | chr2 | 0 | 242193529 | 123667 | 123667 |
12 | chr20 | 0 | 64444167 | 37696 | 37696 |
13 | chr21 | 0 | 46709983 | 18791 | 18791 |
14 | chr22 | 0 | 50818468 | 33549 | 33549 |
15 | chr3 | 0 | 198295559 | 104300 | 104300 |
16 | chr4 | 0 | 190214555 | 65051 | 65051 |
17 | chr5 | 0 | 181538259 | 70514 | 70514 |
18 | chr6 | 0 | 170805979 | 73980 | 73980 |
19 | chr7 | 0 | 159345973 | 74905 | 74905 |
20 | chr8 | 0 | 145138636 | 58508 | 58508 |
21 | chr9 | 0 | 138394717 | 60518 | 60518 |
22 | chrM | 0 | 16569 | 37 | 37 |
23 | chrX | 0 | 156040895 | 51549 | 51549 |
24 | chrY | 0 | 57227415 | 4347 | 4347 |
Previous parallel command is indeed a shorthand equivalent to the following query:
create #chr1# = gor -p chr1 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
create #chr10# = gor -p chr10 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
create #chr11# = gor -p chr11 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
...
create #chrY# = gor -p chrY <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
gor [#chr1#] [#chr10#] [#chr11#] ... [#chrY#]
Therefore, one or more gor-workers execute each CREATE statement and the result (a single row in this case) will be persited in cache. Also the data in scope for each GRANNO command is only the rows within each create statement. Because of the caching, re-run of a PGOR is always fast. Indeed, PGOR does always use a CREATE statement and the query can also be written the in following way:
%%gor
create #exon_counts# = pgor #exons# | group chrom -count | granno genome -sum -ic allcount;
gor [#exon_counts#] | top 3
Query ran in 0.19 sec Query fetched 3 rows in 0.02 sec (total time 0.22 sec)
chrom | bpStart | bpStop | allCount | sum_allCount | |
---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 145831 | 145831 |
1 | chr10 | 0 | 133797422 | 61586 | 61586 |
2 | chr11 | 0 | 135086622 | 91380 | 91380 |
Fix the aggregation¶
In order to get correct exon counts across the genome we must aggregate after the parallel execution, a pattern that is commonly applied in GOR queries:
%%gor
create #exon_counts# = pgor #exons# | group chrom -count ;
gor [#exon_counts#] | granno genome -sum -ic allcount | top 3
Query ran in 1.08 sec Query fetched 3 rows in 0.20 sec (total time 1.28 sec)
chrom | bpStart | bpStop | allCount | sum_allCount | |
---|---|---|---|---|---|
0 | chr1 | 0 | 248956422 | 145831 | 1572313 |
1 | chr10 | 0 | 133797422 | 61586 | 1572313 |
2 | chr11 | 0 | 135086622 | 91380 | 1572313 |
Controlling the level of splitting¶
Before we simply used PGOR without any instructions of how to split the queries across the genome. By default, the execution engine will rely on a metadata chrom split file and partition the larger chromosomes into two segments, with the cut at the centromeres, and a single partition for the smaller ones, i.e. 38 partitions. However, if there are commands that perform grouping where the grouping range is larger than 1base, it uses only one per chromosome, i.e. 25 partitions.
The PGOR command does however have a -split
option that allows us to control how the splitting is performed. For instance, if there is large volume of data, we may want to use many more partitions.
We will now to our attention to the dbNSFP data set. It is a very large table with about 84M rows of variants and 643 columns. Our goal is to map calibrated Revel and MutPred scores to ACMG evidence level for PP3 and BP4. For the following examples, we will use the Python GOR query helper library.
qh = GQH.GOR_Query_Helper()
qh.add("""def ##ref## = ref;""")
qh.add("""def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol""")
mydefs = qh.defs()
print(mydefs)
def ##ref## = ref; def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz | rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol;
Now we can use the string variable mydefs within the magic syntax like this:
%%gor
$mydefs
nor ##dbnsfp## | top 1 | unpivot 1- | granno -count | rename allcount Tot_Col_Count
| where lower(col_name) in ('chr','pos','ref','alt','transcript_stable_id','gene_symbol','revel_score','mutpred_score')
Query ran in 1.13 sec Query fetched 8 rows in 0.01 sec (total time 1.15 sec)
Col_Name | Col_Value | Tot_Col_Count | |
---|---|---|---|
0 | chr | chr1 | 643 |
1 | pos | 65565 | 643 |
2 | ref | A | 643 |
3 | alt | C | 643 |
4 | gene_symbol | OR4F5 | 643 |
5 | transcript_stable_id | ENST00000641515 | 643 |
6 | REVEL_score | . | 643 |
7 | MutPred_score | . | 643 |
We inspect the dbNSFP table and see that it has many of the columns with semicolon separated list, i.e. multiple transcrips are collapsed into lists per variant.
%%gor
$mydefs
gor ##ref##/ensgenes/ensgenes_transcripts.gor | segproj | where segcount > 1
| join -segsnp -ir ##dbnsfp##
| where contains(transcript_stable_id,';')
| top 1
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| unpivot 3-
Query ran in 1.07 sec Query fetched 6 rows in 2.12 sec (total time 3.19 sec)
chr | pos | Col_Name | Col_Value | |
---|---|---|---|---|
0 | chr1 | 925942 | ref | A |
1 | chr1 | 925942 | alt | C |
2 | chr1 | 925942 | transcript_stable_id | ENST00000420190;ENST00000437963;ENST0000034206... |
3 | chr1 | 925942 | gene_symbol | SAMD11;SAMD11;SAMD11;SAMD11;SAMD11;SAMD11;SAMD... |
4 | chr1 | 925942 | REVEL_score | 0.521;0.521;0.521;.;.;.;.;.;.;.;. |
5 | chr1 | 925942 | MutPred_score | 0.988 |
We will now split these lists into multiple rows and replace the score value where it is missing for the transcrips with the minimum score per variant.
%%gor
$mydefs
gor ##ref##/ensgenes/ensgenes_transcripts.gor | segproj | where segcount > 1
| join -segsnp -ir ##dbnsfp##
| where contains(transcript_stable_id,';')
| top 1
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
Query ran in 0.98 sec Query fetched 11 rows in 1.19 sec (total time 2.17 sec)
chr | pos | ref | alt | transcript_stable_id | gene_symbol | REVEL_score | MutPred_score | min_REVEL_score | min_MutPred_score | new_revel_score | new_mutpred_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 925942 | A | C | ENST00000420190 | SAMD11 | 0.521 | 0.988 | 0.521 | 0.988 | 0.521 | 0.988 |
1 | chr1 | 925942 | A | C | ENST00000437963 | SAMD11 | 0.521 | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
2 | chr1 | 925942 | A | C | ENST00000342066 | SAMD11 | 0.521 | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
3 | chr1 | 925942 | A | C | ENST00000618181 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
4 | chr1 | 925942 | A | C | ENST00000622503 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
5 | chr1 | 925942 | A | C | ENST00000618323 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
6 | chr1 | 925942 | A | C | ENST00000616016 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
7 | chr1 | 925942 | A | C | ENST00000618779 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
8 | chr1 | 925942 | A | C | ENST00000616125 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
9 | chr1 | 925942 | A | C | ENST00000620200 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
10 | chr1 | 925942 | A | C | ENST00000617307 | SAMD11 | . | NaN | 0.521 | 0.988 | 0.521 | 0.988 |
Specifying the number of split partitions¶
We are now ready to create our first transformation of dbNSFP
mydefs = qh.add("""create ##dbnsfp_small## = pgor -split 100 ##dbnsfp##
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score
| replace new_* if(#rc='NaN','.',#rc)
| rename new_(.*) #{1}
""")
print(mydefs)
def ##ref## = ref; def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz | rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol; create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## | select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score | split 5- -s ';' | granno 1 -gc ref,alt -min -fc revel_score,mutpred_score | calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score)) | calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score)) | select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score | replace new_* if(#rc='NaN','.',#rc) | rename new_(.*) #{1} ;
Notice that we use here 100 splits, because dbNSFP is a large table and we are performing significant calculations per row. With sufficient number of stand-by workers this should finish in about a minute. Also, worth noting is that with the CREATE statement, we are creating a temporary table which is significantly smaller than the original dbNSFP table, with its 643 columns.
%%gor
$mydefs
gor [##dbnsfp_small##] | where revel_score != '.' | top 2
Query ran in 0.74 sec Query fetched 2 rows in 0.15 sec (total time 0.90 sec)
chr | pos | ref | alt | transcript_stable_id | gene_symbol | revel_score | mutpred_score | |
---|---|---|---|---|---|---|---|---|
0 | chr1 | 69091 | A | C | ENST00000641515 | OR4F5 | 0.109 | 0.823 |
1 | chr1 | 69091 | A | C | ENST00000335137 | OR4F5 | 0.109 | 0.823 |
To understand better how the results in ##dbnsfp_small## are stored we can inspect the "dictionary" like this:
%%gor
$mydefs
nor -asdict [##dbnsfp_small##] | granno -count | top 5
Query ran in 0.58 sec Query fetched 5 rows in 0.02 sec (total time 0.60 sec)
col1 | col2 | col3 | col4 | col5 | col6 | allCount | |
---|---|---|---|---|---|---|---|
0 | ../../../../cache/result_cache/cache01/a64/a64... | 1 | chr7 | 120274633 | chr7 | 149879677 | 114 |
1 | ../../../../cache/result_cache/cache01/130/130... | 2 | chr12 | 61710374 | chr12 | 89655886 | 114 |
2 | ../../../../cache/result_cache/cache01/440/440... | 3 | chr6 | 30061525 | chr6 | 57646158 | 114 |
3 | ../../../../cache/result_cache/cache01/bc3/bc3... | 4 | chr2 | 94588522 | chr2 | 119977070 | 114 |
4 | ../../../../cache/result_cache/cache01/e9e/e9e... | 5 | chr9 | 14807 | chr9 | 27950671 | 114 |
We see reference to 114 cache files (slightly different than the target split of 100 because of uneven chromosome sizes) and each cache file has metadata telling the system the position of the first and last row in any given file. Now we are ready to map the scores to ACMG evidence.
mydefs = qh.add("""create ##BP4_PP3_scores## = pgor [##dbnsfp_small##]
| calc Revel_BP4_PP3 if(revel_score!='.',if(float(revel_score) <= 0.003,'BP4_verystrong',
if(0.003 < float(revel_score) and float(revel_score) <= 0.016,'BP4_strong',
if(0.016 < float(revel_score) and float(revel_score) <= 0.183,'BP4_moderate',
if(0.183 < float(revel_score) and float(revel_score) <= 0.290,'BP4_supporting',
if(0.644 <= float(revel_score) and float(revel_score) < 0.773,'PP3_supporting',
if(0.773 <= float(revel_score) and float(revel_score) < 0.932,'PP3_moderate',
if(0.932 <= float(revel_score),'PP3_strong',''))))))),'')
| calc Mutpred_BP4_PP3 if(mutpred_score!='.',if(float(mutpred_score) <= 0.010,'BP4_strong',
if(0.010 < float(mutpred_score) and float(mutpred_score) <= 0.197,'BP4_moderate',
if(0.197 < float(mutpred_score) and float(mutpred_score) <= 0.391,'BP4_supporting',
if(0.737 <= float(mutpred_score) and float(mutpred_score) < 0.829,'PP3_supporting',
if(0.829 <= float(mutpred_score) and float(mutpred_score) < 0.932,'PP3_moderate',
if(0.932 <= float(mutpred_score),'PP3_strong','')))))),'')
""")
%%gor
$mydefs
gor [##BP4_PP3_scores##] | where len(revel_bp4_pp3)>1 | top 4
Query ran in 0.73 sec Query fetched 4 rows in 0.06 sec (total time 0.79 sec)
chr | pos | ref | alt | transcript_stable_id | gene_symbol | revel_score | mutpred_score | Revel_BP4_PP3 | Mutpred_BP4_PP3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 69091 | A | C | ENST00000641515 | OR4F5 | 0.109 | 0.823 | BP4_moderate | PP3_supporting |
1 | chr1 | 69091 | A | C | ENST00000335137 | OR4F5 | 0.109 | 0.823 | BP4_moderate | PP3_supporting |
2 | chr1 | 69091 | A | G | ENST00000641515 | OR4F5 | 0.133 | 0.813 | BP4_moderate | PP3_supporting |
3 | chr1 | 69091 | A | G | ENST00000335137 | OR4F5 | 0.133 | 0.813 | BP4_moderate | PP3_supporting |
Notice, that in ##BP4_PP3_scores## we use only the default split, because the volume of data we are working with now is much smaller than in the original dbNSFP table.
Specifying the partition with segments¶
Here we want to show how we can calculate how the evidence levels are distributes with VEP consequence labels. Hence, we will join the project VEP table with ##BP4_PP3_scores##. This can be a heavy quer and we would like to make sure that the load per gor-workers is similar. To do that we will use the SEGHIST command to generate regions with similar number of rows. We start by counting the total number of rows in the VEP table by joining the distinct variants with the transcrips:
mydefs = qh.add("def ##veppath## = source/anno/vep_v107;")
%%gor rowcount <<
$mydefs
create #temp# = pgor ##veppath##/allvars.gord
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group chrom -sum -ic overlapcount;
nor [#temp#] | group -sum -ic sum_OverlapCount
Query ran in 2.32 sec Query fetched 1 rows in 5.91 sec (total time 8.22 sec)
rowCountPerSeg = int(rowcount.at[0,'sum_sum_OverlapCount']/200)
rowCountPerSeg
1084847
The number above is the targe count to achieve 200 partitions. With it in hand, we can apply SEGHIST that gives us close to this count per segment.
mydefs = qh.add(f"""create #segs# = pgor ##veppath##/allvars.gord
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group 10000 -sum -ic overlapcount | seghist {rowCountPerSeg}""")
%%gor
$mydefs
nor [#segs#] | granno -count | top 5
Query ran in 1.31 sec Query fetched 5 rows in 1.69 sec (total time 3.00 sec)
Chrom | bpStart | bpStop | Count | allCount | |
---|---|---|---|---|---|
0 | chr1 | 0 | 6088290 | 1084847 | 216 |
1 | chr1 | 6088290 | 16584106 | 1084847 | 216 |
2 | chr1 | 16584106 | 26792265 | 1084847 | 216 |
3 | chr1 | 26792265 | 39327763 | 1084847 | 216 |
4 | chr1 | 39327763 | 45336868 | 1084847 | 216 |
We see that we get approximately 200 segments that contain close to 216k rows in the VEP table. Now we will join the consequence of the transcripts in the VEP table with the evidence level in the ##BP4_PP3_scores## table. For demonstration purposes, we will only use the segments on chr22 to begin with.
mydefs = qh.add(f"""create ##part_counts## = pgor -split <(gor [#segs#] | where chrom = 'chr22') [##BP4_PP3_scores##]
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count""")
%%gor
$mydefs
gor [##part_counts##]
Query ran in 0.72 sec Query fetched 791 rows in 0.02 sec (total time 0.73 sec)
chr | bpStart | bpStop | Revel_BP4_PP3 | Mutpred_BP4_PP3 | Consequence | allCount | |
---|---|---|---|---|---|---|---|
0 | chr22 | 0 | 50818468 | NaN | NaN | 3_prime_UTR_variant | 70 |
1 | chr22 | 0 | 50818468 | NaN | NaN | NMD_transcript_variant | 88 |
2 | chr22 | 0 | 50818468 | NaN | NaN | coding_sequence_variant | 1 |
3 | chr22 | 0 | 50818468 | NaN | NaN | intron_variant | 94 |
4 | chr22 | 0 | 50818468 | NaN | NaN | missense_variant | 15360 |
... | ... | ... | ... | ... | ... | ... | ... |
786 | chr22 | 0 | 50818468 | PP3_supporting | PP3_strong | missense_variant | 57 |
787 | chr22 | 0 | 50818468 | PP3_supporting | PP3_supporting | 5_prime_UTR_variant | 2 |
788 | chr22 | 0 | 50818468 | PP3_supporting | PP3_supporting | missense_variant | 271 |
789 | chr22 | 0 | 50818468 | PP3_supporting | PP3_supporting | splice_region_variant | 13 |
790 | chr22 | 0 | 50818468 | PP3_supporting | PP3_supporting | start_lost | 3 |
791 rows × 7 columns
Now we override ##part_counts## and for the whole genome:
mydefs = qh.add(f"""create ##part_counts## = pgor -split <(gor [#segs#]) [##BP4_PP3_scores##]
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count""")
and run a summary showing how the evidence levels are covered by different VEP impacts:
%%gor
$mydefs
nor [##part_counts##]
| map -c consequence ##ref##/VEP_impact.map
| group -gc Revel_BP4_PP3,vep_impact -sum -ic allcount
| granno -gc Revel_BP4_PP3 -sum -ic sum_allcount
| calc f form(100*sum_allcount/sum_sum_allcount,2,1)+'%'
| rename sum_allcount count
| select revel_bp4_pp3,vep_impact,f,count
| pivot -gc revel_bp4_pp3 vep_impact -v 'HIGH','MODERATE','LOW','LOWEST' -e '0%'
| calc order decode(#1,'BP4_verystrong,1,BP4_strong,2,BP4_moderate,3,BP4_supporting,4,PP3_supporting,5,PP3_moderate,6,PP3_strong,7,0')
| sort -c order:r | hide order
Query ran in 3.91 sec. Query fetched 8 rows in 6.14 sec (total time 10.05 sec)
Revel_BP4_PP3 | HIGH_f | HIGH_count | MODERATE_f | MODERATE_count | LOW_f | LOW_count | LOWEST_f | LOWEST_count | |
---|---|---|---|---|---|---|---|---|---|
0 | PP3_strong | 0.1% | 282 | 94.4% | 206552 | 5.3% | 11704 | 0.1% | 246 |
1 | PP3_moderate | 0.3% | 2190 | 94.6% | 685780 | 5.1% | 36741 | 0.1% | 428 |
2 | PP3_supporting | 0.5% | 3315 | 94.8% | 664206 | 4.7% | 33059 | 0.1% | 371 |
3 | BP4_supporting | 0.6% | 13501 | 95.3% | 2061078 | 4.1% | 87969 | 0.0% | 966 |
4 | BP4_moderate | 0.2% | 12682 | 96.3% | 5934168 | 3.5% | 215552 | 0.0% | 2658 |
5 | BP4_strong | 0.1% | 320 | 97.2% | 349011 | 2.6% | 9515 | 0.1% | 341 |
6 | BP4_verystrong | 0.1% | 10 | 97.6% | 18445 | 2.3% | 435 | 0.1% | 15 |
7 | NaN | 17.1% | 822918 | 78.7% | 3776064 | 4.1% | 197932 | 0.1% | 3973 |
Because we aggregated the counts for -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence, we can use the same CREATE statement ##part_counts## to summarize for Mutpred a second:
%%gor
$mydefs
nor [##part_counts##]
| map -c consequence ##ref##/VEP_impact.map
| group -gc Mutpred_BP4_PP3,vep_impact -sum -ic allcount
| granno -gc Mutpred_BP4_PP3 -sum -ic sum_allcount
| calc f form(100*sum_allcount/sum_sum_allcount,2,1)+'%'
| rename sum_allcount count
| select Mutpred_BP4_PP3,vep_impact,f,count
| pivot -gc Mutpred_BP4_PP3 vep_impact -v 'HIGH','MODERATE','LOW','LOWEST' -e '0%'
| calc order decode(#1,'BP4_verystrong,1,BP4_strong,2,BP4_moderate,3,BP4_supporting,4,PP3_supporting,5,PP3_moderate,6,PP3_strong,7,0')
| sort -c order:r | hide order
Query ran in 1.01 sec Query fetched 6 rows in 0.20 sec (total time 1.21 sec)
Mutpred_BP4_PP3 | HIGH_f | HIGH_count | MODERATE_f | MODERATE_count | LOW_f | LOW_count | LOWEST_f | LOWEST_count | |
---|---|---|---|---|---|---|---|---|---|
0 | PP3_strong | 17.1% | 22489 | 77.9% | 102720 | 4.9% | 6427 | 0.2% | 200 |
1 | PP3_moderate | 1.1% | 5085 | 94.3% | 427941 | 4.5% | 20637 | 0.1% | 303 |
2 | PP3_supporting | 0.5% | 2778 | 95.2% | 577747 | 4.3% | 25966 | 0.0% | 227 |
3 | BP4_supporting | 0.1% | 4879 | 96.0% | 3254904 | 3.8% | 129346 | 0.1% | 1932 |
4 | BP4_moderate | 0.1% | 1853 | 96.4% | 1196030 | 3.3% | 41469 | 0.1% | 1122 |
5 | NaN | 8.8% | 818134 | 87.2% | 8135962 | 4.0% | 369062 | 0.1% | 5214 |
Finally, if we want to test these queries in the Sequence Miner, we can get all the complete definitions like this:
print(qh.defs())
def ##ref## = ref; def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz | rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol; create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## | select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score | split 5- -s ';' | granno 1 -gc ref,alt -min -fc revel_score,mutpred_score | calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score)) | calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score)) | select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score | replace new_* if(#rc='NaN','.',#rc) | rename new_(.*) #{1} ; create ##BP4_PP3_scores## = pgor [##dbnsfp_small##] | calc Revel_BP4_PP3 if(revel_score!='.',if(float(revel_score) <= 0.003,'BP4_verystrong', if(0.003 < float(revel_score) and float(revel_score) <= 0.016,'BP4_strong', if(0.016 < float(revel_score) and float(revel_score) <= 0.183,'BP4_moderate', if(0.183 < float(revel_score) and float(revel_score) <= 0.290,'BP4_supporting', if(0.644 <= float(revel_score) and float(revel_score) < 0.773,'PP3_supporting', if(0.773 <= float(revel_score) and float(revel_score) < 0.932,'PP3_moderate', if(0.932 <= float(revel_score),'PP3_strong',''))))))),'') | calc Mutpred_BP4_PP3 if(mutpred_score!='.',if(float(mutpred_score) <= 0.010,'BP4_strong', if(0.010 < float(mutpred_score) and float(mutpred_score) <= 0.197,'BP4_moderate', if(0.197 < float(mutpred_score) and float(mutpred_score) <= 0.391,'BP4_supporting', if(0.737 <= float(mutpred_score) and float(mutpred_score) < 0.829,'PP3_supporting', if(0.829 <= float(mutpred_score) and float(mutpred_score) < 0.932,'PP3_moderate', if(0.932 <= float(mutpred_score),'PP3_strong','')))))),'') ; def ##veppath## = source/anno/vep_v107; create #segs# = pgor ##veppath##/allvars.gord | join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor | replace overlapcount if(overlapcount = 0,1,overlapcount) | group 10000 -sum -ic overlapcount | seghist 1084847; create ##part_counts## = pgor -split <(gor [#segs#]) [##BP4_PP3_scores##] | varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord | group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count;