Parallel Queries: PARALLEL¶

ParallelQueries_PARALLEL

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}

Parallel GORpipe queries with PARALLEL¶

The PARALLEL macro command is similar to PGOR and PARTGOR, however, unlike them it can be used both for GOR and NOR command and provides flexibility to make parallel queries based on any parameters provided in a relation/table. First we will start with a simple example where we use PARALLEL instead of PGOR.

In [7]:

qh = GQH.GOR_Query_Helper()
mydefs = qh.add("""create #segs# = gor #genes# | group chrom | segspan -maxseg 100000000;""")

In [10]:

%%gor
$mydefs
nor [#segs#] | granno -count | top 10

Query ran in 0.09 sec
Query fetched 10 rows in 0.01 sec (total time 0.11 sec)

Out[10]:

	Chrom	bpStart	bpStop	segCount	allCount
0	chr1	0	62239105	1	45
1	chr1	62239105	124478211	1	45
2	chr1	124478211	186717316	1	45
3	chr1	186717316	248956422	1	45
4	chr10	0	66898711	1	45
5	chr10	66898711	133797422	1	45
6	chr11	0	67543311	1	45
7	chr11	67543311	135086622	1	45
8	chr12	0	66637654	1	45
9	chr12	66637654	133275309	1	45

We see that we get 45 zero-based segments, each no more than 100Mbases in size. Lets now calculate the numbers of variants in dbSNP.

In [24]:

mydefs = qh.add("""create #pcounts# = parallel -parts <(nor [#segs#] | replace bpStart bpStart+1)
<(gor -p #{col:Chrom}:#{col:bpStart}-#{col:bpStop} #dbsnp# | group chrom -count);""")

In [18]:

%%gor
$mydefs
gor [#pcounts#] | top 10

Query ran in 0.79 sec
Query fetched 10 rows in 0.03 sec (total time 0.82 sec)

Out[18]:

	chrom	bpStop	allCount
0	chr1	248956422	25553008
1	chr1	248956422	18569734
2	chr1	248956422	27038573
3	chr1	248956422	26007566
4	chr10	133797422	27554851
5	chr10	133797422	29196823
6	chr11	135086622	28411265
7	chr11	135086622	29833483
8	chr12	133275309	27487946
9	chr12	133275309	28683770

We see that we get multiple counts just like if we had done a PGOR. Unlike for PGOR, the temp table dictionary does not have chromasomal range partition metadata:

In [19]:

%%gor
$mydefs
nor -asdict [#pcounts#] | top 10

Query ran in 0.78 sec
Query fetched 10 rows in 0.01 sec (total time 0.79 sec)

Out[19]:

	col1	col2
0	../../../../cache/result_cache/cache01/ae1/ae1...	45
1	../../../../cache/result_cache/cache01/4a4/4a4...	44
2	../../../../cache/result_cache/cache01/709/709...	43
3	../../../../cache/result_cache/cache01/bab/bab...	42
4	../../../../cache/result_cache/cache01/4b6/4b6...	41
5	../../../../cache/result_cache/cache01/11a/11a...	40
6	../../../../cache/result_cache/cache01/d22/d22...	39
7	../../../../cache/result_cache/cache01/ede/ede...	38
8	../../../../cache/result_cache/cache01/21b/21b...	37
9	../../../../cache/result_cache/cache01/6ec/6ec...	36

This is because PARALLEL is designed for both GOR and NOR execution and as is stands does always return same types of metadata (as it stands now). We can therefore modify the query and have it return non-gor data:

In [38]:

mydefs = qh.add("""create #pcounts# = parallel -parts <(nor [#segs#] | replace bpStart bpStart+1)
<(nor <(gor -p #{col:Chrom}:#{col:bpStart}-#{col:bpStop} #dbsnp# | group chrom -count) 
| calc range '#{col:Chrom}:#{col:bpStart}-#{col:bpStop}' | select range,allcount) ;""")

In [40]:

%%gor
$mydefs
nor [#pcounts#] | sort -c range | top 10

Query ran in 0.87 sec
Query fetched 10 rows in 0.22 sec (total time 1.09 sec)

Out[40]:

	range	allCount
0	chr10:1-66898711	29196823
1	chr10:66898712-133797422	27554851
2	chr11:1-67543311	29833483
3	chr11:67543312-135086622	28411265
4	chr12:1-66637654	28683770
5	chr12:66637655-133275309	27487946
6	chr13:1-57182164	18190940
7	chr13:57182165-114364328	24140223
8	chr14:1-53521859	15182593
9	chr14:53521860-107043718	22303823

We can indeed get similar results using PGOR with split option:

In [41]:

mydefs = qh.add("""create #pgor_counts# = pgor -split <(gor [#segs#]) #dbsnp# | top 10
| calc range '#{CHROM}:#{BPSTART}-#{BPSTOP}'
| group chrom -gc range -count
;""")

In [43]:

%%gor
$mydefs
gor [#pgor_counts#] | top 10

Query ran in 1.66 sec
Query fetched 10 rows in 0.03 sec (total time 1.69 sec)

Out[43]:

	chrom	bpStop	range	allCount
0	chr1	248956422	chr1:124478212-186717316	10
1	chr1	248956422	chr1:186717317-248956422	10
2	chr1	248956422	chr1:1-62239105	10
3	chr1	248956422	chr1:62239106-124478211	10
4	chr10	133797422	chr10:1-66898711	10
5	chr10	133797422	chr10:66898712-133797422	10
6	chr11	135086622	chr11:67543312-135086622	10
7	chr11	135086622	chr11:1-67543311	10
8	chr12	133275309	chr12:1-66637654	10
9	chr12	133275309	chr12:66637655-133275309	10

In [45]:

%%gor
$mydefs
nor -asdict [#pgor_counts#] | granno -count | top 10

Query ran in 0.49 sec
Query fetched 10 rows in 0.01 sec (total time 0.50 sec)

Out[45]:

	col1	col2	col3	col5	allCount
0	../../../../cache/result_cache/cache01/9f1/9f1...	1	chr6	chr6	45
1	../../../../cache/result_cache/cache01/4a7/4a7...	2	chr2	chr2	45
2	../../../../cache/result_cache/cache01/43a/43a...	3	chr9	chr9	45
3	../../../../cache/result_cache/cache01/8e8/8e8...	4	chr5	chr5	45
4	../../../../cache/result_cache/cache01/c92/c92...	5	chr3	chr3	45
5	../../../../cache/result_cache/cache01/79f/79f...	6	chr11	chr11	45
6	../../../../cache/result_cache/cache01/b76/b76...	7	chrM	chrM	45
7	../../../../cache/result_cache/cache01/0a9/0a9...	8	chr14	chr14	45
8	../../../../cache/result_cache/cache01/657/657...	9	chr11	chr11	45
9	../../../../cache/result_cache/cache01/0d1/0d1...	10	chr3	chr3	45

Note: PGOR and PARTGOR generate GOR dictionaries that can be used in downstream GOR queries with partitioned seek, whereas PARALLEL outputs dictionary without range metadata. Therefore, we typically recommend the use of them for GOR based queries while PARALLEL may be used with NORD type phenotype dictionaries and NOR style queries.

NOR type parallelization¶

Here we look at an example where we want to calculate column statistics of a table. First we select an arbitrary table with a mix of numeric and string columns.

In [56]:

qh2 = GQH.GOR_Query_Helper()
mydefs2 = qh2.add_many("""
def #maxrows# = 1000000;
create #table# = nor ref/ensgenes/ensexons.gorz | calc f1 random() | calc f2 random() | calc f3 sqr(f1/(1+sqr(f2)));
""")

Firs we will run a generic query that is relatively slow but memory efficient to discover the column types:

In [59]:

mydefs2 = qh2.add("""
create #coltypes# = nor [#table#] | calc dummys 's' | calc dummyi 1
| unpivot 1-
| where col_value != 'NA'
| calc type if(isint(col_value),'I',if(isfloat(col_value),'F','S'))
| group -gc col_name,type
| group -gc col_name -set -sc type;
""")

In [60]:

%%gor
$mydefs2
nor [#coltypes#]

Query ran in 91.93 sec
Query fetched 15 rows in 0.01 sec (total time 91.94 sec)

Out[60]:

	Col_Name	set_type
0	chrom	S
1	chromend	I
2	chromstart	I
3	dummyi	I
4	dummys	S
5	exon	S
6	f1	F
7	f2	F
8	f3	F
9	gene_biotype	S
10	gene_stable_id	S
11	gene_symbol	S
12	strand	S
13	transcript_biotype	S
14	transcript_stable_id	S

Ignore the dummy columns that are created for now - we will explain them later. Now we add definitions to calculate more complex statistics, both for numeric and string columns. Because this analysis uses the RANK command, we will run this in parallel for a subset of columns at the time, because it will require significant memory and compute.

In [77]:

mydefs2 = qh2.add_many("""
def #parallel_parts# = 4;

create #fcolparts# = nor [#coltypes#] | where set_type in ('F','I','F,I')  | select col_name 
| rownum | calc g mod(rownum,#parallel_parts#) | group -gc g -lis -sc col_name;

create #scolparts# = nor [#coltypes#] | where not(set_type in ('F','I','F,I'))  | select col_name
| rownum | calc g mod(rownum,#parallel_parts#) | group -gc g -lis -sc col_name;

create #numstat# = parallel -parts [#fcolparts#] <(nor [#table#]
| unpivot 1-
| where col_value != 'NA'
| inset -c col_name <(nor [#coltypes#] | where listhasany('#{col:lis_col_name}',col_name))
| rank col_value -o asc -gc col_name -q 
| calc v10 if(lowOReqRank <= 0.1, col_value, -1e100)
| calc v25 if(lowOReqRank <= 0.25, col_value, -1e100)
| calc v50 if(lowOReqRank <= 0.5, col_value, -1e100)
| calc v75 if(lowOReqRank <= 0.75, col_value, -1e100)
| calc v90 if(lowOReqRank <= 0.90, col_value, -1e100)
| group -gc col_name -count -min -max -avg -std -fc col_value,v10-
| rename allcount nonNA
| select col_name-std_col_value,max_v10,max_v25,max_v50,max_v75,max_v90
| rename max_(v.*) #{1}
);
create #stringstat# = parallel -parts [#scolparts#] <(nor [#table#]
| unpivot 1-
| inset -c col_name <(nor [#coltypes#] | where listhasany('#{col:lis_col_name}',col_name))
| group -gc col_name,col_value -count
| sort -c col_name,allcount:nr
| rename allcount value_count
| group -gc col_name -min -max -len 1000000 -lis -sc col_value,value_count
| calc distValues listsize(lis_value_count)
| calc histo str(distValues)+':'+listmap(listfilter(listzip(lis_col_value,lis_value_count),'i<=10'),'brackets(x)')+if(listsize(lis_value_count)>10,',...','')
| hide lis_*
);
""")

In #fcolparts# and #scolparts# we make #parallel_parts# list of numeric and string columns, respectively. We group the column names together into lis_col_name and use a relation with this column to generate parallel partitions for #numstat# and #stringstat#. We ensure that the parallel commands don't get an emtpy relation with no row by using the dummy columns mentioned above (because corrently it does not tolerate it). Notice that each parallel query will read all of #table#, however, we unpivot and filter with listhasany('#{col:lis_col_name}',col_name)) to limit the scope of data that goes into RANK and GROUP -gc col_name,col_value, both of which may require significant memory. The PARALLEL macro command replaces #{col:lis_col_name} with the appropriate value for the partition at hand. We generate ranges for the numeric distribution in v10-v90 and a histogram with the most common strings.

We generate a description for all the columns in #table# by:

In [86]:

%%gor
$mydefs2

nor [#coltypes#] | map -c col_name [#numstat#]
| merge <(nor [#coltypes#] | map -c col_name [#stringstat#])
| where not(col_name in ('dummyi','dummys'))

Query ran in 0.40 sec
Query fetched 13 rows in 0.06 sec (total time 0.46 sec)

Out[86]:

	Col_Name	set_type	nonNA	min_Col_Value	max_Col_Value	avg_Col_Value	std_Col_Value	v10	v25	v50	v75	v90	min_value_count	max_value_count	distValues	histo
0	chrom	S	NaN	chr1	chrY	NaN	NaN	NaN	NaN	NaN	NaN	NaN	104300.0	91380.0	25.0	25:(chr1;145831),(chr2;123667),(chr3;104300),(...
1	chromend	I	1572313.0	647.0	2.48937043E8	7.604323e+07	5.600179e+07	1.130701e+07	3.260182e+07	6.383704e+07	1.117808e+08	1.553267e+08	NaN	NaN	NaN	NaN
2	chromstart	I	1572313.0	576.0	2.4893658E8	7.604297e+07	5.600179e+07	1.130672e+07	3.260151e+07	6.383714e+07	1.117807e+08	1.553259e+08	NaN	NaN	NaN	NaN
3	exon	S	NaN	ENSE00000000003	ENSE00003970673	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	98.0	491801.0	491801:(ENSE00001639513;201),(ENSE00001666408;...
4	f1	F	1572313.0	1.9561712161220157E-7	0.9999998997299483	5.001228e-01	2.887756e-01	9.975428e-02	2.499995e-01	5.002753e-01	7.503033e-01	9.001638e-01	NaN	NaN	NaN	NaN
5	f2	F	1572313.0	6.00124052674289E-7	0.9999996842541631	5.002463e-01	2.885996e-01	1.002356e-01	2.504988e-01	5.005184e-01	7.503596e-01	9.000775e-01	NaN	NaN	NaN	NaN
6	f3	F	1572313.0	2.9139279845133705E-14	0.9998610928646812	2.142417e-01	2.213703e-01	5.607365e-03	3.527133e-02	1.407709e-01	3.238778e-01	5.541057e-01	NaN	NaN	NaN	NaN
7	gene_biotype	S	NaN	IG_C_gene	vault_RNA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	99.0	40.0	40:(protein_coding;1322775),(lncRNA;188620),(t...
8	gene_stable_id	S	NaN	ENSG00000000003	ENSG00000289719	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	997.0	61498.0	61498:(ENSG00000145362;3796),(ENSG00000224078;...
9	gene_symbol	S	NaN	5S_rRNA	snoZ196	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	997.0	39850.0	39850:(na;89965),(ANK2;3796),(SNHG14;2915),(AB...
10	strand	S	NaN	+	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	769432.0	802881.0	2.0	2:(+;802881),(-;769432)
11	transcript_biotype	S	NaN	IG_C_gene	vault_RNA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	99.0	44.0	44:(protein_coding;823843),(nonsense_mediated_...
12	transcript_stable_id	S	NaN	ENST00000000233	ENST00000697417	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	98.0	246450.0	246450:(ENST00000589042;363),(ENST00000591111;...

We can display the histogram with wider columns like this:

In [87]:

%%gor dfHisto <<
$mydefs2
nor [#stringstat#] | select col_name,histo

Query ran in 0.57 sec
Query fetched 8 rows in 0.03 sec (total time 0.60 sec)

In [89]:

for i in range(0,len(dfHisto)):
    print(f"{i+1}\t{dfHisto.at[i,'Col_Name']}\t{dfHisto.at[i,'histo']}")
    print("----------------")

1	exon	491801:(ENSE00001639513;201),(ENSE00001666408;199),(ENSE00001912145;151),(ENSE00001352600;143),(ENSE00001926576;137),(ENSE00001854584;131),(ENSE00001723087;122),(ENSE00001697882;118),(ENSE00001836596;117),(ENSE00001859880;113),...
----------------
2	strand	2:(+;802881),(-;769432)
----------------
3	gene_symbol	39850:(na;89965),(ANK2;3796),(SNHG14;2915),(ABCC8;2691),(SNX14;2419),(DYNC1H1;2357),(IFT122;2338),(KCNMA1;2138),(MAPK10;2044),(PLCB4;1918),...
----------------
4	chrom	25:(chr1;145831),(chr2;123667),(chr3;104300),(chr11;91380),(chr17;88751),(chr12;86588),(chr19;81557),(chr7;74905),(chr6;73980),(chr5;70514),...
----------------
5	gene_stable_id	61498:(ENSG00000145362;3796),(ENSG00000224078;2915),(ENSG00000006071;2691),(ENSG00000135317;2419),(ENSG00000197102;2357),(ENSG00000163913;2338),(ENSG00000156113;2138),(ENSG00000109339;2044),(ENSG00000101333;1918),(ENSG00000155657;1700),...
----------------
6	transcript_stable_id	246450:(ENST00000589042;363),(ENST00000591111;313),(ENST00000342992;312),(ENST00000342175;191),(ENST00000359218;191),(ENST00000460472;191),(ENST00000397345;182),(ENST00000427231;182),(ENST00000454784;173),(ENST00000409198;150),...
----------------
7	gene_biotype	40:(protein_coding;1322775),(lncRNA;188620),(transcribed_unprocessed_pseudogene;22141),(processed_pseudogene;11768),(unprocessed_pseudogene;7847),(transcribed_unitary_pseudogene;5151),(transcribed_processed_pseudogene;2886),(misc_RNA;2211),(snRNA;1901),(miRNA;1877),...
----------------
8	transcript_biotype	44:(protein_coding;823843),(nonsense_mediated_decay;212755),(lncRNA;186275),(retained_intron;168473),(processed_transcript;142138),(processed_pseudogene;11768),(unprocessed_pseudogene;7808),(transcribed_unprocessed_pseudogene;5855),(misc_RNA;2211),(snRNA;1901),...
----------------

For more examples of the use of the PARALLEL macro command, see the notebook HPO_Phewas.

In [ ]: