Extract Variants¶

Extract_Variants

Demonstrate the basic commands to extract variants from a genotype freeze¶

First we include boiler plate code to define the project, set environment variables and import libraries.

In [1]:

%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1

project = "janssen_sle"
%env GOR_API_PROJECT={project}

Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.

In [2]:

gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;
"""

First we define which samples (PNs) and which variants we want to export the genotypes from. We use CREATE statements to create what we refer to as virtual relations or temporary tables. We pick the first 10 samples in the freeze bucket relation/table/file and 100 variants from the BRCA2 gene. The bucket relation has two columns (PN,bucket). The row number and the bucket value are used to determine the position of the sample within the value column in the genotype freeze rows. We use filtering options to extract only rows from buckets that contain genotypes from our samples of interest that are stored in #pns#. Because the value column in the horizontal bucketized genotype table stores genotypes from multiple PNs, we first trim it down by replacing it with its length.

In [3]:

%%gor
$gordefs

create #pns# = nor #buckets# | select pn | top 10;
create ##input_vars## = gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp -ir #af# | select 1-4 | top 100;

gor [##input_vars##]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| replace values len(values) | rename values len_values
| top 5

Query ran in 0.32 sec
Query fetched 5 rows in 0.34 sec (total time 0.66 sec)

Out[3]:

	CHROM	POS	REF	alt	bucket	len_values	Source
0	chr13	32315226	G	A	bucket_1	1946	bucket_1
1	chr13	32316435	G	A	bucket_1	1946	bucket_1
2	chr13	32316446	A	C	bucket_1	1946	bucket_1
3	chr13	32316447	T	C	bucket_1	1946	bucket_1
4	chr13	32316494	T	G	bucket_1	1946	bucket_1

We will move the create statement we setup into a Python string because we will be using them in multiple samples below.

In [4]:

gorcreates = """
create #pns# = nor #buckets# | select pn | top 10;
create ##input_vars## = gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp -ir #af# | select 1-4 | top 100;
"""

Horizontal GOR-style output¶

First we show how we can extract a table that represents the variants, for only the samples of interest, in a horizontal format. Both as fixed value-size format (1 char per genotype) and then we use the "fixed size value map" function to transform values into a comma separated list with VCF-style genotypes. The CSVSEL command takes a stream of variant value rows and uses the bucket relation and samples to generate a single value row per variant (because we use -gc #3,#4).

In [5]:

%%gor
$gordefs
$gorcreates

gor [##input_vars##]
| varjoin  -ir <(gor #hgt# -nf -ff [#pns#])
| csvsel -vs 1 -gc #3,#4 #buckets# [#pns#]
| calc lis_values fsvmap(values,1,'decode(x,"0,0/0,1,0/1,2,1/1,./.")',',')
| replace values '_'+values

Query ran in 0.30 sec
Query fetched 100 rows in 0.18 sec (total time 0.48 sec)

Out[5]:

	CHROM	POS	REF	alt	values	lis_values
0	chr13	32315226	G	A	_0110002100	0/0,0/1,0/1,0/0,0/0,0/0,1/1,0/1,0/0,0/0
1	chr13	32316435	G	A	_0100000010	0/0,0/1,0/0,0/0,0/0,0/0,0/0,0/0,0/1,0/0
2	chr13	32316446	A	C	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
3	chr13	32316447	T	C	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
4	chr13	32316494	T	G	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
...	...	...	...	...	...	...
95	chr13	32337431	A	T	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
96	chr13	32337458	G	T	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
97	chr13	32337512	T	C	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
98	chr13	32337573	A	G	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
99	chr13	32337580	T	C	_0000000000	0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0

100 rows × 6 columns

Multi-columnar VCF-style output¶

The CSVSEL command has a "-vcf" option to format the genotypes according to the VCF specification.

In [6]:

%%gor
$gordefs
$gorcreates

gor [##input_vars##]
| varjoin  -ir <(gor #hgt# -nf -ff [#pns#])
| csvsel -vs 1 -gc #3,#4 #buckets# <(nor [#pns#] | top 5) -vcf
| top 5

Query ran in 0.30 sec
Query fetched 5 rows in 0.15 sec (total time 0.45 sec)

Out[6]:

	CHROM	POS	REF	alt	QUAL	FILTER	INFO	FORMAT	GD515262501	GD515227401	GD515226101	GD515228601	GD515247301
0	chr13	32315226	G	A	.	.	.	GT	0/0	0/1	0/1	0/0	0/0
1	chr13	32316435	G	A	.	.	.	GT	0/0	0/1	0/0	0/0	0/0
2	chr13	32316446	A	C	.	.	.	GT	0/0	0/0	0/0	0/0	0/0
3	chr13	32316447	T	C	.	.	.	GT	0/0	0/0	0/0	0/0	0/0
4	chr13	32316494	T	G	.	.	.	GT	0/0	0/0	0/0	0/0	0/0

Notice that we use a nested query for the PN selection, to prevent that we get too many columns in the output. If we end the query with a `WRITE` command, we can easily write the output to a file stored in user_data, even though it contains hundred thousand rows. Keep in mind however, that displaying such columns is difficult and the SequenceMiner will not handle them well if there are more than few hundred columns.

Row-level output¶

By using the "-tag PN" option on the CSVSEL command, rather than getting a single row with a value column representing all the samples, we get one row per sample for each variant. Optionally, the "-hide" option can be used to suppress unwanted values, such as homozygous reference.

In [7]:

%%gor
$gordefs
$gorcreates

gor [##input_vars##]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| csvsel -vs 1 -gc #3,#4 -tag PN /* -hide '0' */ #buckets# [#pns#]
| rename value GT | top 5
                

Query ran in 0.28 sec
Query fetched 5 rows in 0.16 sec (total time 0.44 sec)

Out[7]:

	CHROM	POS	REF	alt	PN	GT
0	chr13	32315226	G	A	GD515262501	0
1	chr13	32315226	G	A	GD515227401	1
2	chr13	32315226	G	A	GD515226101	1
3	chr13	32315226	G	A	GD515228601	0
4	chr13	32315226	G	A	GD515247301	0

Transposed output¶

In some cases it may be desired to get the genotypes in a format where all the variants per sample are stored in a single row. This can be achieved with the GTTRANSPOSEcommand. Notice here that we the nested GOR expression as an input to the NOR command in order to apply SELECT to eliminate the redundant (chrom,pos) command that come from the transpose GOR command. As an "additiona stunt", we filter the samples such that we only see samples with two or more genotype columns with non-zero value.

In [8]:

%%gor
$gordefs
$gorcreates

create #myvars# = gor [##input_vars##] | top 8;

nor <(gor [#myvars#]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| gttranspose #buckets# [#pns#] [#myvars#] -vs 1 -cols)
| select pn-
| where listsize(listfilter(cols2list('#2-'),'x!="0"'))>1               
| top 5

Query ran in 0.33 sec
Query fetched 5 rows in 0.17 sec (total time 0.51 sec)

Out[8]:

	PN	chr13_32315226_G_A	chr13_32316435_G_A	chr13_32319070_T_A	chr13_32319090_A_C
0	GD515227401	1	1	0	2
1	GD515226101	1	0	0	2
2	GD515247301	0	0	3	2
3	GD515201701	2	0	0	2
4	GD515214801	1	0	0	2

Parallel execution¶

As a final example, we show an example where we extract genotypes for all carriers of high-impact variants in the top 10 genes with the highest number of high impact variants.

#genecount# annotates the genes with overlap count (see "-ic" option) of high-impact variants int the gene. We join gene segments and gene_symbols with variants in #vep#. We use PGOR because the #vep# table may have very many rows.
#topgenes# ranks the genes and filters them based on high overlap count.
#myvars# uses "-ir" join to extract the high impact variants from the top ranking genes and eliminates potential duplicates because of overlaping genes.

In [9]:

%%gor
$gordefs

create #genecount# = pgor #genes# | join -segsnp -xl gene_symbol -xr gene_symbol -ic <(gor #vep# | where max_impact = 'HIGH');
create #topgenes# = gor [#genecount#] | rank genome overlapcount -o desc | where rank_overlapcount <= 10;
create #selvars# = gor [#topgenes#] | join -segsnp -xl gene_symbol -xr gene_symbol -ir #vep# | where max_impact = 'HIGH' | select 1-4 | distinct;
gor [#selvars#]

Query ran in 0.22 sec
Query fetched 2,614 rows in 0.01 sec (total time 0.23 sec)

Out[9]:

	CHROM	POS	Reference	Call
0	chr11	47331871	T	C
1	chr11	47331882	C	T
2	chr11	47332070	A	G
3	chr11	47332071	C	T
4	chr11	47332075	G	A
...	...	...	...	...
2609	chrX	32844848	C	A
2610	chrX	32849781	G	A
2611	chrX	33020202	T	A
2612	chrX	33211281	C	A
2613	chrX	33211304	C	T

2614 rows × 4 columns

For example purpose, we use an advanced "-split" option in PGOR where we provide it segments to split up the genome for parallel execution. We defined the segments based on the density of the variants that we want to export, i.e. the variants in #selvars#, and the fact that we are extraction the genotypes for all the samples in the freeze (possibly >> 10k). With the SEGHIST command, we can ensure that each parallel task extracts no more than #numVarsPerTask# = 1000 in our example. One can inspect the #segs# table to see the actual number. Keep in mind that we don't want to have the task too few but also not too many, e.g. much over 500 partition tasks per query. We will fetch the data into a the Python variable pythongenotypes.

In [10]:

%%gor pythongenotypes <<
$gordefs

create #genecount# = pgor #genes# | join -segsnp -xl gene_symbol -xr gene_symbol -ic <(gor #vep# | where max_impact = 'HIGH');
create #topgenes# = gor [#genecount#] | rank genome overlapcount -o desc | where rank_overlapcount <= 10;
create #selvars# = gor [#topgenes#] | join -segsnp -xl gene_symbol -xr gene_symbol -ir #vep# | where max_impact = 'HIGH' | select 1-4 | distinct;

def #numVarsPerTask# = 1000;
create #segs# = gor [#selvars#] | group 100000 -count | seghist #numVarsPerTask#;

create #varoutput# = pgor -split <(gor [#segs#]) [#selvars#]
| varjoin  -ir <(gor #hgt#)
| csvsel -vs 1 -gc #3,#4 #buckets# <(nor #buckets# | select pn) -tag PN -hide '0,3'
| rename value GT                 
;
                  
gor [#varoutput#]  | varjoin -r <(gor #vep# | where max_impact = 'HIGH' | select 1-4,gene_symbol,max_consequence,cds_position,amino_acids)

Query ran in 0.32 sec
Query fetched 128,414 rows in 3.26 sec (total time 3.58 sec)

In [11]:

pythongenotypes[0:10]

Out[11]:

	CHROM	POS	REF	alt	PN	GT	Gene_Symbol	max_consequence	CDS_position	Amino_Acids
0	chr11	47333192	A	C	GD514033801	1	MYBPC3	splice_donor_variant	NaN	NaN
1	chr11	47333192	A	C	GD514545901	1	MYBPC3	splice_donor_variant	NaN	NaN
2	chr11	47333193	C	A	GD515070601	1	MYBPC3	splice_donor_variant	NaN	NaN
3	chr11	47333238	C	A	GD514873301	1	MYBPC3	stop_gained	3286	E/*
4	chr11	47333291	C	T	GD515326501	1	MYBPC3	stop_gained	3233	W/*
5	chr11	47333291	C	T	GD515331401	1	MYBPC3	stop_gained	3233	W/*
6	chr11	47333291	C	T	GD514681901	1	MYBPC3	stop_gained	3233	W/*
7	chr11	47333291	C	T	GD514891101	1	MYBPC3	stop_gained	3233	W/*
8	chr11	47333291	C	T	GD515054401	1	MYBPC3	stop_gained	3233	W/*
9	chr11	47333291	C	T	GD515070601	1	MYBPC3	stop_gained	3233	W/*

Since most sample are carriers of few variants, we can collapse the information per sample. We can use the federated virtual relation support in the GORdb query service to do the job using NOR:

In [12]:

%%gor
nor [var:pythongenotypes]
| calc gt_info '('+gene_symbol+' '+amino_acids+' '+cds_position+')'
| group -gc pn -lis -sc gt_info -s ', ' -count | sort -c allcount:nr
| map -c pn <(nor [var:pythongenotypes] | group -gc pn,gene_symbol -count | calc gene_var_count gene_symbol+' '+allcount | select pn,gene_var_count)

Loading relations [var:pythongenotypes] from local state
Query ran in 0.89 sec
Query fetched 19,462 rows in 0.99 sec (total time 1.88 sec)

Out[12]:

	PN	allCount	lis_gt_info	gene_var_count
0	GD514311501	12	(MYBPC3 E/* 1156), (BRCA2 S/* 7115), (BRCA2 E/...	APC 1,BRCA1 4,BRCA2 2,LDLR 1,MSH2 1,MYBPC3 1,T...
1	GD514818901	12	(MYBPC3 S/* 416), (BRCA2 ), (BRCA2 R/* 6952),...	BRCA1 2,BRCA2 5,LDLR 1,MSH2 2,MYBPC3 1,TSC2 1
2	GD514681901	11	(MYBPC3 W/* 3233), (MYBPC3 E/* 2965), (MYBPC3 ...	BRCA1 2,BRCA2 2,MSH2 1,MYBPC3 5,TSC2 1
3	GD515116601	11	(MYBPC3 E/* 2965), (MYBPC3 S/* 416), (BRCA2 S/...	BRCA1 2,BRCA2 3,MSH2 1,MYBPC3 2,TSC2 3
4	GD515234501	11	(MYBPC3 ), (MYBPC3 S/* 416), (BRCA2 S/* 7115)...	BRCA1 2,BRCA2 2,MSH2 1,MYBPC3 2,TSC2 4
...	...	...	...	...
19457	GD515530801	5	(BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T...	BRCA1 1,BRCA2 2,MSH2 1,TSC2 1
19458	GD515558601	5	(BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T...	BRCA1 2,BRCA2 2,TSC2 1
19459	GD515563601	5	(BRCA2 E/* 1132,64,8566), (TSC2 ), (BRCA1 L/*...	BRCA1 2,BRCA2 1,MSH2 1,TSC2 1
19460	GD515592501	5	(BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T...	BRCA1 1,BRCA2 2,MSH2 1,TSC2 1
19461	GD515612401	5	(BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T...	BRCA1 1,BRCA2 2,MSH2 1,TSC2 1

19462 rows × 4 columns

In [ ]: