Extract Variants¶
Demonstrate the basic commands to extract variants from a genotype freeze¶
First we include boiler plate code to define the project, set environment variables and import libraries.
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
%env LOG_QUERY=1
project = "janssen_sle"
%env GOR_API_PROJECT={project}
Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.
gordefs = """
def ##freeze## = freezes/js_19k_array;
def #buckets# = ##freeze##/buckets.tsv;
def #hgt# = ##freeze##/variants.gord;
def #af# = ##freeze##/metadata/AF.gorz;
def #vep# = ##freeze##/vep_single.gorz;
"""
First we define which samples (PNs) and which variants we want to export the genotypes from. We use CREATE
statements to create what we refer to as virtual relations or temporary tables. We pick the first 10 samples in the freeze bucket relation/table/file and 100 variants from the BRCA2 gene. The bucket relation has two columns (PN,bucket). The row number and the bucket value are used to determine the position of the sample within the value column in the genotype freeze rows. We use filtering options to extract only rows from buckets that contain genotypes from our samples of interest that are stored in #pns#. Because the value column in the horizontal bucketized genotype table stores genotypes from multiple PNs, we first trim it down by replacing it with its length.
%%gor
$gordefs
create #pns# = nor #buckets# | select pn | top 10;
create ##input_vars## = gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp -ir #af# | select 1-4 | top 100;
gor [##input_vars##]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| replace values len(values) | rename values len_values
| top 5
Query ran in 0.32 sec Query fetched 5 rows in 0.34 sec (total time 0.66 sec)
CHROM | POS | REF | alt | bucket | len_values | Source | |
---|---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | bucket_1 | 1946 | bucket_1 |
1 | chr13 | 32316435 | G | A | bucket_1 | 1946 | bucket_1 |
2 | chr13 | 32316446 | A | C | bucket_1 | 1946 | bucket_1 |
3 | chr13 | 32316447 | T | C | bucket_1 | 1946 | bucket_1 |
4 | chr13 | 32316494 | T | G | bucket_1 | 1946 | bucket_1 |
We will move the create statement we setup into a Python string because we will be using them in multiple samples below.
gorcreates = """
create #pns# = nor #buckets# | select pn | top 10;
create ##input_vars## = gor #genes# | where gene_symbol = 'BRCA2' | join -segsnp -ir #af# | select 1-4 | top 100;
"""
Horizontal GOR-style output¶
First we show how we can extract a table that represents the variants, for only the samples of interest, in a horizontal format. Both as fixed value-size format (1 char per genotype) and then we use the "fixed size value map" function to transform values into a comma separated list with VCF-style genotypes. The CSVSEL
command takes a stream of variant value rows and uses the bucket relation and samples to generate a single value row per variant (because we use -gc #3,#4).
%%gor
$gordefs
$gorcreates
gor [##input_vars##]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| csvsel -vs 1 -gc #3,#4 #buckets# [#pns#]
| calc lis_values fsvmap(values,1,'decode(x,"0,0/0,1,0/1,2,1/1,./.")',',')
| replace values '_'+values
Query ran in 0.30 sec Query fetched 100 rows in 0.18 sec (total time 0.48 sec)
CHROM | POS | REF | alt | values | lis_values | |
---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | _0110002100 | 0/0,0/1,0/1,0/0,0/0,0/0,1/1,0/1,0/0,0/0 |
1 | chr13 | 32316435 | G | A | _0100000010 | 0/0,0/1,0/0,0/0,0/0,0/0,0/0,0/0,0/1,0/0 |
2 | chr13 | 32316446 | A | C | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
3 | chr13 | 32316447 | T | C | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
4 | chr13 | 32316494 | T | G | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
... | ... | ... | ... | ... | ... | ... |
95 | chr13 | 32337431 | A | T | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
96 | chr13 | 32337458 | G | T | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
97 | chr13 | 32337512 | T | C | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
98 | chr13 | 32337573 | A | G | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
99 | chr13 | 32337580 | T | C | _0000000000 | 0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0 |
100 rows × 6 columns
Multi-columnar VCF-style output¶
The CSVSEL
command has a "-vcf" option to format the genotypes according to the VCF specification.
%%gor
$gordefs
$gorcreates
gor [##input_vars##]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| csvsel -vs 1 -gc #3,#4 #buckets# <(nor [#pns#] | top 5) -vcf
| top 5
Query ran in 0.30 sec Query fetched 5 rows in 0.15 sec (total time 0.45 sec)
CHROM | POS | REF | alt | QUAL | FILTER | INFO | FORMAT | GD515262501 | GD515227401 | GD515226101 | GD515228601 | GD515247301 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | . | . | . | GT | 0/0 | 0/1 | 0/1 | 0/0 | 0/0 |
1 | chr13 | 32316435 | G | A | . | . | . | GT | 0/0 | 0/1 | 0/0 | 0/0 | 0/0 |
2 | chr13 | 32316446 | A | C | . | . | . | GT | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
3 | chr13 | 32316447 | T | C | . | . | . | GT | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
4 | chr13 | 32316494 | T | G | . | . | . | GT | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
Notice that we use a nested query for the PN selection, to prevent that we get too many columns in the output. If we end the query with a `WRITE` command, we can easily write the output to a file stored in user_data, even though it contains hundred thousand rows. Keep in mind however, that displaying such columns is difficult and the SequenceMiner will not handle them well if there are more than few hundred columns.
Row-level output¶
By using the "-tag PN" option on the CSVSEL
command, rather than getting a single row with a value column representing all the samples, we get one row per sample for each variant. Optionally, the "-hide" option can be used to suppress unwanted values, such as homozygous reference.
%%gor
$gordefs
$gorcreates
gor [##input_vars##]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| csvsel -vs 1 -gc #3,#4 -tag PN /* -hide '0' */ #buckets# [#pns#]
| rename value GT | top 5
Query ran in 0.28 sec Query fetched 5 rows in 0.16 sec (total time 0.44 sec)
CHROM | POS | REF | alt | PN | GT | |
---|---|---|---|---|---|---|
0 | chr13 | 32315226 | G | A | GD515262501 | 0 |
1 | chr13 | 32315226 | G | A | GD515227401 | 1 |
2 | chr13 | 32315226 | G | A | GD515226101 | 1 |
3 | chr13 | 32315226 | G | A | GD515228601 | 0 |
4 | chr13 | 32315226 | G | A | GD515247301 | 0 |
Transposed output¶
In some cases it may be desired to get the genotypes in a format where all the variants per sample are stored in a single row.
This can be achieved with the GTTRANSPOSE
command. Notice here that we the nested GOR expression as an input to the NOR
command in order to
apply SELECT
to eliminate the redundant (chrom,pos) command that come from the transpose GOR command. As an "additiona stunt", we filter the samples such that we only see samples with two or more genotype columns with non-zero value.
%%gor
$gordefs
$gorcreates
create #myvars# = gor [##input_vars##] | top 8;
nor <(gor [#myvars#]
| varjoin -ir <(gor #hgt# -nf -ff [#pns#])
| gttranspose #buckets# [#pns#] [#myvars#] -vs 1 -cols)
| select pn-
| where listsize(listfilter(cols2list('#2-'),'x!="0"'))>1
| top 5
Query ran in 0.33 sec Query fetched 5 rows in 0.17 sec (total time 0.51 sec)
PN | chr13_32315226_G_A | chr13_32316435_G_A | chr13_32316446_A_C | chr13_32316447_T_C | chr13_32316494_T_G | chr13_32316609_C_G | chr13_32319070_T_A | chr13_32319090_A_C | |
---|---|---|---|---|---|---|---|---|---|
0 | GD515227401 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 |
1 | GD515226101 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
2 | GD515247301 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 |
3 | GD515201701 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
4 | GD515214801 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
Parallel execution¶
As a final example, we show an example where we extract genotypes for all carriers of high-impact variants in the top 10 genes with the highest number of high impact variants.
- #genecount# annotates the genes with overlap count (see "-ic" option) of high-impact variants int the gene. We join gene segments and gene_symbols with variants in #vep#. We use
PGOR
because the #vep# table may have very many rows. - #topgenes# ranks the genes and filters them based on high overlap count.
- #myvars# uses "-ir" join to extract the high impact variants from the top ranking genes and eliminates potential duplicates because of overlaping genes.
%%gor
$gordefs
create #genecount# = pgor #genes# | join -segsnp -xl gene_symbol -xr gene_symbol -ic <(gor #vep# | where max_impact = 'HIGH');
create #topgenes# = gor [#genecount#] | rank genome overlapcount -o desc | where rank_overlapcount <= 10;
create #selvars# = gor [#topgenes#] | join -segsnp -xl gene_symbol -xr gene_symbol -ir #vep# | where max_impact = 'HIGH' | select 1-4 | distinct;
gor [#selvars#]
Query ran in 0.22 sec Query fetched 2,614 rows in 0.01 sec (total time 0.23 sec)
CHROM | POS | Reference | Call | |
---|---|---|---|---|
0 | chr11 | 47331871 | T | C |
1 | chr11 | 47331882 | C | T |
2 | chr11 | 47332070 | A | G |
3 | chr11 | 47332071 | C | T |
4 | chr11 | 47332075 | G | A |
... | ... | ... | ... | ... |
2609 | chrX | 32844848 | C | A |
2610 | chrX | 32849781 | G | A |
2611 | chrX | 33020202 | T | A |
2612 | chrX | 33211281 | C | A |
2613 | chrX | 33211304 | C | T |
2614 rows × 4 columns
For example purpose, we use an advanced "-split" option in PGOR
where we provide it segments to split up the genome for parallel execution. We defined the segments based on the density of the variants that we want to export, i.e. the variants in #selvars#, and the fact that we are extraction the genotypes for all the samples in the freeze (possibly >> 10k). With the SEGHIST
command, we can ensure that each parallel task extracts no more than #numVarsPerTask# = 1000 in our example. One can inspect the #segs# table to see the actual number. Keep in mind that we don't want to have the task too few but also not too many, e.g. much over 500 partition tasks per query. We will fetch the data into a the Python variable pythongenotypes.
%%gor pythongenotypes <<
$gordefs
create #genecount# = pgor #genes# | join -segsnp -xl gene_symbol -xr gene_symbol -ic <(gor #vep# | where max_impact = 'HIGH');
create #topgenes# = gor [#genecount#] | rank genome overlapcount -o desc | where rank_overlapcount <= 10;
create #selvars# = gor [#topgenes#] | join -segsnp -xl gene_symbol -xr gene_symbol -ir #vep# | where max_impact = 'HIGH' | select 1-4 | distinct;
def #numVarsPerTask# = 1000;
create #segs# = gor [#selvars#] | group 100000 -count | seghist #numVarsPerTask#;
create #varoutput# = pgor -split <(gor [#segs#]) [#selvars#]
| varjoin -ir <(gor #hgt#)
| csvsel -vs 1 -gc #3,#4 #buckets# <(nor #buckets# | select pn) -tag PN -hide '0,3'
| rename value GT
;
gor [#varoutput#] | varjoin -r <(gor #vep# | where max_impact = 'HIGH' | select 1-4,gene_symbol,max_consequence,cds_position,amino_acids)
Query ran in 0.32 sec Query fetched 128,414 rows in 3.26 sec (total time 3.58 sec)
pythongenotypes[0:10]
CHROM | POS | REF | alt | PN | GT | Gene_Symbol | max_consequence | CDS_position | Amino_Acids | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr11 | 47333192 | A | C | GD514033801 | 1 | MYBPC3 | splice_donor_variant | NaN | NaN |
1 | chr11 | 47333192 | A | C | GD514545901 | 1 | MYBPC3 | splice_donor_variant | NaN | NaN |
2 | chr11 | 47333193 | C | A | GD515070601 | 1 | MYBPC3 | splice_donor_variant | NaN | NaN |
3 | chr11 | 47333238 | C | A | GD514873301 | 1 | MYBPC3 | stop_gained | 3286 | E/* |
4 | chr11 | 47333291 | C | T | GD515326501 | 1 | MYBPC3 | stop_gained | 3233 | W/* |
5 | chr11 | 47333291 | C | T | GD515331401 | 1 | MYBPC3 | stop_gained | 3233 | W/* |
6 | chr11 | 47333291 | C | T | GD514681901 | 1 | MYBPC3 | stop_gained | 3233 | W/* |
7 | chr11 | 47333291 | C | T | GD514891101 | 1 | MYBPC3 | stop_gained | 3233 | W/* |
8 | chr11 | 47333291 | C | T | GD515054401 | 1 | MYBPC3 | stop_gained | 3233 | W/* |
9 | chr11 | 47333291 | C | T | GD515070601 | 1 | MYBPC3 | stop_gained | 3233 | W/* |
Since most sample are carriers of few variants, we can collapse the information per sample. We can use the federated virtual relation support in the GORdb query service to do the job using NOR:
%%gor
nor [var:pythongenotypes]
| calc gt_info '('+gene_symbol+' '+amino_acids+' '+cds_position+')'
| group -gc pn -lis -sc gt_info -s ', ' -count | sort -c allcount:nr
| map -c pn <(nor [var:pythongenotypes] | group -gc pn,gene_symbol -count | calc gene_var_count gene_symbol+' '+allcount | select pn,gene_var_count)
Loading relations [var:pythongenotypes] from local state Query ran in 0.89 sec Query fetched 19,462 rows in 0.99 sec (total time 1.88 sec)
PN | allCount | lis_gt_info | gene_var_count | |
---|---|---|---|---|
0 | GD514311501 | 12 | (MYBPC3 E/* 1156), (BRCA2 S/* 7115), (BRCA2 E/... | APC 1,BRCA1 4,BRCA2 2,LDLR 1,MSH2 1,MYBPC3 1,T... |
1 | GD514818901 | 12 | (MYBPC3 S/* 416), (BRCA2 ), (BRCA2 R/* 6952),... | BRCA1 2,BRCA2 5,LDLR 1,MSH2 2,MYBPC3 1,TSC2 1 |
2 | GD514681901 | 11 | (MYBPC3 W/* 3233), (MYBPC3 E/* 2965), (MYBPC3 ... | BRCA1 2,BRCA2 2,MSH2 1,MYBPC3 5,TSC2 1 |
3 | GD515116601 | 11 | (MYBPC3 E/* 2965), (MYBPC3 S/* 416), (BRCA2 S/... | BRCA1 2,BRCA2 3,MSH2 1,MYBPC3 2,TSC2 3 |
4 | GD515234501 | 11 | (MYBPC3 ), (MYBPC3 S/* 416), (BRCA2 S/* 7115)... | BRCA1 2,BRCA2 2,MSH2 1,MYBPC3 2,TSC2 4 |
... | ... | ... | ... | ... |
19457 | GD515530801 | 5 | (BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T... | BRCA1 1,BRCA2 2,MSH2 1,TSC2 1 |
19458 | GD515558601 | 5 | (BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T... | BRCA1 2,BRCA2 2,TSC2 1 |
19459 | GD515563601 | 5 | (BRCA2 E/* 1132,64,8566), (TSC2 ), (BRCA1 L/*... | BRCA1 2,BRCA2 1,MSH2 1,TSC2 1 |
19460 | GD515592501 | 5 | (BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T... | BRCA1 1,BRCA2 2,MSH2 1,TSC2 1 |
19461 | GD515612401 | 5 | (BRCA2 S/* 7115), (BRCA2 E/* 1132,64,8566), (T... | BRCA1 1,BRCA2 2,MSH2 1,TSC2 1 |
19462 rows × 4 columns