Uniparental disomy (UPD)¶
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1
project = "test-hg19"
%env GOR_API_PROJECT={project}
Uniparental disomy¶
This notebook serves as an example of how a relatively simple Perl/Python script that processes variants can be rewritten using the GOR query methodology. As describe in a GDX paper by Julie Scuffins et.al. in Genetics in Medicine (2021) 23:1101–1107; https://doi.org/10.1038/s41436-020-01092-8, the identification of UPD regions involves the following:
"The detection method and algorithm (Figure S1) described below was developed internally in Perl, Python, and R program languages except where noted. For each trio, variant calls were subset to SNPs with a genotype quality ≥90 in all three. Variants with the potential to reduce reliability of observed inheritance patterns were excluded, including all indels for potential allele balance skewing and all variants within the MHC (chr6:28477797–33448354) and LRC (chr19:51403257–59118983) due to high rates of somatic variation. The sex chromosomes were also excluded.
This notebook show a new UPD detection algorithm that is NOT a direct copy of the Perl script but rather written from first principles.
It also comes with few tests using simulation of UPD from real data.
qh = GQH.GOR_Query_Helper()
mydefs = ""
First we create phased parental data that we want to use to simulate the disomy
mydefs = qh.add_many("""
def ##index_case## = 'TEST_2943078';
def ##index_case_no_quotes## = 'TEST_2943078';
def ##father## = 'TEST_2943110';
def ##mother## = 'TEST_2943097';
def ##goodcov## = source/cov/goodcov_8.wgs.gord -s PN;
def ##variants## = source/var/wgs_varcalls.gord -s PN;
create #f# = gor ##variants## -f TEST_2943110
| calc phase if(random()>0.5,1,0)
| calc f if(callcopies = '2',1,if(callcopies=1 and phase=1,1,0))
| calc m if(callcopies = '2',1,if(callcopies=1 and phase=0,1,0))
| select 1-4,callcopies,gl_call,pn,f,m
;
create #m# = gor ##variants## -f TEST_2943097
| calc phase if(random()>0.5,1,0)
| calc f if(callcopies = '2',1,if(callcopies=1 and phase=1,1,0))
| calc m if(callcopies = '2',1,if(callcopies=1 and phase=0,1,0))
| select 1-4,callcopies,gl_call,pn,f,m
""")
Next we setup segments for simulating the disomy
mydefs = qh.add("""
create #test_regions# = gor <(nor <(norrows 1 | calc x 'chr1,0,100000000,Heterodisomy;chr2,0,100000000,Isodisomy;chr3,10000000,15000000,Isodisomy;chr4,10000000,11000000,Isodisomy'
| split x -s ';' | colsplit x 4 c -s ',' | select c_1- | rename #1 chrom | rename #2 segstart | rename #3 segstop | rename #4 region))
| sort chrom;
""")
%%gor
$mydefs
gor [#test_regions#]
Query ran in 0.64 sec Query fetched 4 rows in 0.00 sec (total time 0.65 sec)
chrom | segstart | segstop | region | |
---|---|---|---|---|
0 | chr1 | 0 | 100000000 | Heterodisomy |
1 | chr2 | 0 | 100000000 | Isodisomy |
2 | chr3 | 10000000 | 15000000 | Isodisomy |
3 | chr4 | 10000000 | 11000000 | Isodisomy |
Note that by definition, the last region is too small to be picked up, as we will see. Next we simulate the proband. Within the disomy region we use non-Mendelian inheritance as specified in the IF logic for GT.
mydefs = qh.add("""
create #proband# = pgor [#f#] [#m#] | select 1-4 | distinct
| varjoin -norm -r -l -e 0 -rprefix fa [#f#]
| varjoin -norm -r -l -e 0 -rprefix mo [#m#]
| join -snpseg -r -l [#test_regions#]
| calc gt if(region = 'Isodisomy',str(fa_f)+str(fa_f),if(region = 'Heterodisomy',str(fa_f)+str(fa_m),if(random()<0.5,str(fa_f),str(fa_m))+if(random()<0.5,str(mo_f),str(mo_m))))
| calc callcopies decode(gt,'00,0,01,1,10,1,11,2,3')
| select 1-4,callcopies
| where callcopies != '0'
| calc gl_call 100
| calc pn ##index_case##
""")
Finally, we add the query logic from gregor_upd_caller.ftl.yml(). It is same, excepth that the variant source has been edited to use the simulated data, e.g. #proband#, #f# and #m#.
mydefs = qh.add_many("""
create ##ped## = nor <(gorrow chr1,1,2 | calc pn ##index_case_no_quotes## | calc fn ##father## | calc mn ##mother##) | select pn,fn,mn;
create ##pns## = nor [##ped##] | unpivot #1- | select col_value | rename #1 PN | where PN != '' | distinct;
create ##buck## = nor [##pns##] | calc bucket 'b1';
def ##genome_filter## = not(chrom in ('chrM','chrX','chrY')) and not(chrom = 'chr6' and 28477797 <= pos and pos <= 33448354) and not(chrom = 'chr19' and 51403257 <= pos and pos <= 59118983);
def ##roi_targets## = ref_gregor/util/target_roi.gorz;
def #supportive#($1,$2,$3) = if(($1 = 'hom' and $2 in ('het','hom','unknown') and $3 = 'wt') or ($1 = 'wt' and $2 in ('wt','het','unknown') and $3 = 'hom'),1,0);
def #contradictive#($1,$2,$3) = if(($1 in ('het','hom') and $2 = 'wt' and $3 in ('het','hom')) or ($1 in ('het','wt') and $2 = 'hom' and $3 in ('het','wt')),1,0);
def #iso#($1,$2) = if($1 in ('hom','wt') and $2 = 'het',1,0);
create ##upd_class## = pgor /* ##variants## -fs -ff [##pns##] */ <(gor [#proband#] | merge [#f#] | merge [#m#])
| join -snpseg -i ##roi_targets##
| where ##genome_filter##
| calc goodQC if(GL_Call>=30,1,0)
| granno 1 -gc #3,#4 -sum -ic goodQC -count
| where sum_goodQC = allcount
| select 1-4,callcopies,pn
| rename callcopies GT
| gtgen -gc #3,#4 [##buck##] <(gor ##goodcov## -fs -ff [##pns##])
| csvsel -gc #3,#4 -vs 1 [##buck##] [##pns##] -tag PN
| rename value GT
| pedpivot -gc #3,#4 -e 3 PN [##ped##]
| select 1-4,P_PN,P_GT,F_GT,M_GT
| rename P_PN PN | rename P_GT child | rename F_GT father | rename M_GT mother
| replace child-mother decode(#rc,'0,wt,1,het,2,hom,3,unknown')
| calc father_supp #supportive#(child,father,mother)
| calc mother_supp #supportive#(child,mother,father)
| calc father_contr #contradictive#(child,father,mother)
| calc mother_contr #contradictive#(child,mother,father)
| calc father_iso if(father_supp = 1, #iso#(child,father) , 0)
| calc mother_iso if(mother_supp = 1, #iso#(child,mother) , 0)
| calc ihe if(child='het' and (father='wt' and mother='wt' or father='hom' and mother='hom') or child='hom' and (father='wt' or mother='wt') or child = 'wt' and (father = 'hom' or mother = 'hom'),1,0)
;
create ##upd_seg## = gor <(gor [##upd_class##] | group 2500000 -steps 20 -count -sum -ic father_supp-
| merge <(gor [##upd_class##] | group 5000000 -steps 20 -count -sum -ic father_supp-)
| merge <(gor [##upd_class##] | group 10000000 -steps 20 -count -sum -ic father_supp-)
| merge <(gor [##upd_class##] | group 20000000 -steps 20 -count -sum -ic father_supp-)
| merge <(gor [##upd_class##] | group 50000000 -steps 20 -count -sum -ic father_supp-)
| merge <(gor [##upd_class##] | group 100000000 -steps 20 -count -sum -ic father_supp-)
| rename sum_(.*) #{1}
| where ( father_contr/(father_contr+father_supp) < 0.1 and father_supp > 1 or father_contr/(father_contr+father_supp) < 0.1 and mother_supp > 1 )
)
| calc type if(father_supp > mother_supp,'father','mother')
| segspan -maxseg 1000000000 -gc type
| select 1-3,type
""")
Now we run the final logic to list potential UPD regions that pass our criteria for a length of 5Mb and more than 10 inheritance errors:
%%gor
$mydefs
gor [##upd_seg##] | join -segsnp [##upd_class##]
| group 1 -gc #3-type -count -sum -min -max -ic pos,father_supp-
| rename min_pos diso_start | rename max_pos diso_stop
| rename sum_(.*) #{1}
| rename allcount variantCount
| hide max_*,min_*
| calc father_hetero father_supp-father_iso
| calc mother_hetero mother_supp-mother_iso
| replace type if(type = 'father',if(father_iso > father_hetero,'father Isodisomies','father Heterodisomies'),if(mother_iso > mother_hetero,'mother Isodisomies','mother Heterodisomies'))
| calc size diso_stop-diso_start
| where size >= 5000000 and ihe >= 10
| replace size if(size>1e6,str(round(size/1e6))+'Mb',if(size>1e3,str(round(size/1e3))+'kb',str(size)+'bp'))
| calc father_conf form(father_supp/(father_supp+father_contr)*100,3,1)+'%'
| calc mother_conf form(mother_supp/(mother_supp+mother_contr)*100,3,1)+'%'
| select chrom,diso_start,diso_stop,size,type,variantCount,ihe,father_conf,mother_conf,father_supp,father_contr,mother_supp,mother_contr,father_iso,father_hetero,mother_iso,mother_hetero
| sort chrom
Query ran in 0.93 sec Query fetched 3 rows in 0.05 sec (total time 0.99 sec)
Chrom | diso_start | diso_stop | size | type | variantCount | ihe | father_conf | mother_conf | father_supp | father_contr | mother_supp | mother_contr | father_iso | father_hetero | mother_iso | mother_hetero | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 877831 | 104166496 | 103Mb | father Heterodisomies | 1620 | 100 | 98.0% | 0.0% | 100 | 2 | 0 | 567 | 0 | 100 | 0 | 0 |
1 | chr2 | 45895 | 109746039 | 110Mb | father Isodisomies | 899 | 186 | 91.6% | 0.0% | 186 | 17 | 0 | 211 | 141 | 45 | 0 | 0 |
2 | chr3 | 10028294 | 16926666 | 7Mb | father Isodisomies | 130 | 22 | 95.7% | 0.0% | 22 | 1 | 0 | 26 | 18 | 4 | 0 | 0 |
We see that all of our regions show up. Note that the variant count is reflecting that we use our standard GDX ##roi_targets##. For WGS analysis, we could make this more accurate.