Sequence Reads¶

The sequence data that is imported into the WuXi NextCODE system is part of an analysis pipeline that begins with the DNA samples taken from a subject, which are sequenced using next-generation sequencing (NGS). The overlapping sequence reads are then aligned to the reference genome and compiled into a coverage file. Individual sequence reads can be up to 250 base-pairs long.

Finally, this coverage file is compared to known variants that have been annotated with regards to their effects on the individual (i.e. their genetic traits or the likelihood of developing certain diseases). The WXNC tools are developed specifically to work with this last step in the pipeline.

In the following chapter, we will introduce some commands for working with sequence read data in the GOR query language, in particular sequence read data that is stored in BAM files.

Viewing Sequence Reads in Sequence Miner¶

Before we show each of the commands that can be performed with sequence read data, we will first show how to open the data in Sequence Miner and describe how the data can be interpreted using the tool. In the following video example, we show how to open the variation and BAM files for participants in a study and view the BAM tracks in the genome browser.

Pileup¶

The PILEUP command describes the base-pair formation at each chromosomal position. It summarizes the base calls of aligned sequence reads to a reference sequence.

Bamflag¶

The FLAG column in the sequence reads indicates settings for a number of different parameters. The BAMFLAG command expands the FLAG bitmap column into multiple Boolean (1/0) columns, which are easier to read. The table below has a list of the different parameters and their corresponding bit in the flag value.

The Flag Column as expanded by BAMFLAG¶
#	Binary	Decimal	Hexadecimal	Description
1	1	1	0x1	Read paired
2	10	2	0x2	Read mapped in proper pair
3	100	4	0x4	Read unmapped
4	1000	8	0x8	Mate unmapped
5	10000	16	0x10	Read reverse strand
6	100000	32	0x20	Mate reverse strand
7	1000000	64	0x40	First in pair
8	10000000	128	0x80	Second in pair
9	100000000	256	0x100	Not primary alignment
10	1000000000	512	0x200	Read fails platform/vendor quality checks
11	10000000000	1024	0x400	Read is PCR or optical duplicate
12	100000000000	2048	0x800	Supplementary alignment

Cigarsegs¶

The CIGARSEGS command takes sequence reads from a BAM-like stream and splits them into multiple reads based on the CIGAR column in the stream.

Bases¶

The BASES command splits the SEQ column in the BAM file into the individual bases, showing the relative position within the read and the base-quality.

Variants¶

The VARIANTS command returns the variants found in sequence reads and their associated quality. The variant quality is simply the base-quality of the first position in the variants. The VARIANTS command uses sliding window sort so that variants from overlapping reads are returned in genomic order.

Liftover¶

The LIFTOVER command is used to convert GOR data from one reference genome build to another, for example from hg19 to hg38.