Querying bgen files in CRI


Sabrina Mi


November 10, 2020

BGEN format

A bgen file has a header block with information about the file, including number of samples, the number of variant data blocks, and flags which describe how data is stores. A variant data block contains data for a single snp, including ID, position, and alleles. bgens from UKBiobank also have a sample identifier block with has an identifier for each sample.

Querying a subset of a bgen genotype

Start an interactive job, qsub -I. We can query data for a region of a chromosome in a bgen file:

module load gcc/6.2.0; module load bgen/1.1.3
bgenix -g /gpfs/data/ukb-share/genotypes/v3/ukb_imp_chr17_v3.bgen -incl-range 17:46018872-46026674 > pnpo_37.bgen

This outputs a bgen containing data only for snps on chromosome 17, between positions 46018872 46026674.

Before querying it, we need to create an index file:

bgenix -g pnpo_37.bgen -index

More documentation is here.

Convert to VCF

For some reason, using the bgenix -vcf argument to convert the bgen output to a vcf is unreliable, so we use qctool to convert instead. qctool is installed in /gpfs/data/im-lab/nas40t2/bin. Note that qctool requires gcc, so this will need to be run in a job with the gcc module loaded.

export PATH=$PATH:/gpfs/data/im-lab/nas40t2/bin/software

bgenix -g pnpo_37.bgen | qctool -g - -filetype bgen -s /gpfs/data/ukb-share/genotypes/ukb19526_imp_chr1_v3_s487395.sample -og ~/PNPO_37.vcf

Run qctool -help for a list of options, or for more documentation, https://www.well.ox.ac.uk/~gav/qctool_v2/index.html.