Tuesday, March 13, 2012

Date: March 13, Qi Sun: Next generation genotype storage

Qi Sun talks about next generation genotype storage databases, including NetCDF4 and HDF5.

NetCDF4 is basically HDF5 with certain restrictions (a subset of HDF5). If NetCDF4 used, both NetCDF4 and HDF5 code libraries can be used. NetCDF4 has Perl libraries available, while HDF5 does not. NetCDF4 also has a pure Java library. The integrated genome viewer (IGV) uses HDF5 as the database engine, but moved to another format now.

Directories are essentially called "groups", and files are called "dataset". Metadata is called "attributes" (embedded in a dataset).

HDF5 datasets can be browsed using a browser (HDFView).

Example of a HDF5 store: root dir, project dir, contains one folder per chromosome; for each chromosome, a dataset with a matrix of scores (containing 1-byte values); two additional datasets encode row and column headers (as fixed width strings).

Transforming a table to a HDF5 store: row by row or column by column (slow vs fast dimension). Data chunking also possible (blocks of consecutive data, then jumps to new line etc). Huge effect on performance. Chunked offers a compromise in performance for both dimensions.

Encoding of genotypes. Simplest is two bytes per genotype. Another possibility is IUPAC nomenclature, 1 byte per data point. Most compressed encode genoypes with 2 bits each. Missing data: 0-0. polymorphism: 1-0  0-1 (heterozygotes). 1-1 homozygote (no polymorphism).

This will be incorporated into TASSEL.

Jarek developed an interface based on a network socket.













No comments:

Post a Comment