Tuesday, March 27, 2012

Date: March 27. Naama Menda. Storing phenotypes

Storing Phenotypes.

Presentation: Naama Menda (BTI / SGN)


What is a phenotype?
How can we store phenotypes?

Phenotypes include morphology, development, behavior, biochemical properies, molecular characteristics.  Dependent on genotype and the environment.

Quantitative vs qualitative.

Metadata: weather information, who collected, how measured, etc.

Storing phenotypes: Chado phenotype modules. Problem: Chado tables too generic. One table captured all aspects of a phenotype (a cvterm for entity and a cvterm for the quality (etc), plus a value, cvalue_id (cvterm), and attribute_id. uniquename  has a free text description that is basically for qualitative phenotypic information. This table is too limited.

Thinking not about phenotypes, but about natural diversity. Create Natural Diversity chado module, which can store lots of metadata about experiments and stocks.

Chado uses attribute/value, like PATO.  Came from Flybase, old.  Not
EQV, entity/quality/value (fruit/color/red).  Flybase doesn't actually
use the id's, just text; and doesn't give the values, just names the
traits.  But Sol can't change the Chado schema because it would affect
"Post-composed" phenotypes like fruit/color/red.  Can it be reused for
different experiments?
Now the gmod.org/wiki/Chado_Natural_Diversity_Module.
Interactive QTL mapping in biparental populations.  Want to add
association mapping and genomic selection.
Chado "property" tables, generic.

Storing post composed terms: work in progress. Who wants to help?

Tuesday, March 13, 2012

Date: March 13, Qi Sun: Next generation genotype storage

Qi Sun talks about next generation genotype storage databases, including NetCDF4 and HDF5.

NetCDF4 is basically HDF5 with certain restrictions (a subset of HDF5). If NetCDF4 used, both NetCDF4 and HDF5 code libraries can be used. NetCDF4 has Perl libraries available, while HDF5 does not. NetCDF4 also has a pure Java library. The integrated genome viewer (IGV) uses HDF5 as the database engine, but moved to another format now.

Directories are essentially called "groups", and files are called "dataset". Metadata is called "attributes" (embedded in a dataset).

HDF5 datasets can be browsed using a browser (HDFView).

Example of a HDF5 store: root dir, project dir, contains one folder per chromosome; for each chromosome, a dataset with a matrix of scores (containing 1-byte values); two additional datasets encode row and column headers (as fixed width strings).

Transforming a table to a HDF5 store: row by row or column by column (slow vs fast dimension). Data chunking also possible (blocks of consecutive data, then jumps to new line etc). Huge effect on performance. Chunked offers a compromise in performance for both dimensions.

Encoding of genotypes. Simplest is two bytes per genotype. Another possibility is IUPAC nomenclature, 1 byte per data point. Most compressed encode genoypes with 2 bits each. Missing data: 0-0. polymorphism: 1-0  0-1 (heterozygotes). 1-1 homozygote (no polymorphism).

This will be incorporated into TASSEL.

Jarek developed an interface based on a network socket.

Date: March 6th, 2012: Genevieve DeClerck, Genotyping module

Genevieve DeClerck (Gramene) talked about the Gramene genotypes and phenotypes modules.

GDPDM database schema.

PackSNP.pm module. (Link?)