Bioinformatics Practitioners Club (was BTI Perl Club)

Date: Oct 27, 2017. Mark Daly: Dovetail Genomics – Building the Best Genomes on Earth

2017-10-09T14:40:00.000-04:00

Mark Daly is the Director of Sales at Dovetail Genomics LLC

A contiguous and accurate genome assembly is a crucial first step in fully understanding the biology of any organism. A high-quality genome assembly will make any downstream analyses, like gene annotation, synteny, comparative genomics and population genetics far easier and more reliable.

Dovetail Genomics is the leading service provider for high quality genome assemblies. To date, they have completed more than 400 projects, spanning many classes of organisms from plants to reptiles, amphibians, fish, mammals, birds, insects and more.

Using two complementary scaffolding methods, Chicago and Dovetail Hi-C, they are dramatically increasing the contiguity and accuracy of genome assemblies, enabling true, full-chromosome-length scaffolding. This talk will provide an update on their most current technologies. Mark will profile several customer projects and discuss how improvements in their assemblies led to better science and new discoveries.

Date: Oct 27, 2017
Time: 10:00 AM
Location: Weill hall, Room 226

Date: Dec 7, 2015. Susan Strickler: Long read assembly and annotation strategies for complex plant genomes

2015-09-26T19:00:00.000-04:00

Susan Strickler is a Research Associate in the Mueller lab at the Boyce Thompson Institute on Cornell University campus.

Plant genome assembly can be notoriously difficult due to such challenges as heterozygosity, polyploidy, and repeats. The long read technology provided by PacBio sequencing can help to overcome some of these obstacles to result in more complete and accurate assembly and annotation. In this talk, I will discuss de novo genome assembly tools and methods for processing PacBio genome and transcriptome data to generate high quality plant genomes and gene models.

Date: Dec 07, 2015
Time: 11:00 AM
Location: Weill hall, Room 221

Slides

Date: Oct 19, 2015. Katie Wilkins: Population Diversity of Xanthomonas oryzae pv oryzicola TAL Effectors and their Candidate Targets

2015-09-26T18:20:00.000-04:00

Katie Wilkins is a Computational Biology PhD Candidate in the Bogdanove lab in the Plant Pathology and Plant-Microbe Biology section of the School of Integrative Plant Science at Cornell University.

Xanthomonas oryzae pv oryzicola is the causal agent of bacterial leaf streak of rice, a disease that can lead to up to 30% yield loss in this staple crop. Disease progression is mediated in part by the secretion of transcription activator-like (TAL) effectors that upregulate host genes by binding to corresponding promoter regions. Genes upregulated by TAL effectors can confer host resistance or enhance host susceptibility. Knowledge of these important TAL effector-target pairs informs breeding of resistant rice varieties. To determine the distribution of TAL effectors and their candidate targets at the population level, we sequenced 10 strains of Xanthomonas oryzae pv oryzicola and performed RNA-Seq of rice inoculated with each strain. We also used population level conservation to evaluate potential importance of the identified TAL effectors and their candidate targets.

Date: Oct 19, 2015
Time: 11:00 AM
Location: Weill hall, Room 221

Slides

Date: May 4, 2015. Kevin Panke-Buisse: Rhizosphere Microbiome: Manipulation and Investigation

2015-04-28T11:23:00.000-04:00

Kevin Panke-Buisse is a PhD Candidate in the Kao-Kniffin Lab in the Horticulture section of the School of Integrative Plant Science at Cornell University.

Soil microorganisms found in the root zone impact plant growth and development, but the potential to harness these benefits is hampered by the sheer abundance and diversity of the players influencing desirable plant traits. This talk will outline some of the ways we can manipulate the rhizosphere microbiome and look at shifts across treatments via 16s sequencing.

Date: May 04, 2015
Time: 10:30 AM
Location: Weill hall, Room 321
Paper
Slides

Date: Apr 20, 2015. Jeff Glaubitz: The Maize Rare Alleles Project: Biology & Bioinformatics

2015-04-13T13:34:00.001-04:00

Jeff Glaubitz is a Senior Research Associate and the Project Manager of Panzea - the NSF Maize Diversity Project.

The NSF project Biology Of Rare Alleles In Maize And Its Wild Relatives (Ed Buckler, PI) is combining the power of population genetic and molecular models with quantitative genetics to elucidate the relative contributions of rare versus common alleles to phenotypic variation and evolution. We are taking advantage of recent advances in high-throughput genotyping and phenotyping methodologies to identify the key biological attributes of variants (genome annotations) that will allow us to better predict the functional effects of rare alleles in Zea. This information will then be used to accelerate crop improvement either through more accurate genomic selection or via future genome editing approaches. We hope to enhance the effectiveness of plant breeding by improving our ability to identify, predict, and select on the effects of rare variants, both deleterious and beneficial. In this talk I will give an overview of the biological goals of this project and the various bioinformatic tools that are being developed to achieve these goals, with an emphasis on TASSEL.

Date: April 20, 2015
Time: 11:00 AM
Location: Weill hall, Room 321
Slides

Date: Mar 2, 2015. Minghui Wang: Genome-wide crossover distribution in the population of maize B73 and Mo17

2015-02-26T09:52:00.001-05:00

Minghui Wang is a Postdoctoral Associate at the BRC Bioinformatics Facility on campus.

Crossovers (COs) are essential for the accurate segregation of homologous chromosomes at the first meiotic division. However, CO are not evenly distributed across genome. Their number and location are tightly regulated. Here, we report a detailed, genome-wide characterization of the rate and localization of COs in maize, in male and female meiosis.

Date: Mar 2, 2015
Time: 11:00 AM
Location: Weill hall, Room 321
Slides

Date: Feb 2, 2015. Zehong Ding: Comparison of leaf gradient transcriptomics in multiple C3 and C4 species

2015-01-27T23:34:00.000-05:00

Zehong Ding is a Postdoctoral Associate at the BRC Bioinformatics Facility on campus.

Transferring C4 photosynthesis into C3 crops has been proposed as one of the most promising ways to increase the yield ceiling and hence global productivity. To better understand the function of C4 photosynthesis, and to identify candidate genes that associated with C4 pathway, comparative transcriptomes were conducted along a leaf developmental gradient in maize, viridis, sorghum and rice. In total 478 C4 candidate genes were identified. Besides the classical C4 genes, many function well characterized genes that associated with light reaction, starch and sucrose metabolism, hormone, TFs, and transporters were included. These findings will provide important insights into the gene differentiation between C3 and C4 species. In addition, the C4 candidate genes that identified in our approach would be a useful gene resource that could be used for C4 engineering of C3 crops.

Date: Feb 2, 2015
Time: 11:00 AM
Location: Weill hall, Room 321
Slides

Date: Dec 8, 2014. Brandon Barker: Autosave for Research: Checkpoint and Restart Computational Workloads

2014-12-04T01:45:00.000-05:00

Brandon Barker is a Computational Scientist working at the Cornell Center for Advanced Computing with research interests in safety-critical programming, parallel computing and linear modelling of metabolic systems.

It is not uncommon to have computational analyses running for many days or weeks. Software, hardware, and power failures all present the possibility that a significant amount of work could be lost. In some cases, the programmer can incrementally save data at specific intervals, but this is an error-prone process, and it is time-consuming to implement for each application. If a failure occurs, all data not saved will need to be regenerated again, and the researcher can only hope that another failure won't occur.

There is general solution known as Checkpoint/Restart (C/R) that can work for serial and a large variety of parallel programming applications. The primary advantage of all C/R implementations is the automation of saving program state at specified increments and allowing the program to be resumed. There are numerous C/R frameworks and implementations, each offering various advantages and disadvantages; despite some implementations being very mature, C/R remains an area of open and active research as no single solution covers every application type. By knowing the capabilities and drawbacks of each C/R solution, as well as the requirements and specifications of your application, it should be straightforward to choose a C/R framework that is right for you.

In this talk, we discuss in more detail what checkpointing is, and several scenarios where one would and would not want to use it. Next we discuss several popular checkpointing solutions, giving examples when pertinent of software that would not work well with each solution. Finally, we give some simple examples of how checkpointing can be implemented on your own system (Linux currently required).

Date: Dec 8, 2014
Time: 11:00 AM
Location: Weill hall, Room 121
Webex (webcast)
Slides

Date: Nov 3, 2014. Steve Lantz: Parallel MATLAB: the Parallel Computing Toolbox, MDCS, and Red Cloud

2014-10-27T10:26:00.000-04:00

Steve Lantz is a Senior Research Associate working at the Cornell Center for Advanced Computing with research interests in numerical modeling, fluid dynamics and parallel computing.

MATLAB can be very useful as a tool for data analysis and interaction. In the typical scenario, you simply run it on a single-user computing resource like a laptop, controlling it through some convenient combination of scripts, commands, and the GUI. But what happens when your intended analysis starts to take days to run, instead of hours? Or what if your script starts crashing because you have exceeded your local memory?

The Parallel Computing Toolbox (PCT) gives you features that can help you overcome these performance and memory limitations. First, it allows you to write code that may take better advantage of the multiple cores on your local machine, or perhaps even its GPU. If that is insufficient, PCT also allows you to connect your local MATLAB client to remote resources based on the MATLAB Distributed Computing Server (MDCS) software. These remote resources become an extension of your local client, so that you can do large-scale, batch-style processing straight from your laptop. Many of the same PCT strategies that you use to enhance local execution are also able to exploit MDCS; furthermore, you can use a whole cluster’s memory in an aggregated fashion.

In this talk, Dr. Lantz will give a quick overview of the various capabilities provided by PCT. He will look at how scripts can be scaled up from your multi-core laptop all the way to cluster-scale MDCS resources. Finally, he will present CAC’s Red Cloud with MATLAB service as an on-campus source of the extra cycles and memory that you may sometimes require to get your work done. The process of connecting your client to CAC (or any similar MDCS-based service) will also be described.

Date: Nov 3, 2014
Time: 11:00 AM
Location: Weill hall, Room 121

Slides

Date: Oct 13, 2014. Julia Goodrich: Conducting a microbiome study

2014-10-02T14:16:00.000-04:00

Julia Goodrich is a PhD student in the lab of Dr. Ruth Ley in the Department of Molecular Biology and Genetics and the Department of Microbiology at Cornell University.

Human microbiome research is an actively developing area of inquiry, with ramifications for our lifestyles, our interactions with microbes, and how we treat disease. Bacterial and archaeal 16S rRNA gene sequence data from complex microbial communities present statistical and computational challenges. She will present some of the standard techniques currently used to characterize microbial communities.

Date: Oct 13, 2014
Time: 11:00 AM
Location: Weill hall, Room 121

Slides

Date: Sep 29, 2014. Adam Brazier: Workflows and Data management

2014-09-15T14:37:00.001-04:00

Adam Brazier is a computational scientist working at the Cornell Center for Advanced Computing, having worked previously in the Cornell Astronomy Department.

Workflows have always been a part of the scientific research enterprise, but as their complexity and scale has grown the demands on modelling, planning, implementing and monitoring have grown. Inextricably tied to the workflow, and extending past it, is data management. Dr. Brazier will talk about both topics in the era of growing, distributed collaborations.

Date: Sep 29, 2014
Time: 11:00 AM
Location: Weill hall, Room 121

Slides

Date: Aug 11, 2014. Haruo Suzuki: Microbial genome analysis using the G-language system

2014-07-31T15:02:00.000-04:00

Haruo Suzuki is an Associate Professor at Yamaguchi University, Japan. You can find his publications here.

He will talk about research on GC skew (predicting the origin and terminus of DNA replication), codon usage (predicting highly expressed or horizontally transferred genes), and dinucleotide composition (classifying DNA sequences and predicting plasmid hosts), followed by an introduction of the data sources (NCBI Genbank and FASTA files) and a hands-on demo of G-language Web services.

The Web services provide URL-based access to all functions of G-language Genome Analysis Environment. For example, information for Staphylococcus aureus N315 genome (NC_002745) is given by http://rest.g-language.org/NC_002745, information for tst gene is shown by http://rest.g-language.org/NC_002745/tst, and GC skew is computed by http://rest.g-language.org/NC_002745/gcskew. Please bring your smart phone, tablet or laptop to follow the demonstration.

Date: Aug 11, 2014
Time: 11:00 AM
Location: Weill hall, Room 321

Slides

Date: Apr 28, 2014. Robert Bukowski: HapMap3: large-scale genotyping for Zea mays

2014-04-22T13:15:00.000-04:00

Robert Bukowski is a Senior Research Associate at the BRC Bioinformatics Facility on campus.

The maize HapMap project, currently at its third release, focuses on capturing allelic variation, including rare alleles, across a highly diverse set of maize germplasm. It is a part of ongoing effort in the pan-zea community to identify variants responsible for complex trait variation. I will present computational techniques we use to characterize genotypes of about 1,000 maize lines from Illumina whole-genome sequencing data.

Date: Apr 28, 2014
Time: 11:00 AM
Location: Weill hall, Room 221

Slides

Date: Mar 10, 2014. Aureliano Bombarely: Insights into Plant Genome Sequence Assembly

2014-03-04T15:04:00.002-05:00

Aureliano Bombarely is a Research Associate at the Doyle Lab in the Plant Biology Department. He was formerly with the Sol Genomics group at BTI. Please see here for recent publications.

Recent developments in genome assembly tools and algorithms will be presented. There will be particular focus on challenges associated with assembling polyploid plant genomes. Methods to assess quality and completeness in order to compare multiple assemblies will also be covered

Date: Mar 10, 2014
Time: 4:00 PM
Location: Weill hall, Room 321

Slides

Date: Feb 24, 2014. Jaroslaw Pillardy: Solutions for large data storage computing infrastructure

2014-02-13T13:59:00.000-05:00

Jarek is the Director of the BRC Bioinformatics Facility on campus.

Large data storage infrastructure problems and solutions based on Bioinformatics Facility experience will be discussed with a focus on hardware issues. What do we use and why do we use it? How to achieve compromise between cost, performance and maintenance?

Date: Feb 24, 2014
Time: 11:00 AM
Location: Weill hall, Room 221

Slides

Date: Nov 22, 2013. Lukas Mueller: Git

2013-11-18T09:44:00.003-05:00

The first talk will be on Git by Lukas Mueller from Sol Genomics Network. Its highly recommended that you bring your laptop with Git installed for hands-on exercises. You can download Git here.

Date: Nov 22, 2013
Time: 2:00 PM
Location: Weill hall, Room 321

Slides

Some useful links
Git for Scientists
Git branching model

The BTI Perl Club is being revived as the Bioinformatics Practitioners Club

2013-11-08T19:26:00.002-05:00

This will be a forum for practicing bioinformaticians at the Ithaca campus to discuss technology. There are many introductory courses as well as domain-specific journal clubs already available. See here for a list. The focus of Bioinformatics Practitioners Club will be developments in methods, practices and new tools for bioinformatics. The meetings will typically be held once a month. The time and location details for the next talk will be sent out soon on the mailing list.

Date: Oct 30, 2012. Jeremy Edwards. Topic: Pedigrees, GraphViz, SVG

2012-10-30T14:11:00.003-04:00

GraphViz: from ATT

SVG

Pedigrees in SGN

Slides

Date: Oct 16, 2012. Presenter: Qi Sun. Topic: Ensembl & Multithreading

2012-10-29T21:51:00.000-04:00

Qi Sun

Overview of Ensembl system.
Implementing multi-processing in Perl (with example code)

Presentation

Date: April 10: Dave Matthews. FastBit & T3 Phenotypes

2012-04-10T22:10:00.001-04:00

Today, Dave Matthews gave an overview of FastBit (which is part of FastQuery for HDF5).

Slides

Dave also demonstrated some of the capabilities of the T3 phenotypic database.

Date: March 27. Naama Menda. Storing phenotypes

2012-03-27T14:11:00.002-04:00

Storing Phenotypes.

Presentation: Naama Menda (BTI / SGN)

Slides

What is a phenotype?
How can we store phenotypes?

Phenotypes include morphology, development, behavior, biochemical properies, molecular characteristics. Dependent on genotype and the environment.

Quantitative vs qualitative.

Metadata: weather information, who collected, how measured, etc.

Storing phenotypes: Chado phenotype modules. Problem: Chado tables too generic. One table captured all aspects of a phenotype (a cvterm for entity and a cvterm for the quality (etc), plus a value, cvalue_id (cvterm), and attribute_id. uniquename has a free text description that is basically for qualitative phenotypic information. This table is too limited.

Thinking not about phenotypes, but about natural diversity. Create Natural Diversity chado module, which can store lots of metadata about experiments and stocks.

Chado uses attribute/value, like PATO. Came from Flybase, old. Not
EQV, entity/quality/value (fruit/color/red). Flybase doesn't actually
use the id's, just text; and doesn't give the values, just names the
traits. But Sol can't change the Chado schema because it would affect
Flybase.
"Post-composed" phenotypes like fruit/color/red. Can it be reused for
different experiments?
Now the gmod.org/wiki/Chado_Natural_Diversity_Module.
Interactive QTL mapping in biparental populations. Want to add
association mapping and genomic selection.
Chado "property" tables, generic.

Storing post composed terms: work in progress. Who wants to help?

Date: March 13, Qi Sun: Next generation genotype storage

2012-03-13T14:10:00.001-04:00

Qi Sun talks about next generation genotype storage databases, including NetCDF4 and HDF5.

NetCDF4 is basically HDF5 with certain restrictions (a subset of HDF5). If NetCDF4 used, both NetCDF4 and HDF5 code libraries can be used. NetCDF4 has Perl libraries available, while HDF5 does not. NetCDF4 also has a pure Java library. The integrated genome viewer (IGV) uses HDF5 as the database engine, but moved to another format now.

Directories are essentially called "groups", and files are called "dataset". Metadata is called "attributes" (embedded in a dataset).

HDF5 datasets can be browsed using a browser (HDFView).

Example of a HDF5 store: root dir, project dir, contains one folder per chromosome; for each chromosome, a dataset with a matrix of scores (containing 1-byte values); two additional datasets encode row and column headers (as fixed width strings).

Transforming a table to a HDF5 store: row by row or column by column (slow vs fast dimension). Data chunking also possible (blocks of consecutive data, then jumps to new line etc). Huge effect on performance. Chunked offers a compromise in performance for both dimensions.

Encoding of genotypes. Simplest is two bytes per genotype. Another possibility is IUPAC nomenclature, 1 byte per data point. Most compressed encode genoypes with 2 bits each. Missing data: 0-0. polymorphism: 1-0 0-1 (heterozygotes). 1-1 homozygote (no polymorphism).

This will be incorporated into TASSEL.

Jarek developed an interface based on a network socket.

Date: March 6th, 2012: Genevieve DeClerck, Genotyping module

2012-03-13T14:09:00.001-04:00

Genevieve DeClerck (Gramene) talked about the Gramene genotypes and phenotypes modules.

GDPDM database schema.

PackSNP.pm module. (Link?)

Slides

2011-11-01T09:32:00.001-04:00

Date: Nov 8, 2011.

Topic: RNASeq data processing pipeline developed in the Brutnell lab

Presenter: Lin Wang

Slides: [not yet]

2011-11-01T09:30:00.001-04:00

Date: Nov 1, 2011.
Topic: MySQL database performance tuning.
Presenter: Clayton Birkett, USDA

Clay discussed several ways to increase database performance, including effects of data denormalization, use of different storage engines, distributed databases (incl. memcached), and use of solid state drives.

Slides