Bioinformatics Practitioners Club (was BTI Perl Club): 2014

Thursday, December 4, 2014

Date: Dec 8, 2014. Brandon Barker: Autosave for Research: Checkpoint and Restart Computational Workloads

Brandon Barker is a Computational Scientist working at the Cornell Center for Advanced Computing with research interests in safety-critical programming, parallel computing and linear modelling of metabolic systems.

It is not uncommon to have computational analyses running for many days or weeks. Software, hardware, and power failures all present the possibility that a significant amount of work could be lost. In some cases, the programmer can incrementally save data at specific intervals, but this is an error-prone process, and it is time-consuming to implement for each application. If a failure occurs, all data not saved will need to be regenerated again, and the researcher can only hope that another failure won't occur.

There is general solution known as Checkpoint/Restart (C/R) that can work for serial and a large variety of parallel programming applications. The primary advantage of all C/R implementations is the automation of saving program state at specified increments and allowing the program to be resumed. There are numerous C/R frameworks and implementations, each offering various advantages and disadvantages; despite some implementations being very mature, C/R remains an area of open and active research as no single solution covers every application type. By knowing the capabilities and drawbacks of each C/R solution, as well as the requirements and specifications of your application, it should be straightforward to choose a C/R framework that is right for you.

In this talk, we discuss in more detail what checkpointing is, and several scenarios where one would and would not want to use it. Next we discuss several popular checkpointing solutions, giving examples when pertinent of software that would not work well with each solution. Finally, we give some simple examples of how checkpointing can be implemented on your own system (Linux currently required).

Date: Dec 8, 2014
Time: 11:00 AM
Location: Weill hall, Room 121
Webex (webcast)
Slides

Monday, October 27, 2014

Date: Nov 3, 2014. Steve Lantz: Parallel MATLAB: the Parallel Computing Toolbox, MDCS, and Red Cloud

Steve Lantz is a Senior Research Associate working at the Cornell Center for Advanced Computing with research interests in numerical modeling, fluid dynamics and parallel computing.

MATLAB can be very useful as a tool for data analysis and interaction. In the typical scenario, you simply run it on a single-user computing resource like a laptop, controlling it through some convenient combination of scripts, commands, and the GUI. But what happens when your intended analysis starts to take days to run, instead of hours? Or what if your script starts crashing because you have exceeded your local memory?

The Parallel Computing Toolbox (PCT) gives you features that can help you overcome these performance and memory limitations. First, it allows you to write code that may take better advantage of the multiple cores on your local machine, or perhaps even its GPU. If that is insufficient, PCT also allows you to connect your local MATLAB client to remote resources based on the MATLAB Distributed Computing Server (MDCS) software. These remote resources become an extension of your local client, so that you can do large-scale, batch-style processing straight from your laptop. Many of the same PCT strategies that you use to enhance local execution are also able to exploit MDCS; furthermore, you can use a whole cluster’s memory in an aggregated fashion.

In this talk, Dr. Lantz will give a quick overview of the various capabilities provided by PCT. He will look at how scripts can be scaled up from your multi-core laptop all the way to cluster-scale MDCS resources. Finally, he will present CAC’s Red Cloud with MATLAB service as an on-campus source of the extra cycles and memory that you may sometimes require to get your work done. The process of connecting your client to CAC (or any similar MDCS-based service) will also be described.

Date: Nov 3, 2014
Time: 11:00 AM
Location: Weill hall, Room 121

Slides

Thursday, October 2, 2014

Date: Oct 13, 2014. Julia Goodrich: Conducting a microbiome study

Julia Goodrich is a PhD student in the lab of Dr. Ruth Ley in the Department of Molecular Biology and Genetics and the Department of Microbiology at Cornell University.

Human microbiome research is an actively developing area of inquiry, with ramifications for our lifestyles, our interactions with microbes, and how we treat disease. Bacterial and archaeal 16S rRNA gene sequence data from complex microbial communities present statistical and computational challenges. She will present some of the standard techniques currently used to characterize microbial communities.

Date: Oct 13, 2014
Time: 11:00 AM
Location: Weill hall, Room 121

Slides

Monday, September 15, 2014

Date: Sep 29, 2014. Adam Brazier: Workflows and Data management

Adam Brazier is a computational scientist working at the Cornell Center for Advanced Computing, having worked previously in the Cornell Astronomy Department.

Workflows have always been a part of the scientific research enterprise, but as their complexity and scale has grown the demands on modelling, planning, implementing and monitoring have grown. Inextricably tied to the workflow, and extending past it, is data management. Dr. Brazier will talk about both topics in the era of growing, distributed collaborations.

Date: Sep 29, 2014
Time: 11:00 AM
Location: Weill hall, Room 121

Slides

Thursday, July 31, 2014

Date: Aug 11, 2014. Haruo Suzuki: Microbial genome analysis using the G-language system

Haruo Suzuki is an Associate Professor at Yamaguchi University, Japan. You can find his publications here.

He will talk about research on GC skew (predicting the origin and terminus of DNA replication), codon usage (predicting highly expressed or horizontally transferred genes), and dinucleotide composition (classifying DNA sequences and predicting plasmid hosts), followed by an introduction of the data sources (NCBI Genbank and FASTA files) and a hands-on demo of G-language Web services.

The Web services provide URL-based access to all functions of G-language Genome Analysis Environment. For example, information for Staphylococcus aureus N315 genome (NC_002745) is given by http://rest.g-language.org/NC_002745, information for tst gene is shown by http://rest.g-language.org/NC_002745/tst, and GC skew is computed by http://rest.g-language.org/NC_002745/gcskew. Please bring your smart phone, tablet or laptop to follow the demonstration.

Date: Aug 11, 2014
Time: 11:00 AM
Location: Weill hall, Room 321

Slides

Tuesday, April 22, 2014

Date: Apr 28, 2014. Robert Bukowski: HapMap3: large-scale genotyping for Zea mays

Robert Bukowski is a Senior Research Associate at the BRC Bioinformatics Facility on campus.

The maize HapMap project, currently at its third release, focuses on capturing allelic variation, including rare alleles, across a highly diverse set of maize germplasm. It is a part of ongoing effort in the pan-zea community to identify variants responsible for complex trait variation. I will present computational techniques we use to characterize genotypes of about 1,000 maize lines from Illumina whole-genome sequencing data.

Date: Apr 28, 2014
Time: 11:00 AM
Location: Weill hall, Room 221

Slides

Tuesday, March 4, 2014

Date: Mar 10, 2014. Aureliano Bombarely: Insights into Plant Genome Sequence Assembly

Aureliano Bombarely is a Research Associate at the Doyle Lab in the Plant Biology Department. He was formerly with the Sol Genomics group at BTI. Please see here for recent publications.

Recent developments in genome assembly tools and algorithms will be presented. There will be particular focus on challenges associated with assembling polyploid plant genomes. Methods to assess quality and completeness in order to compare multiple assemblies will also be covered

Date: Mar 10, 2014
Time: 4:00 PM
Location: Weill hall, Room 321

Slides

Thursday, February 13, 2014

Date: Feb 24, 2014. Jaroslaw Pillardy: Solutions for large data storage computing infrastructure

Jarek is the Director of the BRC Bioinformatics Facility on campus.

Large data storage infrastructure problems and solutions based on Bioinformatics Facility experience will be discussed with a focus on hardware issues. What do we use and why do we use it? How to achieve compromise between cost, performance and maintenance?

Date: Feb 24, 2014
Time: 11:00 AM
Location: Weill hall, Room 221

Slides