[BDCSG2008] Computational Paradigms for Genomic Medicine (Jill Mesirov)

Jilll is reviewing what is going with data and biology. There has been an explosion on the numbers they are generating data (from volumes to throughput). Simulations has also been common practices, robot operations, etc. more and more data. Some numbers, now their center use 4.8K processors and 1440+ Terabytes of storage. The challenge give the proper tools to biologist (not CS people). The two key topics of the talk: computation paradigms and computation foundations. They heavily rely on genome expression arrays (row patients, column genes, value expression values). A simple example: classify leukemias (example of how can be distinguished using expression arrays). Patient samples, extract messenger RNA and then create the expression signatures (high dimensionality low training sample set). They repeated the same problem for predicted outcome on prognosis on brain cancer, but for this program there was no strong signal to get them accurate enough. Genes work on regulatory networks (sets of genes), and they tried to do the analysis this way—acting as adding background knowledge to the problem—boosting the results and making the treatment possible. But, the problem is that there should be and infrastructure that could be easy to use and able to replicate experiments. Infrastructure should integrate and interact to components. Should be able to support techs and illiterates equally. Two interfaces (visual and programatic). Access to a library of functions, write pipes, language agnostic, and build on web services (scalable from the laptop to clusters). The name GenePattern. They are collaborating with Microsoft working on a tool (word document) to link to pipelines and the data in the data (can run with other version) and append results to the document too.