Posts

Meandre: Semantic-Driven Data-Intensive Flow Engine

Finally we have finished setting up the website for Meandre a semantic-driven data-intensive flow engine. Meandre provides basic infrastructure for data-intensive computation. It provides, among others, tools for creating components and flows, a high-level language to describe flows, and multicore and distributed execution environment based on a service-oriented paradigm. We are currently working on getting gear up for a first alpha release. You can visit the Meandre site here. I will be posting in the Meandre blog about our current steps toward getting the release out of the door. The Meandre infrastructure is being build to support the SEASR project ...

[BDCSG2008] Summary of BDCSG2008 blogging

It has been a greet meeting. Lots of interesting ideas and a lot to explore from now on. Just what I like :D. I summarized below the list of post I make related to the meeting. Introductory post [Data-Intensive Scalable Computing. Randy Bryant, CMU](/posts/data-intensive-scalable-computing-randy-bryant.md" >}}) Text Information Management: Challenges and Opportunities. ChengXiang Zhai, UIUC Clouds and ManyCore: The Revolution. Dan Reed, MSR Computational Paradigms for Genomic Medicine. Jill Mesirov, Broad Institute of MIT and Harvard Simplicity and Complexity in Data Systems (Garth Gibson) Handling Large Datasets at Google: Current Systems and Future Directions. Jeff Dean, Google Algorithmic Perspectives on Large-Scale Social Network Data. Jon Kleinberg, Cornell Mining the Web Graph. Marc Najork, MSR “What” Goes Around. Joe Hellerstein, Berkeley Sherpa: Hosted Data Serving. Raghu Ramakrishnan, Yahoo! Scientific Applications of Large Databases. Alex Szalay, JHU Data-Rich Computing: Where It’s At. Phil Gibbons, Intel NSF Plans for Supporting Data Intensive Computing: Jeannette Wing, NSF. The Google/IBM data center: Christophe Bisciglia, Google

Blogging from the Big Data Computing Study Group 2008

I was lucky to attend the Big Data Computing Study Group 2008. The line of speaker is impressive. The event was held at Yahoo! Sunnyvale, and Thomas Kwan (UIUC alumni know at Yahoo!) helped organize it. I blogged about it on my DITA blog where you can find links to all the related posts.

[BDCSG2008] NSF Plans for Supporting Data Intensive Computing (Jeannette Wing and Christophe Bisciglia)

NSF listens at you academics. Jeannete opens the floor with this claim. Questions: What are the limitations of this modeling paradigm (data-intensive one)? What are meaningful metrics of performance here? What about security processes and data on a shared resource? How can we reduce power consumption? Can this parading problem not possible otherwise, or simplify them, or open the door to new applications? NSF rolling out cluster exploratory program, also going to roll out a new solicitation for Data-Intensive Computing. Also emphasizing from data to knowledge, since scientist are throwing it away. This is a great opportunity for collaborative efforts between CS and scientist. NSF goal: provide access to cluster resource and access to massive data sets. Google and IBM rolling out the cluster (for academics). NSF will roll out a cluster exploratory will be the solicitation program announced yesterday to distribute access to the cluster and research grants. Review of Christophe experience on teaching a class about clustering, and he realized that providing away computer cycles is more valuable than plain grant money. It runs on Hadoop. The cluster will be allocate by rack weeks, 5 Terabytes and priority on 80 processes (but still people there and lower priority and large data sets). And since the reviewing was not Google expertise they reach to NSF to use it. Googler to start collaborations and IBM will also help providing support for it. Jeannette claiming this is a new model, but NSF is open for new model and other partners. ...

[BDCSG2008] Data-Rich computing: Where It’s All (Phil Gibbons)

The next speaker of the afternoon is Phil Gibbons from Intel Research. Intel has created a research theme on data-rich computing for the next few years (same as the other one presented on the Hadoop summit about ground modeling). An approach, bring the computation to the data (cluster approach), but there are also two elements in the picture: (1) memory hierarchy issues, and (2) pervasive multimedia sensing. The first one is in important because for pure performance, the second one keeps forcing pushing the computation closer to the sensors. The memory hierarchy implies that multi cores share a common L2 cache, and the farther we move the bandwidth drops, and you can keep pushing in HD/SSD. And this basic unit is what gets replicated to build clusters. (Little note about SSD rewriting quirk and how cache coherency plays in the overall picture). All this lead to HI-SPADE project (Hierarchy-Savvy Parallel Algorithm Design). The goal is to hide as much as possible, but only expose what is important to tune. Continues with an example of cache misses and how that can be palliated with the right scheduler. Phil then moved to show how that would work on parallel merge sort (solution á la merge first depth search), also compared later with hash join, and again, no free lunch (they run into the usual one worst like a champ, the other fails miserably). The next topic on the presentation when to the quirk of SSD (flash based). The improvements over traditional HD reach a 3 order magnitude. But again, random rewrites in flash are painful, but there is a way to express semi-random algorithms may be help palliate the problem. Shifting gears, pervasive multimedia sensing is the next topic on the arena. Phil start reviewing sensor networks and how the sensors are becoming more powerful, but more important, the numbers are growing and scaling up (exponentially). Again, moving to their example of the IrisNet project, and how pushing computation down helps also with the distribution of the data, pushing their results as feeds (XML) into aggregations nodes (in a tree shape). Once aggregated, they provide the usual distribution, replication, and querying. ...