BDCSG

[BDCSG2008] Summary of BDCSG2008 blogging

It has been a greet meeting. Lots of interesting ideas and a lot to explore from now on. Just what I like :D. I summarized below the list of post I make related to the meeting. Introductory post [Data-Intensive Scalable Computing. Randy Bryant, CMU](/posts/data-intensive-scalable-computing-randy-bryant.md" >}}) Text Information Management: Challenges and Opportunities. ChengXiang Zhai, UIUC Clouds and ManyCore: The Revolution. Dan Reed, MSR Computational Paradigms for Genomic Medicine. Jill Mesirov, Broad Institute of MIT and Harvard Simplicity and Complexity in Data Systems (Garth Gibson) Handling Large Datasets at Google: Current Systems and Future Directions. Jeff Dean, Google Algorithmic Perspectives on Large-Scale Social Network Data. Jon Kleinberg, Cornell Mining the Web Graph. Marc Najork, MSR “What” Goes Around. Joe Hellerstein, Berkeley Sherpa: Hosted Data Serving. Raghu Ramakrishnan, Yahoo! Scientific Applications of Large Databases. Alex Szalay, JHU Data-Rich Computing: Where It’s At. Phil Gibbons, Intel NSF Plans for Supporting Data Intensive Computing: Jeannette Wing, NSF. The Google/IBM data center: Christophe Bisciglia, Google

Blogging from the Big Data Computing Study Group 2008

I was lucky to attend the Big Data Computing Study Group 2008. The line of speaker is impressive. The event was held at Yahoo! Sunnyvale, and Thomas Kwan (UIUC alumni know at Yahoo!) helped organize it. I blogged about it on my DITA blog where you can find links to all the related posts.

[BDCSG2008] NSF Plans for Supporting Data Intensive Computing (Jeannette Wing and Christophe Bisciglia)

NSF listens at you academics. Jeannete opens the floor with this claim. Questions: What are the limitations of this modeling paradigm (data-intensive one)? What are meaningful metrics of performance here? What about security processes and data on a shared resource? How can we reduce power consumption? Can this parading problem not possible otherwise, or simplify them, or open the door to new applications? NSF rolling out cluster exploratory program, also going to roll out a new solicitation for Data-Intensive Computing. Also emphasizing from data to knowledge, since scientist are throwing it away. This is a great opportunity for collaborative efforts between CS and scientist. NSF goal: provide access to cluster resource and access to massive data sets. Google and IBM rolling out the cluster (for academics). NSF will roll out a cluster exploratory will be the solicitation program announced yesterday to distribute access to the cluster and research grants. Review of Christophe experience on teaching a class about clustering, and he realized that providing away computer cycles is more valuable than plain grant money. It runs on Hadoop. The cluster will be allocate by rack weeks, 5 Terabytes and priority on 80 processes (but still people there and lower priority and large data sets). And since the reviewing was not Google expertise they reach to NSF to use it. Googler to start collaborations and IBM will also help providing support for it. Jeannette claiming this is a new model, but NSF is open for new model and other partners. ...

[BDCSG2008] Data-Rich computing: Where It’s All (Phil Gibbons)

The next speaker of the afternoon is Phil Gibbons from Intel Research. Intel has created a research theme on data-rich computing for the next few years (same as the other one presented on the Hadoop summit about ground modeling). An approach, bring the computation to the data (cluster approach), but there are also two elements in the picture: (1) memory hierarchy issues, and (2) pervasive multimedia sensing. The first one is in important because for pure performance, the second one keeps forcing pushing the computation closer to the sensors. The memory hierarchy implies that multi cores share a common L2 cache, and the farther we move the bandwidth drops, and you can keep pushing in HD/SSD. And this basic unit is what gets replicated to build clusters. (Little note about SSD rewriting quirk and how cache coherency plays in the overall picture). All this lead to HI-SPADE project (Hierarchy-Savvy Parallel Algorithm Design). The goal is to hide as much as possible, but only expose what is important to tune. Continues with an example of cache misses and how that can be palliated with the right scheduler. Phil then moved to show how that would work on parallel merge sort (solution á la merge first depth search), also compared later with hash join, and again, no free lunch (they run into the usual one worst like a champ, the other fails miserably). The next topic on the presentation when to the quirk of SSD (flash based). The improvements over traditional HD reach a 3 order magnitude. But again, random rewrites in flash are painful, but there is a way to express semi-random algorithms may be help palliate the problem. Shifting gears, pervasive multimedia sensing is the next topic on the arena. Phil start reviewing sensor networks and how the sensors are becoming more powerful, but more important, the numbers are growing and scaling up (exponentially). Again, moving to their example of the IrisNet project, and how pushing computation down helps also with the distribution of the data, pushing their results as feeds (XML) into aggregations nodes (in a tree shape). Once aggregated, they provide the usual distribution, replication, and querying. ...

[BDCSG2008] Scientific Applications of Large Databases (Alex Szalay)

Alex is opening the talk showing a clear exponential growth in Astronomy (LSST and the petabyte example generation). Data generated from sensors keep growing like crazy. Images have more and more resolution. Hopkin’s databases started with the digital sky initiative, with generated 3 terabytes of data in the 90’s, and the number keep growing up to the point of LSST which will be forced to dump images because is not possible to store all of them. SkyServer is base on SQL server and .NET, serving the 3 terabytes, serving SQL queries reaching 15 millions queries recently. Alex presented a revision of the usage and data delivery. They are planning to anonymize the log and make it publicly available. Then, he switched gears toward the immersive turbulence project that seeks to generate high resolution turbulence images. Again they are storing the information on a SQL server. Moving gears to SkyQuery. A federating web services to build a join query on the fly, but the problem is some join results turn out to be unfeasible. The revision of projects then moved to “Life under your feet”, a non intrusive way to measure environments. The key component is the aggregation and drill down mode for exploring those aggregates down to a sensor level. Another one, OncoSpace, oncology treatment evolution based on the comparison of images of the same patients and their evolutions across time, again implemented on the same SQL server. But all this projects has commonalities, indexing and the need to extract small subsets from larger datasets. MyDQ targets to extract data from other databases and services and leave it on a relational database the user can use. Graywulf, there is no off-the-shelf solution for scientific 1000 TB of data, so the solution is to scale it out. They took the root of fragmentation and create chunks easy to manage, again using SQL server clusters. They also introduce a workflow manager to monitor and the control of it. And this led them to create a Petascale center to play with at John Hopkins University, build on Pan-Starts. ...