[BDCSG2008] Data-Rich computing: Where It’s All (Phil Gibbons)

The next speaker of the afternoon is Phil Gibbons from Intel Research. Intel has created a research theme on data-rich computing for the next few years (same as the other one presented on the Hadoop summit about ground modeling). An approach, bring the computation to the data (cluster approach), but there are also two elements in the picture: (1) memory hierarchy issues, and (2) pervasive multimedia sensing. The first one is in important because for pure performance, the second one keeps forcing pushing the computation closer to the sensors. The memory hierarchy implies that multi cores share a common L2 cache, and the farther we move the bandwidth drops, and you can keep pushing in HD/SSD. And this basic unit is what gets replicated to build clusters. (Little note about SSD rewriting quirk and how cache coherency plays in the overall picture). All this lead to HI-SPADE project (Hierarchy-Savvy Parallel Algorithm Design). The goal is to hide as much as possible, but only expose what is important to tune. Continues with an example of cache misses and how that can be palliated with the right scheduler. Phil then moved to show how that would work on parallel merge sort (solution á la merge first depth search), also compared later with hash join, and again, no free lunch (they run into the usual one worst like a champ, the other fails miserably). The next topic on the presentation when to the quirk of SSD (flash based). The improvements over traditional HD reach a 3 order magnitude. But again, random rewrites in flash are painful, but there is a way to express semi-random algorithms may be help palliate the problem. Shifting gears, pervasive multimedia sensing is the next topic on the arena. Phil start reviewing sensor networks and how the sensors are becoming more powerful, but more important, the numbers are growing and scaling up (exponentially). Again, moving to their example of the IrisNet project, and how pushing computation down helps also with the distribution of the data, pushing their results as feeds (XML) into aggregations nodes (in a tree shape). Once aggregated, they provide the usual distribution, replication, and querying.