Events

[BDCSG2008] Data-Rich computing: Where It’s All (Phil Gibbons)

The next speaker of the afternoon is Phil Gibbons from Intel Research. Intel has created a research theme on data-rich computing for the next few years (same as the other one presented on the Hadoop summit about ground modeling). An approach, bring the computation to the data (cluster approach), but there are also two elements in the picture: (1) memory hierarchy issues, and (2) pervasive multimedia sensing. The first one is in important because for pure performance, the second one keeps forcing pushing the computation closer to the sensors. The memory hierarchy implies that multi cores share a common L2 cache, and the farther we move the bandwidth drops, and you can keep pushing in HD/SSD. And this basic unit is what gets replicated to build clusters. (Little note about SSD rewriting quirk and how cache coherency plays in the overall picture). All this lead to HI-SPADE project (Hierarchy-Savvy Parallel Algorithm Design). The goal is to hide as much as possible, but only expose what is important to tune. Continues with an example of cache misses and how that can be palliated with the right scheduler. Phil then moved to show how that would work on parallel merge sort (solution á la merge first depth search), also compared later with hash join, and again, no free lunch (they run into the usual one worst like a champ, the other fails miserably). The next topic on the presentation when to the quirk of SSD (flash based). The improvements over traditional HD reach a 3 order magnitude. But again, random rewrites in flash are painful, but there is a way to express semi-random algorithms may be help palliate the problem. Shifting gears, pervasive multimedia sensing is the next topic on the arena. Phil start reviewing sensor networks and how the sensors are becoming more powerful, but more important, the numbers are growing and scaling up (exponentially). Again, moving to their example of the IrisNet project, and how pushing computation down helps also with the distribution of the data, pushing their results as feeds (XML) into aggregations nodes (in a tree shape). Once aggregated, they provide the usual distribution, replication, and querying. ...

[BDCSG2008] Scientific Applications of Large Databases (Alex Szalay)

Alex is opening the talk showing a clear exponential growth in Astronomy (LSST and the petabyte example generation). Data generated from sensors keep growing like crazy. Images have more and more resolution. Hopkin’s databases started with the digital sky initiative, with generated 3 terabytes of data in the 90’s, and the number keep growing up to the point of LSST which will be forced to dump images because is not possible to store all of them. SkyServer is base on SQL server and .NET, serving the 3 terabytes, serving SQL queries reaching 15 millions queries recently. Alex presented a revision of the usage and data delivery. They are planning to anonymize the log and make it publicly available. Then, he switched gears toward the immersive turbulence project that seeks to generate high resolution turbulence images. Again they are storing the information on a SQL server. Moving gears to SkyQuery. A federating web services to build a join query on the fly, but the problem is some join results turn out to be unfeasible. The revision of projects then moved to “Life under your feet”, a non intrusive way to measure environments. The key component is the aggregation and drill down mode for exploring those aggregates down to a sensor level. Another one, OncoSpace, oncology treatment evolution based on the comparison of images of the same patients and their evolutions across time, again implemented on the same SQL server. But all this projects has commonalities, indexing and the need to extract small subsets from larger datasets. MyDQ targets to extract data from other databases and services and leave it on a relational database the user can use. Graywulf, there is no off-the-shelf solution for scientific 1000 TB of data, so the solution is to scale it out. They took the root of fragmentation and create chunks easy to manage, again using SQL server clusters. They also introduce a workflow manager to monitor and the control of it. And this led them to create a Petascale center to play with at John Hopkins University, build on Pan-Starts. ...

[BDCSG2008] Sherpa: Cloud Computing of the Third Kind (Raghu Ramakrishnan)

Raghu (former professor at Madison Wisconsin, now at Yahoo!) is leading a very interesting project on largely scale storage (Sherpa). Here you can find some of my unconnected notes. Software as a service requires to CPU and data. Cloud computing using assimilated to Map-Reduce grids, but they decouple computation and data. For instance Condor is great for high-throughput computing, but on the data side you run into SSDS, Hadoop, etc. But there is a third one, transactional storage. Moreover SQL is the most largely used parallel programming language. Raghu wonder why can’t we build on the lesson learned on RDBMS for OLTP. Sherpa is aiming not to support ACID models, but massively scalable via relaxation. Updates: creation, or simple object updates. Queries: selection with filtering. The vision is to start in a box, if it needs to scale, that should be transparent. PNUTS is part of Sherpa, and it is the technology for: geographic replication, uniform updates, queries, and flexible schemas. Then he goes and describe the inner parts of PNUTS and the software stack. Some interesting notes, no logging, message validation, no traditional transactions. Lower levels put and get a key, on the top of it ranges and sorted, PNUTS on the top provide the querying facility (insert, select, remove). Flexible schemas, the fields are declared at the table label, but do not need to be present (flexible growth). Records are mastered on different nodes, during the utilization, the masters can migrate depending on the usage of them. The basic consistency model is based on a timeline. Master writes and reads are ordered, others can catch up in time. Load balancing by splitting and migration, and guaranteed by the Yahoo! Message Broker. The goal, simple, light, massively scalable. ...

[BDCSG2008] “What” goes around (Joe Hellerstein)

Joe open fires saying “The web is big, a lot of monkeys pushing keys”. Funny. The industrial revolution of data is coming. Large amounts of data are going to be produce. The other revolution is the hardware revolution, leading to the question of how we program such animals to avoid the dead of the hardware industry. The last one, the industrial revolution in software, echoing automatic programming. Declarative programs is great, but how many domains, and which ones can absorb it. Benefits: Rapid prototyping, pocket-size code bases, independent from the runtime, ease of analysis and security, allow optimization and adaptability. But the key question is where is this useful? (besides SQL and spreadsheets). His group has rolled out declarative languages for networking. That includes routing algorithms. other networking stacks, and wireless sensor nets. His approach is a reincarnation of DATALOG. It fits the centrality of the graphs and rendezvous in networks. After this initial issues P2 has been used for consensus (paxos), secure networking, flexible data replications, and mobile networks. Currently other applications being build: compilers, natural language, computer games, security protocols, information extraction, modular robotics. The current challenges they are facing include a sound system design, language facing the usage on real world programing, lack of analysis for the languages, and not turing complete, connections to graph theory and algebraic modeling, efficient models for A*. Another challenge is how you do distributed inference and metacompilation to provide hardware runtimes. The data network uncertainty and P2 can help solve the embedding of the routing information, the network routing informations, and the conceptual networks together, and being able to express them together. Evita Raced is the runtime for P2 (a simple wire data flow bootstrapper). More info here. ...

[BDCSG2008] Mining the Web Graph (Marc Najork)

Marc takes the floor and starts talking about the web graphs (the one generated by pages hyperlinks). Hyperlinks is a key element of the element. Lately webpages has an increase of the number of links, usually generated by CMS (for instance navigation). However, there is a change on the meaning of those hyperlinks. Analytics have different flavors, for example page rank is pretty simple, but others require random access, requiring memory storage (requiring to to huge re graphs in memory). Using their own Microsoft tools, they distribute and replicate it in a cluster, to be able to run some of these analytic algorithms (for instance HITS for page ranking). Sampling can help deal with high a-rity nodes in a graph. He continues presenting the SALSA algorithm (successor of HITS). SALSA requires sampling, and Marc suggest that uniform works pretty well) However, how you evaluate the ranking algorithms? Compile a truth set? Sometime assembled by humans (may not know what the intend of the query was), but another alternative is to use click logs (potentially biassed toward the first results presented). As a field, he claims about the need to collaborate with social sciences to model and better understand the meaning and motivations of hyperlinks. ...