BDCSG

[BDCSG2008] Sherpa: Cloud Computing of the Third Kind (Raghu Ramakrishnan)

Raghu (former professor at Madison Wisconsin, now at Yahoo!) is leading a very interesting project on largely scale storage (Sherpa). Here you can find some of my unconnected notes. Software as a service requires to CPU and data. Cloud computing using assimilated to Map-Reduce grids, but they decouple computation and data. For instance Condor is great for high-throughput computing, but on the data side you run into SSDS, Hadoop, etc. But there is a third one, transactional storage. Moreover SQL is the most largely used parallel programming language. Raghu wonder why can’t we build on the lesson learned on RDBMS for OLTP. Sherpa is aiming not to support ACID models, but massively scalable via relaxation. Updates: creation, or simple object updates. Queries: selection with filtering. The vision is to start in a box, if it needs to scale, that should be transparent. PNUTS is part of Sherpa, and it is the technology for: geographic replication, uniform updates, queries, and flexible schemas. Then he goes and describe the inner parts of PNUTS and the software stack. Some interesting notes, no logging, message validation, no traditional transactions. Lower levels put and get a key, on the top of it ranges and sorted, PNUTS on the top provide the querying facility (insert, select, remove). Flexible schemas, the fields are declared at the table label, but do not need to be present (flexible growth). Records are mastered on different nodes, during the utilization, the masters can migrate depending on the usage of them. The basic consistency model is based on a timeline. Master writes and reads are ordered, others can catch up in time. Load balancing by splitting and migration, and guaranteed by the Yahoo! Message Broker. The goal, simple, light, massively scalable. ...

[BDCSG2008] “What” goes around (Joe Hellerstein)

Joe open fires saying “The web is big, a lot of monkeys pushing keys”. Funny. The industrial revolution of data is coming. Large amounts of data are going to be produce. The other revolution is the hardware revolution, leading to the question of how we program such animals to avoid the dead of the hardware industry. The last one, the industrial revolution in software, echoing automatic programming. Declarative programs is great, but how many domains, and which ones can absorb it. Benefits: Rapid prototyping, pocket-size code bases, independent from the runtime, ease of analysis and security, allow optimization and adaptability. But the key question is where is this useful? (besides SQL and spreadsheets). His group has rolled out declarative languages for networking. That includes routing algorithms. other networking stacks, and wireless sensor nets. His approach is a reincarnation of DATALOG. It fits the centrality of the graphs and rendezvous in networks. After this initial issues P2 has been used for consensus (paxos), secure networking, flexible data replications, and mobile networks. Currently other applications being build: compilers, natural language, computer games, security protocols, information extraction, modular robotics. The current challenges they are facing include a sound system design, language facing the usage on real world programing, lack of analysis for the languages, and not turing complete, connections to graph theory and algebraic modeling, efficient models for A*. Another challenge is how you do distributed inference and metacompilation to provide hardware runtimes. The data network uncertainty and P2 can help solve the embedding of the routing information, the network routing informations, and the conceptual networks together, and being able to express them together. Evita Raced is the runtime for P2 (a simple wire data flow bootstrapper). More info here. ...

[BDCSG2008] Mining the Web Graph (Marc Najork)

Marc takes the floor and starts talking about the web graphs (the one generated by pages hyperlinks). Hyperlinks is a key element of the element. Lately webpages has an increase of the number of links, usually generated by CMS (for instance navigation). However, there is a change on the meaning of those hyperlinks. Analytics have different flavors, for example page rank is pretty simple, but others require random access, requiring memory storage (requiring to to huge re graphs in memory). Using their own Microsoft tools, they distribute and replicate it in a cluster, to be able to run some of these analytic algorithms (for instance HITS for page ranking). Sampling can help deal with high a-rity nodes in a graph. He continues presenting the SALSA algorithm (successor of HITS). SALSA requires sampling, and Marc suggest that uniform works pretty well) However, how you evaluate the ranking algorithms? Compile a truth set? Sometime assembled by humans (may not know what the intend of the query was), but another alternative is to use click logs (potentially biassed toward the first results presented). As a field, he claims about the need to collaborate with social sciences to model and better understand the meaning and motivations of hyperlinks. ...

[BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)

How can we help social science to do their science, but also how can we create systems from the lessons learned. This topics also include security and sensitivity of the data. He also review from the Karate papers to the latest papers about social networks. Scale changes the way you approach the data. The original studies allowed knowing what each link mean, but large scale networks loses this property. However he is approaching for a language to express some of the analysis of the social networks and processes. Also, how we bind information per user and how can we model users. But the also security policies. Diffusion in social networks and how things are propagated (even locally), but it is hard to measure how people change their minds on the diffusion process. Chain-letter study where the petition and the trace was collected, but they can also be forward to mailing list, but you can trace some some of the traces of the mailing list. The path were messed with mutations (typos) amputations, etc. They generate some algorithms for maximum likelihood of the tree assemble. But the output was unexpected, opposed to the six-degree separation, they found narrow deep trees. Why a chain-letter would run as a deep-first search? Time played a role. Since friends are small searches, and basically the replicated copies where discarded. The model of the trees was able to be replicated following this time dimension. Another element gets throw to the mix is the threshold of the diffusion. Basically, a message gets in, but how many inputs repetitions your require to validated it an pass it along? Results show that the second input the one that boost that threshold. Viral marketing is another example that wants to understand diffusion. All this leads to multiple models and how you integrate them. Privacy and social networks is another key element. How does that play? Is anonymation the way to go? Social network graphs, even if anonymized hints can lead to the deanonymation of the picture. Before the network is release you can add actions to it, and then you have something to roll back from. The idea create a unique pattern, and then ping them to other people. You can compromise a graph with square root of the log of the number of nodes. Jeff final reflections: toward a model of you. Models of human behavior are possible (for instance the model of time to reply email). But computers track more information about your behavior, opening the door to new modeling (something that the DISCUS project has also been postulating for the last 5 years). ...

[BDCSG2008] Handling Large Datasets at Google: Current Systems and Future Directions (Jeff Dean)

Jeff was the big proposer for map-reduce model–the map-reduce guy. Jeff reviews of the infrastructure and the heterogeneous data set (heterogeneous and at least petabyte scale), their goal: maximize performance by buck. Also data centers, locality, and center are key in the equation. Low cost (not redundant power supplies, not raid disks, running linux, standard network) Software needs to be reliable to failure (node, disks, or racks going dead). Linux on all the production. An scheduler across the cluster to schedule jobs. Cluster wide file system on top, and usually big table cell. The GFS centralized manager that manages metadata and allocations of chunks and replication (talk to the master and then talk to the chunk servers). Big Table helps applications that need a bit more structure storage (key, column, time-stamp). It also provides garbage collection policies. Distribution is break in to tablets (range of rows) and managed by a single machine, the system can split growing tablets. Big Table provides transactions and allow to specify local columns to group together. They allow replications policies across data centers. MapReduce is a nice fitting model for some programs (for instance, the reverse index creation) Allows to move computation to closer to data. Also allows to implement load balancing. GFS works OK for a cluster, but they do not have a global view across the data centers. For instance, they are looking for unique naming of data, and also, if integrated allow data center to keep working if they became disconnected. They are also looking for data distribution policies. ...