Posts

[BDCSG2008] Scientific Applications of Large Databases (Alex Szalay)

Alex is opening the talk showing a clear exponential growth in Astronomy (LSST and the petabyte example generation). Data generated from sensors keep growing like crazy. Images have more and more resolution. Hopkin’s databases started with the digital sky initiative, with generated 3 terabytes of data in the 90’s, and the number keep growing up to the point of LSST which will be forced to dump images because is not possible to store all of them. SkyServer is base on SQL server and .NET, serving the 3 terabytes, serving SQL queries reaching 15 millions queries recently. Alex presented a revision of the usage and data delivery. They are planning to anonymize the log and make it publicly available. Then, he switched gears toward the immersive turbulence project that seeks to generate high resolution turbulence images. Again they are storing the information on a SQL server. Moving gears to SkyQuery. A federating web services to build a join query on the fly, but the problem is some join results turn out to be unfeasible. The revision of projects then moved to “Life under your feet”, a non intrusive way to measure environments. The key component is the aggregation and drill down mode for exploring those aggregates down to a sensor level. Another one, OncoSpace, oncology treatment evolution based on the comparison of images of the same patients and their evolutions across time, again implemented on the same SQL server. But all this projects has commonalities, indexing and the need to extract small subsets from larger datasets. MyDQ targets to extract data from other databases and services and leave it on a relational database the user can use. Graywulf, there is no off-the-shelf solution for scientific 1000 TB of data, so the solution is to scale it out. They took the root of fragmentation and create chunks easy to manage, again using SQL server clusters. They also introduce a workflow manager to monitor and the control of it. And this led them to create a Petascale center to play with at John Hopkins University, build on Pan-Starts. ...

[BDCSG2008] Sherpa: Cloud Computing of the Third Kind (Raghu Ramakrishnan)

Raghu (former professor at Madison Wisconsin, now at Yahoo!) is leading a very interesting project on largely scale storage (Sherpa). Here you can find some of my unconnected notes. Software as a service requires to CPU and data. Cloud computing using assimilated to Map-Reduce grids, but they decouple computation and data. For instance Condor is great for high-throughput computing, but on the data side you run into SSDS, Hadoop, etc. But there is a third one, transactional storage. Moreover SQL is the most largely used parallel programming language. Raghu wonder why can’t we build on the lesson learned on RDBMS for OLTP. Sherpa is aiming not to support ACID models, but massively scalable via relaxation. Updates: creation, or simple object updates. Queries: selection with filtering. The vision is to start in a box, if it needs to scale, that should be transparent. PNUTS is part of Sherpa, and it is the technology for: geographic replication, uniform updates, queries, and flexible schemas. Then he goes and describe the inner parts of PNUTS and the software stack. Some interesting notes, no logging, message validation, no traditional transactions. Lower levels put and get a key, on the top of it ranges and sorted, PNUTS on the top provide the querying facility (insert, select, remove). Flexible schemas, the fields are declared at the table label, but do not need to be present (flexible growth). Records are mastered on different nodes, during the utilization, the masters can migrate depending on the usage of them. The basic consistency model is based on a timeline. Master writes and reads are ordered, others can catch up in time. Load balancing by splitting and migration, and guaranteed by the Yahoo! Message Broker. The goal, simple, light, massively scalable. ...

[BDCSG2008] “What” goes around (Joe Hellerstein)

Joe open fires saying “The web is big, a lot of monkeys pushing keys”. Funny. The industrial revolution of data is coming. Large amounts of data are going to be produce. The other revolution is the hardware revolution, leading to the question of how we program such animals to avoid the dead of the hardware industry. The last one, the industrial revolution in software, echoing automatic programming. Declarative programs is great, but how many domains, and which ones can absorb it. Benefits: Rapid prototyping, pocket-size code bases, independent from the runtime, ease of analysis and security, allow optimization and adaptability. But the key question is where is this useful? (besides SQL and spreadsheets). His group has rolled out declarative languages for networking. That includes routing algorithms. other networking stacks, and wireless sensor nets. His approach is a reincarnation of DATALOG. It fits the centrality of the graphs and rendezvous in networks. After this initial issues P2 has been used for consensus (paxos), secure networking, flexible data replications, and mobile networks. Currently other applications being build: compilers, natural language, computer games, security protocols, information extraction, modular robotics. The current challenges they are facing include a sound system design, language facing the usage on real world programing, lack of analysis for the languages, and not turing complete, connections to graph theory and algebraic modeling, efficient models for A*. Another challenge is how you do distributed inference and metacompilation to provide hardware runtimes. The data network uncertainty and P2 can help solve the embedding of the routing information, the network routing informations, and the conceptual networks together, and being able to express them together. Evita Raced is the runtime for P2 (a simple wire data flow bootstrapper). More info here. ...

[BDCSG2008] Mining the Web Graph (Marc Najork)

Marc takes the floor and starts talking about the web graphs (the one generated by pages hyperlinks). Hyperlinks is a key element of the element. Lately webpages has an increase of the number of links, usually generated by CMS (for instance navigation). However, there is a change on the meaning of those hyperlinks. Analytics have different flavors, for example page rank is pretty simple, but others require random access, requiring memory storage (requiring to to huge re graphs in memory). Using their own Microsoft tools, they distribute and replicate it in a cluster, to be able to run some of these analytic algorithms (for instance HITS for page ranking). Sampling can help deal with high a-rity nodes in a graph. He continues presenting the SALSA algorithm (successor of HITS). SALSA requires sampling, and Marc suggest that uniform works pretty well) However, how you evaluate the ranking algorithms? Compile a truth set? Sometime assembled by humans (may not know what the intend of the query was), but another alternative is to use click logs (potentially biassed toward the first results presented). As a field, he claims about the need to collaborate with social sciences to model and better understand the meaning and motivations of hyperlinks. ...

[BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)

How can we help social science to do their science, but also how can we create systems from the lessons learned. This topics also include security and sensitivity of the data. He also review from the Karate papers to the latest papers about social networks. Scale changes the way you approach the data. The original studies allowed knowing what each link mean, but large scale networks loses this property. However he is approaching for a language to express some of the analysis of the social networks and processes. Also, how we bind information per user and how can we model users. But the also security policies. Diffusion in social networks and how things are propagated (even locally), but it is hard to measure how people change their minds on the diffusion process. Chain-letter study where the petition and the trace was collected, but they can also be forward to mailing list, but you can trace some some of the traces of the mailing list. The path were messed with mutations (typos) amputations, etc. They generate some algorithms for maximum likelihood of the tree assemble. But the output was unexpected, opposed to the six-degree separation, they found narrow deep trees. Why a chain-letter would run as a deep-first search? Time played a role. Since friends are small searches, and basically the replicated copies where discarded. The model of the trees was able to be replicated following this time dimension. Another element gets throw to the mix is the threshold of the diffusion. Basically, a message gets in, but how many inputs repetitions your require to validated it an pass it along? Results show that the second input the one that boost that threshold. Viral marketing is another example that wants to understand diffusion. All this leads to multiple models and how you integrate them. Privacy and social networks is another key element. How does that play? Is anonymation the way to go? Social network graphs, even if anonymized hints can lead to the deanonymation of the picture. Before the network is release you can add actions to it, and then you have something to roll back from. The idea create a unique pattern, and then ping them to other people. You can compromise a graph with square root of the log of the number of nodes. Jeff final reflections: toward a model of you. Models of human behavior are possible (for instance the model of time to reply email). But computers track more information about your behavior, opening the door to new modeling (something that the DISCUS project has also been postulating for the last 5 years). ...