[BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)

How can we help social science to do their science, but also how can we create systems from the lessons learned. This topics also include security and sensitivity of the data. He also review from the Karate papers to the latest papers about social networks. Scale changes the way you approach the data. The original studies allowed knowing what each link mean, but large scale networks loses this property. However he is approaching for a language to express some of the analysis of the social networks and processes. Also, how we bind information per user and how can we model users. But the also security policies. Diffusion in social networks and how things are propagated (even locally), but it is hard to measure how people change their minds on the diffusion process. Chain-letter study where the petition and the trace was collected, but they can also be forward to mailing list, but you can trace some some of the traces of the mailing list. The path were messed with mutations (typos) amputations, etc. They generate some algorithms for maximum likelihood of the tree assemble. But the output was unexpected, opposed to the six-degree separation, they found narrow deep trees. Why a chain-letter would run as a deep-first search? Time played a role. Since friends are small searches, and basically the replicated copies where discarded. The model of the trees was able to be replicated following this time dimension. Another element gets throw to the mix is the threshold of the diffusion. Basically, a message gets in, but how many inputs repetitions your require to validated it an pass it along? Results show that the second input the one that boost that threshold. Viral marketing is another example that wants to understand diffusion. All this leads to multiple models and how you integrate them. Privacy and social networks is another key element. How does that play? Is anonymation the way to go? Social network graphs, even if anonymized hints can lead to the deanonymation of the picture. Before the network is release you can add actions to it, and then you have something to roll back from. The idea create a unique pattern, and then ping them to other people. You can compromise a graph with square root of the log of the number of nodes. Jeff final reflections: toward a model of you. Models of human behavior are possible (for instance the model of time to reply email). But computers track more information about your behavior, opening the door to new modeling (something that the DISCUS project has also been postulating for the last 5 years). ...

Mar 26, 2008 · 3 min · 436 words · Xavier Llorà

[BDCSG2008] Handling Large Datasets at Google: Current Systems and Future Directions (Jeff Dean)

Jeff was the big proposer for map-reduce model–the map-reduce guy. Jeff reviews of the infrastructure and the heterogeneous data set (heterogeneous and at least petabyte scale), their goal: maximize performance by buck. Also data centers, locality, and center are key in the equation. Low cost (not redundant power supplies, not raid disks, running linux, standard network) Software needs to be reliable to failure (node, disks, or racks going dead). Linux on all the production. An scheduler across the cluster to schedule jobs. Cluster wide file system on top, and usually big table cell. The GFS centralized manager that manages metadata and allocations of chunks and replication (talk to the master and then talk to the chunk servers). Big Table helps applications that need a bit more structure storage (key, column, time-stamp). It also provides garbage collection policies. Distribution is break in to tablets (range of rows) and managed by a single machine, the system can split growing tablets. Big Table provides transactions and allow to specify local columns to group together. They allow replications policies across data centers. MapReduce is a nice fitting model for some programs (for instance, the reverse index creation) Allows to move computation to closer to data. Also allows to implement load balancing. GFS works OK for a cluster, but they do not have a global view across the data centers. For instance, they are looking for unique naming of data, and also, if integrated allow data center to keep working if they became disconnected. They are also looking for data distribution policies. ...

Mar 26, 2008 · 2 min · 257 words · Xavier Llorà

[BDCSG2008] Simplicity and Complexity in Data Systems (Garth Gibson)

Energy community and HPC. It is cheaper to collect a lot of samples and run simulations to decide where to drill (the extremely costly part). Energy community and HPC. It is cheaper to collect a lot of samples and run simulations to decide where to drill (the extremely costly part). Review of several effort one modeling for science making. They also run a collection of failures and maintenance cycles on hardware. Job interruptions becomes linear in number of chips (quite interesting result). Compute power on top 500 still doubles per year, and thus the failure rates. Lost disk becomes more and more painful (since the regeneration takes from 8h currently per terabyte to weeks in prediction). All this claims a change on the design. The approach is to spread data as much possible, and that can help to linearly reduce the reconstruction times. File systems are “object” file systems such as the Google FS and Hadoop FS. pNFS: scalable NFS soon. A key goal: reuse current tools people are using to speed up adaption and make it appealing to the users. ...

Mar 26, 2008 · 1 min · 181 words · Xavier Llorà

[BDCSG2008] Computational Paradigms for Genomic Medicine (Jill Mesirov)

Jilll is reviewing what is going with data and biology. There has been an explosion on the numbers they are generating data (from volumes to throughput). Simulations has also been common practices, robot operations, etc. more and more data. Some numbers, now their center use 4.8K processors and 1440+ Terabytes of storage. The challenge give the proper tools to biologist (not CS people). The two key topics of the talk: computation paradigms and computation foundations. They heavily rely on genome expression arrays (row patients, column genes, value expression values). A simple example: classify leukemias (example of how can be distinguished using expression arrays). Patient samples, extract messenger RNA and then create the expression signatures (high dimensionality low training sample set). They repeated the same problem for predicted outcome on prognosis on brain cancer, but for this program there was no strong signal to get them accurate enough. Genes work on regulatory networks (sets of genes), and they tried to do the analysis this way—acting as adding background knowledge to the problem—boosting the results and making the treatment possible. But, the problem is that there should be and infrastructure that could be easy to use and able to replicate experiments. Infrastructure should integrate and interact to components. Should be able to support techs and illiterates equally. Two interfaces (visual and programatic). Access to a library of functions, write pipes, language agnostic, and build on web services (scalable from the laptop to clusters). The name GenePattern. They are collaborating with Microsoft working on a tool (word document) to link to pipelines and the data in the data (can run with other version) and append results to the document too. ...

Mar 26, 2008 · 2 min · 277 words · Xavier Llorà

[BDCSG2008] Clouds and ManyCores: The Revolution (Dan Reed)

Dan Reed (former NCSA director now at Microsoft Research) continues with the meeting presentations. His elevator pitch: the infrastructure need to take into account applications and the user experience. Current trend is that monolithic data consolidation is crumbling under dispersion, changing the traditional picture. The flavors of big data can be explored along two dimensions: (regular/irregular) versus (structured/unstructured). He emphasizes on focusing more on the user experience with big data, and how you can manage resource at any given point. Cloud computing can help organically orchestrate this resources on demand. He also show some examples of Dryad (the Microsoft take on map-reduce architectures) and DryadLINQ. Another interesting comment: ...

Mar 26, 2008 · 1 min · 140 words · Xavier Llorà