[BDCSG2008] Mining the Web Graph (Marc Najork)

Marc takes the floor and starts talking about the web graphs (the one generated by pages hyperlinks). Hyperlinks is a key element of the element. Lately webpages has an increase of the number of links, usually generated by CMS (for instance navigation). However, there is a change on the meaning of those hyperlinks. Analytics have different flavors, for example page rank is pretty simple, but others require random access, requiring memory storage (requiring to to huge re graphs in memory). Using their own Microsoft tools, they distribute and replicate it in a cluster, to be able to run some of these analytic algorithms (for instance HITS for page ranking). Sampling can help deal with high a-rity nodes in a graph. He continues presenting the SALSA algorithm (successor of HITS). SALSA requires sampling, and Marc suggest that uniform works pretty well) However, how you evaluate the ranking algorithms? Compile a truth set? Sometime assembled by humans (may not know what the intend of the query was), but another alternative is to use click logs (potentially biassed toward the first results presented). As a field, he claims about the need to collaborate with social sciences to model and better understand the meaning and motivations of hyperlinks.