Xavier Llorà

Easy, reliable, and flexible storage for Python

A while ago I wrote a little post about alternative column stores. One that I mentioned was Tokyo Cabinet (and its associated server Tokyo Tyrant. Tokyo Cabinet it is a key-value store written in C and with bindings for multiple languages (including Python and Java). It can maintain data bases in memory or spin them to disk (you can pick between hash or B-tree based stores). Having heard a bunch of good things, I finally gave it a try. I just installed both Cabinet and Tyrant (you may find useful installation instructions here using the usual configure, make, make install cycle). Another nice feature of Tyrant is that it also supports HTTP gets and puts. So having all this said, I just wanted to check how easy it was to use it from Python. And the answer was very simple. Joseph Turian’s examples got me running in less than 2 minutes—see the piece of code below—when dealing with a particular data base. Using Tyrant over HTTP is quite simple too—see PeteSearch blog post. ...

Large Scale Data Mining using Genetics-Based Machine Learning

Below you may find the slides of the GECCO 2009 tutorial that Jaume Bacardit and I put together. Hope you enjoy it. Slides Abstract We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them. This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for GBML methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms. We will also review a topic interlaced with all of them: how can we model the scalability of the components of our GBML systems to better engineer them to get the best performance out of them for large datasets. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions. ...

Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre

Below you may find the slides I used during GECCO 2009 to present the paper titled “Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre”. An early preprint in form of technical report can be found as an IlliGAL TR No. 2009001 or the full paper at the ACM digital library

NIGEL 2006 Part VI: Bacardit

After coming back from GECCO I just uploaded the last of the NIGEL 2006 talks at LCS & GBML Central. This last talk was by Jaume Bacardit and GBML for protein structure prediction.

NIGEL 2006 Part V: Bernardó vs. Lanzi

After the vacation break, two more NIGEL 2006 talks are available at LCS & GBML Central. This week Ester Bernardó presents how LCS can perform in the presence of class imbalance, whereas Lanzi continues his quest on computed predictions.