Xavier Llorà

Efficient storage for Python

Did you ever run into the situation that your analysis/simulation data is too large to fit it in memory? Does the flat file format you use for storing your data sets become to big that renders it slow to a crawl? If you answered yes, you may want to give a spin to the HDF5 library. HDF5 file are not replacement for relational data bases. They are catered for storing complex data objects and a wide variety of metadata. It is also optimize for efficiency of storage and retrieval. The underlying library is written in C. If you are a Python user, PyTables provides a very efficient wrapper for HDF5 files. It gives you access to all the HDF5 api, plus it is nicely integrated with NumPy and provides natural naming conventions. In another words, you can quickly store and retrieve your arrays/matrix to HDF5 files, giving you a very interesting persistence layer. For instance you can do a simple table scan by: ...

Back to Urbana despite Continental Airlines

Yes, I am back. Almost 28 hours later than expected, but I am back to Urbana. The trip started with a very tiny little delay of 8 hours in Barcelona. Yes, COA121 was the flight to Newark. Yes, COA121 was the first leg toward Urbana. Yes, the plane had technical problems on the way to Barcelona and had to land at the Azores islands to get it “fixed”. Yes, we did not dare to ask. Yes, we were tired of waiting for a plane that seemed never to come; but you should have seen the faces of the passengers deplaning at Barcelona, exhausted is not a word descriptive enough. Yes, we all filled a complaint form asking for money back according to the European Union bill of right for air passengers. Yes, we know that it was a bit of wishful thinking and the lack of anything better to do while waiting. ...

Managing your digital library of research papers

A while a go I wrote about tools for managing your LaTeX bibliography. Despite the fact that the tools I described help managing your LaTeX bibliography collection, it still did not help much with managing the tons of PDFs files you end pilling up when doing research on a particular topic. BibDesk has now the ability to attach files to entries, Zotero with its ability to store snapshots is still the closest thing I have found so far. However, a friend just pointed me to Papers, a Mac tool—yes it is just available for Mac—for managing your digital library of papers. Very much like iTunes, it allows to streamline your search, reading, organizing, and writing—there is a very interesting webcast by the creators of the software. If you have a Mac, it is worthwhile to give it a spin. ...

ICEIS 2008: Blogging summary and final strings

If you are looking for a list of the related blogging done during ICEIS 2008 just follow this link. During Sunday morning I run into Angel A. Juan, an assistant professor at Open University of Catalonia (UOC), interested on analyzing online teaching efforts and how tools to assist professors monitoring students performance on online media. I visited him yesterday at his office and we got and interesting exchange of ideas. Most of them revolved around the work we have conducted under the DISCUS project, and how similar is our efforts on marketing focus groups and their online teaching environment. His group, Distributed, Parallel and Collaborative Systems, was also interested on the work done under the SEASR project, mostly focusing on the Meadre infrastructure for data-intensive flow computing we are getting close to release. ...

ICEIS 2008: Final sprint and Ricardo Baeza-Yates

This is the final sprint for ICEIS. I have been mostly focusing on posters this morning. It his hard to pick one up. I would just say that there was some interesting work on personalized recommender systems—paper 219. But as I said, there were a bunch of interesting ones and quite a few interesting by-the-poster conversations. Actually, I am having a very interesting time after the mix of attendees’ profiles. The morning finally meandered into Ricardo Baeza-Yates’s keynote talk. After the initial technical problems—presentation mode of OpenOffice running on Ubuntu 8.0.4 got up just 75% of the slide area—they finally succeeded on getting something up and get the talk started. This was a pretty technical talk about Yahoo! research effort on caching to improve the performance and also help scalability and contain cost on the coming years. Besides several cache techniques, he also presented a bunch of possible paralyzation models based on document/term partitions. A thing he breezed over was the machine learning model for classifying queries. That surfaced several places, from predicting common and rare content, to frequent, unfrequent, and rare queries. I was glad that the technical problems were solve and we could enjoy it. And the conference is finally close. Next year, Milan. ...