Programming

Meandre: Semantic-Driven Data-Intensive Flows in the Clouds

by Llorà, X., Ács, B., Auvil, L., Capitanu, B., Welge, M.E., Goldberg, D.E. (2008). This paper has been accepted at the 4th IEEE International Conference on e-Science. An early draft of the paper can be found as IlliGAL technical report 2008013. You can download the pdf here. More information is also available at the Meandre website as part of the SEASR project. Abstract: Data-intensive flow computing allows efficient processing of large volumes of data otherwise unapproachable. This paper introduces a new semantic-driven data-intensive flow infrastructure which: (1) provides a robust and transparent scalable solution from a laptop to large-scale clusters,(2) creates an unified solution for batch and interactive tasks in high-performance computing environments, and (3) encourages reusing and sharing components. Banking on virtualization and cloud computing techniques the Meandre infrastructure is able to create and dispose Meandre clusters on demand, being transparent to the final user. This paper also presents a prototype of such clustered infrastructure and some results obtained using it. ...

Fast mutation implementation for genetic algorithms in Python

The other day I was playing to see how much I could squeeze out of a genetic algorithm written in Python. The code below shows the example I used. The first part implements a simple two loop version of a traditional allele random mutation. The second part is coded using numpy 2D arrays. The code also measures the time spent on both implementations using cProfile. from numpy import * pop_size = 2000 l = 200 z = zeros((pop_size,l)) def mutate () : for i in xrange(pop_size): for j in xrange(l) : if random.random()<0.5 : z[i,j] = random.random() import cProfile cProfile.run('mutate()') def mutate_matrix () : r = random.random(size=(pop_size,l))<0.5 v = random.random(size=(pop_size,l)) k = r*v + logical_not(r)*z cProfile.run('mutate_matrix()') If you run the code listed above you may get something similar to ...

Who does your intranet link to?

Have you ever wondered who does your intranet link to? I was sitting the other day in a meeting (yes, I know, breaking news) and I was wondering what would be a fast way to be able to answer the question. The basic sketch I did in my mind was simple: Set up a web crawler to the domain I want to analyze Run the crawling job Get the links collected on the web map Process the links to only keep the site they refer to Remove duplicates Visualize the graph Simple isn’t it? So, what do I need to get it to work? Basically three pieces of software (a web crawler, a graph manipulation library, and a visualization package) and some glue. Going over the things I been playing for the last year I draw three candidates: Nutch, RDFlib, and prefuse. Oh, the glue will be just two Python scripts. ...

New Semester and IACAT

Every new semester has its quickoff overhead. Besides trying to get back on track after travelling, and having meeting after meeting, now I am sitting at the inagural act of the Institute for Advanced Computing Applications and Technologies at the University of Illinois at Urbana-Champaign. A list and description of the three kicking off projects of the center is also available. These three themes are: Next-Generation Acceleration Systems for Advanced Science and Engineering Applications Multiscale Simulation in Science and Engineering Synergistic Research on Parallel Programming for Petascale Applications Quite and interesting mix to follow. ...

Efficient storage for Python

Did you ever run into the situation that your analysis/simulation data is too large to fit it in memory? Does the flat file format you use for storing your data sets become to big that renders it slow to a crawl? If you answered yes, you may want to give a spin to the HDF5 library. HDF5 file are not replacement for relational data bases. They are catered for storing complex data objects and a wide variety of metadata. It is also optimize for efficiency of storage and retrieval. The underlying library is written in C. If you are a Python user, PyTables provides a very efficient wrapper for HDF5 files. It gives you access to all the HDF5 api, plus it is nicely integrated with NumPy and provides natural naming conventions. In another words, you can quickly store and retrieve your arrays/matrix to HDF5 files, giving you a very interesting persistence layer. For instance you can do a simple table scan by: ...