Home > Activities > Seminaire Logiciel Libre et Programmation > Farewell to Disks: Efficient Processing of Obstinate Data

Farewell to Disks: Efficient Processing of Obstinate Data

— filed under:

By Diomidis Spinellis

What
  • seminaire
When Mar 25, 2011
from 10:00 AM to 11:00 AM
Where IRILL, 23 Avenue d'Italie, Paris, France - 3th floor
Add event to calendar vCal
iCal

Questions whose answer requires sophisticated processing of huge data sets come up increasingly often in our networked and interlinked, and (increasingly) DNA-sequenced world. Attacking such problems with traditional techniques, such as loading data into memory for processing or querying a relational database, is cumbersome and inefficient. Data sizes are growing inexorably, while disk-based data structures and applications relying on them, optimized to handle sequential retrievals and relational joins, often prove inadequate for running complex algorithms. Therefore, sophisticated processing of huge complex data sets requires us to rethink the relationship between disk-based storage and main-memory processing.

Some features of modern systems, namely 64-bit architectures, memory mapped sparse files, virtual memory, and copy on write support, allow us to process our data with readable and efficient RAM-based algorithms, using slow disks and filesystems only for their large capacity and to secure the data’s persistence. I demonstrate this approach through a series of C++ programs that run on Wikipedia’s data looking for matching words and links between unrelated entries.

Through these programs I will show how we can use STL containers, iterators, and algorithms to access disk-based data without performing any system calls. Although RAM-based processing opens up again many problems that database systems already solve, I will argue that such processing is the right move, because it provides us with a unified programming and performance model for all our data operations irrespective of where the data resides.