IRILL - Research and Innovation on Free Software

Machine learning threats and opportunities for Debian and Free Software


"Machine learning threats and opportunities for Debian and Free Software"
by Pablo Ariel Duboue,
on 2012-07-14 00:00:00
Download MP4 format

machine learning models and source data, that is

I would like to give a talk about how machine learning models are more a game changer with respect to Free Software than what Debian already has collectively agreed. The key aspects I want to discuss are user expectations with respect to software incorporating trained models, preferred form of modification, training times and licensing issues for training data. This talk is informed by the discussion on debian-legal [1] back in 2009 and my work on the machine learning infrastructure for IBM DeepQA Watson system. The objective is to start discussion on the subject and bring more people to the table. (Without hopefully not ending up in a major flamewar like in 2009 ;-) [1] http://lists.debian.org/debian-legal/2009/05/msg00028.html

What follows is the skeleton of the talk. If you have any questions, just ping me on IRC (DrDub). I have already purchased my ticket for Managua so I'll certainly be there!

This talk in a nutshell Status Quo is good for now Many threats need to be addressed outside of Debian (e.g., licensing) ** The opportunities can be tackled by multi-distro efforts

Threats: Source code is not what it used to be * Value on the data * People are getting used to technology with complex training on the backend (e.g., Siri) Yet-another-clever-GPL-circumvention trick * I give you the code but I charge you for the model. Sweet.

Questions How can we acquire the data? * Maybe build a Free Software-volunteer driven Mechanical Turk-like? How can we assure the data is kept Free? * CC-SA and derivatives?

OK, we have the data (actually, we have plenty of data already) then what? Training machine learning models takes a whole different type of build-machine 64Gb of RAM for 3 days, sure! ** Why? Oh my, why?

Opportunities Main challenge for Debian IMO is to change users into contributors * Training data contributors can follow the success case of Translators Training data contributors * Annotate more data to fix a bug * More bugs with "data patches" ** Inter-distro collaboration opportunities * Sharing data is easier than sharing code (think object-orientation)

Conclusions Don't shoot the messenger Not doing anything is an option... for now Any pointers for licensing? Any pointers for an inter-distro model training project?