dimanche, mars 24, 2013

Does Big Data Require an Epistemological Revolution?

In a paper recently published in Paris Tech Review, Big Data: Farewell to Cartesian Thinking?, Jean-Pierre Malle argues that "Big Data deviates from 'traditional' scientific knowledge". So radical is this deviation that it entails a "cultural revolution" turning into an "industrial revolution" as dramatized by the breakneck speed of technological innovations. And indeed Cartesian Dualism – among other intellectual and religious influences – did prepare the way for modern scientists to think about the world in abstractions. The Newtonian world-view of physics, for instance, was certainly facilitated by Descartes' philosophical system. Before Descartes and Newton such plainly observable phenomena as the falling of an apple from a tree, the rhythm of the tides in the oceans, or the movements of the planets around the sun were separate and distinct events, "raw data" in Malle's view "usually kept for subsequent processes that are not yet determined". Through Newton's abstract conceptualization, however, apples, oceans, and heavenly bodies all became essentially the same: masses attracted by other masses, all moving according to the same laws of gravitation. Malle does not take issue with this however, "for inductive speed algorithms, it is preferable to transform information from a form 'in extension' (eg conservation of all receipts of Mr. Smith [/or records of phenomena/]) to a form 'in comprehension' (eg Mr. Smith buys a loaf all Mondays and sometimes a cake [/or the law of gravitation/]) more manageable and less bulky", as abstraction hardly qualifies as a "farewell" to Cartesian thinking.

In fact, according to the author, the issue lies in the "Western scientific tradition that derives from Descartes" of deductive thinking. Malle's critical point is that Big Data conceptualization and processing, in contrast, demand induction. The metaphor called to illustrate the case in point is illuminating: it goes along the following lines.

  • "Induction, unlike deduction, is a mechanism used by the human brain at almost every moment";
  • "[/Induction/] is particularly relevant when analyzing a situation out of its context";
  • "For example, to apply deductive logics to the decision of crossing a street, you would need to measure all vehicle speeds, locate them in space and calculate, using a set of equations, which is the right time to cross. Needless to say, the slowness of this technique of analysis would be more of an obstacle, compared with the use your own senses and cognitive abilities… In fact, our brain captures the global scene in a comprehensive situation and processes it by using induction. To do this, it generalizes the principles observed in similar situations involving us – or others – that we have observed (other people crossings streets, at any time, with or without light, wet or dry ground, etc.). Our brain is able to integrate a huge number of parameters in a flash and project the results of its inductions on the current scene.";
  • "This is exactly what Big Data processing needs: search instantly for the critical information, process them as a whole without preconditions, reproduce effective mechanisms that have been observed in the past, generate new data that can be used directly in the current situation.";

These are strongly stated epistemological claims which, despite being immediately followed by the inevitable reference to a famous 2008 article by Chris Anderson in Wired to the effect that "the knowledge from Big Data will be produced by 'agnostic' statistics. This lack of ideology is the very condition of their success: in their own way, numbers speak for themselves", are worth investigating further for deeper qualification.

In the analogy offered by the author, the Big Data algorithm – later called an inductive algorithm in the article, plays the role of the human brain confronting a worldly situation in real time. It is a trademark of the current commentary on Big Data that the resurgence of early cybernetics ideas is a source for metaphors. The viewpoint from constructivist empiricism, acknowledged in the article, is then that Big Data repositories and streams constitute the "world" – a Brave New World, possibly – and the Big Data processing constructs the "knowledge", a representation of "reality". In the vein of the cybernetic reasoning lines, what is missing here is proper identification of the closing of the loop, the Wiener-Odobleja notion of feedback. Big Data is not as innocent as that, notwithstanding Anderson. As Hans Jonas remarked, "there is a strong and, it seems, almost irresistible tendency in the human mind to interpret human functions in terms of the artifact that take their place, and artifacts in terms of the replaced human function". Big Data appears to be generally collected, if in unquestionably staggerring volume and blazing speed, to serve a purpose – precisely an analysis in view of, say, a commercial (e-commerce marketing) or a public health (Open Data) objective. Aren't inductive algorithms, singled out as "particularly relevant when analyzing a situation out of its context" and "generat[/ing/] new data that can be used directly in the current situation", reviving the cybernetics attempt to account for purposive behavior without purpose – like behaviorism is an attempt at psychology without the psyche?

Not that, in addition, induction itself isn't fraught with deep questions. Nelson Goodman's Fact, Fiction and Forecast fascinating predicates, grue and bleen, respectively:

  • (grue) applying to all things examined before a time t just in case they are green but to other things just in case they are blue, and
  • (bleen) applying to all things examined before a time t just in case they are blue but to other things just in case they are green;

are illuminating cases in point. They illustrate the new riddle of induction.

Let us start with the Humian old riddle of induction, because it sits at the core of the author's argument in favor of inductive algorithms. Goodman describes Hume's riddle of induction as the problem of the validity of the predictions we make. Since predictions are about what has yet to be observed and because there is no necessary connection between what has been observed and what will be observed, what is the justification for the predictions we make? And indeed, as Malle rightfully argues, we cannot use deductive logic to infer predictions about future observations based on past observations because there are no valid rules of deductive logic for such inferences. In fact there is a very active research community in mathematical logic, since Haskell Curry, which strives at refining further these statements (works by Martin-Löf, Girard, Prawitz seem relevant here). Hume's answer was that our observations of one kind of event following another kind of event result in our minds forming habits of regularity. Goodman's tackles the old riddle by turning to the problem of justifying a system of rules of deduction, for comparison. For Goodman, the validity of a deductive system is justified by their conformity to good deductive practice. The justification of rules of a deductive system then depends on our (individual, group, sociological) judgments about whether to reject or accept specific deductive inferences. Thus, for Goodman, the problem of induction dissolves into the same problem as justifying a deductive system: the problem of confirmation of generalizations.

The new riddle of induction, for Goodman, rests on our ability to distinguish law-like generalizations, required for making predictions, from non-law-like generalizations. Law-like generalizations are capable of confirmation while non-law-like generalization are not. A good inductive algorithm based on a Big Data collection of observations of emeralds should conclude that they are green and not grue, this at whatever time t it is run. How to make sure that it so does?

This is where pragmatical approaches are called for, and after the investigatory detour, we concur – albeit for different reasons – with some of the calls to action in Malle's article. Leo Breiman in Statistical Modeling: The Two Cultures help us reclaim some of these treacherous inductive grounds. Breiman makes Big Data fit in the modest (though mythical) Black Box: "Statistics starts with data. Think of the data as being generated by a black box in which a vector of input variables x (independent variables) go in one side, and on the other side the response variables y come out. Inside the black box, nature functions to associate the predictor variables with the response variables". (Basic settings indeed, which, however, may evoke ominous undertones of a Schrödinger's cat sealed box. Rest reassured gentle reader, we won't go here into Quantum Mechanics alternate interpretations!) Breiman then points out purpose, two goals for the analysis:

  • Prediction: to be able to predict what the responses are going to be to future input variables;
  • Information: to extract some information about how nature is associating the response variables to the input variables.

Earlier remarks on the often underplayed importance of purposes served by collecting and analyzing data and on the hard problem of distinguishing Goodman's projectable predicates from non-law-like generalizations should make Breiman's remark clear enough. Breiman then goes on opposing two cultures to approach both goals.

In the data modeling culture, the analysis begins with assuming a stochastic data model for the inside of the black box. The values of the parameters are estimated from the data and the model then used for information and/or prediction. (One then talks of model validation, goodness-of-fit and so forth.) In the algorithmic modeling culture, the analysis considers the inside of the box complex and unknown. Their approach is to find a function f(x), an algorithm that operates on x to predict the responses y. (One also talks of algorithm validation, but in terms such as predictive accuracy.)

Breiman, a well-informed critic of the prevalence of the data-modeling culture in statistics, calls for a new balance in the discipline, leaning towards the algorithmic modeling approach – a research area where he developed momentous contributions. His conclusion – the paper was published in 2001 – offers the following: "The roots of statistics, as in science, lie in working with data and checking theory against data. I hope in this century our field will return to its roots." The current flurry of research and development, both theoretical and practical, on Big Data representation and algorithms testify that this vow was not pledged in vain.

samedi, mars 16, 2013

Le Big Data individuel

L'Ordinateur individuel (#), dont l'auteur, jeune hacker avant l'heure sortant ébloui de la tente « micro-informatique » (Sicob Micro-Boutique), dressée sur le parvis encore désert de La Défense, en marge du Sicob 1978, se souvient encore avoir serré avec émotion le premier numéro, disparaît ce mois-ci pour devenir 01 Net Magazine. À l'heure même où dans toute l'informatique l'individuel triomphe — smartphone, tablette, profil des réseaux sociaux, recommandations personnalisées des sites d'e-commerce — et parfois là où on l'attend le moins. Le Big Data par exemple.

Aujourd'hui le Big Data, c'est vous ! Nous sommes témoins de la résurrection de la doctrine du Cercle de Vienne (#) mais appliquée, plus d'un siècle plus tard, à soi-même : «  self knowledge by numbers » annonce dogmatiquement le site Quantified Self (#). La qualité de vie par la quantification individuelle ! Voilà le slogan hygiéniste New Age, collision allégorique du développement durable et de la Mécanique quantique ! L'ambiguïté du mot d'ordre dévoile tout : la connaissance de soi par la mesure — plus précisément par l'automesure (#), notamment de ses propres paramètres médicaux avec un objectif de « santé » parfaitement louable — c'est aussi les renseignements personnels en grand nombre — avec une inévitable analogie Big Data, Big Brother d'où sourdent de sombres images.

La miniaturisation des capteurs et leur connexion systématique au Net nous permet, en effet littéralement, de nous habiller d'un tissu de points de mesure communicants, capables d'émettre continûment les flux de nombres qui définiraient votre moi. C'est l'entrée de plain-pied dans le cyberspace, naguère encore territoire de la science-fiction. Et l'inversion de point de vue est toute proche : bientôt vous ne serez pas autre chose que ce cordon de flux de données corrélés, ce brouillard statistique de régressions linéaires en devenir, dernier avatar technique du supplément d'âme bergsonien (#). De la surveillance panoptique de tous vos paramètres vitaux, de votre activité physique quotidienne (#) jusqu'à la cinétique des plus petites molécules (#), de l'apoptose de vos cellules (#) à la carte de votre génome personnel (# en couleurs reconstituées et disponibles en plusieurs formats pratiques et bon marché), tout, vous saurez tout sur vous. L'ordinateur individuel s'efface bien devant l'individu devenu ordinateur.

D'autant plus, que l'ordinateur individuel, quant à lui, met le Big Data à la portée de tous. Une nouvelle génération d'outils informatiques point qui menace de reléguer au rang darwinien de dinosaure les algorithmes fondateurs de la discipline, comme Hadoop (#) et Pregel (#), tous grands prédateurs de l'habitat datacenter. Aujourd'hui tant rapides sont les progrès des technologies de stockage et de parallélisation que plus besoin de teraflop (Teratophoneus Data) pour analyser les Big Data, un simple PC suffit amplement à la tâche (#).

GraphChi (#), par exemple, emploie un algorithme novateur pour effectuer les calculs sur des très grands graphes — de l'ordre du milliard de sommets — sur le simple disque dur ou la mémoire SSD d'un modeste PC actuel. Shark (#) met le turbo à vos requêtes analytiques, 5 à 10 fois plus rapide sur disque que Hive, Hadoop ou que les plus rapides des bases de données massivement parallèles, 100 fois plus véloce sur SSD ! Julia (#) un nouveau langage de programmation pour les applications techniques et scientifiques (#) promettrait de laisser son grand frère R — qui connaît pourtant un succès grandissant, porté par les Big Data — dans les starting blocks. Les tsunamis de trillions de points de séries temporelles sont traités au vol par les nouveaux algorithmes dits de « dynamic time warping » (#) — c'est beau comme du StarTrek ! Bref la panoplie complète du data scientist arrive sur votre PC (#), le Big Data pour tous et à chacun son Big Data.

Nous ne reviendons pas sur la position épistémologique qui sous-tend cette ruée vers le Big Data — nous l'avons déjà évoquée dans ces colonnes (#, #) — mais constatons simplement que l'on n'a pas fini de parler des données et de l'ordinateur toujours aussi individuels après ces trente-cinq premières années.

(Et bon anniversaire à rms, 60 ans aujourd'hui !)

Par Jean-Marie Chauvet. Le 16 mars 2013.