Minimum_description_length Minimum_description_length

Minimum description length - Definition and Overview

Related Words: Anecdote, Blood, Brand, Breed, Cast, Category, Character, Characterization, Chronicle, Clan, Class, Color, Commentary, Construction, Definition, Denomination, Diagnosis, Differentiation, Explanation, Feather, Form, Genre, Genus

The minimum description length principle is a formalization of Occam's Razor in which a computer program is used to represent some given information. Occam's Razor states, roughly speaking, that the simplest hypothesis that can explain the data at hand should be preferred over more complicated ones; in MDL, the simplest hypothesis is the shortest program. MDL is important in information theory and learning theory.

Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet. "The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." (Grünwald, 1998. See the link below.) Since we want to select the hypothesis that captures the most regularity in the data, we look for the hypothesis with which the best compression can be achieved.

In order to do this, we must first fix a code to compress the data. In other words, we must choose a (Turing-complete) computer language. We then write a program in that language, that outputs the data. This program thus represents the data. The length of the shortest program that outputs the data is called the Kolmogorov complexity of the data. This is the central idea of Ray Solomonoff's idealized theory of inductive inference.

However, this mathematical theory does not provide a practical way of doing inference. The most important reasons for this are:

  • Kolmogorov complexity is uncomputable: there exists no computer program that, when input an arbitrary sequence of data, outputs the shortest program that produces the data. Even if we accidentally find the shortest program that outputs the data, it is in general not possible to know that it is the shortest.
  • The Kolmogorov complexity depends on what computer language is used to describe programs. It is only defined up to a constant number of bits. If only a small amount of data is available, then such constants may have a very large influence on the inference results: good results cannot be guaranteed when one is working with limited data.

MDL is an attempt to remedy these, by:

  • Restricting the set of allowed codes in such a way that it becomes possible (computable) to find the shortest codelength of the data, relative to the allowed codes, and
  • Choosing a code that it is reasonably efficient whatever the data at hand. This point is somewhat elusive and much research is still going on in this area.

Rather than "programs", in MDL theory one usually speaks of candidate hypotheses or models. The set of allowed programs is then called the model class. (To confuse matters, some authors refer to the model class as the model.) The model that, together with the data, has the shortest description length is then selected.

MDL was not the first attempt to do hypothesis selection through minimizing description length; as early as 1968 Wallace and Boulton pioneered a related concept called Minimum Message Length (MML). MDL was introduced by Jorma Rissanen in 1978; it differs from MML in several ways, most notably in its extensive use of one-part rather than two-part codes.

Central to MDL theory is the 1-1 correspondence between code length functions and probability distributions. (The lemma involved is the Kraft-McMillan inequality.) For any probability distribution <math>P<math>, it is possible to construct a code <math>C<math> such that the length (in bits) of <math>C(x)<math> is equal to <math>-log_2 P(x)<math>; this code minimizes the expected code length. Vice versa, given a code <math>C<math>, one can construct a probability distribution <math>P<math> such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code reduces to searching for a good probability distribution, and vice versa.

This has led some researchers to view MDL as being equivalent to Bayesian inference. Code length of the model and code length of model and data together in the MDL framework correspond to prior probability and marginal likelihood respectively in the Bayesian framework. This point of view is expressed for example in David MacKay's Information Theory, Inference, and Learning Algorithms (see link below). However, while Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework sometimes uses other codes that do not fit into a Bayesian framework. An example is the Shtarkov `normalized maximum likelihood code', which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, the MDL Principle prefers some priors over others. While the same priors tend to be favored in so-called objective Bayesian analysis, they are favored for different reasons.

External links

  • A short introduction (http://volker.nannen.com/pdf/short_introduction_to_model_selection.pdf) to Model Selection, Kolmogorov Complexity and Minimum Description Length.
Copyright 2009 WordIQ.com - Privacy Policy  :: Terms of Use  :: Contact Us  :: About Us
This article is licensed under the GNU Free Documentation License. It uses material from the this Wikipedia article.