Log in

MIC correlation

The MIC method is based on Mutual information (MI), which quantifies the dependence between the joint distribution of X and Y and what the joint distribution would be if X and Y were independent (See,e.g., the Wikipedia entry). Mathematically, MI is defined as

MI=H(X)+H(Y)H(X,Y)

where

H(X)=ip(zi)logp(zi)

is the entropy of a single variable and

H(X,Y)=i,jp(xi,yj)logp(xi,yj)

is the joint entropy of two variables.

 

The authors' main idea is to discretize the data onto many different two-dimensional grids and calculate normalized scores that represents the mutual information of the two variables on each grid. The scores are normalized to ensure a fair comparison between different grids and vary between 0 (uncorrelated) and 1 (high correlations).

MIC is defined as the highest score obtained and is an indication of how strongly the two variables are correlated. In fact, the authors claim that for noiseless functional relationships MIC values are comparable to the coefficient of determination (R2).

Comments

  • jaguar1637 3076 days ago

    MIC and MINE, a short description

     

    I thought I should give a brief description of MIC and MINE, the topic of our Science paper.  Although you're probably better off, if you're interested, looking at our website,http://www.exploredata.net/.  Not only does it have a brief description of the algorithm, but now we have a link that gives access to the paper (if you couldn't access it previously) availableon this page, thanks to the people at Science.

    Our starting point was figuring out what to do with large, multidimensional data sets.  If you're given a bunch of data, how do you start to figure out what might be interesting?  A natural approach is to calculate some measure on each pair of variables which serves as a proxy for how interesting they are -- where interesting, to us, means that the variables appear dependent in some way.  Once you have this, you can pull out the top-scoring pairs and take a closer look.  

    What properties would we like our measure to have?

    First, we would like it to be general, in the sense that it should pick up as diverse a collection of possible associations between variables as possible.  The well-known Pearson correlation coefficient, for example, does not have this property -- as you can see from this Wikipedia picture 

    image



    it's great at picking up linear relationships, but other relationships (periodic sine waves, parabolas, exponential functions, circular associations) it's not so good at.

    Second, we would like it to be what we call equitable.  By this we mean that scores should degrade with noise at (roughly) the same rate.  If you start with a line and start with a sine wave, and add an equivalent amount of noise to both of them in the same way, the resulting scores should be about the same;  otherwise the method would give preference to some types of relationships over others.  

    I should point out that the goal behind these properties is to allowexploration -- the point is you don't know what sort of patterns in the data you should be looking for.  If you do know, you're probably better off using a specific tests.  If you're looking for linear relationships, Pearson is great -- and there are a large variety of powerful tests that have been developed for specific types of relationships over the years.

    Our suggested measure, MIC (maximal information coefficient), is based on mutual information, but is different from it.  Our intuition is that if a relationship exists between two variables, then a grid can be drawn on the scatterplot that partitions the data in a way that corresponds to  that relationship.  Suppose we have a grid, with x rows and y columns;  then this grid induces a probability distribution, where the probability associated with a box on the grid is proportional to the number of data points it contains.  Let I_G be the mutual information of the grid.  We aim to maximize I_G/ log min(x,y) -- this is the MIC score.  (We bound x, y values to avoid trivial cases like every point having its own box;  see the paper for more details.)  MIC values are between 0 and 1, and we can approximately compute MIC values heuristically.  Our work shows via a mix of proofs and experiments that MIC appears to have the properties of generality and equitability we desire.    

    Based on the approach we use to find MIC, we suggest several other possible useful measures, and we refer to the collection as MINE (maximal information-based nonparametric exploration) statistics.  For example, MIC - Pearson^2 turns out to be a useful measure for specifically finding nonlinear relationships.  When MIC is high (close to 1) and Pearson is low (close to 0), there is a relationship that MIC has found but it is not a linear one;  for linear patterns, both MIC and Pearson are high, and so this measure is close to 0.  Other measures we have measure things like deviations from monotonicity -- useful for finding periodic relationships. 

    Should MIC/MINE prove useful, I believe there's lots of theoretical questions left to consider.  The approach is currently essentially heuristic, although we do prove a fair number of things about it.  Proving more about its properties would be useful, as well as improving the complexity to compute the relevant measures.