我的代码是否正确计算数据集的熵/条件熵?

问题描述:

我正在写一个java,我想用它来计算诸如熵,联合熵,条件熵等给定数据集之类的东西。有问题的类如下:我的代码是否正确计算数据集的熵/条件熵?

public class Entropy { 

private Frequency<String> iFrequency = new Frequency<String>(); 
private Frequency<String> rFrequency = new Frequency<String>(); 

Entropy(){ 
    super(); 
} 

public void setInterestedFrequency(List<String> interestedFrequency){ 
    for(String s: interestedFrequency){ 
     this.iFrequency.addValue(s); 
    } 
} 

public void setReducingFrequency(List<String> reducingFrequency){ 
    for(String s:reducingFrequency){ 
     this.rFrequency.addValue(s); 
    } 
} 

private double log(double num, int base){ 
    return Math.log(num)/Math.log(base); 
} 

public double entropy(List<String> data){ 

    double entropy = 0.0; 
    double prob = 0.0; 
    Frequency<String> frequency = new Frequency<String>(); 

    for(String s:data){ 
     frequency.addValue(s); 
    } 

    String[] keys = frequency.getKeys(); 

    for(int i=0;i<keys.length;i++){ 

     prob = frequency.getPct(keys[i]); 
     entropy = entropy - prob * log(prob,2); 
    } 

    return entropy; 
} 

/* 
* return conditional probability of P(interestedClass|reducingClass) 
* */ 
public double conditionalProbability(List<String> interestedSet, 
            List<String> reducingSet, 
            String interestedClass, 
            String reducingClass){ 
    List<Integer> conditionalData = new LinkedList<Integer>(); 

    if(iFrequency.getKeys().length==0){ 
     this.setInterestedFrequency(interestedSet); 
    } 

    if(rFrequency.getKeys().length==0){ 
     this.setReducingFrequency(reducingSet); 
    } 

    for(int i = 0;i<reducingSet.size();i++){ 
     if(reducingSet.get(i).equalsIgnoreCase(reducingClass)){ 
      if(interestedSet.get(i).equalsIgnoreCase(interestedClass)){ 
       conditionalData.add(i); 
      } 
     } 
    } 

    int numerator = conditionalData.size(); 
    int denominator = this.rFrequency.getNum(reducingClass); 

    return (double)numerator/denominator; 
} 

public double jointEntropy(List<String> set1, List<String> set2){ 

    String[] set1Keys; 
    String[] set2Keys; 
    Double prob1; 
    Double prob2; 
    Double entropy = 0.0; 

    if(this.iFrequency.getKeys().length==0){ 
     this.setInterestedFrequency(set1); 
    } 

    if(this.rFrequency.getKeys().length==0){ 
     this.setReducingFrequency(set2); 
    } 

    set1Keys = this.iFrequency.getKeys(); 
    set2Keys = this.rFrequency.getKeys(); 

    for(int i=0;i<set1Keys.length;i++){ 
     for(int j=0;j<set2Keys.length;j++){ 
      prob1 = iFrequency.getPct(set1Keys[i]); 
      prob2 = rFrequency.getPct(set2Keys[j]); 

      entropy = entropy - (prob1*prob2)*log((prob1*prob2),2); 
     } 
    } 

    return entropy; 
} 

public double conditionalEntropy(List<String> interestedSet, List<String> reducingSet){ 

    double jointEntropy = jointEntropy(interestedSet,reducingSet); 
    double reducingEntropyX = entropy(reducingSet); 
    double conEntYgivenX = jointEntropy - reducingEntropyX; 

    return conEntYgivenX; 
} 

在过去的几天,我一直在试图找出为什么我的熵的计算几乎总是完全一样我的计算条件熵。

我使用下面的公式:

H(X)= - 西格玛从x = 1到x = NP(X)*日志(P(X))

H(XY) = - 从x = 1到x = n,y = 1到y = m(p(x)* p(y))* log(p(x)* p(y))的Sigma。 | Y)= H(XY) - H(X)

我得到的熵值和条件熵值几乎相同。

由于我使用的测试,我得到以下值数据集:

@Test 
public void testEntropy(){ 
    FileHelper fileHelper = new FileHelper(); 
    List<String> lines = fileHelper.readFileToMemory(""); 
    Data freshData = fileHelper.parseCSVData(lines); 

    LinkedList<String> headersToChange = new LinkedList<String>(); 
    headersToChange.add("lwt"); 

    Data discreteData = freshData.discretize(freshData.getData(),headersToChange,1,10); 

    Entropy entropy = new Entropy(); 
    Double result = entropy.entropy(discreteData.getData().get("lwt")); 
    assertEquals(2.48,result,.006); 
} 

@Test 
public void testConditionalProbability(){ 

    FileHelper fileHelper = new FileHelper(); 
    List<String> lines = fileHelper.readFileToMemory(""); 
    Data freshData = fileHelper.parseCSVData(lines); 

    LinkedList<String> headersToChange = new LinkedList<String>(); 
    headersToChange.add("age"); 
    headersToChange.add("lwt"); 


    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); 

    Entropy entropy = new Entropy(); 
    double conditionalProb = entropy.conditionalProbability(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6"); 
    assertEquals(.1,conditionalProb,.005); 
} 

@Test 
public void testJointEntropy(){ 


    FileHelper fileHelper = new FileHelper(); 
    List<String> lines = fileHelper.readFileToMemory(""); 
    Data freshData = fileHelper.parseCSVData(lines); 

    LinkedList<String> headersToChange = new LinkedList<String>(); 
    headersToChange.add("age"); 
    headersToChange.add("lwt"); 

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); 

    Entropy entropy = new Entropy(); 
    double jointEntropy = entropy.jointEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age")); 
    assertEquals(5.05,jointEntropy,.006); 
} 

@Test 
public void testSpecifiedConditionalEntropy(){ 

    FileHelper fileHelper = new FileHelper(); 
    List<String> lines = fileHelper.readFileToMemory(""); 
    Data freshData = fileHelper.parseCSVData(lines); 

    LinkedList<String> headersToChange = new LinkedList<String>(); 
    headersToChange.add("age"); 
    headersToChange.add("lwt"); 

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); 

    Entropy entropy = new Entropy(); 
    double specCondiEntropy = entropy.specifiedConditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6"); 
    assertEquals(.332,specCondiEntropy,.005); 

} 

@Test 
public void testConditionalEntropy(){ 

    FileHelper fileHelper = new FileHelper(); 
    List<String> lines = fileHelper.readFileToMemory(""); 
    Data freshData = fileHelper.parseCSVData(lines); 

    LinkedList<String> headersToChange = new LinkedList<String>(); 
    headersToChange.add("age"); 
    headersToChange.add("lwt"); 

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); 

    Entropy entropy = new Entropy(); 
    Double result = entropy.conditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age")); 
    assertEquals(2.47,result,.006); 
} 

一切编译正确,但我敢肯定,我的条件熵的计算是不正确的,但我我不知道我犯了什么错误。

单元测试中的值是我目前得到的值。它们与上述函数的输出相同。

在一个点上,我也使用以下做测试:

List<String> survived = Arrays.asList("1","0","1","1","0","1","0","0","0","1","0","1","0","0","1"); 
List<String> sex = Arrays.asList("0","1","0","1","1","0","0","1","1","0","1","0","0","1","1"); 

其中男性= 1和存活= 1。然后我用此来计算

double result = entropy.entropy(survived); 
assertEquals(.996,result,.006); 

以及

double jointEntropy = entropy.jointEntropy(survived,sex); 
assertEquals(1.99,jointEntropy,.006); 

我也通过手工计算值来检查我的工作。你可以看到一个图像here。由于我的代码给了我与手动完成问题时获得的相同的值,并且由于其他函数非常简单并且仅使用了熵/联合熵函数,所以我认为一切都很好。

但是,出现了问题。下面是我写的两个函数,用来计算信息增益和一组对称不确定性。

public double informationGain(List<String> interestedSet, List<String> reducingSet){ 
    double entropy = entropy(interestedSet); 
    double conditionalEntropy = conditionalEntropy(interestedSet,reducingSet); 
    double infoGain = entropy - conditionalEntropy; 
    return infoGain; 
} 

public double symmetricalUncertainty(List<String> interestedSet, List<String> reducingSet){ 
    double infoGain = informationGain(interestedSet,reducingSet); 
    double intSet = entropy(interestedSet); 
    double redSet = entropy(reducingSet); 
    double symUnc = 2 * (infoGain/ (intSet+redSet)); 
    return symUnc; 
} 

原来的生存/性别集合,我用我给了我一个稍微消极的答案。但是,由于它仅仅是.000000000000002,我只是假设它是一个四舍五入的错误。当我试图运行我的程序时,我得到的用于对称不确定性的任何值都没有任何意义。

+0

请显示您的测试运行的输出。退化情况很简单。例如,给它一个连续N个整数的列表。熵应该是log2(N),对吗?重复类似的情况,通过另一个例程或手动计算很容易验证的情况。 – Prune

+0

@Prune这就是为什么我包括测试。单元中当前列出的值将测试我所得到的值。我手工计算了熵和联合熵的值,所以我非常肯定这两个函数可以正常工作。但是当我计算对称的不确定性时,我的答案是关闭的。当它们应该介于0和1之间时,我会得到负面的价值。我认为可能会有一些我错过的小东西,而我却无法看到它,因为我刚刚在看它。 –

tldr;您对H(X,Y)的计算显然假定X和Y是独立的,这导致H(X,Y)= H(X)+ H(Y),这又导致H(X | Y)等于H(X)。

这是你的问题?如果是这样,然后用正确的公式为X和Y的联合熵(从Wikipedia拍摄):

enter image description here

您代P(X,Y)= P(X)p得到你错了(Y),它假设两个变量都是独立的。如果两个变量都是独立,那么事实上H(X | Y)= H(X)成立,因为Y没有告诉你关于X的任何信息,从而知道Y不会减少X的熵

+0

这几乎是我的问题。什么是正确的公式?我只能在Wikipedia上找到[this](https://en.wikipedia.org/wiki/Joint_entropy)。我的概率理论书没有关于这个主题的任何内容,所以*几乎是我一直在使用的。 –

+0

@ j.jerrod.taylor链接中的链接是正确的。看我的编辑。 – ziggystar

+0

这很有道理。我看到我现在搞砸了。 –