

本文旨在介绍熊英飞老师2017-18年第三篇论文——“Identifying Patch Correctness in Test-Based Program Repair”(在基于测试的程序修复中确认补丁的正确性)。

Identifying Patch Correctness in Test-based Program Repair

1.1 一句话概括文章

The test suites in practice are often too weak to guarantee the correctness and existing approaches often generate a large number of incorrect patches.
To reduce the number of incorrect patches generated, we propose a novel apprach that exploits the behavior similarity of test case executions.
1)passing tests on original and patched programs are likely to behave similarly;
2)failing tests on original and patched programs are likely to behave differently;
3)If two tests exhibit similar runtime behavior, the two tests are likely to have the same results.

Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness.


1.2 实验结果

Our approach successfully prevented 56.3% of the incorrect patches to be generated, without blocking any correct patches.

实验效果看起来也还不错,without blocking any correct patches。

1.3 一些背景知识

In the past decades, a large number of automated program repair approaches [7–9, 12, 13, 16–19, 25, 26, 28, 43, 44] have been proposed, and many of them fall into the category of test-based program repair.

1.4 idea的来源

weak test suite:
1) test suites in real world projects are often too weak [34], and a patched program passing all the tests may still be faulty;
2) As studied by Long et al. [20], the test suites in real world systems are usually weak such that most of the plausible patches are incorrect, making it difficult for a test-based program repair system to ensuire the correctness of the patches.
3) As existing studies [24, 34, 37] show, multiple automatic program repair systems produce much more incorrect patches than correct patches on real world defects, leading to low precision in their generated patches.
4) an existing study [39] also shows that, when developer are provided with low-quality patches, their performance will drop compared to the situation where no patch is provided.

As a result, we believe it is critical to improve the precision of program repair systems, even at the risk of losing some correct patches.

the limitations of existing approaches for enhancing the test suites:
1) existing studies [42, 46] have attempted to generate new test cases to identify incorrect patches.
However, while test inputs can be generated, test oracles cannot be automatically generated in general, known as the oracle problem [1, 31].
As a result, existing approaches either require human to determine test results [42], which is too expensive in many senarios, or rely on inherent oracles such as crash-free [46], which can only identify certain types of incorrect patches that violate such oracles.


1.5 作者的idea

Our goal is to classify patches heuristically without knowing the full oracle.
1) Patch-Sim
2) Test-Sim

workflow of the proposed approach

1.6 关于第一篇文章“ISSTA18a-Shaping Program Repair Space with Existing Patches and similar code”的idea由来


Statistics. Some approaches build a statistical model to select the patches that are likely to fix the defects based on various information sources, such as existing patches [12, 13, 19] and existing source code [43].

[12] PAR, 2013; [13] HDRepair, 2016; [19] Prophet 2016.
[43] ACS 2017.

1.7 关于related work


1) Test-based program repair
2) patch classification
3) patch ranking
4) approaches to the oracle problem
5) other related work

However, the effect of such an application on patch correctness identification is still unknown as far as we are aware and remains as future work.

This is a future direction to be explored.

1.8 patch correctness and behavior similarity


1.9 关于作者使用的技术


measure the similarity of two test executions. In our approach, we measure the similarity of complete-path executions:

In our implementation, we chose Randoop [29], a random testing tool, as the test generation tool.

1.10 小结

1)工作量很足,和其他的技术比较:包括Opad,anti-patterns,syntactic and semantic distances。
中间很有趣的是,在RQ 4 第七页,

we considered two different test generation strategies and compared their results with the result of RQ 1.

3)整个 7 个 RQs,让我觉得很详细,佩服。

1.11 Future work(值得关注)

The result suggest that measuring behavior similarity can be a promising way to tackle the oracle problem and calls for more research on this topic.

意思是未来这方面大有可为,毕竟oracle problem不是一时半会儿能够解决的.需要大量的研究工作的推进
