介绍软件工程领域熊英飞老师及其2017-18年论文——第三篇

- 前言
- Identifying Patch Correctness in Test-based Program Repair

前言

本文旨在介绍熊英飞老师2017-18年第三篇论文——“Identifying Patch Correctness in Test-Based Program Repair”（在基于测试的程序修复中确认补丁的正确性）。

Identifying Patch Correctness in Test-based Program Repair

1.1 一句话概括文章

The test suites in practice are often too weak to guarantee the correctness and existing approaches often generate a large number of incorrect patches.
To reduce the number of incorrect patches generated, we propose a novel apprach that exploits the behavior similarity of test case executions.
1)passing tests on original and patched programs are likely to behave similarly;
2)failing tests on original and patched programs are likely to behave differently;
3)If two tests exhibit similar runtime behavior, the two tests are likely to have the same results.

Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness.

这个确实是我不了解的技术，不知道还可以这样操作…

1.2 实验结果

Our approach successfully prevented 56.3% of the incorrect patches to be generated, without blocking any correct patches.

还是用了很多工具的，比如ASTOR的两个工具（jKali，jGenprog），Nopol，ACS，HDRepair
实验效果看起来也还不错，without blocking any correct patches。

1.3 一些背景知识

In the past decades, a large number of automated program repair approaches [7–9, 12, 13, 16–19, 25, 26, 28, 43, 44] have been proposed, and many of them fall into the category of test-based program repair.

[7] Qing Gao, Yingfei Xiong, Yaqing Mi, Lu Zhang, Weikun Yang, Zhaoping Zhou, Bing Xie, and Hong Mei. 2015. Safe Memory-Leak Fixing for C Programs. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 459–470.
[8] Qing Gao, Hansheng Zhang, Jie Wang, and Yingfei Xiong. 2015. Fixing Recurring Crash Bugs via Analyzing Q&A Sites. In ASE. 307–318.
[9] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In AAAI. 1345–1351
[12] Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In ICSE ’13. 802–811.
[13] Xuan-Bach D Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair. In Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE 23rd International Conference on, Vol. 1. IEEE, 213–224.
[16] C. Le Goues, ThanhVu Nguyen, S. Forrest, and W. Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. Software Engineering, IEEE Transactions on 38, 1 (Jan 2012), 54–72.

[17] Xuliang Liu and Hao Zhong. 2018. Mining * for Program Repair. (2018), to appear pages.

[18] Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015. 166–178. https://doi.org/10.1145/2786805.2786811
[19] Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016. 298–312. https://doi.org/10.1145/2837614.2837617
[25] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. DirectFix: Looking for Simple Program Repairs. In ICSE. 448–458.
[26] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. *Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis. In ICSE. 691–701.
[28] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. SemFix: Program Repair via Semantic Analysis. In ICSE. 772–781.
[43] Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise Condition Synthesis for Program Repair. In ICSE.
[44] Yingfei Xiong, Hansheng Zhang, Arnaud Hubaux, Steven She, Jie Wang, and Krzysztof Czarnecki. 2015. Range fixes: Interactive error resolution for software configuration. Software Engineering, IEEE Transactions on 41, 6 (2015), 603–619.

这些都是比较经典或者比较新的修复工具，感觉可以关注一下。标粗的[17]也可以看看。

1.4 idea的来源

weak test suite:
1) test suites in real world projects are often too weak [34], and a patched program passing all the tests may still be faulty;
[34] Zichao Qi, Fan Long, Sara Achour, and Martin C. Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In ISSTA. 24–36.

2) As studied by Long et al. [20], the test suites in real world systems are usually weak such that most of the plausible patches are incorrect, making it difficult for a test-based program repair system to ensuire the correctness of the patches.
第二点承接第一点（相似但不相同，而且递进）。功力深厚
[20] Fan Long and Martin C. Rinard. 2016. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. 702–713. https://doi.org/10.1145/2884781.2884872

3) As existing studies [24, 34, 37] show, multiple automatic program repair systems produce much more incorrect patches than correct patches on real world defects, leading to low precision in their generated patches.
[24] Matias Martinez and Martin Monperrus. 2016. ASTOR: A Program Repair Library for Java. In Proceedings of ISSTA, Demonstration Track. 441–444. https://doi.org/ 10.1145/2931037.2948705
[34] Zichao Qi, Fan Long, Sara Achour, and Martin C. Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In ISSTA. 24–36.

[37] Edward K Smith, Earl T Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the cure worse than the disease? overfitting in automated program repair. In FSE. 532–543.

4) an existing study [39] also shows that, when developer are provided with low-quality patches, their performance will drop compared to the situation where no patch is provided.

As a result, we believe it is critical to improve the precision of program repair systems, even at the risk of losing some correct patches.
感觉这个说的很精辟

the limitations of existing approaches for enhancing the test suites:
1) existing studies [42, 46] have attempted to generate new test cases to identify incorrect patches.
However, while test inputs can be generated, test oracles cannot be automatically generated in general, known as the oracle problem [1, 31].
As a result, existing approaches either require human to determine test results [42], which is too expensive in many senarios, or rely on inherent oracles such as crash-free [46], which can only identify certain types of incorrect patches that violate such oracles.

不得不感叹，这实在是太紧跟热点了。

有一个需要探究一下的疑问：
作者说的这些limitation在[42], [46] 的原文中能够找到吗？

这个简直是紧跟当前最热的点，实在厉害。对我来说，就算给我看了这几篇我也想不出来啊，积累不够。

[42] Qi Xin and Steven Reiss. 2017. Identifying Test-Suite-Overfitted Patches through Test Case Generation. In ISSTA.
[46] Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. 2017. Better Test Cases for Better Automated Program Repair. In FSE.
[1] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The Oracle Problem in Software Testing: A Survey. IEEE Trans. Software Eng. 41, 5 (2015), 507–525. https://doi.org/10.1109/TSE.2014.2372785
[31] Mauro Pezze and Cheng Zhang. 2015. Automated Test Oracles: A Survey. Advances in Computers 95 (2015), 1–48.

1.5 作者的idea

Our goal is to classify patches heuristically without knowing the full oracle.
1) Patch-Sim
2) Test-Sim

workflow of the proposed approach

1.6 关于第一篇文章“ISSTA18a-Shaping Program Repair Space with Existing Patches and similar code”的idea由来

在这篇文章里面看到，不禁感叹。这里记下来：

Statistics. Some approaches build a statistical model to select the patches that are likely to fix the defects based on various information sources, such as existing patches [12, 13, 19] and existing source code [43].

[12] PAR, 2013; [13] HDRepair, 2016; [19] Prophet 2016.
[43] ACS 2017.

1.7 关于related work

作者的知识面也太广了吧，介绍的这些我其实基本上都想不到，，，那就更加别说idea了

1) Test-based program repair
2) patch classification
3) patch ranking
4) approaches to the oracle problem
5) other related work

中间还给出了一些未来方向，比如：
However, the effect of such an application on patch correctness identification is still unknown as far as we are aware and remains as future work.

This is a future direction to be explored.

1.8 patch correctness and behavior similarity

说的确实很有道理。

1.9 关于作者使用的技术

在Test-Sim上，作者用到了：
1)

measure the similarity of two test executions. In our approach, we measure the similarity of complete-path executions:

[10] Mary Jean Harrold, Gregg Rothermel, Rui Wu, and Liu Yi. 1998. An Empirical Investigation of Program Spectra. In Proceedings of the SIGPLAN/SIGSOFT Workshop on Program Analysis For Software Tools and Engineering, PASTE ’98, Montreal, Canada, June 16, 1998, Thomas Ball, Frank Tip, and A. Michael Berman (Eds.). ACM, 83–90. https://doi.org/10.1145/277631.277647

In our implementation, we chose Randoop [29], a random testing tool, as the test generation tool.

此外还解释了为什么不用evesuite，因为产生的测试用例不够。
[29] Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed Random Testing for Java. In OOPSLA. 815–816.

1.10 小结

后面的章节太多了，我也就大致看了一看，我发现整个实验：
1）工作量很足，和其他的技术比较：包括Opad，anti-patterns，syntactic and semantic distances。
中间很有趣的是，在RQ 4 第七页，

we considered two different test generation strategies and compared their results with the result of RQ 1.

然后第一个就是不做测试用例生产，作为基准程序，和做了randoop的工具去比较。这个很酷。
2）图表很丰富，感觉分析的应该是很到位了，虽然有部分我也没细看
3）整个 7 个 RQs，让我觉得很详细，佩服。

1.11 Future work（值得关注）

The result suggest that measuring behavior similarity can be a promising way to tackle the oracle problem and calls for more research on this topic.

意思是未来这方面大有可为,毕竟oracle problem不是一时半会儿能够解决的.需要大量的研究工作的推进

然而我的积累还太少,这方面没什么知识基础,以后也拿不准会不会做这个.