Best Paper Award

Source File Set Reuse Detection between Projects with Lightweight Similarity Calculation[IEICE TRANS. INF. & SYST., Vol.J103-D No.7 JULY 2020]

Kaoru ITO
Kaoru ITO
Takashi ISHIO
Takashi ISHIO
Tetsuya KANDA
Tetsuya KANDA
Katsuro INOUE
Katsuro INOUE

In software development, reusing the source code of open source software (OSS) is generally done to improve the quality and reliability of the software. However, the version information of the reused OSS source may be lost due to prolonged development or repeated updates and modifications. Although it is necessary to identify the reused source by calculating the similarity between the target software and the OSS source code, it is difficult to automatically identify the version of the library with the existing method of calculating similarities for units of source files, and there is also an issue of huge calculation time for the similarity calculation.

In this paper, the authors propose a method for automatically and quickly identifying the version information of a reused library by comparing the source files of the software to be analyzed with the repository of the reused library. The proposed method first calculates the similarities of each source file using two sets of source files, one for the target software and one for the library, and then sets the sum of these similarities as the similarity of the library. Then, the version of the library with the highest similarity is selected as the reused one. In order to speed up the calculation of the similarities, instead of precisely estimating the Jaccard coefficients that constitute the similarity, the b-bit MinHash method, which is one of the local sharpness hashing methods, is used to approximate the similarity. As a result of applying the proposed method to various OSS libraries, the authors confirmed that the proposed method could identify the version information with an average accuracy of 99.3%, which is the same degree of accuracy as the existing method that accurately calculates the Jaccard coefficients. In addition, the average execution time is reduced by 24.1%, and the memory consumption is less than 1% of the existing method.

In recent years, software development has become increasingly large-scale and complicated, and the highly accurate and fast version control method proposed in this paper will significantly contribute to improving the quality of software. Therefore, this paper is highly evaluated as a paper worthy of the Best Paper Award of the Institute of Electronics, Information and Communication Engineers.