Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

05/24/2023
by   Yiming Sun, et al.
0

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers-such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination.By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI's 244 K packages we found 11.2 M different global identifiers (classes and method/function names-with only 0.6 of entities); 76 most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89 packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset