Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes

03/30/2022
by   Jingzhi Gong, et al.
4

Learning and predicting the performance of a configurable software system helps to provide better quality assurance. One important engineering decision therein is how to encode the configuration into the model built. Despite the presence of different encoding schemes, there is still little understanding of which is better and under what circumstances, as the community often relies on some general beliefs that inform the decision in an ad-hoc manner. To bridge this gap, in this paper, we empirically compared the widely used encoding schemes for software performance learning, namely label, scaled label, and one-hot encoding. The study covers five systems, seven models, and three encoding schemes, leading to 105 cases of investigation. Our key findings reveal that: (1) conducting trial-and-error to find the best encoding scheme in a case by case manner can be rather expensive, requiring up to 400+ hours on some models and systems; (2) the one-hot encoding often leads to the most accurate results while the scaled label encoding is generally weak on accuracy over different models; (3) conversely, the scaled label encoding tends to result in the fastest training time across the models/systems while the one-hot encoding is the slowest; (4) for all models studied, label and scaled label encoding often lead to relatively less biased outcomes between accuracy and training time, but the paired model varies according to the system. We discuss the actionable suggestions derived from our findings, hoping to provide a better understanding of this topic for the community. To promote open science, the data and code of this work can be publicly accessed at https://github.com/ideas-labo/MSR2022-encoding-study.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2022

Improving Model Training via Self-learned Label Representations

Modern neural network architectures have shown remarkable success in sev...
research
06/11/2023

Predicting Software Performance with Divide-and-Learn

Predicting the performance of highly configurable software systems is th...
research
06/29/2022

Adversarial Ensemble Training by Jointly Learning Label Dependencies and Member Models

Training an ensemble of different sub-models has empirically proven to b...
research
01/09/2023

Do Performance Aspirations Matter for Guiding Software Configuration Tuning?

Configurable software systems can be tuned for better performance. Lever...
research
05/29/2020

Quasi-orthonormal Encoding for Machine Learning Applications

Most machine learning models, especially artificial neural networks, req...
research
07/10/2023

Impact of Feature Encoding on Malware Classification Explainability

This paper investigates the impact of feature encoding techniques on the...
research
05/31/2019

Investigating an Effective Character-level Embedding in Korean Sentence Classification

Different from the writing systems of many Romance and Germanic language...

Please sign up or login with your details

Forgot password? Click here to reset