Data generator based on RBF network
There are plenty of problems where the data available is scarce and expensive. We propose a generator of semi-artificial data with similar properties to the original data which enables development and testing of different data mining algorithms and optimization of their parameters. The generated data allow a large scale experimentation and simulations without danger of overfitting. The proposed generator is based on RBF networks which learn sets of Gaussian kernels. Learned Gaussian kernels can be used in a generative mode to generate the data from the same distributions. To asses quality of the generated data we developed several workflows and used them to evaluate the statistical properties of the generated data, structural similarity, and predictive similarity using supervised and unsupervised learning techniques. To determine usability of the proposed generator we conducted a large scale evaluation using 51 UCI data sets. The results show a considerable similarity between the original and generated data and indicate that the method can be useful in several development and simulation scenarios.
READ FULL TEXT