The Klarna Product Page Dataset: A Realistic Benchmark for Web Representation Learning

11/03/2021
by   Alexandra Hotti, et al.
0

This paper tackles the under-explored problem of DOM tree element representation learning. We advance the field of machine learning-based web automation and hope to spur further research regarding this crucial area with two contributions. First, we adapt several popular Graph-based Neural Network models and apply them to embed elements in website DOM trees. Second, we present a large-scale and realistic dataset of webpages. By providing this open-access resource, we lower the entry barrier to this area of research. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web. Finally, using our proposed dataset, we show that the embeddings produced by a Graph Convolutional Neural Network outperform representations produced by other state-of-the-art methods in a web element prediction task.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset