Natcat: Weakly Supervised Text Classification with Naturally Annotated Datasets

09/29/2020
by   Zewei Chu, et al.
0

We seek to improve text classification by leveraging naturally annotated data. In particular, we construct a general purpose text categorization dataset (NatCat) from three online resources: Wikipedia, Reddit, and Stack Exchange. These datasets consist of document-category pairs derived from manual curation that occurs naturally by their communities. We build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval). We benchmark different modeling choices and dataset combinations, and show how each task benefits from different NatCat training resources.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset