https://doi.org/10.1140/epjds/s13688-022-00353-7
Regular Article
Evaluating the construct validity of text embeddings with application to survey questions
1
Department of Methodology & Statistics, Utrecht University, Padualaan 14, Utrecht, The Netherlands
2
Department of Information & Computing Sciences, Utrecht University, Princetonplein 5, Utrecht, The Netherlands
3
Department of Biostatistics and Data Science, Julius Center, University Medical Center Utrecht (UMCU), Universiteitsweg 100, Utrecht, The Netherlands
Received:
21
February
2022
Accepted:
22
June
2022
Published online:
7
July
2022
Text embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are high-quality representations of the information needed to be encoded. We view this quality evaluation problem from a measurement validity perspective, and propose the use of the classic construct validity framework to evaluate the quality of text embeddings. First, we describe how this framework can be adapted to the opaque and high-dimensional nature of text embeddings. Second, we apply our adapted framework to an example where we compare the validity of survey question representation across text embedding models.
Key words: Word embeddings / Sentence embeddings / Measurement validity / Content validity / Convergent validity / Discriminant validity / Predictive validity / Survey questions / Survey methodology / Computational social science
© The Author(s) 2022
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.