https://doi.org/10.1140/epjds/s13688-026-00654-1
Research
From keyword-based text measures to latent variables: confirmatory factor analysis with word embeddings
Computational Social Science Department, Institute of Philosophy and Sociology of the Polish Academy of Sciences, Nowy Świat 72, 00-330, Warsaw, Poland
a
This email address is being protected from spambots. You need JavaScript enabled to view it.
Received:
14
October
2025
Accepted:
31
March
2026
Published online:
14
April
2026
Abstract
Dictionary-based text analysis, where researchers select keywords to measure constructs such as public sentiment, anxiety, or political attitudes in large text corpora, is widely used in computational social science. However, keyword selection is rarely subjected to the same psychometric scrutiny applied to survey instruments: studies seldom report reliability, evaluate internal structure, or test whether the measurement holds across subpopulations or time points. Moreover, few existing methods enable the construction of measures that reflect theoretical or expected relationships among keywords. This paper proposes a method that brings these capabilities to text analysis by applying Confirmatory Factor Analysis (CFA) to word embeddings. Keywords are treated as observed indicators of a latent construct, and their semantic relationships, operationalized as centered cosine similarities between embedding vectors, serve as the input correlation matrix for CFA estimation. The framework enables researchers to estimate factor loadings and model fit indices (CFI, TLI, RMSEA, SRMR), compute reliability coefficients (Cronbach’s alpha, Omega), and test measurement invariance across groups or time periods using multigroup models with structured means. Moreover, the method allows researchers to compare latent construct intensity across groups or time periods, transforming keyword-based text measures from descriptive indicators into formally comparable latent variables. The method is demonstrated through an empirical application of the discourse of war anxiety during Russia’s 2022 invasion of Ukraine. A Monte Carlo simulation further examines the behavior of fit indices under random keyword selection. The approach complements existing text analysis methods and can be implemented using standard software, such as the lavaan R package.
Key words: Computational social science / Text-as-data / Measurement validity / Confirmatory factor analysis / Word embeddings / Measurement invariance / Latent variable modeling / Keyword selection
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1140/epjds/s13688-026-00654-1.
Handling Editor: Jana Lasser
© The Author(s) 2026
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

