https://doi.org/10.1140/epjds/s13688-021-00260-3
Regular Article
Generalized word shift graphs: a method for visualizing and explaining pairwise comparisons between texts
1
Network Science Institute, Northeastern University, 02115, Boston, MA, USA
2
Department of Informatics and Networked Systems, University of Pittsburgh, 15260, Pittsburgh, PA, USA
3
Connection Science, Massachusetts Institute of Technology, 02139, Cambridge, MA, USA
4
Institute for Human-Centered Artificial Intelligence, Stanford University, 94305, Stanford, CA, USA
5
School of Mathematical Sciences, The University of Adelaide, 5005, Adelaide, SA, Australia
6
Computational Story Lab, Vermont Complex Systems Center, & Vermont Advanced Computing Core, The University of Vermont, 05401, Burlington, VT, USA
7
Gund Institute for Environment & Rubenstein School of Environment and Natural Resources, The University of Vermont, 05401, Burlington, VT, USA
8
Department of Ecology and Evolutionary Biology, University of Colorado at Boulder, 80309, Boulder, CO, USA
9
MassMutual Data Science, 01002, Amherst, MA, USA
10
MassMutual Center of Excellence for Complex Systems and Data Science & Department of Mathematics and Statistics, The University of Vermont, 05401, Burlington, VT, USA
a
gallagher.r@northeastern.edu
Received:
3
September
2020
Accepted:
6
January
2021
Published online:
19
January
2021
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts’ rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback–Leibler and Jensen–Shannon divergences. Through a diverse set of case studies ranging from presidential speeches to tweets posted in urban green spaces, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.
Key words: Text as data / Data visualization / Word shift graphs / Sentiment analysis / Computational social science / Digital humanities / Natural language processing / Information theory
© The Author(s) 2021
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.