- Published on 12 July 2017
The buzz of busy commuters, as well as the lack of it, leave behind digital footprints that are rich in information about all aspects of people's lives. In EPJ Data Science, Eszter Bokányi and team analyze 63 million tweets originating all over the US for a period of 10 months, and find links between unemployment rates and and the users' Twitter activity.
Guest post by Eszter Bokányi, originally published on SpringerOpen blog
Until recently, it has been a time-consuming, costly and arduous task to collect and analyze data about individual humans at a large scale. With the advent of the digital era, there is a growing amount of data accessible online that enables the analysis and modeling of human behavior. However, our understanding of these digital data sources and the methods that connect the data to real-world outcomes is still limited.
One of the most interesting data that can be collected from these digital footprints is the one that has geographic information attached to it. By connecting digital data to geographical areas, researchers are able to predict several different phenomena ranging from land use patterns to the estimation of poverty, population density or crime rates.
In our article just published in EPJ Data Science, we deal with a framework for estimating the employment and unemployment rates of United States counties. Previous research could link daily activity patterns of individuals to the regularity of their working hours, unemployment to measurable psychological effects in mobile communication patterns, and aggregated daily activities of certain time intervals of geographical regions to unemployment. The present work’s aim is to give an alternative narrative and a broader mathematical framework for these estimates.
We collected aggregated workday activity timelines of US counties from the normalized number of messages sent in each hour on the online social network Twitter from January 2014 to October 2014. These aggregated timelines are the superpositions of many individuals’ timelines that we cannot measure due to the sparsity of the data. But if we could cluster individuals into homogeneously behaving groups based on their daily activity patterns, the data would enable us to measure the extent to which the time series of each group is present in the time series of the whole county.
We assume that according to their daily time series there are two types of people: those who have regular working hours and those who do not. We formulate our hypothesis that each county’s timeline is a linear combination of these two patterns, and then look for the underlying patterns and linear combination factors that minimize the errors when compared to the timeline measurements of the data.
It turns out that the underlying “hidden” patterns indeed correspond to one that has an earlier morning and night activity, and one that shows a later upward shift in the morning and increased night activity. Moreover, the mixing factor indicating the extent of early-rising pattern in county patterns correlates significantly with employment (0.46±0.02) rates, and with an opposite sign with unemployment rates (−0.34±0.02).
Our results thus show that by analyzing a relatively sparse publicly available geolocated dataset, a very simple model can explain to a certain extent employment/unemployment. This kind of analysis would allow policy makers a better insight into the processes connected to employment phenomena, and could form the basis of future datasets, where problems could not only be identified based on officially registered unemployed people, but also on a basis of the digital footprints people leave on different platforms.