Data science and digital humanities
Fact-checking Ms. Spinney’s opinions piece Inequality doesn't just make pandemics worse – it could cause them, 12 April 2020 made me realize that she makes a bogus reference to some statistical tests which had been probably never carried out on data that had been proven to mislead many historians. I could not find any support for her claim that “Historian Peter Turchin has described a strong statistical association between global connectedness, social crises and pandemics throughout history.” On the other hand, I have found proof that Mr. Turchin is one of the historians who are known to have mispresented a well-known historical dataset from 1975.
This dataset, in contrary to what Ms. Spinney’s source says, is not a ‘plague incident’ dataset. It is proven to be biased by urbanization rate, country and time, and therefore it is not usable for timewise or cross-country comparison, and, most interestingly, well-known to be mispresented and misused by historians [1]. The references do not jump correctly to the end, I will make this happen.
Contents
What went wrong in this case[edit]
- Humanities information was collected and reported with a questionable methodology. While J.N. Birabens original work, which has a bibliography running 225 pages with references in medieval texts about the plague has huge academic value, over the decades later research, partly with the use of digital methods, have found problems with his approach and limited the usability of his work.
- The true meaning of the information was lost in the translation. It is well documented that an original work by a historian, after being translated to digital data, has mislead many historians (for decades!) who seem to have misunderstood or wilfully mispresented the data.
- The mispresented, and anyways problematic data was used with bad statistical methodology. This lead to quasi-statistical claims in the field of humanities which are inherently invalid.
- The digital humanities research based on wrong data, which was misunderstood and misused with bad methodology got reported in popular science to support a biased argument with a seemingly scientific argument that has no scientific support.
This is a very good case study, because it is very well documented, and potentially allows tracing misunderstandings in a humanities field for at least 2-3 decades, and it includes critical methodological problems both in the original humanities field and in the digital (data science) field. Some of the misunderstandings have nothing to do with the digital transformation of information to data – they are problems with source critique in an inherently humanities publication. However, data science can help adding detail to this source critique. And there is another line of problem with the misuse of basic statistical inference, in this case, correlation, which is famously not equal to causation.
How Can We Make Some Typology Of Good Collaboration?[edit]
Because the original data set is available online, and it is very interesting indeed, we can do many things: We can show, among others:
- How a bad source critique to a digital transcription of humanities information creates a problematic dataset. In this particular case, placing the humanities data assembled by Biraben in 1975 makes it clear how his (otherwise originally undocumented) selection of chronicles was biased. He researched whatever he could read in French urban libraries, and whatever microfiches or other replications were available to him.
- After careful source critique, how this digitally transformed humanities source can be put into valid scientific research with the help of data science, not only by supporting a more thorough source critique, but also buy finding novels, limited ways to a problematic data source. In this case, Roosen and Curtis show that correlating with other digitally transcribed humanities data, which elements of Biraben's dataset are likely to be unbiased and available for the kind of research Turchin and Spinney could use.
- How data science results should be fed back to the humanities argument without misinterpreting or over-interpreting the numerical analysis. We can show, by re-creating Turchin's work, that the correlation that he saw was statistically not significant.
- How alternative hypothesis within the humanities field can be explored with real and simulated data, to highlight potential ambiguities and reasoning problems in an inherently humanities argument. We could show that the dataset could have supported two, radically different theoretical conclusions, namely i) that inequalities cause pandemics ii) that pandemics reduce inequality. Both theories are in line with the data that shows higher inequality after a pandemic or local epidemic and lower inequality in the following years. The actual, skillful analysis can give hints into more methodological humanities research and source critique to make the next step in theoretizing.
A Possible Framework[edit]
We can eventually form a deductive framework on the potential workflows how
1. Exploratory data science (which makes no casual claim just presents data for any later interpretation in an unbiased, systematic way). Exploratory data science can help reveal patterns, biases in empirical data.
2. Predictive data science (which is aimed to find casual links) can show if there may be a casual link between various data. If one piece of information helps predicting cases of another information, than there may be a casual link that the researcher can further investigate. Alternatively, we can check if a hypothetised link between two variables is more random-like, or if it follows a predictable pattern.
3. Quantitative social science modelling (which is aimed to find theoretically grounded casual links) can more systematically analyze various variables.
4. Empirical humanities research can be better designed, for example, fieldwork better prepared, if the researcher is well aware of the potential geographical, temporal or other biases of information, or just the total size of the field, f.e. how many people, drug users, etc are there within an urban space.
5. Source critique in humanities sources: like Roosen and Curtis did with the Biraben plague dataset, data scientists can help tracing back the likely biases or methodological issues with a source that is not well documented. For example, Biraben did not document his selection of chronicles to be included in his work, but data scientists could more or less re-create the way he worked 40-50 years later.
References[edit]
These interactions, if done in a logical workflow, can greatly increase the punctuality, focus and validatity of humanities research.
- ↑ Roosen, J., & Curtis, D. R. (2018). Dangers of Noncritical Use of Historical Plague Data. Emerging Infectious Diseases, 24(1), 103-110. https://dx.doi.org/10.3201/eid2401.170477
- ↑ Schmid, Boris V., Büntgen, Ulf, Easterday, W. Ryan, Ginzler, Christian, Walløe, Lars, Bramanti, Barbara, & Stenseth, Nils Chr. (2015). Source code and datasets used to link new waves of plague outbreaks in medieval Europe to climate fluctuations affecting the reservoirs of the disease in Asia. [Data set]. Zenodo.|https://zenodo.org/record/14973#.XqAUj8gzbIU