Data science and digital humanities
Fact-checking Ms. Spinney’s opinions piece [Inequality doesn't just make pandemics worse – it could cause them, 12 April 2020 | https://www.theguardian.com/commentisfree/2020/apr/12/inequality-pandemic-lockdown] made me realize that she makes a bogus reference to [some statistical tests| http://peterturchin.com/cliodynamica/coronavirus-and-our-age-of-discord] which had been probably never carried out on data that had been proven to mislead many historians. I could not find any support for her claim that “Historian Peter Turchin has described a strong statistical association between global connectedness, social crises and pandemics throughout history.” On the other hand, I have found proof that Mr. Turchin is one of the historians who are known to have mispresented a well-known historical dataset from 1975. This dataset, in contrary to what Ms. Spinney’s source says, is not a ‘plague incident’ dataset, it is proven to be biased by urbanization rate, country and time, and therefore it is not usable for timewise or cross-country comparison, and, most interestingly, well-known to be mispresented and misused by historians [1]. The references do not jump correctly to the end, I will make this happen.
What went wrong in this case:
- Humanities information was collected and reported with a questionable methodology. While J.N. Birabens original work, which has a bibliography running 225 pages with references in medieval texts about the plague has huge academic value, over the decades later research, partly with the use of digital methods, have found problems with his approach and limited the usability of his work.
- This humanities information was translated into a [digital dataset](https://zenodo.org/record/14973#.XqAUj8gzbIU), which is available for researchers[2].
- The true meaning of the information was lost in the translation. It is well documented that an original work by a historian, after being translated to digital data, has mislead many historians (for decades!) who seem to have misunderstood or wilfully mispresented the data.
- The mispresented, and anyways problematic data was used with bad statistical methodology. This lead to quasi-statistical claims in the field of humanities which are inherently invalid.
- The digital humanities research based on wrong data, which was misunderstood and misused with bad methodology got reported in popular science to support a biased argument with a seemingly scientific argument that has no scientific support.
This is a very good case study, because it is very well documented, and potentially allows tracing misunderstandings in a humanities field for at least 2-3 decades, and it includes critical methodological problems both in the original humanities field and in the digital (data science) field. Some of the misunderstandings have nothing to do with the digital transformation of information to data – they are problems with source critique in an inherently humanities publication. However, data science can help adding detail to this source critique. And there is another line of problem with the misuse of basic statistical inference, in this case, correlation, which is famously not equal to causation.
Because the original data set is available online, and it is very interesting indeed, we can do many things: We can show, among others:
- how a bad source critique to a digital transcription of humanities information creates a problematic dataset;
- after careful source critique, how this digitally transformed humanities source can be put into valid scientific research with the help of data science, not only by supporting a more thorough source critique, but also buy finding novels, limited ways to a problematic data source;
- how data science results should be fed back to the humanities argument without misinterpreting or over-interpreting the numerical analysis;
- how alternative hypothesis within the humanities field can be explored with real and simulated data, to highlight potential ambiguities and reasoning problems in an inherently humanities argument.
We can eventually form a deductive framework on the potential workflows how
1. exploratory data science (which makes no casual claim just presents data for any later interpretation in an unbiased, systematic way)
2. predictive data science (which is aimed to find casual links)
3. social quantitative science modelling (which is aimed to find theoretically grounded casual links)
4. empirical humanities research and
5. source critique in humanities sources
can create valid and invalid scientific results.
- ↑ Roosen, J., & Curtis, D. R. (2018). Dangers of Noncritical Use of Historical Plague Data. Emerging Infectious Diseases, 24(1), 103-110. | https://dx.doi.org/10.3201/eid2401.170477
- ↑ Schmid, Boris V., Büntgen, Ulf, Easterday, W. Ryan, Ginzler, Christian, Walløe, Lars, Bramanti, Barbara, & Stenseth, Nils Chr. (2015). Source code and datasets used to link new waves of plague outbreaks in medieval Europe to climate fluctuations affecting the reservoirs of the disease in Asia. [Data set]. Zenodo.|https://zenodo.org/record/14973#.XqAUj8gzbIU