How has/will "big data" affect the study of history?

by Psythor

I've got a couple of related questions I guess...

  1. I was wondering if anyone has any insight into any interesting history that has been done utilising "big data" - by which I mean, mining huge datasets in a way that has only been possible in the last decade or two? Did any of this sort of analysis overturn any previously prevalent explanations for events?

  2. Has any thought been given by historians as to how historians of the future will deal with the huge amounts of data we generate now? Rather than worry about a lack of sources, won't future historians be drowning in billions of tweets, e-mails, even Reddit posts? Has any thought been given to how all of this information might be parsed in a meaningful way?

Hoping this doesn't break the 20 year rule as I'm asking about the practice of studying history, rather than about recent events per-se!

flynavy88

Well, one of the limitations of data mining is that the data needs to have been available, meaning there were records/figures available to be mined and explored, which leaves its scope with regard to history to more recent history - but it also leads to focuses on certain areas of history more than others, such as economic history, for which we have large quantifiable sets of data available.

As far as interesting uses... one of the most interesting ones is THOR which is a program the US Air Force to map every single bomb dropped since WW I by the US - using paper reports and modern databases - to examine the effectiveness of air campaigns and bombing targets. A lot of it of course is used to refine and improve tactics and strategies, but I'm sure there have been a lot of interesting historical insights derived from it for a military historian.

Likewise, for US military historians, combat and casualty figures are being documented and explored to provide insight into military operations. For instance, we have taken historical data to provide databases which provide outputs for casualties sustained in war. Example: American War and Military Operations Casualties: Lists and Statistics

As far as how future historians will deal with it? A lot of it is speculation, but it will likely be parsed as it is today - with automation to sift through the data to aid historians in identifying relevant and historically accurate data from the massive amounts of irrelevant and inaccurate data.

rbnm899

I'm not a historian but I do work with big data as part of my day to day work, so I may be able to answer this.

The type of big data I work with is biological, and one of the major ways big data has already affected the study of history is from the study of genetics.

Most people I'm sure have heard of the study which showed that about 0.5% of the worlds population are descended from Genghis Khan, this was originally published in the American Journal of Human Genetics in a report called The Genetic Legacy of the Mongols. In brief the process of making this study involved sequencing many different men's Y-chromosome, thus being able to trace their male lineage. A genetic analysis was then used to spot unusual features in many of the Y-chromosomes and thus infer a common ancestor, the global distribution of these features among different populations suggesting a link to the mogols.

Now that study was from 2002 when genetic sequencing was in it's infancy, it is now much cheaper and we can sequence an individuals DNA relatively easily. To see this potential for the future, just read this news report in Nature of a genome hacker creating the largest ever family tree, containing 13 million individuals.

So what can big data genetics in the future do? From DNA sequencing of the modern population we can trace past migration of human populations, this was actually done fairly notably back in 2001 as part of a BBC series called Blood of the Vikings, examining where in Britain the vikings settled, leaving their DNA behind. With more and more people across the world being sequenced we should be able to have a more detailed picture of past human migrations.

You also may be able to find genetic marks in modern DNA for other historical events, already it has been seen that we can see evidence for the Mongol invasions in the modern human genome, but it has the potential to reveal smaller events such as the sack of Magdeburg.

One of the things I know for certain this field has disproved is the theory that polynesia was populated by people from South America who reached it by raft, as proposed by Thor Heyerdahl in Kon Tiki. In the 1990's scientists examined the mitochondrial DNA from polynesians and found that they are most closely related to people from South East Asia and not South America.

Interestingly non human genetic data is also being studied, I remember reading in 1491 by Charles C. Mann, how Maize was remarkable for being domesticated from a non-edible species, Teosinte. Recent scientific work on the population genetics of different varieties of maize is giving us some insight into how this was done.

restricteddata
  1. I don't think there are really new, huge conclusions that have been enabled by "big data." There have long been historians willing to crunch numbers, and sometimes that has come up with interesting arguments — the Annales School was very influential for awhile, using economic data to tell stories about major shifts in history — but generally speaking most of the "big data" projects of the last decade or so have so far ended up with rather banal conclusions. At least the ones I have seen. Most of the conclusions seem to just underwrite the qualitative conclusions that people had already come to. (In science studies, there seem to be endless "big data" citation studies that conclude that disciplines mostly exist except some people in some disciplines publish in related disciplines. Surprise surprise.)

But I don't want to dismiss the tools. The tools matter, even if they are only part of an argument rather than the provider of the argument. So as a very easy, cheap example — whenever I want to track the rise and fall of a given term, I just pop over to Google Ngram Viewer and see what it tells me. It's never the argument, but it can be part of an argument. Here's one of my favorite examples: What age are we in?. You can see a lot of interesting trends here about how we define ourselves relative to technology. That can't be the argument in and of itself, but it meshes on well with other historical discussions.

Similarly, citation analysis actually can provide interesting arguments about changes over time. So you can track how many articles are submitted on the philosophy of quantum mechanics over time, for example, and note that the proportion relative to other articles in physics drops dramatically in the 1950s through the 1970s. This is then a nice, compelling datapoint to add to a discussion of changing trends in physics in the Cold War — a shift away from the abstract, in part because the abstract doesn't lend itself towards scaling up and doesn't get you government grants.

So the tools matter. Digital tools have radically transformed how younger historians do research. Our training and note-taking and archive use sometimes varies drastically from how our advisors, even our "young" advisors (e.g. people who got tenure recently), wrote their dissertations. And this is something historians have spent a lot of time talking about.

  1. It has been hemmed and hawed over. There are some efforts to preserve that data — Twitter is giving its Tweets to the Library of Congress, for example. Much of it however is not preserved. And it's not clear that this is the stuff that matters. Is having a Twitter archive going to make up for the fact that most people don't preserve their e-mail for decades and decades? I mean, even historians of the late 20th century generally have long, detailed correspondence files that their historical actors kept and somehow got preserved. But nobody does that anymore. Historians of science have thoroughly freaked out about the fact that most communication between scientists today is very ephemeral and not preserved. We have their tweets, but those are usually pretty shallow as a dataset.

As for what to do about it... historians do try to encourage (and sometimes require) scientists and important figures to save their e-mails. Of course not everyone likes this; e-mail accounts are often a blend of professional and unprofessional, and even the "professional" correspondence is a lot more casual than most written communication. I doubt the success rate will be high. Unless the NSA is secretly archiving everything and will let historians of the future look at it, I doubt much will be there. (As an aside, the FBI often transcribed wiretapped phone conversations — their archives can give rare insights into casual telephone conversations, the sort of thing that is rarely written down. Of course, it does so at the cost of violating the privacy of the person in question.)

But historians will adapt. They always do. That's the job. They will find ways to make interesting arguments and tell interesting stories. There will be limited by the source material but they always are, always have been, and always will be.