During its bankruptcy, a lot of Enron's internal communication was made public. Were historians and sociologists able to extract any useful information from them?

This is a really fascinating question, and the answer is a definitive yes.

The trove of emails revealed during the bankruptcy and investigation - which have become known as "The Enron Corpus" - was likely the largest and densest dataset of person-to-person written communication ever to be analysed by academics at the time. It was also unique in the sense that it was likely the first such dataset never to be actually read by a human researcher, but rather parsed for meaning largely by computers.

The dataset comprised over 600,000 emails between approximately 150 senior employees over a period of years. [See note 1]

The emails (along with a load of other internal data) were originally captured by the Federal Energy Regulatory Commission during their investigation, and put into the public domain as an explicit means of making their findings more credible (and you might argue, more damning).

One challenge which recurs quite a lot in the story of the Enron Corpus is that "putting something online", as FERC did, didn't necessarily make it in any way practically available. Today we'd happily download and process the 1.7gb file with little difficulty, but this was 2001 and things were more challenging back then not only because of the size but because of the completely disordered way the emails were provided.

(Interestingly, for an electronic and standardised medium email is a really difficult to parse, not least because for default format of email is to repeat every previous email in a chain. There are many emails in the corpus when it's not even clear who is sending the email.)

Eventually, various academics purchased copies of it (including one who apparently paid $10,000 for hard disks containing it) and made it available in more manageable ways online, including a searchable version which is still available today: http://www.enron-mail.com/ (Trigger warning: site contains 2001-era CSS 😎)

Academics set to work on this corpus with alacrity. To date, the corpus has been referred to in over 3000 academic papers.

The New Yorker wrote an article which provides a really interesting summary of the range of purposes for which the emails have been used, including the mundane:

An “extensive benchmark study of e-mail foldering” ... used seven large accounts to help determine whether people organized their e-mail in ways that might be replicable by machine intelligence. (“Email foldering is a rich and interesting task,” the study’s lead author, Ron Bekkerman, noted, in what may be the paper’s most surprising conclusion.) The answer was not yet: people are too idiosyncratic in the ways they organize their stuff.

To the potentially useful, even if ultimately unsuccessful:

Another team used the corpus to develop a “compliance bot” that could identify sensitive elements in text and alert writers if a message might get them in trouble. ... An M.I.T. student working on a compliance bot noted that it seemed nearly impossible to identify evidence of financial misconduct using basic search strings. He had more success tracking down pornography—of which there was, oddly, a lot—with words like “sex.”

To the sociological:

Noting that “a small number of users have sent a large number of messages”—a fact that will shock no one who gets e-mail at work—one research team mapped epistolary ties on a Gower layout (a connect-the-dots plot) to understand who was in contact with whom. They found a tight nest of connections around Enron’s president, vice-president, and C.E.O. Angled off to either side were ears with more remote networks of traders, managers, and lawyers. The plot looks like a donkey head.

To the linguistic:

In 2014, an enterprising business-English teacher named Evan Frendo had the idea of using the corpus to locate phrases helpful to the foreign businessperson working with Americans. After what must have been punishing study, he discovered a fixation on “ball” metaphors. “I thought I’d get the ball rolling,” one Enroner wrote. “Sounds like you guys had a ball at dinner,” another said. “I played hard ball and told them that I had to have more time,” a correspondent reported. “Someone really dropped the ball here!” an employee chides. “From June 1, we will be totally on the ball,” reads an e-mail that you don’t believe. “I will pretty much leave it in your ball park about Friday night,” somebody writes (a message that Frendo correctly annotates “???”). All told, the corpus contained six hundred and two instances of ball speech, apparently covering every scenario in modern American business. It is not clear that this compendium eases the task of the Danish banker on a morning flight to Dallas. But perhaps it tells him where to focus his study.

The "Enron Corpus" has now been largely deprecated in academic use and it's difficult to find papers that don't resolve to 404 links of now-retired researchers.

Why did it fall out of use? The amount of person-to-person communication available to researchers massively increased as a result of the arrival of "Web 2.0" and the Enron Corpus is now dwarfed by other datasets such as the Reddit Corpus, which is probably even sophisticated enough to detect that this entire paragraph is intended to distract attention from the fact that my sources contain only one academic paper and two press articles.

If you do want a real life example of recent academic analysis of the Enron Corpus then this project is a good freely available example.

Notes:

Quantifying the actual number of emails is harder than it sounds, and there is essentially no final number. Does an out of office reply count? What about an email that wasn't actually sent but is included in a forwarded chain? What about an email from a listserv? What about a failed delivery notification? It's harder than it sounds to get to an exact number. But we can agree it's a lot of emails.

Sources:

- Enron Mail archive: http://www.enron-mail.com/

- Carnegie Mellon University description of the dataset: https://www.cs.cmu.edu/~enron/

- Stanford University site on the dataset: https://snap.stanford.edu/data/email-Enron.html

- "What the Enron Emails says about us", The New Yorker: https://www.newyorker.com/magazine/2017/07/24/what-the-enron-e-mails-say-about-us

- "Armies of Expensive Lawyers Replaced by Cheaper Robots", The New York Times: https://www.nytimes.com/2011/03/05/science/05legal.html?hp

- Recent example of academic research using the corpus: https://github.com/ParakweetLabs/EmailIntentDataSet/wiki