Is Google's statistics for the prevalence of the word "fun" throughout the years accurate? If so, why did its usage suddenly drop off in the early 1800s?

by 1sagas1
ben0x539

Am I totally off-base if I look at the graphs for "sun,fun" and observe that in the 1800s, "fun" drops wildly just as "sun" goes up, and speculate that it's clearly because Google has been misreading that weird long s (ſ) as f all along? Wikipedia quotes someone saying the weird long s "rarely appears in good quality London printing after 1800," so that'd fit.

edXcitizen87539319

One thing to keep in mind with Google's Ngram Viewer is the way it works:

It shows the relative frequency of the search term. For every year it looks for the word(s) you're looking for in all of that year's written works available to Google.

Ideally this would be all written works, but Google is not yet that big. Instead, they have put most of the corpora they have access to in a huge database and use their Ngram Viewer on that. And they keep adding stuff to it (most recently in 2012). Having more works to search is better, but there is one downside: the content of the database is not random. This matters more the fewer works there are (the farther back you go in time).

For example, let's say there's only two corpora. One corpus of works of fiction, containing most of the books published since 1975 (I made up this year for the example), and one corpus of legal texts published since 1980 (again, for the example only). In works of fiction there is lots of 'fun' to be had, in legal texts much less so. If you would search this database for the word 'fun', the relative frequence would make a huge drop in 1980. Not because the word was used any less, but because the data doubled in size from that year on!

Now I tried finding out what corpora Google uses exactly, but the info page which is linked everywhere (including on the Google Research Blog) gives a 404 error. I suspect though that something like what I described above is the case here. If you turn off smoothing in your graph ("smoothing of [0]" instead of [7]) you can see that the frequencies very wildly per year before 1820 (from 0.001 to 0.013). This is an indication of a relatively small dataset for those years.

The trend after 1820 is much more smooth even without any smoothing. Of course this still leaves the question of how that upward trend came to be, and whether or not the dip in the 1960s is real. For a proper analysis, you really need to know more about the underlying data.