Word frequencies from French Google Books corpus

The Google Books NGram Viewer is a great resource. It has word frequency counts for a large sampling of books spanning hundreds of years and many languages.

I wrote some code (in this Colab notebook) to help me augment my vocabulary lists with frequencies of how often a word, in any of its inflected forms, appears in the subset of French books published in 2007 and known to Google. This lets me post tables like this:

expression (root)frequency
bondé1 in 742,000
détrempé1 in 2,040,000
patauger1 in 1,220,000
ballotté1 in 834,000
pain azyme1 in 10,200,000

According to this estimate, I’d come upon the word have to ready 742,000 words on average before coming upon bondé or one of its forms. As it happens, the usage of this word in books has been becoming (somewhat) more common over time:

I’ve gone back to my earlier vocabulary list posts (Pietr-le-Letton Chapter 5 and Chapter 6) and updated the lists with frequencies. I’ve also pointed out a few false conflations that Google has made (e.g. it thinks étaient is a form of the rare verb étayer. It is, but most of the instances of étaient are conjugations of être.) Take a look at the old list posts, and play around with the NGram viewer if you’ve never seen it before.

Vocab list: Pietr-le-Letton, Chapter 5

I’m making lists of unfamiliar words as I read George Simenon’s Pietr-le-Letton. Below is my list for Chapter 5 (Le Russe Ivre), with links to the search result page on Linguee and word frequencies from the Google NGram Viewer.

The chapter takes place in a run-down bar in a fishing town (Fécamp) in winter, which accounts for why there are so many words about boats, bars, and rain. There’s 26 words here and the chapter is 9 pages long, so that’s about 3 new words a page – a “just right book” for my reading level.

expression (root)frequency
prunelles1 in 742,000
bouges1 in 61,200
soutiers1 in 11,100,000
zinc1 in 396,000
canaille1 in 690,000
entrebâillement1 in 4,290,000
crapuleux1 in 1,690,000
louvoyer1 in 1,640,000
luisant1 in 670
oeillade1 in 13,900,000
se saouler1 in 5,040,000
vergue1 in 1,610,000
tressaillir1 in 454,000
heurter1 in 48,400
toussotement1 in 11,600,000
buée1 in 1,670,000
ricaner1 in 528,000
bac1 in 82,000
tremper1 in 140,000
tiraillait1 in 594,000
bec-de-cane1 in 19,800,000
tournant1 in 8,540
marchand de bestiaux1 in 17,500,000
entrouverte1 in 382,000
blême1 in 860,000
tasser1 in 166,000

The frequency numbers are from the French Google Books corpus, specifically books published in 2007. They count how many words of such books you would have to read on average before coming upon the given word in any of its inflected forms. As you can see, a lot of these are fairly literary or old-fashioned words – the Pietr-le-Letton was written in 1931, after all. There’s a few glitches in this analysis. The word luisant, from luire = to shine, is not so common you’d see it once in 670 words. Rather, Google NGram Viewer thinks that lui is a form of luire. As far as I can tell, that’s outright wrong, but of course the pronoun lui is very common and so the conflation makes the estimate worthless. The single form luisant occurs 1 in 1,160,000, but that doesn’t account for all the other forms of luire. So take the frequency estimates with a grain of salt

I’ll be curious to see if my list length diminishes in later chapters and later novels. I’m reminded of the game I used to play when reading Sherlock Holmes stories aloud with my daughter – we’d joke about how many paragraphs into a story Conan Doyle could get without using the word “singular”. It was rarely double-digit.