Finding the haikus in passwords.txt • blog

In the application data stored by Google Chrome is a file named passwords.txt with 30000 commonly used passwords. These are used by the Zxcvbn password strength estimator from Dropbox. Chrome has several billion users so this file is on millions if not billions of devices. I was interested in seeing what art could be gleaned from such a common file. So I calculated how many syllables are found in each password. From this data I could extract all the haikus in the file, comprised of sequences of passwords which accumulate into a haiku. Perhaps this is some of the most widespread poetry in the world.

To do this, I wrote a Node.js script that iterates over the passwords and attempts to detect the syllables for each. It was pretty slow progress, checking against an English dictionary for strings, which could also tell me how many syllables were in it - thanks to the syllable-count-english npm package. Over time I improved the algorithm as I learned more about the passwords file contents. Using regular expressions to match pure digit strings, or strings with leading or trailing digits, I could convert these to their word equivalents so the syllable matching had regular words to count with. Still, this was only of partial help. Most of the passwords are not pure English words so manual checking was required for most of the strings, and 30000 strings is a lot! In total my script detected 426 haikus.

There were also some oddities about what the passwords were. Some of the passwords appear to be truncated, monste for example. It could be hard to know the intent behind some passwords. For example dallas12 could be pronounced “dallas one two” or “dallas twelve”. I tried to follow the principle that we should use the string with the fewest syllables. In the above example that would prioritise twelve over “one two”. This did mean vlad1996 is converted to “vlad one nine nine six” which might sound unnatural. Certain numbers were respected though in the case of cultural resonance - 007 became “double oh seven” instead of “oh oh seven”. Digits that replaced characters in words weren’t considered as altering their pronunciation - for example sc0tland.

Getting deeper into the file, I started being able to spot, in seemingly gibberish, patterns of latin characters which hint at cyrillic languages. Starting with cn, nj, nj, jkm, bz, vf, fh and developing it further, I was able to develop an almost comprehensive digest of Russian passwords in the file. I was grateful to learn that the syllable count in Russian words is equal to the number of vowels. As a non-Russian speaker, this sped up the processing. Thank you to the dictionary-ru and nspell packages.

Still, I spent a lot of time (I processed around 1000 passwords a day) on manual checks. Certainly to be in vogue I could have used AI to do this much job much faster, but what for? It wouldn’t have been as fun. I learned much about the world from reading these passwords. Now I know how to say ‘I love you’ in Vietnamese (anh yêu em), my grasp on the Cyrillic alphabet is improved, I am aware of tons more American sports teams (I’ll probably forget these soon), I learned so many curse words and quite how popular it is to reference sex in HTML input elements.

Explore further on Authenticated poetry.