Ooh, I got @-ed!
There are a lot of frequency-related files mentioned by @Nibbanka in this post:
Of these, the largest is a huge (22MB) file called sortedFrequencyPali.txt
:
Complete word list of all Pali words (about 967.000) as occuring in the CSCD (VRI) Tipitaka edition
Apparently the site on which the file (stored as a .zip) was originally hosted, nibbanam.com, has evaporated, but happily it survives in the Internet Archive:
http://wayback.archive.org/web/20150707075127/http://www.nibbanam.com/sortedFrequencyPali.zip
Here are the first 100 lines:
First 100 lines of sortedFrequencyPali.txt
167140 ca 150824 na 116790 vā 76637 pana 72505 hoti 65284 taṃ 54673 tattha 54515 evaṃ 49782 so 49223 pe 45897 kho 44558 nāma 40389 hi 38357 tassa 38087 te 37980 vuttaṃ 35300 bhikkhave 28660 attho 26537 ayaṃ 25953 viya 25277 tena 23661 tesaṃ 22309 atha 21929 katvā 21675 yaṃ 21550 me 20986 āha 20667 tasmā 20462 idaṃ 20258 yathā 20208 ettha 19663 dhammaṃ 18975 tathā 18879 dhammā 18565 tato 18532 yo 18172 uppajjati 17963 bhagavā 17925 dhammo 17771 attano 17371 bhante 17055 paccayo 16923 ekaṃ 16701 no 16632 dve 15885 paṭicca 15779 bhikkhu 14585 idha 14428 atthi 14268 natthi 14011 kiṃ 13991 ni0 13631 vuccati 13334 cittaṃ 13099 eva 13078 honti 12783 tasmiṃ 12773 hotīti 12619 ye 12318 tīṇi 12237 sā 11286 bhikkhū 11087 iti 11006 yassa 10794 ahaṃ 10652 hutvā 10612 iminā 10566 sace 10461 bhagavato 9957 disvā 9806 imaṃ 9772 saddhiṃ 9720 ceva 9697 gahetvā 9377 pañca 9305 puna 8981 kathaṃ 8856 ime 8819 rūpaṃ 8802 rājā 8764 sī0 8753 tvaṃ 8735 siyā 8416 yena 8269 syā0 8199 ahosi 8145 gantvā 7884 niṭṭhitā 7813 nu 7789 nava 7777 idāni 7564 āvuso 7533 yattha 7425 dhammassa 7374 maṃ 7371 ka0 7342 eko 7250 yasmā 7178 sati 7153 vatvā
After that, just 966,566 lines to go.
It’s a pretty remarkable resource, if a little hard to deal with! If anyone would like a shorted version I’d be happy to put one together and put it online.