I am wondering where to look to gather and sort data on the most frequent SC segments occurring in the suttas? Or even if someone has done this already.
Thanks
I do not think anybody has done it yet. However, It seems like a good approach.
I tried to quickly do that: SC frequencies - Google Drive
The CSV contains the segments from the bilara-data repository which occur more than once without further preprocessing (sorted descending by frequency). Not sure if thatās what was meant.
Does this have any fuzziness? Weād probably want to filter out punctuation and capitalization at the very least. And maybe add a little fuzziness to the text matches.
But of course it really depends what weāre looking for. A problem inherent in this approach is that the choice of where to segment is somewhat arbitrary. So there will be more common strings inside the segments.
In the CSV you made, the number one is ÄmantÄ, which really just tells us that a one-word sentence is more likely to be in a separate segment.
Would you be willing to talk about what you are interested in learning from that data? Just curious.
I just took the raw segments / paragraphs and counts of reoccurence. I agree itās probably more useful to whatever purpose this is for to count the n-grams / word sequences. The folder should be updated now with some frequencies of word sequences on lowercase text without punctuation (e.g. āfreqs-3ā contains the counts for 3-word sequences). Accounting for fuzziness would add quite some complexity, I think.
@olastor thanks for that. Pretty much what i was after. I didnāt realise the tininess of some of the segments.
I was also under the impression that the segments had been divided in such a way that they were repeated for the purpose of quick translation. Maybe Iām wrong.
My idea was to create anki flash cards, and later a game/app thing, which would aim memorisation and familiarity with Pali of the most frequently occurring words and phrases. I was inspired to do this with audio by a friend who has dyslexia but wishes to understand Pali. Additionally ,I am an auditory learner and have been thinking about Duolino meets pali for some time. The reason I was interested in segments was because it seems this is what SC-voice uses.
My idea was to find the segments I wantā¦ maybe they arenāt quite the most frequent, this was a starting pointā¦ get their SC-voice files and then have either the whole or a section of that segment played along with itās English (or whatever language).
I reality, this may be way above my coding skill level, but Iām full of ideas and like to tinker so I thought Iād give it a go. Iāve already identified a few areas where the app would need to differ from duolingo and areas which I really like with regards to how they naturally teach pronunciation, case and gender.
Iād be super keen for collaborators if anyone would like to investigate this further with me. For a bit of background, I come from a graphic design and web design background and have been building custom websites (HTML, CSS, PHP, JS) for about 20 years.
You know what I really struggle with: sandhi. Thereās gotta be a fun video game for untangling combined vowelsā¦
In WA the monastics have a rule; if you come up with the ideaā¦ you are the one to do it. Iāll add it to my brainstorming list though