Sutta segments sorted by frequency?

Pasanna · May 20, 2022, 2:10am

I am wondering where to look to gather and sort data on the most frequent SC segments occurring in the suttas? Or even if someone has done this already.
Thanks

bagginstyrone · May 21, 2022, 10:07am

I do not think anybody has done it yet. However, It seems like a good approach.

josephzizys · May 21, 2022, 2:45pm

Is https://buddhanexus.net/ able to do this?

olastor · May 21, 2022, 6:14pm

I tried to quickly do that: SC frequencies - Google Drive

The CSV contains the segments from the bilara-data repository which occur more than once without further preprocessing (sorted descending by frequency). Not sure if that’s what was meant.

sujato · May 21, 2022, 10:58pm

Does this have any fuzziness? We’d probably want to filter out punctuation and capitalization at the very least. And maybe add a little fuzziness to the text matches.

But of course it really depends what we’re looking for. A problem inherent in this approach is that the choice of where to segment is somewhat arbitrary. So there will be more common strings inside the segments.

In the CSV you made, the number one is Āmantā, which really just tells us that a one-word sentence is more likely to be in a separate segment.

Snowbird · May 22, 2022, 2:56am

Would you be willing to talk about what you are interested in learning from that data? Just curious.

olastor · May 22, 2022, 9:18pm

I just took the raw segments / paragraphs and counts of reoccurence. I agree it’s probably more useful to whatever purpose this is for to count the n-grams / word sequences. The folder should be updated now with some frequencies of word sequences on lowercase text without punctuation (e.g. “freqs-3” contains the counts for 3-word sequences). Accounting for fuzziness would add quite some complexity, I think.

Pasanna · May 23, 2022, 1:19am

@olastor thanks for that. Pretty much what i was after. I didn’t realise the tininess of some of the segments.

I was also under the impression that the segments had been divided in such a way that they were repeated for the purpose of quick translation. Maybe I’m wrong.

My idea was to create anki flash cards, and later a game/app thing, which would aim memorisation and familiarity with Pali of the most frequently occurring words and phrases. I was inspired to do this with audio by a friend who has dyslexia but wishes to understand Pali. Additionally ,I am an auditory learner and have been thinking about Duolino meets pali for some time. The reason I was interested in segments was because it seems this is what SC-voice uses.

My idea was to find the segments I want… maybe they aren’t quite the most frequent, this was a starting point… get their SC-voice files and then have either the whole or a section of that segment played along with it’s English (or whatever language).

I reality, this may be way above my coding skill level, but I’m full of ideas and like to tinker so I thought I’d give it a go. I’ve already identified a few areas where the app would need to differ from duolingo and areas which I really like with regards to how they naturally teach pronunciation, case and gender.

I’d be super keen for collaborators if anyone would like to investigate this further with me. For a bit of background, I come from a graphic design and web design background and have been building custom websites (HTML, CSS, PHP, JS) for about 20 years.

Khemarato.bhikkhu · May 26, 2022, 1:00pm

You know what I really struggle with: sandhi. There’s gotta be a fun video game for untangling combined vowels…

Pasanna · May 26, 2022, 11:30pm

In WA the monastics have a rule; if you come up with the idea… you are the one to do it. I’ll add it to my brainstorming list though