Want to know a neat trick to tell the gender of an Indian first name?
Easy—does it end in "I" or "A"? Then its female.
This simple heuristic works well for most common names of women / men. How well does it really do?
🧵 Tricks with Indian names using fun stats!
The "AI heuristic" captures 78% of females correctly [recall] — misses Komal, Khushboo, Poonam and Sonam.
..but only gets it right 66.4% of the time [precision] — calls Ravi, Aditya, Jitendra, Krishna and Devendra female!
Is that.. good? How would a human do?
How would a human do?
If >50% of people with a certain first name are male, they'd guess male.
This approach gets us precision (P) of 96% with a recall (R) of 88%. F-1 score [combines P+R] goes from 0.72 to 0.92!
[Fun] Most common name at every length!
What names are hardest for humans to classify?
Even the perfect algorithm struggles with names that could be both boys or girls, unisex names — Manpreet, Anshu, Suman, Soumya and Kiran.
Can we improve our heuristic?
Lets try some things—
- If a name ends in a vowel, call it female
- Include "female" suffixes of len 2—CY, HY, OO, EE, BY, LY
- Blacklist male ones—RA, AI, WA, JI, VI
- Add in len 3 suffixes
This gets us up to 0.78 [OpenAI had F1 of ~0.5]!
Why are suffixes so telling for Indian names?
When you anglicize Indian languages, a letter + vowel often becomes 3 letters in English. If you group male / female names by suffix, many are shared!
Women—ITA, IKA, SHA, ANA, SHI
Men—ESH, DRA, EET, ISH, ANT
What about prefixes?
Many Indian names start with the same syllable too — here's a look at the most common 3-letter combinations that start male and female Indian names!
Both—PRA, SHA, SAN, SHI
Women—ANU, PRI, SHR, MAN, SON, NEE
Men—MAN, RAJ, HAR, PAR, ABH, SUB
Turns out — it's not just Indian names! When I did a similar analysis on International names, I got 93% precision (but only 50% recall) with the "AI" heuristic.
In other words, if you assumed a person whose name ended in A or I was a female, you'd be right 93% of the time!
What's the point of all this?
The problem of predicting gender from first Indian name is:
- Clearly solvable, with labelled data
- A basic intro to ML
- Under-explored in literature
- Helps classify gender in other datasets!
- Fun data!
A simple heuristic does SO well!
Dataset was ~350k gender-labelled non-unique Indian names, of which 75% male. 40k of those were unique. That happens after cleaning the data (abbreviation first names and conjoined first/last names) drops ~10% of the names.
CSV to classify—
gist.github.com/deedy/dfe64d21faa4180aff64495dc1c76f52