name-variants

License: MIT

"Chan" is simultaneously 陳, 陈 and 찬 and จัน — lookup() returns all of them.

Every romanization system produces a member of an equivalence class: no canonical form, no ordering dependency, no silent data loss. share_cluster("Hsu", "Xu") is True. lookup("Chan") returns a Chinese surname cluster and a Korean given-name cluster, sorted by bearer count.

Available as a Python library, CLI, pandas accessor, and Model Context Protocol (MCP) server.

pip install name-variants

The core idea

A NameCluster is a frozenset of co-equal representations. 陳, 陈, chen, chan, tan, chern are all members of the same Chinese surname cluster — none is more "real" than another. lookup() returns every cluster that contains your query, sorted by frequency:

from name_variants import lookup, share_cluster

clusters = lookup("Chan")
# [NameCluster(language='chinese', 8 forms),
#  NameCluster(language='korean_given', 3 forms)]

# Both Chinese scripts are in the same cluster — co-equal
assert "陈" in clusters[0]   # Simplified
assert "陳" in clusters[0]   # Traditional

# Membership is case-insensitive
assert "CHAN" in clusters[0]

# Ambiguity is surfaced, not suppressed
assert len(clusters) == 2    # Chinese AND Korean, not one-or-the-other

API

lookup() — all matching clusters

from name_variants import lookup

lookup("Chan")
# [NameCluster(language='chinese', 8 forms),
#  NameCluster(language='korean_given', 3 forms)]

lookup("Nguyen")
# [NameCluster(language='vietnamese', 4 forms)]

lookup("Smith")
# []

Results are sorted by frequency descending — most statistically likely interpretation first.

share_cluster() — equivalence check

from name_variants import share_cluster

share_cluster("Chan", "Chen")        # True  — same Chinese cluster
share_cluster("Chou", "Zhou")        # True  — Wade-Giles = Pinyin
share_cluster("Chiang", "Jiang")     # True  — Chiang Kai-shek / 蒋介石
share_cluster("Hsu", "Xu")           # True  — Taiwan diaspora romanization
share_cluster("Tsao", "Cao")         # True  — Ts'ao Ts'ao / 曹操
share_cluster("Chan", "Kim")         # False — different names
share_cluster("", "Chan")            # False — empty input

dialect() — Chinese romanization system tag

from name_variants import dialect

dialect("chen")   # "mandarin_pinyin"
dialect("chan")   # "cantonese"
dialect("tan")    # "hokkien"
dialect("chou")   # "wade_giles"
dialect("hsu")    # "wade_giles"
dialect("陳")     # "traditional"
dialect("Smith")  # None

normalize() — text preprocessing

from name_variants import normalize

normalize("  NGUYỄN  ")                    # "nguyễn"
normalize("Nguyễn", strip_diacritics=True) # "nguyen"
normalize("chan")                          # strips zero-width spaces

CLI

nv lookup Chan
# [chinese] (~90M bearers)
#   陈  陳  chan  chen  tan  ...
# [korean_given]
#   찬  chan  chahn

nv match Chan Chen          # true
nv match Chan Kim           # false
nv match --exit-code Chan Chen && echo same   # shell-scripting friendly

nv canonicalize-csv names.csv --col name --out out.csv
# adds {name}_canonical column

nv dedupe names.csv --col name --out out.csv
# adds cluster_id column grouping romanization variants

Pandas

pip install "name-variants[pandas]"

import pandas as pd
import name_variants  # registers .nv accessor

s = pd.Series(["Chan", "Chen", "Smith", "Park"])

s.nv.lookup()
# 0    [NameCluster(chinese, ...), NameCluster(korean_given, ...)]
# 1    [NameCluster(chinese, ...)]
# 2    []
# 3    [NameCluster(korean, ...)]

s.nv.cluster_id()
# 0    a3f2b1c4d5e6   ← same as row 1 (Chan and Chen share chinese cluster)
# 1    a3f2b1c4d5e6
# 2                   ← empty string for unknown
# 3    9b8c7d6e5f4a

a = pd.Series(["Chan", "Park"])
b = pd.Series(["Chen", "Bak"])
a.nv.share_cluster_with(b)   # [True, True]

MCP server (Model Context Protocol)

name-variants ships a built-in Model Context Protocol server, exposing name lookup as MCP tools that any MCP-compatible AI client (Claude Desktop, Claude Code, Cursor, etc.) can call directly.

Claude Code:

claude mcp add name-variants -- uvx --from "name-variants[mcp]" nv-mcp

Claude Desktop — add to claude_desktop_config.json:

{
  "mcpServers": {
    "name-variants": {
      "command": "uvx",
      "args": ["--from", "name-variants[mcp]", "nv-mcp"]
    }
  }
}

Three MCP tools are exposed:

Tool	Arguments	Returns
`lookup`	`text: str`	list of `{language, forms[], frequency}` clusters
`share_cluster`	`a: str, b: str`	`true` / `false`
`dialect`	`text: str`	romanization system string or `null`

Language tables

Language	Entries	Coverage
`chinese`	140	Pinyin + Wade-Giles + Cantonese + Hokkien + Hakka + Teochew + Traditional
`japanese`	143	Hepburn + macron variants
`korean`	100	Revised Romanization + McCune-Reischauer
`arabic`	92	Multiple transliteration systems
`vietnamese`	84	Diacritics + stripped forms
`russian`	79	Multiple transliteration systems
`indonesian_malay`	77	—
`persian`	80	—
`indian_hindi`	80	—
`hebrew`	75	—
`turkish`	74	Dotted-İ variants
`greek`	60	—
`thai`	68	—
`indian_bengali`	56	—
`indian_tamil`	53	—
`chinese_given`	120	Common given-name characters with Pinyin
`korean_given`	70	Common given-name syllables
`japanese_given`	107	Common given-name kanji

from name_variants import ALL_TABLES
list(ALL_TABLES.keys())   # all 18 table names

Chinese romanization systems

System	Examples
Mandarin Pinyin	Zhou, Zhang, Wang, Xu
Wade-Giles	Chou, Chang, Wang, Hsu, Tsao, Kuo, Hsieh
Cantonese (Jyutping/Yale)	Chan, Wong, Ng, Lam, Tsui
Hokkien/Min Nan	Tan, Ng, Lim, Goh
Hakka	Fong, Thong
Teochew	Teo, Ng
Postal romanization	Peking, Nanking, Chungking
Traditional characters	陳, 劉, 張, 楊, 趙

NameCluster reference

@dataclass(frozen=True)
class NameCluster:
    forms: frozenset[str]    # all representations — co-equal
    language: str            # "chinese", "korean", "vietnamese", etc.
    frequency: int | None    # approximate global bearer count

    def __contains__(self, text: str) -> bool  # case-insensitive
    def __iter__(self)                          # iterate all forms
    def __len__(self)

Why equivalence classes instead of a canonical key?

A canonical-key model forces a false choice: "Chan" must map to either 陈 or 찬, not both. Table ordering becomes load-bearing — whichever table is consulted last wins. Romanizations must be stripped from given-name tables to prevent collisions.

The NameCluster model eliminates this: every romanization system's output is just another member of a frozenset. lookup() returns all matching clusters. Ambiguity is surfaced, not suppressed. The most likely interpretation comes first by frequency.

Contributing

git clone https://github.com/SecurityRonin/name-variants
cd name-variants
pip install -e ".[dev]"
pytest

Data files are in name_variants/*_names.py and name_variants/*_surnames.py. Each entry is a plain Python dict — easy to read and edit:

"陈": {
    "forms": ["陳", "chen", "chan", "tan", ...],
    "frequency": 90_000_000,
    "dialects": {
        "chen": "mandarin_pinyin",
        "chan": "cantonese",
        "tan":  "hokkien",
        "陳":   "traditional",
    },
},

Adding a new variant is one edit to one entry — forms, frequency, and dialect tag colocated.

name-variants

name-variants

The core idea

API

lookup() — all matching clusters

share_cluster() — equivalence check

dialect() — Chinese romanization system tag

normalize() — text preprocessing

CLI

Pandas

MCP server (Model Context Protocol)

Language tables

Chinese romanization systems

NameCluster reference

Why equivalence classes instead of a canonical key?

Contributing

MCP Server · Populars

🦞 OpenClaw — Personal AI Assistant

MarkItDown-MCP

MarkItDown

Awesome MCP Servers

mcp-server-sentry: A Sentry MCP server

MCP Server · New

Wolfram/AgentTools

agent-device

AnythingMCP — Self-Hosted MCP Server & API Gateway

devcontainer-mcp

Second Brain — MCP Server on Cloudflare Workers