Sep 2, 2024

What's in a Name, Part 1 - Introduction

What information is encoded in a name, and how does HumanGraphics decode it? This introduction to the series lays out how HumanGraphics solves "The Name Problem" to create value for users.

A rose by any other name would smell as sweet.
- Juliet, from Romeo and Juliet, by William Shakespeare

Billy ain’t wrong, but I’m guessing he never had to plan and execute a mission-critical sales campaign on a budget and a deadline, either. It turns out there’s a lot of useful information in leads' and prospects' names — if you know where to look.

In this blog series, we'll explore what a name can reveal about a person, and how HumanGraphics goes about making that data available and usable. This introductory post will explain why name parsing is hard, and why it's an important solution for your business.

The Name Problem

There is an enormous amount of demographic information contained in names. However, it is encoded in a truly mind-boggling variety of naming traditions from around the world. For example, did you know:

The majority of Indonesians do not have family names. Rather, their given names are geographically and culturally specific.¹
In traditional Lithuanian culture, the ending of a woman's surname indicates whether she is married or not. Last names of married women end in -ienė while those of unmarried girls end in -ytė, -iūtė, -utė, -aitė.²
Many Portuguese surnames may be preceded by of/from (de, d') or of the/from the (do, da, dos, das) as in de Sousa, da Costa, d'Oliveira. Those elements are not part of the surname, and as such the names Sousa and de Sousa are considered equivalent.³

And this is only a few quirks of global naming traditions!

Just cataloguing what demographic facts are hidden within names and creating a standardized representation for them requires a systematic survey of the available demographic facts followed by an exercise in data design. Decoding these demographic facts and extracting them from names requires recognition of relevant names and understanding of corresponding cultural nuances. Then actually building a system to expose all of these data is an enormous engineering task.

The HumanGraphics Solution

But while the problem is hard, the solution is invaluable. For instance:

In marketing applications, enriching name-containing records with demographic facts without having to wait for browsing or purchasing behavior unlocks critical marketing infrastructure when qualifying leads and courting prospects, such as segmentation and personalization, which can increase conversion dramatically in any industry.
In machine learning applications, demographics are valuable in any model, but are particularly important in specific use cases, like cold-starting recommendation engines for new users.
In regulatory and governance applications, latent demographic facts can be used to prove that business processes and models in regulated industries are fair in a post-facto analysis, which is an invaluable tool for working with regulators effectively.

HumanGraphics' name engine makes all this information and intelligence available in one affordable, easy-to-use platform. It takes names (and other data such as location and a headshot, if available) and produces country and gender estimates for all markets, and additional age and race estimates for a subset of markets, namely the US.

The name engine uses a proprietary reverse-template statistical parser implementation to "parse" (break down) names into labeled, constituent parts, such as first name and last name. The parser then uses a combination of templates and HumanGraphics' global names dataset to determine the most likely parse for each name.

The Templates

Different cultures use different formats for writing down names. For instance:

Western Tradition: Joe Biden (<GIVEN_NAME> <FAMILY_NAME>)
Eastern Tradition: Abe Shinzō (<FAMILY_NAME> <GIVEN_NAME>)

HumanGraphics has more 150 templates for parsing and formatting names, with more being added regularly. These templates are the result of substantial research and allow HumanGraphics to parse names quickly and accurately.

Unfortunately, within a writing system, there are typically no syntactic hints about which name format is being used. So how does one determine if a name written in Japanese is in western order or eastern order? Data! Inferring name order correctly requires an enormous catalog of various name components to determine whether a given name token is a forename or a surname.

Therefore, a name parser can be no better than the dataset backing it, which is why HumanGraphics' global dataset is such an important differentiator.

The Dataset

No one name dataset is broad or deep enough to power high-quality name matching. Even extremely high-quality datasets like the US Census are inadequate for reasons of data quality and scope. For example:

The U.S. Census removes all spaces from surnames, which makes culturally distinct and meaningfully different surnames like "de la Cruz" (predominantly Hispanic), "dela Cruz" (Filipino), and "Delacruz" (anglicized) indistinguishable in census data.⁴ And even if it were perfect, it would still only cover the USA!
Names extracted from social media data offer important real-world behavioral signals for naming, but in practice they do not provide structured demographic data.
Other kinds of publicly-available data is often the result of surveys or messy database dumps, which inevitably introduces errors, and thus requires corroboration.

This is why HumanGraphics' global name dataset is a hand-crafted combination of multiple datasets. Containing specific data about more than 2 billion people and statistical data about more than 7 billion people in more than 40 writing systems, it is one of the largest of its kind in the world. Combining data from multiple official and behavioral sources allows HumanGraphics' name matches to be the most accurate in the market today.

Up Next - The Anatomy of a Name

Now that the nature of the name problem and how HumanGraphics approaches the solution are clearer at a high level, it's time to dive deeper. The next post in the series will cover the anatomy of names, or how HumanGraphics represents names from across the world. Thanks for reading!