What information is encoded in a name, and how does HumanGraphics decode it? This introduction to the series lays out how HumanGraphics solves "The Name Problem" to create value for users.
A rose by any other name would smell as sweet.
- Juliet, from Romeo and Juliet, by William Shakespeare
Billy ain’t wrong, but I’m guessing he never had to plan and execute a mission-critical sales campaign on a budget and a deadline, either. It turns out there’s a lot of useful information in leads' and prospects' names — if you know where to look.
In this blog series, we'll explore what a name can reveal about a person, and how HumanGraphics goes about making that data available and usable. This introductory post will explain why name parsing is hard, and why it's an important solution for your business.
There is an enormous amount of demographic information contained in names. However, it is encoded in a truly mind-boggling variety of naming traditions from around the world. For example, did you know:
And this is only a few quirks of global naming traditions!
Just cataloguing what demographic facts are hidden within names and creating a standardized representation for them requires a systematic survey of the available demographic facts followed by an exercise in data design. Decoding these demographic facts and extracting them from names requires recognition of relevant names and understanding of corresponding cultural nuances. Then actually building a system to expose all of these data is an enormous engineering task.
But while the problem is hard, the solution is invaluable. For instance:
HumanGraphics' name engine makes all this information and intelligence available in one affordable, easy-to-use platform. It takes names (and other data such as location and a headshot, if available) and produces country and gender estimates for all markets, and additional age and race estimates for a subset of markets, namely the US.
The name engine uses a proprietary reverse-template statistical parser implementation to "parse" (break down) names into labeled, constituent parts, such as first name and last name. The parser then uses a combination of templates and HumanGraphics' global names dataset to determine the most likely parse for each name.
Different cultures use different formats for writing down names. For instance:
HumanGraphics has more 150 templates for parsing and formatting names, with more being added regularly. These templates are the result of substantial research and allow HumanGraphics to parse names quickly and accurately.
Unfortunately, within a writing system, there are typically no syntactic hints about which name format is being used. So how does one determine if a name written in Japanese is in western order or eastern order? Data! Inferring name order correctly requires an enormous catalog of various name components to determine whether a given name token is a forename or a surname.
Therefore, a name parser can be no better than the dataset backing it, which is why HumanGraphics' global dataset is such an important differentiator.
No one name dataset is broad or deep enough to power high-quality name matching. Even extremely high-quality datasets like the US Census are inadequate for reasons of data quality and scope. For example:
This is why HumanGraphics' global name dataset is a hand-crafted combination of multiple datasets. Containing specific data about more than 2 billion people and statistical data about more than 7 billion people in more than 40 writing systems, it is one of the largest of its kind in the world. Combining data from multiple official and behavioral sources allows HumanGraphics' name matches to be the most accurate in the market today.
Now that the nature of the name problem and how HumanGraphics approaches the solution are clearer at a high level, it's time to dive deeper. The next post in the series will cover the anatomy of names, or how HumanGraphics represents names from across the world. Thanks for reading!