See how the HumanGraphics engineering team processes GeoNames data with BigQuery in this ETL example
At HumanGraphics, we work with data every day to provide accurate and meaningful insights into online and social audience behavior. High-quality location data plays an important role in ensuring the reliability of our results. The excellent GeoNames dataset is one of our favorites to work with.
In this post from the engineering team, we’ll walk through a practical example of performing ETL (Extract, Transform, Load) on two data dumps from the GeoNames dataset -- allCountries, and alternateNamesV2 -- using Google Cloud and BigQuery. Whether you’re working with location data for analytics, integrating it into an API, or managing large datasets, this guide offers a straightforward approach to handling and transforming that data.
When loading data into a new system, the goal is to balance retaining the original structure of the data against making it idiomatic and usable in the target system. In this case, this involved standardizations like:
NULL
from
to from_date
)These standardizations help ensure the data is both easy to use and remains faithful to its original intent.
First, download the current allCountries.zip data dump and unzip.
curl -O http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
This creates a file, allCountries.txt
, in the current directory.
Be sure to replace the below placeholders for bucket name and upload path with your own values.
gsutil cp allCountries.txt gs://bucket/path/to/allCountries.txt
The file uses the following schema. Place this into a file in the current directory named allCountries-schema.json
.
[
{
"name": "geoname_id",
"type": "INTEGER",
"mode": "REQUIRED",
"description": "Integer ID of record in GeoNames database"
},
{
"name": "name",
"type": "STRING",
"mode": "REQUIRED",
"description": "Name of geographical point (UTF-8)"
},
{
"name": "ascii_name",
"type": "STRING",
"mode": "NULLABLE",
"description": "Name of geographical point in plain ASCII characters"
},
{
"name": "alternate_names",
"type": "STRING",
"mode": "NULLABLE",
"description": "Comma-separated alternate names; ASCII names automatically transliterated"
},
{
"name": "latitude",
"type": "FLOAT",
"mode": "REQUIRED",
"description": "Latitude in decimal degrees (WGS84)"
},
{
"name": "longitude",
"type": "FLOAT",
"mode": "REQUIRED",
"description": "Longitude in decimal degrees (WGS84)"
},
{
"name": "feature_class",
"type": "STRING",
"mode": "NULLABLE",
"description": "Feature class (1 character, see GeoNames codes)"
},
{
"name": "feature_code",
"type": "STRING",
"mode": "NULLABLE",
"description": "Feature code (see GeoNames codes)"
},
{
"name": "country_code",
"type": "STRING",
"mode": "NULLABLE",
"description": "ISO-3166 2-letter country code"
},
{
"name": "alternate_country_codes",
"type": "STRING",
"mode": "NULLABLE",
"description": "Alternate country codes (comma-separated ISO-3166 2-letter country codes)"
},
{
"name": "admin1_code",
"type": "STRING",
"mode": "NULLABLE",
"description": "Admin1 code (e.g., state/province, FIPS or ISO)"
},
{
"name": "admin2_code",
"type": "STRING",
"mode": "NULLABLE",
"description": "Admin2 code (e.g., county)"
},
{
"name": "admin3_code",
"type": "STRING",
"mode": "NULLABLE",
"description": "Admin3 code (e.g., municipality)"
},
{
"name": "admin4_code",
"type": "STRING",
"mode": "NULLABLE",
"description": "Admin4 code (e.g., locality)"
},
{
"name": "population",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "Population"
},
{
"name": "elevation",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "Elevation in meters"
},
{
"name": "dem",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "Digital elevation model (average elevation in meters)"
},
{
"name": "timezone",
"type": "STRING",
"mode": "NULLABLE",
"description": "IANA timezone ID"
},
{
"name": "modification_date",
"type": "DATE",
"mode": "NULLABLE",
"description": "Date of last modification (YYYY-MM-DD)"
}
]
Next, create a new table in BigQuery from the data in GCS and the given schema. This version of the table will not be the final version of the table, so name accordingly. (You may also want to replace the dataset and table names.)
bq load \
--source_format=CSV \
--field_delimiter=$'\t' \
--quote='' \
--schema=allCountries-schema.json \
etl.allcountries_load \
gs://bucket/path/to/allCountries.txt
Finally, we'll perform ETL of the original data using a SQL query, and store the transformed data in its final destination! (Again, you may also want to replace the dataset and table names.)
bq query \
--use_legacy_sql=false \
--destination_table=geonames.allcountries \
--allow_large_results <<'SQL'
SELECT
geoname_id,
name,
NULLIF(ascii_name, '') AS ascii_name,
SPLIT(NULLIF(alternate_names, ''), ',') AS alternate_names,
latitude,
longitude,
feature_class,
feature_code,
country_code,
SPLIT(NULLIF(alternate_country_codes, ''), ',') AS alternate_country_codes,
admin1_code,
admin2_code,
admin3_code,
admin4_code,
population,
elevation,
dem,
timezone,
modification_date
FROM
`etl.allcountries_load`
SQL
The data is now available in BigQuery at geonames.allcountries
, or wherever you chose to put it. The GCS file gs://bucket/path/to/allCountries.txt
and the BigQuery table etl.allcountries_load
can now be deleted, if desired.
First, download the current alternateNamesV2.zip data dump and unzip.
curl -O http://download.geonames.org/export/dump/alternateNamesV2.zip
unzip alternateNamesV2.zip
This creates a file, alternateNamesV2.txt, in the current directory.
Upload this file to GCS. (Be sure to replace the below placeholders for bucket name and upload path with your own values.)
gsutil cp alternateNamesV2.txt gs://bucket/path/to/alternateNamesV2.txt
The file uses the following schema. Place this into a file in the current directory named alternateNamesV2-schema.json
.
[
{
"name": "alternate_name_id",
"type": "INTEGER",
"mode": "REQUIRED",
"description": "The ID of this alternate name"
},
{
"name": "geoname_id",
"type": "INTEGER",
"mode": "REQUIRED",
"description": "GeonameId referring to ID in table 'geoname'"
},
{
"name": "isolanguage",
"type": "STRING",
"mode": "NULLABLE",
"description": "ISO 639 language code (2- or 3-characters), optionally with country code or variant, or pseudo codes (post, link, etc.)"
},
{
"name": "alternate_name",
"type": "STRING",
"mode": "NULLABLE",
"description": "Alternate name or name variant"
},
{
"name": "is_preferred_name",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "'1', if this alternate name is an official/preferred name"
},
{
"name": "is_short_name",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "'1', if this is a short name like 'California' for 'State of California'"
},
{
"name": "is_colloquial",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "'1', if this alternate name is a colloquial or slang term"
},
{
"name": "is_historic",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "'1', if this alternate name is historic and was used in the past"
},
{
"name": "from_date",
"type": "STRING",
"mode": "NULLABLE",
"description": "Period from when the name was used"
},
{
"name": "to_date",
"type": "STRING",
"mode": "NULLABLE",
"description": "Period to when the name was used"
}
]
Next, create a new table in BigQuery from the data in GCS and the given schema. This version of the table will not be the final version of the table, so name accordingly. (You may also want to replace the dataset and table names.)
bq load \
--source_format=CSV \
--field_delimiter=$'\t' \
--quote='' \
--schema=alternateNamesV2-schema.json \
etl.alternatenamesv2_load \
gs://bucket/path/to/alternateNamesV2.txt
Then run the following query:
bq query \
--use_legacy_sql=false \
--destination_table=geonames.alternatenamesv2 \
--allow_large_results <<'SQL'
CREATE TEMP FUNCTION PARSE_GEONAME_DATE(s STRING)
RETURNS DATE
AS (
CASE
WHEN s IS NULL THEN NULL
WHEN LENGTH(s) = 4 THEN DATE(SAFE_CAST(s AS INT64), 1, 1)
WHEN LENGTH(s) = 8 THEN SAFE.PARSE_DATE('%Y%m%d', s)
WHEN LENGTH(s) = 10 AND CONTAINS_SUBSTR(s, '-') THEN SAFE.PARSE_DATE('%Y-%m-%d', s)
WHEN LENGTH(s) = 10 AND CONTAINS_SUBSTR(s, '.') THEN SAFE.PARSE_DATE('%Y.%m.%d', s)
ELSE NULL
END
);
SELECT
alternate_name_id,
geoname_id,
CASE WHEN isolanguage='' THEN NULL ELSE isolanguage END AS isolanguage,
alternate_name,
COALESCE(is_preferred_name, 0)=1 AS is_preferred_name,
COALESCE(is_short_name, 0)=1 AS is_short_name,
COALESCE(is_colloquial, 0)=1 AS is_colloquial,
COALESCE(is_historic, 0)=1 AS is_historic,
PARSE_GEONAME_DATE(from_date) AS from_date,
PARSE_GEONAME_DATE(to_date) AS to_date
FROM
`etl.alternatenamesv2_load`
SQL
The data is now available in BigQuery at geonames.alternatenamesv2
, or wherever you chose to put it. The GCS file gs://bucket/path/to/alternateNamesV2.txt
and the BigQuery table etl.alternatenamesv2_load
can now be deleted, if desired.
Handling large, location-based datasets like GeoNames can seem challenging, but tools like Google Cloud and BigQuery make the process efficient and scalable. In this guide, we demonstrated how to perform ETL to transform raw data into a structured and usable format, ready for integration into APIs or analytics workflows.
If you’re working with similar datasets, this process provides a solid foundation for managing and enriching location data. But if you’d rather not do all this work yourself, the HumanGraphics API offers robust location data services to handle parsing, enrichment, and more. Let us take care of the heavy lifting so you can focus on building and analyzing!