Technology
Dec 17, 2024

Performing ETL on GeoNames Data with BigQuery

See how the HumanGraphics engineering team processes GeoNames data with BigQuery in this ETL example

Performing ETL on GeoNames Data with BigQuery

At HumanGraphics, we work with data every day to provide accurate and meaningful insights into online and social audience behavior. High-quality location data plays an important role in ensuring the reliability of our results. The excellent GeoNames dataset is one of our favorites to work with.

In this post from the engineering team, we’ll walk through a practical example of performing ETL (Extract, Transform, Load) on two data dumps from the GeoNames dataset -- allCountries, and alternateNamesV2 -- using Google Cloud and BigQuery. Whether you’re working with location data for analytics, integrating it into an API, or managing large datasets, this guide offers a straightforward approach to handling and transforming that data.

Approach to ETL

When loading data into a new system, the goal is to balance retaining the original structure of the data against making it idiomatic and usable in the target system. In this case, this involved standardizations like:

  • Converting comma-separated lists into string arrays
  • Converting boolean encodings from “1 or empty” to “true or false”
  • Converting date strings into date types
  • Converting placeholder empty strings to NULL
  • Standardizing names for format and SQL compatibility (e.g., change from to from_date)

These standardizations help ensure the data is both easy to use and remains faithful to its original intent.

The "allCountries" Dataset

1. Download the data from GeoNames

First, download the current allCountries.zip data dump and unzip.

curl -O http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip

This creates a file, allCountries.txt, in the current directory.

2. Upload the data to Google Cloud Storage

Be sure to replace the below placeholders for bucket name and upload path with your own values.

gsutil cp allCountries.txt gs://bucket/path/to/allCountries.txt

3. Load the raw data into BigQuery

The file uses the following schema. Place this into a file in the current directory named allCountries-schema.json.

[
  {
    "name": "geoname_id",
    "type": "INTEGER",
    "mode": "REQUIRED",
    "description": "Integer ID of record in GeoNames database"
  },
  {
    "name": "name",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Name of geographical point (UTF-8)"
  },
  {
    "name": "ascii_name",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Name of geographical point in plain ASCII characters"
  },
  {
    "name": "alternate_names",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Comma-separated alternate names; ASCII names automatically transliterated"
  },
  {
    "name": "latitude",
    "type": "FLOAT",
    "mode": "REQUIRED",
    "description": "Latitude in decimal degrees (WGS84)"
  },
  {
    "name": "longitude",
    "type": "FLOAT",
    "mode": "REQUIRED",
    "description": "Longitude in decimal degrees (WGS84)"
  },
  {
    "name": "feature_class",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Feature class (1 character, see GeoNames codes)"
  },
  {
    "name": "feature_code",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Feature code (see GeoNames codes)"
  },
  {
    "name": "country_code",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "ISO-3166 2-letter country code"
  },
  {
    "name": "alternate_country_codes",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Alternate country codes (comma-separated ISO-3166 2-letter country codes)"
  },
  {
    "name": "admin1_code",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Admin1 code (e.g., state/province, FIPS or ISO)"
  },
  {
    "name": "admin2_code",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Admin2 code (e.g., county)"
  },
  {
    "name": "admin3_code",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Admin3 code (e.g., municipality)"
  },
  {
    "name": "admin4_code",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Admin4 code (e.g., locality)"
  },
  {
    "name": "population",
    "type": "INTEGER",
    "mode": "NULLABLE",
    "description": "Population"
  },
  {
    "name": "elevation",
    "type": "INTEGER",
    "mode": "NULLABLE",
    "description": "Elevation in meters"
  },
  {
    "name": "dem",
    "type": "INTEGER",
    "mode": "NULLABLE",
    "description": "Digital elevation model (average elevation in meters)"
  },
  {
    "name": "timezone",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "IANA timezone ID"
  },
  {
    "name": "modification_date",
    "type": "DATE",
    "mode": "NULLABLE",
    "description": "Date of last modification (YYYY-MM-DD)"
  }
]

Next, create a new table in BigQuery from the data in GCS and the given schema. This version of the table will not be the final version of the table, so name accordingly. (You may also want to replace the dataset and table names.)

bq load \
  --source_format=CSV \
  --field_delimiter=$'\t' \
  --quote='' \
  --schema=allCountries-schema.json \
  etl.allcountries_load \
  gs://bucket/path/to/allCountries.txt

4. Transform and save the final data

Finally, we'll perform ETL of the original data using a SQL query, and store the transformed data in its final destination! (Again, you may also want to replace the dataset and table names.)

bq query \
  --use_legacy_sql=false \
  --destination_table=geonames.allcountries \
  --allow_large_results <<'SQL'
  SELECT
    geoname_id,
    name,
    NULLIF(ascii_name, '') AS ascii_name,
    SPLIT(NULLIF(alternate_names, ''), ',') AS alternate_names,
    latitude,
    longitude,
    feature_class,
    feature_code,
    country_code,
    SPLIT(NULLIF(alternate_country_codes, ''), ',') AS alternate_country_codes,
    admin1_code,
    admin2_code,
    admin3_code,
    admin4_code,
    population,
    elevation,
    dem,
    timezone,
    modification_date
  FROM
    `etl.allcountries_load`
SQL

The data is now available in BigQuery at geonames.allcountries, or wherever you chose to put it. The GCS file gs://bucket/path/to/allCountries.txt and the BigQuery table etl.allcountries_load can now be deleted, if desired.

The "alternateNamesV2" Dataset

1. Download the data from GeoNames

First, download the current alternateNamesV2.zip data dump and unzip.

curl -O http://download.geonames.org/export/dump/alternateNamesV2.zip
unzip alternateNamesV2.zip

This creates a file, alternateNamesV2.txt, in the current directory.

2. Upload the data to Google Cloud Storage

Upload this file to GCS. (Be sure to replace the below placeholders for bucket name and upload path with your own values.)

gsutil cp alternateNamesV2.txt gs://bucket/path/to/alternateNamesV2.txt

3. Load the raw data into BigQuery

The file uses the following schema. Place this into a file in the current directory named alternateNamesV2-schema.json.‍

[
  {
    "name": "alternate_name_id",
    "type": "INTEGER",
    "mode": "REQUIRED",
    "description": "The ID of this alternate name"
  },
  {
    "name": "geoname_id",
    "type": "INTEGER",
    "mode": "REQUIRED",
    "description": "GeonameId referring to ID in table 'geoname'"
  },
  {
    "name": "isolanguage",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "ISO 639 language code (2- or 3-characters), optionally with country code or variant, or pseudo codes (post, link, etc.)"
  },
  {
    "name": "alternate_name",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Alternate name or name variant"
  },
  {
    "name": "is_preferred_name",
    "type": "INTEGER",
    "mode": "NULLABLE",
    "description": "'1', if this alternate name is an official/preferred name"
  },
  {
    "name": "is_short_name",
    "type": "INTEGER",
    "mode": "NULLABLE",
    "description": "'1', if this is a short name like 'California' for 'State of California'"
  },
  {
    "name": "is_colloquial",
    "type": "INTEGER",
    "mode": "NULLABLE",
    "description": "'1', if this alternate name is a colloquial or slang term"
  },
  {
    "name": "is_historic",
    "type": "INTEGER",
    "mode": "NULLABLE",
    "description": "'1', if this alternate name is historic and was used in the past"
  },
  {
    "name": "from_date",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Period from when the name was used"
  },
  {
    "name": "to_date",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Period to when the name was used"
  }
]

Next, create a new table in BigQuery from the data in GCS and the given schema. This version of the table will not be the final version of the table, so name accordingly. (You may also want to replace the dataset and table names.)

bq load \
  --source_format=CSV \
  --field_delimiter=$'\t' \
  --quote='' \
  --schema=alternateNamesV2-schema.json \
  etl.alternatenamesv2_load \
  gs://bucket/path/to/alternateNamesV2.txt

4. Transform and save the final data

Then run the following query:

bq query \
  --use_legacy_sql=false \
  --destination_table=geonames.alternatenamesv2 \
  --allow_large_results <<'SQL'
  CREATE TEMP FUNCTION PARSE_GEONAME_DATE(s STRING)
  RETURNS DATE
  AS (
    CASE
      WHEN s IS NULL THEN NULL
      WHEN LENGTH(s) = 4 THEN DATE(SAFE_CAST(s AS INT64), 1, 1)
      WHEN LENGTH(s) = 8 THEN SAFE.PARSE_DATE('%Y%m%d', s)
      WHEN LENGTH(s) = 10 AND CONTAINS_SUBSTR(s, '-') THEN SAFE.PARSE_DATE('%Y-%m-%d', s)
      WHEN LENGTH(s) = 10 AND CONTAINS_SUBSTR(s, '.') THEN SAFE.PARSE_DATE('%Y.%m.%d', s)
      ELSE NULL
    END
  );

  SELECT
    alternate_name_id,
    geoname_id,
    CASE WHEN isolanguage='' THEN NULL ELSE isolanguage END AS isolanguage,
    alternate_name,
    COALESCE(is_preferred_name, 0)=1 AS is_preferred_name,
    COALESCE(is_short_name, 0)=1 AS is_short_name,
    COALESCE(is_colloquial, 0)=1 AS is_colloquial,
    COALESCE(is_historic, 0)=1 AS is_historic,
    PARSE_GEONAME_DATE(from_date) AS from_date,
    PARSE_GEONAME_DATE(to_date) AS to_date
  FROM
    `etl.alternatenamesv2_load`
SQL

The data is now available in BigQuery at geonames.alternatenamesv2, or wherever you chose to put it. The GCS file gs://bucket/path/to/alternateNamesV2.txt and the BigQuery table etl.alternatenamesv2_load can now be deleted, if desired.

Conclusion

Handling large, location-based datasets like GeoNames can seem challenging, but tools like Google Cloud and BigQuery make the process efficient and scalable. In this guide, we demonstrated how to perform ETL to transform raw data into a structured and usable format, ready for integration into APIs or analytics workflows.

If you’re working with similar datasets, this process provides a solid foundation for managing and enriching location data. But if you’d rather not do all this work yourself, the HumanGraphics API offers robust location data services to handle parsing, enrichment, and more. Let us take care of the heavy lifting so you can focus on building and analyzing!

Andy Boothe

Andy Boothe

Andy Boothe is the owner of HumanGraphics