Building a Geospatial Lakehouse with Open Source and Databricks

Most knowledge that pertains to a measurable course of in the actual world has a geospatial side to it. Organisations that handle belongings over a large geographical space, or have a enterprise course of which requires them to think about many layers of geographical attributes that require mapping, may have extra difficult geospatial analytics necessities, after they begin to use this knowledge to reply strategic questions or optimise. These geospatially focussed organisations may ask these kinds of questions of their knowledge:

What number of of my belongings fall inside a geographical boundary?

How lengthy does it take my prospects to get to a web site on foot or by automobile?

What’s the density of footfall I ought to count on per unit space?

All of those are helpful geospatial queries, requiring that numerous knowledge entities be built-in in a standard storage layer, and that geospatial joins equivalent to point-in-polygon operations and geospatial indexing be scaled to deal with the inputs concerned. This text will focus on approaches to scaling geospatial analytics utilizing the options of Databricks, and open-source instruments benefiting from Spark implementations, the widespread Delta desk storage format and Unity Catalog [1], focussing on batch analytics on vector geospatial knowledge.

Answer Overview

The diagram under summarises an open-source strategy to constructing a geospatial Lakehouse in Databricks. By means of quite a lot of ingestion modes (although usually by means of public APIs) geospatial datasets are landed into cloud storage in quite a lot of codecs; with Databricks this could possibly be a quantity inside a Unity Catalog catalog and schema. Geospatial knowledge codecs primarily embrace vector codecs (GeoJSONs, .csv and Shapefiles .shp) which symbolize Latitude/Longitude factors, traces or polygons and attributes, and raster codecs (GeoTIFF, HDF5) for imaging knowledge. Utilizing GeoPandas [2] or Spark-based geospatial instruments equivalent to Mosaic [3] or H3 Databricks SQL capabilities [4] we are able to put together vector recordsdata in reminiscence and save them in a unified bronze layer in Delta format, utilizing Nicely Recognized Textual content (WKT) as a string illustration of any factors or geometries.

Overview of a geospatial analytics workflow constructed utilizing Unity Catalog and open-source in Databricks. Picture by writer.

Whereas the touchdown to bronze layer represents an audit log of ingested knowledge, the bronze to silver layer is the place knowledge preparation and any geospatial joins widespread to all upstream use-cases could be utilized. The completed silver layer ought to symbolize a single geospatial view and will combine with different non-geospatial datasets as a part of an enterprise knowledge mannequin; it additionally gives a chance to consolidate a number of tables from bronze into core geospatial datasets which can have a number of attributes and geometries, at a base stage of grain required for aggregations upstream. The gold layer is then the geospatial presentation layer the place the output of geospatial analytics equivalent to journey time or density calculations could be saved. To be used in dashboarding instruments equivalent to Energy BI, outputs could also be materialised as star schemas, while cloud GIS instruments equivalent to ESRI On-line, will choose GeoJSON recordsdata for particular mapping functions.

Geospatial Information Preparation

Along with the standard knowledge high quality challenges confronted when unifying many particular person knowledge sources in an information lake structure (lacking knowledge, variable recording practices and so on), geospatial knowledge has distinctive knowledge high quality and preparation challenges. So as to make vectorised geospatial datasets interoperable and simply visualised upstream, it’s finest to decide on a geospatial co-ordinate system equivalent to WGS 84 (the broadly used worldwide GPS customary). Within the UK many public geospatial datasets will use different co-ordinate methods equivalent to OSGB 36, which is an optimisation for mapping geographical options within the UK with elevated accuracy (this format is commonly written in Eastings and Northings slightly than the extra typical Latitude and Longitude pairs) and a change to WGS 84 is required for the these datasets to keep away from inaccuracies within the downstream mapping as outlined within the Determine under.

*Overview of geospatial co-ordinate methods a) and overlay of WGS 84 and OSGB 36 for the UK b). Pictures tailored from [5]* with permission from writer. Copyright (c) Ordnance Survey 2018.

Most geospatial libraries equivalent to GeoPandas, Mosaic and others have built-in capabilities to deal with these conversions, for instance from the Mosaic documentation:

df = (
  spark.createDataFrame([{'wkt': 'MULTIPOINT ((10 40), (40 30), (20 20), (30 10))'}])
  .withColumn('geom', st_setsrid(st_geomfromwkt('wkt'), lit(4326)))
)
df.choose(st_astext(st_transform('geom', lit(3857)))).present(1, False)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
|MULTIPOINT ((1113194.9079327357 4865942.279503176), (4452779.631730943 3503549.843504374), (2226389.8158654715 2273030.926987689), (3339584.723798207 1118889.9748579597))|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Converts a multi-point geometry from WGS84 to Net Mercator projection format.

One other knowledge high quality situation distinctive to vector geospatial knowledge, is the idea of invalid geometries outlined within the Determine under. These invalid geometries will break upstream GeoJSON recordsdata or analyses, so it’s best to repair them or delete them if crucial. Most geospatial libraries provide capabilities to seek out or try to repair invalid geometries.

*Examples of kinds of invalid geometries. Picture taken from [6] with permission from writer*. Copyright (c) 2024 Christoph Rieke.

These knowledge high quality and preparation steps ought to be applied early on within the Lakehouse layers; I’ve executed them within the bronze to silver step prior to now, together with any reusable geospatial joins and different transformations.

Scaling Geospatial Joins and Analytics

The geospatial side of the silver/enterprise layer ought to ideally symbolize a single geospatial view that feeds all upstream aggregations, analytics, ML modelling and AI. Along with knowledge high quality checks and remediation, it’s generally helpful to consolidate many geospatial datasets with aggregations or unions to simplify the info mannequin, simplify upstream queries and stop the necessity to redo costly geospatial joins. Geospatial joins are sometimes very computationally costly because of the massive variety of bits required to symbolize generally advanced multi-polygon geometries and the necessity for a lot of pair-wise comparisons.

Just a few methods exist to make these joins extra environment friendly. You’ll be able to, for instance, simplify advanced geometries, successfully decreasing the variety of lat lon pairs required to symbolize them; totally different approaches can be found for doing this that may be geared in direction of totally different desired outputs (e.g., preserving space, or eradicating redundant factors) and these could be applied within the libraries, for instance in Mosaic:

df = spark.createDataFrame([{'wkt': 'LINESTRING (0 1, 1 2, 2 1, 3 0)'}])
df.choose(st_simplify('wkt', 1.0)).present()
+----------------------------+
| st_simplify(wkt, 1.0)      |
+----------------------------+
| LINESTRING (0 1, 1 2, 3 0) |
+----------------------------+

One other strategy to scaling geospatial queries is to make use of a geospatial indexing system as outlined within the Determine under. By aggregating level or polygon geometry knowledge to a geospatial indexing system equivalent to H3, an approximation of the identical data could be represented in a extremely compressed kind represented by a brief string identifier, which maps to a set of mounted polygons (with visualisable lat/lon pairs) which cowl the globe, over a spread of hexagon/pentagon areas at totally different resolutions, that may be rolled up/down in a hierarchy.

*Motivation for geospatial indexing methods (compression) [7] and visualisation of the H3 index from Uber [8]. Pictures tailored with permission from authors.* Copyright (c) CARTO 2023. Copyright (c) Uber 2018.

In Databricks the H3 indexing system can be optimised to be used with its Spark SQL engine, so you may write queries equivalent to this level in polygon be part of, as approximations in H3, first changing the factors and polygons to H3 indexes on the desired decision (res. 7 which is ~ 5km^2) after which utilizing the H3 index fields as keys to affix on:

WITH locations_h3 AS (
    SELECT
        id,
        lat,
        lon,
        h3_pointash3(
            CONCAT('POINT(', lon, ' ', lat, ')'),
            7
        ) AS h3_index
    FROM places
),
regions_h3 AS (
    SELECT
        title,
        explode(
            h3_polyfillash3(
                wkt,
                7
            )
        ) AS h3_index
    FROM areas
)
SELECT
    l.id AS point_id,
    r.title AS region_name,
    l.lat,
    l.lon,
    r.h3_index,
    h3_boundaryaswkt(r.h3_index) AS h3_polygon_wkt  
FROM locations_h3 l
JOIN regions_h3 r
  ON l.h3_index = r.h3_index;

GeoPandas and Mosaic may even mean you can do geospatial joins with none approximations if required, however usually the usage of H3 is a sufficiently correct approximation for joins and analytics equivalent to density calculations. With a cloud analytics platform you can even make use of APIs, to herald stay site visitors knowledge and journey time calculations utilizing providers equivalent to Open Route Service [9], or enrich geospatial knowledge with further attributes (e.g., transport hubs or retail places) utilizing instruments such because the Overpass API for Open Avenue Map [10].

Geospatial Presentation Layers

Now that some geospatial queries and aggregations have been executed and analytics are able to visualise downstream, the presentation layer of a geospatial lakehouse could be structured in keeping with the downstream instruments used for consuming the maps or analytics derived from the info. The Determine under outlines two typical approaches.

*Comparability of GeoJSON Characteristic Assortment a) vs dimensionally modelled star schema b) as knowledge constructions for geospatial presentation layer outputs. Picture by writer.*

When serving a cloud geospatial data system (GIS) equivalent to ESRI On-line or different internet software with mapping instruments, GeoJSON recordsdata saved in a gold/presentation layer quantity, containing all the crucial knowledge for the map or dashboard to be created, can represent the presentation layer. Utilizing the FeatureCollection GeoJSON sort you may create a nested JSON containing a number of geometries and related attributes (“options”) which can be factors, linestrings or polygons. If the downstream dashboarding instrument is Energy BI, a star schema may be most popular, the place the geometries and attributes could be modelled as details and dimensions to profit from its cross filtering and measure help, with outputs materialised as Delta tables within the presentation layer.

Platform Structure and Integrations

Geospatial knowledge will usually symbolize one a part of a wider enterprise knowledge mannequin and portfolio of analytics and ML/AI use-cases and these would require (ideally) a cloud knowledge platform, with a sequence of upstream and downstream integrations to deploy, orchestrate and really see that the analytics show helpful to an organisation. The Determine under reveals a high-level structure for the sort of Azure knowledge platform I’ve labored with geospatial knowledge on prior to now.

*Excessive-level structure of a geospatial Lakehouse in Azure*. Picture by writer.

Information is landed utilizing quite a lot of ETL instruments (if doable Databricks itself is ample). Inside the workspace(s) a medallion sample of uncooked (bronze), enterprise (silver), and presentation (gold) layers are maintained, utilizing the hierarchy of Unity Catalog catalog.schema.desk/quantity to generate per use-case layer separation (significantly of permissions) if wanted. When presentable outputs are able to share, there are a number of choices for knowledge sharing, app constructing and dashboarding and GIS integration choices.

For instance with ESRI cloud, an ADLSG2 storage account connector inside ESRI permits knowledge written to an exterior Unity Catalog quantity (i.e., GeoJSON recordsdata) to be pulled by means of into the ESRI platform for integration into maps and dashboards. Some organisations might choose that geospatial outputs be written to downstream methods equivalent to CRMs or different geospatial databases. Curated geospatial knowledge and its aggregations are additionally steadily used as enter options to ML fashions and this works seamlessly with geospatial Delta tables. Databricks are growing varied AI analytics options constructed into the workspace (e.g., AI BI Genie [11] and Agent Bricks [12]), that give the power to question knowledge in Unity Catalog utilizing English and the doubtless long-term imaginative and prescient is for any geospatial knowledge to work with these AI instruments in the identical means as some other tabular knowledge, solely one of many visualise outputs shall be maps.

In Closing

On the finish of the day, it’s all about making cool maps which can be helpful for resolution making. The determine under reveals a few geospatial analytics outputs I’ve generated over the previous few years. Geospatial analytics boils right down to realizing issues like the place individuals or occasions or belongings cluster, how lengthy it sometimes takes to get from A to B, and what the panorama seems like when it comes to the distribution of some attribute of curiosity (may be habitats, deprivation, or some danger issue). All essential issues to know for strategic planning (e.g., the place do I put a fireplace station?), realizing your buyer base (e.g., who’s inside 30 min of my location?) or operational resolution help (e.g., this Friday which places are more likely to require further capability?).