Skip to content
Ross Bernet edited this page Jul 26, 2017 · 4 revisions

GeoTrellis Project Roadmap

This document is in progress while this note appears. Release schedule may change following discussion and estimates.

Overall Objectives

  • Enabling processing of large scale geo-spatial data
  • Support ad-hoc analytic workflows
  • Provide clear on-boarding and project architecture documentation
  • Enable machine learning workflows on satellite imagery.

Focus on ad hoc workflows

There are two types of potential users for GeoTrellis: (1) those focused on application development and (2) those focused on data analysis.

  1. Product Developers

    Product developers create an application/system with a GIS component. They are interested in a modular and stable API, key features that solve the "hard" problem. This has been the primary focus for GeoTrellis development leading up to the 1.0.0 release.

  2. Data Scientist

    Data scientist is interested in extracting information from combining multiple datasets through ad-hoc analysis. Much of the effort to hit this use case has been through GeoPySpark but it relates to core goals of GeoTrellis.

    1. Data science focus reaches a wider audience than Spark/Scala application developers.
    2. Ad-hoc analysis puts more pressure on maturity and composability of the API, aiding all users.
    3. Ad-hoc analysis more often deals with data that is heterogeneous in projection and resolution, exposing more performance problems.
    4. Many important social questions, like measuring deforestation, impact of climate change on given area or industry are best handled through ad-hoc analysis.

Release Schedule

Objective: Release a version at the end of every quarter

The development on 2.0 features will start before 1.2 release. The workflow will be to bump head to 2.0 as soon as first 2.0 feature goes in and back-merge 1.2 PRs into a release branch as they are implemented.

2017 July - Sept (GeoTrellis 1.2)

The main focus is to support data science use case for GeoTrellis where the driving use case is GeoPySpark. This results in focus on some key new features and optimizations of central operations.

  1. New Operations

    • Euclidean Distance
    • Viewshed
    • RDD Rasterization
  2. Problems

    • GeoTiff band interleave streaming
  3. geotrellis-spark-sql

    • Spark DataFrame Support
    • SparkSQL Support
    • SparkML Integration
  4. Layer IO SPI

    Decouple different back-ends through use of Java Service Provider Interface to load:

    • AttributeStore
    • LayerReader
    • LayerWriter

    The interface should be based on producing these classes from URI which fully configures them.

  5. TileView

    Accumulates transformations on tile which avoid intermediate allocation when they are again transformed into Tile.

    1. Facets

      • Local
      • Focal: how do we cursor ?
        • Benchmark focal view vs focal tile operation.
        • Check if cursor is compatible with random strategy.
      • Resample
      • Reproject

      What can be done for focal? Current focal methods are stateful in order to optimize overall transformation. We can't translate that because the call to facal is going to be obscured behind produced view. This implies that it has to be available for random access.

    2. Opens

      • Lower memory footprint for Tile transformations
      • Lower memory footprint for RDD transformations
      • Map Algebra RPC
  6. Optimize central operations

    Optimizations here will have positive impact on applications across the board.

    • Pyramid: in a single shuffle step
    • Reproject: avoid collect metadata after reprojection
  7. GeoTiff Layers: Support for storing GeoTrellis layers as and reading directly from GeoTiffs

  8. Batch Pipeline

    https://github.com/locationtech/geotrellis/blob/master/docs/architecture/005-etl-pipeline.rst

  9. Vector Tile updates

    Bring vector tile class up to date with vector pipe development. Provide and document API for simple Vector/VectorTile/Raster workflows.

  10. LiDAR Support

Read and sort points from `.laz` files into Hadoop friendly formats. Generate elevation rasters from lidar point clouds through either IDW or Delaunay triangulation. Will result in `geotrellis.pdal` subproject.
  1. TensorFlow Integration

    Use TensorFlow model to label a raster layer.

2017 Oct - Dec (GeoTrellis 2.0)

This release will focus on addressing API issues that we and our users have hit through usage of GeoTrellis. Some new abstractions will be introduces to unify the multiple context of performing the same operation.

  1. Cloud optimized GeoTiff Layer

    Ability to save GeoTrellis layers as a set of GeoTiffs. Each GeoTiff would a meta-tile and provide a segment layout optimized for TMS fetches.

    Objectives:

    • reduce friction between GeoTrellis and other GIS tools
    • provide meta-tile support
    • enable band sub-setting layer reads
  2. Spatial Indexing

    • SFCurve Dependency
    • Temporal Binning
  3. Machine Learning

    • Converting spatio-temporal imagery into training sets
    • Pattern of Life
    • ML Model Application
  4. MAML RPC

    Support a project that creates TMS endpoints from MAML definitions

  5. Cross-Resolution Raster Operations

    Ability to predictably perform operations on non-aligned rasters with pixels. AKA: Map Algebra over rasters.

    The product must have specified resolution => resample input to that resolution. Intersecting these rasters requires a spatial-join. Problem:

    • tiles too big
    • tiles intersect wrong
    • how do you collect neighbors on non-tiled rasters?
    • need some use-cases
    1. Affected

      raster1 raster2 raster1 raster2 raster1 raster2

    2. Requires

      • NoData Semantics
  6. Lazy Layer IO…

    Reading rasters as metadata-first would give a chance to filter and join the future rasters before they're read fully. This is useful for both Ingest and reading layers through LayerReader

  7. Unified MapAlgebra API

    • TileLike: Abstracts over Tile, MultibandTile, TileView
    • LayerLike: Abstracts over RDD[(K,V)], Seq[(K,V)], Map[K, V]
  8. Separation of spark/collections API

    There should be a way to perform collections operations without bringing in expansive `spark-core` dependency. This however will require moving some of the utility classes that describe tile layers but do not require spark outside of spark package.

  9. NoData Semantics

    Parameteraize this behariov: 1 + ND = 1 or 1 + ND = ND

Future Releases

  • OGC Standards
    • GeoServer Plugin for GeoTrellis raster layers
  • Further LIDAR work

API Refactor Candidates

  • LayerReader/Writer
    • Relies only on Avro
    • Relies on SprayJson
    • No segmented reads
  • No abstraction between Tile/MultibandTile