Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Schema management functionality on streams for InfinyOn Cloud and Fluvio Topics #3267

Open
drc-infinyon opened this issue May 20, 2023 · 1 comment

Comments

@drc-infinyon
Copy link
Contributor

drc-infinyon commented May 20, 2023

Related RFC: #3081

Summary

This product brief describe the need for schema management functionality. There are folks in our developer community who have asked if we support functionality similar to the Kafka Schema registry. This document will describew the problem space and the functionality needed to serve InfinyOn Customer.

Opportunity

Schema is an essential input to implementing and maintaining data contracts and data quality. Majority of the data world operate on defined schemas and data models. The ability to implement a schema on the topics will enable my different features including enabling time window based aggregation and matrialized views which relies on a tabular structure.

Target audience

Schema management will be relevant for InfinyOn Cloud Developers as well as analysts to implement a schema configuration in their data flows.

Customer Insights

Among our current user feedback, we have an IoT company who described their need for schemas.

They receive data from sensors which are made and deployed by different vendors and they send similar payloads with differences in the attribute names, metric systems of dimensions. These differences need to be reconciled in the process of cleanup. Below is 5 minutes of the customer describing the use case.

Another consumption pattern shared by a SaaS company developing usage based billing who receives consumption data from their users and provides them the capability of billing and invoicing.

Experience

Currently, users may have a wide range of experiences with the schema given that schema is handled differently in different systems like databases or streaming tools like Kafka.

As we consider the experience of how the schema management would look like for the InfinyOn Cloud user we need to be informed by the data sources, the payload, and the consumption patterns.

For instance, if we are looking at semi-structured data from web pages, RSS feeds, clickstream we would expect XML, JSON inputs. As we consider the consumption patterns and the serialization deserialization requirements, we have come across customers and prospects who use Avro, Protobuf as serialization patterns and the data gets store in a flavour of Parquet like Hudi or iceberg or other optimized columnar formats like arrow.

Now the schema provides the ability to model semi-structured data in a tabular model, which enables the ability to perform aggregation, create derived columns, and model the data for analytical workflows.

For InfinyOn customers, we need to enable a schema management on the data collected from the edge to generate alerts on schema change or issues with the payload from the source and dynamic computation using smart modules based on attribute values.

Acceptance Criteria

  • Ability to define a schema configuration using YAML files specifying the schema type and the keys
  • Ability to apply the schema configuration using the Fluvio CLI
  • Ability to detect changes in the schema or incorrect data and generate error messages

Competitive Insights

  1. Confluent Schema Registry: https://docs.confluent.io/platform/current/schema-registry/index.html
  2. Slalom schema registry introduction: https://medium.com/slalom-technology/introduction-to-schema-registry-in-kafka-915ccf06b902
  3. Confluent Schema Registry 101, Avro, JSON: https://youtu.be/ovIsHhIrie8

Interface

Configuration

Schema configuration example applied to topic:

*schema-config.yaml*
meta:
  name: column-schema-1
  version: 1.0 # semver expected
  # schema names a smart module conforming to a smart module schema interface
	schema-provider: infinyon/[email protected] # alternative include column, protobuf, parquet, arrow

# spec is a user defined custom specification string, the schema does not parse the spec is passed to the schema smartmodule
# as a opaque string
spec: |
	- name: fruit_id
	  key: true
	  type: integer
	- name: fruit_name
	  type: string
	- name: fruit_color
	  type: string

CLI

CLI Commands concept

fluvio schema create

fluvio schema list

fluvio schema describe SCHEMA_NAME[@VERSION]

fluvio schema apply SCHEMA_NAME TOPIC_NAME

fluvio schema remove SCHEMA_NAME TOPIC_NAME

fluvio schema delete

fluvio schema disable SCHEMA_NAME@VERSION

fluvio schema create --config schema-config.yaml
@ajhunyady
Copy link
Contributor

@drc-infinyon, as per our conversation, the schema should be applied at the topic level. Do you have the notes or a pick from the whiteboard session?

fluvio topic create <name> --config <config with schema definition>
fluvio topic apply <name> --config <config with schema definition>

@drc-infinyon drc-infinyon moved this from 🏷 Features to 🏗 In progress in InfinyOn Public Roadmap Jun 12, 2023
@drc-infinyon drc-infinyon changed the title Schema management functionality on streams for InfinyOn Cloud and Fluvio Topics [Feature] Schema management functionality on streams for InfinyOn Cloud and Fluvio Topics Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants