Skip to content

Systems and Tech Thoughts

Explicit Data Definitions For Smooth Integrations

July 02, 2021

This post explains explicit data definitions and explians how they are leveraged to make integration safe and easy. Finally the post contrasts explicit data definitions with integration approaches which use implicitly defined data.

First Class Data

Many companies initially start with implicitly defined data. This means that the data is defined in application code without a shareable representation. The burden falls on the consumers to impose a structure and types on the data being emitted, which creates a fundamentally inequitable relationship.

The following diagram shows an example of implicitly defined data, and calls out a number of issues with it:

implicit data

Implicit data is difficult to work with, presenting challenges along a number of dimensions:

  • Discoverability: Definitions live in code and must be searched discovered through code search.
  • Searchability: Finding specific instances of fields or data must also be searched through code search.
  • Documentation: Code often lives defined in application code which means documentation most likely exists in application code as well.
  • Policies: Implicitly defined code is hard to apply globaly policies such as pii/compliance. Are policies distributed as libraries? What happens if services don’t play by the rules or take a non compliant naming approach?
  • Discovering Structure: Data can be dynamically created, or inconsistently created.
  • Discoverying Types: Data types may be dynamically created, or inconsistently created.

Implicit data has a very real cost to organizations. Data consumers such as billing, usage, data science & security build up a dependency on implicit data, which is subject to change at any moment. Building business critical reporting use cases on implicit data is like building a house on sand, at some point it will fall over.

building on sand

Implicit data is not auditable. It’s extremely hard to discover the structure of data being emitted, types, information contained, and understand privacy implications.

Explicit data definitions can be used to address these concerns, which promotes data as an explicitly encoded entity. The following shows an example of an explicit event defined using protocol buffers:

protobuf_event

An explicit event lives on its own and is defined independent of any programming language or producer. Explicit events are shareable and strongly typed. The event listed above represents the data being communicated between two teams. It is a first class entity, containing an explicit structure and type. The data definition can be shared, talked about, validated, and used as a blueprint to create events in almost every programming language.

Stakeholder Integration

Using explicitly defined data benefits the integration process. During the integration teams can share ideas around data structure and types because they are explicitly encoded the data schema. Design conversations start with sketching out data structures. After a data format is agreed upon, teams can leverage the protocol buffers toolchain to generate language specific bindings (such as node.js and python client libraries) to generate the data using the explicit definition as a blueprint.

shared schema

Integrating using explicit contracts was faster, safer, and cheaper than integrating using implicit data structures encoded in json. I can’t get over how much easier and safer it is to consume explitly defined data. The image below shows explicitly defined data consumption on the left side an implicit data consumption on the right side.

implementation

The right hand side contains one snippet from an entire file. The logic involved with handling implicit data (if this field exists, if it’s this type, if this value is an object, if the object has a key, etc etc), is a huge time sink and introduces a huge risk vector for bugs. Contrast this with the snippet on the left, which leverages an explicit data schema. It’s 2 LINES LONG! If the data or types are invalid an exception is raised on deserialization!

Paving the Way For The Future

I believe that all data deployments can benefit from explicitly defined data, with schema on write enforced as close to the publisher as feasible. Explicitly defined data supports privacy, discoverability, audibility and contractual compliance, while controlling costs.

Explicitly Typed Data is Enforceabe

Enforceability supports correctness. Data is either valid against an explicit specification or it is invalid. No ambiguity can exist. Structure types makes data easy to reason about and to work with. This differs from using JSON to implicitly define data (which is super common in the industry and completely normal). Once an oraginization reaches a size in terms of number of teams, services and data volumes implicitly defined data becomes untenable.

Businesses need to enforce privacy, PII and contractual compliance. Explicitly defined data provides companies a hook to define metadata such as annotating the data as customer data or customer usage data, and annotating individual fields as pii. Since the data is explicit companies can then use the definition to apply policies programmatically to the data early in its creation lifecycle.

Documentable / Discoverable

Explicitly defined data makes it trivial to generate documentation. It also allows us to publish definitions to a centralized and searchable index. This makes it easy for stakeholders or compliance to audit our data definitions.

Contrast this with implicit data definitions. It’s extremely difficult to document and discover data before it is created. Implicit data is not easily documentable until the data is in motion (either in kinesis or kafka) or is landed in s3. This is because the definition of the data is dynamically created and lives within individual services.

Cost Optimized

Explicitly defined data is trivial to convert to a cost optimized version such as parquet. Often companies being by storing almost data as gzipped json files on blob storage. As data volumes grow companies can leverage tools such as Presto/Athena to expose near real time exploratory analysis and batch workloads, at a low cost.

But companies can only do this if our data has a fixed schema. Implicit data is untyped and dynamic which prohibits us from reliably leveraging cost and read optimized storage solutions such as parquet.

Conclusion

Implicitly defined data is an intuitive way to start a service or project, but at a certain size it begins to fall apart. Explicitly defined data provides an alternative which makes it simple to produce and parse data from multiple programming langauages. The state of data tooling (such as protocol buffers) adds minimal overhead to development and deployment while providing all the benefits of explicitly typed data.

Happy Hacking!


Thoughts on Systems & Tech