This post explains explicit data definitions, how they are leveraged to make integration safe and easy. Finally the post contrasts explicit data definitions with integration approaches which use implicitly defined data.
Many companies initially start with implicitly defined data. This means that the data is defined in application code without a shareable representation. The burden falls on the consumers (such as a data team) to impose a structure and types of the data being emitted, which creates a fundamentally inequitable relationship.
The following diagram shows an example of implicitly defined data, and calls out a number of issues with it:
Implicit data is difficult to work with, presenting challenges along a number of dimensions:
- Discoverability: Definitions live in code and must be searched discovered through code search.
- Searchability: Finding specific instances of fields or data must also be searched through code search.
- Documentation: Code often lives defined in application code which means documentation most likely exists in application code as well.
- Policies: Implicitly defined code is hard to apply globaly policies such as pii/compliance. Are policies distributed as libraries? What happens if services don’t play by the rules or take a non compliant naming approach?
- Discovering Structure: Data can be dynamically created, or inconsistently created.
- Discoverying Types: Data types may be dynamically created, or inconsistently created.
Implicit data has a very real cost to the organization. Data warehouse consumers such as billing, usage, data science & security build up a dependency on implicit data, which is subject to change at any moment. Building business critical reporting use cases on implicit data is like building a house on sand, at some point it will fall over.
Implicit data is not auditable. It’s extremely hard to discover the structure of data being emitted, types, information contained, and understand privacy implications.
Explicit data definitions can be used to address these concerns, which promotes data as an explicitly encoded entity. The following shows an example of an explicit event defined using protocol buffers:
An explicit event lives on its own and is defined independent of any programming language or producer. Explicit events are shareable and strongly typed. The event listed above represents the data being communicated between two teams. It is a first class entity, containing an explicit structure and type. The data definition can be shared, talked about, validated, and used as a blueprint to create events in almost every programming language.
Using explicitly defined data benefits the integration process. During the integration teams can share ideas around data structure and types because they are explicitly encoded the data schema. Design conversations start with sketching out data structures. After a data format is agreed upon, teams can leverage the protocol buffers toolchain to generate language specific bindings (such as node.js and python client libraries) to generate the data using the explicit definition as a blueprint.