Published on

Schema On Read Considered Harmful

Authors
  • avatar
    Name
    Danny Mican
    Twitter

Schema-on-read places the burden of imposing the structure and types of data on the consumer (reader) of data. Schema-on-read decouples a reader and writer which increases consumer complexity, risk of incorrectness and bugs. Schema-on-read creates expensive feedback loops when data changes. Schema-on-read is no longer a viable strategy in the era of big data, two-pizza teams, and microservices. Beware of schema-on-read, it's harmful.

Schema-on-read vs Schema-on-write

A data schema strategy such as schema-on-read and schema-on-write occur during data integration points. Data structure and types must be known during writes and reads. If not, data would be an opaque unusable blob. Choosing a schema strategy is unavoidable. Common integration points of data are:

In each of these cases data is produced, persisted and read back for later use. During write data is serialized into a byte representation. During read data is deserialized from a byte representation into a more usable form. Two approaches exist for managing the serialization and deserialization process:

  • Schema-on-read
  • Schema-on-write

Schema-on-read

Schema-on-read occurs when no metadata is available about the types and structure of data. Schema-on-read only communicates a serialization type, such as CSV or JSON. Schema-on-read lacks metadata around data shape (structure), fields and data types.

Schema-on-read most often involves JSON in which a publisher creates and persists JSON serialized data and a consumer process later loads and parses the JSON.

The following are common examples of schema-on-read:

  • JSON data is written to Mongo, and queried later
  • JSON is logged from an application and consumed by ElasticSearch
  • JSON is published to Kafka

Schema-on-read is alluring when building out distributed systems because the lack of schema is perceived to reduce friction. Fast iteration is a selling point for schemaless systems like Mongodb. It's also alluring to publish JSON in queue-based systems because of the ease of publishing. Don't fall into the trap! The lack of structured and enforceable contract creates issues as soon as systems and teams begin to grow.

Schema-on-write

Schema-on-write occurs when metadata such as structure and types are available and enforced when writing data. The most common example is relational database schema. A relational database prohibits inserting non-compliant data. The database enforces data structure on insert (on write).

Common examples of schema-on-write are:

Schema-on-write is essential for companies with multiple teams or microservices, but enforcing schema and sharing metadata is additional work, which can be viewed as unnecessary for early software or small teams.

Issues

Unknown shape / fields

Lacking known metadata about the shape and structure of data results in code like the following:

data = get_json_data()
contents = data.get('contents', {})
# get customer sometimes in top level sometimes in "details"
details = contents.get(
    'details', # retrieve the "details" key if present
    {} # if not return an empty dictionary
).get(
    'customer', # get "customer" key from the data returned in prior step
    contents.get( # if "customer" is not present, return top level "customer"
        'customer
    )
)

Everything about the field location and structure is implicit. At any moment the location of fields may change, which would result in broken consumers.

Unknown Datatypes

Consumers experience issues with inferring and casting field types. Some producers may provide a small int for a boolean, others may provide a string 'true', or 'false'. Commonly structure and datatype overlap when accessing data. A value may be an object or a literal:

details = data['details']
if isinstance(details, dict):
    # details is a dictionary
elif isinstance(details, str):
    # details is a string and does not contain sub fields

Schema-on-read creates complex consumers. Consumers can't trust the data they receive and must code defensively (overly verbose sacrificing types) or risk errors. The resulting code is filled with conditionals, introspection and flaky logic.

Out of sync data

Schema-on-read creates long feedback loops between a consumer and a producer. Producers create data of any structure and type, and the burden of parsing the data falls on the consumer. Consumers often explicitly break when producers update or change the data. The defensive conditionals, listed above, create a situation where consumers may continue without an explicit error. Producers may change field locations or data types without communicating to the consumer teams. This often causes consumers to emit unknown (null) data types, which go unnoticed until further downstream.

The burden to detect invalid or changed data becomes the consumers responsibility. The consumer becomes a canary for detecting and alerting the producer of changes. Many times producers and consumer are on different teams. This usually results in ticketing systems, scheduling, PRs, cross team communication is required when data changes, which gets extremely expensive especially when considering the issue is completely solvable when schema-on-write is employed.

Avoiding Schema-on-read

Explicit Contracts - (Typed enforceable)

Schema-on-write is achieved by leveraging explicit enforceable schemas. Tools like Protocol Buffers, relational databases or API validation all enforce schema on read. These approaches enforce data when data is created, which avoids complexity on the consumer side, and eliminates long feedback loops between producers and consumers.

Contract Driven Design

Contract driven design focuses on data as a first class entity. Contract driven design promotes the definition of data as a first step in design. Teams can work together to create data definitions and encode those definitions in a tool like Protocol Buffers. Language specific bindings can be auto generated for a given definition, when using tools like Protocol Buffers.

Conclusion

Schema-on-read may look alluring when starting new projects because of the perception less work is involved in sharing data. In practice more work is involved due to the complexity of consumer processes, and the burden of imposing structure and types. Structured serialization formats such as Protocol Buffers, avro, and thrift make it trial to serialize from language specific objects using a schema and trivial to deserialize data into language specific objects.

Accepting schema-on-read is like supporting a REST API that must accept any arbitrary payload. It doesn't work for any party involved, especially consumers. Data benefits from explicit contracts in the same way REST APIs enforce known allowed inputs and outputs and refuse to accept arbitrary data.

Happy Hacking!