Skip to content

Systems and Tech Thoughts

Its Never Been Easier To Get Started With Data

November 22, 2020

The low cost of storage combined with systems like Presto, which decouple computer from storage, makes it easier than ever to get started with in-house data solutions. Leveraging data to make decisions and understand customers can directly impact business financials. This article explores how the current data landscape low cost storage and decoupled make this new world possible and explains why the opportunity is right to hone a data driven stratgy.

What’s At Stake?

Money, customers, and unrealized opportunities. Developing a customer focused view through data analytics allows companies to better serve their customers. Data enables companies to develop a full picture of customer experience, full order history, yearly value, to better fit solutions with customers.

Companies that achieve this such as Google, Amazon, Netflix, Kroger, Nordstrom and countless others are achieving this vision through the use of data. I hope the value of data is obvious at this point in business. It’s necessary for visibility and feedback on decisions. Companies without data are essentially flying blind.

What’s Different About Now?

Four main factors led to the revolution in data technology, and make now the perfect time to get started with data:

  • Streaming Systems With Connector Ecosystems
  • Cheap Cloud-Based Storage
  • Decoupled Compute
  • Open Source Visualization Tools

Each component has mature open source or cloud based solutions.

Streaming Systems - “Getting Data”

kafka connect Kafka Connect. Source Robin Moffatt

Streaming systems like Kafka enable real time data ingestion and transfer of data. Most importantly they have rich ecosystems with many built-in connectors. These connectors make it super easy to move data around. Change Data Capture connectors can be bolted on to current databases and convert a standard RDBMS into a streaming event source.

These components can also put data in systems like S3 or analytic data systems. Leveraging streaming platforms like Confluent, Kafka or AWS Kinesis Firehose outsources complexity and means that companies no longer have to build high throughput distributed components in-house. Many cloud providers offer hosted Kafka. When combined with Kafka Connect it’s never been easier to start moving data.

Cheap Cloud Storage - “Storing Data”

The availability of cheap cloud storage makes it possible to retain huge amounts of raw data for low costs. Low cost cloud storage enables Extract Load Transform (ELT) which saves all raw data, and then transforms and loads that data. This contrasts with the prior generations approach of Extract Transform Load (ETL). ETL extracts data and transforms it in an atomic step, which results in losing the raw data. Cheap cloud storage is also essential for enabling decoupled compute.

Decoupled Compute - “Processing Data”

presto arch Presto Architecture. Source Prestodb

Systems like Presto, Snowflake, and Redshift Spectrum make it easier than ever to process data. Traditional data warehouses require that data is loaded before it is queried (the L in ETL). Distributed SQL engines like Presto can query the data where it sits. This means you can put some JSON or CSV on S3 and then query it directly using SQL, but without explicitly loading it into a DWH! This blows open the ability of companies to begin analyzing their data.

Suppose you have a high throughput web application hosted on AWS, which generates 500GB of log data a day and ends up on S3. Presto-like engines enable you to query that log data where it sits. This means you can execute SQL statements directly against the data that lives on S3! Before presto indexing and querying this data would be a big data problem. With presto it’s as simple as defining a schema, loading the partitions and then issuing queries.

Open Source Visualization - “Visualizing Data”

data visualization superset Superset Dashboard. Source Superset GitHub

The final step is visualizing the data. The visualization space is maturing and it’s easier than ever to point an open source or low cost cloud visualization tool like Superset, Metabase or Tableau at your data. Visualization or BI tools are essential for enabling low-tech business decision makers. It also enables self service uses of data where each team or business unit can maintain a view into the data that benefits them.

More Accessible Than Ever

The offerings above combine to create an environment that’s easier than ever to get started with data. These technologies are so accessible they can be “bolted-on” to current or legacy solutions. It’s a really fun time to be working in data. The tools above create a low barrier to entry to begin harnessing data and using it to make a real business financial impact.


Thoughts on Systems & Tech