3 minute read / Oct 26, 2022 /
9 Predictions for Data in 2023
These are my 9 predictions. A year from now, I’ll score them to see how I did.
- Cloud data warehouses (CDW) will process 75% of workloads by 2024. In the last five years, CDWs have grown from 20% of the workloads to 50%, with on-prem databases constituting the remainder. Meanwhile, the industry has grown from $36b to $80b during that time.
- Data workloads will segment by use case into three groups. First, in-memory databases like DuckDB will grow to dominate local analysis even for massive files. CDWs will retain classic BI & exploration uses. Cloud data lakehouses will serve jobs operating on massive data & jobs that don’t require the fastest latency - and do it at half the storage price.
- Metrics layers will unify the data stack. Today, there are two different forks in data. The first fork uses ETL to pump data into a CDW, then to a BI or data exploration tool. The second fork, the machine learning stack, is identical save for the outputs: model serving & model training. The metric layer will become the single place metrics & features are defined, unifying the stack & potentially moving model serving & training into the database.
- Large language machine learning models will change the role of data engineers. I recorded a video of myself writing code to produce charts & embedded it in the presentation. The video shows Github Copilot magically creating a chart for the DuckDB star growth. Copilot ingests a comment, writes the code, even adds my custom theme function. When I execute the code, it works. Technologies like this will push data engineering work to a higher plane of abstraction.
- WebAssembly or WASM will become an essential part of end-user facing data apps. WASM is a technology that accelerates browser software. Pages load faster, data processing is speedier & users are happier. Every major browser supports WASM & consequently, anyone producing a data app for an end user will use it.
- Notebooks will win 20% of Excel users. Of the 1b global Excel users, 20% will become prosumers, writing Python/SQL to analyze data. They will do it in notebooks like Jupyter, which are easily shared, reproducible & version controlled. Those notebooks will become data apps used by end users inside companies, replacing brittle Excel & Google Sheets.
- SaaS applications will use the CDW as a backend for both reading & writing. Today, sales, marketing, & finance data exist in disparate systems. ETL systems use APIs to push that data into the CDW for analysis. In the future, software products will build their apps on top of the CDW to take advantage of centralized security, faster procurement processes, & adjacent data. These systems will also write back to the CDW.
- Data Observability becomes a Must Have. Software engineers measure the success of their efforts through up-time. 99.9% or three-nines of up-time means only 1 incident per 1000 hours. Today’s data teams see 70 incidents per 1000 tables. Data teams will align on data uptime/accuracy metrics & drive to the three-nines equivalent, using data observability tools to measure their performance.
- The Decade of Data Continues. Data startups raised more than $60b in total in 2021 more than 20% of all venture dollars raised. We’re still in the early innings of this foundational movement.
Thank you to the Monte Carlo team for the opportunity & the audience for the great questions at the end.