Couldn’t attend Transform 2022? Discover all the summit sessions now in our on-demand library! Look here.
The world is full of situations where one size doesn’t fit all – footwear, healthcare, the number of sprinkles you want on a fudge sundae, to name a few. You can add data pipelines to the list.
Traditionally, a data pipeline manages connectivity to business applications, controls requests and data flow in new data environments, and then manages the steps necessary to cleanse, organize, and present a refined data product to consumers, inside or outside the walls of the company. These results have become essential to help decision-makers move their business forward.
Lessons from Big Data
Everyone Knows Big Data Success Stories: How Companies Love netflix build pipelines that handle more than a petabyte of data every day, or how Meta analyzes over 300 petabytes of clickstream data within its analytics platforms. It’s easy to assume that we’ve already solved all the hard problems once we’ve reached that scale.
Unfortunately, it’s not that simple. Just ask anyone who works with operational data pipelines – they’ll be the first to tell you that one size certainly doesn’t fit all.
MetaBeat will bring together thought leaders to advise on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, California.
For operational data, which is the data that underpins critical parts of a business like finance, supply chain, and HR, organizations routinely fail to deliver value from analytics pipelines. This is true even though they were designed in a way that resembles Big Data environments.
Why? Because they’re trying to solve a fundamentally different data problem with essentially the same approach, and it doesn’t work.
The problem is not the size of the data, but its complexity.
Major social or digital streaming platforms often store large sets of data as a series of simple, ordered events. One line of data is captured in a data pipeline for a user watching a TV show, and another logs every “like” button clicked on a social media profile. All of this data is processed through data pipelines at tremendous speed and scale using cloud technology.
The datasets themselves are large, and that’s great because the underlying data is extremely well-ordered and managed to begin with. The highly organized structure of clickstream data means that billions and billions of records can be analyzed in no time.
Data pipelines and ERP platforms
For operational systems, such as the enterprise resource planning (ERP) platforms that most organizations use to run their essential day-to-day processes, however, it’s a very different data landscape.
Since their introduction in the 1970s, ERP systems have evolved to optimize every ounce of performance to capture the raw transactions of the business environment. Every sales order, financial entry, and inventory item in the supply chain should be captured and processed as quickly as possible.
To achieve this performance, ERP systems have evolved to manage tens of thousands of individual database tables that track business data items and even more relationships between these objects. This data architecture is effective in ensuring the consistency of a customer’s or supplier’s records over time.
But, at the end of the day, what’s great for transaction speed within that business process usually isn’t great for analytics performance. Instead of the clean, simple, and well-organized tables created by modern online applications, there is a spaghetti-like mess of data spread across a complex, real-time, mission-critical application.
For example, analyzing a single financial transaction on a company’s books may require data from over 50 separate tables in the ERP backend database, often with multiple lookups and calculations.
To answer questions that span hundreds of tables and relationships, business analysts must write increasingly complex queries that often take hours to return results. Unfortunately, these queries simply never return responses in time and leave the company blind at a critical moment in its decision-making.
To address this issue, organizations are trying to further develop the design of their data pipelines with the goal of routing data to increasingly simplified business views that minimize the complexity of various queries to make them easier to execute.
This could work in theory, but it comes at the cost of oversimplifying the data itself. Rather than allowing analysts to ask and answer any question with data, this approach frequently summarizes or reshapes the data to improve performance. This means analysts can get quick answers to predefined questions and wait longer for everything else.
With inflexible data pipelines, asking new questions means going back to the source system, which is time-consuming and quickly expensive. If anything changes in the ERP application, the pipeline breaks completely.
Rather than applying a static pipeline model that cannot respond effectively to more interconnected data, it is important to design this level of connection from the start.
Rather than making the pipelines ever smaller to solve the problem, the design should instead encompass these connections. In practice, this means addressing the fundamental reason behind the pipeline itself: making data accessible to users without the time and cost associated with expensive analytical queries.
Each table connected in a complex analysis puts additional pressure on the underlying platform and on the people responsible for maintaining business performance by tuning and optimizing these queries. To reinvent the approach, you have to look at how everything is optimized when the data is loaded – but, above all, before any query is executed. This is usually referred to as query acceleration and provides a useful shortcut.
This approach to query acceleration offers many performance multiples over traditional data analysis. It achieves this without needing the data to be prepared or modeled in advance. By analyzing the entire data set and preparing that data before running queries, answers to questions are less constrained. It also improves the utility of the query by providing the full breadth of raw business data available for exploration.
By challenging fundamental assumptions about how we acquire, process and analyze our operational data, it is possible to simplify and streamline the steps needed to move from fragile and costly data pipelines to faster business decisions. Remember: one size does not fit all.
Nick Jewell is the Senior Director of Product Marketing at Incorta.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including data technicians, can share data insights and innovations.
If you want to learn more about cutting-edge insights and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You might even consider writing your own article!
Learn more about DataDecisionMakers