The Data-led Professional Podcast: What is a single source of truth and why do you need it?
Welcome to The Data-led Professional Podcast:A podcast dedicated to helping folks become data-led to build better products and experiences.
With Claudiu Murariu, CEO and Co-Founder of InnerTrends, Arpit Choudhury, Founder of Data-led Academy. Featuring special guest Tejas Manohar, Co-Founder of Hightouch
In this week’s episode, podcast hosts and special guest Tejas Manohar, co-founder of Hightouch - a Reverse ETL tool - discuss why organizations need a single source of truth for their customer data, the benefits of adopting this format, how data is analyzed and acted upon when a single source of truth is accessible, what data you should expect to find in your data warehouse, and so much more.
Want to listen to the whole episode? Check it out here:
Subscribe for more episodes on your favorite platform:
Select excerpts from the episode:
What is the single source of truth for customer data? And why do organizations need one?
[T]: Over the last 10 years, companies have gained access to more customer data than ever; there's just been an explosion of digital data. Whether you're a B2B company, or a B2C company, you have digital data about your customers not just from your application, but also from their interactions with you on social media, on your marketing content, your advertisements, your support tickets. With all of this information, there are a lot of insights that can be gathered about your customers that might influence the behavior of each team and business function around the company.
Having a single source of truth for customer data allows the company to operate on a single picture of what customers are doing across all of the mediums in which you're collecting this customer data. And with this single source of customer data, each of the business functions can have insights all the way down to what the customer is doing across all functions, rather than just in their one function.
A single source of truth for customer data is a dream for most companies. It’s always an incremental process to get there. But having more customer information inside of each department can definitely enable more efficient functions for it.
[C]: What I like about the source of truth is that “source” means there is this area of data that feeds into everything else. And “truth” means that it's accurate. It's not misleading, it doesn't have partial data, everything is in place.
Arpit, why do you think it’s a dream? Why isn’t it a reality?
[A]: It's a dream because it's something that people are constantly thinking about. It's not a reality because it's a really hard problem to solve. A lot of companies think that they already have a source of truth.
Where should the source of truth exist? From your experience, Tejas, where do you see most companies keep the source of truth?
[T]: In the early stages, it's really hard to say where the source of truth is for a company. But as the company scales, it gravitates towards an architecture where the source of truth across their company is actually the business intelligence layer that sits on top of the data warehouse.
Over the last seven to 10 years, it's become easier than ever for companies to dump all of their customer data from various data sources into a data warehouse, and then throw a BI tool on top of it that allows them to slice and dice that data by a variety of metrics and create reports and dashboards that the whole company relies on.
What we see when we talk to companies that have a software operation is that when they hit about 30 to 50 people, the data warehouse and the BI layer actually become the source of truth around the company, which means people are looking at dashboards like Looker, Mode, Tableau, etc. to find out core company metrics.
It doesn't necessarily mean that it is going to portray the most accurate source of data at this stage for companies. Oftentimes, they need to perform more investment to actually tune that data and make it make it usable, accurate, and clean. But it is the place that contains all of the data or closest to all of the data. So that's why it naturally becomes a source of truth amongst a company.
[C]: The problem we see is that a warehouse is built, data is sent to the warehouse, but the warehouse is not seen as a source of truth because the company doesn't really put effort into it; they just throw data at it, and a lot of data gets gathered. That warehouse should become the source of truth, but because there is not a good process in place to make sure that the data is accurate, it's normalized correctly, and can be linked in a good way, most companies don't use it as a source of truth.
Should the data warehouse be the source of truth, and should it have all the processes in place to be an accurate truth?
Can you shed some light on when customer data platforms can really be the source of truth or whether they can coexist with data warehouses?
[T]: Customer data platforms [like Segment] definitely aren't mutually exclusive with data warehouses. In fact, they really go hand in hand.
But what we find is that when companies hit a certain stage, they have customer data that originates from so many different systems that they need a place that can easily allow them to dump raw data into it, and then figure out what to do with it there. And that's where data warehouses have really excelled. And not just data warehouses’ technology, but also the ecosystem around them.
I think we are seeing the data warehouse and the BI layer become that source of truth around the company, as in: it's the place that people look if they can't easily find the answer that they're trying to find in their tool of choice.
But where the data warehouse as a source of truth is falling short for a lot of companies today, and is a problem that's being addressed, is making it a usable source of truth for most of the company. The biggest problem with data warehouses today is that for a lot of teams, they're not accessible, or they haven't been made accessible.
What data should I expect to find in the data warehouse?
[T]: The trend in data warehouses is that you can actually scale the storage layer of them separately from the compute layer, which allows you to serve different queries around the company. This is a technology advancement that's happened recently that basically allows you to dump as much data as possible in the data warehouse.
The general principle of data collection we're seeing when it comes to warehouses is: dump as much data as you want into the data warehouse, extract it from your systems, from third party systems, load that into the data warehouse, and then figure it out from there using a mixture of a transformation tools as well as business intelligence tools in the ecosystem. You should aim to get all of your customer data in the data warehouse.
There are a few main categories of the customer data that you should expect to see in a good data warehouse:
First, you want Entity Data. You want to be able to query, what's the state of different entities that exist across my company (a user, a project, an integration, a specific setting someone switched on)? At a minimum, you want to be able to serve in the warehouse the same queries that your developers would be able to write against their application database that powers your company's application.
The next step is bringing into the warehouse more data than the bare minimum that you have to run your application. This consists of additional Entity Data from SaaS services. These are services that the company is using as a system of record of sorts across the company for different business functions. As companies have gotten more digital, a lot of business functions can be bootstrapped with SaaS services without building things in house for using support systems like Zendesk, or CRM systems like Salesforce, or billing systems like Stripe. So there's core data almost as valuable, or more valuable than the data in your application in those systems about how your customers are interacting with your company and with your brand, so dumping all of that data in the data warehouse is the next step.
The third thing that you should always expect inside of a good data warehouse is some sort of time series or event data. You want to know when a customer did certain things, in what sequence, and what that journey actually looked like. So it's not just about what the customers data is today, but how did it evolve over time? And that's super useful and super important for running analytics.
When it comes to getting data into the warehouse, there's bound to be messy data on the way in. It’s really important to transform it, and post-process once there. For this, solutions like DBT (data build tools) are popping up- these are emerging tools in the data engineering and data analytics ecosystem that allow data analysts to create core tables and views in their data warehouse that represent core data models.
This allows people in the company to look at core metrics- the revenue of a user, the health score of the user, their churn risk, etc.- rather than individual data points.
My philosophy is that if you have the opportunity to put a transformation layer in your data warehouse and have analysts or data engineers build the core data models for your company, you should dump as much data as possible into the warehouse, so that you don't regret not having captured certain things before.
The source of truth around the company becomes that processed data (created out of raw data) and those models that have been built out to answer common questions that everyone around the company is asking.
[C]: You’re saying that the source of truth is a system, not just the data warehouse by itself. That’s a very interesting point.
[T]: This approach, however, has bandwidth concerns as you need to invest in making your data warehouse clean and in making the data model accessible and easy to understand. Otherwise, the source of truth will not really be a system, it will be the people who own the data analytics and the BI team that become a source of truth around the company, which is not what you want in order to create a data-driven culture.
[A]: Excellent point!
Let’s talk about tools
[T]: It's really interesting because we're seeing a modern data stack emerge as certain tools become popular in the ecosystem.
There’s the ETL, or ELT layer, which are tools that help you get data into the warehouse, from all of the systems that your company is using. Then you have the transformation layer- the DBT- and then a BI and analytics tools on top. Finally, you have a “reversed ETL” like Hightouch that pulls the processed data from the data warehouse and feeds it back to the core systems that each of the company’s teams use.
Closing the loop like this, all of these teams can rely on having a single view of the customer, from across the organization, that doesn't just exist in a BI tool. And you can define granular rules for how you want this data to show up in each system.
Finally, another interesting term that is starting to emerge is operational analytics. Instead of making the analytics function of the company a reactive initiative, or something that helps you analyze what's happened the past, actually taking those insights that you've built in the BI layer, and operationalizing them so that they can influence how you talk to customers and how you interact with customers in the future.
Have any additional questions about building a single source of truth for customer data, or any topics you would like to hear covered on The Data-Led Professional Podcast? Comment below! We can’t wait to hear from you.
Subscribe for more episodes on your favorite platform: