Hey blog! It has been a while! Life happens and has a way of distracting us from some of the things we love. I have been busy with work and family - still posting content elsewhere, some of which I might share here in the future, others will be left to stay where they are. It’s valuable content in the context in which it was created, but not necessarily valuable in the context of this blog. I have been thinking about how to get back into the swing of things and I think I have a good topic for today. ...
What do we mean by Self-Serve
Self-serve, or self-service, has been around in various guises for decades from Automatic Teller Machines (ATMs) through to self-serve analytics. With the advent of spreadsheeting tools, self-serve analytics has arguably been around for a long time, and with the more recent introduction of dedicated self-serve Business Intelligence tools, self-serve has become embedded in the analytics vernacular. But what do we really mean when we say “self-serve”? In my experience, we all have different interpretations and understanding of what we mean - and those interpretations will be specific to the context in which we operate. For some, self-serve could mean allowing the end-user to author their own reports. For others, it could mean allowing the end-user to be fully involved in the onboarding, curation and sharing of data. ...
Scheduling Databricks Cluster Uptime
Problem Interactive and SQL Warehouse (formerly known as SQL Endpoint) clusters take time to become active. This can range from around 5 mins through to almost 10 mins. For some workloads and users, this waiting time can be frustrating if not unacceptable. For this use case, we had streaming clusters that needed to be available for when streams started at 07:00 and to be turned off when streams stopped being sent at 21:00. Similarly, there was also need from business users for their SQL Warehouse clusters to be available for when business started trading so that their BI reports didn’t timeout waiting for the clusters to start. ...
CI / CD With Synapse Serverless
Context A project that I’m working on uses Azure Synapse Serverless as a serving layer option for its data platform. The main processing and transformation of data is achieved using Databricks, with the resulting data being made available as a Delta file. Our processes ensure that the Delta files are registered automatically within Databricks as Delta Tables, but there is no native way to register Delta objects in Synapse. Therefore, we’ve gone down a route of creating a series of Stored Procedures in Synapse - which can be called from Databricks - which register the Delta files as views within Synapse. There are performance considerations to be understood with this approach - namely the use of loosely typed data. For example, a string in Delta is converted to VARCHAR(8000) in Synapse. This approach isn’t for everyone, so use with caution. ...
Why Data Quality is Important
Data is among the most valuable assets for any organisation. Without data, the ability to make informed decisions is diminished. So it stands to reason that Data Quality is incredibly important to any organisation. If data doesn’t meet the expectations of accuracy, validity, completeness, and consistency that an organisation sets it, then the data could have severe implications for the organisation. Conversely, if data does meet those expectations, then it is a real asset that can be used to drive value across an organisation. ...
Using and Abusing Auto Loader's Inferred Schema
Problem Databricks’ Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front. But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields? ...
Using Auto Loader on Azure Databricks with AWS S3
Problem Recently on a client project, we wanted to use the Auto Loader functionality in Databricks to easily consume from AWS S3 into our Azure hosted data platform. The reason why we opted for Auto Loader over any other solution is because it natively exists within Databricks and allows us to quickly ingest data from Azure Storage Accounts and AWS S3 Buckets, while using the benefits of Structured Streaming to checkpoint which files it last loaded. It also means we’re less dependent upon additional systems to provide that “what did we last load” context. ...
Data Product Fictional Case Study: Retail
Background In a previous post, we explored what the data domains could look like for our fictional retailer - XclusiV. In this post, we will explore how the data products could work in this fictional case study, including how pure data consumers would handle the data - particularly those consumers who have a holistic view of an organisation (also a group of consumers for whom a traditional analytical model is perfect). ...
Data Domain Fictional Case Study: Retail
In previous posts we’ve understood what is Data Mesh and gone into greater detail with regards to the principles. In this next series of posts I want to use a fictional case study to explore how the underlying principles could work in practice. This post will introduce the fictitious company; the challenges it faces; and how the principle of decentralised data ownership and architecture, with domain alignment, would work. Fictitious Company: XclusiV XclusiV is a luxury retailer operating in multiple countries. It has two divisions, which operate almost as separate businesses, which we will call Division X and Division V. The Point-of-Sale (POS) and Enterprise Resource Planning (ERP) systems within each market that it operates is the same for each division, but the POS and ERP systems can vary between markets. ...
Databricks Labs: Data Generator
Databricks recently released the public preview of a Data Generator for use within Databricks to generate synthetic data. This is particularly exciting as the Information Security manager at a client recently requested synthetic data to be generated for use in all non-production environments as a feature of a platform I’ve been designing for them. The Product Owner decided at the time that it was too costly to implement any time soon, but this release from Databricks makes the requirement for synthetic data much easier and quicker to realise and deliver. ...