Scheduling Databricks Cluster Uptime
Problem Interactive and SQL Warehouse (formerly known as SQL Endpoint) clusters take time to become active. This can range from around 5 mins through to almost 10 mins. For some workloads and users, this waiting time can be frustrating if not unacceptable. For this use case, we had streaming clusters that needed to be available for when streams started at 07:00 and to be turned off when streams stopped being sent at 21:00.
CI / CD With Synapse Serverless
Context A project that I’m working on uses Azure Synapse Serverless as a serving layer option for its data platform. The main processing and transformation of data is achieved using Databricks, with the resulting data being made available as a Delta file. Our processes ensure that the Delta files are registered automatically within Databricks as Delta Tables, but there is no native way to register Delta objects in Synapse. Therefore, we’ve gone down a route of creating a series of Stored Procedures in Synapse - which can be called from Databricks - which register the Delta files as views within Synapse.
Why Data Quality is Important
Data is among the most valuable assets for any organisation. Without data, the ability to make informed decisions is diminished. So it stands to reason that Data Quality is incredibly important to any organisation. If data doesn’t meet the expectations of accuracy, validity, completeness, and consistency that an organisation sets it, then the data could have severe implications for the organisation. Conversely, if data does meet those expectations, then it is a real asset that can be used to drive value across an organisation.
Using and Abusing Auto Loader's Inferred Schema
Problem Databricks' Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front. But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields?
Using Auto Loader on Azure Databricks with AWS S3
Problem Recently on a client project, we wanted to use the Auto Loader functionality in Databricks to easily consume from AWS S3 into our Azure hosted data platform. The reason why we opted for Auto Loader over any other solution is because it natively exists within Databricks and allows us to quickly ingest data from Azure Storage Accounts and AWS S3 Buckets, while using the benefits of Structured Streaming to checkpoint which files it last loaded.
Data Product Fictional Case Study: Retail
Background In a previous post, we explored what the data domains could look like for our fictional retailer - XclusiV. In this post, we will explore how the data products could work in this fictional case study, including how pure data consumers would handle the data - particularly those consumers who have a holistic view of an organisation (also a group of consumers for whom a traditional analytical model is perfect).
Data Domain Fictional Case Study: Retail
In previous posts we’ve understood what is Data Mesh and gone into greater detail with regards to the principles. In this next series of posts I want to use a fictional case study to explore how the underlying principles could work in practice. This post will introduce the fictitious company; the challenges it faces; and how the principle of decentralised data ownership and architecture, with domain alignment, would work. Fictitious Company: XclusiV XclusiV is a luxury retailer operating in multiple countries.
Databricks Labs: Data Generator
Databricks recently released the public preview of a Data Generator for use within Databricks to generate synthetic data. This is particularly exciting as the Information Security manager at a client recently requested synthetic data to be generated for use in all non-production environments as a feature of a platform I’ve been designing for them. The Product Owner decided at the time that it was too costly to implement any time soon, but this release from Databricks makes the requirement for synthetic data much easier and quicker to realise and deliver.
Data Mesh Deep Dive
In a previous post, we laid down the foundational principles of a Data Mesh, and touched on some of the problems we have with the current analytical architectures. In this post, I will go deeper into the underlying principles of Data Mesh, particularly why we need an architecture paradigm like Data Mesh. Let’s start with why we need a paradigm like Data Mesh. Why do we need Data Mesh? In my previous post, I made the bold claim that analytical architectures hadn’t fundamentally progressed since the inception of the Data Warehouse in the 1990s.
What is Data Mesh?
To be able to properly describe what Data Mesh is, we need to contextualise in which analytical generation we currently are, mostly so that we can describe what it is not. Analytical Generations The first generation of analytics is the humble Data Warehouse and has existed since the 1990s and, while being mature and well known, is not always implemented correctly and, even the purest of implementation, comes under the strain of creaking and complex ETLs as it has struggled to scale with the increased volume of data and demand from consumers.
Tabular Automation with TMSL and PowerShell: SQL Bits
At SQL Bits, Europe’s largest data conference, I presented a Tabular Automation with TMSL and PowerShell. By now, I was a seasoned speaker but was still quite nervous, especially as the attendees of my session included members from Microsoft’s development team. Feedback from them, was that it was nice to demonstrate how TMSL could be used outside what it was originally intended to do. Video to the session can be found on the SQL Bits website.
Deep Dive Into Data Lakes: SQL Bits
At SQL Bits, Europe’s largest data conference, I presented a Deep Dive Into Data Lakes. It was my first solo speaking session and was the 08:00 session on the Saturday - right after the Friday night party. Video to the session can be found on the SQL Bits website.