I am a passionate data engineer and analytics consultant with over 10 years of working experience. I help my customers overcome the challenges they have with data, allowing them to unlock value and gain valuable insight in their data.
Context A project that I’m working on uses Azure Synapse Serverless as a serving layer option for its data platform. The main processing and transformation of data is achieved using Databricks, with the resulting data being made available as a Delta file. Our processes ensure that the Delta files are registered automatically within Databricks as Delta Tables, but there is no native way to register Delta objects in Synapse. Therefore, we’ve gone down a route of creating a series of Stored Procedures in Synapse - which can be called from Databricks - which register the Delta files as views within Synapse.
Data is among the most valuable assets for any organisation. Without data, the ability to make informed decisions is diminished. So it stands to reason that Data Quality is incredibly important to any organisation. If data doesn’t meet the expectations of accuracy, validity, completeness, and consistency that an organisation sets it, then the data could have severe implications for the organisation. Conversely, if data does meet those expectations, then it is a real asset that can be used to drive value across an organisation.
Problem Databricks' Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front. But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields?