Using and Abusing Auto Loader's Inferred Schema
Problem
Databricks’ Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front.
But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields?
Solution
Schema inference in Auto Loader works by sampling the data. It will do this by either reading 50GB or 1,000 files it discovers - whichever threshold is crossed first.
By default, a schema will be inferred at every run but that’s not efficient, so we need to supply it with a schema location to store the inferred schema. This means that at every run, Databricks is going to retrieve that defined schema and use it.
However, unless you set cloudFiles.inferColumnTypes
to true
in the config, it’ll try to infer the data types on every run, even though they have been defined in the schema. This can cause errors if it decides that a long
is an int
as it will produce a schema mismatch.
There’s a lot more to Schema inference on the Databricks website, including all the different options and how it relates to schema evolution.
In terms of taking that inferred schema and amending it to what you were expecting or using it to filter out sensitive data fields, there are two approaches to this:
- Schema Hints
- Amending the Inferred Schema File
Schema Hints
Schema hints are really useful if the schema Databricks infers for you has data types which you think or know are incorrect. When you provide a schema hint, this overrides the inferred schema for the fields you’ve provided hints for.
In this example, our inferred schema has detected our Date field as a String; Quantity as an Integer; and Amount as an Integer. But what if Date is a date and Amount is actually a float?
This is where the schema hint comes in handy. We don’t need to modify the entire schema, just need to give Databricks the necessary information to make the right choice.
.option("cloudFiles.schemaHints", "Date DATE, Amount FLOAT")
Which will produce a schema of:
This is great for small amendments to a schema, particularly if it’s a large one. If you need to make significant changes then you might want to define the schema upfront or take a look at the other option…
Amending the Inferred Schema File
On the first run of using schema inference, Databricks will output the schema to a _schema
location of your choosing. This can then be referenced in the schema location option.
.option("cloudFiles.schemaLocation", "/mnt/lake/schemaLocation")
CAUTION: The file that is output is not meant to be edited so proceed with care.
The file is often called 0, though could be called any form of integer if the schema has evolved, and contains two lines:
- The version of the schema, represented by
vN
- The schema in a json block
If you have a large and / or complex schema, editing this json block could be a challenge. But if your large / and or complex schema contains sensitive fields, it may be necessary to edit this block.
In this example, we have sensitive fields (highlighted) that we do not want to have in our data platform because:
- Do not need them in our platform for analysis purposes
- Will cause a world of pain if they get compromised
Therefore it’s safer to remove them from the platform before they get ingested. Removing them is a manual exercise of deleting each contained field definition block.
{\"name\":\"fieldName\",\"type\":\"dataType\",\"nullable\":boolean,\"metadata\":{}}
Depending on the number of fields, this could significantly reduce the number of fields in your schema. Depending on what’s left over, it may be easier and safer to define the schema up front.
We’ve removed the fields from the schema, but the schema evolution functionality of Auto Loader can easily add those fields back in. Therefore, we need to explicitly tell Auto Loader to not perform schema evolution.
.options("cloudFiles.schemaEvolutionMode": "none")
This means that any new columns, including the ones we have intentionally excluded, will not make their way any further into the data flow.
Conclusion
We’ve seen two approaches to using (and abusing) the inferred schema functionality of Auto Loader. Schema inference significantly reduces the barriers to ingesting data but there are occasions when schema inference might need a little help. Hopefully, this post will have given you an idea of when and why you want to interfere with an automated process.
comments powered by Disqus