What is Extract Transform Load (ETL)?
Extract Transform Load refers to a trio of processes that are performed when moving raw data from its source to a data warehouse, data mart, or relational database.
What do I need to know about ETL?
Data must be properly formatted and normalized in order to be loaded into these types of data storage systems, and ETL is used as shorthand to describe the three stages of preparing data. ETL also describes the commercial software category that automates the three processes.
What does ETL accomplish?
The three words in Extract Transform Load each describe a process in the moving of data from its source to a formal data storage system (most often a data warehouse).
- Extract—The extraction process is the first phase of ETL, in which data is collected from one or more data sources and held in temporary storage where the subsequent two phases can be executed. During extraction, validation rules are applied to test whether data has expected values essential to the data warehouse. Data that fails the validation is rejected and further processed to discover why it failed validation and remediate if possible.
- Transform—In the transformation phase, the data is processed to make values and structure consistent across all data. Typical transformations include things like date formatting, resorting rows or columns of data, joining data from two values into one, or, conversely, splitting data from one value into two. The goal of transformation is to make all the data conform to a uniform schema.
- Load—The load phase moves the transformed data into the permanent, target database. Once loaded, the ETL process is complete, although in many organizations ETL is performed regularly in order to keep the data warehouse updated with the latest data.
Why do you need ETL?
When creating a data warehouse, it is common for data from disparate sources to be brought together in one place so that it can be analyzed for patterns and insights. It would be great if data from all these sources had a compatible schema from the outset, but this is rarely the case. ETL takes data that is heterogeneous and makes it homogeneous. Without ETL it would be impossible to programmatically analyze heterogeneous data and derive business intelligence from it.