Building A Data Lakehouse On Google Cloud Platform


Published May 2022


Historically, organizations have implemented siloed and separate architectures. Data warehouses store structured, aggregate data (primarily used for BI and reporting), whereas data lakes store large volumes of unstructured and semi-structured data (primarily used for ML workloads). This approach often results in complex ETL pipelines because of extensive data movement, processing, and duplication. Operationalizing and governing this architecture is challenging, costly, and reduces agility. As organizations are moving to the cloud, they want to break these silos.

To address these issues, a new architecture choice has emerged: the data lakehouse. The data lake house combines the key benefits of data lakes and data warehouses. This architecture offers a low-cost storage format that is accessible by various processing engines like Spark while also providing powerful management and optimization features.

The landscape of data continues to evolve and grow at an exponential rate. It is important to have flexible patterns and limitless scale to ensure data is used as an investment, rather than a sunk cost.