Using Apache Spark on Amazon EMR with SageMaker for End-to-End ML and Data Science Workflows

Webinar

Published May 2022

x

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as to build, train, and deploy models. Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. In this talk, we will demonstrate recent integrations between the services making it really simple for Data Scientists and Machine Learning Engineers to use distributed big data frameworks such as Spark in their machine learning workflow.

Learning Objectives:

  • How to use a unified notebook-centric experience to create and manage EMR clusters, run analytics on those clusters, and train and deploy SageMaker models
  • How to use a one-click interface for debugging and monitoring Amazon EMR jobs through the Spark UI.
  • How data workers can discover, connect, create, and stop clusters in a multi-account setup.