AWS Lake Formation — Step by Step guide

Dipali Kulshrestha
6 min readMar 6, 2022

In this tutorial, we will create a data lake using AWS Lake Formation. Here we will ingest batch data as well as real-time data in our data lake.

Step1: Set-up DataLakeAdminUser

Log-in to AWS management console with root user

a) Now since, root user can not be a Data Lake admin, Create IAM user datalake-admin-user

b) Attach following permissions to this user:

i) Administrator access

ii) AWSLakeFormationDataAdmin

c) Go to Lake formation and add this new user as admin. [from Permissions => Administrative Roles and Tasks]

d) Log off. And log-in via New user

Step2: Configure S3 and Register Location for Lake Formation

  • Create a S3 Bucket e.g. dlfdemo
  • Register this location using Lake Formation
  • Create a folder ‘rawdata’ in S3 to hold all the raw data in the bucket

Step3: Data ingestion set-up

part1: ingest batch data

Create folder ingest-batch inside ‘rawdata’

upload data file customers.csv in ingest-batch [ you can download sample file from https://github.com/dipalikulshrestha/datalakeformation ]

Step4: Create Database and Metadata

a) In Lake Formation, Go to Database, Create DB with name ‘rawdata’

b) Map the location with folder ‘rawdata’

Look at the S3 folder structure: imagine rawdata as your database and ingest-batch as a table within database rawdata

c) Create Glue Crawler ‘batchcrawler’ to create meta-data on the basis of data available in ingest-batch (it will eventually create a table with name ‘ingest-batch’)

crawlername: batchcrawler

path: s3://dlfdemo/rawdata/ingest-batch

role: AWSGlueServiceRole-batchingest

database: rawdata

Screenshots below for various stages:

If I run this crawler now, it actually don’t have permission to look inside this bucket. Fundamentally whole point of Lake formation is to control the access on data we registered. We have registered this bucket as datalake and created this service role. This role need lake formation access to read data from data lake.

Data permissions, Grant, give all access

Permissions ==> DataLake Permissions

Now run the crawler

it takes 1–2 min and should add 1 new table named ingest_batch.

go to tables and verify the table:

You can also edit the schema of this table (if required)

Now, as a next steps, we want to interactively query this data in datalake. So, here we will use Serverless service AWS Athena .

  • Athena allows to use SQL queries on data in data lakes
  • Serverless, Athena uses Presto, a distributed SQL engine to run queries
  • Removes a lot of complexity in data querying
  • earlier … you needed to use EMR to run Hadoop and Hue for queries
  • Before you run your first query, you need to set up a query result location in Amazon S3
  • Settings → Edit → Select the location
  • To do that … we can create a folder in our S3 Bucket for that

in Athena, Choose database as rawdata and write query:

select * from rawdata.ingest_batch

part2: ingest real-time data using Kinesis Firehose

In S3 … create a folder for the real-time data ingestion at rawdata/ingest-real-time

Now, Go to Kinesis => Kinesis Data Firehose => Create Delivery Stream

Source: Direct Put & Destination: Amazon S3

Delivery Stream Name: DLRealTimeIngest

S3 Bucket: s3://dlfdemo

Prefix: rawdata/ingest-real-time/

In Advanced settings: Note the IAM role

Go to Lake Formation and Grant access to this Firehose role to ingest data in DataLake

Permissions => Data Lake Permissions => Grant

Now, to ingest data into Firehose Delivery Stream

Let’s Test with demo data => Start sending Demo Data

it will take 60 seconds to appear (as it is the buffer time we set while creating the data stream in firehose)

After a while check in S3 ingest-real-time folder, if it has started pushing files:

Now, you can stop sending demo data

Create Glue Crawler for this real-time data from the Lake Formation

name: realtimecrawler

Path: s3://dlfdemo/rawdata/ingest-real-time

role: AWSGlueServiceRole-realtimeingestdemo

Database: rawdata

Add the IAM Role permissions in the Lake Formation: There are two main types of permissions in AWS Lake Formation:

Metadata access: Permissions on Data Catalog resources (Data Catalog permissions).

Underlying data access — Permissions on locations in Amazon Simple Storage Service (Amazon S3)

Data lake Permissions ==> Grant

Data Location ==> Grant

  • Run the Crawler : Realtimecrawler

it should add 1 table named ingest_real_time, check/edit schema if needed, and query via Athena

--

--