AWS Lake Formation — Step by Step guide

6 min readMar 6, 2022

In this tutorial, we will create a data lake using AWS Lake Formation. Here we will ingest batch data as well as real-time data in our data lake.

Step1: Set-up DataLakeAdminUser

Log-in to AWS management console with root user

a) Now since, root user can not be a Data Lake admin, Create IAM user datalake-admin-user

b) Attach following permissions to this user:

i) Administrator access

ii) AWSLakeFormationDataAdmin

c) Go to Lake formation and add this new user as admin. [from Permissions => Administrative Roles and Tasks]

d) Log off. And log-in via New user

Step2: Configure S3 and Register Location for Lake Formation

Create a S3 Bucket e.g. dlfdemo
Register this location using Lake Formation

Create a folder ‘rawdata’ in S3 to hold all the raw data in the bucket

Step3: Data ingestion set-up

part1: ingest batch data

Create folder ingest-batch inside ‘rawdata’

upload data file customers.csv in ingest-batch [ you can download sample file from https://github.com/dipalikulshrestha/datalakeformation ]

Step4: Create Database and Metadata

a) In Lake Formation, Go to Database, Create DB with name ‘rawdata’

b) Map the location with folder ‘rawdata’

Look at the S3 folder structure: imagine rawdata as your database and ingest-batch as a table within database rawdata

c) Create Glue Crawler ‘batchcrawler’ to create meta-data on the basis of data available in ingest-batch (it will eventually create a table with name ‘ingest-batch’)

crawlername: batchcrawler

path: s3://dlfdemo/rawdata/ingest-batch

role: AWSGlueServiceRole-batchingest

database: rawdata

Screenshots below for various stages:

If I run this crawler now, it actually don’t have permission to look inside this bucket. Fundamentally whole point of Lake formation is to control the access on data we registered. We have registered this bucket as datalake and created this service role. This role need lake formation access to read data from data lake.

Data permissions, Grant, give all access

Permissions ==> DataLake Permissions

Now run the crawler

it takes 1–2 min and should add 1 new table named ingest_batch.

go to tables and verify the table:

You can also edit the schema of this table (if required)

Now, as a next steps, we want to interactively query this data in datalake. So, here we will use Serverless service AWS Athena .

Athena allows to use SQL queries on data in data lakes
Serverless, Athena uses Presto, a distributed SQL engine to run queries
Removes a lot of complexity in data querying
earlier … you needed to use EMR to run Hadoop and Hue for queries
Before you run your first query, you need to set up a query result location in Amazon S3
Settings → Edit → Select the location
To do that … we can create a folder in our S3 Bucket for that