0% found this document useful (0 votes)
23 views498 pages

Aws Glue Developer Guide

The AWS Glue Developer Guide provides comprehensive information on AWS Glue, a serverless ETL service. It covers setup, security, data cataloging, job authoring, monitoring, and troubleshooting, along with detailed instructions for using its various components and features. The guide serves as a resource for developers looking to implement and manage data integration processes using AWS Glue.

Uploaded by

chandra.sekhar15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views498 pages

Aws Glue Developer Guide

The AWS Glue Developer Guide provides comprehensive information on AWS Glue, a serverless ETL service. It covers setup, security, data cataloging, job authoring, monitoring, and troubleshooting, along with detailed instructions for using its various components and features. The guide serves as a resource for developers looking to implement and manage data integration processes using AWS Glue.

Uploaded by

chandra.sekhar15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 498

AWS Glue

Developer Guide
AWS Glue Developer Guide

AWS Glue: Developer Guide


Copyright © 2019 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner
that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not
owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by
Amazon.
AWS Glue Developer Guide

Table of Contents
What Is AWS Glue? ............................................................................................................................. 1
When Should I Use AWS Glue? .................................................................................................... 1
How It Works .................................................................................................................................... 3
Serverless ETL Jobs Run in Isolation ............................................................................................. 3
Concepts ................................................................................................................................... 4
AWS Glue Terminology ....................................................................................................... 5
Components .............................................................................................................................. 6
AWS Glue Console .............................................................................................................. 6
AWS Glue Data Catalog ...................................................................................................... 7
AWS Glue Crawlers and Classifiers ........................................................................................ 7
AWS Glue ETL Operations ................................................................................................... 7
The AWS Glue Jobs System ................................................................................................. 7
Converting Semi-Structured Schemas to Relational Schemas ........................................................... 8
Getting Started ................................................................................................................................ 10
Setting up IAM Permissions for AWS Glue ................................................................................... 10
Step 1: Create an IAM Policy for the AWS Glue Service .......................................................... 10
Step 2: Create an IAM Role for AWS Glue ............................................................................ 14
Step 3: Attach a Policy to IAM Users That Access AWS Glue ................................................... 15
Step 4: Create an IAM Policy for Notebook Servers ............................................................... 22
Step 5: Create an IAM Role for Notebook Servers ................................................................. 24
Step 6: Create an IAM Policy for Amazon SageMaker Notebooks ............................................. 25
Step 7: Create an IAM Role for Amazon SageMaker Notebooks ............................................... 27
Setting Up DNS in Your VPC ...................................................................................................... 28
Setting Up Your Environment to Access Data Stores ..................................................................... 28
Amazon VPC Endpoints for Amazon S3 ............................................................................... 29
Setting Up a VPC to Connect to JDBC Data Stores ................................................................ 30
Setting Up Your Environment for Development Endpoints ............................................................. 33
Setting Up Your Network for a Development Endpoint .......................................................... 33
Setting Up Amazon EC2 for a Notebook Server .................................................................... 34
Setting Up Encryption .............................................................................................................. 35
Console Workflow Overview ...................................................................................................... 37
Security ........................................................................................................................................... 39
Authentication and Access Control ............................................................................................. 40
Authentication ................................................................................................................. 40
Access-Control Overview ................................................................................................... 41
Cross-Account Access ........................................................................................................ 50
Resource ARNs ................................................................................................................. 54
Policy Examples ............................................................................................................... 57
API Permissions Reference ................................................................................................. 66
Encryption and Secure Access .................................................................................................... 80
Encrypting Your Data Catalog ............................................................................................ 81
Encrypting Connection Passwords ...................................................................................... 82
Encrypting Data Written by AWS Glue ................................................................................ 82
Populating the AWS Glue Data Catalog ............................................................................................... 86
Defining a Database in Your Data Catalog ................................................................................... 88
Working with Databases on the Console ............................................................................. 88
Defining Tables in the AWS Glue Data Catalog ............................................................................. 88
Table Partitions ................................................................................................................ 89
Working with Tables on the Console ................................................................................... 89
Adding a Connection to Your Data Store ..................................................................................... 92
When Is a Connection Used? .............................................................................................. 92
Defining a Connection in the AWS Glue Data Catalog ............................................................ 92
Connecting to a JDBC Data Store in a VPC .......................................................................... 93
Working with Connections on the Console ........................................................................... 94

iii
AWS Glue Developer Guide

Cataloging Tables with a Crawler ............................................................................................... 97


Defining a Crawler in the AWS Glue Data Catalog ................................................................. 98
Which Data Stores Can I Crawl? ......................................................................................... 98
Using Include and Exclude Patterns .................................................................................... 98
What Happens When a Crawler Runs? ............................................................................... 101
Are Amazon S3 Folders Created as Tables or Partitions? ...................................................... 102
Configuring a Crawler ..................................................................................................... 103
Scheduling a Crawler ...................................................................................................... 106
Working with Crawlers on the Console .............................................................................. 107
Adding Classifiers to a Crawler ................................................................................................. 109
When Do I Use a Classifier? ............................................................................................. 109
Custom Classifiers ........................................................................................................... 109
Built-In Classifiers in AWS Glue ........................................................................................ 110
Writing Custom Classifiers ............................................................................................... 112
Working with Classifiers on the Console ............................................................................ 122
Working with Data Catalog Settings on the AWS Glue Console ..................................................... 123
Populating the Data Catalog Using AWS CloudFormation Templates ............................................. 124
Sample Database ............................................................................................................ 125
Sample Database, Table, Partitions ................................................................................... 126
Sample Grok Classifier .................................................................................................... 129
Sample JSON Classifier ................................................................................................... 130
Sample XML Classifier ..................................................................................................... 130
Sample Amazon S3 Crawler ............................................................................................. 131
Sample Connection ......................................................................................................... 132
Sample JDBC Crawler ...................................................................................................... 133
Sample Job for Amazon S3 to Amazon S3 ......................................................................... 135
Sample Job for JDBC to Amazon S3 ................................................................................. 136
Sample On-Demand Trigger ............................................................................................. 137
Sample Scheduled Trigger ............................................................................................... 138
Sample Conditional Trigger .............................................................................................. 139
Sample Development Endpoint ........................................................................................ 140
Authoring Jobs ............................................................................................................................... 141
Workflow Overview ................................................................................................................ 142
Adding Jobs ........................................................................................................................... 142
Defining Job Properties ................................................................................................... 142
Built-In Transforms ......................................................................................................... 144
Jobs on the Console ....................................................................................................... 146
Editing Scripts ........................................................................................................................ 151
Defining a Script ............................................................................................................ 151
Scripts on the Console .................................................................................................... 152
Providing Your Own Custom Scripts .................................................................................. 153
Triggering Jobs ...................................................................................................................... 154
Triggering Jobs Based on Schedules or Events .................................................................... 154
Defining Trigger Types .................................................................................................... 154
Working with Triggers on the Console ............................................................................... 154
Using Development Endpoints ................................................................................................. 156
Managing the Environment .............................................................................................. 156
Using a Dev Endpoint ..................................................................................................... 156
Accessing Your Dev Endpoint ........................................................................................... 156
Development Endpoints on the Console ............................................................................ 157
Tutorial Prerequisites ...................................................................................................... 161
Tutorial: Local Zeppelin Notebook .................................................................................... 164
Tutorial: Amazon EC2 Zeppelin Notebook Server ................................................................ 167
Tutorial: Use a REPL Shell ................................................................................................ 170
Tutorial: Use PyCharm Professional ................................................................................... 171
Managing Notebooks .............................................................................................................. 177
Notebook Server Considerations ....................................................................................... 179

iv
AWS Glue Developer Guide

Notebooks on the Console ............................................................................................... 185


Running and Monitoring .................................................................................................................. 188
Automated Tools .................................................................................................................... 189
Time-Based Schedules for Jobs and Crawlers ............................................................................. 189
Cron Expressions ............................................................................................................ 189
Job Bookmarks ....................................................................................................................... 191
Using Job Bookmarks ...................................................................................................... 192
Using an AWS Glue Script ................................................................................................ 193
Using Modification Timestamps ........................................................................................ 194
Automating with CloudWatch Events ........................................................................................ 196
Monitoring with Amazon CloudWatch ....................................................................................... 196
Using CloudWatch Metrics ............................................................................................... 197
Setting Up Amazon CloudWatch Alarms on AWS Glue Job Profiles ........................................ 210
Job Monitoring and Debugging ................................................................................................ 210
Debugging OOM Exceptions and Job Abnormalities ............................................................ 211
Debugging Demanding Stages and Straggler Tasks ............................................................. 218
Monitoring the Progress of Multiple Jobs .......................................................................... 222
Monitoring for DPU Capacity Planning .............................................................................. 226
Logging Using CloudTrail ......................................................................................................... 230
AWS Glue Information in CloudTrail .................................................................................. 231
Understanding AWS Glue Log File Entries .......................................................................... 231
Troubleshooting ............................................................................................................................. 234
Gathering AWS Glue Troubleshooting Information ...................................................................... 234
Troubleshooting Connection Issues ........................................................................................... 234
Troubleshooting Errors ............................................................................................................ 235
Error: Resource Unavailable ............................................................................................. 235
Error: Could Not Find S3 Endpoint or NAT Gateway for subnetId in VPC ................................. 236
Error: Inbound Rule in Security Group Required .................................................................. 236
Error: Outbound Rule in Security Group Required ............................................................... 236
Error: Custom DNS Resolution Failures .............................................................................. 236
Error: Job Run Failed Because the Role Passed Should Be Given Assume Role Permissions for
the AWS Glue Service ..................................................................................................... 236
Error: DescribeVpcEndpoints Action Is Unauthorized. Unable to Validate VPC ID vpc-id ............. 237
Error: DescribeRouteTables Action Is Unauthorized. Unable to Validate Subnet Id: subnet-id in
VPC id: vpc-id ................................................................................................................ 237
Error: Failed to Call ec2:DescribeSubnets ........................................................................... 237
Error: Failed to Call ec2:DescribeSecurityGroups ................................................................. 237
Error: Could Not Find Subnet for AZ ................................................................................. 237
Error: Job Run Exception When Writing to a JDBC Target ..................................................... 237
Error: Amazon S3 Timeout ............................................................................................... 238
Error: Amazon S3 Access Denied ....................................................................................... 238
Error: Amazon S3 Access Key ID Does Not Exist .................................................................. 238
Error: Job Run Fails When Accessing Amazon S3 with an s3a:// URI .................................... 238
Error: Amazon S3 Service Token Expired ............................................................................ 240
Error: No Private DNS for Network Interface Found ............................................................. 240
Error: Development Endpoint Provisioning Failed ................................................................ 240
Error: Notebook Server CREATE_FAILED ............................................................................. 240
Error: Local Notebook Fails to Start .................................................................................. 240
Error: Notebook Usage Errors ........................................................................................... 241
Error: Running Crawler Failed ........................................................................................... 241
Error: Upgrading Athena Data Catalog .............................................................................. 241
Error: A Job is Reprocessing Data When Job Bookmarks Are Enabled ..................................... 241
AWS Glue Limits ..................................................................................................................... 242
ETL Programming ........................................................................................................................... 244
General Information ................................................................................................................ 244
Special Parameters ......................................................................................................... 244
Connection Parameters ................................................................................................... 245

v
AWS Glue Developer Guide

Format Options .............................................................................................................. 248


Managing Partitions ........................................................................................................ 250
Grouping Input Files ....................................................................................................... 251
Reading from JDBC in Parallel .......................................................................................... 252
Moving Data to and from Amazon Redshift ....................................................................... 253
ETL Programming in Python .................................................................................................... 254
Using Python ................................................................................................................. 254
List of Extensions ........................................................................................................... 255
List of Transforms .......................................................................................................... 255
Python Setup ................................................................................................................. 255
Calling APIs ................................................................................................................... 256
Python Libraries ............................................................................................................. 258
Python Samples ............................................................................................................. 259
PySpark Extensions ......................................................................................................... 273
PySpark Transforms ........................................................................................................ 297
ETL Programming in Scala ....................................................................................................... 324
Using Scala .................................................................................................................... 329
Scala API List ................................................................................................................. 330
AWS Glue API ................................................................................................................................ 371
Security ................................................................................................................................. 376
— data types — .......................................................................................................... 376
DataCatalogEncryptionSettings ........................................................................................ 377
EncryptionAtRest ............................................................................................................ 377
ConnectionPasswordEncryption ........................................................................................ 377
EncryptionConfiguration .................................................................................................. 378
S3Encryption ................................................................................................................. 378
CloudWatchEncryption .................................................................................................... 378
JobBookmarksEncryption ................................................................................................. 379
SecurityConfiguration ...................................................................................................... 379
— operations — .......................................................................................................... 379
GetDataCatalogEncryptionSettings (get_data_catalog_encryption_settings) ........................... 379
PutDataCatalogEncryptionSettings (put_data_catalog_encryption_settings) ........................... 380
PutResourcePolicy (put_resource_policy) ............................................................................ 381
GetResourcePolicy (get_resource_policy) ............................................................................ 381
DeleteResourcePolicy (delete_resource_policy) .................................................................... 382
CreateSecurityConfiguration (create_security_configuration) ................................................. 382
DeleteSecurityConfiguration (delete_security_configuration) ................................................ 383
GetSecurityConfiguration (get_security_configuration) ......................................................... 384
GetSecurityConfigurations (get_security_configurations) ...................................................... 384
Catalog ................................................................................................................................. 385
Databases ...................................................................................................................... 385
Tables ........................................................................................................................... 389
Partitions ....................................................................................................................... 403
Connections ................................................................................................................... 414
User-Defined Functions ................................................................................................... 421
Importing an Athena Catalog ........................................................................................... 426
Crawlers and Classifiers ........................................................................................................... 427
Classifiers ...................................................................................................................... 428
Crawlers ........................................................................................................................ 435
Scheduler ...................................................................................................................... 444
Autogenerating ETL Scripts ...................................................................................................... 446
— data types — .......................................................................................................... 446
CodeGenNode ................................................................................................................ 446
CodeGenNodeArg ........................................................................................................... 447
CodeGenEdge ................................................................................................................. 447
Location ........................................................................................................................ 447
CatalogEntry .................................................................................................................. 448

vi
AWS Glue Developer Guide

MappingEntry ................................................................................................................ 448


— operations — .......................................................................................................... 448
CreateScript (create_script) .............................................................................................. 449
GetDataflowGraph (get_dataflow_graph) ........................................................................... 449
GetMapping (get_mapping) .............................................................................................. 450
GetPlan (get_plan) .......................................................................................................... 450
Jobs ...................................................................................................................................... 451
Jobs .............................................................................................................................. 451
Job Runs ....................................................................................................................... 458
Triggers ......................................................................................................................... 465
DevEndpoints ......................................................................................................................... 472
— data types — .......................................................................................................... 472
DevEndpoint .................................................................................................................. 472
DevEndpointCustomLibraries ............................................................................................ 474
— operations — .......................................................................................................... 474
CreateDevEndpoint (create_dev_endpoint) ......................................................................... 475
UpdateDevEndpoint (update_dev_endpoint) ....................................................................... 477
DeleteDevEndpoint (delete_dev_endpoint) ......................................................................... 477
GetDevEndpoint (get_dev_endpoint) ................................................................................. 478
GetDevEndpoints (get_dev_endpoints) .............................................................................. 478
Common Data Types ............................................................................................................... 479
Tag ............................................................................................................................... 479
DecimalNumber .............................................................................................................. 479
ErrorDetail ..................................................................................................................... 480
PropertyPredicate ........................................................................................................... 480
ResourceUri .................................................................................................................... 480
String Patterns ............................................................................................................... 480
Exceptions ............................................................................................................................. 481
AccessDeniedException .................................................................................................... 481
AlreadyExistsException .................................................................................................... 481
ConcurrentModificationException ...................................................................................... 481
ConcurrentRunsExceededException ................................................................................... 481
CrawlerNotRunningException ........................................................................................... 482
CrawlerRunningException ................................................................................................ 482
CrawlerStoppingException ............................................................................................... 482
EntityNotFoundException ................................................................................................ 482
GlueEncryptionException ................................................................................................. 482
IdempotentParameterMismatchException .......................................................................... 483
InternalServiceException .................................................................................................. 483
InvalidExecutionEngineException ...................................................................................... 483
InvalidInputException ...................................................................................................... 483
InvalidTaskStatusTransitionException ................................................................................. 483
JobDefinitionErrorException ............................................................................................. 484
JobRunInTerminalStateException ...................................................................................... 484
JobRunInvalidStateTransitionException .............................................................................. 484
JobRunNotInTerminalStateException ................................................................................. 484
LateRunnerException ....................................................................................................... 485
NoScheduleException ...................................................................................................... 485
OperationTimeoutException ............................................................................................. 485
ResourceNumberLimitExceededException ........................................................................... 485
SchedulerNotRunningException ........................................................................................ 485
SchedulerRunningException ............................................................................................. 486
SchedulerTransitioningException ....................................................................................... 486
UnrecognizedRunnerException .......................................................................................... 486
ValidationException ......................................................................................................... 486
VersionMismatchException ............................................................................................... 486
Document History .......................................................................................................................... 487

vii
AWS Glue Developer Guide

Earlier Updates ....................................................................................................................... 488


AWS Glossary ................................................................................................................................. 490

viii
AWS Glue Developer Guide
When Should I Use AWS Glue?

What Is AWS Glue?


AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-
effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.
AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine
that automatically generates Python or Scala code, and a flexible scheduler that handles dependency
resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or
manage.

Use the AWS Glue console to discover data, transform it, and make it available for search and querying.
The console calls the underlying services to orchestrate the work required to transform your data. You
can also use the AWS Glue API operations to interface with AWS Glue services. Edit, debug, and test your
Python or Scala Apache Spark ETL code using a familiar development environment.

For pricing information, see AWS Glue Pricing.

When Should I Use AWS Glue?


You can use AWS Glue to build a data warehouse to organize, cleanse, validate, and format data. You
can transform and move AWS Cloud data into your data store. You can also load data from disparate
sources into your data warehouse for regular reporting and analysis. By storing it in a data warehouse,
you integrate information from different parts of your business and provide a common source of data for
decision making.

AWS Glue simplifies many tasks when you are building a data warehouse:

• Discovers and catalogs metadata about your data stores into a central catalog. You can process semi-
structured data, such as clickstream or process logs.
• Populates the AWS Glue Data Catalog with table definitions from scheduled crawler programs.
Crawlers call classifier logic to infer the schema, format, and data types of your data. This metadata is
stored as tables in the AWS Glue Data Catalog and used in the authoring process of your ETL jobs.
• Generates ETL scripts to transform, flatten, and enrich your data from source to target.
• Detects schema changes and adapts based on your preferences.
• Triggers your ETL jobs based on a schedule or event. You can initiate jobs automatically to move your
data into your data warehouse. Triggers can be used to create a dependency flow between jobs.
• Gathers runtime metrics to monitor the activities of your data warehouse.
• Handles errors and retries automatically.
• Scales resources, as needed, to run your jobs.

You can use AWS Glue when you run serverless queries against your Amazon S3 data lake. AWS Glue
can catalog your Amazon Simple Storage Service (Amazon S3) data, making it available for querying
with Amazon Athena and Amazon Redshift Spectrum. With crawlers, your metadata stays in sync with
the underlying data. Athena and Redshift Spectrum can directly query your Amazon S3 data lake using
the AWS Glue Data Catalog. With AWS Glue, you access and analyze data through one unified interface
without loading it into multiple data silos.

You can create event-driven ETL pipelines with AWS Glue. You can run your ETL jobs as soon as
new data becomes available in Amazon S3 by invoking your AWS Glue ETL jobs from an AWS Lambda
function. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.

1
AWS Glue Developer Guide
When Should I Use AWS Glue?

You can use AWS Glue to understand your data assets. You can store your data using various AWS
services and still maintain a unified view of your data using the AWS Glue Data Catalog. View the Data
Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in
one central repository. The Data Catalog also serves as a drop-in replacement for your external Apache
Hive Metastore.

2
AWS Glue Developer Guide
Serverless ETL Jobs Run in Isolation

AWS Glue: How It Works


AWS Glue uses other AWS services to orchestrate your ETL (extract, transform, and load) jobs to build
a data warehouse. AWS Glue calls API operations to transform your data, create runtime logs, store
your job logic, and create notifications to help you monitor your job runs. The AWS Glue console
connects these services into a managed application, so you can focus on creating and monitoring your
ETL work. The console performs administrative and job development operations on your behalf. You
supply credentials and other properties to AWS Glue to access your data sources and write to your data
warehouse.

AWS Glue takes care of provisioning and managing the resources that are required to run your workload.
You don't need to create the infrastructure for an ETL tool because AWS Glue does it for you. When
resources are required, to reduce startup time, AWS Glue uses an instance from its warm pool of
instances to run your workload.

With AWS Glue, you create jobs using table definitions in your Data Catalog. Jobs consist of scripts that
contain the programming logic that performs the transformation. You use triggers to initiate jobs either
on a schedule or as a result of a specified event. You determine where your target data resides and which
source data populates your target. With your input, AWS Glue generates the code that's required to
transform your data from source to target. You can also provide scripts in the AWS Glue console or API to
process your data.

AWS Glue is available in several AWS Regions. For more information, see AWS Regions and Endpoints in
the Amazon Web Services General Reference.

Topics
• Serverless ETL Jobs Run in Isolation (p. 3)
• AWS Glue Concepts (p. 4)
• AWS Glue Components (p. 6)
• Converting Semi-Structured Schemas to Relational Schemas (p. 8)

Serverless ETL Jobs Run in Isolation


AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs on
virtual resources that it provisions and manages in its own service account.

AWS Glue is designed to do the following:

• Segregate customer data.


• Protect customer data in transit and at rest.
• Access customer data only as needed in response to customer requests, using temporary, scoped-down
credentials, or with a customer's consent to IAM roles in their account.

During provisioning of an ETL job, you provide input data sources and output data targets in your virtual
private cloud (VPC). In addition, you provide the IAM role, VPC ID, subnet ID, and security group that
are needed to access data sources and targets. For each tuple (customer account ID, IAM role, subnet
ID, and security group), AWS Glue creates a new Spark environment that is isolated at the network and
management level from all other Spark environments inside the AWS Glue service account.

AWS Glue creates elastic network interfaces in your subnet using private IP addresses. Spark jobs use
these elastic network interfaces to access your data sources and data targets. Traffic in, out, and within

3
AWS Glue Developer Guide
Concepts

the Spark environment is governed by your VPC and networking policies with one exception: Calls made
to AWS Glue libraries can proxy traffic to AWS Glue API operations through the AWS Glue VPC. All AWS
Glue API calls are logged; thus, data owners can audit API access by enabling AWS CloudTrail, which
delivers audit logs to your account.

AWS Glue managed Spark environments that run your ETL jobs are protected with the same security
practices followed by other AWS services. Those practices are listed in the AWS Access section of the
Introduction to AWS Security Processes whitepaper.

AWS Glue Concepts


The following diagram shows the architecture of an AWS Glue environment.

You define jobs in AWS Glue to accomplish the work that's required to extract, transform, and load (ETL)
data from a data source to a data target. You typically perform the following actions:

• You define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You
point your crawler at a data store, and the crawler creates table definitions in the Data Catalog.

In addition to table definitions, the AWS Glue Data Catalog contains other metadata that is required to
define ETL jobs. You use this metadata when you define a job to transform your data.
• AWS Glue can generate a script to transform your data. Or, you can provide the script in the AWS Glue
console or API.
• You can run your job on demand, or you can set it up to start when a specified trigger occurs. The
trigger can be a time-based schedule or an event.

4
AWS Glue Developer Guide
AWS Glue Terminology

When your job runs, a script extracts data from your data source, transforms the data, and loads it to
your data target. The script runs in an Apache Spark environment in AWS Glue.

Important
Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. They contain
metadata; they don't contain data from a data store.

Text-based data, such as CSVs, must be encoded in UTF-8 for AWS Glue to process it successfully.
For more information, see UTF-8 in Wikipedia.

AWS Glue Terminology


AWS Glue relies on the interaction of several components to create and manage your data warehouse
workflow.

AWS Glue Data Catalog


The persistent metadata store in AWS Glue. Each AWS account has one AWS Glue Data Catalog. It
contains table definitions, job definitions, and other control information to manage your AWS Glue
environment.

Classifier
Determines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV,
JSON, AVRO, XML, and others. It also provides classifiers for common relational database management
systems using a JDBC connection. You can write your own classifier by using a grok pattern or by
specifying a row tag in an XML document.

Connection
Contains the properties that are required to connect to your data store.

Crawler
A program that connects to a data store (source or target), progresses through a prioritized list of
classifiers to determine the schema for your data, and then creates metadata in the AWS Glue Data
Catalog.

Database
A set of associated table definitions organized into a logical group in AWS Glue.

Data store, data source, data target


A data store is a repository for persistently storing your data. Examples include Amazon S3 buckets and
relational databases. A data source is a data store that is used as input to a process or transform. A data
target is a data store that a process or transform writes to.

Development endpoint
An environment that you can use to develop and test your AWS Glue scripts.

5
AWS Glue Developer Guide
Components

Job
The business logic that is required to perform ETL work. It is composed of a transformation script, data
sources, and data targets. Job runs are initiated by triggers that can be scheduled or triggered by events.

Notebook server
A web-based environment that you can use to run your PySpark statements. For more information,
see Apache Zeppelin. You can set up a notebook server on a development endpoint to run PySpark
statements with AWS Glue extensions.

Script
Code that extracts data from sources, transforms it, and loads it into targets. AWS Glue generates
PySpark or Scala scripts. PySpark is a Python dialect for ETL programming.

Table
The metadata definition that represents your data. Whether your data is in an Amazon Simple Storage
Service (Amazon S3) file, an Amazon Relational Database Service (Amazon RDS) table, or another set
of data, a table defines the schema of your data. A table in the AWS Glue Data Catalog consists of the
names of columns, data type definitions, and other metadata about a base dataset. The schema of your
data is represented in your AWS Glue table definition. The actual data remains in its original data store,
whether it be in a file or a relational database table. AWS Glue catalogs your files and relational database
tables in the AWS Glue Data Catalog. They are used as sources and targets when you create an ETL job.

Transform
The code logic that is used to manipulate your data into a different format.

Trigger
Initiates an ETL job. Triggers can be defined based on a scheduled time or an event.

AWS Glue Components


AWS Glue provides a console and API operations to set up and manage your extract, transform, and
load (ETL) workload. You can use API operations through several language-specific SDKs and the AWS
Command Line Interface (AWS CLI). For information about using the AWS CLI, see AWS CLI Command
Reference.

AWS Glue uses the AWS Glue Data Catalog to store metadata about data sources, transforms, and
targets. The Data Catalog is a drop-in replacement for the Apache Hive Metastore. The AWS Glue Jobs
system provides a managed infrastructure for defining, scheduling, and running ETL operations on your
data. For more information about the AWS Glue API, see AWS Glue API (p. 371).

AWS Glue Console


You use the AWS Glue console to define and orchestrate your ETL workflow. The console calls several API
operations in the AWS Glue Data Catalog and AWS Glue Jobs system to perform the following tasks:

• Define AWS Glue objects such as jobs, tables, crawlers, and connections.
• Schedule when crawlers run.

6
AWS Glue Developer Guide
AWS Glue Data Catalog

• Define events or schedules for job triggers.


• Search and filter lists of AWS Glue objects.
• Edit transformation scripts.

AWS Glue Data Catalog


The AWS Glue Data Catalog is your persistent metadata store. It is a managed service that lets you
store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive
metastore.

Each AWS account has one AWS Glue Data Catalog. It provides a uniform repository where disparate
systems can store and find metadata to keep track of data in data silos, and use that metadata to query
and transform the data.

You can use AWS Identity and Access Management (IAM) policies to control access to the data sources
managed by the AWS Glue Data Catalog. These policies allow different groups in your enterprise to
safely publish data to the wider organization while protecting sensitive information. IAM policies let you
clearly and consistently define which users have access to which data, regardless of its location.

For information about how to use the AWS Glue Data Catalog, see Populating the AWS Glue Data
Catalog (p. 86). For information about how to program using the Data Catalog API, see Catalog
API (p. 385).

AWS Glue Crawlers and Classifiers


AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract
schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. From
there it can be used to guide ETL operations.

For information about how to set up crawlers and classifiers, see Cataloging Tables with a
Crawler (p. 97). For information about how to program crawlers and classifiers using the AWS Glue
API, see Crawlers and Classifiers API (p. 427).

AWS Glue ETL Operations


Using the metadata in the Data Catalog, AWS Glue can autogenerate Scala or PySpark (the Python API
for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL
operations. For example, you can extract, clean, and transform raw data, and then store the result in a
different repository, where it can be queried and analyzed. Such a script might convert a CSV file into a
relational form and save it in Amazon Redshift.

For more information about how to use AWS Glue ETL capabilities, see Programming ETL
Scripts (p. 244).

The AWS Glue Jobs System


The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can
create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to
different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the
arrival of new data.

For more information about using the AWS Glue Jobs system, see Running and Monitoring AWS
Glue (p. 188). For information about programming using the AWS Glue Jobs system API, see Jobs
API (p. 451).

7
AWS Glue Developer Guide
Converting Semi-Structured
Schemas to Relational Schemas

Converting Semi-Structured Schemas to Relational


Schemas
It's common to want to convert semi-structured data into relational tables. Conceptually, you are
flattening a hierarchical schema to a relational schema. AWS Glue can perform this conversion for you
on-the-fly.

Semi-structured data typically contains mark-up to identify entities within the data. It can have nested
data structures with no fixed schema. For more information about semi-structured data, see Semi-
structured data in Wikipedia.

Relational data is represented by tables that consist of rows and columns. Relationships between tables
can be represented by a primary key (PK) to foreign key (FK) relationship. For more information, see
Relational database in Wikipedia.

AWS Glue uses crawlers to infer schemas for semi-structured data. It then transforms the data to a
relational schema using an ETL (extract, transform, and load) job. For example, you might want to
parse JSON data from Amazon Simple Storage Service (Amazon S3) source files to Amazon Relational
Database Service (Amazon RDS) tables. Understanding how AWS Glue handles the differences between
schemas can help you understand the transformation process.

This diagram shows how AWS Glue transforms a semi-structured schema to a relational schema.

The diagram illustrates the following:

• Single value A converts directly to a relational column.

8
AWS Glue Developer Guide
Converting Semi-Structured
Schemas to Relational Schemas

• The pair of values, B1 and B2, convert to two relational columns.


• Structure C, with children X and Y, converts to two relational columns.
• Array D[] converts to a relational column with a foreign key (FK) that points to another relational
table. Along with a primary key (PK), the second relational table has columns that contain the offset
and value of the items in the array.

9
AWS Glue Developer Guide
Setting up IAM Permissions for AWS Glue

Getting Started Using AWS Glue


The following sections provide an overview and walk you through setting up and using AWS Glue. For
information about AWS Glue concepts and components, see AWS Glue: How It Works (p. 3).

Topics
• Setting up IAM Permissions for AWS Glue (p. 10)
• Setting Up DNS in Your VPC (p. 28)
• Setting Up Your Environment to Access Data Stores (p. 28)
• Setting Up Your Environment for Development Endpoints (p. 33)
• Setting Up Encryption in AWS Glue (p. 35)
• AWS Glue Console Workflow Overview (p. 37)

Setting up IAM Permissions for AWS Glue


You use AWS Identity and Access Management (IAM) to define policies and roles that are needed to
access resources used by AWS Glue. The following steps lead you through the basic permissions that you
need to set up your environment. Depending on your business needs, you might have to add or reduce
access to your resources.

1. Create an IAM Policy for the AWS Glue Service (p. 10): Create a service policy that allows access to
AWS Glue resources.
2. Create an IAM Role for AWS Glue (p. 14): Create an IAM role, and attach the AWS Glue service policy
and a policy for your Amazon Simple Storage Service (Amazon S3) resources that are used by AWS
Glue.
3. Attach a Policy to IAM Users That Access AWS Glue (p. 15): Attach policies to any IAM user that
signs in to the AWS Glue console.
4. Create an IAM Policy for Notebooks (p. 22): Create a notebook server policy to use in the creation
of notebook servers on development endpoints.
5. Create an IAM Role for Notebooks (p. 24): Create an IAM role and attach the notebook server policy.
6. Create an IAM Policy for Amazon SageMaker Notebooks (p. 25): Create an IAM policy to use when
creating Amazon SageMaker notebooks on development endpoints.
7. Create an IAM Role for Amazon SageMaker Notebooks (p. 27): Create an IAM role and attach the
policy to grant permissions when creating Amazon SageMaker notebooks on development endpoints.

Step 1: Create an IAM Policy for the AWS Glue Service


For any operation that accesses data on another AWS resource, such as accessing your objects in Amazon
S3, AWS Glue needs permission to access the resource on your behalf. You provide those permissions by
using AWS Identity and Access Management (IAM).
Note
You can skip this step if you use the AWS managed policy AWSGlueServiceRole.

In this step, you create a policy that is similar to AWSGlueServiceRole. You can find the most current
version of AWSGlueServiceRole on the IAM console.

To create an IAM policy for AWS Glue


This policy grants permission for some Amazon S3 actions to manage resources in your account that are
needed by AWS Glue when it assumes the role using this policy. Some of the resources that are specified

10
AWS Glue Developer Guide
Step 1: Create an IAM Policy for the AWS Glue Service

in this policy refer to default names that are used by AWS Glue for Amazon S3 buckets, Amazon S3 ETL
scripts, CloudWatch Logs, and Amazon EC2 resources. For simplicity, AWS Glue writes some Amazon S3
objects into buckets in your account prefixed with aws-glue-* by default.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Policies.
3. Choose Create Policy.
4. On the Create Policy screen, navigate to a tab to edit JSON. Create a policy document with the
following JSON statements, and then choose Review policy.
Note
Add any permissions needed for Amazon S3 resources. You might want to scope the
resources section of your access policy to only those resources that are required.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:*",
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:ListAllMyBuckets",
"s3:GetBucketAcl",
"ec2:DescribeVpcEndpoints",
"ec2:DescribeRouteTables",
"ec2:CreateNetworkInterface",
"ec2:DeleteNetworkInterface",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcAttribute",
"iam:ListRolePolicies",
"iam:GetRole",
"iam:GetRolePolicy",
"cloudwatch:PutMetricData"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket"
],
"Resource": [
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*"
]

11
AWS Glue Developer Guide
Step 1: Create an IAM Policy for the AWS Glue Service

},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:AssociateKmsKey"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
},
{
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]
}

The following table describes the permissions granted by this policy.

Action Resource Description

"glue:*" "*" Allows permission to run all AWS Glue


API operations.

"s3:GetBucketLocation", "*" Allows listing of Amazon S3 buckets


"s3:ListBucket", from crawlers, jobs, development
"s3:ListAllMyBuckets", endpoints, and notebook servers.
"s3:GetBucketAcl",

12
AWS Glue Developer Guide
Step 1: Create an IAM Policy for the AWS Glue Service

Action Resource Description

"ec2:DescribeVpcEndpoints", "*" Allows setup of Amazon EC2 network


"ec2:DescribeRouteTables", items, such as VPCs, when running
"ec2:CreateNetworkInterface", jobs, crawlers, and development
"ec2:DeleteNetworkInterface", endpoints.
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcAttribute",

"iam:ListRolePolicies", "*" Allows listing IAM roles from crawlers,


"iam:GetRole", jobs, development endpoints, and
"iam:GetRolePolicy" notebook servers.

"cloudwatch:PutMetricData" "*" Allows writing CloudWatch metrics


for jobs.

"s3:CreateBucket" Allows the creation of Amazon S3


"arn:aws:s3:::aws-
glue-*" buckets in your account from jobs and
notebook servers.

Naming convention: Uses Amazon S3


folders named aws-glue-.

"s3:GetObject", Allows get, put, and delete of Amazon


"arn:aws:s3:::aws-
"s3:PutObject", glue-*/*", S3 objects into your account when
"s3:DeleteObject" "arn:aws:s3:::*/storing objects such as ETL scripts and
*aws-glue-*/ notebook server locations.
*"
Naming convention: Grants
permission to Amazon S3 buckets
or folders whose names are prefixed
with aws-glue-.

"s3:GetObject" Allows get of Amazon S3 objects


"arn:aws:s3:::crawler-
public*", used by examples and tutorials from
crawlers and jobs.
"arn:aws:s3:::aws-
glue-*"
Naming convention: Amazon S3
bucket names begin with crawler-
public and aws-glue-.

"logs:CreateLogGroup", Allows writing logs to CloudWatch


"arn:aws:logs:*:*:/
"logs:CreateLogStream", aws-glue/*" Logs.
"logs:PutLogEvents"
Naming convention: AWS Glue writes
logs to log groups whose names
begin with aws-glue.

"ec2:CreateTags", Allows tagging of Amazon EC2


"arn:aws:ec2:*:*:network-
"ec2:DeleteTags" interface/*", resources created for development
endpoints.
"arn:aws:ec2:*:*:security-
group/*",
Naming convention: AWS Glue tags
"arn:aws:ec2:*:*:instance/
*" Amazon EC2 network interfaces,
security groups, and instances with
aws-glue-service-resource.

13
AWS Glue Developer Guide
Step 2: Create an IAM Role for AWS Glue

5. On the Review Policy screen, type your Policy Name, for example GlueServiceRolePolicy. Type an
optional description, and when you're satisfied with the policy, then Create policy.

Step 2: Create an IAM Role for AWS Glue


You need to grant your IAM role permissions that AWS Glue can assume when calling other services
on your behalf. This includes access to Amazon S3 for any sources, targets, scripts, and temporary
directories that you use with AWS Glue. Permission is needed by crawlers, jobs, and development
endpoints.

You provide those permissions by using AWS Identity and Access Management (IAM). Add a policy to the
IAM role that you pass to AWS Glue.

To create an IAM role for AWS Glue

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Roles.
3. Choose Create role.
4. For role type, choose AWS Service, find and choose Glue, and choose Next: Permissions.
5. On the Attach permissions policy page, choose the policies that contain the required permissions;
for example, the AWS managed policy AWSGlueServiceRole for general AWS Glue permissions and
the AWS managed policy AmazonS3FullAccess for access to Amazon S3 resources. Then choose
Next: Review.
Note
Ensure that one of the policies in this role grants permissions to your Amazon S3 sources
and targets. You might want to provide your own policy for access to specific Amazon S3
resources. Data sources require s3:ListBucket and s3:GetObject permissions. Data
targets require s3:ListBucket, s3:PutObject, and s3:DeleteObject permissions. For
more information about creating an Amazon S3 policy for your resources, see Specifying
Resources in a Policy. For an example Amazon S3 policy, see Writing IAM Policies: How to
Grant Access to an Amazon S3 Bucket.
If you plan to access Amazon S3 sources and targets that are encrypted with SSE-KMS, then
attach a policy that allows AWS Glue crawlers, jobs, and development endpoints to decrypt
the data. For more information, see Protecting Data Using Server-Side Encryption with AWS
KMS-Managed Keys (SSE-KMS). The following is an example:

{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"kms:Decrypt"
],
"Resource":[
"arn:aws:kms:*:account-id-without-hyphens:key/key-id"
]
}
]
}

6. For Role name, type a name for your role; for example, AWSGlueServiceRoleDefault. Create the
role with the name prefixed with the string AWSGlueServiceRole to allow the role to be passed
from console users to the service. AWS Glue provided policies expect IAM service roles to begin with

14
AWS Glue Developer Guide
Step 3: Attach a Policy to IAM Users That Access AWS Glue

AWSGlueServiceRole. Otherwise, you must add a policy to allow your users the iam:PassRole
permission for IAM roles to match your naming convention. Choose Create Role.

Step 3: Attach a Policy to IAM Users That Access AWS


Glue
Any IAM user that signs in to the AWS Glue console or AWS Command Line Interface (AWS CLI) must
have permissions to access specific resources. You provide those permissions by using AWS Identity and
Access Management (IAM), through policies.

When you finish this step, your IAM user has the following policies attached:

• The AWS managed policy AWSGlueConsoleFullAccess or the custom policy GlueConsoleAccessPolicy


• AWSGlueConsoleSageMakerNotebookFullAccess
• CloudWatchLogsReadOnlyAccess
• AWSCloudFormationReadOnlyAccess
• AmazonAthenaFullAccess

To attach an inline policy and embed it in an IAM user


You can attach an AWS managed policy or an inline policy to an IAM user to access the AWS Glue
console. Some of the resources specified in this policy refer to default names that are used by AWS Glue
for Amazon S3 buckets, Amazon S3 ETL scripts, CloudWatch Logs, AWS CloudFormation, and Amazon
EC2 resources. For simplicity, AWS Glue writes some Amazon S3 objects into buckets in your account
prefixed with aws-glue-* by default.
Note
You can skip this step if you use the AWS managed policy AWSGlueConsoleFullAccess.
Important
AWS Glue needs permission to assume a role that is used to perform work on your behalf. To
accomplish this, you add the iam:PassRole permissions to your AWS Glue users. This policy
grants permission to roles that begin with AWSGlueServiceRole for AWS Glue service roles,
and AWSGlueServiceNotebookRole for roles that are required when you create a notebook
server. You can also create your own policy for iam:PassRole permissions that follows your
naming convention.

In this step, you create a policy that is similar to AWSGlueConsoleFullAccess. You can find the most
current version of AWSGlueConsoleFullAccess on the IAM console.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane, choose Users.
3. In the list, choose the name of the user to embed a policy in.
4. Choose the Permissions tab and, if necessary, expand the Permissions policies section.
5. Choose the Add Inline policy link.
6. On the Create Policy screen, navigate to a tab to edit JSON. Create a policy document with the
following JSON statements, and then choose Review policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [

15
AWS Glue Developer Guide
Step 3: Attach a Policy to IAM Users That Access AWS Glue

"glue:*",
"redshift:DescribeClusters",
"redshift:DescribeClusterSubnetGroups",
"iam:ListRoles",
"iam:ListRolePolicies",
"iam:GetRole",
"iam:GetRolePolicy",
"iam:ListAttachedRolePolicies",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcs",
"ec2:DescribeVpcEndpoints",
"ec2:DescribeRouteTables",
"ec2:DescribeVpcAttribute",
"ec2:DescribeKeyPairs",
"ec2:DescribeInstances",
"rds:DescribeDBInstances",
"s3:ListAllMyBuckets",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetBucketLocation",
"cloudformation:DescribeStacks",
"cloudformation:GetTemplateSummary",
"dynamodb:ListTables",
"kms:ListAliases",
"kms:DescribeKey",
"cloudwatch:GetMetricData",
"cloudwatch:ListDashboards"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*",
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket"
],
"Resource": [
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:GetLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
},
{
"Effect": "Allow",
"Action": [

16
AWS Glue Developer Guide
Step 3: Attach a Policy to IAM Users That Access AWS Glue

"cloudformation:CreateStack",
"cloudformation:DeleteStack"
],
"Resource": "arn:aws:cloudformation:*:*:stack/aws-glue*/*"
},
{
"Effect": "Allow",
"Action": [
"ec2:RunInstances"
],
"Resource": [
"arn:aws:ec2:*:*:instance/*",
"arn:aws:ec2:*:*:key-pair/*",
"arn:aws:ec2:*:*:image/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:subnet/*",
"arn:aws:ec2:*:*:volume/*"
]
},
{
"Effect": "Allow",
"Action": [
"ec2:TerminateInstances",
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Resource": [
"arn:aws:ec2:*:*:instance/*"
],
"Condition": {
"StringLike": {
"ec2:ResourceTag/aws:cloudformation:stack-id":
"arn:aws:cloudformation:*:*:stack/aws-glue-*/*"
},
"StringEquals": {
"ec2:ResourceTag/aws:cloudformation:logical-id": "ZeppelinInstance"
}
}
},
{
"Action": [
"iam:PassRole"
],
"Effect": "Allow",
"Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
"Condition": {
"StringLike": {
"iam:PassedToService": [
"glue.amazonaws.com"
]
}
}
},
{
"Action": [
"iam:PassRole"
],
"Effect": "Allow",
"Resource": "arn:aws:iam::*:role/AWSGlueServiceNotebookRole*",
"Condition": {
"StringLike": {
"iam:PassedToService": [
"ec2.amazonaws.com"
]
}

17
AWS Glue Developer Guide
Step 3: Attach a Policy to IAM Users That Access AWS Glue

}
},
{
"Action": [
"iam:PassRole"
],
"Effect": "Allow",
"Resource": [
"arn:aws:iam::*:role/service-role/AWSGlueServiceRole*"
],
"Condition": {
"StringLike": {
"iam:PassedToService": [
"glue.amazonaws.com"
]
}
}
}
]
}

The following table describes the permissions granted by this policy.

Action Resource Description

"glue:*" "*" Allows permission to run all AWS Glue


API operations.

"redshift:DescribeClusters", "*" Allows creation of connections to


"redshift:DescribeClusterSubnetGroups" Amazon Redshift.

"iam:ListRoles", "*" Allows listing IAM roles when working


"iam:ListRolePolicies", with crawlers, jobs, development
"iam:GetRole", endpoints, and notebook servers.
"iam:GetRolePolicy",
"iam:ListAttachedRolePolicies"

"ec2:DescribeSecurityGroups", "*" Allows setup of Amazon EC2 network


"ec2:DescribeSubnets", items, such as VPCs, when running
"ec2:DescribeVpcs", jobs, crawlers, and development
"ec2:DescribeVpcEndpoints", endpoints.
"ec2:DescribeRouteTables",
"ec2:DescribeVpcAttribute",
"ec2:DescribeKeyPairs",
"ec2:DescribeInstances"

"rds:DescribeDBInstances" "*" Allows creation of connections to


Amazon RDS.

"s3:ListAllMyBuckets", "*" Allows listing of Amazon S3 buckets


"s3:ListBucket", when working with crawlers, jobs,
"s3:GetBucketAcl", development endpoints, and
"s3:GetBucketLocation" notebook servers.

"dynamodb:ListTables" "*" Allows listing of DynamoDB tables.

"kms:ListAliases", "*" Allows working with KMS keys.


"kms:DescribeKey"

18
AWS Glue Developer Guide
Step 3: Attach a Policy to IAM Users That Access AWS Glue

Action Resource Description

"cloudwatch:GetMetricData", "*" Allows working with CloudWatch


"cloudwatch:ListDashboards" metrics.

"s3:GetObject", "arn:aws:s3::: Allows get and put of Amazon S3


"s3:PutObject" aws-glue- objects into your account when
*/*", storing objects such as ETL scripts and
"arn:aws:s3::: notebook server locations.
*/*aws-
glue-*/*", Naming convention: Grants
"arn:aws:s3::: permission to Amazon S3 buckets
aws-glue-*" or folders whose names are prefixed
with aws-glue-.

"s3:CreateBucket" "arn:aws:s3::: Allows create of an Amazon S3


aws-glue-*" bucket into your account when
storing objects such as ETL scripts and
notebook server locations

Naming convention: Grants


permission to Amazon S3 buckets
whose names are prefixed with aws-
glue-.

"logs:GetLogEvents" Allows
"arn:aws:logs:*:*: / retrieval of CloudWatch Logs.
aws-glue/*"
Naming convention: AWS Glue writes
logs to log groups whose names
begin with aws-glue-.

"cloudformation:CreateStack", "arn:aws: Allows managing AWS


CloudFormation stacks when working
"cloudformation:DeleteStack" cloudformation:*:*:stack/
aws-glue*/*" with notebook servers.

Naming convention: AWS Glue creates


stacks whose names begin with aws-
glue.

"ec2:RunInstances" Allows running of development


"arn:aws:ec2:*:*:instance/
*", endpoints and notebook servers.
"arn:aws:ec2:*:*:key-
pair/*",
"arn:aws:ec2:*:*:image/
*",
"arn:aws:ec2:*:*:security-
group/*",
"arn:aws:ec2:*:*:network-
interface/*",
"arn:aws:ec2:*:*:subnet/
*",
"arn:aws:ec2:*:*:volume/
*"

19
AWS Glue Developer Guide
Step 3: Attach a Policy to IAM Users That Access AWS Glue

Action Resource Description

"ec2:TerminateInstances", Allows manipulating development


"arn:aws:ec2:*:*:instance/
"ec2:CreateTags", *" endpoints and notebook servers.
"ec2:DeleteTags"
Naming convention: AWS Glue AWS
CloudFormation stacks with a name
that is prefixed with aws-glue- and
logical-id ZeppelinInstance.

"iam:PassRole" Allows AWS Glue to assume


"arn:aws:iam::*:role/
PassRole permission for roles that
AWSGlueServiceRole*"
begin with AWSGlueServiceRole.

"iam:PassRole" Allows Amazon EC2 to


"arn:aws:iam::*:role/
assume PassRole permission
AWSGlueServiceNotebookRole*"
for roles that begin with
AWSGlueServiceNotebookRole.

"iam:PassRole" Allows AWS Glue to assume


"arn:aws:iam::*:role/
service-role/ PassRole permission for roles
that begin with service-role/
AWSGlueServiceRole*"
AWSGlueServiceRole.

7. On the Review policy screen, type a name for the policy, for example GlueConsoleAccessPolicy.
When you're satisfied with the policy, then choose Create policy. Ensure that no errors appear in a
red box at the top of the screen. Correct any that are reported.
Note
If Use autoformatting is selected, the policy is reformatted whenever you open a policy or
choose Validate Policy.

To attach the AWSGlueConsoleFullAccess managed policy

You can attach the AWSGlueConsoleFullAccess policy to provide permissions that are required by the
AWS Glue console user.
Note
You can skip this step if you created your own policy for AWS Glue console access.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane, choose Policies.
3. In the list of policies, select the check box next to the AWSGlueConsoleFullAccess. You can use the
Filter menu and the search box to filter the list of policies.
4. Choose Policy actions, and then choose Attach.
5. Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the
list of principal entities. After choosing the user to attach the policy to, choose Attach policy.

To attach the AWSGlueConsoleSageMakerNotebookFullAccess managed policy

You can attach the AWSGlueConsoleSageMakerNotebookFullAccess policy to a user to manage Amazon


SageMaker notebooks created on the AWS Glue console. In addition to other required AWS Glue console
permissions, this policy grants access to resources needed to manage Amazon SageMaker notebooks.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.

20
AWS Glue Developer Guide
Step 3: Attach a Policy to IAM Users That Access AWS Glue

2. In the navigation pane, choose Policies.


3. In the list of policies, select the check box next to the
AWSGlueConsoleSageMakerNotebookFullAccess. You can use the Filter menu and the search box
to filter the list of policies.
4. Choose Policy actions, and then choose Attach.
5. Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the
list of principal entities. After choosing the user to attach the policy to, choose Attach policy.

To attach the CloudWatchLogsReadOnlyAccess managed policy

You can attach the CloudWatchLogsReadOnlyAccess policy to a user to view the logs created by AWS
Glue on the CloudWatch Logs console.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane, choose Policies.
3. In the list of policies, select the check box next to the CloudWatchLogsReadOnlyAccess. You can use
the Filter menu and the search box to filter the list of policies.
4. Choose Policy actions, and then choose Attach.
5. Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the
list of principal entities. After choosing the user to attach the policy to, choose Attach policy.

To attach the AWSCloudFormationReadOnlyAccess managed policy

You can attach the AWSCloudFormationReadOnlyAccess policy to a user to view the AWS
CloudFormation stacks used by AWS Glue on the AWS CloudFormation console.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane, choose Policies.
3. In the list of policies, select the check box next to the AWSCloudFormationReadOnlyAccess. You
can use the Filter menu and the search box to filter the list of policies.
4. Choose Policy actions, and then choose Attach.
5. Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the
list of principal entities. After choosing the user to attach the policy to, choose Attach policy.

To attach the AmazonAthenaFullAccess managed policy

You can attach the AmazonAthenaFullAccess policy to a user to view Amazon S3 data in the Athena
console.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane, choose Policies.
3. In the list of policies, select the check box next to the AmazonAthenaFullAccess. You can use the
Filter menu and the search box to filter the list of policies.
4. Choose Policy actions, and then choose Attach.
5. Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the
list of principal entities. After choosing the user to attach the policy to, choose Attach policy.

21
AWS Glue Developer Guide
Step 4: Create an IAM Policy for Notebook Servers

Step 4: Create an IAM Policy for Notebook Servers


If you plan to use notebooks with development endpoints, you must specify permissions when
you create the notebook server. You provide those permissions by using AWS Identity and Access
Management (IAM).

This policy grants permission for some Amazon S3 actions to manage resources in your account that are
needed by AWS Glue when it assumes the role using this policy. Some of the resources that are specified
in this policy refer to default names used by AWS Glue for Amazon S3 buckets, Amazon S3 ETL scripts,
and Amazon EC2 resources. For simplicity, AWS Glue defaults writing some Amazon S3 objects into
buckets in your account prefixed with aws-glue-*.
Note
You can skip this step if you use the AWS managed policy AWSGlueServiceNotebookRole.

In this step, you create a policy that is similar to AWSGlueServiceNotebookRole. You can find the
most current version of AWSGlueServiceNotebookRole on the IAM console.

To create an IAM policy for notebooks

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Policies.
3. Choose Create Policy.
4. On the Create Policy screen, navigate to a tab to edit JSON. Create a policy document with the
following JSON statements, and then choose Review policy.

{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"glue:CreateDatabase",
"glue:CreatePartition",
"glue:CreateTable",
"glue:DeleteDatabase",
"glue:DeletePartition",
"glue:DeleteTable",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTableVersions",
"glue:GetTables",
"glue:UpdateDatabase",
"glue:UpdatePartition",
"glue:UpdateTable",
"glue:CreateBookmark",
"glue:GetBookmark",
"glue:UpdateBookmark",
"glue:GetMetric",
"glue:PutMetric",
"glue:CreateConnection",
"glue:CreateJob",
"glue:DeleteConnection",
"glue:DeleteJob",
"glue:GetConnection",
"glue:GetConnections",

22
AWS Glue Developer Guide
Step 4: Create an IAM Policy for Notebook Servers

"glue:GetDevEndpoint",
"glue:GetDevEndpoints",
"glue:GetJob",
"glue:GetJobs",
"glue:UpdateJob",
"glue:BatchDeleteConnection",
"glue:UpdateConnection",
"glue:GetUserDefinedFunction",
"glue:UpdateUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:DeleteUserDefinedFunction",
"glue:CreateUserDefinedFunction",
"glue:BatchGetPartition",
"glue:BatchDeletePartition",
"glue:BatchCreatePartition",
"glue:BatchDeleteTable",
"glue:UpdateDevEndpoint",
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:ListAllMyBuckets",
"s3:GetBucketAcl"
],
"Resource":[
"*"
]
},
{
"Effect":"Allow",
"Action":[
"s3:GetObject"
],
"Resource":[
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue*"
]
},
{
"Effect":"Allow",
"Action":[
"s3:PutObject",
"s3:DeleteObject"
],
"Resource":[
"arn:aws:s3:::aws-glue*"
]
},
{
"Effect":"Allow",
"Action":[
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition":{
"ForAllValues:StringEquals":{
"aws:TagKeys":[
"aws-glue-service-resource"
]
}
},
"Resource":[
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]

23
AWS Glue Developer Guide
Step 5: Create an IAM Role for Notebook Servers

The following table describes the permissions granted by this policy.

Action Resource Description

"glue:*" "*" Allows permission to run all AWS Glue


API operations.

"s3:GetBucketLocation", "*" Allows listing of Amazon S3 buckets


"s3:ListBucket", from notebook servers.
"s3:ListAllMyBuckets",
"s3:GetBucketAcl"

"s3:GetObject" Allows get of Amazon S3 objects


"arn:aws:s3:::crawler-
public*", used by examples and tutorials from
notebooks.
"arn:aws:s3:::aws-
glue-*"
Naming convention: Amazon S3
bucket names begin with crawler-
public and aws-glue-.

"s3:PutObject", Allows put and delete of Amazon


"arn:aws:s3:::aws-
"s3:DeleteObject" glue*" S3 objects into your account from
notebooks.

Naming convention: Uses Amazon S3


folders named aws-glue.

"ec2:CreateTags", Allows tagging of Amazon EC2


"arn:aws:ec2:*:*:network-
"ec2:DeleteTags" interface/*", resources created for notebook
servers.
"arn:aws:ec2:*:*:security-
group/*",
Naming convention: AWS Glue tags
"arn:aws:ec2:*:*:instance/
*" Amazon EC2 instances with aws-glue-
service-resource.

5. On the Review Policy screen, type your Policy Name, for example
GlueServiceNotebookPolicyDefault. Type an optional description, and when you're satisfied with
the policy, then Create policy.

Step 5: Create an IAM Role for Notebook Servers


If you plan to use notebooks with development endpoints, you need to grant the IAM role permissions.
You provide those permissions by using AWS Identity and Access Management, through an IAM role.
Note
When you create an IAM role using the IAM console, the console creates an instance profile
automatically and gives it the same name as the role to which it corresponds.

To create an IAM role for notebooks

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Roles.
3. Choose Create role.

24
AWS Glue Developer Guide
Step 6: Create an IAM Policy for
Amazon SageMaker Notebooks

4. For role type, choose AWS Service, find and choose EC2, and choose the EC2 use case, then choose
Next: Permissions.
5. On the Attach permissions policy page, choose the policies that contain the required permissions;
for example, AWSGlueServiceNotebookRole for general AWS Glue permissions and the AWS
managed policy AmazonS3FullAccess for access to Amazon S3 resources. Then choose Next:
Review.
Note
Ensure that one of the policies in this role grants permissions to your Amazon S3 sources
and targets. Also confirm that your policy allows full access to the location where you store
your notebook when you create a notebook server. You might want to provide your own
policy for access to specific Amazon S3 resources. For more information about creating an
Amazon S3 policy for your resources, see Specifying Resources in a Policy.
If you plan to access Amazon S3 sources and targets that are encrypted with SSE-KMS,
then attach a policy which allows notebooks to decrypt the data. For more information, see
Protecting Data Using Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS). For
example:

{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"kms:Decrypt"
],
"Resource":[
"arn:aws:kms:*:account-id-without-hyphens:key/key-id"
]
}
]
}

6. For Role name, type a name for your role. Create the role with the name prefixed with the
string AWSGlueServiceNotebookRole to allow the role to be passed from console users
to the notebook server. AWS Glue provided policies expect IAM service roles to begin with
AWSGlueServiceNotebookRole. Otherwise you must add a policy to your users to allow the
iam:PassRole permission for IAM roles to match your naming convention. For example, type
AWSGlueServiceNotebookRoleDefault. Then choose Create role.

Step 6: Create an IAM Policy for Amazon SageMaker


Notebooks
If you plan to use Amazon SageMaker notebooks with development endpoints, you must specify
permissions when you create the notebook. You provide those permissions by using AWS Identity and
Access Management (IAM).

To create an IAM policy for Amazon SageMaker notebooks

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Policies.
3. Choose Create Policy.
4. On the Create Policy page, navigate to a tab to edit the JSON. Create a policy document with the
following JSON statements. Edit bucket-name, region-code, account-id, and development-

25
AWS Glue Developer Guide
Step 6: Create an IAM Policy for
Amazon SageMaker Notebooks

endpoint-name for your environment. The development-endpoint-name must already exist


before you use this policy in an IAM role used to create an Amazon SageMaker notebook.

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::bucket-name"
]
},
{
"Action": [
"s3:GetObject"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::bucket-name*"
]
},
{
"Action": [
"logs:CreateLogStream",
"logs:DescribeLogStreams",
"logs:PutLogEvents",
"logs:CreateLogGroup"
],
"Effect": "Allow",
"Resource": [
"arn:aws:logs:region-code:account-id:log-group:/aws/sagemaker/*",
"arn:aws:logs:region-code:account-id:log-group:/aws/sagemaker/*:log-
stream:aws-glue-*"
]
},
{
"Action": [
"glue:UpdateDevEndpoint",
"glue:GetDevEndpoint",
"glue:GetDevEndpoints"
],
"Effect": "Allow",
"Resource": [
"arn:aws:glue:region-code:account-id:devEndpoint/development-endpoint-
name*"
]
}
]
}

Then choose Review policy.

The following table describes the permissions granted by this policy.

Action Resource Description

"s3:ListBucket*" Grants permission to list Amazon S3


"arn:aws:s3:::bucket-
name" buckets.

26
AWS Glue Developer Guide
Step 7: Create an IAM Role for
Amazon SageMaker Notebooks

Action Resource Description

"s3:GetObject" Grants permission to get Amazon


"arn:aws:s3:::bucket-
name*" S3 objects that are used by Amazon
SageMaker notebooks.

"logs:CreateLogStream", Grants permission to write logs to


"arn:aws:logs:region-
"logs:DescribeLogStreams", code:account- Amazon CloudWatch Logs from
"logs:PutLogEvents", id:log- notebooks.
"logs:CreateLogGroup" group:/aws/
sagemaker/*", Naming convention: Writes to log
groups whose names begin with aws-
"arn:aws:logs:region-
code:account- glue.
id:log-
group:/aws/
sagemaker/
*:log-
stream:aws-
glue-*"

"glue:UpdateDevEndpoint", Grants permission to use a


"arn:aws:glue:region-
"glue:GetDevEndpoint", code:account- development endpoint from Amazon
"glue:GetDevEndpoints" id:devEndpoint/ SageMaker notebooks.
development-
endpoint-
name*"
5. On the Review Policy screen, enter your Policy Name, for example AWSGlueSageMakerNotebook.
Enter an optional description, and when you're satisfied with the policy, choose Create policy.

Step 7: Create an IAM Role for Amazon SageMaker


Notebooks
If you plan to use Amazon SageMaker notebooks with development endpoints, you need to grant the
IAM role permissions. You provide those permissions by using AWS Identity and Access Management
(IAM), through an IAM role.

To create an IAM role for Amazon SageMaker notebooks

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Roles.
3. Choose Create role.
4. For role type, choose AWS Service, find and choose SageMaker, and then choose the SageMaker -
Execution use case. Then choose Next: Permissions.
5. On the Attach permissions policy page, choose the policies that contain the required permissions;
for example, AmazonSageMakerFullAccess. Choose Next: Review.

If you plan to access Amazon S3 sources and targets that are encrypted with SSE-KMS, attach a
policy that allows notebooks to decrypt the data, as shown in the following example. For more
information, see Protecting Data Using Server-Side Encryption with AWS KMS-Managed Keys (SSE-
KMS).

{
"Version":"2012-10-17",

27
AWS Glue Developer Guide
Setting Up DNS in Your VPC

"Statement":[
{
"Effect":"Allow",
"Action":[
"kms:Decrypt"
],
"Resource":[
"arn:aws:kms:*:account-id-without-hyphens:key/key-id"
]
}
]
}

6. For Role name, enter a name for your role. To allow the role to be passed from
console users to Amazon SageMaker, use a name that is prefixed with the string
AWSGlueServiceSageMakerNotebookRole. AWS Glue provided policies expect IAM roles to begin
with AWSGlueServiceSageMakerNotebookRole. Otherwise you must add a policy to your users to
allow the iam:PassRole permission for IAM roles to match your naming convention.

For example, enter AWSGlueServiceSageMakerNotebookRole-Default, and then choose Create


role.
7. After you create the role, attach the policy that allows additional permissions required to create
Amazon SageMaker notebooks from AWS Glue.

Open the role that you just created, AWSGlueServiceSageMakerNotebookRole-Default, and choose
Attach policies. Attach the policy that you created named AWSGlueSageMakerNotebook to the
role.

Setting Up DNS in Your VPC


Domain Name System (DNS) is a standard by which names used on the internet are resolved to their
corresponding IP addresses. A DNS hostname uniquely names a computer and consists of a host name
and a domain name. DNS servers resolve DNS hostnames to their corresponding IP addresses. When
using a custom DNS for name resolution, both forward DNS lookup and reverse DNS lookup must be
implemented.

To set up DNS in your VPC, ensure that DNS hostnames and DNS resolution are both enabled in your
VPC. The VPC network attributes enableDnsHostnames and enableDnsSupport must be set to true.
To view and modify these attributes, go to the VPC console at https://console.aws.amazon.com/vpc/.

For more information, see Using DNS with your VPC.


Note
If you are using Route 53, confirm that your configuration does not override DNS network
attributes.

Setting Up Your Environment to Access Data


Stores
To run your extract, transform, and load (ETL) jobs, AWS Glue must be able to access your data stores.
If a job doesn't need to run in your virtual private cloud (VPC) subnet—for example, transforming data
from Amazon S3 to Amazon S3—no additional configuration is needed.

If a job needs to run in your VPC subnet—for example, transforming data from a JDBC data store in a
private subnet—AWS Glue sets up elastic network interfaces that enable your jobs to connect securely to
other resources within your VPC. Each elastic network interface is assigned a private IP address from the

28
AWS Glue Developer Guide
Amazon VPC Endpoints for Amazon S3

IP address range within the subnet you specified. No public IP addresses are assigned. Security groups
specified in the AWS Glue connection are applied on each of the elastic network interfaces. For more
information, see Setting Up a VPC to Connect to JDBC Data Stores (p. 30).

All JDBC data stores that are accessed by the job must be available from the VPC subnet. To access
Amazon S3 from within your VPC, a VPC endpoint (p. 29) is required. If your job needs to access both
VPC resources and the public internet, the VPC needs to have a Network Address Translation (NAT)
gateway inside the VPC.

A job or development endpoint can only access one VPC (and subnet) at a time. If you need to access
data stores in different VPCs, you have the following options:

• Use VPC peering to access the data stores. For more about VPC peering, see VPC Peering Basics
• Use an Amazon S3 bucket as an intermediary storage location. Split the work into two jobs, with the
Amazon S3 output of job 1 as the input to job 2.

For JDBC data stores, you create a connection in AWS Glue with the necessary properties to connect to
your data stores. For more information about the connection, see Adding a Connection to Your Data
Store (p. 92).
Note
Make sure you set up your DNS environment for AWS Glue. For more information, see Setting
Up DNS in Your VPC (p. 28).

Topics
• Amazon VPC Endpoints for Amazon S3 (p. 29)
• Setting Up a VPC to Connect to JDBC Data Stores (p. 30)

Amazon VPC Endpoints for Amazon S3


For security reasons, many AWS customers run their applications within an Amazon Virtual Private Cloud
environment (Amazon VPC). With Amazon VPC, you can launch Amazon EC2 instances into a virtual
private cloud, which is logically isolated from other networks—including the public internet. With an
Amazon VPC, you have control over its IP address range, subnets, routing tables, network gateways, and
security settings.
Note
If you created your AWS account after 2013-12-04, you already have a default VPC in each AWS
Region. You can immediately start using your default VPC without any additional configuration.
For more information, see Your Default VPC and Subnets in the Amazon VPC User Guide.

Many customers have legitimate privacy and security concerns about sending and receiving data across
the public internet. Customers can address these concerns by using a virtual private network (VPN) to
route all Amazon S3 network traffic through their own corporate network infrastructure. However, this
approach can introduce bandwidth and availability challenges.

VPC endpoints for Amazon S3 can alleviate these challenges. A VPC endpoint for Amazon S3 enables
AWS Glue to use private IP addresses to access Amazon S3 with no exposure to the public internet. AWS
Glue does not require public IP addresses, and you don't need an internet gateway, a NAT device, or a
virtual private gateway in your VPC. You use endpoint policies to control access to Amazon S3. Traffic
between your VPC and the AWS service does not leave the Amazon network.

When you create a VPC endpoint for Amazon S3, any requests to an Amazon S3 endpoint within the
Region (for example, s3.us-west-2.amazonaws.com) are routed to a private Amazon S3 endpoint within
the Amazon network. You don't need to modify your applications running on EC2 instances in your VPC
—the endpoint name remains the same, but the route to Amazon S3 stays entirely within the Amazon
network, and does not access the public internet.

29
AWS Glue Developer Guide
Setting Up a VPC to Connect to JDBC Data Stores

For more information about VPC endpoints, see VPC Endpoints in the Amazon VPC User Guide.

The following diagram shows how AWS Glue can use a VPC endpoint to access Amazon S3.

To set up access for Amazon S3

1. Sign in to the AWS Management Console and open the Amazon VPC console at https://
console.aws.amazon.com/vpc/.
2. In the left navigation pane, choose Endpoints.
3. Choose Create Endpoint, and follow the steps to create an Amazon S3 endpoint in your VPC.

Setting Up a VPC to Connect to JDBC Data Stores


To enable AWS Glue components to communicate, you must set up access to your data stores, such
as Amazon Redshift and Amazon RDS. To enable AWS Glue to communicate between its components,
specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-
referencing rule, you can restrict the source to the same security group in the VPC, and it's not open to all
networks. The default security group for your VPC might already have a self-referencing inbound rule for
ALL Traffic.

To set up access for Amazon Redshift data stores

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. In the left navigation pane, choose Clusters.

30
AWS Glue Developer Guide
Setting Up a VPC to Connect to JDBC Data Stores

3. Choose the cluster name that you want to access from AWS Glue.
4. In the Cluster Properties section, choose a security group in VPC security groups to allow AWS Glue
to use. Record the name of the security group that you chose for future reference. Choosing the
security group opens the Amazon EC2 console Security Groups list.
5. Choose the security group to modify and navigate to the Inbound tab.
6. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or
confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and
whose Source is the same security group name as the Group ID.

The inbound rule looks similar to the following:

Type Protocol Port Range Source

All TCP TCP 0–65535 database-security-


group

For example:

7. Add a rule for outbound traffic also. Either open outbound traffic to all ports, for example:

Type Protocol Port Range Destination

All Traffic ALL ALL 0.0.0.0/0

Or create a self-referencing rule where Type All TCP, Protocol is TCP, Port Range includes all
ports, and whose Destination is the same security group name as the Group ID. If using an Amazon
S3 VPC endpoint, also add an HTTPS rule for Amazon S3 access. The s3-prefix-list-id is
required in the security group rule to allow traffic from the VPC to the Amazon S3 VPC endpoint.

For example:

Type Protocol Port Range Destination

All TCP TCP 0–65535 security-group

HTTPS TCP 443 s3-prefix-list-id

31
AWS Glue Developer Guide
Setting Up a VPC to Connect to JDBC Data Stores

To set up access for Amazon RDS data stores

1. Sign in to the AWS Management Console and open the Amazon RDS console at https://
console.aws.amazon.com/rds/.
2. In the left navigation pane, choose Instances.
3. Choose the Amazon RDS Engine and DB Instance name that you want to access from AWS Glue.
4. From Instance Actions, choose See Details. On the Details tab, find the Security Groups name you
will access from AWS Glue. Record the name of the security group for future reference.
5. Choose the security group to open the Amazon EC2 console.
6. Confirm that your Group ID from Amazon RDS is chosen, then choose the Inbound tab.
7. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or
confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and
whose Source is the same security group name as the Group ID.

The inbound rule looks similar to this:

Type Protocol Port Range Source

All TCP TCP 0–65535 database-security-


group

For example:

8. Add a rule for outbound traffic also. Either open outbound traffic to all ports, for example:

Type Protocol Port Range Destination

All Traffic ALL ALL 0.0.0.0/0

Or create a self-referencing rule where Type All TCP, Protocol is TCP, Port Range includes all
ports, and whose Destination is the same security group name as the Group ID. If using an Amazon
S3 VPC endpoint, also add an HTTPS rule for Amazon S3 access. The s3-prefix-list-id is
required in the security group rule to allow traffic from the VPC to the Amazon S3 VPC endpoint.

For example:

32
AWS Glue Developer Guide
Setting Up Your Environment for Development Endpoints

Type Protocol Port Range Destination

All TCP TCP 0–65535 security-group

HTTPS TCP 443 s3-prefix-list-id

Setting Up Your Environment for Development


Endpoints
To run your extract, transform, and load (ETL) scripts with AWS Glue, you sometimes develop and test
your scripts using a development endpoint. When you set up a development endpoint, you specify a
virtual private cloud (VPC), subnet, and security groups.
Note
Make sure you set up your DNS environment for AWS Glue. For more information, see Setting
Up DNS in Your VPC (p. 28).

Setting Up Your Network for a Development


Endpoint
To enable AWS Glue to access required resources, add a row in your subnet route table to associate
a prefix list for Amazon S3 to the VPC endpoint. A prefix list ID is required for creating an outbound
security group rule that allows traffic from a VPC to access an AWS service through a VPC endpoint. To
ease connecting to a notebook server that is associated with this development endpoint, from your local
machine, add a row to the route table to add an internet gateway ID. For more information, see VPC
Endpoints. Update the subnet routes table to be similar to the following table:

Destination Target

10.0.0.0/16 local

pl-id for Amazon S3 vpce-id

0.0.0.0/0 igw-xxxx

To enable AWS Glue to communicate between its components, specify a security group with a self-
referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source
to the same security group in the VPC, and it's not open to all networks. The default security group for
your VPC might already have a self-referencing inbound rule for ALL Traffic.

To set up a security group

1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://
console.aws.amazon.com/ec2/.
2. In the left navigation pane, choose Security Groups.
3. Either choose an existing security group from the list, or Create Security Group to use with the
development endpoint.
4. In the security group pane, navigate to the Inbound tab.

33
AWS Glue Developer Guide
Setting Up Amazon EC2 for a Notebook Server

5. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or
confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and
whose Source is the same security group name as the Group ID.

The inbound rule looks similar to this:

Type Protocol Port Range Source

All TCP TCP 0–65535 security-group

The following shows an example of a self-referencing inbound rule:

6. Add a rule to for outbound traffic also. Either open outbound traffic to all ports, or create a self-
referencing rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source
is the same security group name as the Group ID.

The outbound rule looks similar to one of these rules:

Type Protocol Port Range Destination

All TCP TCP 0–65535 security-group

All Traffic ALL ALL 0.0.0.0/0

Setting Up Amazon EC2 for a Notebook Server


With a development endpoint, you can create a notebook server to test your ETL scripts with Zeppelin
notebooks. To enable communication to your notebook, specify a security group with inbound rules
for both HTTPS (port 443) and SSH (port 22). Ensure that the rule's source is either 0.0.0.0/0 or the IP
address of the machine that is connecting to the notebook.
Note
When using a custom DNS, ensure that the custom DNS server is able to do forward and reverse
resolution for the entire subnet CIDR where the notebook server is launched.

To set up a security group

1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://
console.aws.amazon.com/ec2/.

34
AWS Glue Developer Guide
Setting Up Encryption

2. In the left navigation pane, choose Security Groups.


3. Either choose an existing security group from the list, or Create Security Group to use with your
notebook server. The security group that is associated with your development endpoint is also used
to create your notebook server.
4. In the security group pane, navigate to the Inbound tab.
5. Add inbound rules similar to this:

Type Protocol Port Range Source

SSH TCP 22 0.0.0.0/0

HTTPS TCP 443 0.0.0.0/0

The following shows an example of the inbound rules for the security group:

Setting Up Encryption in AWS Glue


The following example workflow highlights the options to configure when you use encryption with AWS
Glue. The example demonstrates the use of specific AWS Key Management Service (AWS KMS) keys,
but you might choose other settings based on your particular needs. This workflow highlights only the
options that pertain to encryption when setting up AWS Glue. For more information about encryption,
see Encryption and Secure Access for AWS Glue (p. 80).

1. If the user of the AWS Glue console doesn't use a permissions policy that allows all AWS Glue API
operations (for example, "glue:*"), confirm that the following actions are allowed:
• "glue:GetDataCatalogEncryptionSettings"
• "glue:PutDataCatalogEncryptionSettings"
• "glue:CreateSecurityConfiguration"
• "glue:GetSecurityConfiguration"
• "glue:GetSecurityConfigurations"
• "glue:DeleteSecurityConfiguration"
2. Any client that accesses or writes to an encrypted catalog—that is, any console user, crawler, job, or
development endpoint—needs the following permissions:

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",

35
AWS Glue Developer Guide
Setting Up Encryption

"Action": [
"kms:GenerateDataKey",
"kms:Decrypt",
"kms:Encrypt"
],
"Resource": "(key-arns-used-for-data-catalog)"
}
}

3. Any user or role that accesses an encrypted connection password needs the following permissions:

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": "(key-arns-used-for-password-encryption)"
}
}

4. The role of any extract, transform, and load (ETL) job that writes encrypted data to Amazon S3 needs
the following permissions:

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:Encrypt",
"kms:GenerateDataKey"
],
"Resource": "(key-arns-used-for-s3)"
}
}

5. Any ETL job or crawler that writes encrypted Amazon CloudWatch Logs requires the following
permissions in the key policy (not IAM policy):

{
"Effect": "Allow",
"Principal": {
"Service": "logs.region.amazonaws.com"
},
"Action": [
"kms:Encrypt*",
"kms:Decrypt*",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:Describe*"
],
"Resource": "arn of key used for ETL/crawler cloudwatch encryption"
}

For more information about key policies, see Using Key Policies in AWS KMS in the AWS Key
Management Service Developer Guide.
6. Any ETL job that uses an encrypted job bookmark needs the following permissions:

36
AWS Glue Developer Guide
Console Workflow Overview

"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:Encrypt"
],
"Resource": "(key-arns-used-for-job-bookmark-encryption)"
}
}

7. On the AWS Glue console, choose Settings in the navigation pane. On the Data catalog settings page,
encrypt your Data Catalog by selecting the Metadata encryption check box. This option encrypts all
the objects in the Data Catalog with the AWS KMS key that you choose.

When encryption is enabled, the client that is accessing the Data Catalog must have AWS KMS
permissions. For more information, see Encrypting Your Data Catalog (p. 81).
8. In the navigation pane, choose Security configurations. A security configuration is a set of
security properties that can be used to configure AWS Glue processes. Then choose Add security
configuration. In the configuration, choose any of the following options:
a. Select the S3 encryption check box. For Encryption mode, choose SSE-KMS. For the AWS KMS
key, choose aws/s3 (ensure that the user has permission to use this key). This enables data written
by the job to Amazon S3 to use the AWS managed AWS Glue AWS KMS key.
b. Select the CloudWatch logs encryption check box, and choose an AWS managed AWS KMS
key (ensure that the user has permission to use this key). This enables data written by the job to
CloudWatch Logs with the AWS managed AWS Glue AWS KMS key.
c. Choose Advanced properties, and select the Job bookmark encryption check box. For the AWS
KMS key, choose aws/glue (ensure that the user has permission to use this key). This enables
encryption of job bookmarks written to Amazon S3 with the AWS Glue AWS KMS key.
9. In the navigation pane, choose Connections. Choose Add connection to create a connection to the
Java Database Connectivity (JDBC) data store that is the target of your ETL job. To enforce that Secure
Sockets Layer (SSL) encryption is used, select the Require SSL connection check box, and test your
connection.
10.In the navigation pane, choose Jobs. Choose Add job to create a job that transforms data. In the job
definition, choose the security configuration that you created.
11.On the AWS Glue console, run your job on demand. Verify that any Amazon S3 data written by the
job, the CloudWatch Logs written by the job, and the job bookmarks are all encrypted.

AWS Glue Console Workflow Overview


With AWS Glue, you store metadata in the AWS Glue Data Catalog. You use this metadata to orchestrate
ETL jobs that transform data sources and load your data warehouse. The following steps describe the
general workflow and some of the choices that you make when working with AWS Glue.

1. Populate the AWS Glue Data Catalog with table definitions.

In the console, you can add a crawler to populate the AWS Glue Data Catalog. You can start the Add
crawler wizard from the list of tables or the list of crawlers. You choose one or more data stores for
your crawler to access. You can also create a schedule to determine the frequency of running your
crawler.

Optionally, you can provide a custom classifier that infers the schema of your data. You can create
custom classifiers using a grok pattern. However, AWS Glue provides built-in classifiers that are
automatically used by crawlers if a custom classifier does not recognize your data. When you define a
crawler, you don't have to select a classifier. For more information about classifiers in AWS Glue, see
Adding Classifiers to a Crawler (p. 109).

37
AWS Glue Developer Guide
Console Workflow Overview

Crawling some types of data stores requires a connection that provides authentication and location
information. If needed, you can create a connection that provides this required information in the AWS
Glue console.

The crawler reads your data store and creates data definitions and named tables in the AWS Glue Data
Catalog. These tables are organized into a database of your choosing. You can also populate the Data
Catalog with manually created tables. With this method, you provide the schema and other metadata
to create table definitions in the Data Catalog. Because this method can be a bit tedious and error
prone, it's often better to have a crawler create the table definitions.

For more information about populating the AWS Glue Data Catalog with table definitions, see
Defining Tables in the AWS Glue Data Catalog (p. 88).
2. Define a job that describes the transformation of data from source to target.

Generally, to create a job, you have to make the following choices:


• Pick a table from the AWS Glue Data Catalog to be the source of the job. Your job uses this table
definition to access your data store and interpret the format of your data.
• Pick a table or location from the AWS Glue Data Catalog to be the target of the job. Your job uses
this information to access your data store.
• Tell AWS Glue to generate a PySpark script to transform your source to target. AWS Glue generates
the code to call built-in transforms to convert data from its source schema to target schema
format. These transforms perform operations such as copy data, rename columns, and filter data to
transform data as necessary. You can modify this script in the AWS Glue console.

For more information about defining jobs in AWS Glue, see Authoring Jobs in AWS Glue (p. 141).
3. Run your job to transform your data.

You can run your job on demand, or start it based on a one of these trigger types:
• A trigger that is based on a cron schedule.
• A trigger that is event-based; for example, the successful completion of another job can start an
AWS Glue job.
• A trigger that starts a job on demand.

For more information about triggers in AWS Glue, see Triggering Jobs in AWS Glue (p. 154).
4. Monitor your scheduled crawlers and triggered jobs.

Use the AWS Glue console to view the following:


• Job run details and errors.
• Crawler run details and errors.
• Any notifications about AWS Glue activities

For more information about monitoring your crawlers and jobs in AWS Glue, see Running and
Monitoring AWS Glue (p. 188).

38
AWS Glue Developer Guide

Security in AWS Glue


You can manage your AWS Glue resources and your data stores by using authentication, access control,
and encryption.

Use AWS Identity and Access Management (IAM) policies to assign permissions and control access to AWS
Glue resources.

AWS Glue also enables you to encrypt data, logs, and bookmarks using keys that you manage with AWS
KMS. You can configure ETL jobs and development endpoints to use AWS KMS keys to write encrypted
data at rest. Additionally, you can use AWS KMS keys to encrypt the logs generated by crawlers and ETL
jobs, as well as, encrypt ETL job bookmarks. With AWS Glue, you can also encrypt the metadata stored in
the Data Catalog with keys that you manage with AWS KMS.

The following examples describe some of the methods you can use for secure processing.

• Use AWS Identity and Access Management (IAM) policies to assign permissions that determine who
is allowed to manage AWS Glue resources. For more information, see Identity-Based Policies (IAM
Policies) For Access Control (p. 44).
• Use AWS Glue resource policies to control access to Data Catalog resources and grant cross-account
access without using an IAM role. For more information, see AWS Glue Resource Policies For Access
Control (p. 48).
• Use security configurations to encrypt your Amazon S3, Amazon CloudWatch Logs, and job bookmarks
data at rest. For more information, see Encrypting Data Written by Crawlers, Jobs, and Development
Endpoints (p. 82).
• Only connect to JDBC data stores with trusted Secure Sockets Layer (SSL) connections. For more
information, see Working with Connections on the AWS Glue Console (p. 94).
• Encrypt your AWS Glue Data Catalog. For more information, see Encrypting Your Data
Catalog (p. 81).
• Use the security features of your database engine to control who can log in to the databases on a
database instance, just as you might do with a database on your local network.

Important
To use fine-grained access control with the Data Catalog and Athena, consider the following
limitations:

• You must upgrade from an Athena-managed Data Catalog to the AWS Glue Data Catalog.
• Athena does not support cross-account access to an AWS Glue Data Catalog.
• You cannot limit access to individual partitions within a table. You can only limit access to
databases and entire tables.
• When limiting access to a specific database in the AWS Glue Data Catalog, you must also
specify a default database for each AWS Region. If you use Athena and the AWS Glue Data
Catalog in more than one region, add a resource ARN for each default database in each
region. For example, to allow GetDatabase access to example_db in the us-east-1
Region, include the default database in the policy as well:

{
"Effect": "Allow",
"Action": [
"glue:GetDatabase"

39
AWS Glue Developer Guide
Authentication and Access Control

],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:database/default",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:catalog"
]
}

For more information, see Fine-Grained Access to Databases and Tables in the AWS Glue Data
Catalog.

Topics
• Authentication and Access Control for AWS Glue (p. 40)
• Encryption and Secure Access for AWS Glue (p. 80)

Authentication and Access Control for AWS Glue


Access to AWS Glue requires credentials. Those credentials must have permissions to access AWS
resources, such as an AWS Glue table or an Amazon Elastic Compute Cloud (Amazon EC2) instance. The
following sections provide details on how you can use AWS Identity and Access Management (IAM) and
AWS Glue to help secure access to your resources.

Topics
• Authentication (p. 40)
• Overview of Managing Access Permissions to Your AWS Glue Resources (p. 41)
• Granting Cross-Account Access (p. 50)
• Specifying AWS Glue Resource ARNs (p. 54)
• AWS Glue Access-Control Policy Examples (p. 57)
• AWS Glue API Permissions: Actions and Resources Reference (p. 66)

Authentication
You can access AWS as any of the following types of identities:

• AWS account root user – When you first create an AWS account, you begin with a single sign-in
identity that has complete access to all AWS services and resources in the account. This identity is
called the AWS account root user and is accessed by signing in with the email address and password
that you used to create the account. We strongly recommend that you do not use the root user for
your everyday tasks, even the administrative ones. Instead, adhere to the best practice of using the
root user only to create your first IAM user. Then securely lock away the root user credentials and use
them to perform only a few account and service management tasks.
• IAM user – An IAM user is an identity within your AWS account that has specific custom permissions
(for example, permissions to create a table in AWS Glue). You can use an IAM user name and password
to sign in to secure AWS webpages like the AWS Management Console, AWS Discussion Forums, or the
AWS Support Center.

In addition to a user name and password, you can also generate access keys for each user. You can
use these keys when you access AWS services programmatically, either through one of the several
SDKs or by using the AWS Command Line Interface (CLI). The SDK and CLI tools use the access keys

40
AWS Glue Developer Guide
Access-Control Overview

to cryptographically sign your request. If you don’t use AWS tools, you must sign the request yourself.
AWS Glue supports Signature Version 4, a protocol for authenticating inbound API requests. For more
information about authenticating requests, see Signature Version 4 Signing Process in the AWS General
Reference.

• IAM role – An IAM role is an IAM identity that you can create in your account that has specific
permissions. It is similar to an IAM user, but it is not associated with a specific person. An IAM role
enables you to obtain temporary access keys that can be used to access AWS services and resources.
IAM roles with temporary credentials are useful in the following situations:

• Federated user access – Instead of creating an IAM user, you can use existing user identities from
AWS Directory Service, your enterprise user directory, or a web identity provider. These are known as
federated users. AWS assigns a role to a federated user when access is requested through an identity
provider. For more information about federated users, see Federated Users and Roles in the IAM User
Guide.

• AWS service access – You can use an IAM role in your account to grant an AWS service permissions
to access your account’s resources. For example, you can create a role that allows Amazon Redshift
to access an Amazon S3 bucket on your behalf and then load data from that bucket into an Amazon
Redshift cluster. For more information, see Creating a Role to Delegate Permissions to an AWS
Service in the IAM User Guide.

• Applications running on Amazon EC2 – You can use an IAM role to manage temporary credentials
for applications that are running on an EC2 instance and making AWS API requests. This is preferable
to storing access keys within the EC2 instance. To assign an AWS role to an EC2 instance and make
it available to all of its applications, you create an instance profile that is attached to the instance.
An instance profile contains the role and enables programs that are running on the EC2 instance
to get temporary credentials. For more information, see Using an IAM Role to Grant Permissions to
Applications Running on Amazon EC2 Instances in the IAM User Guide.

Overview of Managing Access Permissions to Your


AWS Glue Resources
You can have valid credentials to authenticate your requests, but unless you have the appropriate
permissions, you can't create or access an AWS Glue resource such as a table in the Data Catalog.

Every AWS resource is owned by an AWS account, and permissions to create or access a resource are
governed by permissions policies. An account administrator can attach permissions policies to IAM
identities (that is, users, groups, and roles). Some services (such as AWS Glue and Amazon S3) also
support attaching permissions policies to the resources themselves.
Note
An account administrator (or administrator user) is a user who has administrative privileges. For
more information, see IAM Best Practices in the IAM User Guide.

When granting permissions, you decide who is getting the permissions, the resources they get
permissions for, and the specific actions that you want to allow on those resources.

Topics
• Using Permissions Policies to Manage Access to Resources (p. 42)
• AWS Glue Resources and Operations (p. 42)

41
AWS Glue Developer Guide
Access-Control Overview

• Understanding Resource Ownership (p. 42)


• Managing Access to Resources (p. 43)
• Specifying Policy Elements: Actions, Effects, and Principals (p. 44)
• Specifying Conditions in a Policy (p. 44)
• Identity-Based Policies (IAM Policies) For Access Control (p. 44)
• AWS Glue Resource Policies For Access Control (p. 48)

Using Permissions Policies to Manage Access to Resources


A permissions policy is defined by a JSON object that describes who has access to what. The syntax of the
JSON object is largely defined by the IAM service (see AWS IAM Policy Reference in the IAM User Guide).
Note
This section discusses using IAM in the context of AWS Glue but does not provide detailed
information about the IAM service. For detailed IAM documentation, see What Is IAM? in the IAM
User Guide.

For a list showing all of the AWS Glue API operations and the resources that they apply to, see AWS Glue
API Permissions: Actions and Resources Reference (p. 66).

To learn more about IAM policy syntax and descriptions, see AWS IAM Policy Reference in the IAM User
Guide.

AWS Glue supports two kinds of policy:

• Identity-Based Policies (IAM Policies) For Access Control (p. 44)


• AWS Glue Resource Policies For Access Control (p. 48)

By supporting both identity-based and resource policies, AWS Glue gives you fine-grained control over
who can access what metadata.

For more examples, see AWS Glue Resource-Based Access-Control Policy Examples (p. 62).

AWS Glue Resources and Operations


AWS Glue provides a set of operations to work with AWS Glue resources. For a list of available
operations, see AWS Glue AWS Glue API (p. 371).

Understanding Resource Ownership


The AWS account owns the resources that are created in the account, regardless of who created the
resources. Specifically, the resource owner is the AWS account of the principal entity (that is, the root
account, an IAM user, or an IAM role) that authenticates the resource creation request. The following
examples illustrate how this works:

• If you use the root account credentials of your AWS account to create a table, your AWS account is the
owner of the resource (in AWS Glue, the resource is a table).
• If you create an IAM user in your AWS account and grant permissions to create a table to that user,
the user can create a table. However, your AWS account, to which the user belongs, owns the table
resource.
• If you create an IAM role in your AWS account with permissions to create a table, anyone who can
assume the role can create a table. Your AWS account, to which the user belongs, owns the table
resource.

42
AWS Glue Developer Guide
Access-Control Overview

Managing Access to Resources


A permissions policy describes who has access to what. The following section explains the available
options for creating permissions policies.
Note
This section discusses using IAM in the context of AWS Glue. It doesn't provide detailed
information about the IAM service. For complete IAM documentation, see What Is IAM? in the
IAM User Guide. For information about IAM policy syntax and descriptions, see AWS IAM Policy
Reference in the IAM User Guide.

Policies attached to an IAM identity are referred to as identity-based policies (IAM policies) and policies
attached to a resource are referred to as resource-based policies.

Topics
• Identity-Based Policies (IAM Policies) (p. 43)
• Resource-Based Policies (p. 44)

Identity-Based Policies (IAM Policies)


You can attach policies to IAM identities. For example, you can do the following:

• Attach a permissions policy to a user or a group in your account – To grant a user permissions to
create an AWS Glue resource, such as a table, you can attach a permissions policy to a user or group
that the user belongs to.
• Attach a permissions policy to a role (grant cross-account permissions) – You can attach an
identity-based permissions policy to an IAM role to grant cross-account permissions. For example,
the administrator in account A can create a role to grant cross-account permissions to another AWS
account (for example, account B) or an AWS service as follows:
1. Account A administrator creates an IAM role and attaches a permissions policy to the role that
grants permissions on resources in account A.
2. Account A administrator attaches a trust policy to the role identifying account B as the principal
who can assume the role.
3. Account B administrator can then delegate permissions to assume the role to any users in account B.
Doing this allows users in account B to create or access resources in account A. The principal in the
trust policy can also be an AWS service principal if you want to grant an AWS service permissions to
assume the role.

For more information about using IAM to delegate permissions, see Access Management in the IAM
User Guide.

The following is an example identity-based policy that grants permissions for one AWS Glue action
(GetTables). The wildcard character (*) in the Resource value means that you are granting permission
to this action to obtain names and details of all the tables in a databases in the Data Catalog. If the user
also has access to other catalogs through a resource policy, then it is given access to these resources too.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTables",
"Effect": "Allow",
"Action": [
"glue:GetTables"
],
"Resource": "*"

43
AWS Glue Developer Guide
Access-Control Overview

}
]
}

For more information about using identity-based policies with AWS Glue, see Identity-Based Policies
(IAM Policies) For Access Control (p. 44). For more information about users, groups, roles, and
permissions, see Identities (Users, Groups, and Roles) in the IAM User Guide.

Resource-Based Policies
Other services, such as Amazon S3, also support resource-based permissions policies. For example, you
can attach a policy to an S3 bucket to manage access permissions to that bucket.

Specifying Policy Elements: Actions, Effects, and Principals


For each AWS Glue resource, the service defines a set of API operations. To grant permissions for these
API operations, AWS Glue defines a set of actions that you can specify in a policy. Some API operations
can require permissions for more than one action in order to perform the API operation. For more
information about resources and API operations, see AWS Glue Resources and Operations (p. 42) and
AWS Glue AWS Glue API (p. 371).

The following are the most basic policy elements:

• Resource – You use an Amazon Resource Name (ARN) to identify the resource that the policy applies
to. For more information, see AWS Glue Resources and Operations (p. 42).
• Action – You use action keywords to identify resource operations that you want to allow or deny. For
example, you can use create to allow users to create a table.
• Effect – You specify the effect, either allow or deny, when the user requests the specific action. If you
don't explicitly grant access to (allow) a resource, access is implicitly denied. You can also explicitly
deny access to a resource, which you might do to make sure that a user cannot access it, even if a
different policy grants access.
• Principal – In identity-based policies (IAM policies), the user that the policy is attached to is the
implicit principal. For resource-based policies, you specify the user, account, service, or other entity
that you want to receive permissions (applies to resource-based policies only). AWS Glue doesn't
support resource-based policies.

To learn more about IAM policy syntax and descriptions, see AWS IAM Policy Reference in the IAM User
Guide.

For a list showing all of the AWS Glue API operations and the resources that they apply to, see AWS Glue
API Permissions: Actions and Resources Reference (p. 66).

Specifying Conditions in a Policy


When you grant permissions, you can use the access policy language to specify the conditions when a
policy should take effect. For example, you might want a policy to be applied only after a specific date.
For more information about specifying conditions in a policy language, see Condition in the IAM User
Guide.

To express conditions, you use predefined condition keys. There are AWS-wide condition keys and AWS
Glue–specific keys that you can use as appropriate. For a complete list of AWS-wide keys, see Available
Keys for Conditions in the IAM User Guide.

Identity-Based Policies (IAM Policies) For Access Control


This type of policy is attached to an IAM identity (user, group, role, or service) and grants permissions for
that IAM identity to access specified resources.

44
AWS Glue Developer Guide
Access-Control Overview

AWS Glue supports identity-based policies (IAM policies) for all AWS Glue operations. By attaching a
policy to a user or a group in your account, you can grant them permissions to create, access, or modify
an AWS Glue resource such as a table in the AWS Glue Data Catalog.

By attaching a policy to an IAM role, you can grant cross-account access permissions to IAM identities in
other AWS accounts. For more information, see Granting Cross-Account Access (p. 50).

The following is an example identity-based policy that grants permissions for AWS Glue actions
(glue:GetTable, GetTables, GetDatabase, and GetDatabases). The wildcard character (*) in the
Resource value means that you are granting permission to these actions to obtain names and details of
all the tables and databases in the Data Catalog. If the user also has access to other catalogs through a
resource policy, then it is given access to these resources too.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTables",
"Effect": "Allow",
"Action": [
"glue:GetTable",
"glue:GetTables",
"glue:GetDatabase",
"glue:GetDataBases"
],
"Resource": "*"
}
]
}

Here is another example, targeting the us-west-2 Region and using a placeholder for the specific AWS
account number.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTablesActionOnBooks",
"Effect": "Allow",
"Action": [
"glue:GetTable",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/books"
]
}
]
}

This policy grants read-only permission to a table named books in the database named db1. Notice
that to grant Get permission to a table that permission to the catalog and database resources is also
required.

To deny access to a table, requires that you create a policy to deny a user access to the table, or its
parent database or catalog. This allows you to easily deny access to a specific resource that cannot
be circumvented with a subsequent allow permission. For example, if you deny access to table books
in database db1, then if you grant access to database db1, access to table books is still denied.

45
AWS Glue Developer Guide
Access-Control Overview

The following is an example identity-based policy that denies permissions for AWS Glue actions
(glue:GetTables and GetTable) to database db1 and all of the tables within it.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyGetTablesToDb1",
"Effect": "Deny",
"Action": [
"glue:GetTables",
"glue:GetTable"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:database/db1"
]
}
]
}

For more policy examples, see Identity-Based Policy Examples (p. 57).

Resource-Level Permissions Only Applies To Data Catalog Objects


Because you can only define fine-grained control for catalog objects in the Data Catalog, you must
write your client's IAM policy so that API operations that allow ARNs for the Resource statement are
not mixed with API operations that do not allow ARNs. For example, the following IAM policy allows
API operations for GetJob and GetCrawler and defines the Resource as * because AWS Glue does
not allow ARNs for crawlers and jobs. Because ARNs are allowed for catalog API operations such as
GetDatabase and GetTable, ARNs can be specified in the second half of the policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetJob*",
"glue:GetCrawler*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"glue:Get*"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default",
"arn:aws:glue:us-east-1:123456789012:table/default/e*1*",
"arn:aws:glue:us-east-1:123456789012:connection/connection2"
]
}
]
}

For a list of AWS Glue catalog objects that allow ARNs, see Data Catalog ARNs (p. 54)

46
AWS Glue Developer Guide
Access-Control Overview

Permissions Required to Use the AWS Glue Console


For a user to work with the AWS Glue console, that user must have a minimum set of permissions that
allows the user to work with the AWS Glue resources for their AWS account. In addition to these AWS
Glue permissions, the console requires permissions from the following services:

• Amazon CloudWatch Logs permissions to display logs.


• AWS Identity and Access Management permissions to list and pass roles.
• AWS CloudFormation permissions to work with stacks.
• Amazon Elastic Compute Cloud permissions to list VPCs, subnets, security groups, instances, and other
objects.
• Amazon Simple Storage Service permissions to list buckets and objects. Also permission to retrieve
and save scripts.
• Amazon Redshift permissions to work with clusters.
• Amazon Relational Database Service permissions to list instances.

For more information on the permissions that your users require to view and work with the AWS Glue
console, see Step 3: Attach a Policy to IAM Users That Access AWS Glue (p. 15).

If you create an IAM policy that is more restrictive than the minimum required permissions, the console
won't function as intended for users with that IAM policy. To ensure that those users can still use the
AWS Glue console, also attach the AWSGlueConsoleFullAccess managed policy to the user, as
described in AWS Managed (Predefined) Policies for AWS Glue (p. 47).

You don't need to allow minimum console permissions for users that are making calls only to the AWS
CLI or the AWS Glue API.

AWS Managed (Predefined) Policies for AWS Glue


AWS addresses many common use cases by providing standalone IAM policies that are created and
administered by AWS. These AWS managed policies grant necessary permissions for common use cases
so that you can avoid having to investigate what permissions are needed. For more information, see AWS
Managed Policies in the IAM User Guide.

The following AWS managed policies, which you can attach to users in your account, are specific to AWS
Glue and are grouped by use case scenario:

• AWSGlueConsoleFullAccess – Grants full access to AWS Glue resources when using the AWS
Management Console. If you follow the naming convention for resources specified in this policy, users
have full console capabilities. This policy is typically attached to users of the AWS Glue console.
• AWSGlueServiceRole – Grants access to resources that various AWS Glue processes require to run on
your behalf. These resources include AWS Glue, Amazon S3, IAM, CloudWatch Logs, and Amazon EC2.
If you follow the naming convention for resources specified in this policy, AWS Glue processes have the
required permissions. This policy is typically attached to roles specified when defining crawlers, jobs,
and development endpoints.
• AWSGlueServiceNotebookRole – Grants access to resources required when creating a notebook
server. These resources include AWS Glue, Amazon S3, and Amazon EC2. If you follow the naming
convention for resources specified in this policy, AWS Glue processes have the required permissions.
This policy is typically attached to roles specified when creating a notebook server on a development
endpoint.
• AWSGlueConsoleSageMakerNotebookFullAccess – Grants full access to AWS Glue and Amazon
SageMaker resources when using the AWS Management Console. If you follow the naming convention
for resources specified in this policy, users have full console capabilities. This policy is typically
attached to users of the AWS Glue console who manage Amazon SageMaker notebooks.

47
AWS Glue Developer Guide
Access-Control Overview

Note
You can review these permissions policies by signing in to the IAM console and searching for
specific policies there.

You can also create your own custom IAM policies to allow permissions for AWS Glue actions and
resources. You can attach these custom policies to the IAM users or groups that require those
permissions.

AWS Glue Resource Policies For Access Control


A resource policy is a policy that is attached to a resource rather than to an IAM identity. For example, in
Amazon Simple Storage Service (Amazon S3), a resource policy is attached to an Amazon S3 bucket.

AWS Glue supports using resource policies to control access to Data Catalog resources. These resources
include databases, tables, connections, and user-defined functions, along with the Data Catalog APIs that
interact with these resources.

An AWS Glue resource policy can only be used to manage permissions for Data Catalog resources. You
can't attach it to any other AWS Glue resources such as jobs, triggers, development endpoints, crawlers,
or classifiers.

A resource policy is attached to a catalog, which is a virtual container for all the kinds of Data Catalog
resources mentioned previously. Each AWS account owns a single catalog in an AWS Region whose
catalog ID is the same as the AWS account ID. A catalog cannot be deleted or modified.

A resource policy is evaluated for all API calls to the catalog where the caller principal is included in the
"Principal" block of the policy document.

Currently, only one resource policy is allowed per catalog, and its size is limited to 10 KB.

You use a policy document written in JSON format to create or modify a resource policy. The policy
syntax is the same as for an IAM policy (see AWS IAM Policy Reference), with the following exceptions:

• A "Principal" or "NotPrincipal" block is required for each policy statement.


• The "Principal" or "NotPrincipal" must identify valid existing AWS root users or IAM users,
roles, or groups. Wildcard patterns (like arn:aws:iam::account-id:user/*) are not allowed.
• The "Resource" block in the policy requires all resource ARNs to match the following regular
expression syntax (where the first %s is the region and the second %s is the account-id):

*arn:aws:glue:%s:%s:(\*|[a-zA-Z\*]+\/?.*)

For example, both arn:aws:glue:us-west-2:account-id:* and arn:aws:glue:us-


west-2:account-id:database/default are allowed, but * is not allowed.
• Unlike IAM policies, an AWS Glue resource policy must only contain ARNs of resources belonging to the
catalog to which the policy is attached. Such ARNs always start with arn:aws:glue:.
• A policy cannot cause the identity creating it to be locked out of further policy creation or
modification.
• A resource-policy JSON document cannot exceed 10 KB in size.

As an example, if the following policy is attached to the catalog in Account A, it grants to the IAM
identity dev in Account A permission to create any table in database db1 in Account A. It also grants the
same permission to the root user in Account B.

48
AWS Glue Developer Guide
Access-Control Overview

"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateTable"
],
"Principal": {"AWS": [
"arn:aws:iam::account-A-id:user/dev",
"arn:aws:iam::account-B-id:root"
]},
"Resource": [
"arn:aws:glue:us-east-1:account-A-id:table/db1/*",
"arn:aws:glue:us-east-1:account-A-id:database/db1",
"arn:aws:glue:us-east-1:account-A-id:catalog"
]
}
]
}

The following are some examples of resource policy documents that are not valid.

For example, a policy is invalid if it specifies a user that does not exist in the account of the catalog to
which it is attached:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateTable"
],
"Principal": {"AWS": [
"arn:aws:iam::account-A-id:user/(non-existent-user)"
]},
"Resource": [
"arn:aws:glue:us-east-1:account-A-id:table/db1/tbl1",
"arn:aws:glue:us-east-1:account-A-id:database/db1",
"arn:aws:glue:us-east-1:account-A-id:catalog"
]
}
]
}

A policy is invalid if it contains a resource ARN for a resource in a different account than the catalog to
which it is attached. In this example, this is an incorrect policy if attached to account-A:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateTable"
],
"Principal": {"AWS": [
"arn:aws:iam::account-A-id:user/dev"
]},
"Resource": [
"arn:aws:glue:us-east-1:account-B-id:table/db1/tbl1",
"arn:aws:glue:us-east-1:account-B-id:database/db1",
"arn:aws:glue:us-east-1:account-B-id:catalog"

49
AWS Glue Developer Guide
Cross-Account Access

]
}
]
}

A policy is invalid if it contains a resource ARN for a resource that is not an AWS Glue resource (in this
case, an Amazon S3 bucket):

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateTable"
],
"Principal": {"AWS": [
"arn:aws:iam::account-A-id:user/dev"
]},
"Resource": [
"arn:aws:glue:us-east-1:account-A-id:table/db1/tbl1",
"arn:aws:glue:us-east-1:account-A-id:database/db1",
"arn:aws:glue:us-east-1:account-A-id:catalog",
"arn:aws:s3:::bucket/my-bucket"
]
}
]
}

AWS Glue Resource Policy APIs


You can use the following AWS Glue Data Catalog APIs to create, retrieve, modify, and delete an AWS
Glue resource policy:

• PutResourcePolicy (put_resource_policy) (p. 381)


• GetResourcePolicy (get_resource_policy) (p. 381)
• DeleteResourcePolicy (delete_resource_policy) (p. 382)

You can also use the AWS Glue console to view and edit a resource policy. For more information, see
Working with Data Catalog Settings on the AWS Glue Console (p. 123).

Granting Cross-Account Access


There are two ways in AWS to grant cross-account access to a resource:

Use a resource policy to grant cross-account access

1. An administrator (or other authorized identity) in Account A attaches a resource policy to the Data
Catalog in Account A. This policy grants Account B specific cross-account permissions to perform
operations on a resource in Account A's catalog.
2. An administrator in Account B attaches an IAM policy to a user or other IAM identity in Account B
that delegates the permissions received from Account A.
3. The user or other identity in Account B now has access to the specified resource in Account A.

The user needs permission from both the resource owner (Account A) and their parent account
(Account B) to be able to access the resource.

50
AWS Glue Developer Guide
Cross-Account Access

Use an IAM role to grant cross-account access

1. An administrator (or other authorized identity) in the account that owns the resource (Account A)
creates an IAM role.
2. The administrator in Account A attaches a policy to the role that grants cross-account permissions
for access to the resource in question.
3. The administrator in Account A attaches a trust policy to the role that identifies an IAM identity in a
different account (Account B) as the principal who can assume the role.

The principal in the trust policy can also be an AWS service principal if you want to grant an AWS
service permission to assume the role.
4. An administrator in Account B now delegates permissions to one or more IAM identities in Account B
so that they can assume that role. Doing so gives those identities in Account B access to the resource
in account A.

For more information about using IAM to delegate permissions, see Access Management in the IAM
User Guide. For more information about users, groups, roles, and permissions, see Identities (Users,
Groups, and Roles), also in the IAM User Guide.

For a comparison of these two approaches, see How IAM Roles Differ from Resource-based Policies in the
IAM User Guide. AWS Glue supports both options, with the restriction that a resource policy can grant
access only to Data Catalog resources.

For example, to give user Bob in Account B access to database db1 in Account A, attach the following
resource policy to the catalog in Account A:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase"
],
"Principal": {"AWS": [
"arn:aws:iam::account-B-id:user/Bob"
]},
"Resource": [
"arn:aws:glue:us-east-1:account-A-id:catalog",
"arn:aws:glue:us-east-1:account-A-id:database/db1"
]
}
]
}

In addition, Account B would have to attach the following IAM policy to Bob before he would actually get
access to db1 in Account A:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:account-A-id:catalog",
"arn:aws:glue:us-east-1:account-A-id:database/db1"
]

51
AWS Glue Developer Guide
Cross-Account Access

}
]
}

How to Make a Cross-Account API Call


All AWS Glue Data Catalog operations have a CatalogId field. If all the required permissions have
been granted to enable cross-account access, a caller can make AWS Glue Data Catalog API calls across
accounts by passing the target AWS account ID in CatalogId so as to access the resource in that target
account.

If no CatalogId value is provided, AWS Glue uses the caller's own account ID by default, and the call is
not cross-account.

How to Make a Cross-Account ETL Call


Some AWS Glue PySpark and Scala APIs have a catalog ID field. If all the required permissions have been
granted to enable cross-account access, an ETL job can make PySpark and Scala calls to API operations
across accounts by passing the target AWS account ID in the catalog ID field to access Data Catalog
resources in a target account.

If no catalog ID value is provided, AWS Glue uses the caller's own account ID by default, and the call is
not cross-account.

See GlueContext Class (p. 291) for PySpark APIs which support catalog_id. And see AWS Glue Scala
GlueContext APIs (p. 351) for Scala APIs which support catalogId.

The following example shows the permissions required by the grantee to run an ETL job. In this example,
grantee-account-id is the catalog-id of the client running the job and grantor-account-id is the
owner of the resource. This example grants permission to all catalog resources in the grantor's account.
To limit the scope of resources granted, you can provide specific ARNs for the catalog, database, table,
and connection.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetConnection",
"glue:GetDatabase",
"glue:GetTable",
"glue:GetPartition"
],
"Principal": {"AWS": ["arn:aws:iam:grantee-account-id:root"]},
"Resource": [
"arn:aws:glue:us-east-1:grantor-account-id:*"
]
}
]
}

Note
If a table in the grantor's account points to an Amazon S3 location, that is also in the grantor's
account, the IAM role used to run an ETL job in the grantee's account must have permission to
list and get objects from the grantor's account.

Given that the client in Account A already has permission to create and run ETL jobs, the basic steps to
setup an ETL job for cross-account access are:

1. Allow cross-account data access (skip this step if Amazon S3 cross-account access is already set up).

52
AWS Glue Developer Guide
Cross-Account Access

a. Update the Amazon S3 bucket policy in Account B to allow cross-account access from Account
A.
b. Update the IAM policy in Account A to allow access to the bucket in Account B.
2. Allow cross-account Data Catalog access.

a. Create or update the resource policy attached to the Data Catalog in Account B to allow access
from Account A.
b. Update the IAM policy in Account A to allow access to the Data Catalog in Account B.

AWS Glue Resource Ownership and Operations


Your AWS account owns the AWS Glue Data Catalog resources that are created in that account,
regardless of who created them. Specifically, the owner of a resource is the AWS account of the principal
entity (that is, the root account, the IAM user, or the IAM role) that authenticated the creation request
for that resource; for example:

• If you use the root account credentials of your AWS account to create a table in your Data Catalog,
your AWS account is the owner of the resource.
• If you create an IAM user in your AWS account and grant permissions to that user to create a table,
every table that the user creates is owned by your AWS account, to which the user belongs.
• If you create an IAM role in your AWS account with permissions to create a table, anyone who can
assume the role can create a table. But again, your AWS account owns the table resources that are
created using that role.

For each AWS Glue resource, the service defines a set of API operations that apply to it. To grant
permissions for these API operations, AWS Glue defines a set of actions that you can specify in a policy.
Some API operations can require permissions for more than one action in order to perform the API
operation.

Cross-Account Resource Ownership and Billing


When a user in one AWS account (Account A) creates a new resource such as a database in a different
account (Account B), that resource is then owned by Account B, the account where it was created. An
administrator in Account B automatically gets full permissions to access the new resource, including
reading, writing, and granting access permissions to a third account. The user in Account A can access the
resource that they just created only if they have the appropriate permissions granted by Account B.

Storage costs and other costs that are directly associated with the new resource are billed to Account
B, the resource owner. The cost of requests from the user who created the resource are billed to the
requester's account, Account A.

For more information about AWS Glue billing and pricing, see How AWS Pricing Works.

Cross-Account Access Limitations


AWS Glue cross-account access has the following limitations:

• Cross-account access to AWS Glue is not allowed if the resource owner account has not migrated
the Amazon Athena data catalog to AWS Glue. You can find the current migration status using the
GetCatalogImportStatus (get_catalog_import_status) (p. 427). For more details on how to migrate
an Athena catalog to AWS Glue, see Upgrading to the AWS Glue Data Catalog Step-by-Step in the
Amazon Athena User Guide.
• Cross-account access is only supported for Data Catalog resources, including databases, tables, user-
defined functions, and connections.

53
AWS Glue Developer Guide
Resource ARNs

• Cross-account access to the Data Catalog is not supported when using an AWS Glue crawler, Amazon
Athena, or Amazon Redshift.

Specifying AWS Glue Resource ARNs


In AWS Glue, access to resources can be controlled with an IAM policy. In a policy, you use an Amazon
Resource Name (ARN) to identity the resource the policy applies to. Not all resources in AWS Glue
support ARNs.

Topics
• Data Catalog Amazon Resource Names (ARNs) (p. 54)
• Amazon Resource Names (ARNs) for Non-Catalog Objects (p. 56)
• Access Control for AWS Glue Non-Catalog Singular API Operations (p. 56)
• Access-Control for AWS Glue Non-Catalog API Operations That Retrieve Multiple Items (p. 57)

Data Catalog Amazon Resource Names (ARNs)


Data Catalog resources have a hierarchical structure, with catalog as the root:

arn:aws:glue:region:account-id:catalog

Each AWS account has a single Data Catalog in an AWS Region with the 12-digit account ID as the
catalog ID. Resources have unique Amazon Resource Names (ARNs) associated with them, as shown in
the following table.

Resource Type ARN Format

Catalog arn:aws:glue:region:account-id:catalog

For example: arn:aws:glue:us-east-1:123456789012:catalog

Database arn:aws:glue:region:account-id:database/database name

For example: arn:aws:glue:us-east-1:123456789012:database/


db1

Table arn:aws:glue:region:account-id:table/database name/table


name

For example: arn:aws:glue:us-east-1:123456789012:table/db1/


tbl1

User-defined function arn:aws:glue:region:account-


id:userDefinedFunction/database name/user-defined function
name

For example: arn:aws:glue:us-


east-1:123456789012:userdefinedfunction/db1/func1

Connection arn:aws:glue:region:account-id:connection/connection name

For example: arn:aws:glue:us-east-1:123456789012:connection/


connection1

54
AWS Glue Developer Guide
Resource ARNs

To enable fine-grained access control, you can use these ARNs in your IAM policies and resource policies
to grant and deny access to specific resources. Wildcards are allowed in the policies, for example, the
following ARN matches all tables in database default.

arn:aws:glue:us-east-1:123456789012:table/default/*

Important
All operations performed on a Data Catalog resource require permission on the resource
and all the ancestors of that resource. For example, to create a partitions for a table requires
permission on the table, database, and catalog where the table is located. The following
example shows the permission required to create partitions on table PrivateTable in
database PrivateDatabase in the Data Catalog.

{
"Sid": "GrantCreatePartitions",
"Effect": "Allow",
"Action": [
"glue:BatchCreatePartitions"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:table/PrivateDatabase/PrivateTable",
"arn:aws:glue:us-east-1:123456789012:database/PrivateDatabase",
"arn:aws:glue:us-east-1:123456789012:catalog"
]
}

In addition to permission on the resource and all its ancestors, all delete operations require
permission on all children of that resource. For example, to delete a database, requires
permission on all the tables and user-defined functions in the database, as well as the database
and the catalog where the database is located. The following example shows the permission
required to delete database PrivateDatabase in the Data Catalog.

{
"Sid": "GrantDeleteDatabase",
"Effect": "Allow",
"Action": [
"glue:DeleteDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:table/PrivateDatabase/*",
"arn:aws:glue:us-east-1:123456789012:userdefinedfunction/PrivateDatabase/*",
"arn:aws:glue:us-east-1:123456789012:database/PrivateDatabase",
"arn:aws:glue:us-east-1:123456789012:catalog"
]
}

In summary, actions on Data Catalog resources follow these permission rules:

• Actions on the catalog requires permission on the catalog only.


• Actions on a database requires permission on the database and catalog.
• Delete actions on a database requires permission on the database and catalog plus all tables
and user-defined functions in the database.
• Actions on a table, partition, or table version requires permission on the table, database, and
catalog.
• Actions on a user-defined function requires permission on the user-defined function,
database, and catalog.
• Actions on a connection requires permission on the connection and catalog.

55
AWS Glue Developer Guide
Resource ARNs

Amazon Resource Names (ARNs) for Non-Catalog Objects


Some AWS Glue resources allow resource-level permissions to control access using an ARN. You can
use these ARNs in your IAM policies to enable fine-grained access control. The following table lists the
resources that can contain resource ARNs.

Resource Type ARN Format

Development endpoint arn:aws:glue:region:account-id:devEndpoint/development-


endpoint-name

For example: arn:aws:glue:us-


east-1:123456789012:devEndpoint/temporarydevendpoint

Access Control for AWS Glue Non-Catalog Singular API


Operations
AWS Glue non-catalog singular API operations act on a single item (development endpoint). Examples
are GetDevEndpoint, CreateUpdateDevEndpoint, and UpdateDevEndpoint. For these operations,
a policy must put the API name in the "action" block and the resource ARN in the "resource" block.

Suppose that you want to allow a user to call the GetDevEndpoint operation. The following policy
grants the minimum necessary permissions to an endpoint named myDevEndpoint-1:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Minimum permissions",
"Effect": "Allow",
"Action": "glue:GetDevEndpoint",
"Resource": "arn:aws:glue:us-east-1:123456789012:devEndpoint/myDevEndpoint-1"
}
]
}

The following policy allows UpdateDevEndpoint access to resources that match myDevEndpoint-
with a wildcard (*):

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Permission with wildcard",
"Effect": "Allow",
"Action": "glue:UpdateDevEndpoint",
"Resource": "arn:aws:glue:us-east-1:123456789012:devEndpoint/myDevEndpoint-*"
}
]
}

You can combine the two policies as in the following example. Although you might see
EntityNotFoundException for any development endpoint whose name begins with A, an access
denied error is returned when you try to access other development endpoints:

56
AWS Glue Developer Guide
Policy Examples

"Version": "2012-10-17",
"Statement": [
{
"Sid": "Combined permissions",
"Effect": "Allow",
"Action": [
"glue:UpdateDevEndpoint",
"glue:GetDevEndpoint"
],
"Resource": "arn:aws:glue:us-east-1:123456789012:devEndpoint/A*"
}
]
}

Access-Control for AWS Glue Non-Catalog API Operations That


Retrieve Multiple Items
Some AWS Glue API operations retrieve multiple items (such as multiple development endpoints); for
example, GetDevEndpoints. For this operation, you can specify only a wildcard (*) resource, not specific
ARNs.

For example, to allow GetDevEndpoints, the following policy must scope the resource to the wildcard
(*) because GetDevEndpoints is a plural API operation. The singular operations (GetDevEndpoint,
CreateDevEndpoint, and DeleteDevendpoint) are also scoped to all (*) resources in the example.

{
"Sid": "Plural API included",
"Effect": "Allow",
"Action": [
"glue:GetDevEndpoints",
"glue:GetDevEndpoint",
"glue:CreateDevEndpoint",
"glue:UpdateDevEndpoint"
],
"Resource": [
"*"
]
}

AWS Glue Access-Control Policy Examples


This section contains examples of both identity-based (IAM) access-control policies and AWS Glue
resource policies.

Topics
• AWS Glue Identity-Based (IAM) Access-Control Policy Examples (p. 57)
• AWS Glue Resource-Based Access-Control Policy Examples (p. 62)

AWS Glue Identity-Based (IAM) Access-Control Policy Examples


This section contains example AWS Identity and Access Management (IAM) policies that grant
permissions for various AWS Glue actions and resources. You can copy these examples and edit them on
the IAM console. Then you can attach them to IAM identities such as users, roles, and groups.
Note
These examples all use the us-west-2 Region. You can replace this with whatever AWS Region
you are using.

57
AWS Glue Developer Guide
Policy Examples

Example 1: Grant Read-Only Permission to a Table


The following policy grants read-only permission to a books table in database db1. For more
information about resource ARNs, see Data Catalog Amazon Resource Names (ARNs) (p. 54).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTablesActionOnBooks",
"Effect": "Allow",
"Action": [
"glue:GetTables",
"glue:GetTable"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/books"
]
}
]
}

This policy grants read-only permission to a table named books in the database named db1. Notice
that to grant Get permission to a table that permission to the catalog and database resources is also
required.

The following policy grants the minimum necessary permissions to create table tb1 in database db1:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateTable"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:table/db1/tbl1",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:catalog"
]
}
]
}

Example 2: Filter Tables by GetTables Permission


Assume that there are three tables—customers, stores, and store_sales—in database db1. The
following policy grants GetTables permission to stores and store_sales, but not to customers.
When you call GetTables with this policy, the result contains only the two authorized tables (the
customers table is not returned).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTablesExample",
"Effect": "Allow",
"Action": [

58
AWS Glue Developer Guide
Policy Examples

"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/store_sales",
"arn:aws:glue:us-west-2:123456789012:table/db1/stores"
]
}
]
}

You can simplify the preceding policy by using store* to match any table names that start with store:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTablesExample2",
"Effect": "Allow",
"Action": [
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/store*"
]
}
]
}

Similarly, using /db1/* to match all tables in db1, the following policy grants GetTables access to all
the tables in db1:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTablesReturnAll",
"Effect": "Allow",
"Action": [
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/*"
]
}
]
}

If no table ARN is provided, a call to GetTables succeeds, but it returns an empty list:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTablesEmptyResults",
"Effect": "Allow",
"Action": [

59
AWS Glue Developer Guide
Policy Examples

"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1"
]
}
]
}

If the database ARN is missing in the policy, a call to GetTables fails with an
AccessDeniedException:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetTablesAccessDeny",
"Effect": "Allow",
"Action": [
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:table/db1/*"
]
}
]
}

Example 3: Grant Full Access to a Table and All Partitions


The following policy grants all permissions on a table named books in database db1. This includes read
and write permissions on the table itself, on archived versions of it, and on all its partitions.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FullAccessOnTable",
"Effect": "Allow",
"Action": [
"glue:CreateTable",
"glue:GetTable",
"glue:GetTables",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:GetTableVersion",
"glue:GetTableVersions",
"glue:DeleteTableVersion",
"glue:BatchDeleteTableVersion",
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition",
"glue:UpdatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",

60
AWS Glue Developer Guide
Policy Examples

"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/books"
]
}
]
}

The preceding policy can be simplified in practice:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FullAccessOnTable",
"Effect": "Allow",
"Action": [
"glue:*Table*",
"glue:*Partition*"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/books"
]
}
]
}

Notice that the minimum granularity of fine-grained access control is at the table level. This means that
you can't grant a user access to some partitions in a table but not others, or to some table columns but
not to others. A user either has access to all of a table, or to none of it.

Example 4: Control Access by Name Prefix and Explicit Denial


In this example, suppose that the databases and tables in your AWS Glue Data Catalog are organized
using name prefixes. The databases in the development stage have the name prefix dev-, and those in
production have the name prefix prod-. You can use the following policy to grant developers full access
to all databases, tables, UDFs, and so on, that have the dev- prefix. But you grant read-only access to
everything with the prod- prefix.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DevAndProdFullAccess",
"Effect": "Allow",
"Action": [
"glue:*Database*",
"glue:*Table*",
"glue:*Partition*",
"glue:*UserDefinedFunction*",
"glue:*Connection*"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/dev-*",
"arn:aws:glue:us-west-2:123456789012:database/prod-*",
"arn:aws:glue:us-west-2:123456789012:table/dev-*/*",
"arn:aws:glue:us-west-2:123456789012:table/*/dev-*",
"arn:aws:glue:us-west-2:123456789012:table/prod-*/*",
"arn:aws:glue:us-west-2:123456789012:table/*/prod-*",
"arn:aws:glue:us-west-2:123456789012:userDefinedFunction/dev-*/*",

61
AWS Glue Developer Guide
Policy Examples

"arn:aws:glue:us-west-2:123456789012:userDefinedFunction/*/dev-*",
"arn:aws:glue:us-west-2:123456789012:userDefinedFunction/prod-*/*",
"arn:aws:glue:us-west-2:123456789012:userDefinedFunction/*/prod-*",
"arn:aws:glue:us-west-2:123456789012:connection/dev-*",
"arn:aws:glue:us-west-2:123456789012:connection/prod-*"
]
},
{
"Sid": "ProdWriteDeny",
"Effect": "Deny",
"Action": [
"glue:*Create*",
"glue:*Update*",
"glue:*Delete*"
],
"Resource": [
"arn:aws:glue:us-west-2:123456789012:database/prod-*",
"arn:aws:glue:us-west-2:123456789012:table/prod-*/*",
"arn:aws:glue:us-west-2:123456789012:table/*/prod-*",
"arn:aws:glue:us-west-2:123456789012:userDefinedFunction/prod-*/*",
"arn:aws:glue:us-west-2:123456789012:userDefinedFunction/*/prod-*",
"arn:aws:glue:us-west-2:123456789012:connection/prod-*"
]
}
]
}

The second statement in the preceding policy uses explicit deny. You can use explicit deny to overwrite
any allow permissions that are granted to the principal. This lets you lock down access to critical
resources and prevent another policy from accidentally granting access to them.

In the preceding example, even though the first statement grants full access to prod- resources, the
second statement explicitly revokes write access to them, leaving only read access to prod- resources.

AWS Glue Resource-Based Access-Control Policy Examples


This section contains example resource policies, including policies that grant cross-account access.
Important
By changing an AWS Glue resource policy, you might accidentally revoke permissions for existing
AWS Glue users in your account and cause unexpected disruptions. Try these examples only in
development or test accounts, and ensure that they don't break any existing workflows before
you make the changes.
Note
Both IAM policies and an AWS Glue resource policy take a few seconds to propagate. After you
attach a new policy, you might notice that the old policy is still in effect until the new policy has
propagated through the system.

The following examples use the AWS Command Line Interface (AWS CLI) to interact with AWS Glue
service APIs. You can perform the same operations on the AWS Glue console or using one of the AWS
SDKs.

To set up the AWS CLI

1. Install the AWS CLI by following the instructions in Installing the AWS Command Line Interface in
the AWS Command Line Interface User Guide.
2. Configure the AWS CLI by following the instructions in Configuration and Credential Files. Create an
admin profile using your AWS account administrator credentials. Configure the default AWS Region
to us-west-2 (or a Region that you use), and set the default output format to JSON.
3. Test access to the AWS Glue API by running the following command (replacing Alice with a real
IAM user or role in your account):

62
AWS Glue Developer Guide
Policy Examples

# Run as admin of account account-id


$ aws glue put-resource-policy --profile administrator-name --region us-west-2 --
policy-in-json '{
"Version": "2012-10-17",
"Statement": [
{
"Principal": {
"AWS": [
"arn:aws:iam::account-id:user/Alice"
]
},
"Effect": "Allow",
"Action": [
"glue:*"
],
"Resource": [
"arn:aws:glue:us-west-2:account-id:*"
]
}
]
}'

4. Configure a user profile for each IAM user in the accounts that you use for testing your resource
policy and cross-account access.

Example 1. Use a Resource Policy to Control Access in the Same Account


In this example, an admin user in Account A creates a resource policy that grants IAM user Alice in
Account A full access to the catalog. Alice has no IAM policy attached.

To do this, the administrator user runs the following AWS CLI command:

# Run as admin of Account A


$ aws glue put-resource-policy --profile administrator-name --region us-west-2 --policy-in-
json '{
"Version": "2012-10-17",
"Statement": [
{
"Principal": {
"AWS": [
"arn:aws:iam::account-A-id:user/Alice"
]
},
"Effect": "Allow",
"Action": [
"glue:*"
],
"Resource": [
"arn:aws:glue:us-west-2:account-A-id:*"
]
}
]
}'

Instead of entering the JSON policy document as a part of your AWS CLI command, you can save a
policy document in a file and reference the file path in the AWS CLI command, prefixed by file://. The
following is an example of how you might do that:

$ echo '{
"Version": "2012-10-17",

63
AWS Glue Developer Guide
Policy Examples

"Statement": [
{
"Principal": {
"AWS": [
"arn:aws:iam::account-A-id:user/Alice"
]
},
"Effect": "Allow",
"Action": [
"glue:*"
],
"Resource": [
"arn:aws:glue:us-west-2:account-A-id:*"
]
}
]
}' > /temp/policy.json

$ aws glue put-resource-policy --profile admin1 \


--region us-west-2 --policy-in-json file:///temp/policy.json

After this resource policy has propagated, Alice can access all AWS Glue resources in Account A; for
example:

# Run as user Alice


$ aws glue create-database --profile alice --region us-west-2 --database-input '{
"Name": "new_database",
"Description": "A new database created by Alice",
"LocationUri": "s3://my-bucket"
}'

$ aws glue get-table --profile alice --region us-west-2 --database-name "default" --table-
name "tbl1"}

In response to Alice's get-table call, the AWS Glue service returns the following:

{
"Table": {
"Name": "tbl1",
"PartitionKeys": [],
"StorageDescriptor": {
......
},
......
}
}

Example 2. Use a Resource Policy to Grant Cross-Account Access


In this example, a resource policy in Account A is used to grant to Bob in Account B read-only access to all
Data Catalog resources in Account A. To do this, four steps are needed:

1. Verify that Account A has migrated its Amazon Athena data catalog to AWS Glue.

Cross-account access to AWS Glue is not allowed if the resource-owner account has not migrated its
Athena data catalog to AWS Glue. For more details on how to migrate the Athena catalog to AWS
Glue, see Upgrading to the AWS Glue Data Catalog Step-by-Step in the Amazon Athena User Guide.

# Verify that the value "ImportCompleted" is true. This value is region specific.
$ aws glue get-catalog-import-status --profile admin1 --region us-west-2
{

64
AWS Glue Developer Guide
Policy Examples

"ImportStatus": {
"ImportCompleted": true,
"ImportTime": 1502512345.0,
"ImportedBy": "StatusSetByDefault"
}
}

2. An admin in Account A creates a resource policy granting Account B access.

# Run as admin of Account A


$ aws glue put-resource-policy --profile admin1 --region us-west-2 --policy-in-json '{
"Version": "2012-10-17",
"Statement": [
{
"Principal": {
"AWS": [
"arn:aws:iam::account-B-id:root"
]
},
"Effect": "Allow",
"Action": [
"glue:Get*",
"glue:BatchGet*"
],
"Resource": [
"arn:aws:glue:us-west-2:account-A-id:*"
]
}
]
}'

3. An admin in Account B grants Bob access to Account A using an IAM policy.

# Run as admin of Account B


$ aws iam put-user-policy --profile admin2 --user-name Bob --policy-name
CrossAccountReadOnly --policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:Get*",
"glue:BatchGet*"
],
"Resource": [
"arn:aws:glue:us-west-2:account-A-id:*"
]
}
]
}'

4. Verify Bob's access to a resource in Account A.

# Run as user Bob. This call succeeds in listing Account A databases.


$ aws glue get-databases --profile bob --region us-west-2 --catalog-id account-A-id
{
"DatabaseList": [
{
"Name": "db1"
},
{
"Name": "db2"
},
......

65
AWS Glue Developer Guide
API Permissions Reference

]
}

# This call succeeds in listing tables in the 'default' Account A database.


$ aws glue get-table --profile alice --region us-west-2 --catalog-id account-A-id \
--database-name "default" --table-name "tbl1"
{
"Table": {
"Name": "tbl1",
"PartitionKeys": [],
"StorageDescriptor": {
......
},
......
}
}

# This call fails with access denied, because Bob has only been granted read access.
$ aws glue create-database --profile bob --region us-west-2 --catalog-id account-A-id
--database-input '{
"Name": "new_database2",
"Description": "A new database created by Bob",
"LocationUri": "s3://my-bucket2"
}'

An error occurred (AccessDeniedException) when calling the CreateDatabase operation:


User: arn:aws:iam::account-B-id:user/Bob is not authorized to perform:
glue:CreateDatabase on resource: arn:aws:glue:us-west-2:account-A-id:database/
new_database2

In step 2, the administrator in Account A grants permission to the root user of Account B. The root user
can then delegate the permissions it owns to all IAM principals (users, roles, groups, and so forth) by
attaching IAM policies to them. Because an admin user already has a full-access IAM policy attached, an
administrator automatically owns the permissions granted to the root user, and also the permission to
delegate permissions to other IAM users in the account.

Alternatively, in step 2, you could grant permission to the ARN of user Bob directly. This restricts the
cross-account access permission to Bob alone. However, step 3 is still required for Bob to actually gain
the cross-account access. For cross-account access, both the resource policy in the resource account
and an IAM policy in the user's account are required for access to work. This is different from the same-
account access in Example 1, where either the resource policy or the IAM policy can grant access without
needing the other.

AWS Glue API Permissions: Actions and Resources


Reference
Use the following list as a reference when you're setting up Authentication and Access Control for AWS
Glue (p. 40) and writing a permissions policy to attach to an IAM identity (identity-based policy) or to
a resource (resource policy). The list includes each AWS Glue API operation, the corresponding actions for
which you can grant permissions to perform the action, and the AWS resource for which you can grant
the permissions. You specify the actions in the policy's Action field, and you specify the resource value
in the policy's Resource field.

Actions on some AWS Glue resources require that ancestor and child resource ARNs are also included
in the policy's Resource field. For more information, see Data Catalog Amazon Resource Names
(ARNs) (p. 54).

Generally, you can replace ARN segments with wildcards. For more information see IAM JSON Policy
Elements in the IAM User Guide.

66
AWS Glue Developer Guide
API Permissions Reference

You can use AWS-wide condition keys in your AWS Glue policies to express conditions. For a complete list
of AWS-wide keys, see Available Keys in the IAM User Guide.
Note
To specify an action, use the glue: prefix followed by the API operation name (for example,
glue:GetTable).

AWS Glue API Permissions: Actions and Resources Reference

BatchCreatePartition (batch_create_partition) (p. 407)

Action(s): glue:BatchCreatePartition

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

BatchDeleteConnection (batch_delete_connection) (p. 420)

Action(s): glue:BatchDeleteConnection

Resource:

arn:aws:glue:region:account-id:connection/connection-name
arn:aws:glue:region:account-id:catalog

Note
All the connection deletions to be performed by the call must be authorized by IAM. If any
of these deletions is not authorized, the call fails and no connections are deleted.
BatchDeletePartition (batch_delete_partition) (p. 409)

Action(s): glue:BatchDeletePartition

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

Note
All the partition deletions to be performed by the call must be authorized by IAM. If any of
these deletions is not authorized, the call fails and no partitions are deleted.
BatchDeleteTable (batch_delete_table) (p. 398)

Action(s): glue:BatchDeleteTable

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

67
AWS Glue Developer Guide
API Permissions Reference

Note
All the table deletions to be performed by the call must be authorized by IAM. If any of
these deletions is not authorized, the call fails and no tables are deleted.
BatchDeleteTableVersion (batch_delete_table_version) (p. 402)

Action(s): glue:BatchDeletetTableVersion

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

BatchGetPartition (batch_get_partition) (p. 414)

Action(s): glue:BatchGetPartition

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

BatchStopJobRun (batch_stop_job_run) (p. 463)

Action(s): glue:BatchStopJobRun

Resource:

*
CreateClassifier (create_classifier) (p. 432)

Action(s): glue:CreateClassifier

Resource:

*
CreateConnection (create_connection) (p. 417)

Action(s): glue:CreateConnection

Resource:

arn:aws:glue:region:account-id:connection/connection-name
arn:aws:glue:region:account-id:catalog

CreateCrawler (create_crawler) (p. 439)

Action(s): glue:CreateCrawler

Resource:

68
AWS Glue Developer Guide
API Permissions Reference

*
CreateDatabase (create_database) (p. 386)

Action(s):glue:CreateDatabase

Resource:

arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

CreateDevEndpoint (create_dev_endpoint) (p. 475)

Action(s): glue:CreateDevEndpoint

Resource:

arn:aws:glue:region:account-id:devEndpoint/development-endpoint-name

or

arn:aws:glue:region:account-id:devEndpoint/*
CreateJob (create_job) (p. 455)

Action(s): glue:CreateJob

Resource:

*
CreatePartition (create_partition) (p. 406)

Action(s): glue:CreatePartition

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

CreateScript (create_script) (p. 449)

Action(s): glue:CreateScript

Resource:

*
CreateSecurityConfiguration (create_security_configuration) (p. 382)

Action(s): glue:CreateSecurityConfiguration

Resource:

*
CreateTable (create_table) (p. 396)

Action(s): glue:CreateTable

Resource:

69
AWS Glue Developer Guide
API Permissions Reference

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

CreateTrigger (create_trigger) (p. 468)

Action(s): glue:CreateTrigger

Resource:

*
CreateUserDefinedFunction (create_user_defined_function) (p. 422)

Action(s): glue:CreateUserDefinedFunction

Resource:

arn:aws:glue:region:account-id:userDefinedFunction/database-name/user-defined-function-
name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

DeleteClassifier (delete_classifier) (p. 433)

Action(s): glue:DeleteClassifier

Resource:

*
DeleteConnection (delete_connection) (p. 418)

Action(s): glue:DeleteConnection

Resource:

arn:aws:glue:region:account-id:connection/connection-name
arn:aws:glue:region:account-id:catalog

DeleteCrawler (delete_crawler) (p. 440)

Action(s): glue:DeleteCrawler

Resource:

*
DeleteDatabase (delete_database) (p. 387)

Action(s): glue:DeleteDatabase

Resource:

arn:aws:glue:region:account-id:database/database-name

70
AWS Glue Developer Guide
API Permissions Reference

arn:aws:glue:region:account-id:userDefinedFunction/database-name/*
arn:aws:glue:region:account-id:table/database-name/*
arn:aws:glue:region:account-id:catalog

DeleteDevEndpoint (delete_dev_endpoint) (p. 477)

Action(s): glue:DeleteDevEndpoint

Resource:

arn:aws:glue:region:account-id:devEndpoint/development-endpoint-name

or

arn:aws:glue:region:account-id:devEndpoint/*
DeleteJob (delete_job) (p. 458)

Action(s): glue:DeleteJob

Resource:

*
DeletePartition (delete_partition) (p. 409)

Action(s): glue:UpdatePartition

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

DeleteResourcePolicy (delete_resource_policy) (p. 382)

Action(s): glue:DeleteResourcePolicy

Resource:

*
DeleteSecurityConfiguration (delete_security_configuration) (p. 383)

Action(s): glue:DeleteSecurityConfiguration

Resource:

*
DeleteTable (delete_table) (p. 397)

Action(s): glue:DeleteTable

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

71
AWS Glue Developer Guide
API Permissions Reference

DeleteTableVersion (delete_table_version) (p. 402)

Action(s): glue:DeleteTableVersion

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

DeleteTrigger (delete_trigger) (p. 472)

Action(s): glue:DeleteTrigger

Resource:

*
DeleteUserDefinedFunction (delete_user_defined_function) (p. 424)

Action(s): glue:DeleteUserDefinedFunction

Resource:

arn:aws:glue:region:account-id:userDefinedFunction/database-name/user-defined-function-
name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetCatalogImportStatus (get_catalog_import_status) (p. 427)

Action(s): glue:GetCatalogImportStatus

Resource:

arn:aws:glue:region:account-id:catalog

GetClassifier (get_classifier) (p. 433)

Action(s): glue:GetClassifier

Resource:

*
GetClassifiers (get_classifiers) (p. 434)

Action(s): glue:GetClassifiers

Resource:

*
GetConnection (get_connection) (p. 418)

Action(s): glue:GetConnection

Resource:

72
AWS Glue Developer Guide
API Permissions Reference

arn:aws:glue:region:account-id:connection/connection-name
arn:aws:glue:region:account-id:catalog

GetConnections (get_connections) (p. 419)

Action(s): glue:GetConnections

arn:aws:glue:region:account-id:connection/connection-names
arn:aws:glue:region:account-id:catalog

GetCrawler (get_crawler) (p. 441)

Action(s): glue:GetCrawler

Resource:

*
GetCrawlerMetrics (get_crawler_metrics) (p. 442)

Action(s): glue:GetCrawlerMetrics

Resource:

*
GetCrawlers (get_crawlers) (p. 441)

Action(s): glue:GetCrawlers

Resource:

*
GetDatabase (get_database) (p. 388)

Action(s): glue:GetDatabase

Resource:

arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetDatabases (get_databases) (p. 389)

Action(s): glue:GetDatabases

Resource:

arn:aws:glue:region:account-id:database/database-names
arn:aws:glue:region:account-id:catalog

GetDataCatalogEncryptionSettings (get_data_catalog_encryption_settings) (p. 379)

Action(s): glue:GetDataCatalogEncryptionSettings

73
AWS Glue Developer Guide
API Permissions Reference

Resource:

*
GetDataflowGraph (get_dataflow_graph) (p. 449)

Action(s): glue:GetDataflowGraph

Resource:

*
GetDevEndpoint (get_dev_endpoint) (p. 478)

Action(s): glue:GetDevEndpoint

Resource:

arn:aws:glue:region:account-id:devEndpoint/development-endpoint-name

or

arn:aws:glue:region:account-id:devEndpoint/*
GetDevEndpoints (get_dev_endpoints) (p. 478)

Action(s): glue:GetDevEndpoints

Resource:

*
GetJob (get_job) (p. 457)

Action(s): glue:GetJob

Resource:

*
GetJobRun (get_job_run) (p. 463)

Action(s): glue:GetJobRun

Resource:

*
GetJobRuns (get_job_runs) (p. 464)

Action(s): glue:GetJobRuns

Resource:

*
GetJobs (get_jobs) (p. 457)

Action(s): glue:GetJobs

Resource:

*
GetMapping (get_mapping) (p. 450)

Action(s): glue:GetMapping

74
AWS Glue Developer Guide
API Permissions Reference

Resource:

*
GetPartition (get_partition) (p. 410)

Action(s): glue:GetPartition

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetPartitions (get_partitions) (p. 411)

Action(s): glue:GetPartitions

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetPlan (get_plan) (p. 450)

Action(s): glue:GetPlan

Resource:

*
GetResourcePolicy (get_resource_policy) (p. 381)

Action(s): glue:GetResourcePolicy

Resource:

*
GetSecurityConfiguration (get_security_configuration) (p. 384)

Action(s): glue:GetSecurityConfiguration

Resource:

*
GetSecurityConfigurations (get_security_configurations) (p. 384)

Action(s): glue:GetSecurityConfigurations

Resource:

*
GetTable (get_table) (p. 399)

Action(s): glue:GetTable

Resource:

75
AWS Glue Developer Guide
API Permissions Reference

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetTables (get_tables) (p. 399)

Action(s): glue:GetTables

Resource:

arn:aws:glue:region:account-id:table/database-name/table-names
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetTableVersion (get_table_version) (p. 400)

Action(s): glue:GetTableVersion

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetTableVersions (get_table_versions) (p. 401)

Action(s): glue:GetTableVersions

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetTrigger (get_trigger) (p. 470)

Action(s): glue:GetTrigger

Resource:

*
GetTriggers (get_triggers) (p. 470)

Action(s): glue:GetTriggers

Resource:

*
GetUserDefinedFunction (get_user_defined_function) (p. 424)

Action(s): glue:GetUserDefinedFunction

76
AWS Glue Developer Guide
API Permissions Reference

Resource:

arn:aws:glue:region:account-id:userDefinedFunction/database-name/user-defined-function-
name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

GetUserDefinedFunctions (get_user_defined_functions) (p. 425)

Action(s): glue:GetUserDefinedFunctions

Resource:

arn:aws:glue:region:account-id:userDefinedFunction/database-name/user-defined-function-
names
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

ImportCatalogToGlue (import_catalog_to_glue) (p. 426)

Action(s): glue:ImportCatalogToGlue

Resource:

arn:aws:glue:region:account-id:catalog

PutDataCatalogEncryptionSettings (put_data_catalog_encryption_settings) (p. 380)

Action(s): glue:PutDataCatalogEncryptionSettings

Resource:

*
PutResourcePolicy (put_resource_policy) (p. 381)

Action(s): glue:PutResourcePolicy

Resource:

*
ResetJobBookmark (reset_job_bookmark) (p. 465)

Action(s): glue:ResetJobBookmark

Resource:

*
StartCrawler (start_crawler) (p. 443)

Action(s): glue:StartCrawler

Resource:

77
AWS Glue Developer Guide
API Permissions Reference

StartCrawlerSchedule (start_crawler_schedule) (p. 445)

Action(s): glue:StartCrawlerSchedule

Resource:

*
StartJobRun (start_job_run) (p. 462)

Action(s): glue:StartJobRun

Resource:

*
StartTrigger (start_trigger) (p. 469)

Action(s): glue:StartTrigger

Resource:

*
StopCrawler (stop_crawler) (p. 444)

Action(s): glue:StopCrawler

Resource:

*
StopCrawlerSchedule (stop_crawler_schedule) (p. 446)

Action(s): glue:StopCrawlerSchedule

Resource:

*
StopTrigger (stop_trigger) (p. 471)

Action(s): glue:StopTrigger

Resource:

*
UpdateClassifier (update_classifier) (p. 434)

Action(s): glue:UpdateClassifier

Resource:

*
UpdateConnection (update_connection) (p. 420)

Action(s): glue:UpdateConnection

Resource:

arn:aws:glue:region:account-id:connection/connection-name
arn:aws:glue:region:account-id:catalog

78
AWS Glue Developer Guide
API Permissions Reference

UpdateCrawler (update_crawler) (p. 442)

Action(s): glue:UpdateCrawler

Resource:

*
UpdateCrawlerSchedule (update_crawler_schedule) (p. 445)

Action(s): glue:UpdateCrawlerSchedule

Resource:

*
UpdateDatabase (update_database) (p. 387)

Action(s): glue:UpdateDatabase

Resource:

arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

UpdateDevEndpoint (update_dev_endpoint) (p. 477)

Action(s): glue:UpdateDevEndpoint

Resource:

arn:aws:glue:region:account-id:devEndpoint/development-endpoint-name

or

arn:aws:glue:region:account-id:devEndpoint/*
UpdateJob (update_job) (p. 456)

Action(s): glue:UpdateJob

Resource:

*
UpdatePartition (update_partition) (p. 408)

Action(s): glue:UpdatePartition

Resource:

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

UpdateTable (update_table) (p. 396)

Action(s): glue:UpdateTable

Resource:

79
AWS Glue Developer Guide
Encryption and Secure Access

arn:aws:glue:region:account-id:table/database-name/table-name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

UpdateTrigger (update_trigger) (p. 471)

Action(s): glue:UpdateTrigger

Resource:

*
UpdateUserDefinedFunction (update_user_defined_function) (p. 423)

Action(s): glue:UpdateUserDefinedFunction

Resource:

arn:aws:glue:region:account-id:userDefinedFunction/database-name/user-defined-function-
name
arn:aws:glue:region:account-id:database/database-name
arn:aws:glue:region:account-id:catalog

Related Topics
• Authentication and Access Control (p. 40)

Encryption and Secure Access for AWS Glue


You can encrypt metadata objects in your AWS Glue Data Catalog in addition to the data written to
Amazon Simple Storage Service (Amazon S3) and Amazon CloudWatch Logs by jobs, crawlers, and
development endpoints. You can enable encryption of the entire Data Catalog in your account. When you
create jobs, crawlers, and development endpoints in AWS Glue, you can provide encryption settings, such
as a security configuration, to configure encryption for that process.

With AWS Glue, you can encrypt data using keys that you manage with AWS Key Management Service
(AWS KMS). With encryption enabled, when you add Data Catalog objects, run crawlers, run jobs, or start
development endpoints, AWS KMS keys are used to write data at rest. In addition, you can configure
AWS Glue to only access Java Database Connectivity (JDBC) data stores through a trusted Secure Sockets
Layer (SSL) protocol.

In AWS Glue, you control encryption settings in the following places:

• The settings of your Data Catalog.


• The security configurations that you create.
• The server-side encryption (SSE-S3) setting that is passed as a parameter to your AWS Glue ETL
(extract, transform, and load) job.

For more information about how to set up encryption, see Setting Up Encryption in AWS Glue (p. 35).

Topics

80
AWS Glue Developer Guide
Encrypting Your Data Catalog

• Encrypting Your Data Catalog (p. 81)


• Encrypting Connection Passwords (p. 82)
• Encrypting Data Written by Crawlers, Jobs, and Development Endpoints (p. 82)

Encrypting Your Data Catalog


You can enable encryption of your AWS Glue Data Catalog objects in the Settings of the Data Catalog on
the AWS Glue console. You can enable or disable encryption settings for the entire Data Catalog. In the
process, you specify an AWS KMS key that is automatically used when objects, such as tables, are written
to the Data Catalog. The encrypted objects include the following:

• Databases
• Tables
• Partitions
• Table versions
• Connections
• User-defined functions

You can set this behavior using the AWS Management Console or AWS Command Line Interface (AWS
CLI).

To enable encryption using the console

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Choose Settings in the navigation pane.
3. On the Data catalog settings page, select the Metadata encryption check box, and choose an AWS
KMS key.

When encryption is enabled, all future Data Catalog objects are encrypted. The default key is the AWS
Glue AWS KMS key that is created for your account by AWS. If you clear this setting, objects are no longer
encrypted when they are written to the Data Catalog. Any encrypted objects in the Data Catalog can
continue to be accessed with the key.

To enable encryption using the SDK or AWS CLI

• Use the PutDataCatalogEncryptionSettings API operation. If no key is specified, the default


AWS Glue encryption key for the customer account is used.

Important
The AWS KMS key must remain available in the AWS KMS key store for any objects that are
encrypted with it in the Data Catalog. If you remove the key, the objects can no longer be
decrypted. You might want this in some scenarios to prevent access to Data Catalog metadata.

When encryption is enabled, the client that is accessing the Data Catalog must have the following AWS
KMS permissions in its policy:

• kms:Decrypt
• kms:Encrypt
• kms:GenerateDataKey

81
AWS Glue Developer Guide
Encrypting Connection Passwords

For example, when you define a crawler or a job, the IAM role that you provide in the definition must
have these AWS KMS permissions:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:Encrypt",
"kms:GenerateDataKey"
],
"Resource": "ARN-of-key-used-to-encrypt-data-catalog"
}
]
}

Encrypting Connection Passwords


You can retrieve connection passwords in the AWS Glue Data Catalog by using the GetConnection and
GetConnections API operations. These passwords are stored in the Data Catalog connection and are
used when AWS Glue connects to a Java Database Connectivity (JDBC) data store. When the connection
was created or updated, an option in the Data Catalog settings determined whether the password was
encrypted, and if so, what AWS Key Management Service (AWS KMS) key was specified.

On the AWS Glue console, you can enable this option on the Data catalog settings page:

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Choose Settings in the navigation pane.
3. On the Data catalog settings page, select the Encrypt connection passwords check box.

For more information, see Working with Data Catalog Settings on the AWS Glue Console (p. 123).

Encrypting Data Written by Crawlers, Jobs, and


Development Endpoints
A security configuration is a set of security properties that can be used by AWS Glue. You can use a
security configuration to encrypt data at rest. The following scenarios show some of the ways that you
can use a security configuration.

• Attach a security configuration to an AWS Glue crawler to write encrypted Amazon CloudWatch Logs.
• Attach a security configuration to an extract, transform, and load (ETL) job to write encrypted Amazon
Simple Storage Service (Amazon S3) targets and encrypted CloudWatch Logs.
• Attach a security configuration to an ETL job to write its jobs bookmarks as encrypted Amazon S3
data.
• Attach a security configuration to a development endpoint to write encrypted Amazon S3 targets.

Important
Currently, a security configuration overrides any server-side encryption (SSE-S3) setting
that is passed as an ETL job parameter. Thus, if both a security configuration and an SSE-S3
parameter are associated with a job, the SSE-S3 parameter is ignored.

82
AWS Glue Developer Guide
Encrypting Data Written by AWS Glue

For more information about security configurations, see Working with Security Configurations on the
AWS Glue Console (p. 84).

Topics
• Setting Up AWS Glue to Use Security Configurations (p. 83)
• Creating a Route to AWS KMS for VPC Jobs and Crawlers (p. 83)
• Working with Security Configurations on the AWS Glue Console (p. 84)

Setting Up AWS Glue to Use Security Configurations


Follow these steps to set up your AWS Glue environment to use security configurations.

1. Create or update your AWS Key Management Service (AWS KMS) keys to allow AWS KMS permissions
to the IAM roles that are passed to AWS Glue crawlers and jobs to encrypt CloudWatch Logs.
For more information, see Encrypt Log Data in CloudWatch Logs Using AWS KMS in the Amazon
CloudWatch Logs User Guide.

In the following example, "role1", "role2", and "role3" are IAM roles that are passed to
crawlers and jobs:

{
"Effect": "Allow",
"Principal": { "Service": "logs.region.amazonaws.com",
"AWS": [
"role1",
"role2",
"role3"
] },
"Action": [
"kms:Encrypt*",
"kms:Decrypt*",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:Describe*"
],
"Resource": "*"
}

The Service statement, shown as "Service": "logs.region.amazonaws.com", is required if


you use the key to encrypt CloudWatch Logs.
2. Ensure that the AWS KMS key is ENABLED before it is used.

Creating a Route to AWS KMS for VPC Jobs and Crawlers


You can connect directly to AWS KMS through a private endpoint in your virtual private cloud (VPC)
instead of connecting over the internet. When you use a VPC endpoint, communication between your
VPC and AWS KMS is conducted entirely within the AWS network.

You can create an AWS KMS VPC endpoint within a VPC. Without this step, your jobs or crawlers might
fail with a kms timeout on jobs or an internal service exception on crawlers. For detailed
instructions, see Connecting to AWS KMS Through a VPC Endpoint in the AWS Key Management Service
Developer Guide.

As you follow these instructions, on the VPC console, you must do the following:

• Select the Enable Private DNS name check box.

83
AWS Glue Developer Guide
Encrypting Data Written by AWS Glue

• Choose the Security group (with self-referencing rule) that you use for your job or crawler that
accesses Java Database Connectivity (JDBC). For more information about AWS Glue connections, see
Adding a Connection to Your Data Store (p. 92).

When you add a security configuration to a crawler or job that accesses JDBC data stores, AWS Glue must
have a route to the AWS KMS endpoint. You can provide the route with a network address translation
(NAT) gateway or with an AWS KMS VPC endpoint. To create a NAT gateway, see NAT Gateways in the
Amazon VPC User Guide.

Working with Security Configurations on the AWS Glue Console


A security configuration in AWS Glue contains the properties that are needed when you write encrypted
data. You create security configurations on the AWS Glue console to provide the encryption properties
that are used by crawlers, jobs, and development endpoints.

To see a list of all the security configurations that you have created, open the AWS Glue console at
https://console.aws.amazon.com/glue/ and choose Security configurations in the navigation pane.

The Security configurations list displays the following properties about each configuration:

Name

The unique name you provided when you created the configuration.
S3 encryption mode

If enabled, the Amazon Simple Storage Service (Amazon S3) encryption mode such as SSE-KMS or
SSE-S3.
CloudWatch logs encryption mode

If enabled, the Amazon S3 encryption mode such as SSE-KMS.


Job bookmark encryption mode

If enabled, the Amazon S3 encryption mode such as CSE-KMS.


Date created

The date and time (UTC) that the configuration was created.

You can add or delete configurations in the Security configurations section on the console. To see more
details for a configuration, choose the configuration name in the list. Details include the information that
you defined when you created the configuration.

Adding a Security Configuration


To add a security configuration using the AWS Glue console, on the Security configurations page,
choose Add security configuration. The wizard guides you through setting the required properties.

To set up encryption of data and metadata with AWS Key Management Service (AWS KMS) keys on the
AWS Glue console, add a policy to the console user. This policy must specify the allowed resources as key
ARNs that are used to encrypt Amazon S3 data stores, as in the following example:

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey",
"kms:Decrypt",

84
AWS Glue Developer Guide
Encrypting Data Written by AWS Glue

"kms:Encrypt"],
"Resource": "arn:aws:kms:region:account-id:key/key-id"}
}

Important
When a security configuration is attached to a crawler or job, the IAM role that is passed must
have AWS KMS permissions. For more information, see Encrypting Data Written by Crawlers,
Jobs, and Development Endpoints (p. 82).

When you define a configuration, you can provide values for the following properties:

S3 encryption

When you are writing Amazon S3 data, you use either server-side encryption with Amazon S3
managed keys (SSE-S3) or server-side encryption with AWS KMS managed keys (SSE-KMS). This
field is optional. To enable access to Amazon S3, choose an AWS KMS key, or choose Enter a
key ARN and provide the Amazon Resource Name (ARN) for the key. Enter the ARN in the form
arn:aws:kms:region:account-id:key/key-id. You can also provide the ARN as a key alias,
such as arn:aws:kms:region:account-id:alias/alias-name.
CloudWatch Logs encryption

Server-side (SSE-KMS) encryption is used to encrypt CloudWatch Logs. This field is optional. To
enable it, choose an AWS KMS key, or choose Enter a key ARN and provide the ARN for the key.
Enter the ARN in the form arn:aws:kms:region:account-id:key/key-id. You can also
provide the ARN as a key alias, such as arn:aws:kms:region:account-id:alias/alias-name.
Job bookmark encryption

Client-side (CSE-KMS) encryption is used to encrypt job bookmarks. This field is optional. The
bookmark data is encrypted before it is sent to Amazon S3 for storage. To enable it, choose an AWS
KMS key, or choose Enter a key ARN and provide the ARN for the key. Enter the ARN in the form
arn:aws:kms:region:account-id:key/key-id. You can also provide the ARN as a key alias,
such as arn:aws:kms:region:account-id:alias/alias-name.

For more information, see the following topics in the Amazon Simple Storage Service Developer Guide:

• For information about SSE-S3, see Protecting Data Using Server-Side Encryption with Amazon S3-
Managed Encryption Keys (SSE-S3).
• For information about SSE-KMS, see Protecting Data Using Server-Side Encryption with AWS KMS–
Managed Keys (SSE-KMS).
• For information about CSE-KMS, see Using an AWS KMS–Managed Customer Master Key (CMK).

85
AWS Glue Developer Guide

Populating the AWS Glue Data


Catalog
The AWS Glue Data Catalog contains references to data that is used as sources and targets of your
extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse, you must catalog
this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your
data. You use the information in the Data Catalog to create and monitor your ETL jobs. Typically, you
run a crawler to take inventory of the data in your data stores, but there are other ways to add metadata
tables into your Data Catalog.

You can add table definitions to the AWS Glue Data Catalog in the following ways:

• Run a crawler that connects to one or more data stores, determines the data structures, and writes
tables into the Data Catalog. You can run your crawler on a schedule. For more information, see
Cataloging Tables with a Crawler (p. 97).
• Use the AWS Glue console to create a table in the AWS Glue Data Catalog. For more information, see
Working with Tables on the AWS Glue Console (p. 89).

Use the CreateTable operation in the AWS Glue API (p. 371) to create a table in the AWS Glue Data
Catalog.

The following workflow diagram shows how AWS Glue crawlers interact with data stores and other
elements to populate the Data Catalog.

86
AWS Glue Developer Guide

The following is the general workflow for how a crawler populates the AWS Glue Data Catalog:

1. A crawler runs any custom classifiers that you choose to infer the schema of your data. You provide
the code for custom classifiers, and they run in the order that you specify.

The first custom classifier to successfully recognize the structure of your data is used to create a
schema. Custom classifiers lower in the list are skipped.
2. If no custom classifier matches your data's schema, built-in classifiers try to recognize your data's
schema.
3. The crawler connects to the data store. Some data stores require connection properties for crawler
access.
4. The inferred schema is created for your data.
5. The crawler writes metadata to the Data Catalog. A table definition contains metadata about the
data in your data store. The table is written to a database, which is a container of tables in the Data
Catalog. Attributes of a table include classification, which is a label created by the classifier that
inferred the table schema.

Topics
• Defining a Database in Your Data Catalog (p. 88)
• Defining Tables in the AWS Glue Data Catalog (p. 88)
• Adding a Connection to Your Data Store (p. 92)
• Cataloging Tables with a Crawler (p. 97)
• Adding Classifiers to a Crawler (p. 109)

87
AWS Glue Developer Guide
Defining a Database in Your Data Catalog

• Working with Data Catalog Settings on the AWS Glue Console (p. 123)
• Populating the Data Catalog Using AWS CloudFormation Templates (p. 124)

Defining a Database in Your Data Catalog


When you define a table in the AWS Glue Data Catalog, you add it to a database. A database is used to
organize tables in AWS Glue. You can organize your tables using a crawler or using the AWS Glue console.
A table can be in only one database at a time.

Your database can contain tables that define data from many different data stores. This data can include
objects in Amazon Simple Storage Service (Amazon S3) and relational tables in Amazon Relational
Database Service.
Note
When you delete a database, all the tables in the database are also deleted.

For more information about defining a database using the AWS Glue console, see Working with
Databases on the AWS Glue Console (p. 88).

Working with Databases on the AWS Glue Console


A database in the AWS Glue Data Catalog is a container that holds tables. You use databases to organize
your tables into separate categories. Databases are created when you run a crawler or add a table
manually. The database list in the AWS Glue console displays descriptions for all your databases.

To view the list of databases, sign in to the AWS Management Console and open the AWS Glue console at
https://console.aws.amazon.com/glue/. Choose Databases, and then choose a database name in the list
to view the details.

From the Databases tab in the AWS Glue console, you can add, edit, and delete databases:

• To create a new database, choose Add database and provide a name and description. For compatibility
with other metadata stores, such as Apache Hive, the name is folded to lowercase characters.
Note
If you plan to access the database from Amazon Athena, then provide a name with only
alphanumeric and underscore characters. For more information, see Athena names.
• To edit the description for a database, select the check box next to the database name and choose
Action, Edit database.
• To delete a database, select the check box next to the database name and choose Action, Delete
database.
• To display the list of tables contained in the database, select the check box next to the database name
and choose View tables.

To change the database that a crawler writes to, you must change the crawler definition. For more
information, see Cataloging Tables with a Crawler (p. 97).

Defining Tables in the AWS Glue Data Catalog


When you define a table in AWS Glue, you also specify the value of a classification field that indicates the
type and format of the data that's stored in that table. If a crawler creates the table, these classifications
are determined by either a built-in classifier or a custom classifier. If you create a table manually in

88
AWS Glue Developer Guide
Table Partitions

the console or by using an API, you specify the classification when you define the table. For more
information about creating a table using the AWS Glue console, see Working with Tables on the AWS
Glue Console (p. 89).

When a crawler detects a change in table metadata, a new version of the table is created in the AWS
Glue Data Catalog. You can compare current and past versions of a table.

The schema of the table contains its structure. You can also edit a schema to create a new version of the
table.

The table's history is also maintained in the Data Catalog. This history includes metrics that are gathered
when a data store is updated by an extract, transform, and load (ETL) job. You can find out the name of
the job, when it ran, how many rows were added, and how long the job took to run. The version of the
schema that was used by an ETL job is also kept in the history.

Table Partitions
An AWS Glue table definition of an Amazon Simple Storage Service (Amazon S3) folder can describe
a partitioned table. For example, to improve query performance, a partitioned table might separate
monthly data into different files using the name of the month as a key. In AWS Glue, table definitions
include the partitioning key of a table. When AWS Glue evaluates the data in Amazon S3 folders to
catalog a table, it determines whether an individual table or a partitioned table is added.

All the following conditions must be true for AWS Glue to create a partitioned table for an Amazon S3
folder:

• The schemas of the files are similar, as determined by AWS Glue.


• The data format of the files is the same.
• The compression format of the files is the same.

For example, you might own an Amazon S3 bucket named my-app-bucket, where you store both
iOS and Android app sales data. The data is partitioned by year, month, and day. The data files for iOS
and Android sales have the same schema, data format, and compression format. In the AWS Glue Data
Catalog, the AWS Glue crawler creates one table definition with partitioning keys for year, month, and
day.

The following Amazon S3 listing of my-app-bucket shows some of the partitions. The = symbol is used
to assign partition key values.

my-app-bucket/Sales/year='2010'/month='feb'/day='1'/iOS.csv
my-app-bucket/Sales/year='2010'/month='feb'/day='1'/Android.csv
my-app-bucket/Sales/year='2010'/month='feb'/day='2'/iOS.csv
my-app-bucket/Sales/year='2010'/month='feb'/day='2'/Android.csv
...
my-app-bucket/Sales/year='2017'/month='feb'/day='4'/iOS.csv
my-app-bucket/Sales/year='2017'/month='feb'/day='4'/Android.csv

Working with Tables on the AWS Glue Console


A table in the AWS Glue Data Catalog is the metadata definition that represents the data in a data
store. You create tables when you run a crawler, or you can create a table manually in the AWS Glue
console. The Tables list in the AWS Glue console displays values of your table's metadata. You use table
definitions to specify sources and targets when you create ETL (extract, transform, and load) jobs.

89
AWS Glue Developer Guide
Working with Tables on the Console

To get started, sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/. Choose the Tables tab, and use the Add tables button to create tables
either with a crawler or by manually typing attributes.

Adding Tables on the Console


To use a crawler to add tables, choose Add tables, Add tables using a crawler. Then follow the
instructions in the Add crawler wizard. When the crawler runs, tables are added to the AWS Glue Data
Catalog. For more information, see Cataloging Tables with a Crawler (p. 97).

If you know the attributes that are required to create an Amazon Simple Storage Service (Amazon S3)
table definition in your Data Catalog, you can create it with the table wizard. Choose Add tables, Add
table manually, and follow the instructions in the Add table wizard.

When adding a table manually through the console, consider the following:

• If you plan to access the table from Amazon Athena, then provide a name with only alphanumeric and
underscore characters. For more information, see Athena names.
• The location of your source data must be an Amazon S3 path.
• The data format of the data must match one of the listed formats in the wizard. The corresponding
classification, SerDe, and other table properties are automatically populated based on the format
chosen. You can define tables with the following formats:
JSON

JavaScript Object Notation.


CSV

Character separated values. You also specify the delimiter of either comma, pipe, semicolon, tab,
or Ctrl-A.
Parquet

Apache Parquet columnar storage.


Avro

Apache Avro JSON binary format.


XML

Extensible Markup Language format. Specify the XML tag that defines a row in the data. Columns
are defined within row tags.
• You can define a partition key for the table.
• Currently, partitioned tables that you create with the console cannot be used in ETL jobs.

Table Attributes
The following are some important attributes of your table:

Table name

The name is determined when the table is created, and you can't change it. You refer to a table
name in many AWS Glue operations.
Database

The container object where your table resides. This object contains an organization of your tables
that exists within the AWS Glue Data Catalog and might differ from an organization in your data
store. When you delete a database, all tables contained in the database are also deleted from the
Data Catalog.

90
AWS Glue Developer Guide
Working with Tables on the Console

Location

The pointer to the location of the data in a data store that this table definition represents.
Classification

A categorization value provided when the table was created. Typically, this is written when a crawler
runs and specifies the format of the source data.
Last updated

The time and date (UTC) that this table was updated in the Data Catalog.
Date added

The time and date (UTC) that this table was added to the Data Catalog.
Description

The description of the table. You can write a description to help you understand the contents of the
table.
Deprecated

If AWS Glue discovers that a table in the Data Catalog no longer exists in its original data store, it
marks the table as deprecated in the data catalog. If you run a job that references a deprecated
table, the job might fail. Edit jobs that reference deprecated tables to remove them as sources and
targets. We recommend that you delete deprecated tables when they are no longer needed.
Connection

If AWS Glue requires a connection to your data store, the name of the connection is associated with
the table.

Viewing and Editing Table Details


To see the details of an existing table, choose the table name in the list, and then choose Action, View
details.

The table details include properties of your table and its schema. This view displays the schema of
the table, including column names in the order defined for the table, data types, and key columns
for partitions. If a column is a complex type, you can choose View properties to display details of the
structure of that field, as shown in the following example:

{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "primary-1",
"type": "CHAR",
"comment": ""
},
{
"name": "second ",
"type": "STRING",
"comment": ""
}
]
},
"location": "s3://aws-logs-111122223333-us-east-1",
"inputFormat": "",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "0",

91
AWS Glue Developer Guide
Adding a Connection to Your Data Store

"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"parameters": {
"separatorChar": "|"
}
},
"bucketCols": [],
"sortCols": [],
"parameters": {},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"classification": "csv"
}
}

For more information about the properties of a table, such as StorageDescriptor, see
StorageDescriptor Structure (p. 392).

To change the schema of a table, choose Edit schema to add and remove columns, change column
names, and change data types.

To compare different versions of a table, including its schema, choose Compare versions to see a side-
by-side comparison of two versions of the schema for a table.

To display the files that make up an Amazon S3 partition, choose View partition. For Amazon S3 tables,
the Key column displays the partition keys that are used to partition the table in the source data store.
Partitioning is a way to divide a table into related parts based on the values of a key column, such as
date, location, or department. For more information about partitions, search the internet for information
about "hive partitioning."
Note
To get step-by-step guidance for viewing the details of a table, see the Explore table tutorial in
the console.

Adding a Connection to Your Data Store


Connections are used by crawlers and jobs in AWS Glue to access certain types of data stores. For
information about adding a connection using the AWS Glue console, see Working with Connections on
the AWS Glue Console (p. 94).

When Is a Connection Used?


If your data store requires one, the connection is used when you crawl a data store to catalog its
metadata in the AWS Glue Data Catalog. The connection is also used by any job that uses the data store
as a source or target.

Defining a Connection in the AWS Glue Data Catalog


Some types of data stores require additional connection information to access your data. This
information might include an additional user name and password (different from your AWS credentials),
or other information that is required to connect to the data store.

After AWS Glue connects to a JDBC data store, it must have permission from the data store to perform
operations. The username you provide with the connection must have the required permissions or

92
AWS Glue Developer Guide
Connecting to a JDBC Data Store in a VPC

privileges. For example, a crawler requires SELECT privileges to retrieve metadata from a JDBC data
store. Likewise, a job that writes to a JDBC target requires the necessary privileges to INSERT, UPDATE,
and DELETE data into an existing table.

AWS Glue can connect to the following data stores by using the JDBC protocol:

• Amazon Redshift
• Amazon Relational Database Service
• Amazon Aurora
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
• Publicly accessible databases
• Amazon Aurora
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL

A connection is not typically required for Amazon S3. However, to access Amazon S3 from within your
virtual private cloud (VPC), an Amazon S3 VPC endpoint is required. For more information, see Amazon
VPC Endpoints for Amazon S3 (p. 29).

In your connection information, you also must consider whether data is accessed through a VPC and then
set up network parameters accordingly. AWS Glue requires a private IP for JDBC endpoints. Connections
to databases can be over a VPN and DirectConnect as they provide private IP access to on-premises
databases.

For information about how to connect to on-premises databases, see the blog post How to access and
analyze on-premises data stores using AWS Glue.

Connecting to a JDBC Data Store in a VPC


Typically, you create resources inside Amazon Virtual Private Cloud (Amazon VPC) so that they cannot
be accessed over the public internet. By default, resources in a VPC can't be accessed from AWS Glue.
To enable AWS Glue to access resources inside your VPC, you must provide additional VPC-specific
configuration information that includes VPC subnet IDs and security group IDs. AWS Glue uses this
information to set up elastic network interfaces that enable your function to connect securely to other
resources in your private VPC.

Accessing VPC Data Using Elastic Network Interfaces


When AWS Glue connects to a JDBC data store in a VPC, AWS Glue creates an elastic network interface
(with the prefix Glue_) in your account to access your VPC data. You can't delete this network interface
as long as it's attached to AWS Glue. As part of creating the elastic network interface, AWS Glue
associates one or more security groups to it. To enable AWS Glue to create the network interface,
security groups that are associated with the resource must allow inbound access with a source rule.
This rule contains a security group that is associated with the resource. This gives the elastic network
interface access to your data store with the same security group.

93
AWS Glue Developer Guide
Working with Connections on the Console

To allow AWS Glue to communicate with its components, specify a security group with a self-referencing
inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same
security group in the VPC and not open it to all networks. The default security group for your VPC might
already have a self-referencing inbound rule for ALL Traffic.

You can create rules in the Amazon VPC console. To update rule settings via the AWS Management
Console, navigate to the VPC console (https://console.aws.amazon.com/vpc/), and select the
appropriate security group. Specify the inbound rule for ALL TCP to have as its source the same security
group name. For more information about security group rules, see Security Groups for Your VPC.

Each elastic network interface is assigned a private IP address from the IP address range in the subnets
that you specify. The network interface is not assigned any public IP addresses. AWS Glue requires
internet access (for example, to access AWS services that don't have VPC endpoints). You can configure
a network address translation (NAT) instance inside your VPC, or you can use the Amazon VPC NAT
gateway. For more information, see NAT Gateways in the Amazon VPC User Guide. You can't directly use
an internet gateway attached to your VPC as a route in your subnet route table because that requires the
network interface to have public IP addresses.

The VPC network attributes enableDnsHostnames and enableDnsSupport must be set to true. For
more information, see Using DNS with your VPC.
Important
Don't put your data store in a public subnet or in a private subnet that doesn't have internet
access. Instead, attach it only to private subnets that have internet access through a NAT
instance or an Amazon VPC NAT gateway.

Elastic Network Interface Properties


To create the elastic network interface, you must supply the following properties:

VPC

The name of the VPC that contains your data store.


Subnet

The subnet in the VPC that contains your data store.


Security groups

The security groups that are associated with your data store. AWS Glue associates these security
groups with the elastic network interface that is attached to your VPC subnet. To allow AWS Glue
components to communicate and also prevent access from other networks, at least one chosen
security group must specify a self-referencing inbound rule for all TCP ports.

For information about managing a VPC with Amazon Redshift, see Managing Clusters in an Amazon
Virtual Private Cloud (VPC).

For information about managing a VPC with Amazon RDS, see Working with an Amazon RDS DB Instance
in a VPC.

Working with Connections on the AWS Glue Console


A connection contains the properties that are needed to access your data store. To see a list of all the
connections that you have created, open the AWS Glue console at https://console.aws.amazon.com/
glue/, and choose the Connections tab.

The Connections list displays the following properties about each connection:

94
AWS Glue Developer Guide
Working with Connections on the Console

Name

When you create a connection, you give it a unique name.


Type

The data store type and the properties that are required for a successful connection. AWS Glue uses
the JDBC protocol to access several types of data stores.
Date created

The date and time (UTC) that the connection was created.
Last updated

The date and time (UTC) that the connection was last updated.
Updated by

The user who created or last updated the connection.

From the Connections tab in the AWS Glue console, you can add, edit, and delete connections. To see
more details for a connection, choose the connection name in the list. Details include the information
you defined when you created the connection.

As a best practice, before you use a data store connection in an ETL job, choose Test connection. AWS
Glue uses the parameters in your connection to confirm that it can access your data store and reports
back any errors. Connections are required for Amazon Redshift, Amazon Relational Database Service
(Amazon RDS), and JDBC data stores. For more information, see Connecting to a JDBC Data Store in a
VPC (p. 93).
Important
Currently, an ETL job can use only one JDBC connection. If you have multiple data stores in a
job, they must be on the same subnet.

Adding a JDBC Connection to a Data Store


To add a connection in the AWS Glue console, choose Add connection. The wizard guides you through
adding the properties that are required to create a JDBC connection to a data store. If you choose
Amazon Redshift or Amazon RDS, AWS Glue tries to determine the underlying JDBC properties to create
the connection.

When you define a connection, values for the following properties are required:

Connection name

Type a unique name for your connection.


Connection type

Choose either Amazon Redshift, Amazon RDS, or JDBC.


• If you choose Amazon Redshift, choose a Cluster, Database name, Username, and Password in
your account to create a JDBC connection.
• If you choose Amazon RDS, choose an Instance, Database name, Username, and Password in your
account to create a JDBC connection. The console also lists the supported database engine types.
Require SSL connection

Select this option to require AWS Glue to verify that the JDBC database connection is connected
over a trusted Secure Socket Layer (SSL). This option is optional. If not selected, AWS Glue can
ignore failures when it uses SSL to encrypt a connection to a JDBC database. See the documentation
for your database for configuration instructions. When you select this option, if AWS Glue cannot
connect using SSL, the job run, crawler, or ETL statements in a development endpoint fail.

95
AWS Glue Developer Guide
Working with Connections on the Console

This option is validated on the AWS Glue client side. AWS Glue only connects to JDBC over SSL with
certificate and host name validation. Support is available for:
• Oracle
• Microsoft SQL Server
• PostgreSQL
• Amazon Redshift
• MySQL (Amazon RDS instances only)
• Aurora MySQL (Amazon RDS instances only)
• Aurora Postgres (Amazon RDS instances only)
Note
To enable an Amazon RDS Oracle data store to use Require SSL connection, you need to
create and attach an option group to the Oracle instance.
1. Sign in to the AWS Management Console and open the Amazon RDS console at https://
console.aws.amazon.com/rds/.
2. Add an Option group to the Amazon RDS Oracle instance. For more information about
how to add an option group on the Amazon RDS console, see Creating an Option Group
3. Add an Option to the option group for SSL. The Port you specify for SSL is later used
when you create an AWS Glue JDBC connection URL for the Amazon RDS Oracle instance.
For more information about how to add an option on the Amazon RDS console, see
Adding an Option to an Option Group. For more information about the Oracle SSL
option, see Oracle SSL.
4. On the AWS Glue console, create a connection to the Amazon RDS Oracle instance. In
the connection definition, select Require SSL connection, and when requested, enter the
Port you used in the Amazon RDS Oracle SSL option.
JDBC URL

Type the URL for your JDBC data store. For most database engines, this field is in the following
format.

jdbc:protocol://host:port/db_name

Depending on the database engine, a different JDBC URL format might be required. This format can
have slightly different use of the colon (:) and slash (/) or different keywords to specify databases.

For JDBC to connect to the data store, a db_name in the data store is required. The db_name is used
to establish a network connection with the supplied username and password. When connected,
AWS Glue can access other databases in the data store to run a crawler or run an ETL job.

The following JDBC URL examples show the syntax for several database engines.
• To connect to an Amazon Redshift cluster data store with a dev database:

jdbc:redshift://xxx.us-east-1.redshift.amazonaws.com:8192/dev
• To connect to an Amazon RDS for MySQL data store with an employee database:

jdbc:mysql://xxx-cluster.cluster-xxx.us-east-1.rds.amazonaws.com:3306/
employee
• To connect to an Amazon RDS for PostgreSQL data store with an employee database:

jdbc:postgresql://xxx-cluster.cluster-xxx.us-
east-1.rds.amazonaws.com:5432/employee
• To connect to an Amazon RDS for Oracle data store with an employee service name:

jdbc:oracle:thin://@xxx-cluster.cluster-xxx.us-
east-1.rds.amazonaws.com:1521/employee

96
AWS Glue Developer Guide
Cataloging Tables with a Crawler

The syntax for Amazon RDS for Oracle can follow the following patterns:
• jdbc:oracle:thin://@host:port/service_name
• jdbc:oracle:thin://@host:port:SID
• To connect to an Amazon RDS for Microsoft SQL Server data store with an employee database:

jdbc:sqlserver://xxx-cluster.cluster-xxx.us-
east-1.rds.amazonaws.com:1433;database=employee

The syntax for Amazon RDS for SQL Server can follow the following patterns:
• jdbc:sqlserver://server_name:port;database=db_name
• jdbc:sqlserver://server_name:port;databaseName=db_name
Username

Provide a user name that has permission to access the JDBC data store.
Password

Type the password for the user name that has access permission to the JDBC data store.
Port

Type the port used in the JDBC URL to connect to an Amazon RDS Oracle instance. This field is only
shown when Require SSL connection is selected for an Amazon RDS Oracle instance.
VPC

Choose the name of the virtual private cloud (VPC) that contains your data store. The AWS Glue
console lists all VPCs for the current region.
Subnet

Choose the subnet within the VPC that contains your data store. The AWS Glue console lists all
subnets for the data store in your VPC.
Security groups

Choose the security groups that are associated with your data store. AWS Glue requires one or more
security groups with an inbound source rule that allows AWS Glue to connect. The AWS Glue console
lists all security groups that are granted inbound access to your VPC. AWS Glue associates these
security groups with the elastic network interface that is attached to your VPC subnet.

Cataloging Tables with a Crawler


You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method
used by most AWS Glue users. You add a crawler within your Data Catalog to traverse your data stores.
The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog.
Extract, transform, and load (ETL) jobs that you define in AWS Glue use these metadata tables as sources
and targets.

Your crawler uses an AWS Identity and Access Management (IAM) role for permission to access your
data stores and the Data Catalog. The role you pass to the crawler must have permission to access
Amazon S3 paths and Amazon DynamoDB tables that are crawled. For more information, see Working
with Crawlers on the AWS Glue Console (p. 107). Some data stores require additional authorization to
establish a connection. For more information, see Adding a Connection to Your Data Store (p. 92).

For more information about using the AWS Glue console to add a crawler, see Working with Crawlers on
the AWS Glue Console (p. 107).

97
AWS Glue Developer Guide
Defining a Crawler in the AWS Glue Data Catalog

Defining a Crawler in the AWS Glue Data Catalog


When you define a crawler, you choose one or more classifiers that evaluate the format of your data to
infer a schema. When the crawler runs, the first classifier in your list to successfully recognize your data
store is used to create a schema for your table. You can use built-in classifiers or define your own. AWS
Glue provides built-in classifiers to infer schemas from common files with formats that include JSON,
CSV, and Apache Avro. For the current list of built-in classifiers in AWS Glue, see Built-In Classifiers in
AWS Glue (p. 110).

Which Data Stores Can I Crawl?


A crawler can crawl both file-based and table-based data stores. Crawlers can crawl the following data
stores:

• Amazon Simple Storage Service (Amazon S3)


• Amazon Redshift
• Amazon Relational Database Service (Amazon RDS)
• Amazon Aurora
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
• Amazon DynamoDB
• Publicly accessible databases
• Aurora
• MariaDB
• SQL Server
• MySQL
• Oracle
• PostgreSQL

When you define an Amazon S3 data store to crawl, you can choose whether to crawl a path in your
account or another account. The output of the crawler is one or more metadata tables defined in the
AWS Glue Data Catalog. A table is created for one or more files found in your data store. If all the
Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3
object is partitioned, only one metadata table is created.

If the data store that is being crawled is a relational database, the output is also a set of metadata
tables defined in the AWS Glue Data Catalog. When you crawl a relational database, you must provide
authorization credentials for a connection to read objects in the database engine. Depending on the type
of database engine, you can choose which objects are crawled, such as databases, schemas, and tables.

If the data store that's being crawled is one or more Amazon DynamoDB tables, the output is one or
more metadata tables in the AWS Glue Data Catalog. When defining a crawler using the AWS Glue
console, you specify a DynamoDB table. If you're using the AWS Glue API, you specify a list of tables.

Using Include and Exclude Patterns


When evaluating what to include or exclude in a crawl, a crawler starts by evaluating the required include
path for Amazon S3 and relational data stores. For every data store that you want to crawl, you must
specify a single include path.

98
AWS Glue Developer Guide
Using Include and Exclude Patterns

For Amazon S3 data stores, the syntax is bucket-name/folder-name/file-name.ext. To crawl all


objects in a bucket, you specify just the bucket name in the include path.

For JDBC data stores, the syntax is either database-name/schema-name/table-name or database-


name/table-name. The syntax depends on whether the database engine supports schemas within a
database. For example, for database engines such as MySQL or Oracle, don't specify a schema-name in
your include path. You can substitute the percent sign (%) for a schema or table in the include path to
represent all schemas or all tables in a database. You cannot substitute the percent sign (%) for database
in the include path.

A crawler connects to a JDBC data store using an AWS Glue connection that contains a JDBC URI
connection string. The crawler only has access to objects in the database engine using the JDBC user
name and password in the AWS Glue connection. The crawler can only create tables that it can access
through the JDBC connection. After the crawler accesses the database engine with the JDBC URI, the
include path is used to determine which tables in the database engine are created in the Data Catalog.
For example, with MySQL, if you specify an include path of MyDatabase/%, then all tables within
MyDatabase are created in the Data Catalog. When accessing Amazon Redshift, if you specify an include
path of MyDatabase/%, then all tables within all schemas for database MyDatabase are created in the
Data Catalog. If you specify an include path of MyDatabase/MySchema/%, then all tables in database
MyDatabase and schema MySchema are created.

After you specify an include path, you can then exclude objects from the crawl that your include
path would otherwise include by specifying one or more Unix-style glob exclude patterns. These
patterns are applied to your include path to determine which objects are excluded. These patterns
are also stored as a property of tables created by the crawler. AWS Glue PySpark extensions, such as
create_dynamic_frame.from_catalog, read the table properties and exclude objects defined by the
exclude pattern.

AWS Glue supports the following kinds of glob patterns in the exclude pattern.

Exclude pattern Description

*.csv Matches an Amazon S3 path that represents an


object name in the current folder ending in .csv

*.* Matches all object names that contain a dot

*.{csv,avro} Matches object names ending with .csv or .avro

foo.? Matches object names starting with foo. that are


followed by a single character extension

myfolder/* Matches objects in one level of subfolder from


myfolder, such as /myfolder/mysource

myfolder/*/* Matches objects in two levels of subfolders from


myfolder, such as /myfolder/mysource/data

myfolder/** Matches objects in all subfolders of myfolder,


such as /myfolder/mysource/mydata and /
myfolder/mysource/data

myfolder** Matches subfolder myfolder as well as files


below myfolder, such as /myfolder and /
myfolder/mydata.txt

Market* Matches tables in a JDBC database with names


that begin with Market, such as Market_us and
Market_fr

99
AWS Glue Developer Guide
Using Include and Exclude Patterns

AWS Glue interprets glob exclude patterns as follows:

• The slash (/) character is the delimiter to separate Amazon S3 keys into a folder hierarchy.
• The asterisk (*) character matches zero or more characters of a name component without crossing
folder boundaries.
• A double asterisk (**) matches zero or more characters crossing folder or schema boundaries.
• The question mark (?) character matches exactly one character of a name component.
• The backslash (\) character is used to escape characters that otherwise can be interpreted as special
characters. The expression \\ matches a single backslash, and \{ matches a left brace.
• Brackets [ ] create a bracket expression that matches a single character of a name component out
of a set of characters. For example, [abc] matches a, b, or c. The hyphen (-) can be used to specify
a range, so [a-z] specifies a range that matches from a through z (inclusive). These forms can be
mixed, so [abce-g] matches a, b, c, e, f, or g. If the character after the bracket ([) is an exclamation
point (!), the bracket expression is negated. For example, [!a-c] matches any character except a, b,
or c.

Within a bracket expression, the *, ?, and \ characters match themselves. The hyphen (-) character
matches itself if it is the first character within the brackets, or if it's the first character after the ! when
you are negating.
• Braces ({ }) enclose a group of subpatterns, where the group matches if any subpattern in the group
matches. A comma (,) character is used to separate the subpatterns. Groups cannot be nested.
• Leading period or dot characters in file names are treated as normal characters in match operations.
For example, the * exclude pattern matches the file name .hidden.

Example of Amazon S3 Exclude Patterns

Each exclude pattern is evaluated against the include path. For example, suppose that you have the
following Amazon S3 directory structure:

/mybucket/myfolder/
departments/
finance.json
market-us.json
market-emea.json
market-ap.json
employees/
hr.json
john.csv
jane.csv
juan.txt

Given the include path s3://mybucket/myfolder/, the following are some sample results for exclude
patterns:

Exclude pattern Results

departments/** Excludes all files and folders below departments


and includes the employees folder and its files

departments/market* Excludes market-us.json, market-


emea.json, and market-ap.json

**.csv Excludes all objects below myfolder that have a


name ending with .csv

100
AWS Glue Developer Guide
What Happens When a Crawler Runs?

Exclude pattern Results

employees/*.csv Excludes all .csv files in the employees folder

Example of Excluding a Subset of Amazon S3 Partitions

Suppose that your data is partitioned by day, so that each day in a year is in a separate Amazon S3
partition. For January 2015, there are 31 partitions. Now, to crawl data for only the first week of January,
you must exclude all partitions except days 1 through 7:

2015/01/{[!0],0[8-9]}**, 2015/0[2-9]/**, 2015/1[0-2]/**

Take a look at the parts of this glob pattern. The first part, 2015/01/{[!0],0[8-9]}**, excludes all
days that don't begin with a "0" in addition to day 08 and day 09 from month 01 in year 2015. Notice
that "**" is used as the suffix to the day number pattern and crosses folder boundaries to lower-level
folders. If "*" is used, lower folder levels are not excluded.

The second part, 2015/0[2-9]/**, excludes days in months 02 to 09, in year 2015.

The third part, 2015/1[0-2]/**, excludes days in months 10, 11, and 12, in year 2015.

Example of JDBC Exclude Patterns

Suppose that you are crawling a JDBC database with the following schema structure:

MyDatabase/MySchema/
HR_us
HR_fr
Employees_Table
Finance
Market_US_Table
Market_EMEA_Table
Market_AP_Table

Given the include path MyDatabase/MySchema/%, the following are some sample results for exclude
patterns:

Exclude pattern Results

HR* Excludes the tables with names that begin with HR

Market_* Excludes the tables with names that begin with


Market_

**_Table Excludes all tables with names that end with


_Table

What Happens When a Crawler Runs?


When a crawler runs, it takes the following actions to interrogate a data store:

• Classifies data to determine the format, schema, and associated properties of the raw data – You
can configure the results of classification by creating a custom classifier.

101
AWS Glue Developer Guide
Are Amazon S3 Folders Created as Tables or Partitions?

• Groups data into tables or partitions – Data is grouped based on crawler heuristics.
• Writes metadata to the Data Catalog – You can configure how the crawler adds, updates, and deletes
tables and partitions.

The metadata tables that a crawler creates are contained in a database when you define a crawler. If your
crawler does not define a database, your tables are placed in the default database. In addition, each table
has a classification column that is filled in by the classifier that first successfully recognized the data
store.

The crawler can process both relational database and file data stores.

If the file that is crawled is compressed, the crawler must download it to process it. When a crawler runs,
it interrogates files to determine their format and compression type and writes these properties into
the Data Catalog. Some file formats (for example, Apache Parquet) enable you to compress parts of the
file as it is written. For these files, the compressed data is an internal component of the file and AWS
Glue does not populate the compressionType property when it writes tables into the Data Catalog.
In contrast, if an entire file is compressed by a compression algorithm (for example, gzip), then the
compressionType property is populated when tables are written into the Data Catalog.

The crawler generates the names for the tables it creates. The names of the tables that are stored in the
AWS Glue Data Catalog follow these rules:

• Only alphanumeric characters and underscore (_) are allowed.


• Any custom prefix cannot be longer than 64 characters.
• The maximum length of the name cannot be longer than 128 characters. The crawler truncates
generated names to fit within the limit.
• If duplicate table names are encountered, the crawler adds a hash string suffix to the name.

If your crawler runs more than once, perhaps on a schedule, it looks for new or changed files or tables in
your data store. The output of the crawler includes new tables found since a previous run.

Are Amazon S3 Folders Created as Tables or


Partitions?
When an AWS Glue crawler scans Amazon S3 and detects multiple folders in a bucket, it determines the
root of a table in the folder structure and which folders are partitions of a table. The name of the table
is based on the Amazon S3 prefix or folder name. You provide an Include path that points to the folder
level to crawl. When the majority of schemas at a folder level are similar, the crawler creates partitions
of a table instead of two separate tables. To influence the crawler to create separate tables, add each
table's root folder as a separate data store when you define the crawler.

For example, with the following Amazon S3 structure:

s3://bucket01/folder1/table1/partition1/file.txt
s3://bucket01/folder1/table1/partition2/file.txt
s3://bucket01/folder1/table1/partition3/file.txt
s3://bucket01/folder1/table2/partition4/file.txt
s3://bucket01/folder1/table2/partition5/file.txt

If the schemas for table1 and table2 are similar, and a single data store is defined in the crawler
with Include path s3://bucket01/folder1/, the crawler creates a single table with two partition
columns. One partition column contains table1 and table2, and a second partition column contains
partition1 through partition5. To create two separate tables, define the crawler with two data

102
AWS Glue Developer Guide
Configuring a Crawler

stores. In this example, define the first Include path as s3://bucket01/folder1/table1/ and the
second as s3://bucket01/folder1/table2.
Note
In Amazon Athena, each table corresponds to an Amazon S3 prefix with all the objects in it. If
objects have different schemas, Athena does not recognize different objects within the same
prefix as separate tables. This can happen if a crawler creates multiple tables from the same
Amazon S3 prefix. This might lead to queries in Athena that return zero results. For Athena
to properly recognize and query tables, create the crawler with a separate Include path for
each different table schema in the Amazon S3 folder structure. For more information, see Best
Practices When Using Athena with AWS Glue.

Configuring a Crawler
When a crawler runs, it might encounter changes to your data store that result in a schema or partition
that is different from a previous crawl. You can use the AWS Management Console or the AWS Glue API
to configure how your crawler processes certain types of changes.

Topics
• Configuring a Crawler on the AWS Glue Console (p. 103)
• Configuring a Crawler Using the API (p. 104)
• How to Prevent the Crawler from Changing an Existing Schema (p. 105)
• How to Create a Single Schema For Each Amazon S3 Include Path (p. 106)

Configuring a Crawler on the AWS Glue Console


When you define a crawler using the AWS Glue console, you have several options for configuring the
behavior of your crawler. For more information about using the AWS Glue console to add a crawler, see
Working with Crawlers on the AWS Glue Console (p. 107).

When a crawler runs against a previously crawled data store, it might discover that a schema has
changed or that some objects in the data store have been deleted. The crawler logs changes to a schema.
New tables and partitions are always created regardless of the schema change policy.

To specify what the crawler does when it finds changes in the schema, you can choose one of the
following actions on the console:

• Update the table definition in the Data Catalog – Add new columns, remove missing columns, and
modify the definitions of existing columns in the AWS Glue Data Catalog. Remove any metadata that is
not set by the crawler. This is the default setting.
• Add new columns only – For tables that map to an Amazon S3 data store, add new columns as they
are discovered, but don't remove or change the type of existing columns in the Data Catalog. Choose
this option when the current columns in the Data Catalog are correct and you don't want the crawler
to remove or change the type of the existing columns. If a fundamental Amazon S3 table attribute
changes, such as classification, compression type, or CSV delimiter, mark the table as deprecated.
Maintain input format and output format as they exist in the Data Catalog. Update SerDe parameters
only if the parameter is one that is set by the crawler. For all other data stores, modify existing column
definitions.
• Ignore the change and don't update the table in the Data Catalog – Only new tables and partitions
are created.

A crawler might also discover new or changed partitions. By default, new partitions are added and
existing partitions are updated if they have changed. In addition, you can set a crawler configuration
option to Update all new and existing partitions with metadata from the table on the AWS Glue

103
AWS Glue Developer Guide
Configuring a Crawler

console. When this option is set, partitions inherit metadata properties—such as their classification,
input format, output format, SerDe information, and schema—from their parent table. Any changes to
these properties in a table are propagated to its partitions. When this configuration option is set on an
existing crawler, existing partitions are updated to match the properties of their parent table the next
time the crawler runs.

To specify what the crawler does when it finds a deleted object in the data store, choose one of the
following actions:

• Delete tables and partitions from the Data Catalog


• Ignore the change and don't update the table in the Data Catalog
• Mark the table as deprecated in the Data Catalog – This is the default setting.

Configuring a Crawler Using the API


When you define a crawler using the AWS Glue API, you can choose from several fields to configure
your crawler. The SchemaChangePolicy in the crawler API determines what the crawler does when it
discovers a changed schema or a deleted object. The crawler logs schema changes as it runs.

When a crawler runs, new tables and partitions are always created regardless of the schema
change policy. You can choose one of the following actions in the UpdateBehavior field in the
SchemaChangePolicy structure to determine what the crawler does when it finds a changed table
schema:

• UPDATE_IN_DATABASE – Update the table in the AWS Glue Data Catalog. Add new columns, remove
missing columns, and modify the definitions of existing columns. Remove any metadata that is not set
by the crawler.
• LOG – Ignore the changes, and don't update the table in the Data Catalog.

You can also override the SchemaChangePolicy structure using a JSON object supplied in the crawler
API Configuration field. This JSON object can contain a key-value pair to set the policy to not update
existing columns and only add new columns. For example, provide the following JSON object as a string:

{
"Version": 1.0,
"CrawlerOutput": {
"Tables": { "AddOrUpdateBehavior": "MergeNewColumns" }
}
}

This option corresponds to the Add new columns only option on the AWS Glue console. It overrides
the SchemaChangePolicy structure for tables that result from crawling Amazon S3 data stores only.
Choose this option if you want to maintain the metadata as it exists in the Data Catalog (the source
of truth). New columns are added as they are encountered, including nested data types. But existing
columns are not removed, and their type is not changed. If an Amazon S3 table attribute changes
significantly, mark the table as deprecated, and log a warning that an incompatible attribute needs to be
resolved.

When a crawler runs against a previously crawled data store, it might discover new or changed partitions.
By default, new partitions are added and existing partitions are updated if they have changed. In
addition, you can set a crawler configuration option to InheritFromTable (corresponding to the
Update all new and existing partitions with metadata from the table option on the AWS Glue console).
When this option is set, partitions inherit metadata properties from their parent table, such as their
classification, input format, output format, SerDe information, and schema. Any property changes to the
parent table are propagated to its partitions.

104
AWS Glue Developer Guide
Configuring a Crawler

When this configuration option is set on an existing crawler, existing partitions are updated to match
the properties of their parent table the next time the crawler runs. This behavior is set crawler API
Configuration field. For example, provide the following JSON object as a string:

{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}

The crawler API Configuration field can set multiple configuration options. For example, to configure
the crawler output for both partitions and tables, you can provide a string representation of the
following JSON object:

{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
"Tables": {"AddOrUpdateBehavior": "MergeNewColumns" }
}
}

You can choose one of the following actions to determine what the crawler does when it finds a deleted
object in the data store. The DeleteBehavior field in the SchemaChangePolicy structure in the
crawler API sets the behavior of the crawler when it discovers a deleted object.

• DELETE_FROM_DATABASE – Delete tables and partitions from the Data Catalog.


• LOG – Ignore the change and don't update the table in the Data Catalog.
• DEPRECATE_IN_DATABASE – Mark the table as deprecated in the Data Catalog. This is the default
setting.

How to Prevent the Crawler from Changing an Existing Schema


If you don't want a crawler to overwrite updates you made to existing fields in an Amazon S3 table
definition, choose the option on the console to Add new columns only or set the configuration option
MergeNewColumns. This applies to tables and partitions, unless Partitions.AddOrUpdateBehavior
is overridden to InheritFromTable.

If you don't want a table schema to change at all when a crawler runs, set the schema change policy to
LOG. You can also set a configuration option that sets partition schemas to inherit from the table.

If you are configuring the crawler on the console, you can choose the following actions:

• Ignore the change and don't update the table in the Data Catalog
• Update all new and existing partitions with metadata from the table

When you configure the crawler using the API, set the following parameters:

• Set the UpdateBehavior field in SchemaChangePolicy structure to LOG.


• Set the Configuration field with a string representation of the following JSON object in the crawler
API; for example:

{
"Version": 1.0,

105
AWS Glue Developer Guide
Scheduling a Crawler

"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}

How to Create a Single Schema For Each Amazon S3 Include


Path
By default, when a crawler defines tables for data stored in Amazon S3, it considers both data
compatibility and schema similarity. Data compatibility factors taken into account include whether
the data is of the same format (for example, JSON), the same compression type (for example, GZIP),
the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how
closely the schemas of separate Amazon S3 objects are similar.

You can configure a crawler to CombineCompatibleSchemas into a common table definition when
possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the
specific schemas when evaluating Amazon S3 objects in the specified include path.

If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a
single schema for each S3 path.

When you configure the crawler using the API, set the following configuration option:

• Set the Configuration field with a string representation of the following JSON object in the crawler
API; for example:

{
"Version": 1.0,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas" }
}

To help illustrate this option, suppose you define a crawler with an include path s3://bucket/
table1/. When the crawler runs, it finds two JSON files with the following characteristics:

• File 1 – S3://bucket/table1/year=2017/data1.json
• File content – {“A”: 1, “B”: 2}
• Schema – A:int, B:int

• File 2 – S3://bucket/table1/year=2018/data2.json
• File content – {“C”: 3, “D”: 4}
• Schema – C: int, D: int

By default, the crawler creates two tables, named year_2017 and year_2018 because the schemas
are not sufficiently similar. However, if the option Create a single schema for each S3 path is
selected, and if the data is compatible, the crawler creates one table. The table has the schema
A:int,B:int,C:int,D:int and partitionKey year:string.

Scheduling an AWS Glue Crawler


You can run an AWS Glue crawler on demand or on a regular schedule. Crawler schedules can be
expressed in cron format. For more information, see cron in Wikipedia.

106
AWS Glue Developer Guide
Working with Crawlers on the Console

When you create a crawler based on a schedule, you can specify certain constraints, such as the
frequency the crawler runs, which days of the week it runs, and at what time. These constraints are based
on cron. When setting up a crawler schedule, you should consider the features and limitations of cron.
For example, if you choose to run your crawler on day 31 each month, keep in mind that some months
don't have 31 days.

For more information about using cron to schedule jobs and crawlers, see Time-Based Schedules for Jobs
and Crawlers (p. 189).

Working with Crawlers on the AWS Glue Console


A crawler accesses your data store, extracts metadata, and creates table definitions in the AWS Glue
Data Catalog. The Crawlers pane in the AWS Glue console lists all the crawlers that you create. The list
displays status and metrics from the last run of your crawler.

To add a crawler using the console

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/. Choose Crawlers in the navigation pane.
2. Choose Add crawler, and follow the instructions in the Add crawler wizard.
Note
To get step-by-step guidance for adding a crawler, choose Add crawler under Tutorials
in the navigation pane. You can also use the Add crawler wizard to create and modify an
IAM role that attaches a policy that includes permissions for your Amazon Simple Storage
Service (Amazon S3) data stores.

Optionally, you can add a security configuration to a crawler to specify at-rest encryption
options. For more information, see Encrypting Data Written by Crawlers, Jobs, and Development
Endpoints (p. 82).

When a crawler runs, the provided IAM role must have permission to access the data store that is
crawled. For an Amazon S3 data store, you can use the AWS Glue console to create a policy or add a
policy similar to the following:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::bucket/object*"
]
}
]
}

If the crawler reads KMS encrypted Amazon S3 data, then the IAM role must have decrypt permission on
the KMS key. For more information, see Step 2: Create an IAM Role for AWS Glue (p. 14).

For an Amazon DynamoDB data store, you can use the AWS Glue console to create a policy or add a
policy similar to the following:

107
AWS Glue Developer Guide
Working with Crawlers on the Console

"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:DescribeTable",
"dynamodb:Scan"
],
"Resource": [
"arn:aws:dynamodb:region:account-id:table/table-name*"
]
}
]
}

For Amazon S3 data stores, an exclude pattern is relative to the include path. For more information
about glob patterns, see Which Data Stores Can I Crawl? (p. 98).

When you crawl a JDBC data store, a connection is required. For more information, see Working with
Connections on the AWS Glue Console (p. 94). An exclude path is relative to the include path. For
example, to exclude a table in your JDBC data store, type the table name in the exclude path.

When you crawl DynamoDB tables, you can choose one table name from the list of DynamoDB tables in
your account.

Viewing Crawler Results


To view the results of a crawler, find the crawler name in the list and choose the Logs link. This link takes
you to the CloudWatch Logs, where you can see details about which tables were created in the AWS Glue
Data Catalog and any errors that were encountered. You can manage your log retention period in the
CloudWatch console. The default log retention is Never Expire. For more information about how to
change the retention period, see Change Log Data Retention in CloudWatch Logs.

To see details of a crawler, choose the crawler name in the list. Crawler details include the information
you defined when you created the crawler with the Add crawler wizard. When a crawler run completes,
choose Tables in the navigation pane to see the tables that were created by your crawler in the database
that you specified.
Note
The crawler assumes the permissions of the IAM role that you specify when you define it. This
IAM role must have permissions to extract data from your data store and write to the Data
Catalog. The AWS Glue console lists only IAM roles that have attached a trust policy for the AWS
Glue principal service. From the console, you can also create an IAM role with an IAM policy to
access Amazon S3 data stores accessed by the crawler. For more information about providing
roles for AWS Glue, see Identity-Based Policies (p. 44).

The following are some important properties and metrics about the last run of a crawler:

Name

When you create a crawler, you must give it a unique name.


Schedule

You can choose to run your crawler on demand or choose a frequency with a schedule. For more
information about scheduling a crawler, see Scheduling a Crawler (p. 106).
Status

A crawler can be ready, starting, stopping, scheduled, or schedule paused. A running crawler
progresses from starting to stopping. You can resume or pause a schedule attached to a crawler.

108
AWS Glue Developer Guide
Adding Classifiers to a Crawler

Logs

Links to any available logs from the last run of the crawler.
Last runtime

The amount of time it took the crawler to run when it last ran.
Median runtime

The median amount of time it took the crawler to run since it was created.
Tables updated

The number of tables in the AWS Glue Data Catalog that were updated by the latest run of the
crawler.
Tables added

The number of tables that were added into the AWS Glue Data Catalog by the latest run of the
crawler.

Adding Classifiers to a Crawler


A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema.
The classifier also returns a certainty number to indicate how certain the format recognition was.

AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue
invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on
the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a
classifier returns certainty=1.0 during processing, it indicates that it's 100 percent certain that it can
create the correct schema. AWS Glue then uses the output of that classifier.

If no classifier returns certainty=1.0, AWS Glue uses the output of the classifier that has the highest
certainty. If no classifier returns a certainty greater than 0.0, AWS Glue returns the default classification
string of UNKNOWN.

When Do I Use a Classifier?


You use classifiers when you crawl a data store to define metadata tables in the AWS Glue Data Catalog.
You can set up your crawler with an ordered set of classifiers. When the crawler invokes a classifier,
the classifier determines whether the data is recognized. If the classifier can't recognize the data or is
not 100 percent certain, the crawler invokes the next classifier in the list to determine whether it can
recognize the data.

For more information about creating a classifier using the AWS Glue console, see Working with Classifiers
on the AWS Glue Console (p. 122).

Custom Classifiers
The output of a classifier includes a string that indicates the file's classification or format (for example,
json) and the schema of the file. For custom classifiers, you define the logic for creating the schema
based on the type of classifier. Classifier types include defining schemas based on grok patterns, XML
tags, and JSON paths.

If you change a classifier definition, any data that was previously crawled using the classifier is not
reclassified. A crawler keeps track of previously crawled data. New data is classified with the updated
classifier, which might result in an updated schema. If the schema of your data has evolved, update the

109
AWS Glue Developer Guide
Built-In Classifiers in AWS Glue

classifier to account for any schema changes when your crawler runs. To reclassify data to correct an
incorrect classifier, create a new crawler with the updated classifier.

For more information about creating custom classifiers in AWS Glue, see Writing Custom
Classifiers (p. 112).
Note
If your data format is recognized by one of the built-in classifiers, you don't need to create a
custom classifier.

Built-In Classifiers in AWS Glue


AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many
database systems.

If AWS Glue doesn't find a custom classifier that fits the input data format with 100 percent certainty, it
invokes the built-in classifiers in the order shown in the following table. The built-in classifiers return a
result to indicate whether the format matches (certainty=1.0) or does not match (certainty=0.0).
The first classifier that has certainty=1.0 provides the classification string and schema for a metadata
table in your Data Catalog.

Classifier type Classification string Notes

Apache Avro avro Reads the schema at the beginning of the file to
determine format.

Apache ORC orc Reads the file metadata to determine format.

Apache Parquet parquet Reads the schema at the end of the file to determine
format.

JSON json Reads the beginning of the file to determine format.

Binary JSON bson Reads the beginning of the file to determine format.

XML xml Reads the beginning of the file to determine format.


AWS Glue determines the table schema based on XML
tags in the document.

For information about creating a custom XML classifier


to specify rows in the document, see Writing XML
Custom Classifiers (p. 116).

Amazon Ion ion Reads the beginning of the file to determine format.

Combined Apache combined_apache Determines log formats through a grok pattern.


log

Apache log apache Determines log formats through a grok pattern.

Linux kernel log linux_kernel Determines log formats through a grok pattern.

Microsoft log microsoft_log Determines log formats through a grok pattern.

Ruby log ruby_logger Reads the beginning of the file to determine format.

Squid 3.x log squid Reads the beginning of the file to determine format.

Redis monitor log redismonlog Reads the beginning of the file to determine format.

Redis log redislog Reads the beginning of the file to determine format.

110
AWS Glue Developer Guide
Built-In Classifiers in AWS Glue

Classifier type Classification string Notes

CSV csv Checks for the following delimiters: comma (,), pipe (|),
tab (\t), semicolon (;), and Ctrl-A (\u0001). Ctrl-A is the
Unicode control character for Start Of Heading.

Amazon Redshift redshift Uses JDBC connection to import metadata.

MySQL mysql Uses JDBC connection to import metadata.

PostgreSQL postgresql Uses JDBC connection to import metadata.

Oracle database oracle Uses JDBC connection to import metadata.

Microsoft SQL Server sqlserver Uses JDBC connection to import metadata.

Amazon DynamoDB dynamodb Reads data from the DynamoDB table.

Files in the following compressed formats can be classified:

• ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other
services (because of the archive).
• BZIP
• GZIP
• LZ4
• Snappy (as standard Snappy format, not as Hadoop native Snappy format)

Built-In CSV Classifier


The built-in CSV classifier parses CSV file contents to determine the schema for an AWS Glue table. This
classifier checks for the following delimiters:

• Comma (,)
• Pipe (|)
• Tab (\t)
• Semicolon (;)
• Ctrl-A (\u0001)

Ctrl-A is the Unicode control character for Start Of Heading.

To be classified as CSV, the table schema must have at least two columns and two rows of data. The
CSV classifier uses a number of heuristics to determine whether a header is present in a given file. If the
classifier can't determine a header from the first row of data, column headers are displayed as col1,
col2, col3, and so on. The built-in CSV classifier determines whether to infer a header by evaluating the
following characteristics of the file:

• Every column in a potential header parses as a STRING data type.


• Except for the last column, every column in a potential header has content that is fewer than 150
characters. To allow for a trailing delimiter, the last column can be empty throughout the file.
• Every column in a potential header must meet the AWS Glue regex requirements for a column name.
• The header row must be sufficiently different from the data rows. To determine this, one or more of
the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as the header.

111
AWS Glue Developer Guide
Writing Custom Classifiers

Note
If the built-in CSV classifier does not create your AWS Glue table as you want, you might be able
to use one of the following alternatives:

• Change the column names in the Data Catalog, set the SchemaChangePolicy to LOG, and
set the partition output configuration to InheritFromTable for future crawler runs.
• Create a custom grok classifier to parse the data and assign the columns that you want.
• The built-in CSV classifier creates tables referencing the LazySimpleSerDe as the
serialization library, which is a good choice for type inference. However, if the CSV
data contains quoted strings, edit the table definition and change the SerDe library to
OpenCSVSerDe. Adjust any inferred types to STRING, set the SchemaChangePolicy to LOG,
and set the partitions output configuration to InheritFromTable for future crawler runs.
For more information about SerDe libraries, see SerDe Reference in the Amazon Athena User
Guide.

Writing Custom Classifiers


You can provide a custom classifier to classify your data using a grok pattern or an XML tag in AWS Glue.
A crawler calls a custom classifier. If the classifier recognizes the data, it returns the classification and
schema of the data to the crawler. You might need to define a custom classifier if your data doesn't
match any built-in classifiers, or if you want to customize the tables that are created by the crawler.

For more information about creating a classifier using the AWS Glue console, see Working with Classifiers
on the AWS Glue Console (p. 122).

AWS Glue runs custom classifiers before built-in classifiers, in the order you specify. When a crawler finds
a classifier that matches the data, the classification string and schema are used in the definition of tables
that are written to your AWS Glue Data Catalog.

Topics
• Writing Grok Custom Classifiers (p. 112)
• Writing XML Custom Classifiers (p. 116)
• Writing JSON Custom Classifiers (p. 117)

Writing Grok Custom Classifiers


Grok is a tool that is used to parse textual data given a matching pattern. A grok pattern is a named
set of regular expressions (regex) that are used to match data one line at a time. AWS Glue uses grok
patterns to infer the schema of your data. When a grok pattern matches your data, AWS Glue uses the
pattern to determine the structure of your data and map it into fields.

AWS Glue provides many built-in patterns, or you can define your own. You can create a grok pattern
using built-in patterns and custom patterns in your custom classifier definition. You can tailor a grok
pattern to classify custom text file formats.
Note
AWS Glue grok custom classifiers use the GrokSerDe serialization library for tables created
in the AWS Glue Data Catalog. If you are using the AWS Glue Data Catalog with Amazon
Athena, Amazon EMR, or Redshift Spectrum, check the documentation about those services
for information about support of the GrokSerDe. Currently, you might encounter problems
querying tables created with the GrokSerDe from Amazon EMR and Redshift Spectrum.

The following is the basic syntax for the components of a grok pattern:

%{PATTERN:field-name}

112
AWS Glue Developer Guide
Writing Custom Classifiers

Data that matches the named PATTERN is mapped to the field-name column in the schema, with
a default data type of string. Optionally, the data type for the field can be cast to byte, boolean,
double, short, int, long, or float in the resulting schema.

%{PATTERN:field-name:data-type}

For example, to cast a num field to an int data type, you can use this pattern:

%{NUMBER:num:int}

Patterns can be composed of other patterns. For example, you can have a pattern for a SYSLOG
time stamp that is defined by patterns for month, day of the month, and time (for example, Feb 1
06:25:43). For this data, you might define the following pattern:

SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}

Note
Grok patterns can process only one line at a time. Multiple-line patterns are not supported. Also,
line breaks within a pattern are not supported.

Custom Classifier Values in AWS Glue


When you define a grok classifier, you supply the following values to AWS Glue to create the custom
classifier.

Name

Name of the classifier.


Classification

The text string that is written to describe the format of the data that is classified; for example,
special-logs.
Grok pattern

The set of patterns that are applied to the data store to determine whether there is a match. These
patterns are from AWS Glue built-in patterns (p. 114) and any custom patterns that you define.

The following is an example of a grok pattern:

%{TIMESTAMP_ISO8601:timestamp} \[%{MESSAGEPREFIX:message_prefix}\]
%{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}

When the data matches TIMESTAMP_ISO8601, a schema column timestamp is created. The
behavior is similar for the other named patterns in the example.
Custom patterns

Optional custom patterns that you define. These patterns are referenced by the grok pattern that
classifies your data. You can reference these custom patterns in the grok pattern that is applied to
your data. Each custom component pattern must be on a separate line. Regular expression (regex)
syntax is used to define the pattern.

The following is an example of using custom patterns:

CRAWLERLOGLEVEL (BENCHMARK|ERROR|WARN|INFO|TRACE)
MESSAGEPREFIX .*-.*-.*-.*-.*

113
AWS Glue Developer Guide
Writing Custom Classifiers

The first custom named pattern, CRAWLERLOGLEVEL, is a match when the data matches one of the
enumerated strings. The second custom pattern, MESSAGEPREFIX, tries to match a message prefix
string.

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

AWS Glue Built-In Patterns


AWS Glue provides many common patterns that you can use to build a custom classifier. You add a
named pattern to the grok pattern in a classifier definition.

The following list consists of a line for each pattern. In each line, the pattern name is followed its
definition. Regular expression (regex) syntax is used in defining the pattern.

#AWS Glue Built-in patterns


USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME:UNWANTED}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
NUMBER (?:%{BASE10NUM:UNWANTED})
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.
[0-9A-Fa-f]+)))\b
BOOLEAN (?i)(true|false)

POSINT \b(?:[1-9][0-9]*)\b
NONNEGINT \b(?:[0-9]+)\b
WORD \b\w+\b
NOTSPACE \S+
SPACE \s*
DATA .*?
GREEDYDATA .*
#QUOTEDSTRING (?:(?<!\\)(?:"(?:\\.|[^\\"])*"|(?:'(?:\\.|[^\\'])*')|(?:`(?:\\.|[^\\`])*`)))
QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\
\`]+)+`)|``))
UUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}

# Networking
MAC (?:%{CISCOMAC:UNWANTED}|%{WINDOWSMAC:UNWANTED}|%{COMMONMAC:UNWANTED})
CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-
f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|
(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.
(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4})
{1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|
[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4})
{0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|
(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|
2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:)
{1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?
\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-
Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d))
{3}))|:)))(%.+)?
IPV4 (?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?
[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]
{1,2}))(?![0-9])
IP (?:%{IPV6:UNWANTED}|%{IPV4:UNWANTED})
HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-_]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-_]
{0,62}))*(\.?|\b)
HOST %{HOSTNAME:UNWANTED}
IPORHOST (?:%{HOSTNAME:UNWANTED}|%{IP:UNWANTED})

114
AWS Glue Developer Guide
Writing Custom Classifiers

HOSTPORT (?:%{IPORHOST}:%{POSINT:PORT})

# paths
PATH (?:%{UNIXPATH}|%{WINPATH})
UNIXPATH (?>/(?>[\w_%!$@:.,~-]+|\\.)*)+
#UNIXPATH (?<![\w\/])(?:/[^\/\s?*]*)+
TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
WINPATH (?>[A-Za-z]+:|\\)(?:\\[^\\?*]*)+
URIPROTO [A-Za-z]+(\+[A-Za-z+]+)?
URIHOST %{IPORHOST}(?::%{POSINT:port})?
# uripath comes loosely from RFC1738, but mostly from what Firefox
# doesn't turn into %XX
URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+
#URIPARAM \?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)?
URIPARAM \?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]]*
URIPATHPARAM %{URIPATH}(?:%{URIPARAM})?
URI %{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM})?

# Months: January, Feb, 3, 03, 12, December


MONTH \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|
Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b
MONTHNUM (?:0?[1-9]|1[0-2])
MONTHNUM2 (?:0[1-9]|1[0-2])
MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])

# Days: Monday, Tue, Thu, etc...


DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|
Sun(?:day)?)

# Years?
YEAR (?>\d\d){1,2}
# Time: HH:MM:SS
#TIME \d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?
# TIME %{POSINT<24}:%{POSINT<60}(?::%{POSINT<60}(?:\.%{POSINT})?)?
HOUR (?:2[0123]|[01]?[0-9])
MINUTE (?:[0-5][0-9])
# '60' is a leap second in most time standards and thus is valid.
SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
# datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)
DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
DATESTAMP_US %{DATE_US}[- ]%{TIME}
DATESTAMP_EU %{DATE_EU}[- ]%{TIME}
ISO8601_TIMEZONE (?:Z|[+-]%{HOUR}(?::?%{MINUTE}))
ISO8601_SECOND (?:%{SECOND}|60)
TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?
%{ISO8601_TIMEZONE}?
TZ (?:[PMCE][SD]T|UTC)
DATESTAMP_RFC822 %{DAY} %{MONTH} %{MONTHDAY} %{YEAR} %{TIME} %{TZ}
DATESTAMP_RFC2822 %{DAY}, %{MONTHDAY} %{MONTH} %{YEAR} %{TIME} %{ISO8601_TIMEZONE}
DATESTAMP_OTHER %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{TZ} %{YEAR}
DATESTAMP_EVENTLOG %{YEAR}%{MONTHNUM2}%{MONTHDAY}%{HOUR}%{MINUTE}%{SECOND}
CISCOTIMESTAMP %{MONTH} %{MONTHDAY} %{TIME}

# Syslog Dates: Month Day HH:MM:SS


SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
PROG (?:[\w._/%-]+)
SYSLOGPROG %{PROG:program}(?:\[%{POSINT:pid}\])?
SYSLOGHOST %{IPORHOST}
SYSLOGFACILITY <%{NONNEGINT:facility}.%{NONNEGINT:priority}>
HTTPDATE %{MONTHDAY}/%{MONTH}/%{YEAR}:%{TIME} %{INT}

# Shortcuts
QS %{QUOTEDSTRING:UNWANTED}

115
AWS Glue Developer Guide
Writing Custom Classifiers

# Log formats
SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource}
%{SYSLOGPROG}:

MESSAGESLOG %{SYSLOGBASE} %{DATA}

COMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\]


"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})"
%{NUMBER:response} (?:%{Bytes:bytes=%{NUMBER}|-})
COMBINEDAPACHELOG %{COMMONAPACHELOG} %{QS:referrer} %{QS:agent}
COMMONAPACHELOG_DATATYPED %{IPORHOST:clientip} %{USER:ident;boolean} %{USER:auth}
\[%{HTTPDATE:timestamp;date;dd/MMM/yyyy:HH:mm:ss Z}\] "(?:%{WORD:verb;string}
%{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion;float})?|%{DATA:rawrequest})"
%{NUMBER:response;int} (?:%{NUMBER:bytes;long}|-)

# Log Levels
LOGLEVEL ([A|a]lert|ALERT|[T|t]race|TRACE|[D|d]ebug|DEBUG|[N|n]otice|NOTICE|[I|i]nfo|
INFO|[W|w]arn?(?:ing)?|WARN?(?:ING)?|[E|e]rr?(?:or)?|ERR?(?:OR)?|[C|c]rit?(?:ical)?|CRIT?
(?:ICAL)?|[F|f]atal|FATAL|[S|s]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)

Writing XML Custom Classifiers


XML (Extensible Markup Language) defines the structure of a document with the use of tags in the file.
With an XML custom classifier, you can specify the tag name used to define a row.

Custom Classifier Values in AWS Glue


When you define an XML classifier, you supply the following values to AWS Glue to create the classifier.
The classification field of this classifier is set to xml.

Name

Name of the classifier.


Row tag

The XML tag name that defines a table row in the XML document, without angle brackets < >. The
name must comply with XML rules for a tag.
Note
The element containing the row data cannot be a self-closing empty element. For example,
this empty element is not parsed by AWS Glue:

<row att1=”xx” att2=”yy” />

Empty elements can be written as follows:

<row att1=”xx” att2=”yy”> </row>

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

For example, suppose that you have the following XML file. To create an AWS Glue table that only
contains columns for author and title, create a classifier in the AWS Glue console with Row tag as
AnyCompany. Then add and run a crawler that uses this custom classifier.

116
AWS Glue Developer Guide
Writing Custom Classifiers

<?xml version="1.0"?>
<catalog>
<book id="bk101">
<AnyCompany>
<author>Rivera, Martha</author>
<title>AnyCompany Developer Guide</title>
</AnyCompany>
</book>
<book id="bk102">
<AnyCompany>
<author>Stiles, John</author>
<title>Style Guide for AnyCompany</title>
</AnyCompany>
</book>
</catalog>

Writing JSON Custom Classifiers


JSON (JavaScript Object Notation) is a data-interchange format. It defines data structures with name-
value pairs or an ordered list of values. With a JSON custom classifier, you can specify the JSON path to a
data structure that is used to define the schema for your table.

Custom Classifier Values in AWS Glue


When you define a JSON classifier, you supply the following values to AWS Glue to create the classifier.
The classification field of this classifier is set to json.

Name

Name of the classifier.


JSON path

A JSON path that points to an object that is used to define a table schema. The JSON path can be
written in dot notation or bracket notation. The following operators are supported:

Operator
Description

$Root element of a JSON object. This starts all path expressions

*Wildcard character. Available anywhere a name or numeric are required in the JSON path.

Dot-notated child. Specifies a child field in a JSON object.


.<name>

Bracket-notated child. Specifies child field in a JSON object. Only a single child field can be
['<name>']
specified.

Array index. Specifies the value of an array by index.


[<number>]

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

Example of Using a JSON Classifier to Pull Records from an Array

Suppose that your JSON data is an array of records. For example, the first few lines of your file might
look like the following:

117
AWS Glue Developer Guide
Writing Custom Classifiers

{
"type": "constituency",
"id": "ocd-division\/country:us\/state:ak",
"name": "Alaska"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:1",
"name": "Alabama's 1st congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:2",
"name": "Alabama's 2nd congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:3",
"name": "Alabama's 3rd congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:4",
"name": "Alabama's 4th congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:5",
"name": "Alabama's 5th congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:6",
"name": "Alabama's 6th congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:al\/cd:7",
"name": "Alabama's 7th congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:ar\/cd:1",
"name": "Arkansas's 1st congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:ar\/cd:2",
"name": "Arkansas's 2nd congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:ar\/cd:3",
"name": "Arkansas's 3rd congressional district"
},
{
"type": "constituency",
"id": "ocd-division\/country:us\/state:ar\/cd:4",
"name": "Arkansas's 4th congressional district"
}
]

When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema.
Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.
For example, the schema might look like the following:

118
AWS Glue Developer Guide
Writing Custom Classifiers

root
|-- record: array

However, to create a schema that is based on each record in the JSON array, create a custom JSON
classifier and specify the JSON path as $[*]. When you specify this JSON path, the classifier interrogates
all 12 records in the array to determine the schema. The resulting schema contains separate fields for
each object, similar to the following example:

root
|-- type: string
|-- id: string
|-- name: string

Example of Using a JSON Classifier to Examine Only Parts of a File


Suppose that your JSON data follows the pattern of the example JSON file s3://awsglue-datasets/
examples/us-legislators/all/areas.json drawn from http://everypolitician.org/. Example
objects in the JSON file look like the following:

{
"type": "constituency",
"id": "ocd-division\/country:us\/state:ak",
"name": "Alaska"
}
{
"type": "constituency",
"identifiers": [
{
"scheme": "dmoz",
"identifier": "Regional\/North_America\/United_States\/Alaska\/"
},
{
"scheme": "freebase",
"identifier": "\/m\/0hjy"
},
{
"scheme": "fips",
"identifier": "US02"
},
{
"scheme": "quora",
"identifier": "Alaska-state"
},
{
"scheme": "britannica",
"identifier": "place\/Alaska"
},
{
"scheme": "wikidata",
"identifier": "Q797"
}
],
"other_names": [
{
"lang": "en",
"note": "multilingual",
"name": "Alaska"
},
{

119
AWS Glue Developer Guide
Writing Custom Classifiers

"lang": "fr",
"note": "multilingual",
"name": "Alaska"
},
{
"lang": "nov",
"note": "multilingual",
"name": "Alaska"
}
],
"id": "ocd-division\/country:us\/state:ak",
"name": "Alaska"
}

When you run a crawler using the built-in JSON classifier, the entire file is used to create the schema. You
might end up with a schema like this:

root
|-- type: string
|-- id: string
|-- name: string
|-- identifiers: array
| |-- element: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- element: struct
| | |-- lang: string
| | |-- note: string
| | |-- name: string

However, to create a schema using just the "id" object, create a custom JSON classifier and specify the
JSON path as $.id. Then the schema is based on only the "id" field:

root
|-- record: string

The first few lines of data extracted with this schema look like this:

{"record": "ocd-division/country:us/state:ak"}
{"record": "ocd-division/country:us/state:al/cd:1"}
{"record": "ocd-division/country:us/state:al/cd:2"}
{"record": "ocd-division/country:us/state:al/cd:3"}
{"record": "ocd-division/country:us/state:al/cd:4"}
{"record": "ocd-division/country:us/state:al/cd:5"}
{"record": "ocd-division/country:us/state:al/cd:6"}
{"record": "ocd-division/country:us/state:al/cd:7"}
{"record": "ocd-division/country:us/state:ar/cd:1"}
{"record": "ocd-division/country:us/state:ar/cd:2"}
{"record": "ocd-division/country:us/state:ar/cd:3"}
{"record": "ocd-division/country:us/state:ar/cd:4"}
{"record": "ocd-division/country:us/state:as"}
{"record": "ocd-division/country:us/state:az/cd:1"}
{"record": "ocd-division/country:us/state:az/cd:2"}
{"record": "ocd-division/country:us/state:az/cd:3"}
{"record": "ocd-division/country:us/state:az/cd:4"}
{"record": "ocd-division/country:us/state:az/cd:5"}
{"record": "ocd-division/country:us/state:az/cd:6"}

120
AWS Glue Developer Guide
Writing Custom Classifiers

{"record": "ocd-division/country:us/state:az/cd:7"}

To create a schema based on a deeply nested object, such as "identifier," in the JSON file, you can
create a custom JSON classifier and specify the JSON path as $.identifiers[*].identifier.
Although the schema is similar to the previous example, it is based on a different object in the JSON file.

The schema looks like the following:

root
|-- record: string

Listing the first few lines of data from the table shows that the schema is based on the data in the
"identifier" object:

{"record": "Regional/North_America/United_States/Alaska/"}
{"record": "/m/0hjy"}
{"record": "US02"}
{"record": "5879092"}
{"record": "4001016-8"}
{"record": "destination/alaska"}
{"record": "1116270"}
{"record": "139487266"}
{"record": "n79018447"}
{"record": "01490999-8dec-4129-8254-eef6e80fadc3"}
{"record": "Alaska-state"}
{"record": "place/Alaska"}
{"record": "Q797"}
{"record": "Regional/North_America/United_States/Alabama/"}
{"record": "/m/0gyh"}
{"record": "US01"}
{"record": "4829764"}
{"record": "4084839-5"}
{"record": "161950"}
{"record": "131885589"}

To create a table based on another deeply nested object, such as the "name" field in the "other_names"
array in the JSON file, you can create a custom JSON classifier and specify the JSON path as
$.other_names[*].name. Although the schema is similar to the previous example, it is based on a
different object in the JSON file. The schema looks like the following:

root
|-- record: string

Listing the first few lines of data in the table shows that it is based on the data in the "name" object in
the "other_names" array:

{"record": "Alaska"}
{"record": "Alaska"}
{"record": "######"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}

121
AWS Glue Developer Guide
Working with Classifiers on the Console

{"record": "######"}
{"record": "######"}
{"record": "######"}
{"record": "Alaska"}
{"record": "Alyaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "#### ######"}
{"record": "######"}
{"record": "Alaska"}
{"record": "#######"}

Working with Classifiers on the AWS Glue Console


A classifier determines the schema of your data. You can write a custom classifier and point to it from
AWS Glue. To see a list of all the classifiers that you have created, open the AWS Glue console at https://
console.aws.amazon.com/glue/, and choose the Classifiers tab.

The list displays the following properties about each classifier:

Classifier

The classifier name. When you create a classifier, you must provide a name for it.
Classification

The classification type of tables inferred by this classifier.


Last updated

The last time this classifier was updated.

From the Classifiers list in the AWS Glue console, you can add, edit, and delete classifiers. To see more
details for a classifier, choose the classifier name in the list. Details include the information you defined
when you created the classifier.

To add a classifier in the AWS Glue console, choose Add classifier. When you define a classifier, you
supply values for the following:

Classifier name

Provide a unique name for your classifier.


Classification

For grok classifiers, describe the format or type of data that is classified or provide a custom label.
Grok pattern

For grok classifiers, this is used to parse your data into a structured schema. The grok pattern is
composed of named patterns that describe the format of your data store. You write this grok pattern
using the named built-in patterns provided by AWS Glue and custom patterns you write and include
in the Custom patterns field. Although grok debugger results might not match the results from AWS
Glue exactly, we suggest that you try your pattern using some sample data with a grok debugger.
You can find grok debuggers on the web. The named built-in patterns provided by AWS Glue are
generally compatible with grok patterns that are available on the web.

Build your grok pattern by iteratively adding named patterns and check your results in a debugger.
This activity gives you confidence that when the AWS Glue crawler runs your grok pattern, your data
can be parsed.

122
AWS Glue Developer Guide
Working with Data Catalog
Settings on the AWS Glue Console

Custom patterns

For grok classifiers, these are optional building blocks for the Grok pattern that you write. When
built-in patterns cannot parse your data, you might need to write a custom pattern. These custom
patterns are defined in this field and referenced in the Grok pattern field. Each custom pattern is
defined on a separate line. Just like the built-in patterns, it consists of a named pattern definition
that uses regular expression (regex) syntax.

For example, the following has the name MESSAGEPREFIX followed by a regular expression
definition to apply to your data to determine whether it follows the pattern.

MESSAGEPREFIX .*-.*-.*-.*-.*

Row tag

For XML classifiers, this is the name of the XML tag that defines a table row in the XML document.
Type the name without angle brackets < >. The name must comply with XML rules for a tag.
JSON path

For JSON classifiers, this is the JSON path to the object, array, or value that defines a row of
the table being created. Type the name in either dot or bracket JSON syntax using AWS Glue
supported operators. For more information, see the list of operators in Writing JSON Custom
Classifiers (p. 117).

For more information, see Writing Custom Classifiers (p. 112).

Working with Data Catalog Settings on the AWS


Glue Console
The Data Catalog settings page contains options to set properties for the Data Catalog in your account.

To change the fine-grained access control of the Data Catalog:

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Choose Settings, and then in the Permissions editor, add the policy statement to change fine-
grained access control of the Data Catalog for your account. Only one policy at a time can be
attached to a Data Catalog.
3. Choose Save to update your Data Catalog with any changes you made.

You can also use AWS Glue API operations to put, get, and delete resource policies. For more information,
see Security APIs in AWS Glue (p. 376).

The Settings page displays the following options:

Metadata encryption

Select this check box to encrypt the metadata in your Data Catalog. Metadata is encrypted at rest
using the AWS Key Management Service (AWS KMS) key that you specify. For more information, see
Encrypting Your Data Catalog (p. 81).

123
AWS Glue Developer Guide
Populating the Data Catalog Using
AWS CloudFormation Templates

Encrypt connection passwords

Select this check box to encrypt passwords in the AWS Glue connection object when the connection
is created or updated. Passwords are encrypted using the AWS KMS key that you specify. When
passwords are returned, they are encrypted. This option is a global setting for all AWS Glue
connections in the Data Catalog. If you clear this check box, previously encrypted passwords remain
encrypted using the key that was used when they were created or updated. For more information
about AWS Glue connections, see Adding a Connection to Your Data Store (p. 92).

When you enable this option, choose an AWS KMS key, or choose Enter a key ARN
and provide the Amazon Resource Name (ARN) for the key. Enter the ARN in the form
arn:aws:kms:region:account-id:key/key-id. You can also provide the ARN as a key alias,
such as arn:aws:kms:region:account-id:alias/alias-name.
Important
If this option is selected, any user or role that creates or updates a connection must have
kms:Encrypt permission on the specified KMS key.

For more information, see Encrypting Connection Passwords (p. 82).


Permissions

Add a resource policy to define fine-grained access control of the Data Catalog. You can paste a
JSON resource policy into this control. For more information, see Resource Policies (p. 48).

Populating the Data Catalog Using AWS


CloudFormation Templates
AWS CloudFormation is a service that can create many AWS resources. AWS Glue provides API operations
to create objects in the AWS Glue Data Catalog. However, it might be more convenient to define and
create AWS Glue objects and other related AWS resource objects in an AWS CloudFormation template
file. Then you can automate the process of creating the objects.

AWS CloudFormation provides a simplified syntax—either JSON (JavaScript Object Notation) or


YAML (YAML Ain't Markup Language)—to express the creation of AWS resources. You can use AWS
CloudFormation templates to define Data Catalog objects such as databases, tables, partitions,
crawlers, classifiers, and connections. You can also define ETL objects such as jobs, triggers, and
development endpoints. You create a template that describes all the AWS resources you want, and AWS
CloudFormation takes care of provisioning and configuring those resources for you.

For more information, see What Is AWS CloudFormation? and Working with AWS CloudFormation
Templates in the AWS CloudFormation User Guide.

If you plan to use AWS CloudFormation templates that are compatible with AWS Glue, as an
administrator, you must grant access to AWS CloudFormation and to the AWS services and actions on
which it depends. To grant permissions to create AWS CloudFormation resources, attach the following
policy to the IAM users that work with AWS CloudFormation:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudformation:*"
],
"Resource": "*"
}

124
AWS Glue Developer Guide
Sample Database

]
}

The following table contains the actions that an AWS CloudFormation template can perform on your
behalf. It includes links to information about the AWS resource types and their property types that you
can add to an AWS CloudFormation template.

AWS Glue Resource AWS CloudFormation Template AWS Glue Samples

Classifier AWS::Glue::Classifier Grok classifier (p. 129),


JSON classifier (p. 130), XML
classifier (p. 130)

Connection AWS::Glue::Connection MySQL connection (p. 132)

Crawler AWS::Glue::Crawler Amazon S3 crawler (p. 131),


MySQL crawler (p. 133)

Database AWS::Glue::Database Empty database (p. 125),


Database with tables (p. 126)

Development endpoint AWS::Glue::DevEndpoint Development


endpoint (p. 140)

Job AWS::Glue::Job Amazon S3 job (p. 135), JDBC


job (p. 136)

Partition AWS::Glue::Partition Partitions of a table (p. 126)

Table AWS::Glue::Table Table in a database (p. 126)

Trigger AWS::Glue::Trigger On-demand trigger (p. 137),


Scheduled trigger (p. 138),
Conditional trigger (p. 139)

To get started, use the following sample templates and customize them with your own metadata. Then
use the AWS CloudFormation console to create an AWS CloudFormation stack to add objects to AWS
Glue and any associated services. Many fields in an AWS Glue object are optional. These templates
illustrate the fields that are required or are necessary for a working and functional AWS Glue object.

An AWS CloudFormation template can be in either JSON or YAML format. In these examples, YAML is
used for easier readability. The examples contain comments (#) to describe the values that are defined in
the templates.

AWS CloudFormation templates can include a Parameters section. This section can be changed in
the sample text or when the YAML file is submitted to the AWS CloudFormation console to create a
stack. The Resources section of the template contains the definition of AWS Glue and related objects.
AWS CloudFormation template syntax definitions might contain properties that include more detailed
property syntax. Not all properties might be required to create an AWS Glue object. These samples show
example values for common properties to create an AWS Glue object.

Sample AWS CloudFormation Template for an AWS


Glue Database
An AWS Glue database in the Data Catalog contains metadata tables. The database consists of very few
properties and can be created in the Data Catalog with an AWS CloudFormation template. The following

125
AWS Glue Developer Guide
Sample Database, Table, Partitions

sample template is provided to get you started and to illustrate the use of AWS CloudFormation
stacks with AWS Glue. The only resource created by the sample template is a database named cfn-
mysampledatabase. You can change it by editing the text of the sample or changing the value on the
AWS CloudFormation console when you submit the YAML.

The following shows example values for common properties to create an AWS Glue database. For more
information about the AWS CloudFormation database template for AWS Glue, see AWS::Glue::Database.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database named
mysampledatabase
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
CFNDatabaseName:
Type: String
Default: cfn-mysampledatabse

# Resources section defines metadata for the Data Catalog


Resources:
# Create an AWS Glue database
CFNDatabaseFlights:
Type: AWS::Glue::Database
Properties:
# The database is created in the Data Catalog for your account
CatalogId: !Ref AWS::AccountId
DatabaseInput:
# The name of the database is defined in the Parameters section above
Name: !Ref CFNDatabaseName
Description: Database to hold tables for flights data
LocationUri: s3://crawler-public-us-east-1/flight/2016/csv/
#Parameters: Leave AWS database parameters blank

Sample AWS CloudFormation Template for an AWS


Glue Database, Table, and Partition
An AWS Glue table contains the metadata that defines the structure and location of data that you want
to process with your ETL scripts. Within a table, you can define partitions to parallelize the processing of
your data. A partition is a chunk of data that you defined with a key. For example, using month as a key,
all the data for January is contained in the same partition. In AWS Glue, databases can contain tables,
and tables can contain partitions.

The following sample shows how to populate a database, a table, and partitions using an AWS
CloudFormation template. The base data format is csv and delimited by a comma (,). Because a
database must exist before it can contain a table, and a table must exist before partitions can be created,
the template uses the DependsOn statement to define the dependency of these objects when they are
created.

The values in this sample define a table that contains flight data from a publicly available Amazon
S3 bucket. For illustration, only a few columns of the data and one partitioning key are defined. Four
partitions are also defined in the Data Catalog. Some fields to describe the storage of the base data are
also shown in the StorageDescriptor fields.

---

126
AWS Glue Developer Guide
Sample Database, Table, Partitions

AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database, a table, and
partitions
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters substituted in the Resources section
# These parameters are names of the resources created in the Data Catalog
Parameters:
CFNDatabaseName:
Type: String
Default: cfn-database-flights-1
CFNTableName1:
Type: String
Default: cfn-manual-table-flights-1
# Resources to create metadata in the Data Catalog
Resources:
###
# Create an AWS Glue database
CFNDatabaseFlights:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Ref CFNDatabaseName
Description: Database to hold tables for flights data
###
# Create an AWS Glue table
CFNTableFlights:
# Creating the table waits for the database to be created
DependsOn: CFNDatabaseFlights
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableInput:
Name: !Ref CFNTableName1
Description: Define the first few columns of the flights table
TableType: EXTERNAL_TABLE
Parameters: {
"classification": "csv"
}
# ViewExpandedText: String
PartitionKeys:
# Data is partitioned by month
- Name: mon
Type: bigint
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: year
Type: bigint
- Name: quarter
Type: bigint
- Name: month
Type: bigint
- Name: day_of_month
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 1
# Create an AWS Glue partition
CFNPartitionMon1:

127
AWS Glue Developer Guide
Sample Database, Table, Partitions

DependsOn: CFNTableFlights
Type: AWS::Glue::Partition
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableName: !Ref CFNTableName1
PartitionInput:
Values:
- 1
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: mon
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=1/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 2
# Create an AWS Glue partition
CFNPartitionMon2:
DependsOn: CFNTableFlights
Type: AWS::Glue::Partition
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableName: !Ref CFNTableName1
PartitionInput:
Values:
- 2
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: mon
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=2/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 3
# Create an AWS Glue partition
CFNPartitionMon3:
DependsOn: CFNTableFlights
Type: AWS::Glue::Partition
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableName: !Ref CFNTableName1
PartitionInput:
Values:
- 3
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: mon
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=3/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

128
AWS Glue Developer Guide
Sample Grok Classifier

# Partition 4
# Create an AWS Glue partition
CFNPartitionMon4:
DependsOn: CFNTableFlights
Type: AWS::Glue::Partition
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableName: !Ref CFNTableName1
PartitionInput:
Values:
- 4
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: mon
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=4/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Sample AWS CloudFormation Template for an AWS


Glue Grok Classifier
An AWS Glue classifier determines the schema of your data. One type of custom classifier uses a grok
pattern to match your data. If the pattern matches, then the custom classifier is used to create your
table's schema and set the classification to the value set in the classifier definition.

This sample creates a classifier that creates a schema with one column named message and sets the
classification to greedy.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the classifier to be created


CFNClassifierName:
Type: String
Default: cfn-classifier-grok-one-column-1

#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses grok pattern to put all data in one column and classifies it
as "greedy".
CFNClassifierFlights:
Type: AWS::Glue::Classifier
Properties:
GrokClassifier:
#Grok classifier that puts all data in one column
Name: !Ref CFNClassifierName
Classification: greedy
GrokPattern: "%{GREEDYDATA:message}"

129
AWS Glue Developer Guide
Sample JSON Classifier

#CustomPatterns: none

Sample AWS CloudFormation Template for an AWS


Glue JSON Classifier
An AWS Glue classifier determines the schema of your data. One type of custom classifier uses a
JsonPath string defining the JSON data for the classifier to classify. AWS Glue supports a subset of the
operators for JsonPath, as described in Writing JsonPath Custom Classifiers.

If the pattern matches, then the custom classifier is used to create your table's schema.

This sample creates a classifier that creates a schema with each record in the Records3 array in an
object.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a JSON classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the classifier to be created


CFNClassifierName:
Type: String
Default: cfn-classifier-json-one-column-1

#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses a JSON pattern.
CFNClassifierFlights:
Type: AWS::Glue::Classifier
Properties:
JSONClassifier:
#JSON classifier
Name: !Ref CFNClassifierName
JsonPath: $.Records3[*]

Sample AWS CloudFormation Template for an AWS


Glue XML Classifier
An AWS Glue classifier determines the schema of your data. One type of custom classifier specifies an
XML tag to designate the element that contains each record in an XML document that is being parsed.
If the pattern matches, then the custom classifier is used to create your table's schema and set the
classification to the value set in the classifier definition.

This sample creates a classifier that creates a schema with each record in the Record tag and sets the
classification to XML.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an XML classifier
#

130
AWS Glue Developer Guide
Sample Amazon S3 Crawler

# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the classifier to be created


CFNClassifierName:
Type: String
Default: cfn-classifier-xml-one-column-1

#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses the XML pattern and classifies it as "XML".
CFNClassifierFlights:
Type: AWS::Glue::Classifier
Properties:
XMLClassifier:
#XML classifier
Name: !Ref CFNClassifierName
Classification: XML
RowTag: <Records>

Sample AWS CloudFormation Template for an AWS


Glue Crawler for Amazon S3
An AWS Glue crawler creates metadata tables in your Data Catalog that correspond to your data. You can
then use these table definitions as sources and targets in your ETL jobs.

This sample creates a crawler, the required IAM role, and an AWS Glue database in the Data Catalog.
When this crawler is run, it assumes the IAM role and creates a table in the database for the public flights
data. The table is created with the prefix "cfn_sample_1_". The IAM role created by this template
allows global permissions; you might want to create a custom role. No custom classifiers are defined by
this classifier. AWS Glue built-in classifiers are used by default.

When you submit this sample to the AWS CloudFormation console, you must confirm that you want to
create the IAM role.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the crawler to be created


CFNCrawlerName:
Type: String
Default: cfn-crawler-flights-1
CFNDatabaseName:
Type: String
Default: cfn-database-flights-1
CFNTablePrefixName:
Type: String
Default: cfn_sample_1_
#
#
# Resources section defines metadata for the Data Catalog
Resources:

131
AWS Glue Developer Guide
Sample Connection

#Create IAM Role assumed by the crawler. For demonstration, this role is given all
permissions.
CFNRoleFlights:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Principal:
Service:
- "glue.amazonaws.com"
Action:
- "sts:AssumeRole"
Path: "/"
Policies:
-
PolicyName: "root"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action: "*"
Resource: "*"
# Create a database to contain tables created by the crawler
CFNDatabaseFlights:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Ref CFNDatabaseName
Description: "AWS Glue container to hold metadata tables for the flights crawler"
#Create a crawler to crawl the flights data on a public S3 bucket
CFNCrawlerFlights:
Type: AWS::Glue::Crawler
Properties:
Name: !Ref CFNCrawlerName
Role: !GetAtt CFNRoleFlights.Arn
#Classifiers: none, use the default classifier
Description: AWS Glue crawler to crawl flights data
#Schedule: none, use default run-on-demand
DatabaseName: !Ref CFNDatabaseName
Targets:
S3Targets:
# Public S3 bucket with the flights data
- Path: "s3://crawler-public-us-east-1/flight/2016/csv"
TablePrefix: !Ref CFNTablePrefixName
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":
{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":
\"MergeNewColumns\"}}}"

Sample AWS CloudFormation Template for an AWS


Glue Connection
An AWS Glue connection in the Data Catalog contains the JDBC and network information that is required
to connect to a JDBC database. This information is used when you connect to a JDBC database to crawl
or run ETL jobs.

132
AWS Glue Developer Guide
Sample JDBC Crawler

This sample creates a connection to an Amazon RDS MySQL database named devdb. When this
connection is used, an IAM role, database credentials, and network connection values must also be
supplied. See the details of necessary fields in the template.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a connection
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the connection to be created


CFNConnectionName:
Type: String
Default: cfn-connection-mysql-flights-1
CFNJDBCString:
Type: String
Default: "jdbc:mysql://xxx-mysql.yyyyyyyyyyyyyy.us-east-1.rds.amazonaws.com:3306/devdb"
CFNJDBCUser:
Type: String
Default: "master"
CFNJDBCPassword:
Type: String
Default: "12345678"
NoEcho: true
#
#
# Resources section defines metadata for the Data Catalog
Resources:
CFNConnectionMySQL:
Type: AWS::Glue::Connection
Properties:
CatalogId: !Ref AWS::AccountId
ConnectionInput:
Description: "Connect to MySQL database."
ConnectionType: "JDBC"
#MatchCriteria: none
PhysicalConnectionRequirements:
AvailabilityZone: "us-east-1d"
SecurityGroupIdList:
- "sg-7d52b812"
SubnetId: "subnet-84f326ee"
ConnectionProperties: {
"JDBC_CONNECTION_URL": !Ref CFNJDBCString,
"USERNAME": !Ref CFNJDBCUser,
"PASSWORD": !Ref CFNJDBCPassword
}
Name: !Ref CFNConnectionName

Sample AWS CloudFormation Template for an AWS


Glue Crawler for JDBC
An AWS Glue crawler creates metadata tables in your Data Catalog that correspond to your data. You can
then use these table definitions as sources and targets in your ETL jobs.

This sample creates a crawler, required IAM role, and an AWS Glue database in the Data Catalog. When
this crawler is run, it assumes the IAM role and creates a table in the database for the public flights data
that has been stored in a MySQL database. The table is created with the prefix "cfn_jdbc_1_". The

133
AWS Glue Developer Guide
Sample JDBC Crawler

IAM role created by this template allows global permissions; you might want to create a custom role. No
custom classifiers can be defined for JDBC data. AWS Glue built-in classifiers are used by default.

When you submit this sample to the AWS CloudFormation console, you must confirm that you want to
create the IAM role.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the crawler to be created


CFNCrawlerName:
Type: String
Default: cfn-crawler-jdbc-flights-1
# The name of the database to be created to contain tables
CFNDatabaseName:
Type: String
Default: cfn-database-jdbc-flights-1
# The prefix for all tables crawled and created
CFNTablePrefixName:
Type: String
Default: cfn_jdbc_1_
# The name of the existing connection to the MySQL database
CFNConnectionName:
Type: String
Default: cfn-connection-mysql-flights-1
# The name of the JDBC path (database/schema/table) with wildcard (%) to crawl
CFNJDBCPath:
Type: String
Default: saldev/%
#
#
# Resources section defines metadata for the Data Catalog
Resources:
#Create IAM Role assumed by the crawler. For demonstration, this role is given all
permissions.
CFNRoleFlights:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Principal:
Service:
- "glue.amazonaws.com"
Action:
- "sts:AssumeRole"
Path: "/"
Policies:
-
PolicyName: "root"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action: "*"
Resource: "*"

134
AWS Glue Developer Guide
Sample Job for Amazon S3 to Amazon S3

# Create a database to contain tables created by the crawler


CFNDatabaseFlights:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Ref CFNDatabaseName
Description: "AWS Glue container to hold metadata tables for the flights crawler"
#Create a crawler to crawl the flights data in MySQL database
CFNCrawlerFlights:
Type: AWS::Glue::Crawler
Properties:
Name: !Ref CFNCrawlerName
Role: !GetAtt CFNRoleFlights.Arn
#Classifiers: none, use the default classifier
Description: AWS Glue crawler to crawl flights data
#Schedule: none, use default run-on-demand
DatabaseName: !Ref CFNDatabaseName
Targets:
JdbcTargets:
# JDBC MySQL database with the flights data
- ConnectionName: !Ref CFNConnectionName
Path: !Ref CFNJDBCPath
#Exclusions: none
TablePrefix: !Ref CFNTablePrefixName
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":
{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":
\"MergeNewColumns\"}}}"

Sample AWS CloudFormation Template for an AWS


Glue Job for Amazon S3 to Amazon S3
An AWS Glue job in the Data Catalog contains the parameter values that are required to run a script in
AWS Glue.

This sample creates a job that reads flight data from an Amazon S3 bucket in csv format and writes it to
an Amazon S3 Parquet file. The script that is run by this job must already exist. You can generate an ETL
script for your environment with the AWS Glue console. When this job is run, an IAM role with the correct
permissions must also be supplied.

Common parameter values are shown in the template. For example, AllocatedCapacity (DPUs)
defaults to 5.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using the public flights S3 table in a
public bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the job to be created


CFNJobName:
Type: String
Default: cfn-job-S3-to-S3-2
# The name of the IAM role that the job assumes. It must have access to data, script,
temporary directory

135
AWS Glue Developer Guide
Sample Job for JDBC to Amazon S3

CFNIAMRoleName:
Type: String
Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
CFNScriptLocation:
Type: String
Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-test2
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create job to run script which accesses flightscsv table and write to S3 file as parquet.
# The script already exists and is called by this job
CFNJobFlights:
Type: AWS::Glue::Job
Properties:
Role: !Ref CFNIAMRoleName
#DefaultArguments: JSON object
# If script written in Scala, then set DefaultArguments={'--job-language'; 'scala',
'--class': 'your scala class'}
#Connections: No connection needed for S3 to S3 job
# ConnectionsList
#MaxRetries: Double
Description: Job created with CloudFormation
#LogUri: String
Command:
Name: glueetl
ScriptLocation: !Ref CFNScriptLocation
# for access to directories use proper IAM role with permission to buckets and
folders that begin with "aws-glue-"
# script uses temp directory from job definition if required (temp directory
not used S3 to S3)
# script defines target for output as s3://aws-glue-target/sal
AllocatedCapacity: 5
ExecutionProperty:
MaxConcurrentRuns: 1
Name: !Ref CFNJobName

Sample AWS CloudFormation Template for an AWS


Glue Job for JDBC to Amazon S3
An AWS Glue job in the Data Catalog contains the parameter values that are required to run a script in
AWS Glue.

This sample creates a job that reads flight data from a MySQL JDBC database as defined by the
connection named cfn-connection-mysql-flights-1 and writes it to an Amazon S3 Parquet file.
The script that is run by this job must already exist. You can generate an ETL script for your environment
with the AWS Glue console. When this job is run, an IAM role with the correct permissions must also be
supplied.

Common parameter values are shown in the template. For example, AllocatedCapacity (DPUs)
defaults to 5.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using a MySQL JDBC DB with the flights data
to an S3 file
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog

136
AWS Glue Developer Guide
Sample On-Demand Trigger

Parameters:

# The name of the job to be created


CFNJobName:
Type: String
Default: cfn-job-JDBC-to-S3-1
# The name of the IAM role that the job assumes. It must have access to data, script,
temporary directory
CFNIAMRoleName:
Type: String
Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
CFNScriptLocation:
Type: String
Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-dec4a
# The name of the connection used for JDBC data source
CFNConnectionName:
Type: String
Default: cfn-connection-mysql-flights-1
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create job to run script which accesses JDBC flights table via a connection and write to
S3 file as parquet.
# The script already exists and is called by this job
CFNJobFlights:
Type: AWS::Glue::Job
Properties:
Role: !Ref CFNIAMRoleName
#DefaultArguments: JSON object
# For example, if required by script, set temporary directory as
DefaultArguments={'--TempDir'; 's3://aws-glue-temporary-xyc/sal'}
Connections:
Connections:
- !Ref CFNConnectionName
#MaxRetries: Double
Description: Job created with CloudFormation using existing script
#LogUri: String
Command:
Name: glueetl
ScriptLocation: !Ref CFNScriptLocation
# for access to directories use proper IAM role with permission to buckets and
folders that begin with "aws-glue-"
# if required, script defines temp directory as argument TempDir and used in
script like redshift_tmp_dir = args["TempDir"]
# script defines target for output as s3://aws-glue-target/sal
AllocatedCapacity: 5
ExecutionProperty:
MaxConcurrentRuns: 1
Name: !Ref CFNJobName

Sample AWS CloudFormation Template for an AWS


Glue On-Demand Trigger
An AWS Glue trigger in the Data Catalog contains the parameter values that are required to start a job
run when the trigger fires. An on-demand trigger fires when you enable it.

This sample creates an on-demand trigger that starts one job named cfn-job-S3-to-S3-1.

---

137
AWS Glue Developer Guide
Sample Scheduled Trigger

AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an on-demand trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
# The existing job to be started by this trigger
CFNJobName:
Type: String
Default: cfn-job-S3-to-S3-1
# The name of the trigger to be created
CFNTriggerName:
Type: String
Default: cfn-trigger-ondemand-flights-1
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating an on-demand trigger for a job
Resources:
# Create trigger to run an existing job (CFNJobName) on an on-demand schedule.
CFNTriggerSample:
Type: AWS::Glue::Trigger
Properties:
Name:
Ref: CFNTriggerName
Description: Trigger created with CloudFormation
Type: ON_DEMAND
Actions:
- JobName: !Ref CFNJobName
# Arguments: JSON object
#Schedule:
#Predicate:

Sample AWS CloudFormation Template for an AWS


Glue Scheduled Trigger
An AWS Glue trigger in the Data Catalog contains the parameter values that are required to start a job
run when the trigger fires. A scheduled trigger fires when it is enabled and the cron timer pops.

This sample creates a scheduled trigger that starts one job named cfn-job-S3-to-S3-1. The timer is a
cron expression to run the job every 10 minutes on weekdays.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a scheduled trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
# The existing job to be started by this trigger
CFNJobName:
Type: String
Default: cfn-job-S3-to-S3-1
# The name of the trigger to be created
CFNTriggerName:
Type: String
Default: cfn-trigger-scheduled-flights-1
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating a scheduled trigger for a job
#
Resources:

138
AWS Glue Developer Guide
Sample Conditional Trigger

# Create trigger to run an existing job (CFNJobName) on a cron schedule.


TriggerSample1CFN:
Type: AWS::Glue::Trigger
Properties:
Name:
Ref: CFNTriggerName
Description: Trigger created with CloudFormation
Type: SCHEDULED
Actions:
- JobName: !Ref CFNJobName
# Arguments: JSON object
# # Run the trigger every 10 minutes on Monday to Friday
Schedule: cron(0/10 * ? * MON-FRI *)
#Predicate:

Sample AWS CloudFormation Template for an AWS


Glue Conditional Trigger
An AWS Glue trigger in the Data Catalog contains the parameter values that are required to start a job
run when the trigger fires. A conditional trigger fires when it is enabled and its conditions are met, such
as a job completing successfully.

This sample creates a conditional trigger that starts one job named cfn-job-S3-to-S3-1. This job
starts when the job named cfn-job-S3-to-S3-2 completes successfully.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a conditional trigger for a job, which starts
when another job completes
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
# The existing job to be started by this trigger
CFNJobName:
Type: String
Default: cfn-job-S3-to-S3-1
# The existing job that when it finishes causes trigger to fire
CFNJobName2:
Type: String
Default: cfn-job-S3-to-S3-2
# The name of the trigger to be created
CFNTriggerName:
Type: String
Default: cfn-trigger-conditional-1
#
Resources:
# Create trigger to run an existing job (CFNJobName) when another job completes
(CFNJobName2).
CFNTriggerSample:
Type: AWS::Glue::Trigger
Properties:
Name:
Ref: CFNTriggerName
Description: Trigger created with CloudFormation
Type: CONDITIONAL
Actions:
- JobName: !Ref CFNJobName
# Arguments: JSON object
#Schedule: none
Predicate:

139
AWS Glue Developer Guide
Sample Development Endpoint

#Value for Logical is required if more than 1 job listed in Conditions


Logical: AND
Conditions:
- LogicalOperator: EQUALS
JobName: !Ref CFNJobName2
State: SUCCEEDED

Sample AWS CloudFormation Template for an AWS


Glue Development Endpoint
An AWS Glue development endpoint is an environment that you can use to develop and test your AWS
Glue scripts.

This sample creates a development endpoint with the minimal network parameter values required to
successfully create it. For more information about the parameters that you need to set up a development
endpoint, see Setting Up Your Environment for Development Endpoints (p. 33).

You provide an existing IAM role ARN (Amazon Resource Name) to create the development endpoint.
Supply a valid RSA public key and keep the corresponding private key available if you plan to create a
notebook server on the development endpoint.
Note
For any notebook server that you create that is associated with a development endpoint, you
manage it. Therefore, if you delete the development endpoint, to delete the notebook server,
you must delete the AWS CloudFormation stack on the AWS CloudFormation console.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a development endpoint
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:

# The name of the crawler to be created


CFNEndpointName:
Type: String
Default: cfn-devendpoint-1
CFNIAMRoleArn:
Type: String
Default: arn:aws:iam::123456789012/role/AWSGlueServiceRoleGA
#
#
# Resources section defines metadata for the Data Catalog
Resources:
CFNDevEndpoint:
Type: AWS::Glue::DevEndpoint
Properties:
EndpointName: !Ref CFNEndpointName
#ExtraJarsS3Path: String
#ExtraPythonLibsS3Path: String
NumberOfNodes: 5
PublicKey: ssh-rsa public.....key myuserid-key
RoleArn: !Ref CFNIAMRoleArn
SecurityGroupIds:
- sg-64986c0b
SubnetId: subnet-c67cccac

140
AWS Glue Developer Guide

Authoring Jobs in AWS Glue


A job is the business logic that performs the extract, transform, and load (ETL) work in AWS Glue. When
you start a job, AWS Glue runs a script that extracts data from sources, transforms the data, and loads it
into targets. You can create jobs in the ETL section of the AWS Glue console. For more information, see
Working with Jobs on the AWS Glue Console (p. 146).

The following diagram summarizes the basic workflow and steps involved in authoring a job in AWS
Glue:

Topics
• Workflow Overview (p. 142)
• Adding Jobs in AWS Glue (p. 142)
• Editing Scripts in AWS Glue (p. 151)
• Triggering Jobs in AWS Glue (p. 154)
• Using Development Endpoints for Developing Scripts (p. 156)
• Managing Notebooks (p. 177)

141
AWS Glue Developer Guide
Workflow Overview

Workflow Overview
When you author a job, you supply details about data sources, targets, and other information. The result
is a generated Apache Spark API (PySpark) script. You can then store your job definition in the AWS Glue
Data Catalog.

The following describes an overall process of authoring jobs in the AWS Glue console:

1. You choose a data source for your job. The tables that represent your data source must already be
defined in your Data Catalog. If the source requires a connection, the connection is also referenced in
your job. If your job requires multiple data sources, you can add them later by editing the script.
2. You choose a data target of your job. The tables that represent the data target can be defined in your
Data Catalog, or your job can create the target tables when it runs. You choose a target location when
you author the job. If the target requires a connection, the connection is also referenced in your job. If
your job requires multiple data targets, you can add them later by editing the script.
3. You customize the job-processing environment by providing arguments for your job and generated
script. For more information, see Adding Jobs in AWS Glue (p. 142).
4. Initially, AWS Glue generates a script, but you can also edit this script to add sources, targets, and
transforms. For more information about transforms, see Built-In Transforms (p. 144).
5. You specify how your job is invoked, either on demand, by a time-based schedule, or by an event. For
more information, see Triggering Jobs in AWS Glue (p. 154).
6. Based on your input, AWS Glue generates a PySpark or Scala script. You can tailor the script based on
your business needs. For more information, see Editing Scripts in AWS Glue (p. 151).

Adding Jobs in AWS Glue


A job consists of the business logic that performs extract, transform, and load (ETL) work in AWS Glue.
You can monitor job runs to understand runtime metrics such as success, duration, and start time. The
output of a job is your transformed data, written to a location that you specify.

Job runs can be initiated by triggers that start a job when they fire. A job contains a script that connects
to your source data, processes your data using the script's logic, and then writes it out to your data
target. Your job can have multiple data sources and multiple data targets. You can use scripts that are
generated by AWS Glue to transform data, or you can provide your own. The AWS Glue code generator
can automatically create an Apache Spark API (PySpark) script given a source schema and target location
or schema. You can use this script as a starting point and edit it to meet your goals.

AWS Glue can write output files in several data formats, including JSON, CSV, ORC (Optimized Row
Columnar), Apache Parquet, and Apache Avro. For some data formats, common compression formats can
be written.

Defining Job Properties


When you define your job in the AWS Glue console (p. 146), you provide the following information to
control the AWS Glue runtime environment:

IAM role

Specify the IAM role that is used for authorization to resources used to run the job and access data
stores. For more information about permissions for running jobs in AWS Glue, see Overview of
Managing Access Permissions to Your AWS Glue Resources (p. 41).

142
AWS Glue Developer Guide
Defining Job Properties

Generated or custom script

The code in the ETL script defines your job's procedural logic. The script can be coded in Python or
Scala. You can choose whether the script that the job runs is generated by AWS Glue or provided
by you. You provide the script name and location in Amazon Simple Storage Service (Amazon S3).
Confirm that there isn't a file with the same name as the script directory in the path. To learn more
about using scripts, see Editing Scripts in AWS Glue (p. 151).
Scala class name

If the script is coded in Scala, a class name must be provided. The default class name for AWS Glue
generated scripts is GlueApp.
Temporary directory

Provide the location of a working directory in Amazon S3 where temporary intermediate results are
written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the
temporary directory in the path. This directory is used when AWS Glue reads and writes to Amazon
Redshift and by certain AWS Glue transforms.
Job bookmark

Specify how AWS Glue processes state information when the job runs. You can have it remember
previously processed data, update state information, or ignore state information.
Job metrics

Enable or disable the creation of CloudWatch metrics when this job runs. To see profiling data this
option must be enabled. For more information about how to enable and visualize metrics, see Job
Monitoring and Debugging (p. 210).
Server-side encryption

If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using
SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon
S3 temporary directory is encrypted. For more information, see Protecting Data Using Server-Side
Encryption with Amazon S3-Managed Encryption Keys (SSE-S3).
Important
Currently, a security configuration overrides any server-side encryption (SSE-S3) setting
passed as an ETL job parameter. Thus, if both a security configuration and an SSE-S3
parameter are associated with a job, the SSE-S3 parameter is ignored.
Script libraries

If your script requires it, you can specify locations for the following:
• Python library path
• Dependent jars path
• Referenced files path

You can define the comma-separated Amazon S3 paths for these libraries when you define a job.
You can override these paths when you run the job. For more information, see Providing Your Own
Custom Scripts (p. 153).
Concurrent DPUs per job run

A data processing unit (DPU) is a relative measure of processing power that is used by a job. Choose
an integer from 2 to 100. The default is 10. A single DPU provides processing capacity that consists
of 4 vCPUs compute and 16 GB of memory.
Max concurrency

Sets the maximum number of concurrent runs that are allowed for this job. The default is 1. An
error is returned when this threshold is reached. The maximum value you can specify is controlled

143
AWS Glue Developer Guide
Built-In Transforms

by a service limit. For example, if a previous run of a job is still running when a new instance is
started, you might want to return an error to prevent two instances of the same job from running
concurrently.
Job timeout

Sets the maximum execution time in minutes. The default is 2880 minutes. If this limit is greater
than the execution time, the job run state changes to “TIMEOUT”.
Delay notification threshold

Sets the threshold (in minutes) before a delay notification is sent. You can set this threshold to
send notifications when a RUNNING, STARTING, or STOPPING job run takes more than an expected
number of minutes.
Number of retries

Specify the number of times, from 0 to 10, that AWS Glue should automatically restart the job if it
fails.
Job parameters

A set of key-value pairs that are passed as named parameters to the script invoked by the job. These
are default values that are used when the script is run, but you can override them at run time. The
key name is prefixed with --, for example --myKey and the value is value-for-myKey.

'--myKey' : 'value-for-myKey'

For more examples, see Python parameters in Passing and Accessing Python Parameters in AWS
Glue (p. 256).
Target path

For Amazon S3 target locations, provide the location of a directory in Amazon S3 where your output
is written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the
target path directory in the path.

For more information about adding a job using the AWS Glue console, see Working with Jobs on the AWS
Glue Console (p. 146).

Built-In Transforms
AWS Glue provides a set of built-in transforms that you can use to process your data. You can call these
transforms from your ETL script. Your data passes from transform to transform in a data structure
called a DynamicFrame, which is an extension to an Apache Spark SQL DataFrame. The DynamicFrame
contains your data, and you reference its schema to process your data. For more information about these
transforms, see AWS Glue PySpark Transforms Reference (p. 297).

AWS Glue provides the following built-in transforms:

ApplyMapping

Maps source columns and data types from a DynamicFrame to target columns and data types in a
returned DynamicFrame. You specify the mapping argument, which is a list of tuples that contain
source column, source type, target column, and target type.
DropFields

Removes a field from a DynamicFrame. The output DynamicFrame contains fewer fields than the
input. You specify which fields to remove using the paths argument. The paths argument points

144
AWS Glue Developer Guide
Built-In Transforms

to a field in the schema tree structure using dot notation. For example, to remove field B, which is a
child of field A in the tree, type A.B for the path.
DropNullFields

Removes null fields from a DynamicFrame. The output DynamicFrame does not contain fields of
the null type in the schema.
Filter

Selects records from a DynamicFrame and returns a filtered DynamicFrame. You specify a function,
such as a Lambda function, which determines whether a record is output (function returns true) or
not (function returns false).
Join

Equijoin of two DynamicFrames. You specify the key fields in the schema of each frame to compare
for equality. The output DynamicFrame contains rows where keys match.
Map

Applies a function to the records of a DynamicFrame and returns a transformed DynamicFrame.


The supplied function is applied to each input record and transforms it to an output record. The map
transform can add fields, delete fields, and perform lookups using an external API operation. If there
is an exception, processing continues, and the record is marked as an error.
MapToCollection

Applies a transform to each DynamicFrame in a DynamicFrameCollection.


Relationalize

Converts a DynamicFrame to a relational (rows and columns) form. Based on the data's schema,
this transform flattens nested structures and creates DynamicFrames from arrays structures. The
output is a collection of DynamicFrames that can result in data written to multiple tables.
RenameField

Renames a field in a DynamicFrame. The output is a DynamicFrame with the specified field
renamed. You provide the new name and the path in the schema to the field to be renamed.
ResolveChoice

Use ResolveChoice to specify how a column should be handled when it contains values of
multiple types. You can choose to either cast the column to a single data type, discard one or more
of the types, or retain all types in either separate columns or a structure. You can select a different
resolution policy for each column or specify a global policy that is applied to all columns.
SelectFields

Selects fields from a DynamicFrame to keep. The output is a DynamicFrame with only the selected
fields. You provide the paths in the schema to the fields to keep.
SelectFromCollection

Selects one DynamicFrame from a collection of DynamicFrames. The output is the selected
DynamicFrame. You provide an index to the DynamicFrame to select.
Spigot

Writes sample data from a DynamicFrame. Output is a JSON file in Amazon S3. You specify the
Amazon S3 location and how to sample the DynamicFrame. Sampling can be a specified number of
records from the beginning of the file or a probability factor used to pick records to write.
SplitFields

Splits fields into two DynamicFrames. Output is a collection of DynamicFrames: one with selected
fields, and one with the remaining fields. You provide the paths in the schema to the selected fields.

145
AWS Glue Developer Guide
Jobs on the Console

SplitRows

Splits rows in a DynamicFrame based on a predicate. The output is a collection of two


DynamicFrames: one with selected rows, and one with the remaining rows. You provide the
comparison based on fields in the schema. For example, A > 4.
Unbox

Unboxes a string field from a DynamicFrame. The output is a DynamicFrame with the selected
string field reformatted. The string field can be parsed and replaced with several fields. You provide
a path in the schema for the string field to reformat and its current format type. For example, you
might have a CSV file that has one field that is in JSON format {"a": 3, "b": "foo", "c":
1.2}. This transform can reformat the JSON into three fields: an int, a string, and a double.

Working with Jobs on the AWS Glue Console


A job in AWS Glue consists of the business logic that performs extract, transform, and load (ETL) work.
You can create jobs in the ETL section of the AWS Glue console.

To view existing jobs, sign in to the AWS Management Console and open the AWS Glue console at
https://console.aws.amazon.com/glue/. Then choose the Jobs tab in AWS Glue. The Jobs list displays
the location of the script that is associated with each job, when the job was last modified, and the
current job bookmark option.

From the Jobs list, you can do the following:

• To start an existing job, choose Action, and then choose Run job.
• To stop a Running or Starting job, choose Action, and then choose Stop job run.
• To add triggers that start a job, choose Action, Choose job triggers.
• To modify an existing job, choose Action, and then choose Edit job or Delete.
• To change a script that is associated with a job, choose Action, Edit script.
• To reset the state information that AWS Glue stores about your job, choose Action, Reset job
bookmark.
• To create a development endpoint with the properties of this job, choose Action, Create development
endpoint.

To add a new job using the console

1. Open the AWS Glue console, and choose the Jobs tab.
2. Choose Add job, and follow the instructions in the Add job wizard.

If you decide to have AWS Glue generate a script for your job, you must specify the job properties,
data sources, and data targets, and verify the schema mapping of source columns to target columns.
The generated script is a starting point for you to add code to perform your ETL work. Verify the
code in the script and modify it to meet your business needs.
Note
To get step-by-step guidance for adding a job with a generated script, see the Add job
tutorial in the console.

Optionally, you can add a security configuration to a job to specify at-rest encryption options.
For more information, see Encrypting Data Written by Crawlers, Jobs, and Development
Endpoints (p. 82).

146
AWS Glue Developer Guide
Jobs on the Console

If you provide or author the script, your job defines the sources, targets, and transforms. But you
must specify any connections that are required by the script in the job. For information about
creating your own script, see Providing Your Own Custom Scripts (p. 153).

Note
The job assumes the permissions of the IAM role that you specify when you create it. This
IAM role must have permission to extract data from your data store and write to your target.
The AWS Glue console only lists IAM roles that have attached a trust policy for the AWS Glue
principal service. For more information about providing roles for AWS Glue, see Identity-Based
Policies (p. 44).
If the job reads KMS encrypted Amazon S3 data, then the IAM role must have decrypt
permission on the KMS key. For more information, see Step 2: Create an IAM Role for AWS
Glue (p. 14).
Important
Check Troubleshooting Errors in AWS Glue (p. 235) for known problems when a job runs.

To learn about the properties that are required for each job, see Defining Job Properties (p. 142).

To get step-by-step guidance for adding a job with a generated script, see the Add job tutorial in the
AWS Glue console.

Viewing Job Details


To see details of a job, select the job in the Jobs list and review the information on the following tabs:

• History
• Details
• Script
• Metrics

History
The History tab shows your job run history and how successful a job has been in the past. For each job,
the run metrics include the following:

• Run ID is an identifier created by AWS Glue for each run of this job.
• Retry attempt shows the number of attempts for jobs that required AWS Glue to automatically retry.
• Run status shows the success of each run listed with the most recent run at the top. If a job is
Running or Starting, you can choose the action icon in this column to stop it.
• Error shows the details of an error meesage if the run was not successful.
• Logs links to the logs written to stdout for this job run.

The Logs link takes you to the CloudWatch Logs, where you can see all the details about the tables
that were created in the AWS Glue Data Catalog and any errors that were encountered. You can
manage your log retention period in the CloudWatch console. The default log retention is Never
Expire. For more information about how to change the retention period, see Change Log Data
Retention in CloudWatch Logs.
• Error logs links to the logs written to stderr for this job run.

This link takes you to the CloudWatch Logs, where you can see details about any errors that were
encountered. You can manage your log retention period in the CloudWatch console. The default log

147
AWS Glue Developer Guide
Jobs on the Console

retention is Never Expire. For more information about how to change the retention period, see
Change Log Data Retention in CloudWatch Logs.
• Execution time shows the length of time during which the job run consumed resources. The amount is
calculated from when the job run starts consuming resources until it finishes.
• Timeout shows the maximum execution time during which this job run can consume resources before
it stops and goes into timeout status.
• Delay shows the threshold before sending a job delay notification. When a job run execution time
reaches this threshold, AWS Glue sends a notification ("Glue Job Run Status") to CloudWatch Events.
• Triggered by shows the trigger that fired to start this job run.
• Start time shows the date and time (local time) that the job started.
• End time shows the date and time (local time) that the job ended.

For a specific job run, you can View run metrics, which displays graphs of metrics for the selected job
run. For more information about how to enable metrics and interpret the graphs, see Job Monitoring and
Debugging (p. 210).

Details
The Details tab includes attributes of your job. It shows you the details about the job definition and also
lists the triggers that can start this job. Each time one of the triggers in the list fires, the job is started.
For the list of triggers, the details include the following:

• Trigger name shows the names of triggers that start this job when fired.
• Trigger type lists the type of trigger that starts this job.
• Trigger status displays whether the trigger is created, activated, or deactivated.
• Trigger parameters shows parameters that define when the trigger fires.
• Jobs to trigger shows the list of jobs that start when this trigger fires.

Script
The Script tab shows the script that runs when your job is started. You can invoke an Edit script view
from this tab. For more information about the script editor in the AWS Glue console, see Working with
Scripts on the AWS Glue Console (p. 152). For information about the functions that are called in your
script, see Program AWS Glue ETL Scripts in Python (p. 254).

Metrics
The Metrics tab shows metrics collected when a job runs and profiling is enabled. The following graphs
are shown:

• ETL Data Movement


• Memory Profile: Driver and Executors

Choose View additional metrics to show the following graphs:

• ETL Data Movement


• Memory Profile: Driver and Executors
• Data Shuffle Across Executors
• CPU Load: Driver and Executors
• Job Execution: Active Executors, Completed Stages & Maximum Needed Executors

148
AWS Glue Developer Guide
Jobs on the Console

Data for these graphs is pushed to CloudWatch metrics if the job is enabled to collect metrics. For
more information about how to enable metrics and interpret the graphs, see Job Monitoring and
Debugging (p. 210).

Example of ETL Data Movement Graph

The ETL Data Movement graph shows the following metrics:

• the number of bytes read from Amazon S3 by all executors


—glue.ALL.s3.filesystem.read_bytes (p. 207)
• number of bytes written to Amazon S3 by all executors
—glue.ALL.s3.filesystem.write_bytes (p. 208)

Example of Memory Profile Graph

The Memory Profile graph shows the following metrics:

• the fraction of memory used by the JVM heap for this driver (scale: 0-1) by the driver, an executor
identified by executorId, or all executors—
• glue.driver.jvm.heap.usage (p. 205)
• glue.executorId.jvm.heap.usage (p. 205)
• glue.ALL.jvm.heap.usage (p. 205)

Example of Data Shuffle Across Executors Graph

The Data Shuffle Across Executors graph shows the following metrics:

149
AWS Glue Developer Guide
Jobs on the Console

• the number of bytes read by all executors to shuffle data between them
—glue.driver.aggregate.shuffleLocalBytesRead (p. 202)
• the number of bytes written by all executors to shuffle data between them
—glue.driver.aggregate.shuffleBytesWritten (p. 201)

Example of CPU Load Graph

The CPU Load graph shows the following metrics:

• the fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by executorId, or
all executors—
• glue.driver.system.cpuSystemLoad (p. 209)
• glue.executorId.system.cpuSystemLoad (p. 209)
• glue.ALL.system.cpuSystemLoad (p. 209)

Example of Job Execution Graph

The Job Execution graph shows the following metrics:

• the number of actively running executors


—glue.driver.ExecutorAllocationManager.executors.numberAllExecutors (p. 203)
• the number of completed stages—glue.aggregate.numCompletedStages (p. 199)
• the number of maximum needed executors
—glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors (p. 204)

150
AWS Glue Developer Guide
Editing Scripts

Editing Scripts in AWS Glue


A script contains the code that extracts data from sources, transforms it, and loads it into targets. AWS
Glue runs a script when it starts a job.

AWS Glue ETL scripts can be coded in Python or Scala. Python scripts use a language that is an extension
of the PySpark Python dialect for extract, transform, and load (ETL) jobs. The script contains extended
constructs to deal with ETL transformations. When you automatically generate the source code logic for
your job, a script is created. You can edit this script, or you can provide your own script to process your
ETL work.

For information about defining and editing scripts using the AWS Glue console, see Working with Scripts
on the AWS Glue Console (p. 152).

Defining a Script
Given a source and target, AWS Glue can generate a script to transform the data. This proposed script
is an initial version that fills in your sources and targets, and suggests transformations in PySpark. You
can verify and modify the script to fit your business needs. Use the script editor in AWS Glue to add
arguments that specify the source and target, and any other arguments that are required to run. Scripts
are run by jobs, and jobs are started by triggers, which can be based on a schedule or an event. For more
information about triggers, see Triggering Jobs in AWS Glue (p. 154).

In the AWS Glue console, the script is represented as code. You can also view the script as a diagram that
uses annotations (##) embedded in the script. These annotations describe the parameters, transform
types, arguments, inputs, and other characteristics of the script that are used to generate a diagram in
the AWS Glue console.

The diagram of the script shows the following:

• Source inputs to the script


• Transforms
• Target outputs written by the script

Scripts can contain the following annotations:

Annotation Usage

@params Parameters from the ETL job that


the script requires.

@type Type of node in the diagram,


such as the transform type, data
source, or data sink.

151
AWS Glue Developer Guide
Scripts on the Console

Annotation Usage

@args Arguments passed to the node,


except reference to input data.

@return Variable returned from script.

@inputs Data input to node.

To learn about the code constructs within a script, see Program AWS Glue ETL Scripts in
Python (p. 254).

Working with Scripts on the AWS Glue Console


A script contains the code that performs extract, transform, and load (ETL) work. You can provide your
own script, or AWS Glue can generate a script with guidance from you. For information about creating
your own scripts, see Providing Your Own Custom Scripts (p. 153).

You can edit a script in the AWS Glue console. When you edit a script, you can add sources, targets, and
transforms.

To edit a script

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/. Then choose the Jobs tab.
2. Choose a job in the list, and then choose Action, Edit script to open the script editor.

You can also access the script editor from the job details page. Choose the Script tab, and then
choose Edit script.

Script Editor
The AWS Glue script editor lets you insert, modify, and delete sources, targets, and transforms in your
script. The script editor displays both the script and a diagram to help you visualize the flow of data.

To create a diagram for the script, choose Generate diagram. AWS Glue uses annotation lines in the
script beginning with ## to render the diagram. To correctly represent your script in the diagram, you
must keep the parameters in the annotations and the parameters in the Apache Spark code in sync.

The script editor lets you add code templates wherever your cursor is positioned in the script. At the top
of the editor, choose from the following options:

• To add a source table to the script, choose Source.


• To add a target table to the script, choose Target.
• To add a target location to the script, choose Target location.
• To add a transform to the script, choose Transform. For information about the functions that are
called in your script, see Program AWS Glue ETL Scripts in Python (p. 254).
• To add a Spigot transform to the script, choose Spigot.

In the inserted code, modify the parameters in both the annotations and Apache Spark code. For
example, if you add a Spigot transform, verify that the path is replaced in both the @args annotation
line and the output code line.

152
AWS Glue Developer Guide
Providing Your Own Custom Scripts

The Logs tab shows the logs that are associated with your job as it runs. The most recent 1,000 lines are
displayed.

The Schema tab shows the schema of the selected sources and targets, when available in the Data
Catalog.

Providing Your Own Custom Scripts


Scripts perform the extract, transform, and load (ETL) work in AWS Glue. A script is created when you
automatically generate the source code logic for a job. You can either edit this generated script, or you
can provide your own custom script.
Important
Your custom script must be compatible with Apache Spark 2.2.1.

To provide your own custom script in AWS Glue, follow these general steps:

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Choose the Jobs tab, and then choose Add job to start the Add job wizard.
3. In the Job properties screen, choose the IAM role that is required for your custom script to run. For
more information, see Authentication and Access Control for AWS Glue (p. 40).
4. Under This job runs, choose one of the following:

• An existing script that you provide


• A new script to be authored by you
5. Choose any connections that your script references. These objects are needed to connect to the
necessary JDBC data stores.

An elastic network interface is a virtual network interface that you can attach to an instance in a
virtual private cloud (VPC). Choose the elastic network interface that is required to connect to the
data store that's used in the script.
6. If your script requires additional libraries or files, you can specify them as follows:

Python library path

Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that
are required by the script.
Note
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the
pandas Python Data Analysis Library, are not yet supported.
Dependent jars path

Comma-separated Amazon S3 paths to JAR files that are required by the script.
Note
Currently, only pure Java or Scala (2.11) libraries can be used.
Referenced files path

Comma-separated Amazon S3 paths to additional files (for example, configuration files) that are
required by the script.
7. If you want, you can add a schedule to your job. To change a schedule, you must delete the existing
schedule and add a new one.

For more information about adding jobs in AWS Glue, see Adding Jobs in AWS Glue (p. 142).

153
AWS Glue Developer Guide
Triggering Jobs

For step-by-step guidance, see the Add job tutorial in the AWS Glue console.

Triggering Jobs in AWS Glue


You decide what triggers an extract, transform, and load (ETL) job to run in AWS Glue. The triggering
condition can be based on a schedule (as defined by a cron expression) or on an event. You can also run a
job on demand.

Triggering Jobs Based on Schedules or Events


When you create a trigger for a job based on a schedule, you can specify constraints, such as the
frequency the job runs, which days of the week it runs, and at what time. These constraints are based on
cron. When you're setting up a schedule for a trigger, you should consider the features and limitations
of cron. For example, if you choose to run your crawler on day 31 each month, keep in mind that some
months don't have 31 days. For more information about cron, see Time-Based Schedules for Jobs and
Crawlers (p. 189).

When you create a trigger based on an event, you specify events to watch that cause the trigger to fire,
such as when another job succeeded. For a conditional trigger based on a job events trigger, you specify
a list of jobs that cause a trigger to fire when any or all jobs satisfy the watched job events. In turn, when
the trigger fires, it starts a run of any dependent jobs.

Defining Trigger Types


A trigger can be one of the following types:

Schedule

A time-based trigger based on cron.


Job events (conditional)

An event-based trigger that fires when a previous job or multiple jobs satisfy a list of conditions.
You provide a list of job events to watch for when their run state changes to succeeded, failed,
stopped, or timeout. This trigger waits to fire until any or all the conditions are satisfied.
Important
Dependent jobs are only started if the job which completes was started by a trigger (not run
ad-hoc). All jobs in a dependency chain must be descendants of a single schedule or on-
demand trigger.
On-demand

The trigger fires when you start it. As jobs complete, any triggers watching for completion are also
fired and dependent jobs are started.

So that they are ready to fire as soon as they exist, you can set a flag to enable (activate) schedule and
job events (conditional) triggers when they are created.

For more information about defining triggers using the AWS Glue console, see Working with Triggers on
the AWS Glue Console (p. 154).

Working with Triggers on the AWS Glue Console


A trigger controls when an ETL job runs in AWS Glue. To view your existing triggers, sign in to the AWS
Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Then
choose the Triggers tab.

154
AWS Glue Developer Guide
Working with Triggers on the Console

The Triggers list displays properties for each trigger:

Trigger name

The unique name you gave the trigger when you created it.
Trigger type

Indicates whether the trigger is time-based (Schedule), event-based (Job events), or started by you
(On-demand).
Trigger status

Indicates whether the trigger is Enabled or ACTIVATED and ready to invoke associated jobs when
it fires. The trigger can also be Disabled or DEACTIVATED and paused so that it doesn't determine
whether a job is invoked. A CREATED trigger exists, but does not factor into whether a job runs.
Trigger parameters

For Schedule triggers, this includes the details about the frequency and time to fire the trigger. For
Job events triggers, it includes the list of jobs to watch that, depending on their run state, might fire
the trigger. See the details of the trigger for the watch list of jobs with events.
Jobs to trigger

Lists the jobs associated with the trigger that are invoked when this trigger fires.

Adding and Editing Triggers


To edit, delete, or start a trigger, select the check box next to the trigger in the list, and then choose
Action. Here you can also disable a trigger to prevent it from starting any associated jobs, or enable it to
start associated jobs when it fires.

Choose a trigger in the list to view details for the trigger. Trigger details include the information you
defined when you created the trigger.

To add a new trigger, choose Add trigger, and follow the instructions in the Add trigger wizard.

You provide the following properties:

Name

Give your trigger a unique name.


Trigger type

Specify one of the following:


• Schedule: The trigger fires at a specific time.
• Job events: The trigger fires when any or all jobs in the list match the selected job event. For the
trigger to fire, the watched job must have been started by a trigger. For any job you choose, you
can only watch one job event.
• On-demand: The trigger fires when it is started from the triggers list page.

For Schedule and Job events trigger types, you can enable them when they are created.
Jobs to trigger

List of jobs that are started by this trigger.

For more information, see Triggering Jobs in AWS Glue (p. 154).

155
AWS Glue Developer Guide
Using Development Endpoints

Using Development Endpoints for Developing


Scripts
AWS Glue can create an environment for you to iteratively develop and test your extract, transform, and
load (ETL) scripts. You can develop your script in a notebook and point to an AWS Glue endpoint to test
it. When you're satisfied with the results of your development process, you can create an ETL job that
runs your script. With this process, you can add functions and debug your script in an interactive manner.
Note
Your Python scripts must target Python 2.7, because AWS Glue development endpoints do not
support Python 3 yet.

Managing Your Development Environment


With AWS Glue, you can create, edit, and delete development endpoints. You provide configuration
values to provision the development environments. These values tell AWS Glue how to set up the
network so that you can access your development endpoint securely, and your endpoint can access your
data stores. Then, create a notebook that connects to the development endpoint, and use your notebook
to author and test your ETL script.

For more information about managing a development endpoint using the AWS Glue console, see
Working with Development Endpoints on the AWS Glue Console (p. 157).

How to Use a Development Endpoint


To use a development endpoint, you can follow this workflow.

1. Create an AWS Glue development endpoint through the console or API. This endpoint is launched in
your virtual private cloud (VPC) with your defined security groups.
2. The console or API can poll the development endpoint until it is provisioned and ready for work.
When it's ready, you can connect to the development endpoint to create and test AWS Glue scripts.

• You can install an Apache Zeppelin notebook on your local machine, connect it to a development
endpoint, and then develop on it using your browser.
• You can create an Apache Zeppelin notebook server in its own Amazon EC2 instance in your
account using the AWS Glue console, and then connect to it using your browser. For more
information about how to create a notebook server, see Creating a Notebook Server Associated
with a Development Endpoint (p. 179).
• You can create an Amazon SageMaker notebook in your account using the AWS Glue console. For
more information about how to create a notebook, see Working with Notebooks on the AWS Glue
Console (p. 185).
• You can open a terminal window to connect directly to a development endpoint.
• If you have the Professional edition of the JetBrains PyCharm Python IDE, you can connect it to
a development endpoint and use it to develop interactively. PyCharm can then support remote
breakpoints if you insert pydevd statements in your script.
3. When you finish debugging and testing on your development endpoint, you can delete it.

Accessing Your Development Endpoint


When you create a development endpoint in a virtual private cloud (VPC), AWS Glue returns only
a private IP address, and the public IP address field is not populated. When you create a non-VPC
development endpoint, AWS Glue returns only a public IP address.

156
AWS Glue Developer Guide
Development Endpoints on the Console

If your development endpoint has a Public address, then confirm it is reachable with the SSH private key
for the development endpoint. For example:

ssh -i dev-endpoint-private-key.pem glue@public-address

If your development endpoint has a Private address and your VPC subnet is routable from the public
internet and its security groups allow inbound access from your client, then you can follow these
instructions to attach an elastic IP to a development endpoint, thereby allowing access from the
internet.
Note
To use elastic IPs, the subnet being used requires an internet gateway associated through the
route table.

1. On the AWS Glue console, navigate to the development endpoint details page. Record the Private
address for use in the next step.
2. On the Amazon EC2 console, navigate to Network and Security, then choose Network Interfaces.
Search for the Private DNS (IPv4) that corresponds to the Private address in the AWS Glue console
development endpoint details page. You might need to modify which columns are displayed in
your Amazon EC2 console. Note the Network interface ID (ENI) for this address. For example
eni-12345678.
3. On the Amazon EC2 console, navigate to Network and Security, then choose Elastic IPs. Choose
Allocate new address, then Allocate to allocate a new elastic IP.
4. On the Elastic IPs page, choose the newly allocated Elastic IP. Then choose Actions, Associate
address.
5. On the Associate address page make the following choices:

• For Resource type, choose Network interface.


• In the Network interface field, type the Network interface ID (ENI) for the private address.
• Choose Associate.
6. Confirm if the newly associated Elastic IP is reachable with the SSH private key associated with the
development endpoint. For example:

ssh -i dev-endpoint-private-key.pem glue@elastic-ip

For information about using a bastion host to get SSH access to the development endpoint’s private
address, see Securely Connect to Linux Instances Running in a Private Amazon VPC.

Working with Development Endpoints on the AWS


Glue Console
A development endpoint is an environment that you can use to develop and test your AWS Glue scripts.
The Dev endpoints tab on the AWS Glue console lists all the development endpoints that you have
created. You can add, delete, or rotate the SSH key of a development endpoint. You can also create
notebooks that use the development endpoint.

To display details for a development endpoint, choose the endpoint in the list. Endpoint details include
the information you defined when you created it using the Add endpoint wizard. They also include
information that you need to connect to the endpoint and any notebooks that use the endpoint.

Follow the instructions in the tutorial topics to learn the details about how to use your development
endpoint with notebooks.

157
AWS Glue Developer Guide
Development Endpoints on the Console

The following are some of the development endpoint properties:

Endpoint name

The unique name that you give the endpoint when you create it.
Provisioning status

Describes whether the endpoint is being created (PROVISIONING), ready to be used (READY), in the
process of terminating (TERMINATING), terminated (TERMINATED), or failed (FAILED).
Failure reason

Reason for the development endpoint failure.


Private address

Address to connect to the development endpoint. On the Amazon EC2 console, you can view the
ENI attached to this IP address. This internal address is created if the development endpoint is
associated with a virtual private cloud (VPC). For more information about accessing a development
endpoint from a private address, see the section called “Accessing Your Dev Endpoint” (p. 156).
Public address

Address to connect to the development endpoint.


Public key contents

Current public SSH keys that are associated with the development endpoint (optional). If you
provided a public key when you created the development endpoint, you should have saved the
corresponding SSH private key.
IAM role

Specify the IAM role that is used for authorization to resources. If the development endpoint reads
KMS encrypted Amazon S3 data, then the IAM role must have decrypt permission on the KMS
key. For more information about creating an IAM role, see Step 2: Create an IAM Role for AWS
Glue (p. 14).
SSH to Python REPL

You can open a terminal window on your computer (laptop) and type this command to interact with
the development endpoint in as a Read-Eval-Print Loop (REPL) shell. This field is only shown if the
development endpoint contains a public SSH key.
SSH to Scala REPL

You can open a terminal window on your computer and type this command to interact with the
development endpoint as a REPL shell. This field is only shown if the development endpoint
contains a public SSH key.
SSH tunnel to remote interpreter

You can open a terminal window on your computer and type this command to open a tunnel to
the development endpoint. Then open your local Apache Zeppelin notebook and point to the
development endpoint as a remote interpreter. Once the interpreter is set up, all notes within the
notebook can use it. This field is only shown if the development endpoint contains a public SSH key.
Public key update status

The status of completing an update of the public key on the development endpoint. When you
update a public key, the new key must be propagated to the development endpoint. Status values
include COMPLETED and PENDING.
Last modified time

Last time this development endpoint was modified.

158
AWS Glue Developer Guide
Development Endpoints on the Console

Running for

Amount of time the development endpoint has been provisioned and READY.

Adding an Endpoint
1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Choose the Dev endpoints tab, and then choose Add endpoint.
3. Follow the steps in the AWS Glue Add endpoint wizard to provide the properties that are required to
create an endpoint. If you choose to provide an SSH public key when you create your development
endpoint, save your SSH private key to access the development endpoint later.

The following are some optional fields you can provide:

Security configuration

Add a security configuration to a development endpoint to specify at-rest encryption options.


For more information, see Encrypting Data Written by Crawlers, Jobs, and Development
Endpoints (p. 82).
Data processing units (DPUs)

The number of DPUs that AWS Glue uses for your development endpoint. The number must be
greater than 1.
Python library path

Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that
are required by your script. Multiple values must be complete paths separated by a comma (,).
Only individual files are supported, not a directory path.
Note
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the
pandas Python Data Analysis Library, are not yet supported.
Dependent jars path

Comma-separated Amazon S3 paths to JAR files that are required by the script.
Note
Currently, only pure Java or Scala (2.11) libraries can be used.

Creating a Notebook Server Hosted on Amazon EC2


You can install an Apache Zeppelin Notebook on your local machine and use it to debug and
test ETL scripts on a development endpoint. Alternatively, you can host the Zeppelin Notebook
server on an Amazon EC2 instance. For more information, see the section called “Notebook Server
Considerations” (p. 179).

In the AWS Glue Create notebook server window, you add the properties that are required to create a
notebook server to use an Apache Zeppelin notebook.
Note
For any notebook server that you create that is associated with a development endpoint, you
manage it. Therefore, if you delete the development endpoint, to delete the notebook server,
you must delete the AWS CloudFormation stack on the AWS CloudFormation console.
Important
Before you can use the notebook server hosted on Amazon EC2, you must run a script on the
Amazon EC2 instance that does the following actions:

159
AWS Glue Developer Guide
Development Endpoints on the Console

• Sets the Zeppelin notebook password.


• Sets up communication between the notebook server and the development endpoint.
• Verifies or generates a Secure Sockets Layer (SSL) certificate to access the Zeppelin notebook.

For more information, see the section called “Notebook Server Considerations” (p. 179).

You provide the following properties:

CloudFormation stack name

The name of your notebook that is created in the AWS CloudFormation stack on the development
endpoint. The name is prefixed with aws-glue-. This notebook runs on an Amazon EC2 instance.
The Zeppelin HTTP server is started either on public port 443 or localhost port 8080 that can be
accessed with an SSH tunnel command.
IAM role

A role with a trust relationship to Amazon EC2 that matches the Amazon EC2 instance profile
exactly. Create the role in the IAM console. Choose Amazon EC2, and attach a policy for the
notebook, such as AWSGlueServiceNotebookRoleDefault. For more information, see Step 5: Create
an IAM Role for Notebook Servers (p. 24).

For more information about instance profiles, see Using Instance Profiles.
EC2 key pair

The Amazon EC2 key that is used to access the Amazon EC2 instance hosting the notebook server.
You can create a key pair on the Amazon EC2 console (https://console.aws.amazon.com/ec2/). Save
the key files for later use. For more information, see Amazon EC2 Key Pairs.
Attach a public IP to the notebook server EC2 instance

Select this option to attach a public IP which can be used to access the notebook server from the
internet. Whether you choose a public or private Subnet is a factor when deciding to select this
option. In a public subnet, a notebook server requires a public IP to access the internet. If your
notebook server is in a private subnet and you do not want a public IP, don't select this option.
However, your notebook server still requires a route to the internet such as through a NAT gateway.
Notebook username

The user name that you use to access the Zeppelin notebook. The default is admin.
Notebook S3 path

The location where the state of the notebook is stored. The Amazon S3 path to the Zeppelin
notebook must follow the format: s3://bucket-name/username. Subfolders cannot be included
in the path. The default is s3://aws-glue-notebooks-account-id-region/notebook-
username.
Subnet

The available subnets that you can use with your notebook server. An asterisk (*) indicates that the
subnet can be accessed from the internet. The subnet must have a route to the internet through
an internet gateway (IGW), NAT gateway, or VPN. For more information, see Setting Up Your
Environment for Development Endpoints (p. 33).
Security groups

The available security groups that you can use with your notebook server. The security group must
have inbound rules for HTTPS (port 443) and SSH (port 22). Ensure that the rule's source is either
0.0.0.0/0 or the IP address of the machine connecting to the notebook.

160
AWS Glue Developer Guide
Tutorial Prerequisites

S3 AWS KMS key

A key used for client-side KMS encryption of the Zeppelin notebook storage on Amazon S3. This
field is optional. Enable access to Amazon S3 by either choosing an AWS KMS key or choose Enter
a key ARN and provide the Amazon Resource Name (ARN) for the key. Type the ARN in the form
arn:aws:kms:region:account-id:key/key-id. You can also provide the ARN in the form of a
key alias, such as arn:aws:kms:region:account-id:alias/alias-name.
Custom AMI ID

A custom Amazon Machine Image (AMI) ID of an encrypted Amazon Elastic Block Storage (EBS) EC2
instance. This field is optional. Provide the AMI ID by either choosing an AMI ID or choose Enter AMI
ID and type the custom AMI ID. For more information about how to encrypt your notebook server
storage, see Encryption and AMI Copy.
Notebook server tags

The AWS CloudFormation stack is always tagged with a key aws-glue-dev-endpoint and the value
of the name of the development endpoint. You can add more tags to the AWS CloudFormation
stack.

EC2 instance

The name of Amazon EC2 instance that is created to host your notebook. This links to the Amazon
EC2 console (https://console.aws.amazon.com/ec2/) where the instance is tagged with the key aws-
glue-dev-endpoint and value of the name of the development endpoint.
CloudFormation stack

The name of the AWS CloudFormation stack used to create the notebook server.
SSH to EC2 server command

Type this command in a terminal window to connect to the Amazon EC2 instance that is running
your notebook server. The Amazon EC2 address shown in this command is either public or private
depending on whether you chose to Attach a public IP to the notebook server EC2 instance.
Copy certificate

Example scp command to copy the keystore required to set up the Zeppelin notebook server to the
Amazon EC2 instance that hosts the notebook server. Run the command from a terminal window
in the directory where the Amazon EC2 private key is located. The key to access the Amazon EC2
instance is the parameter to the -i option. You provide the path-to-keystore-file. The
location where the development endpoint private SSH key on the Amazon EC2 server is located is
the remaining part of the command.
HTTPS URL

After completing the setup of a notebook server, type this URL in a browser to connect to your
notebook using HTTPS.

Tutorial Setup: Prerequisites for the Development


Endpoint Tutorials
Development endpoints create an environment where you can interactively test and debug ETL scripts
in various ways before you run them as AWS Glue jobs. The tutorials in this section show you how to do
this using different IDEs. All of them assume that you have set up a development endpoint and crawled
sample data to create tables in your AWS Glue Data Catalog using the steps in the following sections.
Note
Your Python scripts must target Python 2.7, because AWS Glue development endpoints do not
support Python 3 yet.

161
AWS Glue Developer Guide
Tutorial Prerequisites

Because you're using only Amazon Simple Storage Service (Amazon S3) data in some cases, and a mix
of JDBC and Amazon S3 data in others, you will set up one development endpoint that is not in a virtual
private cloud (VPC) and one that is.

Crawling the Sample Data Used in the Tutorials


The first step is to create a crawler that can crawl some sample data and record metadata about it in
tables in your Data Catalog. The sample data that is used is drawn from http://everypolitician.org/ and
has been modified slightly for purposes of the tutorials. It contains data in JSON format about United
States legislators and the seats that they have held in the US House of Representatives and Senate.

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.

In the AWS Glue console, choose Databases in the navigation pane, and then choose Add database.
Name the database legislators.
2. Choose Crawlers, and then choose Add crawler. Name the crawler legislator_crawler, and then
choose Next.
3. Leave Amazon S3 as the data store. Under Crawl data in, choose Specified path in another account.
Then in the Include path box, type s3://awsglue-datasets/examples/us-legislators/
all. Choose Next, and then choose Next again to confirm that you don't want to add another data
store.
4. Provide an IAM role for the crawler to assume when it runs, choose Next. Then choose Next to
confirm that this crawler will be run on demand.
5. For Database, choose the legislators database. Choose Next, and then choose Finish to
complete the creation of the new crawler.
6. Choose Crawlers in the navigation pane again. Select the check box next to the new
legislator_crawler crawler, and choose Run crawler.
7. Choose Databases in the navigation pane. Choose the legislators database, and then choose
Tables in legislators. You should see six tables created by the crawler in your Data Catalog,
containing metadata that the crawler retrieved.

Creating a Development Endpoint for Amazon S3 Data


The next thing to do is to create a development endpoint for Amazon S3 data. When you use a JDBC
data source or target, the development endpoint must be created with a VPC. However, this isn't
necessary in this tutorial if you are only accessing Amazon S3.

1. In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.
2. Specify an endpoint name, such as demo-endpoint.
3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs.
For more information, see Step 2: Create an IAM Role for AWS Glue (p. 14). Choose Next.
4. In Networking, leave Skip networking information selected, and choose Next.
5. In SSH Public Key, enter a public key generated by an SSH key generator program, such as ssh-
keygen (do not use an Amazon EC2 key pair). The generated public key will be imported into your
development endpoint. Save the corresponding private key to later connect to the development
endpoint using SSH. Choose Next. For more information, see ssh-keygen in Wikipedia.
Note
When generating the key on Microsoft Windows, use a current version of PuTTYgen and
paste the public key into the AWS Glue console from the PuTTYgen window. Generate an
RSA key. Do not upload a file with the public key, instead use the key generated in the field
Public key for pasting into OpenSSH authorized_keys file. The corresponding private key

162
AWS Glue Developer Guide
Tutorial Prerequisites

(.ppk) can be used in PuTTY to connect to the development endpoint. To connect to the
development endpoint with SSH on Windows, convert the private key from .ppk format to
OpenSSH .pem format using the PuTTYgen Conversion menu. For more information, see
Connecting to Your Linux Instance from Windows Using PuTTY.
6. In Review, choose Finish. After the development endpoint is created, wait for its provisioning status
to move to READY.

Creating an Amazon S3 Location to Use for Output


If you don't already have a bucket, follow the instructions in Create a Bucket to set one up in Amazon S3
where you can save output from sample ETL scripts.

Creating a Development Endpoint with a VPC


Although not required for this tutorial, a VPC development endpoint is needed if both Amazon S3 and
JDBC data stores are accessed by your ETL statements. In this case, when you create a development
endpoint you specify network properties of the virtual private cloud (Amazon VPC) that contains your
JDBC data stores. Before you start, set up your environment as explained in Setting Up Your Environment
for Development Endpoints (p. 33).

1. In the AWS Glue console, choose Dev endpoints in the navigation pane. Then choose Add endpoint.
2. Specify an endpoint name, such as vpc-demo-endpoint.
3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs.
For more information, see Step 2: Create an IAM Role for AWS Glue (p. 14). Choose Next.
4. In Networking, specify an Amazon VPC, a subnet, and security groups. This information is used
to create a development endpoint that can connect to your data resources securely. Consider the
following suggestions when filling in the properties of your endpoint:

• If you already set up a connection to your data stores, you can use the same connection to
determine the Amazon VPC, subnet, and security groups for your endpoint. Otherwise, specify
these parameters individually.
• Ensure that your Amazon VPC has Edit DNS hostnames set to yes. This parameter can be set
in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For more information, see
Setting Up DNS in Your VPC (p. 28).
• For this tutorial, ensure that the Amazon VPC you select has an Amazon S3 VPC endpoint. For
information about how to create an Amazon S3 VPC endpoint, see Amazon VPC Endpoints for
Amazon S3 (p. 29).
• Choose a public subnet for your development endpoint. You can make a subnet a public subnet
by adding a route to an internet gateway. For IPv4 traffic, create a route with Destination
0.0.0.0/0 and Target the internet gateway ID. Your subnet’s route table should be associated
with an internet gateway, not a NAT gateway. This information can be set in the Amazon VPC
console (https://console.aws.amazon.com/vpc/). For example:

163
AWS Glue Developer Guide
Tutorial: Local Zeppelin Notebook

For more information, see Route tables for Internet Gateways. For information about how to
create an internet gateway, see Internet Gateways.
• Ensure that you choose a security group that has an inbound self-reference rule. This information
can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example:

For more information about how to set up your subnet, see Setting Up Your Environment for
Development Endpoints (p. 33).

Choose Next.
5. In SSH Public Key, enter a public key generated by an SSH key generator program (do not use an
Amazon EC2 key pair). Save the corresponding private key to later connect to the development
endpoint using SSH. Choose Next.
Note
When generating the key on Microsoft Windows, use a current version of PuTTYgen and
paste the public key into the AWS Glue console from the PuTTYgen window. Generate an
RSA key. Do not upload a file with the public key, instead use the key generated in the field
Public key for pasting into OpenSSH authorized_keys file. The corresponding private key
(.ppk) can be used in PuTTY to connect to the development endpoint. To connect to the
development endpoint with SSH on Windows, convert the private key from .ppk format to
OpenSSH .pem format using the PuTTYgen Conversion menu. For more information, see
Connecting to Your Linux Instance from Windows Using PuTTY.
6. In Review, choose Finish. After the development endpoint is created, wait for its provisioning status
to move to READY.

You are now ready to try out the tutorials in this section:

• Tutorial: Set Up a Local Apache Zeppelin Notebook to Test and Debug ETL Scripts (p. 164)
• Tutorial: Set Up an Apache Zeppelin Notebook Server on Amazon EC2 (p. 167)
• Tutorial: Use a REPL Shell with Your Development Endpoint (p. 170)

Tutorial: Set Up a Local Apache Zeppelin Notebook to


Test and Debug ETL Scripts
In this tutorial, you connect an Apache Zeppelin Notebook on your local machine to a development
endpoint so that you can interactively run, debug, and test AWS Glue ETL (extract, transform, and load)

164
AWS Glue Developer Guide
Tutorial: Local Zeppelin Notebook

scripts before deploying them. This tutorial uses SSH port forwarding to connect your local machine to
an AWS Glue development endpoint. For more information, see Port forwarding in Wikipedia.

The tutorial assumes that you have already taken the steps outlined in Tutorial Prerequisites (p. 161).

Installing an Apache Zeppelin Notebook


1. Make sure that you have an up-to-date version of Java installed on your local machine (see the Java
home page for the latest version).

If you are running on Microsoft Windows, make sure that the JAVA_HOME environment variable
points to the right Java directory. It's possible to update Java without updating this variable, and if it
points to a folder that no longer exists, Zeppelin fails to start.
2. Download Apache Zeppelin (the version with all interpreters) from the Zeppelin download page onto
your local machine.

Download the older release named zeppelin-0.7.3-bin-all.tgz from the download page and follow
the installation instructions. Start Zeppelin in the way that's appropriate for your operating system.
Leave the terminal window that starts the notebook server open while you are using Zeppelin. When
the server has started successfully, you can see a line in the console that ends with "Done, zeppelin
server started."
3. Open Zeppelin in your browser by navigating to http://localhost:8080.
4. In Zeppelin in the browser, open the drop-down menu at anonymous in the upper-right corner of
the page, and choose Interpreter. On the interpreters page, search for spark, and choose edit on
the right. Make the following changes:

• Select the Connect to existing process check box, and then set Host to localhost and Port to
9007 (or whatever other port you are using for port forwarding).
• In Properties, set master to yarn-client.
• If there is a spark.executor.memory property, delete it by choosing the x in the action column.
• If there is a spark.driver.memory property, delete it by choosing the x in the action column.

Choose Save at the bottom of the page, and then choose OK to confirm that you want to update the
interpreter and restart it. Use the browser back button to return to the Zeppelin start page.

Initiating SSH Port Forwarding to Connect to Your DevEndpoint


Next, use SSH local port forwarding to forward a local port (here, 9007) to the remote destination
defined by AWS Glue (169.254.76.1:9007).

Open a terminal window that gives you access to the SSH secure-shell protocol. On Microsoft Windows,
you can use the BASH shell provided by Git for Windows, or install Cygwin.

Run the following SSH command, modified as follows:

• Replace private-key-file-path with a path to the .pem file that contains the private key
corresponding to the public key that you used to create your development endpoint.
• If you are forwarding a different port than 9007, replace 9007 with the port number that you are
actually using locally. The address, 169.254.76.1:9007, is the remote port and not changed by you.
• Replace dev-endpoint-public-dns with the public DNS address of your development endpoint. To
find this address, navigate to your development endpoint in the AWS Glue console, choose the name,
and copy the Public address that's listed in the Endpoint details page.

165
AWS Glue Developer Guide
Tutorial: Local Zeppelin Notebook

ssh -i private-key-file-path -NTL 9007:169.254.76.1:9007 glue@dev-endpoint-public-dns

You will likely see a warning message like the following:

The authenticity of host 'ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com


(xx.xxx.xxx.xx)'
can't be established. ECDSA key fingerprint is SHA256:4e97875Brt+1wKzRko
+JflSnp21X7aTP3BcFnHYLEts.
Are you sure you want to continue connecting (yes/no)?

Type yes and leave the terminal window open while you use your Zeppelin notebook.

Running a Simple Script Fragment in a Notebook Paragraph


In the Zeppelin start page, choose Create new note. Name the new note Legislators, and confirm
spark as the interpreter.

Type the following script fragment into your notebook and run it. It uses the person's metadata in the
AWS Glue Data Catalog to create a DynamicFrame from your sample data. It then prints out the item
count and the schema of this data.

%pyspark
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *

# Create a Glue context


glueContext = GlueContext(SparkContext.getOrCreate())

# Create a DynamicFrame using the 'persons_json' table


persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators",
table_name="persons_json")

# Print out information about this data


print "Count: ", persons_DyF.count()
persons_DyF.printSchema()

The output of the script is as follows:

Count: 1961
root
|-- family_name: string
|-- name: string
|-- links: array
| |-- element: struct
| | |-- note: string
| | |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
| |-- element: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- element: struct
| | |-- note: string
| | |-- name: string

166
AWS Glue Developer Guide
Tutorial: Amazon EC2 Zeppelin Notebook Server

| | |-- lang: string


|-- sort_name: string
|-- images: array
| |-- element: struct
| | |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
| |-- element: struct
| | |-- type: string
| | |-- value: string
|-- death_date: string

Troubleshooting Your Local Notebook Connection


• If you encounter a connection refused error, you might be using a development endpoint that is out of
date. Try creating a new development endpoint and reconnecting.
• If your connection times out or stops working for any reason, you may need to take the following steps
to restore it:
1. In Zeppelin, in the drop-down menu in the upper-right corner of the page, choose Interpreters. On
the interpreters page, search for spark. Choose edit, and clear the Connect to existing process
check box. Choose Save at the bottom of the page.
2. Initiate SSH port forwarding as described earlier.
3. In Zeppelin, re-enable the spark interpreter's Connect to existing process settings, and then save
again.

Resetting the interpreter like this should restore the connection. Another way to accomplish this is to
choose restart for the Spark interpreter on the Interpreters page. Then wait for up to 30 seconds to
ensure that the remote interpreter has restarted.
• Ensure your development endpoint has permission to access the remote Zeppelin interpreter. Without
the proper networking permissions you might encounter errors such as open failed: connect failed:
Connection refused.

Tutorial: Set Up an Apache Zeppelin Notebook Server


on Amazon EC2
In this tutorial, you create an Apache Zeppelin Notebook server that is hosted on an Amazon EC2
instance. The notebook connects to one of your development endpoints so that you can interactively
run, debug, and test AWS Glue ETL (extract, transform, and load) scripts before deploying them.

The tutorial assumes that you have already taken the steps outlined in Tutorial Prerequisites (p. 161).

Creating an Apache Zeppelin Notebook Server on an Amazon


EC2 Instance
To create a notebook server on Amazon EC2, you must have permission to create resources in AWS
CloudFormation, Amazon EC2, and other services. For more information about required user permissions,
see Step 3: Attach a Policy to IAM Users That Access AWS Glue (p. 15).

1. On the AWS Glue console, choose Dev endpoints to go to the development endpoints list.
2. Choose an endpoint by selecting the box next to it. Choose and endpoint with an empty SSH public
key because the key is generated with a later action on the Amazon EC2 instance. Then choose
Actions, and choose Create notebook server.

167
AWS Glue Developer Guide
Tutorial: Amazon EC2 Zeppelin Notebook Server

To host the notebook server, an Amazon EC2 instance is spun up using an AWS CloudFormation
stack on your development endpoint. If you create the Zeppelin server with an SSL certificate, the
Zeppelin HTTPS server is started on port 443.
3. Enter an AWS CloudFormation stack server name such as demo-cf, using only alphanumeric
characters and hyphens.
4. Choose an IAM role that you have set up with a trust relationship to Amazon EC2, as documented in
Step 5: Create an IAM Role for Notebook Servers (p. 24).
5. Choose an Amazon EC2 key pair that you have generated on the Amazon EC2 console (https://
console.aws.amazon.com/ec2/), or choose Create EC2 key pair to generate a new one. Remember
where you have downloaded and saved the private key portion of the pair. This key pair is different
from the SSH key you used when creating your development endpoint (the keys that Amazon EC2
uses are 2048-bit SSH-2 RSA keys). For more information about Amazon EC2 keys, see Amazon EC2
Key Pairs.

It is to generally a good practice to ensure that the private-key file is write-protected so that it is
not accidentally modified. On macOS and Linux systems, do this by opening a terminal and entering
chmod 400 private-key-file path. On Windows, open the console and enter attrib -r
private-key-file path.
6. Choose a user name to access your Zeppelin notebook.
7. Choose an Amazon S3 path for your notebook state to be stored in.
8. Choose Create.

You can view the status of the AWS CloudFormation stack in the AWS CloudFormation console Events
tab (https://console.aws.amazon.com/cloudformation). You can view the Amazon EC2 instances created
by AWS CloudFormation in the Amazon EC2 console (https://console.aws.amazon.com/ec2/). Search
for instances that are tagged with the key name aws-glue-dev-endpoint and value of the name of your
development endpoint.

After the notebook server is created, its status changes to CREATE_COMPLETE in the Amazon EC2
console. Details about your server also appear in the development endpoint details page. When the
creation is complete, you can connect to a notebook on the new server.

To complete the setup of the Zeppelin notebook server, you must run a script on the Amazon EC2
instance. This tutorial requires that you upload an SSL certificate when you create the Zeppelin server
on the Amazon EC2 instance. But there is also an SSH local port forwarding method to connect.
For additional setup instructions, see Creating a Notebook Server Associated with a Development
Endpoint (p. 179). When the creation is complete, you can connect to a notebook on the new server
using HTTPS.
Note
For any notebook server that you create that is associated with a development endpoint, you
manage it. Therefore, if you delete the development endpoint, to delete the notebook server,
you must delete the AWS CloudFormation stack on the AWS CloudFormation console.

Connecting to Your Notebook Server on Amazon EC2


1. In the AWS Glue console, choose Dev endpoints to navigate to the development endpoints list.
Choose the name of the development endpoint for which you created a notebook server. Choosing
the name opens its details page.
2. On the Endpoint details page, copy the URL labeled HTTPS URL for your notebook server.
3. Open a web browser, and paste in the notebook server URL. This lets you access the server using
HTTPS on port 443. Your browser may not recognize the server's certificate, in which case you have
to override its protection and proceed anyway.
4. Log in to Zeppelin using the user name and password that you provided when you created the
notebook server.

168
AWS Glue Developer Guide
Tutorial: Amazon EC2 Zeppelin Notebook Server

Running a Simple Script Fragment in a Notebook Paragraph


1. Choose Create new note and name it Legislators. Confirm spark as the Default Interpreter.
2. You can verify that your notebook is now set up correctly by typing the statement spark.version
and running it. This returns the version of Apache Spark that is running on your notebook server.
3. Type the following script into the next paragraph in your notebook and run it. This script reads
metadata from the persons_json table that your crawler created, creates a DynamicFrame from the
underlying data, and displays the number of records and the schema of the data.

%pyspark
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions

# Create a Glue context


glueContext = GlueContext(SparkContext.getOrCreate())

# Create a DynamicFrame using the 'persons_json' table


persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators",
table_name="persons_json")

# Print out information about this data


print "Count: ", persons_DyF.count()
persons_DyF.printSchema()

The output of the script should be:

Count: 1961
root
|-- family_name: string
|-- name: string
|-- links: array
| |-- element: struct
| | |-- note: string
| | |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
| |-- element: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- element: struct
| | |-- note: string
| | |-- name: string
| | |-- lang: string
|-- sort_name: string
|-- images: array
| |-- element: struct
| | |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
| |-- element: struct
| | |-- type: string
| | |-- value: string
|-- death_date: string

169
AWS Glue Developer Guide
Tutorial: Use a REPL Shell

Tutorial: Use a REPL Shell with Your Development


Endpoint
In AWS Glue, you can create a development endpoint and then invoke a REPL (Read–Evaluate–Print
Loop) shell to run PySpark code incrementally so that you can interactively debug your ETL scripts before
deploying them.

The tutorial assumes that you have already taken the steps outlined in Tutorial Prerequisites (p. 161).

1. In the AWS Glue console, choose Dev endpoints to navigate to the development endpoints list.
Choose the name of a development endpoint to open its details page.
2. Copy the SSH command labeled SSH to Python REPL, and paste it into a text editor. This field
is only shown if the development endpoint contains a public SSH key. Replace the <private-
key.pem> text with the path to the private-key .pem file that corresponds to the public key that
you used to create the development endpoint. Use forward slashes rather than backslashes as
delimiters in the path.
3. On your local computer, open a terminal window that can run SSH commands, and paste in the
edited SSH command. Run the command. The output will look like this:

download: s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar to ../../


usr/share/aws/glue/etl/jars/glue-assembly.jar
download: s3://aws-glue-jes-prod-us-east-1-assets/etl/python/PyGlue.zip to ../../usr/
share/aws/glue/etl/python/PyGlue.zip
Python 2.7.12 (default, Sep 1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/
slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/
slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

Using Python version 2.7.12 (default, Sep 1 2016 22:14:00)


SparkSession available as 'spark'.
>>>

4. Test that the REPL shell is working correctly by typing the statement, print spark.version. As
long as that displays the Spark version, your REPL is now ready to use.
5. Now you can try executing the following simple script, line by line, in the shell:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
glueContext = GlueContext(SparkContext.getOrCreate())
persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators",
table_name="persons_json")
print "Count: ", persons_DyF.count()

170
AWS Glue Developer Guide
Tutorial: Use PyCharm Professional

persons_DyF.printSchema()

Tutorial: Set Up PyCharm Professional with a


Development Endpoint
This tutorial shows you how to connect the PyCharm Professional Python IDE running on your local
machine to a development endpoint so that you can interactively run, debug, and test AWS Glue ETL
(extract, transfer, and load) scripts before deploying them.

To connect to a development endpoint interactively, you must have PyCharm Professional installed. You
can't do this using the free edition.

The tutorial assumes that you have already taken the steps outlined in Tutorial Prerequisites (p. 161).

Connecting PyCharm Professional to a Development Endpoint


1. Create a new pure-Python project in PyCharm named legislators.
2. Create a file named get_person_schema.py in the project with the following content:

import sys
import pydevd
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *

def main():
# Invoke pydevd
pydevd.settrace('169.254.76.0', port=9001, stdoutToServer=True, stderrToServer=True)

# Create a Glue context


glueContext = GlueContext(SparkContext.getOrCreate())

# Create a DynamicFrame using the 'persons_json' table


persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators",
table_name="persons_json")

# Print out information about this data


print "Count: ", persons_DyF.count()
persons_DyF.printSchema()

if __name__ == "__main__":
main()

3. Download the AWS Glue Python library file, PyGlue.zip, from https://s3.amazonaws.com/
aws-glue-jes-prod-us-east-1-assets/etl/python/PyGlue.zip to a convenient location
on your local machine.
4. Add PyGlue.zip as a content root for your project in PyCharm:

• In PyCharm, choose File, Settings to open the Settings dialog box. (You can also use the gear-
and-wrench icon on the toolbar, or press Ctrl+Alt+S.)
• Expand the legislators project and choose Project Structure. Then in the right pane, choose +
Add Content Root.
• Navigate to the location where you saved PyGlue.zip, select it, then choose Apply.

The Settings screen should look something like the following:

171
AWS Glue Developer Guide
Tutorial: Use PyCharm Professional

Leave the Settings dialog box open after you choose Apply.
5. Configure deployment options to upload the local script to your development endpoint using SFTP
(this capability is available only in PyCharm Professional):

• In the Settings dialog box, expand the Build, Execution, Deployment section. Choose the
Deployment subsection.
• Choose the + icon at the top of the middle pane to add a new server. Give it a name and set its
Type to SFTP.
• Set the SFTP host to the Public address of your development endpoint, as listed on its details
page (choose the name of your development endpoint in the AWS Glue console to display the
details page).
• Set the User name to glue.
• Set the Auth type to Key pair (OpenSSH or Putty). Set the Private key file by browsing to the
location where your development endpoint's private key file is located. Note that PyCharm only
supports DSA, RSA and ECDSA OpenSSH key types, and does not accept keys in Putty's private
format. You can use an up-to-date version of ssh-keygen to generate a key-pair type that
PyCharm accepts, using syntax like the following:

ssh-keygen -t rsa -f my_key_file_name -C "[email protected]"

• Choose Test SFTP connection, and allow the connection to be tested. If the connection succeeds,
choose Apply.

The Settings screen should now look something like the following:

172
AWS Glue Developer Guide
Tutorial: Use PyCharm Professional

Again, leave the Settings dialog box open after you choose Apply.
6. Map the local directory to a remote directory for deployment:

• In the right pane of the Deployment page, choose the middle tab at the top, labeled Mappings.
• In the Deployment Path column, enter a path under /home/glue/scripts/ for deployment of
your project path.
• Choose Apply.

The Settings screen should now look something like the following:

173
AWS Glue Developer Guide
Tutorial: Use PyCharm Professional

Choose OK to close the Settingsdialog box.

Deploying the Script to Your Development Endpoint


To deploy your script to the development endpoint, choose Tools, Deployment, and then choose the
name under which you set up your development endpoint, as shown in the following image:

After your script has been deployed, the bottom of the screen should look something like the following:

174
AWS Glue Developer Guide
Tutorial: Use PyCharm Professional

Starting the Debug Server on localhost and a Local Port


To start the debug server, take the following steps:

1. Choose Run, Edit Configuration.


2. Expand Defaults in the left pane, and choose Python Remote Debug.
3. Enter a port number, such as 9001, for the Port:

4. Note items 2 and 3 in the instructions in this screen. The script file that you created does import
pydevd. But in the call to settrace, it replaces localhost with 169.254.76.0, which is a special
link local IP address that is accessible to your development endpoint.
5. Choose Apply to save this default configuration.
6. Choose the + icon at the top of the screen to create a new configuration based on the default that
you just saved. In the drop-down menu, choose Python Remote Debug. Name this configuration
demoDevEndpoint, and choose OK.
7. On the Run menu, choose Debug 'demoDevEndpoint'. Your screen should now look something like
the following:

175
AWS Glue Developer Guide
Tutorial: Use PyCharm Professional

Initiating Port Forwarding


To invoke silent-mode remote port forwarding over SSH, open a terminal window that supports SSH,
such as Bash (or on Windows, Git Bash). Type this command with the replacements that follow:

ssh -i private-key-file-path -nNT -g -R :9001:localhost:9001


[email protected]

Replacements

• Replace private-key-file-path with the path to the private-key .pem file that corresponds to
your development endpoint's public key.
• Replace ec2-12-345-678-9.compute-1.amazonaws.com with the public address of your
development endpoint. You can find the public address in the AWS Glue console by choosing Dev
endpoints. Then choose the name of the development endpoint to open its Endpoint details page.

Running Your Script on the Development Endpoint


To run your script on your development endpoint, open another terminal window that supports SSH, and
type this command with the replacements that follow:

ssh -i private-key-file-path \
[email protected] \
-t gluepython deployed-script-path/script-name

176
AWS Glue Developer Guide
Managing Notebooks

Replacements

• Replace private-key-file-path with the path to the private-key .pem file that corresponds to
your development endpoint's public key.
• Replace ec2-12-345-678-9.compute-1.amazonaws.com with the public address of your
development endpoint. You can find the public address in the AWS Glue console by navigating to Dev
endpoints. Then choose the name of the development endpoint to open its Endpoint details page.
• Replace deployed-script-path with the path that you entered in the Deployment Mappings tab
(for example, /home/glue/scripts/legislators/).
• Replace script-name with the name of the script that you uploaded (for example,
get_person_schema.py).

PyCharm now prompts you to provide a local source file equivalent to the one being debugged remotely:

Choose Autodetect.

You are now set up to debug your script remotely on your development endpoint.

Managing Notebooks
A notebook enables interactive development and testing of your ETL (extract, transform, and load)
scripts on a development endpoint. AWS Glue provides an interface to Amazon SageMaker notebooks
and Apache Zeppelin notebook servers.

• Amazon SageMaker provides an integrated Jupyter authoring notebook instance. With AWS Glue, you
create and manage Amazon SageMaker notebooks. You can also open Amazon SageMaker notebooks
from the AWS Glue console.

177
AWS Glue Developer Guide
Managing Notebooks

In addition, you can use Apache Spark with Amazon SageMaker on AWS Glue development endpoints
which support Amazon SageMaker (but not AWS Glue ETL jobs). SageMaker Spark is an open source
Apache Spark library for Amazon SageMaker. For more information, see see Using Apache Spark with
Amazon SageMaker.
• Apache Zeppelin notebook servers are run on Amazon EC2 instances. You can create these instances
on the AWS Glue console.

For more information about creating and accessing your notebooks using the AWS Glue console, see
Working with Notebooks on the AWS Glue Console (p. 185).

For more information about creating development endpoints, see Working with Development Endpoints
on the AWS Glue Console (p. 157).

Important
Managing Amazon SageMaker notebooks with AWS Glue development endpoints is available
in the following AWS Regions:

Region Code

US East (Ohio) us-east-2

US East (N. Virginia) us-east-1

US West (N. California) us-west-1

US West (Oregon) us-west-2

Asia Pacific (Tokyo) ap-northeast-1

Asia Pacific (Seoul) ap-northeast-2

Asia Pacific (Mumbai) ap-south-1

Asia Pacific (Singapore) ap-southeast-1

Asia Pacific (Sydney) ap-southeast-2

Canada (Central) ca-central-1

EU (Frankfurt) eu-central-1

EU (Ireland) eu-west-1

EU (London) eu-west-2

Topics
• Creating a Notebook Server Associated with a Development Endpoint (p. 179)
• Working with Notebooks on the AWS Glue Console (p. 185)

178
AWS Glue Developer Guide
Notebook Server Considerations

Creating a Notebook Server Associated with a


Development Endpoint
One method for testing your ETL code is to use an Apache Zeppelin notebook running on an Amazon
Elastic Compute Cloud (Amazon EC2) instance. When you use AWS Glue to create a notebook server on
an Amazon EC2 instance, there are several actions you must take to set up your environment securely.
The development endpoint is built to be accessed from a single client. To simplify your setup, start by
creating a development endpoint that is used from a notebook server on Amazon EC2.

The following sections explain some of the choices to make and the actions to take to create a notebook
server securely. These instructions perform the following tasks:

• Create a development endpoint.


• Spin up a notebook server on an Amazon EC2 instance.
• Securely connect a notebook server to a development endpoint.
• Securely connect a web browser to a notebook server.

Choices on the AWS Glue Console


For more information about managing development endpoints using the AWS Glue console, see Working
with Development Endpoints on the AWS Glue Console (p. 157).

To create the development endpoint and notebook server

1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Choose Dev endpoints in the navigation pane, and then choose Add endpoint to create a
development endpoint.
3. Follow the steps in the wizard to create a development endpoint that you plan to associate with one
notebook server running on Amazon EC2.

In the Add SSH public key (optional) step, leave the public key empty. In a later step, you generate
and push a public key to the development endpoint and a corresponding private key to the Amazon
EC2 instance that is running the notebook server.
4. When the development endpoint is provisioned, continue with the steps to create a notebook server
on Amazon EC2. On the development endpoints list page, choose the development endpoint that
you just created. Choose Action, Create Zeppelin notebook server, and fill in the information about
your notebook server. (For more information, see the section called “Development Endpoints on the
Console” (p. 157).)
5. Choose Finish. The notebook server is created with an AWS CloudFormation stack. The AWS Glue
console provides you with the information you need to access the Amazon EC2 instance.

After the notebook server is ready, you must run a script on the Amazon EC2 instance to complete
the setup.

Actions on the Amazon EC2 Instance to Set Up Access


After you create the development endpoint and notebook server, complete the following actions to set
up the Amazon EC2 instance for your notebook server.

179
AWS Glue Developer Guide
Notebook Server Considerations

To set up access to the notebook server

1. If your local desktop is running Windows, you need a way to run commands SSH and SCP to
interact with the Amazon EC2 instance. You can find instructions for connecting in the Amazon EC2
documentation. For more information, see Connecting to Your Linux Instance from Windows Using
PuTTY.
2. You can connect to your Zeppelin notebook using an HTTPS URL. This requires a Secure Sockets
Layer (SSL) certificate on your Amazon EC2 instance. The notebook server must provide web
browsers with a certificate to validate its authenticity and to allow encrypted traffic for sensitive
data such as passwords.

If you have an SSL certificate from a certificate authority (CA), copy your SSL certificate key store
onto the Amazon EC2 instance into a path that the ec2-user has write access to, such as /home/
ec2-user/. See the AWS Glue console notebook server details for the scp command to Copy
certificate. For example, open a terminal window, and enter the following command:

scp -i ec2-private-key keystore.jks ec2-user@dns-address-of-ec2-instance:~/keystore.jks

The truststore, keystore.jks, that is copied to the Amazon EC2 instance must have been created
with a password.

The ec2-private-key is the key needed to access the Amazon EC2 instance. When you created the
notebook server, you provided an Amazon EC2 key pair and saved this EC2 private key to your local
machine. You might need to edit the Copy certificate command to point to the key file on your local
machine. You can also find this key file name on the Amazon EC2 console details for your notebook
server.

The dns-address-of-ec2-instance is the address of the Amazon EC2 instance where the
keystore is copied.
Note
There are many ways to generate an SSL certificate. It is a security best practice to use a
certificate generated with a certificate authority (CA). You might need to enlist the help of
an administrator in your organization to obtain the certificate. Follow the policies of your
organization when you create a keystore for the notebook server. For more information, see
Certificate authority in Wikipedia.

Another method is to generate a self-signed certificate with a script on your notebook server
Amazon EC2 instance. However, with this method, each local machine that connects to the notebook
server must be configured to trust the certificate generated before connecting to the notebook
server. Also, when the generated certificate expires, a new certificate must be generated and trusted
on all local machines. For more information about the setup, see Self-signed certificates (p. 183).
For more information, see Self-signed certificate in Wikipedia.
3. Using SSH, connect to the Amazon EC2 instance that is running your notebook server; for example:

ssh -i ec2-private-key ec2-user@dns-address-of-ec2-instance

The ec2-private-key is the key that is needed to access the Amazon EC2 instance. When you
created the notebook server, you provided an Amazon EC2 key pair and saved this EC2 private key to
your local machine. You might need to edit the Copy certificate command to point to the key file on
your local machine. You can also find this key file name on the Amazon EC2 console details for your
notebook server.

The dns-address-of-ec2-instance is the address of the Amazon EC2 instance where the
keystore is copied.
180
AWS Glue Developer Guide
Notebook Server Considerations

4. From the home directory, /home/ec2-user/, run the ./setup_notebook_server.py script.


AWS Glue created and placed this script on the Amazon EC2 instance. The script performs the
following actions:

• Asks for a Zeppelin notebook password: The password is SHA-256 hashed plus salted-and-
iterated with a random 128-bit salt kept in the shiro.ini file with restricted access. This is the
best practice available to Apache Shiro, the authorization package that Apache Zeppelin uses.
• Generates SSH public and private keys: The script overwrites any existing SSH public key on
the development endpoint that is associated with the notebook server. As a result, any other
notebook servers, Read–Eval–Print Loops (REPLs), or IDEs that connect to this development
endpoint can no longer connect.
• Verifies or generates an SSL certificate: Either use an SSL certificate that was generated with
a certificate authority (CA) or generate a certificate with this script. If you copied a certificate,
the script asks for the location of the keystore file. Provide the entire path on the Amazon EC2
instance, for example, /home/ec2-user/keystore.jks. The SSL certificate is verified.

The following example output of the setup_notebook_server.py script generates a self-signed


SSL certificate.

Starting notebook server setup. See AWS Glue documentation for more details.
Press Enter to continue...

Creating password for Zeppelin user admin


Type the password required to access your Zeppelin notebook:
Confirm password:
Updating user credentials for Zeppelin user admin

Zeppelin username and password saved.

Setting up SSH tunnel to devEndpoint for notebook connection.


Do you want a SSH key pair to be generated on the instance? WARNING this will replace
any existing public key on the DevEndpoint [y/n] y
Generating SSH key pair /home/ec2-user/dev.pem
Generating public/private rsa key pair.
Your identification has been saved in /home/ec2-user/dev.pem.
Your public key has been saved in /home/ec2-user/dev.pem.pub.
The key fingerprint is:
26:d2:71:74:b8:91:48:06:e8:04:55:ee:a8:af:02:22 ec2-user@ip-10-0-0-142
The key's randomart image is:
+--[ RSA 2048]----+
|.o.oooo..o. |
| o. ...+. |
| o . . .o |
| .o . o. |
| . o o S |
|E. . o |
|= |
|.. |
|o.. |
+-----------------+

Attempting to reach AWS Glue to update DevEndpoint's public key. This might take a
while.
Waiting for DevEndpoint update to complete...
Waiting for DevEndpoint update to complete...
Waiting for DevEndpoint update to complete...
DevEndpoint updated to use the public key generated.
Configuring Zeppelin server...

181
AWS Glue Developer Guide
Notebook Server Considerations

********************
We will configure Zeppelin to be a HTTPS server. You can upload a CA signed certificate
for the server to consume (recommended). Or you can choose to have a self-signed
certificate created.
See AWS Glue documentation for additional information on using SSL/TLS certificates.
********************

Do you have a JKS keystore to encrypt HTTPS requests? If not, a self-signed certificate
will be generated. [y/n] n
Generating self-signed SSL/TLS certificate at /home/ec2-user/
ec2-192-0-2-0.compute-1.amazonaws.com.jks
Self-signed certificates successfully generated.
Exporting the public key certificate to /home/ec2-user/
ec2-192-0-2-0.compute-1.amazonaws.com.der
Certificate stored in file /home/ec2-user/ec2-192-0-2-0.compute-1.amazonaws.com.der
Configuring Zeppelin to use the keystore for SSL connection...

Zeppelin server is now configured to use SSL.


SHA256
Fingerprint=53:39:12:0A:2B:A5:4A:37:07:A0:33:34:15:B7:2B:6F:ED:35:59:01:B9:43:AF:B9:50:55:E4:A2:8B

**********
The public key certificate is exported to /home/ec2-user/
ec2-192-0-2-0.compute-1.amazonaws.com.der
The SHA-256 fingerprint for the certificate is
53:39:12:0A:2B:A5:4A:37:07:A0:33:34:15:B7:2B:6F:ED:35:59:01:B9:43:AF:B9:50:55:E4:A2:8B:3B:59:E6.
You may need it when importing the certificate to the client. See AWS Glue
documentation for more details.
**********

Press Enter to continue...

All settings done!

Starting SSH tunnel and Zeppelin...


autossh start/running, process 6074
Done. Notebook server setup is complete. Notebook server is ready.
See /home/ec2-user/zeppelin/logs/ for Zeppelin log files.

5. Check for errors with trying to start the Zeppelin server in the log files located at /home/ec2-
user/zeppelin/logs/.

Actions on Your Local Computer to Connect to the Zeppelin


Server
After you create the development endpoint and notebook server, connect to your Zeppelin notebook.
Depending on how you set up your environment, you can connect in one of the following ways.

1. Connect with a trusted CA certificate. If you provided an SSL certificate from a certificate authority
(CA) when the Zeppelin server was set up on the Amazon EC2 instance, choose this method. To
connect with HTTPS on port 443, open a web browser and enter the URL for the notebook server.
You can find this URL on the development notebook details page for your notebook server. Enter
the contents of the HTTPS URL field; for example:

https://public-dns-address-of-ec2-instance:443

182
AWS Glue Developer Guide
Notebook Server Considerations

2. Connect with a self-signed certificate. If you ran the setup_notebook_server.py script to


generate an SSL certificate, first trust the connection between your web browser and the notebook
server. The details of this action vary by operating system and web browser. The general work flow is
as follows:

1. Access the SSL certificate from the local computer. For some scenarios, this requires you to copy
the SSL certificate from the Amazon EC2 instance to the local computer; for example:

scp -i path-to-ec2-private-key ec2-user@notebook-server-dns:/home/ec2-


user/notebook-server-dns.der notebook-server-dns.der

2. Import and view (or view and then import) the certificate into the certificate manager that is used
by your operating system and browser. Verify that it matches the certificate generated on the
Amazon EC2 instance.

Mozilla Firefox browser:

In Firefox, you might encounter an error like Your connection is not secure. To set up the
connection, the general steps are as follows (the steps might vary by Firefox version):

1. Find the Options or Preferences page, navigate to the page and choose View Certificates. This
option might appear in the Privacy, Security, or Advanced tab.
2. In the Certificate Manager window, choose the Servers tab, and then choose Add Exception.
3. Enter the HTTPS Location of the notebook server on Amazon EC2, and then choose Get
Certificate. Choose View.
4. Verify that the Common Name (CN) matches the DNS of the notebook server Amazon EC2
instance. Also, verify that the SHA-256 Fingerprint matches that of the certificate generated on
the Amazon EC2 instance. You can find the SHA-256 fingerprint of the certificate in the output of
the setup_notebook_server.py script or by running an openssl command on the notebook
instance.

openssl x509 -noout -fingerprint -sha256 -inform der -in path-to-certificate.der

5. If the values match, confirm to trust the certificate.


6. When the certificate expires, generate a new certificate on the Amazon EC2 instance and trust it
on your local computer.

Google Chrome browser on macOS:

When using Chrome on macOS, you might encounter an error like Your connection is not private.
To set up the connection, the general steps are as follows:

1. Copy the SSL certificate from the Amazon EC2 instance to your local computer.
2. Choose Preferences or Settings to find the Settings page. Navigate to the Advanced section, and
then find the Privacy and security section. Choose Manage certificates.
3. In the Keychain Access window, navigate to the Certificates and choose File, Import items to
import the SSL certificate.
4. Verify that the Common Name (CN) matches the DNS of the notebook server Amazon EC2
instance. Also, verify that the SHA-256 Fingerprint matches that of the certificate generated on
the Amazon EC2 instance. You can find the SHA-256 fingerprint of the certificate in the output of
the setup_notebook_server.py script or by running an openssl command on the notebook
instance.

183
AWS Glue Developer Guide
Notebook Server Considerations

openssl x509 -noout -fingerprint -sha256 -inform der -in path-to-certificate.der

5. Trust the certificate by setting Always Trust.


6. When the certificate expires, generate a new certificate on the Amazon EC2 instance and trust it
on your local computer.

Chrome browser on Windows:

When using Chrome on Windows, you might encounter an error like Your connection is not private.
To set up the connection, the general steps are as follows:

1. Copy the SSL certificate from the Amazon EC2 instance to your local computer.
2. Find the Settings page, navigate to the Advanced section, and then find the Privacy and security
section. Choose Manage certificates.
3. In the Certificates window, navigate to the Trusted Root Certification Authorities tab, and
choose Import to import the SSL certificate.
4. Place the certificate in the Certificate store for Trusted Root Certification Authorities.
5. Trust by installing the certificate.
6. Verify that the SHA-1 Thumbprint that is displayed by the certificate in the browser matches that
of the certificate generated on the Amazon EC2 instance. To find the certificate on the browser,
navigate to the list of Trusted Root Certification Authorities, and choose the certificate Issued
To the Amazon EC2 instance. Choose to View the certificate, choose Details, and then view the
Thumbprint for sha1. You can find the corresponding SHA-1 fingerprint of the certificate by
running an openssl command on the Amazon EC2 instance.

openssl x509 -noout -fingerprint -sha1 -inform der -in path-to-certificate.der

7. When the certificate expires, generate a new certificate on the Amazon EC2 instance and trust it
on your local computer.

Microsoft Internet Explorer browser on Windows:

When using Internet Explorer on Windows, you might encounter an error like Your connection is not
private. To set up the connection, the general steps are as follows:

1. Copy the SSL certificate from the Amazon EC2 instance to your local computer.
2. Find the Internet Options page, navigate to the Content tab, and then find the Certificates
section.
3. In the Certificates window, navigate to the Trusted Root Certification Authorities tab, and
choose Import to import the SSL certificate.
4. Place the certificate in the Certificate store for Trusted Root Certification Authorities.
5. Trust by installing the certificate.
6. Verify that the SHA-1 Thumbprint that is displayed by the certificate in the browser matches that
of the certificate generated on the Amazon EC2 instance. To find the certificate on the browser,
navigate to the list of Trusted Root Certification Authorities, and choose the certificate Issued
To the Amazon EC2 instance. Choose to View the certificate, choose Details, and then view the
Thumbprint for sha1. You can find the corresponding SHA-1 fingerprint of the certificate by
running an openssl command on the Amazon EC2 instance.

184
AWS Glue Developer Guide
Notebooks on the Console

openssl x509 -noout -fingerprint -sha1 -inform der -in path-to-certificate.der

7. When the certificate expires, generate a new certificate on the Amazon EC2 instance and trust it
on your local computer.

After you trust the certificate, to connect with HTTPS on port 443, open a web browser and enter
the URL for the notebook server. You can find this URL on the development notebook details page
for your notebook server. Enter the contents of the HTTPS URL field; for example:

https://public-dns-address-of-ec2-instance:443

Working with Notebooks on the AWS Glue Console


A development endpoint is an environment that you can use to develop and test your AWS Glue scripts. A
notebook enables interactive development and testing of your ETL (extract, transform, and load) scripts
on a development endpoint.

AWS Glue provides an interface to Amazon SageMaker notebooks and Apache Zeppelin notebook
servers. On the AWS Glue notebooks page, you can create Amazon SageMaker notebooks and attach
them to a development endpoint. You can also manage Zeppelin notebook servers that you created and
attached to a development endpoint. To create a Zeppelin notebook server, see Creating a Notebook
Server Hosted on Amazon EC2 (p. 159).

The Notebooks page on the AWS Glue console lists all the Amazon SageMaker notebooks and Zeppelin
notebook servers in your AWS Glue environment. You can use the console to perform several actions
on your notebooks. To display details for a notebook or notebook server, choose the notebook in the
list. Notebook details include the information that you defined when you created it using the Create
SageMaker notebook or Create Zeppelin Notebook server wizard.

Amazon SageMaker Notebooks on the AWS Glue Console


The following are some of the properties for Amazon SageMaker notebooks. The console displays some
of these properties when you view the details of a notebook.

Important
AWS Glue only manages Amazon SageMaker notebooks in certain AWS Regions. For more
information, see Managing Notebooks (p. 177).

Ensure you have permissions to manage Amazon SageMaker notebooks in the AWS Glue console. For
more information, see AWSGlueConsoleSageMakerNotebookFullAccess in Step 3: Attach a Policy to
IAM Users That Access AWS Glue (p. 15).

Notebook name

The unique name of the Amazon SageMaker notebook.


Development endpoint

The name of the development endpoint that this notebook is attached to.
Important
This development endpoint must have been created after August 15, 2018.

185
AWS Glue Developer Guide
Notebooks on the Console

Status

The provisioning status of the notebook and whether it is Ready, Failed, Starting, Stopping, or
Stopped.
Failure reason

If the status is Failed, the reason for the notebook failure.


Instance type

The type of the instance used by the notebook.


IAM role

The IAM role that was used to create the Amazon SageMaker notebook.

This role has a trust relationship to Amazon SageMaker. You create this role in the IAM console.
When creating the role, choose Amazon SageMaker, and attach a policy for the notebook, such as
AWSGlueServiceSageMakerNotebookRoleDefault. For more information, see Step 7: Create an IAM
Role for Amazon SageMaker Notebooks (p. 27).

Zeppelin Notebook Servers on the AWS Glue Console


The following are some of the properties for Zeppelin notebook servers. The console displays some of
these properties when you view the details of a notebook.

Notebook server name

The unique name of the Zeppelin notebook server.


Development endpoint

The unique name that you give the endpoint when you create it.
Provisioning status

Describes whether the notebook server is CREATE_COMPLETE or ROLLBACK_COMPLETE.


Failure reason

If the status is Failed, the reason for the notebook failure.


CloudFormation stack

The name of the AWS CloudFormation stack that was used to create the notebook server.
EC2 instance

The name of Amazon EC2 instance that is created to host your notebook. This links to the Amazon
EC2 console (https://console.aws.amazon.com/ec2/) where the instance is tagged with the key aws-
glue-dev-endpoint and value of the name of the development endpoint.
SSH to EC2 server command

Enter this command in a terminal window to connect to the Amazon EC2 instance that is running
your notebook server. The Amazon EC2 address shown in this command is either public or private,
depending on whether you chose to Attach a public IP to the notebook server EC2 instance.
Copy certificate

Example scp command to copy the keystore that is required to set up the Zeppelin notebook server
to the Amazon EC2 instance that hosts the notebook server. Run the command from a terminal
window in the directory where the Amazon EC2 private key is located. The key to access the Amazon
EC2 instance is the parameter to the -i option. You provide the path-to-keystore-file. The

186
AWS Glue Developer Guide
Notebooks on the Console

remaining part of the command is the location where the development endpoint private SSH key on
the Amazon EC2 server is located.
HTTPS URL

After setting up a notebook server, enter this URL in a browser to connect to your notebook using
HTTPS.

187
AWS Glue Developer Guide

Running and Monitoring AWS Glue


You can automate the running of your ETL (extract, transform, and load) jobs. AWS Glue also provides
metrics for crawlers and jobs that you can monitor. After you set up the AWS Glue Data Catalog with
the required metadata, AWS Glue provides statistics about the health of your environment. You can
automate the invocation of crawlers and jobs with a time-based schedule based on cron. You can also
trigger jobs when an event-based trigger fires.

The main objective of AWS Glue is to provide an easier way to extract and transform your data from
source to target. To accomplish this objective, an ETL job follows these typical steps (as shown in the
diagram that follows):

1. A trigger fires to initiate a job run. This event can be set up on a recurring schedule or to satisfy a
dependency.
2. The job extracts data from your source. If required, connection properties are used to access your
source.
3. The job transforms your data using a script that you created and the values of any arguments. The
script contains the Scala or PySpark Python code that transforms your data.
4. The transformed data is loaded to your data targets. If required, connection properties are used to
access the target.
5. Statistics are collected about the job run and are written to your Data Catalog.

The following diagram shows the ETL workflow containing these five steps.

188
AWS Glue Developer Guide
Automated Tools

Topics
• Automated Monitoring Tools (p. 189)
• Time-Based Schedules for Jobs and Crawlers (p. 189)
• Tracking Processed Data Using Job Bookmarks (p. 191)
• Automating AWS Glue with CloudWatch Events (p. 196)
• Monitoring with Amazon CloudWatch (p. 196)
• Job Monitoring and Debugging (p. 210)
• Logging AWS Glue API Calls with AWS CloudTrail (p. 230)

Automated Monitoring Tools


Monitoring is an important part of maintaining the reliability, availability, and performance of AWS Glue
and your other AWS solutions. AWS provides monitoring tools that you can use to watch AWS Glue,
report when something is wrong, and take action automatically when appropriate:

You can use the following automated monitoring tools to watch AWS Glue and report when something is
wrong:

• Amazon CloudWatch Events delivers a near real-time stream of system events that describe changes
in AWS resources. CloudWatch Events enables automated event-driven computing. You can write rules
that watch for certain events and trigger automated actions in other AWS services when these events
occur. For more information, see the Amazon CloudWatch Events User Guide.
• Amazon CloudWatch Logs enables you to monitor, store, and access your log files from Amazon EC2
instances, AWS CloudTrail, and other sources. CloudWatch Logs can monitor information in the log
files and notify you when certain thresholds are met. You can also archive your log data in highly
durable storage. For more information, see the Amazon CloudWatch Logs User Guide.
• AWS CloudTrail captures API calls and related events made by or on behalf of your AWS account
and delivers the log files to an Amazon S3 bucket that you specify. You can identify which users and
accounts call AWS, the source IP address from which the calls are made, and when the calls occur. For
more information, see the AWS CloudTrail User Guide.

Time-Based Schedules for Jobs and Crawlers


You can define a time-based schedule for your crawlers and jobs in AWS Glue. The definition of these
schedules uses the Unix-like cron syntax. You specify time in Coordinated Universal Time (UTC), and the
minimum precision for a schedule is 5 minutes.

Cron Expressions
Cron expressions have six required fields, which are separated by white space.

Syntax

cron(Minutes Hours Day-of-month Month Day-of-week Year)

Fields Values Wildcards

Minutes 0–59 ,-*/

189
AWS Glue Developer Guide
Cron Expressions

Fields Values Wildcards

Hours 0–23 ,-*/

Day-of-month 1–31 ,-*?/LW

Month 1–12 or JAN-DEC ,-*/

Day-of-week 1–7 or SUN-SAT ,-*?/L

Year 1970–2199 ,-*/

Wildcards

• The , (comma) wildcard includes additional values. In the Month field, JAN,FEB,MAR would include
January, February, and March.
• The - (dash) wildcard specifies ranges. In the Day field, 1–15 would include days 1 through 15 of the
specified month.
• The * (asterisk) wildcard includes all values in the field. In the Hours field, * would include every hour.
• The / (forward slash) wildcard specifies increments. In the Minutes field, you could enter 1/10 to
specify every 10th minute, starting from the first minute of the hour (for example, the 11th, 21st, and
31st minute).
• The ? (question mark) wildcard specifies one or another. In the Day-of-month field you could enter
7, and if you didn't care what day of the week the seventh was, you could enter ? in the Day-of-week
field.
• The L wildcard in the Day-of-month or Day-of-week fields specifies the last day of the month or
week.
• The W wildcard in the Day-of-month field specifies a weekday. In the Day-of-month field, 3W
specifies the day closest to the third weekday of the month.

Limits

• You can't specify the Day-of-month and Day-of-week fields in the same cron expression. If you
specify a value in one of the fields, you must use a ? (question mark) in the other.
• Cron expressions that lead to rates faster than 5 minutes are not supported.

Examples

When creating a schedule, you can use the following sample cron strings.

Minutes Hours Day of Month Day of week Year Meaning


month

0 10 * * ? * Run at 10:00
am (UTC)
every day

15 12 * * ? * Run at 12:15
pm (UTC)
every day

0 18 ? * MON-FRI * Run at
6:00 pm
(UTC) every

190
AWS Glue Developer Guide
Job Bookmarks

Minutes Hours Day of Month Day of week Year Meaning


month
Monday
through
Friday

0 8 1 * ? * Run at 8:00
am (UTC)
every first
day of the
month

0/15 * * * ? * Run every


15 minutes

0/10 * ? * MON-FRI * Run every


10 minutes
Monday
through
Friday

0/5 8–17 ? * MON-FRI * Run every


5 minutes
Monday
through
Friday
between
8:00 am and
5:55 pm
(UTC)

For example to run on a schedule of every day at 12:15 UTC, specify:

cron(15 12 * * ? *)

Tracking Processed Data Using Job Bookmarks


AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting
state information from the job run. This persisted state information is called a job bookmark. Job
bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. With
job bookmarks, you can process new data when rerunning on a scheduled interval. A job bookmark is
composed of the states for various elements of jobs, such as sources, transformations, and targets. For
example, your ETL job might read new partitions in an Amazon S3 file. AWS Glue tracks which partitions
the job has processed successfully to prevent duplicate processing and duplicate data in the job's target
data store.

Job bookmarks are implemented for some Amazon Simple Storage Service (Amazon S3) sources and the
Relationalize transform. AWS Glue supports job bookmarks for the Amazon S3 source formats JSON,
CSV, Apache Avro, and XML. The Apache Parquet and ORC formats are currently not supported.

Job bookmarks are implemented for a limited use case for a relational database (JDBC connection)
input source. For this input source, job bookmarks are supported only if the table's primary keys are
in sequential order. Also, job bookmarks search for new rows, but not updated rows. This is because
bookmarks look for the primary keys, which already exist.

191
AWS Glue Developer Guide
Using Job Bookmarks

Topics
• Using Job Bookmarks in AWS Glue (p. 192)
• Using Job Bookmarks with the AWS Glue Generated Script (p. 193)
• Tracking Files Using Modification Timestamps (p. 194)

Using Job Bookmarks in AWS Glue


On the AWS Glue console, a job bookmark option is passed as a parameter when the job is started. The
following table describes the options for setting job bookmarks in AWS Glue.

Job bookmark Description

Enable Causes the job to update the state after a run to keep track of previously
processed data. If your job has a source with job bookmark support, it will
keep track of processed data, and when a job runs, it processes new data
since the last checkpoint.

Disable Job bookmarks are not used, and the job always processes the entire
dataset. You are responsible for managing the output from previous job
runs. This is the default.

Pause Process incremental data since the last run. The job will read the state
information form the last run, but will not update it. This can be used so
that every subsequent run processes data since the last bookmark. You are
responsible for managing the output from previous job runs.

For details about the parameters passed to a job, and specifically for a job bookmark, see Special
Parameters Used by AWS Glue (p. 244).

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to
verify which objects need to be reprocessed. If your input source data has been modified since your last
job run, the files are reprocessed when you run the job again.

If you intend to reprocess all the data using the same job, reset the job bookmark. To reset
the job bookmark state, use the AWS Glue console, the ResetJobBookmark Action (Python:
reset_job_bookmark) (p. 465) API operation, or the AWS CLI. For example, enter the following
command using the AWS CLI:

reset-job-bookmark - -job-name my-job-name

AWS Glue keeps track of job bookmarks by job. If you delete a job, the job bookmark is deleted.

In some cases, you might have enabled AWS Glue job bookmarks but your ETL job is reprocessing data
that was already processed in an earlier run. For information about resolving common causes of this
error, see Troubleshooting Errors in AWS Glue (p. 235).

Transformation Context
Many of the AWS Glue PySpark dynamic frame methods include an optional parameter
named transformation_ctx, which is a unique identifier for the ETL operator instance. The

192
AWS Glue Developer Guide
Using an AWS Glue Script

transformation_ctx parameter is used to identify state information within a job bookmark for the
given operator. Specifically, AWS Glue uses transformation_ctx to index the key to the bookmark
state.

For job bookmarks to work properly, enable the job bookmark parameter and set the
transformation_ctx parameter. If you don't pass in the transformation_ctx parameter, then
job bookmarks are not enabled for a dynamic frame or a table used in the method. For example,
if you have an ETL job that reads and joins two Amazon S3 sources, you might choose to pass the
transformation_ctx parameter only to those methods that you want to enable bookmarks. If you
reset the job bookmark for a job, it resets all transformations that are associated with the job regardless
of the transformation_ctx used.

For more information about the DynamicFrameReader class, see DynamicFrameReader Class (p. 290).
For more information about PySpark extensions, see AWS Glue PySpark Extensions Reference (p. 273).

Using Job Bookmarks with the AWS Glue Generated


Script
This section describes more of the operational details of using job bookmarks. It also provides an
example of a script that you can generate from AWS Glue when you choose a source and destination and
run a job.

Job bookmarks store the states for a job. Each instance of the state is keyed by a job name and a version
number. When a script invokes job.init, it retrieves its state and always gets the latest version. Within
a state, there are multiple state elements, which are specific to each source, transformation, and sink
instance in the script. These state elements are identified by a transformation context that is attached to
the corresponding element (source, transformation, or sink) in the script. The state elements are saved
atomically when job.commit is invoked from the user script. The script gets the job name and the
control option for the job bookmarks from the arguments.

The state elements in the job bookmark are source, transformation, or sink-specific data. For example,
suppose that you want to read incremental data from an Amazon S3 location that is being constantly
written to by an upstream job or process. In this case, the script must determine what has been
processed so far. The job bookmark implementation for the Amazon S3 source saves information so that
when the job runs again, it can filter only the new objects using the saved information and recompute
the state for the next run of the job. A timestamp is used to filter the new files.

In addition to the state elements, job bookmarks have a run number, an attempt number, and a version
number. The run number tracks the run of the job, and the attempt number records the attempts for
a job run. The job run number is a monotonically increasing number that is incremented for every
successful run. The attempt number tracks the attempts for each run, and is only incremented when
there is a run after a failed attempt. The version number increases monotonically and tracks the updates
to a job bookmark.

The following is an example of the generated script. The script and its associated arguments illustrate
the various elements that are required for using job bookmarks. For more information about these
elements see the GlueContext Class (p. 291) API, and the DynamicFrameWriter Class (p. 288) API.

# Sample Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]

193
AWS Glue Developer Guide
Using Modification Timestamps

args = getResolvedOptions(sys.argv, ['JOB_NAME'])


sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
*job = Job(glueContext)*
*job.init(args['JOB_NAME'], args)*
## @type: DataSource
## @args: [database = "database", table_name = "relatedqueries_csv", transformation_ctx =
"datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database",
table_name = "relatedqueries_csv", *transformation_ctx = "datasource0")*
## @type: ApplyMapping
## @args: [mapping = [("col0", "string", "name", "string"), ("col1", "string", "number",
"string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("col0", "string",
"name", "string"), ("col1", "string", "number", "string")], *transformation_ctx =
"applymapping1"*)
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://input_path"}, format
= "json", transformation_ctx = "datasink2"]
## @return: datasink2
## @inputs: [frame = applymapping1]
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1,
connection_type = "s3", connection_options = {"path": "s3://input_path"}, format =
"json", *transformation_ctx = "datasink2"*)

*job.commit()*

Job Arguments :

*--job-bookmark-option, job-bookmark-enable*
*--JOB_NAME, name-1-s3-2-s3-encrypted*

Tracking Files Using Modification Timestamps


For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the files to verify
which objects need to be reprocessed.

Consider the following example. In the diagram, the X axis is a time axis, from left to right, with the left-
most point being T0. The Y axis is list of files observed at time T. The elements representing the list are
placed in the graph based on their modification time.

194
AWS Glue Developer Guide
Using Modification Timestamps

In this example, when a job starts at modification timestamp 1 (T1), it looks for files that have a
modification time greater than T0 and less than or equal to T1. Those files are F2, F3, F4, and F5. The job
bookmark stores the timestamps T0 and T1 as the low and high timestamps respectively.

When the job reruns at T2, it filters files that have a modification time greater than T1 and less than or
equal to T2. Those files are F7, F8, F9, and F10. It thereby misses the files F3', F4', and F5'. The reason
that the files F3', F4', and F5', which have a modification time less than or equal to T1, show up after T1 is
because of Amazon S3 list consistency.

To account for Amazon S3 eventual consistency, AWS Glue includes a list of files (or path hash) in the
job bookmark. AWS Glue assumes that the Amazon S3 file list is only inconsistent up to a finite period
(dt) before the current time. That is, the file list for files with a modification time between T1 - dt and T1
when listing is done at T1 is inconsistent. However, the list of files with a modification time less than or
equal to T1 - d1 is consistent at a time greater than or equal to T1.

You specify the period of time in which AWS Glue will save files (and where the files are likely to be
consistent) by using the MaxBand option in the AWS Glue connection options. The default value is 900
seconds (15 minutes). For more information about this property, see Connection Types and Options for
ETL in AWS Glue (p. 245).

When the job reruns at timestamp 2 (T2), it lists the files in the following ranges:

• T1 - dt (exclusive) to T1 (inclusive). This list includes F4, F5, F4', and F5'. This list is a consistent range.
However, this range is inconsistent for a listing at T1 and has a list of files F3, F4, and F5 saved. For
getting the files to be processed at T2, the files F3, F4, and F5 will be removed.
• T1 (exclusive) to T2 - dt (inclusive). This list includes F7 and F8. This list is a consistent range.
• T2 - dt (exclusive) - T2 (inclusive). This list includes F9 and F10. This list is an inconsistent range.

195
AWS Glue Developer Guide
Automating with CloudWatch Events

The resultant list of files is F3', F4', F5', F7, F8, F9, and F10.

The new files in the inconsistent list are F9 and F10, which are saved in the filter for the next run.

For more information about Amazon S3 eventual consistency, see Introduction to Amazon S3 in the
Amazon Simple Storage Service Developer Guide.

Job Run Failures


A job run version increments when a job fails. For example, if a job run at timestamp 1 (T1) fails, and
it is rerun at T2, it advances the high timestamp to T2. Then, when the job is run at a later point T3, it
advances the high timestamp to Amazon S3.

If a job run fails before the job.commit() (at T1), the files are processed in a subsequent run, in which
AWS Glue processes the files from T0 to T2.

Automating AWS Glue with CloudWatch Events


You can use Amazon CloudWatch Events to automate your AWS services and respond automatically to
system events such as application availability issues or resource changes. Events from AWS services are
delivered to CloudWatch Events in near real time. You can write simple rules to indicate which events are
of interest to you, and what automated actions to take when an event matches a rule. The actions that
can be automatically triggered include the following:

• Invoking an AWS Lambda function


• Invoking Amazon EC2 Run Command
• Relaying the event to Amazon Kinesis Data Streams
• Activating an AWS Step Functions state machine
• Notifying an Amazon SNS topic or an Amazon SQS queue

Some examples of using CloudWatch Events with AWS Glue include the following:

• Activating a Lambda function when an ETL job succeeds


• Notifying an Amazon SNS topic when an ETL job fails

The following CloudWatch Events are generated by AWS Glue.

• Events for "detail-type":"Glue Job State Change" are generated for SUCCEEDED, FAILED,
TIMEOUT, and STOPPED.
• Events for "detail-type":"Glue Job Run Status" are generated for RUNNING, STARTING, and
STOPPING job runs when they exceed the job delay notification threshold.
• Events for "detail-type":"Glue Crawler State Change" are generated for Started,
Succeeded, and Failed.

For more information, see the Amazon CloudWatch Events User Guide. For events specific to AWS Glue,
see AWS Glue Events.

Monitoring with Amazon CloudWatch


You can monitor AWS Glue using CloudWatch, which collects and processes raw data from AWS Glue
into readable, near real-time metrics. These statistics are recorded for a period of two weeks, so
that you can access historical information for a better perspective on how your web application or

196
AWS Glue Developer Guide
Using CloudWatch Metrics

service is performing. By default, AWS Glue metric data is sent to CloudWatch automatically. For more
information, see What Is Amazon CloudWatch? in the Amazon CloudWatch User Guide, and AWS Glue
Metrics (p. 198).

Topics
• Monitoring AWS Glue Using CloudWatch Metrics (p. 197)
• Setting Up Amazon CloudWatch Alarms on AWS Glue Job Profiles (p. 210)

Monitoring AWS Glue Using CloudWatch Metrics


You can profile and monitor AWS Glue operations using AWS Glue Job Profiler, which collects and
processes raw data from AWS Glue jobs into readable, near real-time metrics stored in CloudWatch.
These statistics are retained and aggregated in CloudWatch so that you can access historical information
for a better perspective on how your application is performing.

AWS Glue Metrics Overview


When you interact with AWS Glue, it sends the metrics described below to CloudWatch. You can view
these metrics in the AWS Glue console (the preferred method), the CloudWatch console dashboard, or
the AWS CLI.

To view metrics using the AWS Glue console dashboard


You can view summary or detailed graphs of metrics for a job, or detailed graphs for a job run. For
details about the graphs and metrics you can access in the AWS Glue console dashboard, see Working
with Jobs on the AWS Glue Console (p. 146).

1. Open the AWS Glue console at https://console.aws.amazon.com/glue/


2. In the navigation pane, choose Jobs.
3. Select a job from the Jobs list.
4. Select the Metrics tab.
5. Select View Additional Metrics to see more detailed metrics.

To view metrics using the CloudWatch console dashboard


Metrics are grouped first by the service namespace, and then by the various dimension combinations
within each namespace.

1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.


2. In the navigation pane, choose Metrics.
3. Select the Glue namespace.

To view metrics using the AWS CLI

• At a command prompt, use the following command:

aws cloudwatch list-metrics --namespace "Glue"

AWS Glue reports metrics to CloudWatch every 30 seconds, and the CloudWatch Metric Dashboards
are configured to display them every minute. The AWS Glue metrics represent delta values from
the previously reported values. Where appropriate, Metric Dashboards aggregates (sums) the 30-
second values to obtain a value for the entire last minute. Glue metrics are enabled at initialization

197
AWS Glue Developer Guide
Using CloudWatch Metrics

of a GlueContext in a script and are generally updated only at the end of an Apache Spark task. They
represent the aggregate values across all completed Spark tasks so far.

The Spark metrics that AWS Glue passes on to CloudWatch, on the other hand, are generally absolute
values representing the current state at the time they are reported. AWS Glue reports them to
CloudWatch every 30 seconds, and the Metric Dashboards generally show the average across the data
points received in the last one minute.

AWS Glue metric names are all preceded by one of 3 kinds of prefix:

• glue.driver. – Metrics whose names begin with this prefix either represent AWS Glue Metrics that
are aggregated from all executors at the Spark driver, or Spark metrics corresponding to the Spark
driver.
• glue.executorId. – The executorId is the number of a specific Spark executor, and corresponds with
the executors listed in the logs.
• glue.ALL. – Metrics whose names begin with this prefix aggregate values from all Spark executors.

AWS Glue Metrics


AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue
Metrics Dashboard report them once a minute:

Metric Description

glue.driver.aggregate.bytesRead The number of bytes read from all data sources by all
completed Spark tasks running in all executors..

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor:

• Bytes read.
• Job progress.
• JDBC data sources.
• Job Bookmark Issues.
• Variance across Job Runs.

This metric can be used the same way as the


glue.ALL.s3.filesystem.read_bytes metric, with
the difference that this metric is updated at the end of a
Spark task and captures non-S3 data sources as well.

glue.driver.aggregate.elapsedTime The ETL elapsed time in milliseconds (does not include


the job bootstrap times).

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

198
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description
Valid Statistics: SUM. This metric is a delta value from
the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Milliseconds

Can be used to determine how long it takes a job run to


run on average.

Some ways to use the data:

• Set alarms for stragglers.


• Measure variance across job runs.

The number of completed stages in the job.

Valid dimensions: JobName (the name of the AWS Glue


glue.driver.aggregate.numCompletedStages
Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

• Job progress.
• Per-stage timeline of job execution,when correlated
with other metrics.

Some ways to use the data:

• Identify demanding stages in the execution of a job.


• Set alarms for correlated spikes (demanding stages)
across job runs.

The number of completed tasks in the job.


glue.driver.aggregate.numCompletedTasks

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

• Job progress.
• Parallelism within a stage.

199
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

glue.driver.aggregate.numFailedTasksThe number of failed tasks.

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

• Data abnormalities that cause job tasks to fail.


• Cluster abnormalities that cause job tasks to fail.
• Script abnormalities that cause job tasks to fail.

The data can be used to set alarms for increased failures


that might suggest abnormalities in data, cluster or
scripts.

glue.driver.aggregate.numKilledTasksThe number of tasks killed.

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

• Abnormalities in Data Skew that result in exceptions


(OOMs) that kill tasks.
• Script abnormalities that result in exceptions (OOMs)
that kill tasks.

Some ways to use the data:

• Set alarms for increased failures indicating data


abnormalities.
• Set alarms for increased failures indicating cluster
abnormalities.
• Set alarms for increased failures indicating script
abnormalities.

200
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

glue.driver.aggregate.recordsRead The number of records read from all data sources by all
completed Spark tasks running in all executors.

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

• Records read.
• Job progress.
• JDBC data sources.
• Job Bookmark Issues.
• Skew in Job Runs over days.

This metric can be used in a similar way to the


glue.ALL.s3.filesystem.read_bytes metric, with
the difference that this metric is updated at the end of a
Spark task.

The number of bytes written by all executors to


shuffle data between them since the previous report
glue.driver.aggregate.shuffleBytesWritten
(aggregated by the AWS Glue Metrics Dashboard as the
number of bytes written for this purpose during the
previous minute).

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins,


groupBy, repartition, coalesce).

Some ways to use the data:

• Repartition or decompress large input files before


further processing.
• Repartition data more uniformly to avoid hot keys.
• Pre-filter data before joins or groupBy operations.

201
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

The number of bytes read by all executors to shuffle


data between them since the previous report
glue.driver.aggregate.shuffleLocalBytesRead
(aggregated by the AWS Glue Metrics Dashboard as
the number of bytes read for this purpose during the
previous minute).

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(count).

Valid Statistics: SUM. This metric is a delta value from


the last reported value, so on the AWS Glue Metrics
Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins,


groupBy, repartition, coalesce).

Some ways to use the data:

• Repartition or decompress large input files before


further processing.
• Repartition data more uniformly using hot keys.
• Pre-filter data before joins or groupBy operations.

The number of megabytes of disk space used across all


glue.driver.BlockManager.disk.diskSpaceUsed_MB
executors.

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(gauge).

Valid Statistics: Average. This is a Spark metric, reported


as an absolute value.

Unit: Megabytes

Can be used to monitor:

• Disk space used for blocks that represent cached RDD


partitions.
• Disk space used for blocks that represent
intermediate shuffle outputs.
• Disk space used for blocks that represent broadcasts.

Some ways to use the data:

• Identify job failures due to increased disk usage.


• Identify large partitions resulting in spilling or
shuffling.
• Increase provisioned DPU capacity to correct these
issues.

202
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

The number of actively running job executors.

Valid dimensions: JobName (the name of the AWS Glue


glue.driver.ExecutorAllocationManager.executors.numberAllExecutors
Job), JobRunId (the JobRun ID. or ALL), and Type
(gauge).

Valid Statistics: Average. This is a Spark metric, reported


as an absolute value.

Unit: Count

Can be used to monitor:

• Job activity.
• Straggling executors (with a few executors running
only)
• Current executor-level parallelism.

Some ways to use the data:

• Repartition or decompress large input files


beforehand if cluster is under-utilized.
• Identify stage or job execution delays due to straggler
scenarios.
• • Compare with numberMaxNeededExecutors to
understand backlog for provisioning more DPUs.

203
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

The number of maximum (actively running and


pending) job executors needed to satisfy the current
glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors
load.

Valid dimensions: JobName (the name of the AWS Glue


Job), JobRunId (the JobRun ID. or ALL), and Type
(gauge).

Valid Statistics: Maximum. This is a Spark metric,


reported as an absolute value.

Unit: Count

Can be used to monitor:

• Job activity.
• Current executor-level parallelism and backlog
of pending tasks not yet scheduled because of
unavailable executors due to DPU capacity or killed/
failed executors.

Some ways to use the data:

• Identify pending/backlog of scheduling queue.


• Identify stage or job execution delays due to straggler
scenarios.
• Compare with numberAllExecutors to understand
backlog for provisioning more DPUs.
• Increase provisioned DPU capacity to correct the
pending executor backlog.

204
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

The fraction of memory used by the JVM heap for this


driver (scale: 0-1) for driver, executor identified by
glue.driver.jvm.heap.usage executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue


glue.executorId.jvm.heap.usage Job), JobRunId (the JobRun ID. or ALL), and Type
(gauge).

glue.ALL.jvm.heap.usage Valid Statistics: Average. This is a Spark metric, reported


as an absolute value.

Unit: Percentage

Can be used to monitor:

• Driver out-of-memory conditions (OOM) using


glue.driver.jvm.heap.usage.
• Executor out-of-memory conditions (OOM) using
glue.ALL.jvm.heap.usage.

Some ways to use the data:

• Identify memory-consuming executor ids and stages.


• Identify straggling executor ids and stages.
• Identify a driver out-of-memory condition (OOM).
• Identify an executor out-of-memory condition (OOM)
and obtain the corresponding executor ID so as to be
able to get a stack trace from the executor log.
• Identify files or partitions that may have data skew
resulting in stragglers or out-of-memory conditions
(OOMs).

205
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

glue.driver.jvm.heap.used The number of memory bytes used by the JVM heap for
the driver, the executor identified by executorId, or ALL
executors.
glue.executorId.jvm.heap.used
Valid dimensions: JobName (the name of the AWS Glue
Job), JobRunId (the JobRun ID. or ALL), and Type
glue.ALL.jvm.heap.used (gauge).

Valid Statistics: Average. This is a Spark metric, reported


as an absolute value.

Unit: Bytes

Can be used to monitor:

• Driver out-of-memory conditions (OOM).


• Executor out-of-memory conditions (OOM).

Some ways to use the data:

• Identify memory-consuming executor ids and stages.


• Identify straggling executor ids and stages.
• Identify a driver out-of-memory condition (OOM).
• Identify an executor out-of-memory condition (OOM)
and obtain the corresponding executor ID so as to be
able to get a stack trace from the executor log.
• Identify files or partitions that may have data skew
resulting in stragglers or out-of-memory conditions
(OOMs).

206
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

The number of bytes read from Amazon S3 by the


driver, an executor identified by executorId, or ALL
glue.driver.s3.filesystem.read_bytesexecutors since the previous report (aggregated by the
AWS Glue Metrics Dashboard as the number of bytes
read during the previous minute).
glue.executorId.s3.filesystem.read_bytes
Valid dimensions: JobName, JobRunId, and Type
(gauge).
glue.ALL.s3.filesystem.read_bytes
Valid Statistics: SUM. This metric is a delta value from
the last reported value, so on the AWS Glue Metrics
Dashboard a SUM statistic is used for aggregation.
The area under the curve on the AWS Glue Metrics
Dashboard can be used to visually compare bytes read
by two different job runs.

Unit: Bytes.

Can be used to monitor:

• ETL data movement.


• Job progress.
• Job bookmark issues (data processed, reprocessed,
and skipped).
• Comparison of reads to ingestion rate from external
data sources.
• Variance across job runs.

Resulting data can be used for:

• DPU capacity planning.


• Setting alarms for large spikes or dips in data read for
job runs and job stages.

207
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

The number of bytes written to Amazon S3 by the


driver, an executor identified by executorId, or ALL
glue.driver.s3.filesystem.write_bytes executors since the previous report (aggregated by the
AWS Glue Metrics Dashboard as the number of bytes
written during the previous minute).
glue.executorId.s3.filesystem.write_bytes
Valid dimensions: JobName, JobRunId, and Type
(gauge).
glue.ALL.s3.filesystem.write_bytes
Valid Statistics: SUM. This metric is a delta value from
the last reported value, so on the AWS Glue Metrics
Dashboard a SUM statistic is used for aggregation.
The area under the curve on the AWS Glue Metrics
Dashboard can be used to visually compare bytes
written by two different job runs.

Unit: Bytes

Can be used to monitor:

• ETL data movement.


• Job progress.
• Job bookmark issues (data processed, reprocessed,
and skipped).
• Comparison of reads to ingestion rate from external
data sources.
• Variance across job runs.

Some ways to use the data:

• DPU capacity planning.


• Setting alarms for large spikes or dips in data read for
job runs and job stages.

208
AWS Glue Developer Guide
Using CloudWatch Metrics

Metric Description

The fraction of CPU system load used (scale: 0-1) by


the driver, an executor identified by executorId, or ALL
glue.driver.system.cpuSystemLoad executors.

Valid dimensions: JobName (the name of the AWS Glue


glue.executorId.system.cpuSystemLoad Job), JobRunId (the JobRun ID. or ALL), and Type
(gauge).

glue.ALL.system.cpuSystemLoad Valid Statistics: Average. This metric is reported as an


absolute value.

Unit: Percentage

Can be used to monitor:

• Driver CPU load.


• Executor CPU load.
• Detecting CPU-bound or IO-bound executors or
stages in a Job.

Some ways to use the data:

• DPU capacity Planning along with IO Metrics (Bytes


Read/Shuffle Bytes, Task Parallelism) and the number
of maximum needed executors metric.
• Identify the CPU/IO-bound ratio. This allows for
repartitionioning and increasing provisioned capacity
for long-running jobs with splittable datasets having
lower CPU utilization.

Dimensions for AWS Glue Metrics


AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:

Dimension Description

JobName This dimension filters for metrics of all job runs of a


specific AWS Glue job.

JobRunId This dimension filters for metrics of a specific AWS Glue


job run by a JobRun ID, or ALL.

Type This dimension filters for metrics by either count (an


aggregate number) or gauge (a value at a point in
time).

For more information, see the CloudWatch User Guide.

209
AWS Glue Developer Guide
Setting Up Amazon CloudWatch
Alarms on AWS Glue Job Profiles

Setting Up Amazon CloudWatch Alarms on AWS Glue


Job Profiles
AWS Glue metrics are also available in Amazon CloudWatch. You can set up alarms on any AWS Glue
metric for scheduled jobs.

A few common scenarios for setting up alarms are as follows:

• Jobs running out of memory (OOM): Set an alarm when the memory usage exceeds the normal
average for either the driver or an executor for an AWS Glue job.
• Straggling executors: Set an alarm when the number of executors falls below a certain threshold for a
large duration of time in an AWS Glue job.
• Data backlog or reprocessing: Compare the metrics from individual jobs in a workflow using a
CloudWatch math expression. You can then trigger an alarm on the resulting expression value (such as
the ratio of bytes written by a job and bytes read by a following job).

For detailed instructions on setting alarms, see Create or Edit a CloudWatch Alarm in the Amazon
CloudWatch Events User Guide.

For monitoring and debugging scenarios using CloudWatch, see Job Monitoring and
Debugging (p. 210).

Job Monitoring and Debugging


You can collect metrics about AWS Glue jobs and visualize them on the AWS Glue and Amazon
CloudWatch consoles to identify and fix issues. Profiling your AWS Glue jobs requires the following steps:

1. Enable the Job metrics option in the job definition. You can enable profiling in the AWS Glue console
or as a parameter to the job. For more information see Defining Job Properties (p. 142) or Special
Parameters Used by AWS Glue (p. 244).
2. Confirm that the job script initializes a GlueContext. For example, the following script snippet
initializes a GlueContext and shows where profiled code is placed in the script. This general format is
used in the debugging scenarios that follow.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

...
...

210
AWS Glue Developer Guide
Debugging OOM Exceptions and Job Abnormalities

code-to-profile
...
...

job.commit()

3. Run the job.


4. Visualize the metrics on the AWS Glue console and identify abnormal metrics for the driver or an
executor.
5. Narrow down the root cause using the identified metric.
6. Optionally, confirm the root cause using the log stream of the identified driver or job executor.

Topics
• Debugging OOM Exceptions and Job Abnormalities (p. 211)
• Debugging Demanding Stages and Straggler Tasks (p. 218)
• Monitoring the Progress of Multiple Jobs (p. 222)
• Monitoring for DPU Capacity Planning (p. 226)

Debugging OOM Exceptions and Job Abnormalities


You can debug out-of-memory (OOM) exceptions and job abnormalities in AWS Glue. The following
sections describe scenarios for debugging out-of-memory exceptions of the Apache Spark driver or a
Spark executor.

• Debugging a Driver OOM Exception (p. 211)


• Debugging an Executor OOM Exception (p. 214)

Debugging a Driver OOM Exception


In this scenario, a Spark job is reading a large number of small files from Amazon Simple Storage Service
(Amazon S3). It converts the files to Apache Parquet format and then writes them out to Amazon S3.
The Spark driver is running out of memory. The input Amazon S3 data has more than 1 million files in
different Amazon S3 partitions.

The profiled code is as follows:

data = spark.read.format("json").option("inferSchema", False).load("s3://input_path")


data.write.format("parquet").save(output_path)

Visualize the Profiled Metrics on the AWS Glue Console


The following graph shows the memory usage as a percentage for the driver and executors. This usage
is plotted as one data point that is averaged over the values reported in the last minute. You can see in
the memory profile of the job that the driver memory (p. 205) crosses the safe threshold of 50 percent
usage quickly. On the other hand, the average memory usage (p. 205) across all executors is still less
than 4 percent. This clearly shows abnormality with driver execution in this Spark job.

211
AWS Glue Developer Guide
Debugging OOM Exceptions and Job Abnormalities

The job run soon fails, and the following error appears in the History tab on the AWS Glue console:
Command Failed with Exit Code 1. This error string means that the job failed due to a systemic error—
which in this case is the driver running out of memory.

On the console, choose the Error logs link on the History tab to confirm the finding about driver OOM
from the CloudWatch Logs. Search for "Error" in the job's error logs to confirm that it was indeed an
OOM exception that failed the job:

# java.lang.OutOfMemoryError: Java heap space


# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 12039"...

On the History tab for the job, choose Logs. You can find the following trace of driver execution in
the CloudWatch Logs at the beginning of the job. The Spark driver tries to list all the files in all the
directories, constructs an InMemoryFileIndex, and launches one task per file. This in turn results in the
Spark driver having to maintain a large amount of state in memory to track all the tasks. It caches the
complete list of a large number of files for the in-memory index, resulting in a driver OOM.

Fix the Processing of Multiple Files Using Grouping


You can fix the processing of the multiple files by using the grouping feature in AWS Glue. Grouping is
automatically enabled when you use dynamic frames and when the input dataset has a large number

212
AWS Glue Developer Guide
Debugging OOM Exceptions and Job Abnormalities

of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and
it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores
significantly less state in memory to track fewer tasks. For more information about manually enabling
grouping for your dataset, see Reading Input Files in Larger Groups (p. 251).

To check the memory profile of the AWS Glue job, profile the following code with grouping enabled:

df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"],


"recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3",
connection_options = {"path": output_path}, format = "parquet", transformation_ctx =
"datasink")

You can monitor the memory profile and the ETL data movement in the AWS Glue job profile.

The driver executes below the safe threshold of 50 percent memory usage over the entire duration of the
AWS Glue job. The executors stream the data from Amazon S3, process it, and write it out to Amazon S3.
As a result, they consume less than 5 percent memory at any point in time.

The data movement profile below shows the total number of Amazon S3 bytes that are read (p. 207)
and written (p. 208) in the last minute by all executors as the job progresses. Both follow a similar
pattern as the data is streamed across all the executors. The job finishes processing all one million files in
less than three hours.

213
AWS Glue Developer Guide
Debugging OOM Exceptions and Job Abnormalities

Debugging an Executor OOM Exception


In this scenario, you can learn how to debug OOM exceptions that could occur in Apache Spark executors.
The following code uses the Spark MySQL reader to read a large table of about 34 million rows into a
Spark dataframe. It then writes it out to Amazon S3 in Parquet format. You can provide the connection
properties and use the default Spark configurations to read the table.

val connectionProperties = new Properties()


connectionProperties.put("user", user)
connectionProperties.put("password", password)
connectionProperties.put("Driver", "com.mysql.jdbc.Driver")
val sparkSession = glueContext.sparkSession
val dfSpark = sparkSession.read.jdbc(url, tableName, connectionProperties)
dfSpark.write.format("parquet").save(output_path)

Visualize the Profiled Metrics on the AWS Glue Console


The following graph shows that within a minute of execution, the average memory usage (p. 205)
across all executors spikes up quickly above the safe threshold of 50 percent. The usage reaches up to 92
percent and the container running the executor is terminated ("killed") by Apache Hadoop YARN.

214
AWS Glue Developer Guide
Debugging OOM Exceptions and Job Abnormalities

As the following graph shows, there is always a single executor (p. 203) running until the job fails. This
is because a new executor is launched to replace the killed executor. The JDBC data source reads are not
parallelized by default because it would require partitioning the table on a column and opening multiple
connections. As a result, only one executor reads in the complete table sequentially.

As the following graph shows, Spark tries to launch a new task four times before failing the job. You can
see the memory profile (p. 206) of three executors. Each executor quickly uses up all of its memory. The
fourth executor runs out of memory, and the job fails. As a result, its metric is not reported immediately.

215
AWS Glue Developer Guide
Debugging OOM Exceptions and Job Abnormalities

You can confirm from the error string on the AWS Glue console that the job failed due to OOM
exceptions, as shown in the following image.

Job output logs: To further confirm your finding of an executor OOM exception, look at the CloudWatch
Logs. When you search for Error, you find the four executors being killed in roughly the same time
windows as shown on the metrics dashboard. All are terminated by YARN as they exceed their memory
limits.

Executor 1

18/06/13 16:54:29 WARN YarnAllocator: Container killed by YARN for exceeding


memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN
for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 ERROR YarnClusterScheduler: Lost executor 1 on
ip-10-1-2-175.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB
of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
ip-10-1-2-175.ec2.internal, executor 1): ExecutorLostFailure (executor 1
exited caused by one of the running tasks) Reason: Container killed by YARN for
exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

Executor 2

18/06/13 16:55:35 WARN YarnAllocator: Container killed by YARN for exceeding


memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN
for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 ERROR YarnClusterScheduler: Lost executor 2 on ip-10-1-2-16.ec2.internal:
Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory
used. Consider boosting spark.yarn.executor.memoryOverhead.

216
AWS Glue Developer Guide
Debugging OOM Exceptions and Job Abnormalities

18/06/13 16:55:35 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1,
ip-10-1-2-16.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited
caused by one of the running tasks) Reason: Container killed by YARN for
exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

Executor 3

18/06/13 16:56:37 WARN YarnAllocator: Container killed by YARN for exceeding


memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN
for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 ERROR YarnClusterScheduler: Lost executor 3 on
ip-10-1-2-189.ec2.internal: Container killed by YARN for exceeding memory limits. 5.8 GB
of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2,
ip-10-1-2-189.ec2.internal, executor 3): ExecutorLostFailure (executor 3
exited caused by one of the running tasks) Reason: Container killed by YARN for
exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

Executor 4

18/06/13 16:57:18 WARN YarnAllocator: Container killed by YARN for exceeding


memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN
for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 ERROR YarnClusterScheduler: Lost executor 4 on ip-10-1-2-96.ec2.internal:
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory
used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3,
ip-10-1-2-96.ec2.internal, executor 4): ExecutorLostFailure (executor 4 exited
caused by one of the running tasks) Reason: Container killed by YARN for
exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

Fix the Fetch Size Setting Using AWS Glue Dynamic Frames
The executor ran out of memory while reading the JDBC table because the default configuration for the
Spark JDBC fetch size is zero. This means that the JDBC driver on the Spark executor tries to fetch the 34
million rows from the database together and cache them, even though Spark streams through the rows
one at a time. With Spark, you can avoid this scenario by setting the fetch size parameter to a non-zero
default value.

You can also fix this issue by using AWS Glue dynamic frames instead. By default, dynamic frames use a
fetch size of 1,000 rows. As a result, the executor does not take more than 7 percent of its total memory.
The AWS Glue job finishes in less than two minutes with only a single executor.

val (url, database, tableName) = {


("jdbc_url", "db_name", "table_name")
}
val source = glueContext.getSource(format, sourceJson)
val df = source.getDynamicFrame
glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3",
connection_options = {"path": output_path}, format = "parquet", transformation_ctx =
"datasink")

217
AWS Glue Developer Guide
Debugging Demanding Stages and Straggler Tasks

Normal profiled metrics: The executor memory (p. 205) with AWS Glue dynamic frames never exceeds
the safe threshold, as shown in the following image. It streams in the rows from the database and caches
only 1,000 rows in the JDBC driver at any point in time.

Debugging Demanding Stages and Straggler Tasks


You can use AWS Glue job profiling to identify demanding stages and straggler tasks in your extract,
transform, and load (ETL) jobs. A straggler task takes much longer than the rest of the tasks in a stage of
an AWS Glue job. As a result, the stage takes longer to complete, which also delays the total execution
time of the job.

Coalescing Small Input Files into Larger Output Files


A straggler task can occur when there is a non-uniform distribution of work across the different tasks, or
a data skew results in one task processing more data.

You can profile the following code—a common pattern in Apache Spark—to coalesce a large number of
small files into larger output files. For this example, the input dataset is 32 GB of JSON Gzip compressed
files. The output dataset has roughly 190 GB of uncompressed JSON files.

The profiled code is as follows:

datasource0 = spark.read.format("json").load("s3://input_path")
df = datasource0.coalesce(1)
df.write.format("json").save(output_path)

Visualize the Profiled Metrics on the AWS Glue Console


You can profile your job to examine four different sets of metrics:

• ETL data movement


• Data shuffle across executors
• Job execution
• Memory profile

ETL data movement: In the ETL Data Movement profile, the bytes are read (p. 207) fairly quickly by
all the executors in the first stage that completes within the first six minutes. However, the total job
execution time is around one hour, mostly consisting of the data writes (p. 208).

218
AWS Glue Developer Guide
Debugging Demanding Stages and Straggler Tasks

Data shuffle across executors: The number of bytes read (p. 202) and written (p. 201) during
shuffling also shows a spike before Stage 2 ends, as indicated by the Job Execution and Data Shuffle
metrics. After the data shuffles from all executors, the reads and writes proceed from executor number 3
only.

Job execution: As shown in the graph below, all other executors are idle and are eventually relinquished
by the time 10:09. At that point, the total number of executors decreases to only one. This clearly shows
that executor number 3 consists of the straggler task that is taking the longest execution time and is
contributing to most of the job execution time.

219
AWS Glue Developer Guide
Debugging Demanding Stages and Straggler Tasks

Memory profile: After the first two stages, only executor number 3 (p. 206) is actively consuming
memory to process the data. The remaining executors are simply idle or have been relinquished shortly
after the completion of the first two stages.

Fix Straggling Executors Using Grouping


You can avoid straggling executors by using the grouping feature in AWS Glue. Use grouping to distribute
the data uniformly across all the executors and coalesce files into larger files using all the available
executors on the cluster. For more information, see Reading Input Files in Larger Groups (p. 251).

To check the ETL data movements in the AWS Glue job, profile the following code with grouping
enabled:

df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"],


"recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type =
"s3", connection_options = {"path": output_path}, format = "json", transformation_ctx =
"datasink4")

220
AWS Glue Developer Guide
Debugging Demanding Stages and Straggler Tasks

ETL data movement: The data writes are now streamed in parallel with the data reads throughout the
job execution time. As a result, the job finishes within eight minutes—much faster than previously.

Data shuffle across executors: As the input files are coalesced during the reads using the grouping
feature, there is no costly data shuffle after the data reads.

Job execution: The job execution metrics show that the total number of active executors running and
processing data remains fairly constant. There is no single straggler in the job. All executors are active
and are not relinquished until the completion of the job. Because there is no intermediate shuffle of data
across the executors, there is only a single stage in the job.

221
AWS Glue Developer Guide
Monitoring the Progress of Multiple Jobs

Memory profile: The metrics show the active memory consumption (p. 206) across all executors—
reconfirming that there is activity across all executors. As data is streamed in and written out in parallel,
the total memory footprint of all executors is roughly uniform and well below the safe threshold for all
executors.

Monitoring the Progress of Multiple Jobs


You can profile multiple AWS Glue jobs together and monitor the flow of data between them. This is a
common workflow pattern, and requires monitoring for individual job progress, data processing backlog,
data reprocessing, and job bookmarks.

Topics
• Profiled Code (p. 223)
• Visualize the Profiled Metrics on the AWS Glue Console (p. 223)
• Fix the Processing of Files (p. 225)

222
AWS Glue Developer Guide
Monitoring the Progress of Multiple Jobs

Profiled Code
In this workflow, you have two jobs: an Input job and an Output job. The Input job is scheduled to run
every 30 minutes using a periodic trigger. The Output job is scheduled to run after each successful run of
the Input job. These scheduled jobs are controlled using job triggers.

Input job: This job reads in data from an Amazon Simple Storage Service (Amazon S3) location,
transforms it using ApplyMapping, and writes it to a staging Amazon S3 location. The following code is
profiled code for the Input job:

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3",
connection_options = {"paths": ["s3://input_path"],
"useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1,
connection_type = "s3", connection_options = {"path": staging_path, "compression":
"gzip"}, format = "json")

Output job: This job reads the output of the Input job from the staging location in Amazon S3,
transforms it again, and writes it to a destination:

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3",
connection_options = {"paths": [staging_path],
"useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1,
connection_type = "s3", connection_options = {"path": output_path}, format = "json")

Visualize the Profiled Metrics on the AWS Glue Console


The following dashboard superimposes the Amazon S3 bytes written metric from the Input job onto the
Amazon S3 bytes read metric on the same timeline for the Output job. The timeline shows different job
runs of the Input and Output jobs. The Input job (shown in red) starts every 30 minutes. The Output Job
(shown in brown) starts at the completion of the Input Job, with a Max Concurrency of 1.

223
AWS Glue Developer Guide
Monitoring the Progress of Multiple Jobs

In this example, job bookmarks are not enabled. No transformation contexts are used to enable job
bookmarks in the script code.

Job History: The Input and Output jobs have multiple runs, as shown on the History tab, starting from
12:00 PM.

The Input job on the AWS Glue console looks like this:

The following image shows the Output job:

First job runs: As shown in the Data Bytes Read and Written graph below, the first job runs of the Input
and Output jobs between 12:00 and 12:30 show roughly the same area under the curves. Those areas
represent the Amazon S3 bytes written by the Input job and the Amazon S3 bytes read by the Output
job. This data is also confirmed by the ratio of Amazon S3 bytes written (summed over 30 minutes – the
job trigger frequency for the Input job). The data point for the ratio for the Input job run that started at
12:00PM is also 1.

The following graph shows the data flow ratio across all the job runs:

Second job runs: In the second job run, there is a clear difference in the number of bytes read by the
Output job compared to the number of bytes written by the Input job. (Compare the area under the
curve across the two job runs for the Output job, or compare the areas in the second run of the Input
and Output jobs.) The ratio of the bytes read and written shows that the Output Job read about 2.5x
the data written by the Input job in the second span of 30 minutes from 12:30 to 13:00. This is because
the Output Job reprocessed the output of the first job run of the Input job because job bookmarks were
not enabled. A ratio above 1 shows that there is an additional backlog of data that was processed by the
Output job.

224
AWS Glue Developer Guide
Monitoring the Progress of Multiple Jobs

Third job runs: The Input job is fairly consistent in terms of the number of bytes written (see the area
under the red curves). However, the third job run of the Input job ran longer than expected (see the
long tail of the red curve). As a result, the third job run of the Output job started late. The third job run
processed only a fraction of the data accumulated in the staging location in the remaining 30 minutes
between 13:00 and 13:30. The ratio of the bytes flow shows that it only processed 0.83 of data written
by the third job run of the Input job (see the ratio at 13:00).

Overlap of Input and Output jobs: The fourth job run of the Input job started at 13:30 as per the
schedule, before the third job run of the Output job finished. There is a partial overlap between these
two job runs. However, the third job run of the Output job captures only the files that it listed in the
staging location of Amazon S3 when it began around 13:17. This consists of all data output from the first
job runs of the Input job. The actual ratio at 13:30 is around 2.75. The third job run of the Output job
processed about 2.75x of data written by the fourth job run of the Input job from 13:30 to 14:00.

As these images show, the Output job is reprocessing data from the staging location from all prior job
runs of the Input job. As a result, the fourth job run for the Output job is the longest and overlaps with
the entire fifth job run of the Input job.

Fix the Processing of Files


You should ensure that the Output job processes only the files that haven't been processed by previous
job runs of the Output job. To do this, enable job bookmarks and set the transformation context in the
Output job, as follows:

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3",
connection_options = {"paths": [staging_path],
"useS3ListImplementation":True,"recurse":True}, format="json", transformation_ctx =
"bookmark_ctx")

With job bookmarks enabled, the Output job doesn't reprocess the data in the staging location from all
the previous job runs of the Input job. In the following image showing the data read and written, the
area under the brown curve is fairly consistent and similar with the red curves.

The ratios of byte flow also remain roughly close to 1 because there is no additional data processed.

225
AWS Glue Developer Guide
Monitoring for DPU Capacity Planning

A job run for the Output job starts and captures the files in the staging location before the next Input job
run starts putting more data into the staging location. As long as it continues to do this, it processes only
the files captured from the previous Input job run, and the ratio stays close to 1.

Suppose that the Input job takes longer than expected, and as a result, the Output job captures files in
the staging location from two Input job runs. The ratio is then higher than 1 for that Output job run.
However, the following job runs of the Output job don't process any files that are already processed by
the previous job runs of the Output job.

Monitoring for DPU Capacity Planning


You can use job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be
used to scale out an AWS Glue job.

Topics
• Profiled Code (p. 227)

226
AWS Glue Developer Guide
Monitoring for DPU Capacity Planning

• Visualize the Profiled Metrics on the AWS Glue Console (p. 227)
• Determine the Optimal DPU Capacity (p. 229)

Profiled Code
The following script reads an Amazon Simple Storage Service (Amazon S3) partition containing 428
gzipped JSON files. The script applies a mapping to change the field names, and converts and writes
them to Amazon S3 in Apache Parquet format. You provision 10 DPUs as per the default and execute this
job.

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3",
connection_options = {"paths": [input_path],
"useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [(map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1,
connection_type = "s3", connection_options = {"path": output_path}, format = "parquet")

Visualize the Profiled Metrics on the AWS Glue Console


Job Run 1: In this job run we show how to find if there are under-provisioned DPUs in the cluster. The job
execution functionality in AWS Glue shows the total number of actively running executors (p. 203), the
number of completed stages (p. 199), and the number of maximum needed executors (p. 204).

The number of maximum needed executors is computed by adding the total number of running tasks
and pending tasks, and dividing by the tasks per executor. This result is a measure of the total number of
executors required to satisfy the current load.

In contrast, the number of actively running executors measures how many executors are running active
Apache Spark tasks. As the job progresses, the maximum needed executors can change and typically
goes down towards the end of the job as the pending task queue diminishes.

The horizontal red line in the following graph shows the number of maximum allocated executors, which
depends on the number of DPUs that you allocate for the job. In this case, you allocate 10 DPUs for the
job run. One DPU is reserved for the application master. Nine DPUs run two executors each and one
executor is reserved for the Spark driver. So, the number of maximum allocated executors is 2*9 - 1 = 17
executors.

227
AWS Glue Developer Guide
Monitoring for DPU Capacity Planning

As the graph shows, the number of maximum needed executors starts at 107 at the beginning of the
job, whereas the number of active executors remains 17. This is the same as the number of maximum
allocated executors with 10 DPUs. The ratio between the maximum needed executors and maximum
allocated executors (adding 1 to both for the Spark driver) gives you the under-provisioning factor:
108/18 = 6x. You can provision 6*9 + 1 DPUs = 55 DPUs to scale out the job to run it with maximum
parallelism and finish faster.

Looking at the Amazon S3 bytes read (p. 207) and written (p. 208), notice that the job spends all six
minutes streaming in data from Amazon S3 and writing it out in parallel. All the cores on the allocated
DPUs are reading and writing to Amazon S3. The maximum number of needed executors being 107 also
matches the number of files in the input Amazon S3 path—428. Each executor can launch four Spark
tasks to process four input files (JSON gzipped).

228
AWS Glue Developer Guide
Monitoring for DPU Capacity Planning

Determine the Optimal DPU Capacity


Based on the results of the previous job run, you can increase the total number of allocated DPUs to
55, and see how the job performs. The job finishes in less than three minutes—half the time it required
previously. The job scale-out is not linear in this case because it is a short running job. Jobs with long-
lived tasks or a large number of tasks (a large number of max needed executors) benefit from a close-to-
linear DPU scale-out performance speedup.

As the following image shows, the total number of active executors reaches the maximum allocated
—107 executors. Similarly, the maximum needed executors is never above the maximum allocated
executors. The maximum needed executors number is computed from the actively running and pending
task counts, so it might be smaller than the number of active executors. This is because there can be
executors that are partially or completely idle for a short period of time and are not yet decommissioned.

229
AWS Glue Developer Guide
Logging Using CloudTrail

This job run uses 6x more executors to read and write from Amazon S3 in parallel. As a result, this job
run uses more Amazon S3 bandwidth for both reads and writes, and finishes faster.

Identify Overprovisioned DPUs


Next, you can determine whether scaling out the job with 100 DPUs (99 * 2 = 198 executors) helps
to scale out any further. As the following graph shows, the job still takes three minutes to finish.
Similarly, the job does not scale out beyond 107 executors (55 DPUs configuration), and the remaining
91 executors are overprovisioned and not used at all. This shows that increasing the number of DPUs
might not always improve performance, as evident from the maximum needed executors.

Compare Time Differences


The three job runs shown in the following table summarize the job execution times for 10 DPUs, 55
DPUs, and 100 DPUs. You can find the DPU capacity to improve the job execution time using the
estimates you established by monitoring the first job run.

Job ID Number of DPUs Execution Time

jr_c894524c8ef5048a4d9... 10 6 min.

jr_1a466cf2575e7ffe6856... 55 3 min.

jr_34fa1ed4c6aa9ff0a814... 100 3 min.

Logging AWS Glue API Calls with AWS CloudTrail


AWS Glue is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user,
role, or an AWS service in AWS Glue. CloudTrail captures all API calls for AWS Glue as events. The calls
captured include calls from the AWS Glue console and code calls to the AWS Glue API operations. If you
create a trail, you can enable continuous delivery of CloudTrail events to an Amazon S3 bucket, including
events for AWS Glue. If you don't configure a trail, you can still view the most recent events in the
CloudTrail console in Event history. Using the information collected by CloudTrail, you can determine

230
AWS Glue Developer Guide
AWS Glue Information in CloudTrail

the request that was made to AWS Glue, the IP address from which the request was made, who made the
request, when it was made, and additional details.

To learn more about CloudTrail, see the AWS CloudTrail User Guide.

AWS Glue Information in CloudTrail


CloudTrail is enabled on your AWS account when you create the account. When activity occurs in AWS
Glue, that activity is recorded in a CloudTrail event along with other AWS service events in Event history.
You can view, search, and download recent events in your AWS account. For more information, see
Viewing Events with CloudTrail Event History.

For an ongoing record of events in your AWS account, including events for AWS Glue, create a trail. A trail
enables CloudTrail to deliver log files to an Amazon S3 bucket. By default, when you create a trail in the
console, the trail applies to all AWS Regions. The trail logs events from all Regions in the AWS partition
and delivers the log files to the Amazon S3 bucket that you specify. Additionally, you can configure
other AWS services to further analyze and act upon the event data collected in CloudTrail logs. For more
information, see the following:

• Overview for Creating a Trail


• CloudTrail Supported Services and Integrations
• Configuring Amazon SNS Notifications for CloudTrail
• Receiving CloudTrail Log Files from Multiple Regions and Receiving CloudTrail Log Files from Multiple
Accounts

All AWS Glue actions are logged by CloudTrail and are documented in the AWS Glue API (p. 371) . For
example, calls to the CreateDatabase, CreateTable and CreateScript actions generate entries in
the CloudTrail log files.

Every event or log entry contains information about who generated the request. The identity
information helps you determine the following:

• Whether the request was made with root or AWS Identity and Access Management (IAM) user
credentials.
• Whether the request was made with temporary security credentials for a role or federated user.
• Whether the request was made by another AWS service.

For more information, see the CloudTrail userIdentity Element.

However, CloudTrail doesn't log all information regarding calls. For example, it doesn't log certain
sensitive information, such as the ConnectionProperties used in connection requests, and it logs a
null instead of the responses returned by the following APIs:

BatchGetPartition GetCrawlers GetJobs GetTable


CreateScript GetCrawlerMetrics GetJobRun GetTables
GetCatalogImportStatus GetDatabase GetJobRuns GetTableVersions
GetClassifier GetDatabases GetMapping GetTrigger
GetClassifiers GetDataflowGraph GetObjects GetTriggers
GetConnection GetDevEndpoint GetPartition GetUserDefinedFunction
GetConnections GetDevEndpoints GetPartitions GetUserDefinedFunctions
GetCrawler GetJob GetPlan

Understanding AWS Glue Log File Entries


A trail is a configuration that enables delivery of events as log files to an Amazon S3 bucket that you
specify. CloudTrail log files contain one or more log entries. An event represents a single request from

231
AWS Glue Developer Guide
Understanding AWS Glue Log File Entries

any source and includes information about the requested action, the date and time of the action, request
parameters, and so on. CloudTrail log files aren't an ordered stack trace of the public API calls, so they
don't appear in any specific order.

The following example shows a CloudTrail log entry that demonstrates the DeleteCrawler action.

{
"eventVersion": "1.05",
"userIdentity": {
"type": "IAMUser",
"principalId": "AKIAIOSFODNN7EXAMPLE",
"arn": "arn:aws:iam::123456789012:user/johndoe",
"accountId": "123456789012",
"accessKeyId": "AKIAIOSFODNN7EXAMPLE",
"userName": "johndoe"
},
"eventTime": "2017-10-11T22:29:49Z",
"eventSource": "glue.amazonaws.com",
"eventName": "DeleteCrawler",
"awsRegion": "us-east-1",
"sourceIPAddress": "72.21.198.64",
"userAgent": "aws-cli/1.11.148 Python/3.6.1 Darwin/16.7.0 botocore/1.7.6",
"requestParameters": {
"name": "tes-alpha"
},
"responseElements": null,
"requestID": "b16f4050-aed3-11e7-b0b3-75564a46954f",
"eventID": "e73dd117-cfd1-47d1-9e2f-d1271cad838c",
"eventType": "AwsApiCall",
"recipientAccountId": "123456789012"
}

This example shows a CloudTrail log entry that demonstrates a CreateConnection action.

{
"eventVersion": "1.05",
"userIdentity": {
"type": "IAMUser",
"principalId": "AKIAIOSFODNN7EXAMPLE",
"arn": "arn:aws:iam::123456789012:user/johndoe",
"accountId": "123456789012",
"accessKeyId": "AKIAIOSFODNN7EXAMPLE",
"userName": "johndoe"
},
"eventTime": "2017-10-13T00:19:19Z",
"eventSource": "glue.amazonaws.com",
"eventName": "CreateConnection",
"awsRegion": "us-east-1",
"sourceIPAddress": "72.21.198.66",
"userAgent": "aws-cli/1.11.148 Python/3.6.1 Darwin/16.7.0 botocore/1.7.6",
"requestParameters": {
"connectionInput": {
"name": "test-connection-alpha",
"connectionType": "JDBC",
"physicalConnectionRequirements": {
"subnetId": "subnet-323232",
"availabilityZone": "us-east-1a",
"securityGroupIdList": [
"sg-12121212"
]
}
}
},
"responseElements": null,

232
AWS Glue Developer Guide
Understanding AWS Glue Log File Entries

"requestID": "27136ebc-afac-11e7-a7d6-ab217e5c3f19",
"eventID": "e8b3baeb-c511-4597-880f-c16210c60a4a",
"eventType": "AwsApiCall",
"recipientAccountId": "123456789012"
}

233
AWS Glue Developer Guide
Gathering AWS Glue Troubleshooting Information

AWS Glue Troubleshooting


Topics
• Gathering AWS Glue Troubleshooting Information (p. 234)
• Troubleshooting Connection Issues in AWS Glue (p. 234)
• Troubleshooting Errors in AWS Glue (p. 235)
• AWS Glue Limits (p. 242)

Gathering AWS Glue Troubleshooting Information


If you encounter errors or unexpected behavior in AWS Glue and need to contact AWS Support, you
should first gather information about names, IDs, and logs that are associated with the failed action.
Having this information available enables AWS Support to help you resolve the problems you're
experiencing.

Along with your account ID, gather the following information for each of these types of failures:

When a crawler fails, gather the following information:


• Crawler name

Logs from crawler runs are located in CloudWatch Logs under /aws-glue/crawlers.
When a test connection fails, gather the following information:
• Connection name
• Connection ID
• JDBC connection string in the form jdbc:protocol://host:port/database-name.

Logs from test connections are located in CloudWatch Logs under /aws-glue/
testconnection.
When a job fails, gather the following information:
• Job name
• Job run ID in the form jr_xxxxx.

Logs from job runs are located in CloudWatch Logs under /aws-glue/jobs.

Troubleshooting Connection Issues in AWS Glue


When an AWS Glue crawler or a job uses connection properties to access a data store, you might
encounter errors when you try to connect. AWS Glue uses private IP addresses in the subnet when it
creates elastic network interfaces in your specified virtual private cloud (VPC) and subnet. Security
groups specified in the connection are applied on each of the elastic network interfaces. Check to see
whether security groups allow outbound access and if they allow connectivity to the database cluster.

In addition, Apache Spark requires bi-directional connectivity among driver and executor nodes. One of
the security groups needs to allow ingress rules on all TCP ports. You can prevent it from being open to
the world by restricting the source of the security group to itself with a self-referencing security group.

Here are some typical actions you can take to troubleshoot connection problems:

• Check the port address of your connection.


• Check the user name and password string in your connection.

234
AWS Glue Developer Guide
Troubleshooting Errors

• For a JDBC data store, verify that it allows incoming connections.


• Verify that your data store can be accessed within your VPC.

Troubleshooting Errors in AWS Glue


If you encounter errors in AWS Glue, use the following solutions to help you find the source of the
problems and fix them.
Note
The AWS Glue GitHub repository contains additional troubleshooting guidance in AWS Glue
Frequently Asked Questions.

Topics
• Error: Resource Unavailable (p. 235)
• Error: Could Not Find S3 Endpoint or NAT Gateway for subnetId in VPC (p. 236)
• Error: Inbound Rule in Security Group Required (p. 236)
• Error: Outbound Rule in Security Group Required (p. 236)
• Error: Custom DNS Resolution Failures (p. 236)
• Error: Job Run Failed Because the Role Passed Should Be Given Assume Role Permissions for the AWS
Glue Service (p. 236)
• Error: DescribeVpcEndpoints Action Is Unauthorized. Unable to Validate VPC ID vpc-id (p. 237)
• Error: DescribeRouteTables Action Is Unauthorized. Unable to Validate Subnet Id: subnet-id in VPC id:
vpc-id (p. 237)
• Error: Failed to Call ec2:DescribeSubnets (p. 237)
• Error: Failed to Call ec2:DescribeSecurityGroups (p. 237)
• Error: Could Not Find Subnet for AZ (p. 237)
• Error: Job Run Exception When Writing to a JDBC Target (p. 237)
• Error: Amazon S3 Timeout (p. 238)
• Error: Amazon S3 Access Denied (p. 238)
• Error: Amazon S3 Access Key ID Does Not Exist (p. 238)
• Error: Job Run Fails When Accessing Amazon S3 with an s3a:// URI (p. 238)
• Error: Amazon S3 Service Token Expired (p. 240)
• Error: No Private DNS for Network Interface Found (p. 240)
• Error: Development Endpoint Provisioning Failed (p. 240)
• Error: Notebook Server CREATE_FAILED (p. 240)
• Error: Local Notebook Fails to Start (p. 240)
• Error: Notebook Usage Errors (p. 241)
• Error: Running Crawler Failed (p. 241)
• Error: Upgrading Athena Data Catalog (p. 241)
• Error: A Job is Reprocessing Data When Job Bookmarks Are Enabled (p. 241)

Error: Resource Unavailable


If AWS Glue returns a resource unavailable message, you can view error messages or logs to help you
learn more about the issue. The following tasks describe general methods for troubleshooting.

• For any connections and development endpoints that you use, check that your cluster has not run out
of elastic network interfaces.

235
AWS Glue Developer Guide
Error: Could Not Find S3 Endpoint
or NAT Gateway for subnetId in VPC

Error: Could Not Find S3 Endpoint or NAT Gateway


for subnetId in VPC
Check the subnet ID and VPC ID in the message to help you diagnose the issue.

• Check that you have an Amazon S3 VPC endpoint set up, which is required with AWS Glue. In addition,
check your NAT gateway if that's part of your configuration. For more information, see Amazon VPC
Endpoints for Amazon S3 (p. 29).

Error: Inbound Rule in Security Group Required


At least one security group must open all ingress ports. To limit traffic, the source security group in your
inbound rule can be restricted to the same security group.

• For any connections that you use, check your security group for an inbound rule that is self-
referencing. For more information, see Setting Up Your Environment to Access Data Stores (p. 28).
• When you are using a development endpoint, check your security group for an inbound rule that is
self-referencing. For more information, see Setting Up Your Environment to Access Data Stores (p. 28).

Error: Outbound Rule in Security Group Required


At least one security group must open all egress ports. To limit traffic, the source security group in your
outbound rule can be restricted to the same security group.

• For any connections that you use, check your security group for an outbound rule that is self-
referencing. For more information, see Setting Up Your Environment to Access Data Stores (p. 28).
• When you are using a development endpoint, check your security group for an outbound rule that is
self-referencing. For more information, see Setting Up Your Environment to Access Data Stores (p. 28).

Error: Custom DNS Resolution Failures


When using a custom DNS for internet name resolution, both forward DNS lookup and reverse DNS
lookup must be implemented. Otherwise, you might receive errors similar to: Reverse dns resolution of
ip failure or Dns resolution of dns failed. If AWS Glue returns a message, you can view error messages
or logs to help you learn more about the issue. The following tasks describe general methods for
troubleshooting.

• A custom DNS configuration without reverse lookup can cause AWS Glue to fail. Check your DNS
configuration. If you are using Route 53 or Microsoft Active Directory, make sure that there are forward
and reverse lookups. For more information, see Setting Up DNS in Your VPC (p. 28).

Error: Job Run Failed Because the Role Passed Should


Be Given Assume Role Permissions for the AWS Glue
Service
The user who defines a job must have permission for iam:PassRole for AWS Glue.

236
AWS Glue Developer Guide
Error: DescribeVpcEndpoints Action Is
Unauthorized. Unable to Validate VPC ID vpc-id

• When a user creates an AWS Glue job, confirm that the user's role contains a policy that contains
iam:PassRole for AWS Glue. For more information, see Step 3: Attach a Policy to IAM Users That
Access AWS Glue (p. 15).

Error: DescribeVpcEndpoints Action Is Unauthorized.


Unable to Validate VPC ID vpc-id
• Check the policy passed to AWS Glue for the ec2:DescribeVpcEndpoints permission.

Error: DescribeRouteTables Action Is Unauthorized.


Unable to Validate Subnet Id: subnet-id in VPC id:
vpc-id
• Check the policy passed to AWS Glue for the ec2:DescribeRouteTables permission.

Error: Failed to Call ec2:DescribeSubnets


• Check the policy passed to AWS Glue for the ec2:DescribeSubnets permission.

Error: Failed to Call ec2:DescribeSecurityGroups


• Check the policy passed to AWS Glue for the ec2:DescribeSecurityGroups permission.

Error: Could Not Find Subnet for AZ


• The Availability Zone might not be available to AWS Glue. Create and use a new subnet in a different
Availability Zone from the one specified in the message.

Error: Job Run Exception When Writing to a JDBC


Target
When you are running a job that writes to a JDBC target, the job might encounter errors in the following
scenarios:

• If your job writes to a Microsoft SQL Server table, and the table has columns defined as type Boolean,
then the table must be predefined in the SQL Server database. When you define the job on the AWS
Glue console using a SQL Server target with the option Create tables in your data target, don't map
any source columns to a target column with data type Boolean. You might encounter an error when
the job runs.

You can avoid the error by doing the following:


• Choose an existing table with the Boolean column.
• Edit the ApplyMapping transform and map the Boolean column in the source to a number or string
in the target.

237
AWS Glue Developer Guide
Error: Amazon S3 Timeout

• Edit the ApplyMapping transform to remove the Boolean column from the source.
• If your job writes to an Oracle table, you might need to adjust the length of names of Oracle objects.
In some versions of Oracle, the maximum identifier length is limited to 30 bytes or 128 bytes. This
limit affects the table names and column names of Oracle target data stores.

You can avoid the error by doing the following:


• Name Oracle target tables within the limit for your version.
• The default column names are generated from the field names in the data. To handle the case when
the column names are longer than the limit, use ApplyMapping or RenameField transforms to
change the name of the column to be within the limit.

Error: Amazon S3 Timeout


If AWS Glue returns a connect timed out error, it might be because it is trying to access an Amazon S3
bucket in another AWS Region.

• An Amazon S3 VPC endpoint can only route traffic to buckets within an AWS Region. If you need
to connect to buckets in other Regions, a possible workaround is to use a NAT gateway. For more
information, see NAT Gateways.

Error: Amazon S3 Access Denied


If AWS Glue returns an access denied error to an Amazon S3 bucket or object, it might be because the
IAM role provided does not have a policy with permission to your data store.

• An ETL job must have access to an Amazon S3 data store used as a source or target. A crawler must
have access to an Amazon S3 data store that it crawls. For more information, see Step 2: Create an IAM
Role for AWS Glue (p. 14).

Error: Amazon S3 Access Key ID Does Not Exist


If AWS Glue returns an access key ID does not exist error when running a job, it might be because of one
of the following reasons:

• An ETL job uses an IAM role to access data stores, confirm that the IAM role for your job was not
deleted before the job started.
• An IAM role contains permissions to access your data stores, confirm that any attached Amazon S3
policy containing s3:ListBucket is correct.

Error: Job Run Fails When Accessing Amazon S3 with


an s3a:// URI
If a job run returns an error like Failed to parse XML document with handler class , it might be because of
a failure trying to list hundreds of files using an s3a:// URI. Access your data store using an s3:// URI
instead. The following exception trace highlights the errors to look for:

1. com.amazonaws.SdkClientException: Failed to parse XML document with handler class


com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
2. at
com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxPar

238
AWS Glue Developer Guide
Error: Job Run Fails When Accessing
Amazon S3 with an s3a:// URI

3. at
com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResp
4. at com.amazonaws.services.s3.model.transform.Unmarshallers
$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:70)
5. at com.amazonaws.services.s3.model.transform.Unmarshallers
$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:59)
6. at
com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
7. at
com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
8. at
com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
9. at com.amazonaws.http.AmazonHttpClient
$RequestExecutor.handleResponse(AmazonHttpClient.java:1554)
10. at com.amazonaws.http.AmazonHttpClient
$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1272)
11. at com.amazonaws.http.AmazonHttpClient
$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
12. at com.amazonaws.http.AmazonHttpClient
$RequestExecutor.doExecute(AmazonHttpClient.java:743)
13. at com.amazonaws.http.AmazonHttpClient
$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
14. at com.amazonaws.http.AmazonHttpClient
$RequestExecutor.execute(AmazonHttpClient.java:699)
15. at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access
$500(AmazonHttpClient.java:667)
16. at com.amazonaws.http.AmazonHttpClient
$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
17. at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
18. at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
19. at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
20. at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4266)
21. at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:834)
22. at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:971)
23. at
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1155)
24. at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1144)
25. at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)
26. at org.apache.hadoop.fs.FSDataOutputStream
$PositionCache.close(FSDataOutputStream.java:74)
27. at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
28. at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:467)
29. at
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:117)
30. at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
31. at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala
32. at org.apache.spark.sql.execution.datasources.FileFormatWriter
$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:252)
33. at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun
$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask
$3.apply(FileFormatWriter.scala:191)
34. at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun
$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask
$3.apply(FileFormatWriter.scala:188)
35. at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
36. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql
$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
37. at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$
$anonfun$3.apply(FileFormatWriter.scala:129)
38. at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$
$anonfun$3.apply(FileFormatWriter.scala:128)
39. at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
40. at org.apache.spark.scheduler.Task.run(Task.scala:99)
41. at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
42. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

239
AWS Glue Developer Guide
Error: Amazon S3 Service Token Expired

43. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
44. at java.lang.Thread.run(Thread.java:748)

Error: Amazon S3 Service Token Expired


When moving data to and from Amazon Redshift, temporary Amazon S3 credentials, which expire after
1 hour, are used. If you have a long running job, it might fail. For information about how to set up your
long running jobs to move data to and from Amazon Redshift, see Moving Data to and from Amazon
Redshift (p. 253).

Error: No Private DNS for Network Interface Found


If a job fails or a development endpoint fails to provision, it might be because of a problem in the
network setup.

• If you are using the Amazon provided DNS, the value of enableDnsHostnames must be set to true.
For more information, see DNS.

Error: Development Endpoint Provisioning Failed


If AWS Glue fails to successfully provision a development endpoint, it might be because of a problem in
the network setup.

• When you define a development endpoint, the VPC, subnet, and security groups are validated to
confirm that they meet certain requirements.
• If you provided the optional SSH public key, check that it is a valid SSH public key.
• Check in the VPC console that your VPC uses a valid DHCP option set. For more information, see DHCP
option sets.
• If after a few minutes, the development endpoint Provisioning status changes to FAILED, and the
failure reason is DNS related (for example, Reverse dns resolution of ip 10.5.237.213
failed), check your DNS setup. For more information, see Setting Up DNS in Your VPC (p. 28).
• If the cluster remains in the PROVISIONING state, contact AWS Support.

Error: Notebook Server CREATE_FAILED


If AWS Glue fails to create the notebook server for a development endpoint, it might be because of one
of the following problems:

• AWS Glue passes an IAM role to Amazon EC2 when it is setting up the notebook server. The IAM role
must have a trust relationship to Amazon EC2.
• The IAM role must have an instance profile of the same name. When you create the role for Amazon
EC2 with the IAM console, the instance profile with the same name is automatically created. Check for
an error in the log regarding an invalid instance profile name iamInstanceProfile.name. For more
information, see Using Instance Profiles.
• Check that your role has permission to access aws-glue* buckets in the policy that you pass to create
the notebook server.

Error: Local Notebook Fails to Start


If your local notebook fails to start and reports errors that a directory or folder cannot be found, it might
be because of one of the following problems:

240
AWS Glue Developer Guide
Error: Notebook Usage Errors

• If you are running on Microsoft Windows, make sure that the JAVA_HOME environment variable points
to the correct Java directory. It's possible to update Java without updating this variable, and if it points
to a folder that no longer exists, Zeppelin notebooks fail to start.

Error: Notebook Usage Errors


When using an Apache Zeppelin notebook, you might encounter errors due to your setup or
environment.

• You provide an IAM role with an attached policy when you created the notebook server. If the
policy does not include all the required permissions, you might get an error such as assumed-
role/name-of-role/i-0bf0fa9d038087062 is not authorized to perform some-
action AccessDeniedException. Check the policy that is passed to your notebook server in the
IAM console.
• If the Zeppelin notebook does not render correctly in your web browser, check the Zeppelin
requirements for browser support. For example, there might be specific versions and setup required for
the Safari browser. You might need to update your browser or use a different browser.

Error: Running Crawler Failed


If AWS Glue fails to successfully run a crawler to catalog your data, it might be because of one of the
following reasons. First check if an error is listed in the AWS Glue console crawlers list. Check if there is
an exclamation icon next to the crawler name and hover over the icon to see any associated messages.

• Check the logs for the crawler run in CloudWatch Logs under /aws-glue/crawlers.

Error: Upgrading Athena Data Catalog


If you encounter errors while upgrading your Athena Data Catalog to the AWS Glue Data Catalog, see the
Amazon Athena User Guide topic Upgrading to the AWS Glue Data Catalog Step-by-Step.

Error: A Job is Reprocessing Data When Job


Bookmarks Are Enabled
There might be cases when you have enabled AWS Glue job bookmarks, but your ETL job is reprocessing
data that was already processed in an earlier run. Check for these common causes of this error:

Max Concurrency

Ensure that the maximum number of concurrent runs for the job is 1. For more information, see the
discussion of max concurrency in Adding Jobs in AWS Glue (p. 142). When you have multiple concurrent
jobs with job bookmarks and the maximum concurrency is set to 1, the job bookmark doesn't work
correctly.

Missing Job Object

Ensure that your job run script ends with the following commit:

job.commit()

When you include this object, AWS Glue records the timestamp and path of the job run. If you run the
job again with the same path, AWS Glue processes only the new files. If you don't include this object and

241
AWS Glue Developer Guide
AWS Glue Limits

job bookmarks are enabled, the job reprocesses the already processed files along with the new files and
creates redundancy in the job's target data store.

Spark DataFrame

AWS Glue job bookmarks don't work if you're using the Apache Spark DataFrame as the data sink for
your job. Job bookmarks only work when you use the AWS Glue DynamicFrame Class (p. 278). To
resolve this error, switch to the DynamicFrame using the fromDF (p. 279) construction.

Missing Transformation Context Parameter

Transformation context is an optional parameter in the GlueContext class, but job bookmarks don't
work if you don't include it. To resolve this error, add the transformation context parameter when you
create the DynamicFrame, as shown following:

sample_dynF=create_dynamic_frame_from_catalog(database,
table_name,transformation_ctx="sample_dynF")

Input Source

If you are using a relational database (a JDBC connection) for the input source, job bookmarks work only
if the table's primary keys are in sequential order. Job bookmarks work for new rows, but not for updated
rows. That is because job bookmarks look for the primary keys, which already exist. This does not apply if
your input source is Amazon Simple Storage Service (Amazon S3).

Last Modified Time

For Amazon S3 input sources, job bookmarks check the last modified time of the objects, rather than the
file names, to verify which objects need to be reprocessed. If your input source data has been modified
since your last job run, the files are reprocessed when you run the job again.

AWS Glue Limits


Note
You can contact AWS Support to request a limit increase for the limits listed here. Unless
otherwise noted, each limit is region-specific.

Resource Default Limit

Number of databases per account 10,000

Number of tables per database 100,000

Number of partitions per table 1,000,000

Number of table versions per table 100,000

Number of tables per account 1,000,000

Number of partitions per account 10,000,000

Number of table versions per account 1,000,000

Number of connections per account 1,000

Number of crawlers per account 25

Number of jobs per account 25

242
AWS Glue Developer Guide
AWS Glue Limits

Resource Default Limit

Number of triggers per account 25

Number of concurrent job runs per account 30

Number of concurrent job runs per job 3

Number of jobs per trigger 10

Number of development endpoints per account 5

Number of security configurations per account 10

Maximum DPUs used by a development endpoint at one time 5

Maximum DPUs used by a role at one time 100

243
AWS Glue Developer Guide
General Information

Programming ETL Scripts


AWS Glue makes it easy to write or autogenerate extract, transform, and load (ETL) scripts, as well as test
and run them. This section describes the extensions to Apache Spark that AWS Glue has introduced, and
provides examples of how to code and run ETL scripts in Python and Scala.

General Information about Programming AWS


Glue ETL Scripts
The following sections describe techniques and values that apply generally to AWS Glue ETL
programming in any language.

Topics
• Special Parameters Used by AWS Glue (p. 244)
• Connection Types and Options for ETL in AWS Glue (p. 245)
• Format Options for ETL Inputs and Outputs in AWS Glue (p. 248)
• Managing Partitions for ETL Output in AWS Glue (p. 250)
• Reading Input Files in Larger Groups (p. 251)
• Reading from JDBC Tables in Parallel (p. 252)
• Moving Data to and from Amazon Redshift (p. 253)

Special Parameters Used by AWS Glue


There are a number of argument names that are recognized and used by AWS Glue, that you can use to
set up the script environment for your Jobs and JobRuns:

• --job-language — The script programming language. This must be either scala or python. If this
parameter is not present, the default is python.
• --class — The Scala class that serves as the entry point for your Scala script. This only applies if
your --job-language is set to scala.
• --scriptLocation — The S3 location where your ETL script is located (in a form like s3://path/
to/my/script.py). This overrides a script location set in the JobCommand object.
• --extra-py-files — S3 path(s) to additional Python modules that AWS Glue will add to the
Python path before executing your script. Multiple values must be complete paths separated by a
comma (,). Only individual files are supported, not a directory path. Note that only pure Python
modules will work currently. Extension modules written in C or other languages are not supported.
• --extra-jars — S3 path(s) to additional Java .jar file(s) that AWS Glue will add to the Java
classpath before executing your script. Multiple values must be complete paths separated by a comma
(,).
• --extra-files — S3 path(s) to additional files such as configuration files that AWS Glue will copy
to the working directory of your script before executing it. Multiple values must be complete paths
separated by a comma (,). Only individual files are supported, not a directory path.
• --job-bookmark-option — Controls the behavior of a job bookmark. The following option values
can be set:

244
AWS Glue Developer Guide
Connection Parameters

‑‑job‑bookmark‑option
Description Value

Keep track of previously processed data. When a job runs, process new data since the last
job-
checkpoint.
bookmark-
enable

Always process the entire dataset. You are responsible for managing the output from previous job
job-
runs.
bookmark-
disable

Process incremental data since the last run. Don’t update the state information so that every
job-
subsequent run processes data since the last bookmark. You are responsible for managing the
bookmark-
output from previous job runs.
pause

For example, to enable a job bookmark, pass the argument:

'--job-bookmark-option': 'job-bookmark-enable'

• --TempDir — Specifies an S3 path to a bucket that can be used as a temporary directory for the Job.

For example, to set a temporary directory, pass the argument:

'--TempDir': 's3-path-to-directory'

• --enable-metrics — Enables the collection of metrics for job profiling for this job run. These
metrics are available on the AWS Glue console and CloudWatch console. To enable metrics, only
specify the key, no value is needed.

For example, the following is the syntax for running a job with a --argument and a special parameter:

$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://


my_glue/libraries/test_lib.py"'

There are also several argument names used by AWS Glue internally that you should never set:

• --conf — Internal to AWS Glue. Do not set!


• --debug — Internal to AWS Glue. Do not set!
• --mode — Internal to AWS Glue. Do not set!
• --JOB_NAME — Internal to AWS Glue. Do not set!

Connection Types and Options for ETL in AWS Glue


Various AWS Glue PySpark and Scala methods and transforms specify connection parameters using a
connectionType parameter and a connectionOptions parameter.

The connectionType parameter can take the following values, and the associated "connectionOptions"
parameter values for each type are documented below:

In general, these are for ETL input and do not apply to ETL sinks.

• "connectionType": "s3" (p. 246): Designates a connection to Amazon Simple Storage Service (Amazon
S3).

245
AWS Glue Developer Guide
Connection Parameters

• "connectionType": "parquet" (p. 246): Designates a connection to files stored in Amazon S3 in the
Apache Parquet file format.
• "connectionType": "orc" (p. 247): Designates a connection to files stored in Amazon S3 in the Apache
Hive Optimized Row Columnar (ORC) file format.
• "connectionType": "mysql" (p. 247): Designates a connection to a MySQL database (see JDBC
connectionType values (p. 247)).
• "connectionType": "redshift" (p. 247): Designates a connection to an Amazon Redshift database (see
JDBC connectionType values (p. 247)).
• "connectionType": "oracle" (p. 247): Designates a connection to an Oracle database (see JDBC
connectionType values (p. 247)).
• "connectionType": "sqlserver" (p. 247): Designates a connection to a Microsoft SQL Server database
(see JDBC connectionType values (p. 247)).
• "connectionType": "postgresql" (p. 247): Designates a connection to a PostgreSQL database (see JDBC
connectionType values (p. 247)).
• "connectionType": "dynamodb" (p. 247): Designates a connection to Amazon DynamoDB;
(DynamoDB).

"connectionType": "s3"
Designates a connection to Amazon Simple Storage Service (Amazon S3).

Use the following connectionOptions with "connectionType": "s3":

• "paths": (Required) A list of the Amazon S3 paths from which to read.


• "exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. for
example "[\"**.pdf\"]" would exclude all pdf files. More information about the glob syntax supported
by AWS Glue can be found at Using Include and Exclude Patterns.
• "compressionType": or "compression": (Optional) Specifies how the data is compressed. Use
"compressionType" for Amazon S3 sources and "compression" for Amazon S3 targets. This is
generally not necessary if the data has a standard file extension. Possible values are "gzip" and
"bzip").
• "groupFiles": (Optional) Grouping files is enabled by default when the input contains more than
50,000 files. To enable grouping with fewer than 50,000 files, set this parameter to "inPartition".
To disable grouping when there are more than 50,000 files, set this parameter to "none".
• "groupSize": (Optional) The target group size in bytes. The default is computed based on the input
data size and the size of your cluster. When there are fewer than 50,000 input files, "groupFiles"
must be set to "inPartition" for this to take effect.
• "recurse": (Optional) If set to true, recursively reads files in all subdirectories under the specified
paths.
• "maxBand": (Optional, Advanced) This option controls the duration in seconds after which s3 listing is
likely to be consistent. Files with modification timestamps falling within the last maxBand seconds are
tracked specially when using JobBookmarks to account for S3 eventual consistency. Most users do not
need to set this option. The default is 900 seconds.
• "maxFilesInBand": (Optional, Advanced) This option specifies the maximum number of files to save
from the last maxBand seconds. If this number is exceeded, extra files are skipped and only processed
in the next job run. Most users do not need to set this option.

"connectionType": "parquet"
Designates a connection to files stored in Amazon Simple Storage Service (Amazon S3) in the Apache
Parquet file format.

246
AWS Glue Developer Guide
Connection Parameters

Use the following connectionOptions with "connectionType": "parquet":

• paths: (Required) A list of the Amazon S3 paths from which to read.


• (Other option name/value pairs): Any additional options, including formatting options, are passed
directly to the SparkSQL DataSource. For more information, see Redshift data source for Spark.

"connectionType": "orc"
Designates a connection to files stored in Amazon S3 in the Apache Hive Optimized Row Columnar (ORC)
file format.

Use the following connectionOptions with "connectionType": "orc":

• paths: (Required) A list of the Amazon S3 paths from which to read.


• (Other option name/value pairs): Any additional options, including formatting options, are passed
directly to the SparkSQL DataSource. For more information, see Redshift data source for Spark.

JDBC connectionType values


These include the following:

• "connectionType": "mysql": Designates a connection to a MySQL database.


• "connectionType": "redshift": Designates a connection to an Amazon Redshift database.
• "connectionType": "oracle": Designates a connection to an Oracle database.
• "connectionType": "sqlserver": Designates a connection to a Microsoft SQL Server database.
• "connectionType": "postgresql": Designates a connection to a PostgreSQL database.

Use these connectionOptions with JDBC connections:

• "url": (Required) The JDBC URL for the database.


• "dbtable": The database table to read from. For JDBC data stores that support schemas within a
database, specify schema.table-name. If a schema is not provided, then the default "public" schema
is used.
• "redshiftTmpDir": (Required for Amazon Redshift, optional for other JDBC types) The Amazon S3
path where temporary data can be staged when copying out of the database.
• "user": (Required) The username to use when connecting.
• "password": (Required) The password to use when connecting.

All other option name/value pairs that are included in connectionOptions for a JDBC connection,
including formatting options, are passed directly to the underlying SparkSQL DataSource. For more
information, see Redshift data source for Spark.

"connectionType": "dynamodb"
Designates a connection to Amazon DynamoDB (DynamoDB).

Use the following connectionOptions with "connectionType": "dynamodb":

• "dynamodb.input.tableName": (Required) The DynamoDB table from which to read.


• "dynamodb.throughput.read.percent": (Optional) The percentage of read capacity units (RCU)
to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.

247
AWS Glue Developer Guide
Format Options

Format Options for ETL Inputs and Outputs in AWS


Glue
Various AWS Glue PySpark and Scala methods and transforms specify their input and/or output format
using a format parameter and a format_options parameter. These parameters can take the following
values:

format="avro"
This value designates the Apache Avro data format.

There are no format_options values for format="avro".

format="csv"
This value designates comma-separated-values as the data format (for example, see RFC 4180 and
RFC 7111).

You can use the following format_options values with format="csv":

• separator: Specifies the delimiter character. The default is a comma: ',', but any other character
can be specified.
• escaper: Specifies a character to use for escaping. The default value is "none". If enabled, the
character which immediately follows is used as-is, except for a small set of well-known escapes (\n,
\r, \t, and \0).
• quoteChar: Specifies the character to use for quoting. The default is a double quote: '"'. Set this to
'-1' to disable quoting entirely.
• multiline: A Boolean value that specifies whether a single record can span multiple lines. This can
occur when a field contains a quoted new-line character. You must set this option to "true" if any
record spans multiple lines. The default value is "false", which allows for more aggressive file-
splitting during parsing.
• withHeader: A Boolean value that specifies whether to treat the first line as a header. The default
value is "false". This option can be used in the DynamicFrameReader class.
• writeHeader: A Boolean value that specifies whether to write the header to output. The default
value is "true". This option can be used in the DynamicFrameWriter class.
• skipFirst: A Boolean value that specifies whether to skip the first data line. The default value is
"false".

format="ion"
This value designates Amazon Ion as the data format. (For more information, see the Amazon Ion
Specification.)

Currently, AWS Glue does not support ion for output.

There are no format_options values for format="ion".

format="grokLog"
This value designates a log data format specified by one or more Logstash grok patterns (for example,
see Logstash Reference (6.2]: Grok filter plugin).

Currently, AWS Glue does not support groklog for output.

248
AWS Glue Developer Guide
Format Options

You can use the following format_options values with format="grokLog":

• logFormat: Specifies the grok pattern that matches the log's format.
• customPatterns: Specifies additional grok patterns used here.
• MISSING: Specifies the signal to use in identifying missing values. The default is '-'.
• LineCount: Specifies the number of lines in each log record. The default is '1', and currently only
single-line records are supported.
• StrictMode: A Boolean value that specifies whether strict mode is enabled. In strict mode, the reader
doesn't do automatic type conversion or recovery. The default value is "false".

format="json"
This value designates a JSON (JavaScript Object Notation) data format.

You can use the following format_options values with format="json":

• jsonPath: A JsonPath expression that identifies an object to be read into records. This is particularly
useful when a file contains records nested inside an outer array. For example, the following JsonPath
expression targets the id field of a JSON object:

format="json", format_options={"jsonPath": "$.id"}

• multiline: A Boolean value that specifies whether a single record can span multiple lines. This can
occur when a field contains a quoted new-line character. You must set this option to "true" if any
record spans multiple lines. The default value is "false", which allows for more aggressive file-
splitting during parsing.

format="orc"
This value designates Apache ORC as the data format. (For more information, see the LanguageManual
ORC.)

There are no format_options values for format="orc". However, any options that are accepted by
the underlying SparkSQL code can be passed to it by way of the connection_options map parameter.

format="parquet"
This value designates Apache Parquet as the data format.

There are no format_options values for format="parquet". However, any options that are accepted
by the underlying SparkSQL code can be passed to it by way of the connection_options map
parameter.

format="xml"
This value designates XML as the data format, parsed through a fork of the XML Data Source for Apache
Spark parser.

Currently, AWS Glue does not support "xml" for output.

You can use the following format_options values with format="xml":

• rowTag: Specifies the XML tag in the file to treat as a row. Row tags cannot be self-closing.
• encoding: Specifies the character encoding. The default value is "UTF-8".

249
AWS Glue Developer Guide
Managing Partitions

• excludeAttribute: A Boolean value that specifies whether you want to exclude attributes in
elements or not. The default value is "false".
• treatEmptyValuesAsNulls: A Boolean value that specifies whether to treat white space as a null
value. The default value is "false".
• attributePrefix: A prefix for attributes to differentiate them from elements. This prefix is used for
field names. The default value is "_".
• valueTag: The tag used for a value when there are attributes in the element that have no child. The
default is "_VALUE".
• ignoreSurroundingSpaces: A Boolean value that specifies whether the white space that surrounds
values should be ignored. The default value is "false".

Managing Partitions for ETL Output in AWS Glue


Partitioning is an important technique for organizing datasets so they can be queried efficiently. It
organizes data in a hierarchical directory structure based on the distinct values of one or more columns.

For example, you might decide to partition your application logs in Amazon Simple Storage Service
(Amazon S3) by date, broken down by year, month, and day. Files that correspond to a single day's
worth of data are then placed under a prefix such as s3://my_bucket/logs/year=2018/month=01/
day=23/. Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these
partitions to filter data by partition value without having to read all the underlying data from Amazon
S3.

Crawlers not only infer file types and schemas, they also automatically identify the partition structure
of your dataset when they populate the AWS Glue Data Catalog. The resulting partition columns are
available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.

After you crawl a table, you can view the partitions that the crawler created by navigating to the table in
the AWS Glue console and choosing View Partitions.

For Apache Hive-style partitioned paths in key=val style, crawlers automatically populate the column
name using the key name. Otherwise, it uses default names like partition_0, partition_1, and so
on. To change the default names on the console, navigate to the table, choose Edit Schema, and modify
the names of the partition columns there.

In your ETL scripts, you can then filter on the partition columns.

Pre-Filtering Using Pushdown Predicates


In many cases, you can use a pushdown predicate to filter on partitions without having to list and read
all the files in your dataset. Instead of reading the entire dataset and then filtering in a DynamicFrame,
you can apply the filter directly on the partition metadata in the Data Catalog. Then you only list and
read what you actually need into a DynamicFrame.

For example, in Python you could write the following:

glue_context.create_dynamic_frame.from_catalog(
database = "my_S3_data_set",
table_name = "catalog_data_table",
push_down_predicate = my_partition_predicate)

This creates a DynamicFrame that loads only the partitions in the Data Catalog that satisfy the predicate
expression. Depending on how small a subset of your data you are loading, this can save a great deal of
processing time.

The predicate expression can be any Boolean expression supported by Spark SQL. Anything you
could put in a WHERE clause in a Spark SQL query will work. For example, the predicate expression

250
AWS Glue Developer Guide
Grouping Input Files

pushDownPredicate = "(year=='2017' and month=='04')" loads only the partitions in the


Data Catalog that have both year equal to 2017 and month equal to 04. For more information, see the
Apache Spark SQL documentation, and in particular, the Scala SQL functions reference.

In addition to Hive-style partitioning for Amazon S3 paths, Apache Parquet and Apache ORC file
formats further partition each file into blocks of data that represent column values. Each block also
stores statistics for the records that it contains, such as min/max for column values. AWS Glue supports
pushdown predicates for both Hive-style partitions and block partitions in these formats. In this way,
you can prune unnecessary Amazon S3 partitions in Parquet and ORC formats, and skip blocks that you
determine are unnecessary using column statistics.

Writing Partitions
By default, a DynamicFrame is not partitioned when it is written. All of the output files are written at
the top level of the specified output path. Until recently, the only way to write a DynamicFrame into
partitions was to convert it to a Spark SQL DataFrame before writing.

However, DynamicFrames now support native partitioning using a sequence of keys, using the
partitionKeys option when you create a sink. For example, the following Python code writes out a
dataset to Amazon S3 in the Parquet format, into directories partitioned by the type field. From there,
you can process these partitions using other systems, such as Amazon Athena.

glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_type = "s3",
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")

Reading Input Files in Larger Groups


You can set properties of your tables to enable an AWS Glue ETL job to group files when they are read
from an Amazon S3 data store. These properties enable each ETL task to read a group of input files into
a single in-memory partition, this is especially useful when there is a large number of small files in your
Amazon S3 data store. When you set certain properties, you instruct AWS Glue to group files within an
Amazon S3 data partition and set the size of the groups to be read. You can also set these options when
reading from an Amazon S3 data store with the create_dynamic_frame_from_options method.

To enable grouping files for a table, you set key-value pairs in the parameters field of your table
structure. Use JSON notation to set a value for the parameter field of your table. For more information
about editing the properties of a table, see Viewing and Editing Table Details (p. 91).

You can use this method to enable grouping for tables in the Data Catalog with Amazon S3 data stores.

groupFiles

Set groupFiles to inPartition to enable the grouping of files within an Amazon S3 data partition.
AWS Glue automatically enables grouping if there are more than 50,000 input files. For example:

'groupFiles': 'inPartition'

groupSize

Set groupSize to the target size of groups in bytes. The groupSize property is optional, if not
provided, AWS Glue calculates a size to use all the CPU cores in the cluster while still reducing the
overall number of ETL tasks and in-memory partitions.

251
AWS Glue Developer Guide
Reading from JDBC in Parallel

For example, to set the group size to 1 MB:

'groupSize': '1048576'

Note that the groupsize should be set with the result of a calculation. For example 1024 * 1024 =
1048576.
recurse

Set recurse to True to recursively read files in all subdirectories when specifying paths as an
array of paths. You do not need to set recurse if paths is an array of object keys in Amazon S3. For
example:

'recurse':True

If you are reading from Amazon S3 directly using the create_dynamic_frame_from_options


method, add these connection options. For example, the following attempts to group files into 1 MB
groups:

df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://s3path/"],


'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'}, format="json")

Reading from JDBC Tables in Parallel


You can set properties of your JDBC table to enable AWS Glue to read data in parallel. When you set
certain properties, you instruct AWS Glue to run parallel SQL queries against logical partitions of your
data. You can control partitioning by setting a hash field or a hash expression. You can also control the
number of parallel reads that are used to access your data.

To enable parallel reads, you can set key-value pairs in the parameters field of your table structure.
Use JSON notation to set a value for the parameter field of your table. For more information
about editing the properties of a table, see Viewing and Editing Table Details (p. 91). You
can also enable parallel reads when you call the ETL (extract, transform, and load) methods
create_dynamic_frame_from_options and create_dynamic_frame_from_catalog. For
more information about specifying options in these methods, see from_options (p. 290) and
from_catalog (p. 291) .

You can use this method for JDBC tables, that is, most tables whose base data is a JDBC data store. These
properties are ignored when reading Amazon Redshift and Amazon S3 tables.

hashfield

Set hashfield to the name of a column in the JDBC table to be used to divide the data into
partitions. For best results, this column should have an even distribution of values to spread the
data between partitions. This column can be of any data type. AWS Glue generates non-overlapping
queries that run in parallel to read the data partitioned by this column. For example, if your data is
evenly distributed by month, you can use the month column to read each month of data in parallel:

'hashfield': 'month'

252
AWS Glue Developer Guide
Moving Data to and from Amazon Redshift

AWS Glue creates a query to hash the field value to a partition number and runs the query for all
partitions in parallel. To use your own query to partition a table read, provide a hashexpression
instead of a hashfield.
hashexpression

Set hashexpression to an SQL expression (conforming to the JDBC database engine grammar)
that returns a whole number. A simple expression is the name of any numeric column in the table.
AWS Glue generates SQL queries to read the JDBC data in parallel using the hashexpression in
the WHERE clause to partition data.

For example, use the numeric column customerID to read data partitioned by a customer number:

'hashexpression': 'customerID'

To have AWS Glue control the partitioning, provide a hashfield instead of a hashexpression.
hashpartitions

Set hashpartitions to the number of parallel reads of the JDBC table. If this property is not set,
the default value is 7.

For example, set the number of parallel reads to 5 so that AWS Glue reads your data with five
queries (or fewer):

'hashpartitions': '5'

Moving Data to and from Amazon Redshift


When moving data to and from an Amazon Redshift cluster, AWS Glue jobs issue COPY and UNLOAD
statements against Amazon Redshift to achieve maximum throughput. These commands require that the
Amazon Redshift cluster access Amazon Simple Storage Service (Amazon S3) as a staging directory. By
default, AWS Glue passes in temporary credentials that are created using the role that you specified to
run the job. For security purposes, these credentials expire after 1 hour, which can cause long running
jobs to fail.

To address this issue, you can associate one or more IAM roles with the Amazon Redshift cluster itself.
COPY and UNLOAD can use the role, and Amazon Redshift refreshes the credentials as needed. For more
information about associating a role with your Amazon Redshift cluster, see IAM Permissions for COPY,
UNLOAD, and CREATE LIBRARY in the Amazon Redshift Database Developer Guide. Make sure that the
role you associate with your cluster has permissions to read from and write to the Amazon S3 temporary
directory that you specified in your job.

After you set up a role for the cluster, you need to specify it in ETL (extract, transform, and load)
statements in the AWS Glue script. The syntax depends on how your script reads and writes your dynamic
frame. If your script reads from an AWS Glue Data Catalog table, you can specify a role as follows:

glueContext.create_dynamic_frame.from_catalog(
database = "database-name",
table_name = "table-name",
redshift_tmp_dir = args["TempDir"],
additional_options = {"aws_iam_role": "arn:aws:iam::account-id:role/role-name"})

253
AWS Glue Developer Guide
ETL Programming in Python

Similarly, if your scripts writes a dynamic frame and reads from an Data Catalog, you can specify the role
as follows:

glueContext.write_dynamic_frame.from_catalog(
database = "database-name",
table_name = "table-name",
redshift_tmp_dir = args["TempDir"],
additional_options = {"aws_iam_role": "arn:aws:iam::account-id:role/role-name"})

In these examples, role-name is the role that you associated with your Amazon Redshift cluster, and
database-name and table-name refer to an Amazon Redshift table in your Data Catalog.

You can also specify a role when you use a dynamic frame and you use copy_from_options. The
syntax is similar, but you put the additional parameter in the connection-options map:

connection_options = {
"url": "jdbc:redshift://host:port/redshift-database",
"dbtable": "redshift-table",
"user": "username",
"password": "password",
"redshiftTmpDir": args["TempDir"],
"aws_iam_role": "arn:aws:iam::account-id:role/role-name"
}

df = glueContext.create_dynamic_frame_from_options("redshift", connection-options)

The options are similar when writing to Amazon Redshift:

connection_options = {
"dbtable": "redshift-table",
"database": "redshift-database",
"aws_iam_role": "arn:aws:iam::account-id:role/role-name"
}

glueContext.write_dynamic_frame.from_jdbc_conf(
frame = input-dynamic-frame,
catalog_connection = "connection-name",
connection_options = connection-options,
redshift_tmp_dir = args["TempDir"])

Program AWS Glue ETL Scripts in Python


You can find Python code examples and utilities for AWS Glue in the AWS Glue samples repository on the
GitHub website.

Using Python with AWS Glue


AWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load
(ETL) jobs. This section describes how to use Python in ETL scripts and with the AWS Glue API.

• Setting Up to Use Python with AWS Glue (p. 255)


• Calling AWS Glue APIs in Python (p. 256)

254
AWS Glue Developer Guide
List of Extensions

• Using Python Libraries with AWS Glue (p. 258)


• AWS Glue Python Code Samples (p. 259)

AWS Glue PySpark Extensions


AWS Glue has created the following extensions to the PySpark Python dialect.

• Accessing Parameters Using getResolvedOptions (p. 273)


• PySpark Extension Types (p. 274)
• DynamicFrame Class (p. 278)
• DynamicFrameCollection Class (p. 287)
• DynamicFrameWriter Class (p. 288)
• DynamicFrameReader Class (p. 290)
• GlueContext Class (p. 291)

AWS Glue PySpark Transforms


AWS Glue has created the following transform Classes to use in PySpark ETL operations.

• GlueTransform Base Class (p. 297)


• ApplyMapping Class (p. 299)
• DropFields Class (p. 300)
• DropNullFields Class (p. 301)
• ErrorsAsDynamicFrame Class (p. 303)
• Filter Class (p. 304)
• Join Class (p. 307)
• Map Class (p. 308)
• MapToCollection Class (p. 311)
• Relationalize Class (p. 312)
• RenameField Class (p. 313)
• ResolveChoice Class (p. 315)
• SelectFields Class (p. 316)
• SelectFromCollection Class (p. 317)
• Spigot Class (p. 318)
• SplitFields Class (p. 319)
• SplitRows Class (p. 321)
• Unbox Class (p. 322)
• UnnestFrame Class (p. 323)

Setting Up to Use Python with AWS Glue


Use Python 2.7 rather than Python 3 to develop your ETL scripts. The AWS Glue development endpoints
that provide interactive testing and development do not work with Python 3 yet.

To set up your system for using Python with AWS Glue

Follow these steps to install Python and to be able to invoke the AWS Glue APIs.

255
AWS Glue Developer Guide
Calling APIs

1. If you don't already have Python 2.7 installed, download and install it from the Python.org
download page.
2. Install the AWS Command Line Interface (AWS CLI) as documented in the AWS CLI documentation.

The AWS CLI is not directly necessary for using Python. However, installing and configuring it is a
convenient way to set up AWS with your account credentials and verify that they work.
3. Install the AWS SDK for Python (Boto 3), as documented in the Boto3 Quickstart.

Boto 3 resource APIs are not yet available for AWS Glue. Currently, only the Boto 3 client APIs can be
used.

For more information about Boto 3, see AWS SDK for Python (Boto 3) Getting Started.

You can find Python code examples and utilities for AWS Glue in the AWS Glue samples repository on the
GitHub website.

Calling AWS Glue APIs in Python


Note that Boto 3 resource APIs are not yet available for AWS Glue. Currently, only the Boto 3 client APIs
can be used.

AWS Glue API Names in Python


AWS Glue API names in Java and other programming languages are generally CamelCased. However,
when called from Python, these generic names are changed to lowercase, with the parts of the name
separated by underscore characters to make them more "Pythonic". In the AWS Glue API (p. 371)
reference documentation, these Pythonic names are listed in parentheses after the generic CamelCased
names.

However, although the AWS Glue API names themselves are transformed to lowercase, their parameter
names remain capitalized. It is important to remember this, because parameters should be passed by
name when calling AWS Glue APIs, as described in the following section.

Passing and Accessing Python Parameters in AWS Glue


In Python calls to AWS Glue APIs, it's best to pass parameters explicitly by name. For example:

job = glue.create_job(Name='sample', Role='Glue_DefaultRole',


command={'Name': 'glueetl',
'ScriptLocation': 's3://my_script_bucket/scripts/
my_etl_script.py'})

It is helpful to understand that Python creates a dictionary of the name/value tuples that you specify
as arguments to an ETL script in a Job Structure (p. 451) or JobRun Structure (p. 459). Boto 3 then
passes them to AWS Glue in JSON format by way of a REST API call. This means that you cannot rely on
the order of the arguments when you access them in your script.

For example, suppose that you're starting a JobRun in a Python Lambda handler function, and you want
to specify several parameters. Your code might look something like the following:

from datetime import datetime, timedelta

client = boto3.client('glue')

def lambda_handler(event, context):

256
AWS Glue Developer Guide
Calling APIs

last_hour_date_time = datetime.now() - timedelta(hours = 1)


day_partition_value = last_hour_date_time.strftime("%Y-%m-%d")
hour_partition_value = last_hour_date_time.strftime("%-H")

response = client.start_job_run(
JobName = 'my_test_Job',
Arguments = {
'--day_partition_key': 'partition_0',
'--hour_partition_key': 'partition_1',
'--day_partition_value': day_partition_value,
'--hour_partition_value': hour_partition_value } )

To access these parameters reliably in your ETL script, specify them by name using AWS Glue's
getResolvedOptions function and then access them from the resulting dictionary:

import sys
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv,
['JOB_NAME',
'day_partition_key',
'hour_partition_key',
'day_partition_value',
'hour_partition_value'])
print "The day partition key is: ", args['day_partition_key']
print "and the day partition value is: ", args['day_partition_value']

Example: Create and Run a Job


The following example shows how call the AWS Glue APIs using Python, to create and run an ETL job.

To create and run a job

1. Create an instance of the AWS Glue client:

import boto3
glue = boto3.client(service_name='glue', region_name='us-east-1',
endpoint_url='https://glue.us-east-1.amazonaws.com')

2. Create a job. You must use glueetl as the name for the ETL command, as shown in the following
code:

myJob = glue.create_job(Name='sample', Role='Glue_DefaultRole',


Command={'Name': 'glueetl',
'ScriptLocation': 's3://my_script_bucket/scripts/
my_etl_script.py'})

3. Start a new run of the job that you created in the previous step:

myNewJobRun = glue.start_job_run(JobName=myJob['Name'])

4. Get the job status:

status = glue.get_job_run(JobName=myJob['Name'], RunId=JobRun['JobRunId'])

5. Print the current state of the job run:

print status['JobRun']['JobRunState']

257
AWS Glue Developer Guide
Python Libraries

Using Python Libraries with AWS Glue


You can use Python extension modules and libraries with your AWS Glue ETL scripts as long as they
are written in pure Python. C libraries such as pandas are not supported at the present time, nor are
extensions written in other languages.

Zipping Libraries for Inclusion


Unless a library is contained in a single .py file, it should be packaged in a .zip archive. The package
directory should be at the root of the archive, and must contain an __init__.py file for the package.
Python will then be able to import the package in the normal way.

If your library only consists of a single Python module in one .py file, you do not need to place it in a
.zip file.

Loading Python Libraries in a Development Endpoint


If you are using different library sets for different ETL scripts, you can either set up a separate
development endpoint for each set, or you can overwrite the library .zip file(s) that your development
endpoint loads every time you switch scripts.

You can use the console to specify one or more library .zip files for a development endpoint when you
create it. After assigning a name and an IAM role, choose Script Libraries and job parameters (optional)
and enter the full Amazon S3 path to your library .zip file in the Python library path box. For example:

s3://bucket/prefix/site-packages.zip

If you want, you can specify multiple full paths to files, separating them with commas but no spaces, like
this:

s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip

If you update these .zip files later, you can use the console to re-import them into your development
endpoint. Navigate to the developer endpoint in question, check the box beside it, and choose Update
ETL libraries from the Action menu.

In a similar way, you can specify library files using the AWS Glue APIs. When you create a development
endpoint by calling CreateDevEndpoint Action (Python: create_dev_endpoint) (p. 475), you can specify
one or more full paths to libraries in the ExtraPythonLibsS3Path parameter, in a call that looks this:

dep = glue.create_dev_endpoint(
EndpointName="testDevEndpoint",
RoleArn="arn:aws:iam::123456789012",
SecurityGroupIds="sg-7f5ad1ff",
SubnetId="subnet-c12fdba4",
PublicKey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtp04H/y...",
NumberOfNodes=3,
ExtraPythonLibsS3Path="s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/
lib_X.zip")

When you update a development endpoint, you can also update the libraries it loads using a
DevEndpointCustomLibraries (p. 474) object and setting the UpdateEtlLibraries parameter to
True when calling UpdateDevEndpoint (update_dev_endpoint) (p. 477).

If you are using a Zeppelin Notebook with your development endpoint, you will need to call the
following PySpark function before importing a package or packages from your .zip file:

258
AWS Glue Developer Guide
Python Samples

sc.addPyFile(“/home/glue/downloads/python/yourZipFileName.zip”)

Using Python Libraries in a Job or JobRun


When you are creating a new Job on the console, you can specify one or more library .zip files by
choosing Script Libraries and job parameters (optional) and entering the full Amazon S3 library path(s)
in the same way you would when creating a development endpoint:

s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip

If you are calling CreateJob (create_job) (p. 455), you can specify one or more full paths to default
libraries using the --extra-py-files default parameter, like this:

job = glue.create_job(Name='sampleJob',
Role='Glue_DefaultRole',
Command={'Name': 'glueetl',
'ScriptLocation': 's3://my_script_bucket/scripts/
my_etl_script.py'},
DefaultArguments={'--extra-py-files': 's3://bucket/prefix/
lib_A.zip,s3://bucket_B/prefix/lib_X.zip'})

Then when you are starting a JobRun, you can override the default library setting with a different one:

runId = glue.start_job_run(JobName='sampleJob',
Arguments={'--extra-py-files': 's3://bucket/prefix/lib_B.zip'})

AWS Glue Python Code Samples


• Code Example: Joining and Relationalizing Data (p. 259)
• Code Example: Data Preparation Using ResolveChoice, Lambda, and ApplyMapping (p. 268)

Code Example: Joining and Relationalizing Data


This example uses a dataset that was downloaded from http://everypolitician.org/ to the sample-
dataset bucket in Amazon Simple Storage Service (Amazon S3): s3://awsglue-datasets/
examples/us-legislators/all. The dataset contains data in JSON format about United States
legislators and the seats that they have held in the US House of Representatives and Senate, and has
been modified slightly for purposes of this tutorial.

You can find the source code for this example in the join_and_relationalize.py file in the AWS
Glue samples repository on the GitHub website.

Using this data, this tutorial shows you how to do the following:

• Use an AWS Glue crawler to classify objects that are stored in an Amazon S3 bucket and save their
schemas into the AWS Glue Data Catalog.
• Examine the table metadata and schemas that result from the crawl.
• Write a Python extract, transfer, and load (ETL) script that uses the metadata in the Data Catalog to do
the following:
• Join the data in the different source files together into a single data table (that is, denormalize the
data).
• Filter the joined table into separate tables by type of legislator.
• Write out the resulting data to separate Apache Parquet files for later analysis.

259
AWS Glue Developer Guide
Python Samples

The easiest way to debug Python or PySpark scripts is to create a development endpoint and run your
code there. We recommend that you start by setting up a development endpoint to work in. For more
information, see the section called “Development Endpoints on the Console” (p. 157).

Step 1: Crawl the Data in the Amazon S3 Bucket


1. Sign in to the AWS Management Console, and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Following the steps in Working with Crawlers on the AWS Glue Console (p. 107), create a new
crawler that can crawl the s3://awsglue-datasets/examples/us-legislators/all dataset
into a database named legislators in the AWS Glue Data Catalog.
3. Run the new crawler, and then check the legislators database.

The crawler creates the following metadata tables:

• persons_json
• memberships_json
• organizations_json
• events_json
• areas_json
• countries_r_json

This is a semi-normalized collection of tables containing legislators and their histories.

Step 2: Add Boilerplate Script to the Development Endpoint Notebook


Paste the following boilerplate script into the development endpoint notebook to import the AWS Glue
libraries that you need, and set up a single GlueContext:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

Step 3: Examine the Schemas in the Data Catalog


Next, you can easily examine the schemas that the crawler recorded in the AWS Glue Data Catalog. For
example, to see the schema of the persons_json table, add the following in your notebook:

persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json")
print "Count: ", persons.count()
persons.printSchema()

Here's the output from the print calls:

Count: 1961

260
AWS Glue Developer Guide
Python Samples

root
|-- family_name: string
|-- name: string
|-- links: array
| |-- element: struct
| | |-- note: string
| | |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
| |-- element: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- element: struct
| | |-- note: string
| | |-- name: string
| | |-- lang: string
|-- sort_name: string
|-- images: array
| |-- element: struct
| | |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
| |-- element: struct
| | |-- type: string
| | |-- value: string
|-- death_date: string

Each person in the table is a member of some US congressional body.

To view the schema of the memberships_json table, type the following:

memberships = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="memberships_json")
print "Count: ", memberships.count()
memberships.printSchema()

The output is as follows:

Count: 10439
root
|-- area_id: string
|-- on_behalf_of_id: string
|-- organization_id: string
|-- role: string
|-- person_id: string
|-- legislative_period_id: string
|-- start_date: string
|-- end_date: string

The organizations are parties and the two chambers of Congress, the Senate and House of
Representatives. To view the schema of the organizations_json table, type the following:

orgs = glueContext.create_dynamic_frame.from_catalog(
database="legislators",

261
AWS Glue Developer Guide
Python Samples

table_name="organizations_json")
print "Count: ", orgs.count()
orgs.printSchema()

The output is as follows:

Count: 13
root
|-- classification: string
|-- links: array
| |-- element: struct
| | |-- note: string
| | |-- url: string
|-- image: string
|-- identifiers: array
| |-- element: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- element: struct
| | |-- lang: string
| | |-- note: string
| | |-- name: string
|-- id: string
|-- name: string
|-- seats: int
|-- type: string

Step 4: Filter the Data


Next, keep only the fields that you want, and rename id to org_id. The dataset is small enough that
you can view the whole thing.

The toDF() converts a DynamicFrame to an Apache Spark DataFrame, so you can apply the
transforms that already exist in Apache Spark SQL:

orgs = orgs.drop_fields(['other_names',
'identifiers']).rename_field(
'id', 'org_id').rename_field(
'name', 'org_name')
orgs.toDF().show()

The following shows the output:

+--------------+--------------------+--------------------+--------------------+-----
+-----------+--------------------+
|classification| org_id| org_name| links|seats|
type| image|
+--------------+--------------------+--------------------+--------------------+-----
+-----------+--------------------+
| party| party/al| AL| null| null|
null| null|
| party| party/democrat| Democrat|[[website,http://...| null|
null|https://upload.wi...|
| party|party/democrat-li...| Democrat-Liberal|[[website,http://...| null|
null| null|
| legislature|d56acebe-8fdc-47b...|House of Represen...| null| 435|lower
house| null|

262
AWS Glue Developer Guide
Python Samples

| party| party/independent| Independent| null| null|


null| null|
| party|party/new_progres...| New Progressive|[[website,http://...| null|
null|https://upload.wi...|
| party|party/popular_dem...| Popular Democrat|[[website,http://...| null|
null| null|
| party| party/republican| Republican|[[website,http://...| null|
null|https://upload.wi...|
| party|party/republican-...|Republican-Conser...|[[website,http://...| null|
null| null|
| party| party/democrat| Democrat|[[website,http://...| null|
null|https://upload.wi...|
| party| party/independent| Independent| null| null|
null| null|
| party| party/republican| Republican|[[website,http://...| null|
null|https://upload.wi...|
| legislature|8fa6c3d2-71dc-478...| Senate| null| 100|upper
house| null|
+--------------+--------------------+--------------------+--------------------+-----
+-----------+--------------------+

Type the following to view the organizations that appear in memberships:

memberships.select_fields(['organization_id']).toDF().distinct().show()

The following shows the output:

+--------------------+
| organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+

Step 5: Put It All Together


Now, use AWS Glue to join these relational tables and create one full history table of legislator
memberships and their corresponding organizations.

1. First, join persons and memberships on id and person_id.


2. Next, join the result with orgs on org_id and organization_id.
3. Then, drop the redundant fields, person_id and org_id.

You can do all these operations in one (extended) line of code:

l_history = Join.apply(orgs,
Join.apply(persons, memberships, 'id', 'person_id'),
'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
print "Count: ", l_history.count()
l_history.printSchema()

The output is as follows:

Count: 10439

263
AWS Glue Developer Guide
Python Samples

root
|-- role: string
|-- seats: int
|-- org_name: string
|-- links: array
| |-- element: struct
| | |-- note: string
| | |-- url: string
|-- type: string
|-- sort_name: string
|-- area_id: string
|-- images: array
| |-- element: struct
| | |-- url: string
|-- on_behalf_of_id: string
|-- other_names: array
| |-- element: struct
| | |-- note: string
| | |-- name: string
| | |-- lang: string
|-- contact_details: array
| |-- element: struct
| | |-- type: string
| | |-- value: string
|-- name: string
|-- birth_date: string
|-- organization_id: string
|-- gender: string
|-- classification: string
|-- death_date: string
|-- legislative_period_id: string
|-- identifiers: array
| |-- element: struct
| | |-- scheme: string
| | |-- identifier: string
|-- image: string
|-- given_name: string
|-- family_name: string
|-- id: string
|-- start_date: string
|-- end_date: string

You now have the final table that you can use for analysis. You can write it out in a compact, efficient
format for analytics—namely Parquet—that you can run SQL over in AWS Glue, Amazon Athena, or
Amazon Redshift Spectrum.

The following call writes the table across multiple files to support fast parallel reads when doing analysis
later:

glueContext.write_dynamic_frame.from_options(frame = l_history,
connection_type = "s3",
connection_options = {"path": "s3://glue-sample-target/output-dir/
legislator_history"},
format = "parquet")

To put all the history data into a single file, you must convert it to a data frame, repartition it, and write
it out:

s_history = l_history.toDF().repartition(1)
s_history.write.parquet('s3://glue-sample-target/output-dir/legislator_single')

264
AWS Glue Developer Guide
Python Samples

Or, if you want to separate it by the Senate and the House:

l_history.toDF().write.parquet('s3://glue-sample-target/output-dir/legislator_part',
partitionBy=['org_name'])

Step 6: Write the Data to Relational Databases


AWS Glue makes it easy to write the data to relational databases like Amazon Redshift, even with semi-
structured data. It offers a transform relationalize, which flattens DynamicFrames no matter how
complex the objects in the frame might be.

Using the l_history DynamicFrame in this example, pass in the name of a root table (hist_root)
and a temporary working path to relationalize. This returns a DynamicFrameCollection. You can
then list the names of the DynamicFrames in that collection:

dfc = l_history.relationalize("hist_root", "s3://glue-sample-target/temp-dir/")


dfc.keys()

The following is the output of the keys call:

[u'hist_root', u'hist_root_contact_details', u'hist_root_links',


u'hist_root_other_names', u'hist_root_images', u'hist_root_identifiers']

Relationalize broke the history table out into six new tables: a root table that contains a record
for each object in the DynamicFrame, and auxiliary tables for the arrays. Array handling in relational
databases is often suboptimal, especially as those arrays become large. Separating the arrays into
different tables makes the queries go much faster.

Next, look at the separation by examining contact_details:

l_history.select_fields('contact_details').printSchema()
dfc.select('hist_root_contact_details').toDF().where("id = 10 or id =
75").orderBy(['id','index']).show()

The following is the output of the show call:

root
|-- contact_details: array
| |-- element: struct
| | |-- type: string
| | |-- value: string
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
| 10| 0| fax| |
| 10| 1| | 202-225-1314|
| 10| 2| phone| |
| 10| 3| | 202-225-3772|
| 10| 4| twitter| |
| 10| 5| | MikeRossUpdates|
| 75| 0| fax| |
| 75| 1| | 202-225-7856|
| 75| 2| phone| |

265
AWS Glue Developer Guide
Python Samples

| 75| 3| | 202-225-2711|
| 75| 4| twitter| |
| 75| 5| | SenCapito|
+---+-----+------------------------+-------------------------+

The contact_details field was an array of structs in the original DynamicFrame. Each element of
those arrays is a separate row in the auxiliary table, indexed by index. The id here is a foreign key into
the hist_root table with the key contact_details:

dfc.select('hist_root').toDF().where(
"contact_details = 10 or contact_details = 75").select(
['id', 'given_name', 'family_name', 'contact_details']).show()

The following is the output:

+--------------------+----------+-----------+---------------+
| id|given_name|family_name|contact_details|
+--------------------+----------+-----------+---------------+
|f4fc30ee-7b42-432...| Mike| Ross| 10|
|e3c60f34-7d1b-4c0...| Shelley| Capito| 75|
+--------------------+----------+-----------+---------------+

Notice in these commands that toDF() and then a where expression are used to filter for the rows that
you want to see.

So, joining the hist_root table with the auxiliary tables lets you do the following:

• Load data into databases without array support.


• Query each individual item in an array using SQL.

You already have a connection set up named redshift3. For information about how to create your own
connection, see the section called “Adding a Connection to Your Data Store” (p. 92).

Next, write this collection into Amazon Redshift by cycling through the DynamicFrames one at a time:

for df_name in dfc.keys():


m_df = dfc.select(df_name)
print "Writing to Redshift table: ", df_name
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df,
catalog_connection = "redshift3",
connection_options = {"dbtable":
df_name, "database": "testdb"},
redshift_tmp_dir = "s3://glue-sample-
target/temp-dir/")

The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas within
a database, specify schema.table-name. If a schema is not provided, then the default "public" schema
is used.

For more information, see Connection Types and Options for ETL in AWS Glue (p. 245).

Here's what the tables look like in Amazon Redshift. (You connected to Amazon Redshift through psql.)

testdb=# \d

266
AWS Glue Developer Guide
Python Samples

List of relations
schema | name | type | owner
--------+---------------------------+-------+-----------
public | hist_root | table | test_user
public | hist_root_contact_details | table | test_user
public | hist_root_identifiers | table | test_user
public | hist_root_images | table | test_user
public | hist_root_links | table | test_user
public | hist_root_other_names | table | test_user
(6 rows)

testdb=# \d hist_root_contact_details
Table "public.hist_root_contact_details"
Column | Type | Modifiers
---------------------------+--------------------------+-----------
id | bigint |
index | integer |
contact_details.val.type | character varying(65535) |
contact_details.val.value | character varying(65535) |

testdb=# \d hist_root
Table "public.hist_root"
Column | Type | Modifiers
-----------------------+--------------------------+-----------
role | character varying(65535) |
seats | integer |
org_name | character varying(65535) |
links | bigint |
type | character varying(65535) |
sort_name | character varying(65535) |
area_id | character varying(65535) |
images | bigint |
on_behalf_of_id | character varying(65535) |
other_names | bigint |
birth_date | character varying(65535) |
name | character varying(65535) |
organization_id | character varying(65535) |
gender | character varying(65535) |
classification | character varying(65535) |
legislative_period_id | character varying(65535) |
identifiers | bigint |
given_name | character varying(65535) |
image | character varying(65535) |
family_name | character varying(65535) |
id | character varying(65535) |
death_date | character varying(65535) |
start_date | character varying(65535) |
contact_details | bigint |
end_date | character varying(65535) |

Now you can query these tables using SQL in Amazon Redshift:

testdb=# select * from hist_root_contact_details where id = 10 or id = 75 order by id,


index;

The following shows the result:

id | index | contact_details.val.type | contact_details.val.value


---+-------+--------------------------+---------------------------
10 | 0 | fax | 202-224-6020
10 | 1 | phone | 202-224-3744

267
AWS Glue Developer Guide
Python Samples

10 | 2 | twitter | ChuckGrassley
75 | 0 | fax | 202-224-4680
75 | 1 | phone | 202-224-4642
75 | 2 | twitter | SenJackReed
(6 rows)

Conclusion
Overall, AWS Glue is very flexible. It lets you accomplish, in a few lines of code, what normally
would take days to write. You can find the entire source-to-target ETL scripts in the Python file
join_and_relationalize.py in the AWS Glue samples on GitHub.

Code Example: Data Preparation Using ResolveChoice, Lambda,


and ApplyMapping
The dataset that is used in this example consists of Medicare Provider payment data downloaded
from two Data.CMS.gov sites: Inpatient Prospective Payment System Provider Summary for the Top
100 Diagnosis-Related Groups - FY2011), and Inpatient Charge Data FY 2011. After downloading it,
we modified the data to introduce a couple of erroneous records at the end of the file. This modified
file is located in a public Amazon S3 bucket at s3://awsglue-datasets/examples/medicare/
Medicare_Hospital_Provider.csv.

You can find the source code for this example in the data_cleaning_and_lambda.py file in the AWS
Glue examples GitHub repository.

The easiest way to debug Python or PySpark scripts is to create a development endpoint and run your
code there. We recommend that you start by setting up a development endpoint to work in. For more
information, see the section called “Development Endpoints on the Console” (p. 157).

Step 1: Crawl the Data in the Amazon S3 Bucket


1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Following the process described in Working with Crawlers on the AWS Glue Console (p. 107),
create a new crawler that can crawl the s3://awsglue-datasets/examples/medicare/
Medicare_Hospital_Provider.csv file, and can place the resulting metadata into a database
named payments in the AWS Glue Data Catalog.
3. Run the new crawler, and then check the payments database. You should find that the crawler has
created a metadata table named medicare in the database after reading the beginning of the file
to determine its format and delimiter.

The schema of the new medicare table is as follows:

Column name Data type


==================================================
drg definition string
provider id bigint
provider name string
provider street address string
provider city string
provider state string
provider zip code bigint
hospital referral region description string
total discharges bigint
average covered charges string
average total payments string
average medicare payments string

268
AWS Glue Developer Guide
Python Samples

Step 2: Add Boilerplate Script to the Development Endpoint Notebook


Paste the following boilerplate script into the development endpoint notebook to import the AWS Glue
libraries that you need, and set up a single GlueContext:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

Step 3: Compare Different Schema Parsings


Next, you can see if the schema that was recognized by an Apache Spark DataFrame is the same as the
one that your AWS Glue crawler recorded. Run this code:

medicare = spark.read.format(
"com.databricks.spark.csv").option(
"header", "true").option(
"inferSchema", "true").load(
's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')
medicare.printSchema()

Here's the output from the printSchema call:

root
|-- DRG Definition: string (nullable = true)
|-- Provider Id: string (nullable = true)
|-- Provider Name: string (nullable = true)
|-- Provider Street Address: string (nullable = true)
|-- Provider City: string (nullable = true)
|-- Provider State: string (nullable = true)
|-- Provider Zip Code: integer (nullable = true)
|-- Hospital Referral Region Description: string (nullable = true)
|-- Total Discharges : integer (nullable = true)
|-- Average Covered Charges : string (nullable = true)
|-- Average Total Payments : string (nullable = true)
|-- Average Medicare Payments: string (nullable = true)

Next, look at the schema that an AWS Glue DynamicFrame generates:

medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "payments",
table_name = "medicare")
medicare_dynamicframe.printSchema()

The output from printSchema is as follows:

root
|-- drg definition: string
|-- provider id: choice
| |-- long

269
AWS Glue Developer Guide
Python Samples

| |-- string
|-- provider name: string
|-- provider street address: string
|-- provider city: string
|-- provider state: string
|-- provider zip code: long
|-- hospital referral region description: string
|-- total discharges: long
|-- average covered charges: string
|-- average total payments: string
|-- average medicare payments: string

The DynamicFrame generates a schema in which provider id could be either a long or a string
type. The DataFrame schema lists Provider Id as being a string type, and the Data Catalog lists
provider id as being a bigint type.

Which one is correct? There are two records at the end of the file (out of 160,000 records) with string
values in that column. These are the erroneous records that were introduced to illustrate a problem.

To address this kind of problem, the AWS Glue DynamicFrame introduces the concept of a choice type.
In this case, the DynamicFrame shows that both long and string values can appear in that column.
The AWS Glue crawler missed the string values because it considered only a 2 MB prefix of the data.
The Apache Spark DataFrame considered the whole dataset, but it was forced to assign the most
general type to the column, namely string. In fact, Spark often resorts to the most general case when
there are complex types or variations with which it is unfamiliar.

To query the provider id column, resolve the choice type first. You can use the resolveChoice
transform method in your DynamicFrame to convert those string values to long values with a
cast:long option:

medicare_res = medicare_dynamicframe.resolveChoice(specs = [('provider id','cast:long')])


medicare_res.printSchema()

The printSchema output is now:

root
|-- drg definition: string
|-- provider id: long
|-- provider name: string
|-- provider street address: string
|-- provider city: string
|-- provider state: string
|-- provider zip code: long
|-- hospital referral region description: string
|-- total discharges: long
|-- average covered charges: string
|-- average total payments: string
|-- average medicare payments: string

Where the value was a string that could not be cast, AWS Glue inserted a null.

Another option is to convert the choice type to a struct, which keeps values of both types.

Next, look at the rows that were anomalous:

medicare_res.toDF().where("'provider id' is NULL").show()

You see the following:

270
AWS Glue Developer Guide
Python Samples

+--------------------+-----------+---------------+-----------------------+-------------
+--------------+-----------------+------------------------------------+----------------
+-----------------------+----------------------+-------------------------+
| drg definition|provider id| provider name|provider street address|provider city|
provider state|provider zip code|hospital referral region description|total discharges|
average covered charges|average total payments|average medicare payments|
+--------------------+-----------+---------------+-----------------------+-------------
+--------------+-----------------+------------------------------------+----------------
+-----------------------+----------------------+-------------------------+
|948 - SIGNS & SYM...| null| INC| 1050 DIVISION ST| MAUSTON|
WI| 53948| WI - Madison| 12|
$11961.41| $4619.00| $3775.33|
|948 - SIGNS & SYM...| null| INC- ST JOSEPH| 5000 W CHAMBERS ST| MILWAUKEE|
WI| 53210| WI - Milwaukee| 14|
$10514.28| $5562.50| $4522.78|
+--------------------+-----------+---------------+-----------------------+-------------
+--------------+-----------------+------------------------------------+----------------
+-----------------------+----------------------+-------------------------+

Now remove the two malformed records, as follows:

medicare_dataframe = medicare_res.toDF()
medicare_dataframe = medicare_dataframe.where("'provider id' is NOT NULL")

Step 4: Map the Data and Use Apache Spark Lambda Functions
AWS Glue does not yet directly support Lambda functions, also known as user-defined functions. But
you can always convert a DynamicFrame to and from an Apache Spark DataFrame to take advantage of
Spark functionality in addition to the special features of DynamicFrames.

Next, turn the payment information into numbers, so analytic engines like Amazon Redshift or Amazon
Athena can do their number crunching faster:

from pyspark.sql.functions import udf


from pyspark.sql.types import StringType

chop_f = udf(lambda x: x[1:], StringType())


medicare_dataframe = medicare_dataframe.withColumn(
"ACC", chop_f(
medicare_dataframe["average covered charges"])).withColumn(
"ATP", chop_f(
medicare_dataframe["average total payments"])).withColumn(
"AMP", chop_f(
medicare_dataframe["average medicare payments"]))
medicare_dataframe.select(['ACC', 'ATP', 'AMP']).show()

The output from the show call is as follows:

+--------+-------+-------+
| ACC| ATP| AMP|
+--------+-------+-------+
|32963.07|5777.24|4763.73|
|15131.85|5787.57|4976.71|
|37560.37|5434.95|4453.79|
|13998.28|5417.56|4129.16|
|31633.27|5658.33|4851.44|
|16920.79|6653.80|5374.14|
|11977.13|5834.74|4761.41|
|35841.09|8031.12|5858.50|
|28523.39|6113.38|5228.40|

271
AWS Glue Developer Guide
Python Samples

|75233.38|5541.05|4386.94|
|67327.92|5461.57|4493.57|
|39607.28|5356.28|4408.20|
|22862.23|5374.65|4186.02|
|31110.85|5366.23|4376.23|
|25411.33|5282.93|4383.73|
| 9234.51|5676.55|4509.11|
|15895.85|5930.11|3972.85|
|19721.16|6192.54|5179.38|
|10710.88|4968.00|3898.88|
|51343.75|5996.00|4962.45|
+--------+-------+-------+
only showing top 20 rows

These are all still strings in the data. We can use the powerful apply_mapping transform method to
drop, rename, cast, and nest the data so that other data programming languages and systems can easily
access it:

medicare_tmp_dyf = DynamicFrame.fromDF(medicare_dataframe, glueContext, "nested")


medicare_nest_dyf = medicare_tmp_dyf.apply_mapping([('drg definition', 'string', 'drg',
'string'),
('provider id', 'long', 'provider.id', 'long'),
('provider name', 'string', 'provider.name', 'string'),
('provider city', 'string', 'provider.city', 'string'),
('provider state', 'string', 'provider.state', 'string'),
('provider zip code', 'long', 'provider.zip', 'long'),
('hospital referral region description', 'string','rr', 'string'),
('ACC', 'string', 'charges.covered', 'double'),
('ATP', 'string', 'charges.total_pay', 'double'),
('AMP', 'string', 'charges.medicare_pay', 'double')])
medicare_nest_dyf.printSchema()

The printSchema output is as follows:

root
|-- drg: string
|-- provider: struct
| |-- id: long
| |-- name: string
| |-- city: string
| |-- state: string
| |-- zip: long
|-- rr: string
|-- charges: struct
| |-- covered: double
| |-- total_pay: double
| |-- medicare_pay: double

Turning the data back into a Spark DataFrame, you can show what it looks like now:

medicare_nest_dyf.toDF().show()

The output is as follows:

+--------------------+--------------------+---------------+--------------------+
| drg| provider| rr| charges|
+--------------------+--------------------+---------------+--------------------+
|039 - EXTRACRANIA...|[10001,SOUTHEAST ...| AL - Dothan|[32963.07,5777.24...|
|039 - EXTRACRANIA...|[10005,MARSHALL M...|AL - Birmingham|[15131.85,5787.57...|
|039 - EXTRACRANIA...|[10006,ELIZA COFF...|AL - Birmingham|[37560.37,5434.95...|
|039 - EXTRACRANIA...|[10011,ST VINCENT...|AL - Birmingham|[13998.28,5417.56...|

272
AWS Glue Developer Guide
PySpark Extensions

|039 - EXTRACRANIA...|[10016,SHELBY BAP...|AL - Birmingham|[31633.27,5658.33...|


|039 - EXTRACRANIA...|[10023,BAPTIST ME...|AL - Montgomery|[16920.79,6653.8,...|
|039 - EXTRACRANIA...|[10029,EAST ALABA...|AL - Birmingham|[11977.13,5834.74...|
|039 - EXTRACRANIA...|[10033,UNIVERSITY...|AL - Birmingham|[35841.09,8031.12...|
|039 - EXTRACRANIA...|[10039,HUNTSVILLE...|AL - Huntsville|[28523.39,6113.38...|
|039 - EXTRACRANIA...|[10040,GADSDEN RE...|AL - Birmingham|[75233.38,5541.05...|
|039 - EXTRACRANIA...|[10046,RIVERVIEW ...|AL - Birmingham|[67327.92,5461.57...|
|039 - EXTRACRANIA...|[10055,FLOWERS HO...| AL - Dothan|[39607.28,5356.28...|
|039 - EXTRACRANIA...|[10056,ST VINCENT...|AL - Birmingham|[22862.23,5374.65...|
|039 - EXTRACRANIA...|[10078,NORTHEAST ...|AL - Birmingham|[31110.85,5366.23...|
|039 - EXTRACRANIA...|[10083,SOUTH BALD...| AL - Mobile|[25411.33,5282.93...|
|039 - EXTRACRANIA...|[10085,DECATUR GE...|AL - Huntsville|[9234.51,5676.55,...|
|039 - EXTRACRANIA...|[10090,PROVIDENCE...| AL - Mobile|[15895.85,5930.11...|
|039 - EXTRACRANIA...|[10092,D C H REGI...|AL - Tuscaloosa|[19721.16,6192.54...|
|039 - EXTRACRANIA...|[10100,THOMAS HOS...| AL - Mobile|[10710.88,4968.0,...|
|039 - EXTRACRANIA...|[10103,BAPTIST ME...|AL - Birmingham|[51343.75,5996.0,...|
+--------------------+--------------------+---------------+--------------------+
only showing top 20 rows

Step 5: Write the Data to Apache Parquet


AWS Glue makes it easy to write the data in a format such as Apache Parquet that relational databases
can effectively consume:

glueContext.write_dynamic_frame.from_options(
frame = medicare_nest_dyf,
connection_type = "s3",
connection_options = {"path": "s3://glue-sample-target/output-dir/
medicare_parquet"},
format = "parquet")

AWS Glue PySpark Extensions Reference


AWS Glue has created the following extensions to the PySpark Python dialect.

• Accessing Parameters Using getResolvedOptions (p. 273)


• PySpark Extension Types (p. 274)
• DynamicFrame Class (p. 278)
• DynamicFrameCollection Class (p. 287)
• DynamicFrameWriter Class (p. 288)
• DynamicFrameReader Class (p. 290)
• GlueContext Class (p. 291)

Accessing Parameters Using getResolvedOptions


The AWS Glue getResolvedOptions(args, options) utility function gives you access to the
arguments that are passed to your script when you run a job. To use this function, start by importing it
from the AWS Glue utils module, along with the sys module:

import sys
from awsglue.utils import getResolvedOptions

getResolvedOptions(args, options)

• args – The list of arguments contained in sys.argv.


• options – A Python array of the argument names that you want to retrieve.

273
AWS Glue Developer Guide
PySpark Extensions

Example Retrieving arguments passed to a JobRun


Suppose that you created a JobRun in a script, perhaps within a Lambda function:

response = client.start_job_run(
JobName = 'my_test_Job',
Arguments = {
'--day_partition_key': 'partition_0',
'--hour_partition_key': 'partition_1',
'--day_partition_value': day_partition_value,
'--hour_partition_value': hour_partition_value } )

To retrieve the arguments that are passed, you can use the getResolvedOptions function as follows:

import sys
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv,
['JOB_NAME',
'day_partition_key',
'hour_partition_key',
'day_partition_value',
'hour_partition_value'])
print "The day-partition key is: ", args['day_partition_key']
print "and the day-partition value is: ", args['day_partition_value']

PySpark Extension Types


The types that are used by the AWS Glue PySpark extensions.

DataType
The base class for the other AWS Glue types.

__init__(properties={})

• properties – Properties of the data type (optional).

typeName(cls)
Returns the type of the AWS Glue type class (that is, the class name with "Type" removed from the end).

• cls – An AWS Glue class instance derived from DataType.

jsonValue( )

Returns a JSON object that contains the data type and properties of the class:

{
"dataType": typeName,
"properties": properties
}

AtomicType and Simple Derivatives


Inherits from and extends the DataType (p. 274) class, and serves as the base class for all the AWS Glue
atomic data types.

274
AWS Glue Developer Guide
PySpark Extensions

fromJsonValue(cls, json_value)

Initializes a class instance with values from a JSON object.

• cls – An AWS Glue type class instance to initialize.


• json_value – The JSON object to load key-value pairs from.

The following types are simple derivatives of the AtomicType (p. 274) class:

• BinaryType – Binary data.


• BooleanType – Boolean values.
• ByteType – A byte value.
• DateType – A datetime value.
• DoubleType – A floating-point double value.
• IntegerType – An integer value.
• LongType – A long integer value.
• NullType – A null value.
• ShortType – A short integer value.
• StringType – A text string.
• TimestampType – A timestamp value (typically in seconds from 1/1/1970).
• UnknownType – A value of unidentified type.

DecimalType(AtomicType)
Inherits from and extends the AtomicType (p. 274) class to represent a decimal number (a number
expressed in decimal digits, as opposed to binary base-2 numbers).

__init__(precision=10, scale=2, properties={})

• precision – The number of digits in the decimal number (optional; the default is 10).
• scale – The number of digits to the right of the decimal point (optional; the default is 2).
• properties – The properties of the decimal number (optional).

EnumType(AtomicType)
Inherits from and extends the AtomicType (p. 274) class to represent an enumeration of valid options.

__init__(options)

• options – A list of the options being enumerated.

Collection Types
• ArrayType(DataType) (p. 276)
• ChoiceType(DataType) (p. 276)
• MapType(DataType) (p. 276)
• Field(Object) (p. 276)
• StructType(DataType) (p. 276)
• EntityType(DataType) (p. 277)

275
AWS Glue Developer Guide
PySpark Extensions

ArrayType(DataType)
__init__(elementType=UnknownType(), properties={})

• elementType – The type of elements in the array (optional; the default is UnknownType).
• properties – Properties of the array (optional).

ChoiceType(DataType)
__init__(choices=[], properties={})

• choices – A list of possible choices (optional).


• properties – Properties of these choices (optional).

add(new_choice)
Adds a new choice to the list of possible choices.

• new_choice – The choice to add to the list of possible choices.

merge(new_choices)
Merges a list of new choices with the existing list of choices.

• new_choices – A list of new choices to merge with existing choices.

MapType(DataType)
__init__(valueType=UnknownType, properties={})

• valueType – The type of values in the map (optional; the default is UnknownType).
• properties – Properties of the map (optional).

Field(Object)
Creates a field object out of an object that derives from DataType (p. 274).

__init__(name, dataType, properties={})

• name – The name to be assigned to the field.


• dataType – The object to create a field from.
• properties – Properties of the field (optional).

StructType(DataType)
Defines a data structure (struct).

__init__(fields=[], properties={})

• fields – A list of the fields (of type Field) to include in the structure (optional).
• properties – Properties of the structure (optional).

276
AWS Glue Developer Guide
PySpark Extensions

add(field)

• field – An object of type Field to add to the structure.

hasField(field)

Returns True if this structure has a field of the same name, or False if not.

• field – A field name, or an object of type Field whose name is used.

getField(field)

• field – A field name or an object of type Field whose name is used. If the structure has a field of the
same name, it is returned.

EntityType(DataType)
__init__(entity, base_type, properties)

This class is not yet implemented.

Other Types
• DataSource(object) (p. 277)
• DataSink(object) (p. 277)

DataSource(object)
__init__(j_source, sql_ctx, name)

• j_source – The data source.


• sql_ctx – The SQL context.
• name – The data-source name.

setFormat(format, **options)

• format – The format to set for the data source.


• options – A collection of options to set for the data source.

getFrame()

Returns a DynamicFrame for the data source.

DataSink(object)
__init__(j_sink, sql_ctx)

• j_sink – The sink to create.

277
AWS Glue Developer Guide
PySpark Extensions

• sql_ctx – The SQL context for the data sink.

setFormat(format, **options)

• format – The format to set for the data sink.


• options – A collection of options to set for the data sink.

setAccumulableSize(size)

• size – The accumulable size to set, in bytes.

writeFrame(dynamic_frame, info="")

• dynamic_frame – The DynamicFrame to write.


• info – Information about the DynamicFrame (optional).

write(dynamic_frame_or_dfc, info="")
Writes a DynamicFrame or a DynamicFrameCollection.

• dynamic_frame_or_dfc – Either a DynamicFrame object or a DynamicFrameCollection object


to be written.
• info – Information about the DynamicFrame or DynamicFrames to be written (optional).

DynamicFrame Class
One of the major abstractions in Apache Spark is the SparkSQL DataFrame, which is similar to the
DataFrame construct found in R and Pandas. A DataFrame is similar to a table and supports functional-
style (map/reduce/filter/etc.) operations and SQL operations (select, project, aggregate).

DataFrames are powerful and widely used, but they have limitations with respect to extract, transform,
and load (ETL) operations. Most significantly, they require a schema to be specified before any data is
loaded. SparkSQL addresses this by making two passes over the data—the first to infer the schema,
and the second to load the data. However, this inference is limited and doesn't address the realities of
messy data. For example, the same field might be of a different type in different records. Apache Spark
often gives up and reports the type as string using the original field text. This might not be correct,
and you might want finer control over how schema discrepancies are resolved. And for large datasets, an
additional pass over the source data might be prohibitively expensive.

To address these limitations, AWS Glue introduces the DynamicFrame. A DynamicFrame is similar to a
DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS
Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a
choice (or union) type. You can resolve these inconsistencies to make your datasets compatible with data
stores that require a fixed schema.

Similarly, a DynamicRecord represents a logical record within a DynamicFrame. It is like a row in a


Spark DataFrame, except that it is self-describing and can be used for data that does not conform to a
fixed schema.

You can convert DynamicFrames to and from DataFrames once you resolve any schema
inconsistencies.

278
AWS Glue Developer Guide
PySpark Extensions

— Construction —
• __init__ (p. 279)
• fromDF (p. 279)
• toDF (p. 279)

__init__
__init__(jdf, glue_ctx, name)

• jdf – A reference to the data frame in the Java Virtual Machine (JVM).
• glue_ctx – A GlueContext Class (p. 291) object.
• name – An optional name string, empty by default.

fromDF
fromDF(dataframe, glue_ctx, name)

Converts a DataFrame to a DynamicFrame by converting DataFrame fields to DynamicRecord fields.


Returns the new DynamicFrame.

A DynamicRecord represents a logical record in a DynamicFrame. It is similar to a row in a Spark


DataFrame, except that it is self-describing and can be used for data that does not conform to a fixed
schema.

• dataframe – The Apache Spark SQL DataFrame to convert (required).


• glue_ctx – The GlueContext Class (p. 291) object that specifies the context for this transform
(required).
• name – The name of the resulting DynamicFrame (required).

toDF
toDF(options)

Converts a DynamicFrame to an Apache Spark DataFrame by converting DynamicRecords into


DataFrame fields. Returns the new DataFrame.

A DynamicRecord represents a logical record in a DynamicFrame. It is similar to a row in a Spark


DataFrame, except that it is self-describing and can be used for data that does not conform to a fixed
schema.

• options – A list of options. Specify the target type if you choose the Project and Cast action type.
Examples include the following:

>>>toDF([ResolveOption("a.b.c", "KeepAsStruct")])
>>>toDF([ResolveOption("a.b.c", "Project", DoubleType())])

— Information —
• count (p. 280)
• schema (p. 280)
• printSchema (p. 280)

279
AWS Glue Developer Guide
PySpark Extensions

• show (p. 280)

count
count( ) – Returns the number of rows in the underlying DataFrame.

schema
schema( ) – Returns the schema of this DynamicFrame, or if that is not available, the schema of the
underlying DataFrame.

printSchema
printSchema( ) – Prints the schema of the underlying DataFrame.

show
show(num_rows) – Prints a specified number of rows from the underlying DataFrame.

— Transforms —
• apply_mapping (p. 280)
• drop_fields (p. 281)
• filter (p. 281)
• join (p. 281)
• map (p. 282)
• relationalize (p. 282)
• rename_field (p. 283)
• resolveChoice (p. 283)
• select_fields (p. 284)
• spigot (p. 284)
• split_fields (p. 285)
• split_rows (p. 285)
• unbox (p. 285)
• unnest (p. 286)
• write (p. 286)

apply_mapping
apply_mapping(mappings, transformation_ctx="", info="", stageThreshold=0,
totalThreshold=0)

Applies a declarative mapping to this DynamicFrame and returns a new DynamicFrame with those
mappings applied.

• mappings – A list of mapping tuples, each consisting of: (source column, source type, target column,
target type). Required.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).

280
AWS Glue Developer Guide
PySpark Extensions

• totalThreshold – The number of errors encountered up to and including this transformation at


which the process should error out (optional: zero by default, indicating that the process should not
error out).

drop_fields
drop_fields(paths, transformation_ctx="", info="", stageThreshold=0,
totalThreshold=0)
Calls the FlatMap Class (p. 306) transform to remove fields from a DynamicFrame. Returns a new
DynamicFrame with the specified fields dropped.

• paths – A list of strings, each containing the full path to a field node you want to drop.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

filter
filter(f, transformation_ctx="", info="", stageThreshold=0,
totalThreshold=0)
Returns a new DynamicFrame built by selecting all DynamicRecords within the input DynamicFrame
that satisfy the specified predicate function f.

• f – The predicate function to apply to the DynamicFrame. The function must take a DynamicRecord
as an argument and return True if the DynamicRecord meets the filter requirements, or False if not
(required).

A DynamicRecord represents a logical record in a DynamicFrame. It is similar to a row in a Spark


DataFrame, except that it is self-describing and can be used for data that does not conform to a fixed
schema.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

For an example of how to use the filter transform, see Filter Class (p. 304).

join
join(paths1, paths2, frame2, transformation_ctx="", info="",
stageThreshold=0, totalThreshold=0)
Performs an equality join with another DynamicFrame and returns the resulting DynamicFrame.

• paths1 – A list of the keys in this frame to join.

281
AWS Glue Developer Guide
PySpark Extensions

• paths2 – A list of the keys in the other frame to join.


• frame2 – The other DynamicFrame to join.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

map
map(f, transformation_ctx="", info="", stageThreshold=0,
totalThreshold=0)
Returns a new DynamicFrame that results from applying the specified mapping function to all records in
the original DynamicFrame.

• f – The mapping function to apply to all records in the DynamicFrame. The function must take a
DynamicRecord as an argument and return a new DynamicRecord (required).

A DynamicRecord represents a logical record in a DynamicFrame. It is similar to a row in an Apache


Spark DataFrame, except that it is self-describing and can be used for data that does not conform to a
fixed schema.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

For an example of how to use the map transform, see Map Class (p. 308).

relationalize
relationalize(root_table_name, staging_path, options,
transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)
Relationalizes a DynamicFrame by producing a list of frames that are generated by unnesting nested
columns and pivoting array columns. The pivoted array column can be joined to the root table using the
joinkey generated during the unnest phase.

• root_table_name – The name for the root table.


• staging_path – The path at which to store partitions of pivoted tables in CSV format (optional).
Pivoted tables are read back from this path.
• options – A dictionary of optional parameters.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

282
AWS Glue Developer Guide
PySpark Extensions

rename_field
rename_field(oldName, newName, transformation_ctx="", info="",
stageThreshold=0, totalThreshold=0)

Renames a field in this DynamicFrame and returns a new DynamicFrame with the field renamed.

• oldName – The full path to the node you want to rename.

If the old name has dots in it, RenameField will not work unless you place back-ticks around it (`). For
example, to replace this.old.name with thisNewName, you would call rename_field as follows:

newDyF = oldDyF.rename_field("`this.old.name`", "thisNewName")

• newName – The new name, as a full path.


• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

resolveChoice
resolveChoice(specs = None, option="", transformation_ctx="", info="",
stageThreshold=0, totalThreshold=0)

Resolves a choice type within this DynamicFrame and returns the new DynamicFrame.

• specs – A list of specific ambiguities to resolve, each in the form of a tuple: (path, action).
The path value identifies a specific ambiguous element, and the action value identifies the
corresponding resolution. Only one of the specs and option parameters can be used. If the spec
parameter is not None, then the option parameter must be an empty string. Conversely if the
option is not an empty string, then the spec parameter must be None. If neither parameter is
provided, AWS Glue tries to parse the schema and use it to resolve ambiguities.

The action portion of a specs tuple can specify one of four resolution strategies:
• cast: Allows you to specify a type to cast to (for example, cast:int).
• make_cols: Resolves a potential ambiguity by flattening the data. For example, if columnA could
be an int or a string, the resolution would be to produce two columns named columnA_int and
columnA_string in the resulting DynamicFrame.
• make_struct: Resolves a potential ambiguity by using a struct to represent the data. For example,
if data in a column could be an int or a string, using the make_struct action produces a column
of structures in the resulting DynamicFrame that each contains both an int and a string.
• project: Resolves a potential ambiguity by projecting all the data to one of the possible data
types. For example, if data in a column could be an int or a string, using a project:string
action produces a column in the resulting DynamicFrame where all the int values have been
converted to strings.

If the path identifies an array, place empty square brackets after the name of the array to avoid
ambiguity. For example, suppose you are working with data structured as follows:

"myList": [

283
AWS Glue Developer Guide
PySpark Extensions

{ "price": 100.00 },
{ "price": "$100.00" }
]

You can select the numeric rather than the string version of the price by setting the path to
"myList[].price", and the action to "cast:double".
• option – The default resolution action if the specs parameter is None. If the specs parameter is not
None, then this must not be set to anything but an empty string.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

Example

df1 = df.resolveChoice(option = "make_cols")


df2 = df.resolveChoice(specs = [("a.b", "make_struct"), ("c.d", "cast:double")])

select_fields

select_fields(paths, transformation_ctx="", info="", stageThreshold=0,


totalThreshold=0)

Returns a new DynamicFrame containing the selected fields.

• paths – A list of strings, each of which is a path to a top-level node that you want to select.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

spigot

spigot(path, options={})

Writes sample records to a specified destination during a transformation, and returns the input
DynamicFrame with an additional write step.

• path – The path to the destination to which to write (required).


• options – Key-value pairs specifying options (optional). The "topk" option specifies that the first k
records should be written. The "prob" option specifies the probability (as a decimal) of picking any
given record, to be used in selecting records to write.
• transformation_ctx – A unique string that is used to identify state information (optional).

284
AWS Glue Developer Guide
PySpark Extensions

split_fields
split_fields(paths, name1, name2, transformation_ctx="", info="",
stageThreshold=0, totalThreshold=0)

Returns a new DynamicFrameCollection that contains two DynamicFrames: the first containing all
the nodes that have been split off, and the second containing the nodes that remain.

• paths – A list of strings, each of which is a full path to a node that you want to split into a new
DynamicFrame.
• name1 – A name string for the DynamicFrame that is split off.
• name2 – A name string for the DynamicFrame that remains after the specified nodes have been split
off.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

split_rows
Splits one or more rows in a DynamicFrame off into a new DynamicFrame.

split_rows(comparison_dict, name1, name2, transformation_ctx="", info="",


stageThreshold=0, totalThreshold=0)

Returns a new DynamicFrameCollection containing two DynamicFrames: the first containing all the
rows that have been split off and the second containing the rows that remain.

• comparison_dict – A dictionary in which the key is a path to a column and the value is another
dictionary for mapping comparators to values to which the column value are compared. For example,
{"age": {">": 10, "<": 20}} splits off all rows whose value in the age column is greater than
10 and less than 20.
• name1 – A name string for the DynamicFrame that is split off.
• name2 – A name string for the DynamicFrame that remains after the specified nodes have been split
off.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

unbox
unbox(path, format, transformation_ctx="", info="", stageThreshold=0,
totalThreshold=0, **options)

Unboxes a string field in a DynamicFrame and returns a new DynamicFrame containing the unboxed
DynamicRecords.

285
AWS Glue Developer Guide
PySpark Extensions

A DynamicRecord represents a logical record in a DynamicFrame. It is similar to a row in an Apache


Spark DataFrame, except that it is self-describing and can be used for data that does not conform to a
fixed schema.

• path – A full path to the string node you want to unbox.


• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).
• options – One or more of the following:
• separator – A string containing the separator character.
• escaper – A string containing the escape character.
• skipFirst – A Boolean value indicating whether to skip the first instance.
• withSchema – A string containing the schema; must be called using StructType.json( ).
• withHeader – A Boolean value indicating whether a header is included.

For example: unbox("a.b.c", "csv", separator="|")

unnest
Unnests nested objects in a DynamicFrame, making them top-level objects, and returns a new unnested
DynamicFrame.

unnest(transformation_ctx="", info="", stageThreshold=0,


totalThreshold=0)

Unnests nested objects in a DynamicFrame, making them top-level objects, and returns a new unnested
DynamicFrame.

• transformation_ctx – A unique string that is used to identify state information (optional).


• info – A string to be associated with error reporting for this transformation (optional).
• stageThreshold – The number of errors encountered during this transformation at which the
process should error out (optional: zero by default, indicating that the process should not error out).
• totalThreshold – The number of errors encountered up to and including this transformation at
which the process should error out (optional: zero by default, indicating that the process should not
error out).

For example: unnest( )

write
write(connection_type, connection_options, format, format_options,
accumulator_size)

Gets a DataSink(object) (p. 277) of the specified connection type from the GlueContext Class (p. 291)
of this DynamicFrame, and uses it to format and write the contents of this DynamicFrame. Returns the
new DynamicFrame formatted and written as specified.

286
AWS Glue Developer Guide
PySpark Extensions

• connection_type – The connection type to use. Valid values include s3, mysql, postgresql,
redshift, sqlserver, and oracle.
• connection_options – The connection option to use (optional). For a connection_type of s3, an
Amazon S3 path is defined.

connection_options = {"path": "s3://aws-glue-target/temp"}

For JDBC connections, several properties must be defined. Note that the database name must be part
of the URL. It can optionally be included in the connection options.

connection_options = {"url": "jdbc-url/database", "user": "username", "password":


"password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.
• format_options – Format options for the specified format. See Format Options for ETL Inputs and
Outputs in AWS Glue (p. 248) for the formats that are supported.
• accumulator_size – The accumulable size to use (optional).

— Errors —
• assertErrorThreshold (p. 287)
• errorsAsDynamicFrame (p. 287)
• errorsCount (p. 287)
• stageErrorsCount (p. 287)

assertErrorThreshold
assertErrorThreshold( ) – An assert for errors in the transformations that created this
DynamicFrame. Returns an Exception from the underlying DataFrame.

errorsAsDynamicFrame
errorsAsDynamicFrame( ) – Returns a DynamicFrame that has error records nested inside.

errorsCount
errorsCount( ) – Returns the total number of errors in a DynamicFrame.

stageErrorsCount
stageErrorsCount – Returns the number of errors that occurred in the process of generating this
DynamicFrame.

DynamicFrameCollection Class
A DynamicFrameCollection is a dictionary of DynamicFrame Class (p. 278) objects, in which the
keys are the names of the DynamicFrames and the values are the DynamicFrame objects.

__init__
__init__(dynamic_frames, glue_ctx)

• dynamic_frames – A dictionary of DynamicFrame Class (p. 278) objects.

287
AWS Glue Developer Guide
PySpark Extensions

• glue_ctx – A GlueContext Class (p. 291) object.

keys
keys( ) – Returns a list of the keys in this collection, which generally consists of the names of the
corresponding DynamicFrame values.

values
values(key) – Returns a list of the DynamicFrame values in this collection.

select

select(key)

Returns the DynamicFrame that corresponds to the specfied key (which is generally the name of the
DynamicFrame).

• key – A key in the DynamicFrameCollection, which usually represents the name of a


DynamicFrame.

map

map(callable, transformation_ctx="")

Uses a passed-in function to create and return a new DynamicFrameCollection based on the
DynamicFrames in this collection.

• callable – A function that takes a DynamicFrame and the specified transformation context as
parameters and returns a DynamicFrame.
• transformation_ctx – A transformation context to be used by the callable (optional).

flatmap

flatmap(f, transformation_ctx="")

Uses a passed-in function to create and return a new DynamicFrameCollection based on the
DynamicFrames in this collection.

• f – A function that takes a DynamicFrame as a parameter and returns a DynamicFrame or


DynamicFrameCollection.
• transformation_ctx – A transformation context to be used by the function (optional).

DynamicFrameWriter Class
Methods
• __init__ (p. 289)
• from_options (p. 289)
• from_catalog (p. 289)
• from_jdbc_conf (p. 290)

288
AWS Glue Developer Guide
PySpark Extensions

__init__

__init__(glue_context)

• glue_context – The GlueContext Class (p. 291) to use.

from_options

from_options(frame, connection_type, connection_options={}, format=None,


format_options={}, transformation_ctx="")

Writes a DynamicFrame using the specified connection and format.

• frame – The DynamicFrame to write.


• connection_type – The connection type. Valid values include s3, mysql, postgresql, redshift,
sqlserver, and oracle.
• connection_options – Connection options, such as path and database table (optional). For a
connection_type of s3, an Amazon S3 path is defined.

connection_options = {"path": "s3://aws-glue-target/temp"}

For JDBC connections, several properties must be defined. Note that the database name must be part
of the URL. It can optionally be included in the connection options.

connection_options = {"url": "jdbc-url/database", "user": "username", "password":


"password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas
within a database, specify schema.table-name. If a schema is not provided, then the default "public"
schema is used.

For more information, see Connection Types and Options for ETL in AWS Glue (p. 245).
• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.
• format_options – Format options for the specified format. See Format Options for ETL Inputs and
Outputs in AWS Glue (p. 248) for the formats that are supported.
• transformation_ctx – A transformation context to use (optional).

from_catalog

from_catalog(frame, name_space, table_name, redshift_tmp_dir="",


transformation_ctx="")

Writes a DynamicFrame using the specified catalog database and table name.

• frame – The DynamicFrame to write.


• name_space – The database to use.
• table_name – The table_name to use.
• redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
• transformation_ctx – A transformation context to use (optional).

289
AWS Glue Developer Guide
PySpark Extensions

from_jdbc_conf
from_jdbc_conf(frame, catalog_connection, connection_options={},
redshift_tmp_dir = "", transformation_ctx="")

Writes a DynamicFrame using the specified JDBC connection information.

• frame – The DynamicFrame to write.


• catalog_connection – A catalog connection to use.
• connection_options – Connection options, such as path and database table (optional).
• redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
• transformation_ctx – A transformation context to use (optional).

DynamicFrameReader Class
— Methods —
• __init__ (p. 290)
• from_rdd (p. 290)
• from_options (p. 290)
• from_catalog (p. 291)

__init__
__init__(glue_context)

• glue_context – The GlueContext Class (p. 291) to use.

from_rdd
from_rdd(data, name, schema=None, sampleRatio=None)

Reads a DynamicFrame from a Resilient Distributed Dataset (RDD).

• data – The dataset to read from.


• name – The name to read from.
• schema – The schema to read (optional).
• sampleRatio – The sample ratio (optional).

from_options
from_options(connection_type, connection_options={}, format=None,
format_options={}, transformation_ctx="", push_down_predicate="")

Reads a DynamicFrame using the specified connection and format.

• connection_type – The connection type. Valid values include s3, mysql, postgresql, redshift,
sqlserver, oracle, and dynamodb.
• connection_options – Connection options, such as path and database table (optional). For a
connection_type of s3, Amazon S3 paths are defined in an array.

290
AWS Glue Developer Guide
PySpark Extensions

connection_options = {"paths": [ "s3://mybucket/object_a", "s3://mybucket/object_b"]}

For JDBC connections, several properties must be defined. Note that the database name must be part
of the URL. It can optionally be included in the connection options.

connection_options = {"url": "jdbc-url/database", "user": "username", "password":


"password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

For a JDBC connection that performs parallel reads, you can set the hashfield option. For example:

connection_options = {"url": "jdbc-url/database", "user": "username", "password":


"password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path" , "hashfield":
"month"}

For more information, see Reading from JDBC Tables in Parallel (p. 252).
• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.
• format_options – Format options for the specified format. See Format Options for ETL Inputs and
Outputs in AWS Glue (p. 248) for the formats that are supported.
• transformation_ctx – The transformation context to use (optional).
• push_down_predicate – Filters partitions without having to list and read all the files in your dataset.
For more information, see Pre-Filtering Using Pushdown Predicates (p. 250).

from_catalog
from_catalog(name_space, table_name, redshift_tmp_dir="",
transformation_ctx="", push_down_predicate="", additional_options={})

Reads a DynamicFrame using the specified catalog namespace and table name.

• name_space – The database to read from.


• table_name – The name of the table to read from.
• redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
• transformation_ctx – The transformation context to use (optional).
• push_down_predicate – Filters partitions without having to list and read all the files in your dataset.
For more information, see Pre-Filtering Using Pushdown Predicates (p. 250).
• additional_options – Additional options provided to AWS Glue. To use a JDBC connection that
performs parallel reads, you can set the hashfield, hashexpression, or hashpartitions options.
For example:

additional_options = {"hashfield": "month"}

For more information, see Reading from JDBC Tables in Parallel (p. 252).

GlueContext Class
Wraps the Apache SparkSQL SQLContext object, and thereby provides mechanisms for interacting with
the Apache Spark platform.

291
AWS Glue Developer Guide
PySpark Extensions

Creating
• __init__ (p. 292)
• getSource (p. 292)
• create_dynamic_frame_from_rdd (p. 292)
• create_dynamic_frame_from_catalog (p. 292)
• create_dynamic_frame_from_options (p. 293)

__init__
__init__(sparkContext)

• sparkContext – The Apache Spark context to use.

getSource
getSource(connection_type, transformation_ctx = "", **options)
Creates a DataSource object that can be used to read DynamicFrames from external sources.

• connection_type – The connection type to use, such as Amazon S3, Amazon Redshift, and JDBC.
Valid values include s3, mysql, postgresql, redshift, sqlserver, oracle, and dynamodb.
• transformation_ctx – The transformation context to use (optional).
• options – A collection of optional name-value pairs. For more information, see Connection Types and
Options for ETL in AWS Glue (p. 245).

The following is an example of using getSource:

>>> data_source = context.getSource("file", paths=["/in/path"])


>>> data_source.setFormat("json")
>>> myFrame = data_source.getFrame()

create_dynamic_frame_from_rdd
create_dynamic_frame_from_rdd(data, name, schema=None, sample_ratio=None,
transformation_ctx="")
Returns a DynamicFrame that is created from an Apache Spark Resilient Distributed Dataset (RDD).

• data – The data source to use.


• name – The name of the data to use.
• schema – The schema to use (optional).
• sample_ratio – The sample ratio to use (optional).
• transformation_ctx – The transformation context to use (optional).

create_dynamic_frame_from_catalog
create_dynamic_frame_from_catalog(database, table_name, redshift_tmp_dir,
transformation_ctx = "", push_down_predicate= "", additional_options =
{}, catalog_id = None)
Returns a DynamicFrame that is created using a catalog database and table name.

292
AWS Glue Developer Guide
PySpark Extensions

• Database – The database to read from.


• table_name – The name of the table to read from.
• redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
• transformation_ctx – The transformation context to use (optional).
• push_down_predicate – Filters partitions without having to list and read all the files in your dataset.
For more information, see Pre-Filtering Using Pushdown Predicates (p. 250).
• additional_options – Additional options provided to AWS Glue.
• catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the
default account ID of the caller is used.

create_dynamic_frame_from_options
create_dynamic_frame_from_options(connection_type, connection_options={},
format=None, format_options={}, transformation_ctx = "",
push_down_predicate= "")

Returns a DynamicFrame created with the specified connection and format.

• connection_type – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid
values include s3, mysql, postgresql, redshift, sqlserver, oracle, and dynamodb.
• connection_options – Connection options, such as paths and database table (optional). For a
connection_type of s3, a list of Amazon S3 paths is defined.

connection_options = {"paths": ["s3://aws-glue-target/temp"]}

For JDBC connections, several properties must be defined. Note that the database name must be part
of the URL. It can optionally be included in the connection options.

connection_options = {"url": "jdbc-url/database", "user": "username", "password":


"password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas
within a database, specify schema.table-name. If a schema is not provided, then the default "public"
schema is used.

For more information, see Connection Types and Options for ETL in AWS Glue (p. 245).
• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.
• format_options – Format options for the specified format. See Format Options for ETL Inputs and
Outputs in AWS Glue (p. 248) for the formats that are supported.
• transformation_ctx – The transformation context to use (optional).
• push_down_predicate – Filters partitions without having to list and read all the files in your dataset.
For more information, see Pre-Filtering Using Pushdown Predicates (p. 250).

Writing
• getSink (p. 294)
• write_dynamic_frame_from_options (p. 294)
• write_from_options (p. 295)
• write_dynamic_frame_from_catalog (p. 295)

293
AWS Glue Developer Guide
PySpark Extensions

• write_dynamic_frame_from_jdbc_conf (p. 296)


• write_from_jdbc_conf (p. 296)

getSink
getSink(connection_type, format = None, transformation_ctx = "",
**options)

Gets a DataSink object that can be used to write DynamicFrames to external sources. Check the
SparkSQL format first to be sure to get the expected sink.

• connection_type – The connection type to use, such as Amazon S3, Amazon Redshift, and JDBC.
Valid values include s3, mysql, postgresql, redshift, sqlserver, and oracle.
• format – The SparkSQL format to use (optional).
• transformation_ctx – The transformation context to use (optional).
• options – A collection of option name-value pairs.

For example:

>>> data_sink = context.getSink("s3")


>>> data_sink.setFormat("json"),
>>> data_sink.writeFrame(myFrame)

write_dynamic_frame_from_options
write_dynamic_frame_from_options(frame, connection_type,
connection_options={}, format=None, format_options={}, transformation_ctx
= "")

Writes and returns a DynamicFrame using the specified connection and format.

• frame – The DynamicFrame to write.


• connection_type – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid
values include s3, mysql, postgresql, redshift, sqlserver, and oracle.
• connection_options – Connection options, such as path and database table (optional). For a
connection_type of s3, an Amazon S3 path is defined.

connection_options = {"path": "s3://aws-glue-target/temp"}

For JDBC connections, several properties must be defined. Note that the database name must be part
of the URL. It can optionally be included in the connection options.

connection_options = {"url": "jdbc-url/database", "user": "username", "password":


"password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas
within a database, specify schema.table-name. If a schema is not provided, then the default "public"
schema is used.

For more information, see Connection Types and Options for ETL in AWS Glue (p. 245).
• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.

294
AWS Glue Developer Guide
PySpark Extensions

• format_options – Format options for the specified format. See Format Options for ETL Inputs and
Outputs in AWS Glue (p. 248) for the formats that are supported.
• transformation_ctx – A transformation context to use (optional).

write_from_options
write_from_options(frame_or_dfc, connection_type, connection_options={},
format={}, format_options={}, transformation_ctx = "")

Writes and returns a DynamicFrame or DynamicFrameCollection that is created with the specified
connection and format information.

• frame_or_dfc – The DynamicFrame or DynamicFrameCollection to write.


• connection_type – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid
values include s3, mysql, postgresql, redshift, sqlserver, and oracle.
• connection_options – Connection options, such as path and database table (optional). For a
connection_type of s3, an Amazon S3 path is defined.

connection_options = {"path": "s3://aws-glue-target/temp"}

For JDBC connections, several properties must be defined. Note that the database name must be part
of the URL. It can optionally be included in the connection options.

connection_options = {"url": "jdbc-url/database", "user": "username", "password":


"password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas
within a database, specify schema.table-name. If a schema is not provided, then the default "public"
schema is used.

For more information, see Connection Types and Options for ETL in AWS Glue (p. 245).
• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.
• format_options – Format options for the specified format. See Format Options for ETL Inputs and
Outputs in AWS Glue (p. 248) for the formats that are supported.
• transformation_ctx – A transformation context to use (optional).

write_dynamic_frame_from_catalog
write_dynamic_frame_from_catalog(frame, database, table_name,
redshift_tmp_dir, transformation_ctx = "", addtional_options = {},
catalog_id = None)

Writes and returns a DynamicFrame using a catalog database and a table name.

• frame – The DynamicFrame to write.


• Database – The database to read from.
• table_name – The name of the table to read from.
• redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
• transformation_ctx – The transformation context to use (optional).

295
AWS Glue Developer Guide
PySpark Extensions

• catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the
default account ID of the caller is used.

write_dynamic_frame_from_jdbc_conf

write_dynamic_frame_from_jdbc_conf(frame, catalog_connection,
connection_options={}, redshift_tmp_dir = "", transformation_ctx = "",
catalog_id = None)

Writes and returns a DynamicFrame using the specified JDBC connection information.

• frame – The DynamicFrame to write.


• catalog_connection – A catalog connection to use.
• connection_options – Connection options, such as path and database table (optional). For more
information, see Connection Types and Options for ETL in AWS Glue (p. 245).
• redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
• transformation_ctx – A transformation context to use (optional).
• catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the
default account ID of the caller is used.

write_from_jdbc_conf

write_from_jdbc_conf(frame_or_dfc, catalog_connection,
connection_options={}, redshift_tmp_dir = "", transformation_ctx = "",
catalog_id = None)

Writes and returns a DynamicFrame or DynamicFrameCollection using the specified JDBC


connection information.

• frame_or_dfc – The DynamicFrame or DynamicFrameCollection to write.


• catalog_connection – A catalog connection to use.
• connection_options – Connection options, such as path and database table (optional). For more
information, see Connection Types and Options for ETL in AWS Glue (p. 245).
• redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
• transformation_ctx – A transformation context to use (optional).
• catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the
default account ID of the caller is used.

Extracting
• extract_jdbc_conf (p. 296)

extract_jdbc_conf

extract_jdbc_conf(connection_name, catalog_id = None)

Returns a dict with keys user, password, vendor, and url from the connection object in the Data
Catalog.

• connection_name – The name of the connection in the Data Catalog

296
AWS Glue Developer Guide
PySpark Transforms

• catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the
default account ID of the caller is used.

AWS Glue PySpark Transforms Reference


AWS Glue has created the following transform Classes to use in PySpark ETL operations.

• GlueTransform Base Class (p. 297)


• ApplyMapping Class (p. 299)
• DropFields Class (p. 300)
• DropNullFields Class (p. 301)
• ErrorsAsDynamicFrame Class (p. 303)
• Filter Class (p. 304)
• Join Class (p. 307)
• Map Class (p. 308)
• MapToCollection Class (p. 311)
• Relationalize Class (p. 312)
• RenameField Class (p. 313)
• ResolveChoice Class (p. 315)
• SelectFields Class (p. 316)
• SelectFromCollection Class (p. 317)
• Spigot Class (p. 318)
• SplitFields Class (p. 319)
• SplitRows Class (p. 321)
• Unbox Class (p. 322)
• UnnestFrame Class (p. 323)

GlueTransform Base Class


The base class that all the awsglue.transforms classes inherit from.

The classes all define a __call__ method. They either override the GlueTransform class methods
listed in the following sections, or they are called using the class name by default.

Methods
• apply(cls, *args, **kwargs) (p. 297)
• name(cls) (p. 298)
• describeArgs(cls) (p. 298)
• describeReturn(cls) (p. 298)
• describeTransform(cls) (p. 298)
• describeErrors(cls) (p. 298)
• describe(cls) (p. 299)

apply(cls, *args, **kwargs)


Applies the transform by calling the transform class, and returns the result.

297
AWS Glue Developer Guide
PySpark Transforms

• cls – The self class object.

name(cls)
Returns the name of the derived transform class.

• cls – The self class object.

describeArgs(cls)
• cls – The self class object.

Returns a list of dictionaries, each corresponding to a named argument, in the following format:

[
{
"name": "(name of argument)",
"type": "(type of argument)",
"description": "(description of argument)",
"optional": "(Boolean, True if the argument is optional)",
"defaultValue": "(Default value string, or None)(String; the default value, or None)"
},
...
]

Raises a NotImplementedError exception when called in a derived transform where it is not


implemented.

describeReturn(cls)
• cls – The self class object.

Returns a dictionary with information about the return type, in the following format:

{
"type": "(return type)",
"description": "(description of output)"
}

Raises a NotImplementedError exception when called in a derived transform where it is not


implemented.

describeTransform(cls)
Returns a string describing the transform.

• cls – The self class object.

Raises a NotImplementedError exception when called in a derived transform where it is not


implemented.

describeErrors(cls)
• cls – The self class object.

298
AWS Glue Developer Guide
PySpark Transforms

Returns a list of dictionaries, each describing a possible exception thrown by this transform, in the
following format:

[
{
"type": "(type of error)",
"description": "(description of error)"
},
...
]

describe(cls)
• cls – The self class object.

Returns an object with the following format:

{
"transform" : {
"name" : cls.name( ),
"args" : cls.describeArgs( ),
"returns" : cls.describeReturn( ),
"raises" : cls.describeErrors( ),
"location" : "internal"
}
}

ApplyMapping Class
Applies a mapping in a DynamicFrame.

Methods
• __call__ (p. 299)
• apply (p. 300)
• name (p. 300)
• describeArgs (p. 300)
• describeReturn (p. 300)
• describeTransform (p. 300)
• describeErrors (p. 300)
• describe (p. 300)

__call__(frame, mappings, transformation_ctx = "", info = "", stageThreshold = 0,


totalThreshold = 0)
Applies a declarative mapping to a specified DynamicFrame.

• frame – The DynamicFrame in which to apply the mapping (required).


• mappings – A list of mapping tuples, each consisting of: (source column, source type, target column,
target type). Required.

If the source column has dots in it, the mapping will not work unless you place back-ticks around
it (``). For example, to map this.old.name (string) to thisNewName (string), you would use the
following tuple:

299
AWS Glue Developer Guide
PySpark Transforms

("`this.old.name`", "string", "thisNewName", "string")

• transformation_ctx – A unique string that is used to identify state information (optional).


• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns only the fields of the DynamicFrame specified in the "mapping" tuples.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

DropFields Class
Drops fields within a DynamicFrame.

Methods
• __call__ (p. 301)
• apply (p. 301)
• name (p. 301)
• describeArgs (p. 301)
• describeReturn (p. 301)
• describeTransform (p. 301)
• describeErrors (p. 301)
• describe (p. 301)

300
AWS Glue Developer Guide
PySpark Transforms

__call__(frame, paths, transformation_ctx = "", info = "", stageThreshold = 0,


totalThreshold = 0)
Drops nodes within a DynamicFrame.

• frame – The DynamicFrame in which to drop the nodes (required).


• paths – A list of full paths to the nodes to drop (required).
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a new DynamicFrame without the specified fields.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

DropNullFields Class
Drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null
values in every record in the DynamicFrame data set.

Methods
• __call__ (p. 302)
• apply (p. 302)
• name (p. 302)

301
AWS Glue Developer Guide
PySpark Transforms

• describeArgs (p. 302)


• describeReturn (p. 302)
• describeTransform (p. 302)
• describeErrors (p. 302)
• describe (p. 302)

__call__(frame, transformation_ctx = "", info = "", stageThreshold = 0,


totalThreshold = 0)
Drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null
values in every record in the DynamicFrame data set.

• frame – The DynamicFrame in which to drop null fields (required).


• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a new DynamicFrame with no null fields.

apply(cls, *args, **kwargs)


• cls – cls

name(cls)
• cls – cls

describeArgs(cls)
• cls – cls

describeReturn(cls)
• cls – cls

describeTransform(cls)
• cls – cls

describeErrors(cls)
• cls – cls

describe(cls)
• cls – cls

302
AWS Glue Developer Guide
PySpark Transforms

ErrorsAsDynamicFrame Class
Returns a DynamicFrame that contains nested error records leading up to the creation of the source
DynamicFrame.

Methods
• __call__ (p. 303)
• apply (p. 303)
• name (p. 303)
• describeArgs (p. 303)
• describeReturn (p. 303)
• describeTransform (p. 303)
• describeErrors (p. 303)
• describe (p. 303)

__call__(frame)
Returns a DynamicFrame that contains nested error records relating to the source DynamicFrame.

• frame – The source DynamicFrame (required).

apply(cls, *args, **kwargs)


• cls – cls

name(cls)
• cls – cls

describeArgs(cls)
• cls – cls

describeReturn(cls)
• cls – cls

describeTransform(cls)
• cls – cls

describeErrors(cls)
• cls – cls

describe(cls)
• cls – cls

303
AWS Glue Developer Guide
PySpark Transforms

Filter Class
Builds a new DynamicFrame by selecting records from the input DynamicFrame that satisfy a specified
predicate function.

Methods
• __call__ (p. 304)
• apply (p. 304)
• name (p. 304)
• describeArgs (p. 304)
• describeReturn (p. 304)
• describeTransform (p. 305)
• describeErrors (p. 305)
• describe (p. 305)
• Example Code (p. 305)

__call__(frame, f, transformation_ctx="", info="", stageThreshold=0,


totalThreshold=0))
Returns a new DynamicFrame built by selecting records from the input DynamicFrame that satisfy a
specified predicate function.

• frame – The source DynamicFrame to apply the specified filter function to (required).
• f – The predicate function to apply to each DynamicRecord in the DynamicFrame. The function
must take a DynamicRecord as its argument and return True if the DynamicRecord meets the filter
requirements, or False if it does not (required).

A DynamicRecord represents a logical record in a DynamicFrame. It is similar to a row in a Spark


DataFrame, except that it is self-describing and can be used for data that does not conform to a fixed
schema.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

304
AWS Glue Developer Guide
PySpark Transforms

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

AWS Glue Python Example


This example filters sample data using the Filter transform and a simple Lambda function. The dataset
used here consists of Medicare Provider payment data downloaded from two Data.CMS.gov sites:
Inpatient Prospective Payment System Provider Summary for the Top 100 Diagnosis-Related Groups -
FY2011), and Inpatient Charge Data FY 2011.

After downloading the sample data, we modified it to introduce a couple of erroneous records at
the end of the file. This modified file is located in a public Amazon S3 bucket at s3://awsglue-
datasets/examples/medicare/Medicare_Hospital_Provider.csv. For another example
that uses this dataset, see Code Example: Data Preparation Using ResolveChoice, Lambda, and
ApplyMapping (p. 268).

Begin by creating a DynamicFrame for the data:

%pyspark
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

dyF = glueContext.create_dynamic_frame.from_options(
's3',
{'paths': ['s3://awsglue-datasets/examples/medicare/
Medicare_Hospital_Provider.csv']},
'csv',
{'withHeader': True})

print "Full record count: ", dyF.count()


dyF.printSchema()

The output should be as follows:

Full record count: 163065L


root
|-- DRG Definition: string
|-- Provider Id: string
|-- Provider Name: string
|-- Provider Street Address: string
|-- Provider City: string
|-- Provider State: string
|-- Provider Zip Code: string
|-- Hospital Referral Region Description: string
|-- Total Discharges: string
|-- Average Covered Charges: string
|-- Average Total Payments: string
|-- Average Medicare Payments: string

305
AWS Glue Developer Guide
PySpark Transforms

Next, use the Filter transform to condense the dataset, retaining only those entries that are from
Sacramento, California, or from Montgomery, Alabama. The filter transform works with any filter
function that takes a DynamicRecord as input and returns True if the DynamicRecord meets the filter
requirements, or False if not.
Note
You can use Python’s dot notation to access many fields in a DynamicRecord. For example, you
can access the column_A field in dynamic_record_X as: dynamic_record_X.column_A.
However, this technique doesn't work with field names that contain anything besides
alphanumeric characters and underscores. For fields that contain other characters, such as
spaces or periods, you must fall back to Python's dictionary notation. For example, to access a
field named col-B, use: dynamic_record_X["col-B"].

You can use a simple Lambda function with the Filter transform to remove all DynamicRecords that
don't originate in Sacramento or Montgomery. To confirm that this worked, print out the number of
records that remain:

sac_or_mon_dyF = Filter.apply(frame = dyF,


f = lambda x: x["Provider State"] in ["CA", "AL"] and
x["Provider City"] in ["SACRAMENTO", "MONTGOMERY"])
print "Filtered record count: ", sac_or_mon_dyF.count()

The output that you get looks like the following:

Filtered record count: 564L

FlatMap Class
Applies a transform to each DynamicFrame in a collection and flattens the results.

Methods
• __call__ (p. 306)
• apply (p. 307)
• name (p. 307)
• describeArgs (p. 307)
• describeReturn (p. 307)
• describeTransform (p. 307)
• describeErrors (p. 307)
• describe (p. 307)

__call__(dfc, BaseTransform, frame_name, transformation_ctx = "",


**base_kwargs)
Applies a transform to each DynamicFrame in a collection and flattens the results.

• dfc – The DynamicFrameCollection over which to flatmap (required).


• BaseTransform – A transform derived from GlueTransform to apply to each member of the
collection (required).
• frame_name – The argument name to pass the elements of the collection to (required).
• transformation_ctx – A unique string that is used to identify state information (optional).
• base_kwargs – Arguments to pass to the base transform (required).

306
AWS Glue Developer Guide
PySpark Transforms

Returns a new DynamicFrameCollection created by applying the transform to each DynamicFrame


in the source DynamicFrameCollection.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

Join Class
Performs an equality join on two DynamicFrames.

Methods
• __call__ (p. 307)
• apply (p. 308)
• name (p. 308)
• describeArgs (p. 308)
• describeReturn (p. 308)
• describeTransform (p. 308)
• describeErrors (p. 308)
• describe (p. 308)

__call__(frame1, frame2, keys1, keys2, transformation_ctx = "")


Performs an equality join on two DynamicFrames.

• frame1 – The first DynamicFrame to join (required).


• frame2 – The second DynamicFrame to join (required).
• keys1 – The keys to join on for the first frame (required).
• keys2 – The keys to join on for the second frame (required).

307
AWS Glue Developer Guide
PySpark Transforms

• transformation_ctx – A unique string that is used to identify state information (optional).

Returns a new DynamicFrame obtained by joining the two DynamicFrames.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297)

name(cls)
Inherited from GlueTransform name (p. 298)

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298)

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298)

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298)

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298)

describe(cls)
Inherited from GlueTransform describe (p. 299)

Map Class
Builds a new DynamicFrame by applying a function to all records in the input DynamicFrame.

Methods
• __call__ (p. 308)
• apply (p. 309)
• name (p. 309)
• describeArgs (p. 309)
• describeReturn (p. 309)
• describeTransform (p. 309)
• describeErrors (p. 309)
• describe (p. 309)
• Example Code (p. 309)

__call__(frame, f, transformation_ctx="", info="", stageThreshold=0,


totalThreshold=0)
Returns a new DynamicFrame that results from applying the specified function to all DynamicRecords
in the original DynamicFrame.

308
AWS Glue Developer Guide
PySpark Transforms

• frame – The original DynamicFrame to which to apply the mapping function (required).
• f – The function to apply to all DynamicRecords in the DynamicFrame. The function must take
a DynamicRecord as an argument and return a new DynamicRecord produced by the mapping
(required).

A DynamicRecord represents a logical record in a DynamicFrame. It is similar to a row in an Apache


Spark DataFrame, except that it is self-describing and can be used for data that does not conform to a
fixed schema.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a new DynamicFrame that results from applying the specified function to all DynamicRecords
in the original DynamicFrame.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

AWS Glue Python Example


This example uses the Map transform to merge several fields into one struct type. The dataset that
is used here consists of Medicare Provider payment data downloaded from two Data.CMS.gov sites:
Inpatient Prospective Payment System Provider Summary for the Top 100 Diagnosis-Related Groups -
FY2011), and Inpatient Charge Data FY 2011.

After downloading the sample data, we modified it to introduce a couple of erroneous records at
the end of the file. This modified file is located in a public Amazon S3 bucket at s3://awsglue-
datasets/examples/medicare/Medicare_Hospital_Provider.csv. For another example

309
AWS Glue Developer Guide
PySpark Transforms

that uses this dataset, see Code Example: Data Preparation Using ResolveChoice, Lambda, and
ApplyMapping (p. 268).

Begin by creating a DynamicFrame for the data:

from awsglue.context import GlueContext


from awsglue.transforms import *
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

dyF = glueContext.create_dynamic_frame.from_options(
's3',
{'paths': ['s3://awsglue-datasets/examples/medicare/
Medicare_Hospital_Provider.csv']},
'csv',
{'withHeader': True})

print "Full record count: ", dyF.count()


dyF.printSchema()

The output of this code should be as follows:

Full record count: 163065L


root
|-- DRG Definition: string
|-- Provider Id: string
|-- Provider Name: string
|-- Provider Street Address: string
|-- Provider City: string
|-- Provider State: string
|-- Provider Zip Code: string
|-- Hospital Referral Region Description: string
|-- Total Discharges: string
|-- Average Covered Charges: string
|-- Average Total Payments: string
|-- Average Medicare Payments: string

Next, create a mapping function to merge provider-address fields in a DynamicRecord into a struct,
and then delete the individual address fields:

def MergeAddress(rec):
rec["Address"] = {}
rec["Address"]["Street"] = rec["Provider Street Address"]
rec["Address"]["City"] = rec["Provider City"]
rec["Address"]["State"] = rec["Provider State"]
rec["Address"]["Zip.Code"] = rec["Provider Zip Code"]
rec["Address"]["Array"] = [rec["Provider Street Address"], rec["Provider City"],
rec["Provider State"], rec["Provider Zip Code"]]
del rec["Provider Street Address"]
del rec["Provider City"]
del rec["Provider State"]
del rec["Provider Zip Code"]
return rec

In this mapping function, the line rec["Address"] = {} creates a dictionary in the input
DynamicRecord that contains the new structure.
Note
Python map fields are not supported here. For example, you can't have a line like the following:
rec["Addresses"] = [] # ILLEGAL!

310
AWS Glue Developer Guide
PySpark Transforms

The lines that are like rec["Address"]["Street"] = rec["Provider Street Address"] add
fields to the new structure using Python dictionary syntax.

After the address lines are added to the new structure, the lines that are like del rec["Provider
Street Address"] remove the individual fields from the DynamicRecord.

Now you can use the Map transform to apply your mapping function to all DynamicRecords in the
DynamicFrame.

mapped_dyF = Map.apply(frame = dyF, f = MergeAddress)


mapped_dyF.printSchema()

The output is as follows:

root
|-- Average Total Payments: string
|-- Average Covered Charges: string
|-- DRG Definition: string
|-- Average Medicare Payments: string
|-- Hospital Referral Region Description: string
|-- Address: struct
| |-- Zip.Code: string
| |-- City: string
| |-- Array: array
| | |-- element: string
| |-- State: string
| |-- Street: string
|-- Provider Id: string
|-- Total Discharges: string
|-- Provider Name: string

MapToCollection Class
Applies a transform to each DynamicFrame in the specified DynamicFrameCollection.

Methods
• __call__ (p. 311)
• apply (p. 312)
• name (p. 312)
• describeArgs (p. 312)
• describeReturn (p. 312)
• describeTransform (p. 312)
• describeErrors (p. 312)
• describe (p. 312)

__call__(dfc, BaseTransform, frame_name, transformation_ctx = "",


**base_kwargs)
Applies a transform function to each DynamicFrame in the specified DynamicFrameCollection.

• dfc – The DynamicFrameCollection over which to apply the transform function (required).
• callable – A callable transform function to apply to each member of the collection (required).

311
AWS Glue Developer Guide
PySpark Transforms

• transformation_ctx – A unique string that is used to identify state information (optional).

Returns a new DynamicFrameCollection created by applying the transform to each DynamicFrame


in the source DynamicFrameCollection.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297)

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

Relationalize Class
Flattens nested schema in a DynamicFrame and pivots out array columns from the flattened frame.

Methods
• __call__ (p. 312)
• apply (p. 313)
• name (p. 313)
• describeArgs (p. 313)
• describeReturn (p. 313)
• describeTransform (p. 313)
• describeErrors (p. 313)
• describe (p. 313)

__call__(frame, staging_path=None, name='roottable', options=None,


transformation_ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
Relationalizes a DynamicFrame and produces a list of frames that are generated by unnesting nested
columns and pivoting array columns. The pivoted array column can be joined to the root table using the
joinkey generated in the unnest phase.

312
AWS Glue Developer Guide
PySpark Transforms

• frame – The DynamicFrame to relationalize (required).


• staging_path – The path at which to store partitions of pivoted tables in CSV format (optional).
Pivoted tables are read back from this path.
• name – The name of the root table (optional).
• options – A dictionary of optional parameters.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Return a DynamicFrameCollection containing the DynamicFrames produced by from the


relationalize operation.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

RenameField Class
Renames a node within a DynamicFrame.

Methods
• __call__ (p. 314)
• apply (p. 314)
• name (p. 314)

313
AWS Glue Developer Guide
PySpark Transforms

• describeArgs (p. 314)


• describeReturn (p. 314)
• describeTransform (p. 314)
• describeErrors (p. 314)
• describe (p. 315)

__call__(frame, old_name, new_name, transformation_ctx = "", info = "",


stageThreshold = 0, totalThreshold = 0)
Renames a node within a DynamicFrame.

• frame – The DynamicFrame in which to rename a node (required).


• old_name – Full path to the node to rename (required).

If the old name has dots in it, RenameField will not work unless you place back-ticks around it (``). For
example, to replace this.old.name with thisNewName, you would call RenameField as follows:

newDyF = RenameField(oldDyF, "`this.old.name`", "thisNewName")

• new_name – New name, including full path (required).


• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a DynamicFrame with the specified field renamed.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

314
AWS Glue Developer Guide
PySpark Transforms

describe(cls)
Inherited from GlueTransform describe (p. 299).

ResolveChoice Class
Resolves a choice type within a DynamicFrame.

Methods
• __call__ (p. 315)
• apply (p. 316)
• name (p. 316)
• describeArgs (p. 316)
• describeReturn (p. 316)
• describeTransform (p. 316)
• describeErrors (p. 316)
• describe (p. 316)

__call__(frame, specs = None, choice = "", transformation_ctx = "", info = "",


stageThreshold = 0, totalThreshold = 0)
Provides information for resolving ambiguous types within a DynamicFrame. Returns the resulting
DynamicFrame.

• frame – The DynamicFrame in which to resolve the choice type (required).


• specs – A list of specific ambiguities to resolve, each in the form of a tuple:(path, action).
The path value identifies a specific ambiguous element, and the action value identifies the
corresponding resolution. Only one of the spec and choice parameters can be used. If the spec
parameter is not None, then the choice parameter must be an empty string. Conversely if the
choice is not an empty string, then the spec parameter must be None. If neither parameter is
provided, AWS Glue tries to parse the schema and use it to resolve ambiguities.

The action portion of a specs tuple can specify one of four resolution strategies:
• cast: Allows you to specify a type to cast to (for example, cast:int).
• make_cols: Resolves a potential ambiguity by flattening the data. For example, if columnA
could be an int or a string, the resolution is to produce two columns named columnA_int and
columnA_string in the resulting DynamicFrame.
• make_struct: Resolves a potential ambiguity by using a struct to represent the data. For example,
if data in a column could be an int or a string, using the make_struct action produces a column
of structures in the resulting DynamicFrame with each containing both an int and a string.
• project: Resolves a potential ambiguity by retaining only values of a specified type in the resulting
DynamicFrame. For example, if data in a ChoiceType column could be an int or a string,
specifying a project:string action drops columns from the resulting DynamicFrame which are
not type string.

If the path identifies an array, place empty square brackets after the name of the array to avoid
ambiguity. For example, suppose you are working with data structured as follows:

"myList": [
{ "price": 100.00 },
{ "price": "$100.00" }
]

315
AWS Glue Developer Guide
PySpark Transforms

You can select the numeric rather than the string version of the price by setting the path to
"myList[].price", and the action to "cast:double".
• choice – The default resolution action if the specs parameter is None. If the specs parameter is not
None, then this must not be set to anything but an empty string.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a DynamicFrame with the resolved choice.

Example

df1 = ResolveChoice.apply(df, choice = "make_cols")


df2 = ResolveChoice.apply(df, specs = [("a.b", "make_struct"), ("c.d", "cast:double")])

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

SelectFields Class
Gets fields in a DynamicFrame.

Methods
• __call__ (p. 317)
• apply (p. 317)

316
AWS Glue Developer Guide
PySpark Transforms

• name (p. 317)


• describeArgs (p. 317)
• describeReturn (p. 317)
• describeTransform (p. 317)
• describeErrors (p. 317)
• describe (p. 317)

__call__(frame, paths, transformation_ctx = "", info = "", stageThreshold = 0,


totalThreshold = 0)
Gets fields (nodes) in a DynamicFrame.

• frame – The DynamicFrame in which to select fields (required).


• paths – A list of full paths to the fields to select (required).
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a new DynamicFrame containing only the specified fields.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

SelectFromCollection Class
Selects one DynamicFrame in a DynamicFrameCollection.

317
AWS Glue Developer Guide
PySpark Transforms

Methods
• __call__ (p. 318)
• apply (p. 318)
• name (p. 318)
• describeArgs (p. 318)
• describeReturn (p. 318)
• describeTransform (p. 318)
• describeErrors (p. 318)
• describe (p. 318)

__call__(dfc, key, transformation_ctx = "")


Gets one DynamicFrame from a DynamicFrameCollection.

• dfc – The key of the DynamicFrame to select (required).


• transformation_ctx – A unique string that is used to identify state information (optional).

Returns the specified DynamicFrame.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

Spigot Class
Writes sample records to a specified destination during a transformation.

Methods
• __call__ (p. 319)

318
AWS Glue Developer Guide
PySpark Transforms

• apply (p. 319)


• name (p. 319)
• describeArgs (p. 319)
• describeReturn (p. 319)
• describeTransform (p. 319)
• describeErrors (p. 319)
• describe (p. 319)

__call__(frame, path, options, transformation_ctx = "")


Writes sample records to a specified destination during a transformation.

• frame – The DynamicFrame to spigot (required).


• path – The path to the destination to write to (required).
• options – JSON key-value pairs specifying options (optional). The "topk" option specifies that the
first k records should be written. The "prob" option specifies the probability (as a decimal) of picking
any given record, to be used in selecting records to write.
• transformation_ctx – A unique string that is used to identify state information (optional).

Returns the input DynamicFrame with an additional write step.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297)

name(cls)
Inherited from GlueTransform name (p. 298)

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298)

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298)

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298)

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298)

describe(cls)
Inherited from GlueTransform describe (p. 299)

SplitFields Class
Splits a DynamicFrame into two new ones, by specified fields.

319
AWS Glue Developer Guide
PySpark Transforms

Methods
• __call__ (p. 320)
• apply (p. 320)
• name (p. 320)
• describeArgs (p. 320)
• describeReturn (p. 320)
• describeTransform (p. 320)
• describeErrors (p. 321)
• describe (p. 321)

__call__(frame, paths, name1 = None, name2 = None, transformation_ctx = "",


info = "", stageThreshold = 0, totalThreshold = 0)
Splits one or more fields in a DynamicFrame off into a new DynamicFrame and creates another new
DynamicFrame containing the fields that remain.

• frame – The source DynamicFrame to split into two new ones (required).
• paths – A list of full paths to the fields to be split (required).
• name1 – The name to assign to the DynamicFrame that will contain the fields to be split off
(optional). If no name is supplied, the name of the source frame is used with "1" appended.
• name2 – The name to assign to the DynamicFrame that will contain the fields that remain after the
specified fields are split off (optional). If no name is provided, the name of the source frame is used
with "2" appended.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a DynamicFrameCollection containing two DynamicFrames: one contains only the specified
fields to split off, and the other contains the remaining fields.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

320
AWS Glue Developer Guide
PySpark Transforms

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

SplitRows Class
Splits a DynamicFrame in two by specified rows.

Methods
• __call__ (p. 321)
• apply (p. 321)
• name (p. 321)
• describeArgs (p. 322)
• describeReturn (p. 322)
• describeTransform (p. 322)
• describeErrors (p. 322)
• describe (p. 322)

__call__(frame, comparison_dict, name1="frame1", name2="frame2",


transformation_ctx = "", info = None, stageThreshold = 0, totalThreshold = 0)
Splits one or more rows in a DynamicFrame off into a new DynamicFrame.

• frame – The source DynamicFrame to split into two new ones (required).
• comparison_dict – A dictionary where the key is the full path to a column, and the value is another
dictionary mapping comparators to the value to which the column values are compared. For example,
{"age": {">": 10, "<": 20}} splits rows where the value of "age" is between 10 and 20,
exclusive, from rows where "age" is outside that range (required).
• name1 – The name to assign to the DynamicFrame that will contain the rows to be split off (optional).
• name2 – The name to assign to the DynamicFrame that will contain the rows that remain after the
specified rows are split off (optional).
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns a DynamicFrameCollection that contains two DynamicFrames: one contains only the
specified rows to be split, and the other contains all remaining rows.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

321
AWS Glue Developer Guide
PySpark Transforms

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

Unbox Class
Unboxes a string field in a DynamicFrame.

Methods
• __call__ (p. 322)
• apply (p. 323)
• name (p. 323)
• describeArgs (p. 323)
• describeReturn (p. 323)
• describeTransform (p. 323)
• describeErrors (p. 323)
• describe (p. 323)

__call__(frame, path, format, transformation_ctx = "", info="", stageThreshold=0,


totalThreshold=0, **options)
Unboxes a string field in a DynamicFrame.

• frame – The DynamicFrame in which to unbox a field. (required).


• path – The full path to the StringNode to unbox (required).
• format – A format specification (optional). This is used for an Amazon Simple Storage Service
(Amazon S3) or an AWS Glue connection that supports multiple formats. See Format Options for ETL
Inputs and Outputs in AWS Glue (p. 248) for the formats that are supported.
• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

322
AWS Glue Developer Guide
PySpark Transforms

• separator – A separator token (optional).


• escaper – An escape token (optional).
• skipFirst – True if the first line of data should be skipped, or False if it should not be skipped
(optional).
• withSchema – A string containing schema for the data to be unboxed (optional). This should always be
created using StructType.json.
• withHeader – True if the data being unpacked includes a header, or False if not (optional).

Returns a new DynamicFrame with unboxed DynamicRecords.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

UnnestFrame Class
Unnests a DynamicFrame, flattens nested objects to top-level elements, and generates joinkeys for
array objects.

Methods
• __call__ (p. 324)
• apply (p. 324)
• name (p. 324)
• describeArgs (p. 324)
• describeReturn (p. 324)
• describeTransform (p. 324)
• describeErrors (p. 324)
• describe (p. 324)

323
AWS Glue Developer Guide
ETL Programming in Scala

__call__(frame, transformation_ctx = "", info="", stageThreshold=0,


totalThreshold=0)
Unnests a DynamicFrame. Flattens nested objects to top-level elements, and generates joinkeys for
array objects.

• frame – The DynamicFrame to unnest (required).


• transformation_ctx – A unique string that is used to identify state information (optional).
• info – A string associated with errors in the transformation (optional).
• stageThreshold – The maximum number of errors that can occur in the transformation before it
errors out (optional; the default is zero).
• totalThreshold – The maximum number of errors that can occur overall before processing errors
out (optional; the default is zero).

Returns the unnested DynamicFrame.

apply(cls, *args, **kwargs)


Inherited from GlueTransform apply (p. 297).

name(cls)
Inherited from GlueTransform name (p. 298).

describeArgs(cls)
Inherited from GlueTransform describeArgs (p. 298).

describeReturn(cls)
Inherited from GlueTransform describeReturn (p. 298).

describeTransform(cls)
Inherited from GlueTransform describeTransform (p. 298).

describeErrors(cls)
Inherited from GlueTransform describeErrors (p. 298).

describe(cls)
Inherited from GlueTransform describe (p. 299).

Programming AWS Glue ETL Scripts in Scala


You can find Scala code examples and utilities for AWS Glue in the AWS Glue samples repository on the
GitHub website.

AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load
(ETL) jobs. The following sections describe how to use the AWS Glue Scala library and the AWS Glue API
in ETL scripts, and provide reference documentation for the library.

Contents

324
AWS Glue Developer Guide
ETL Programming in Scala

• Using Scala to Program AWS Glue ETL Scripts (p. 329)


• Testing a Scala ETL Program in a Zeppelin Notebook on a Development Endpoint (p. 329)
• Testing a Scala ETL Program in a Scala REPL (p. 329)
• APIs in the AWS Glue Scala Library (p. 330)
• com.amazonaws.services.glue (p. 330)
• com.amazonaws.services.glue.types (p. 330)
• com.amazonaws.services.glue.util (p. 330)
• AWS Glue Scala ChoiceOption APIs (p. 331)
• ChoiceOption Trait (p. 331)
• ChoiceOption Object (p. 331)
• def apply (p. 331)
• case class ChoiceOptionWithResolver (p. 331)
• case class MatchCatalogSchemaChoiceOption (p. 331)
• Abstract DataSink Class (p. 331)
• def writeDynamicFrame (p. 332)
• def pyWriteDynamicFrame (p. 332)
• def supportsFormat (p. 332)
• def setFormat (p. 332)
• def withFormat (p. 332)
• def setAccumulableSize (p. 332)
• def getOutputErrorRecordsAccumulable (p. 333)
• def errorsAsDynamicFrame (p. 333)
• DataSink Object (p. 333)
• def recordMetrics (p. 333)
• AWS Glue Scala DataSource Trait (p. 333)
• AWS Glue Scala DynamicFrame APIs (p. 333)
• AWS Glue Scala DynamicFrame Class (p. 334)
• val errorsCount (p. 335)
• def applyMapping (p. 335)
• def assertErrorThreshold (p. 336)
• def count (p. 337)
• def dropField (p. 337)
• def dropFields (p. 337)
• def dropNulls (p. 337)
• def errorsAsDynamicFrame (p. 337)
• def filter (p. 337)
• def getName (p. 338)
• def getNumPartitions (p. 338)
• def getSchemaIfComputed (p. 338)
• def isSchemaComputed (p. 338)
• def javaToPython (p. 338)
• def join (p. 338)
• def map (p. 339)
325
• def printSchema (p. 339)
• def recomputeSchema (p. 339)
• def relationalize (p. 339)
AWS Glue Developer Guide
ETL Programming in Scala

• def renameField (p. 340)


• def repartition (p. 341)
• def resolveChoice (p. 341)
• def schema (p. 342)
• def selectField (p. 342)
• def selectFields (p. 342)
• def show (p. 342)
• def spigot (p. 343)
• def splitFields (p. 343)
• def splitRows (p. 343)
• def stageErrorsCount (p. 344)
• def toDF (p. 344)
• def unbox (p. 344)
• def unnest (p. 345)
• def withFrameSchema (p. 346)
• def withName (p. 346)
• def withTransformationContext (p. 346)
• The DynamicFrame Object (p. 346)
• def apply (p. 347)
• def emptyDynamicFrame (p. 347)
• def fromPythonRDD (p. 347)
• def ignoreErrors (p. 347)
• def inlineErrors (p. 347)
• def newFrameWithErrors (p. 347)
• AWS Glue Scala DynamicRecord Class (p. 347)
• def addField (p. 348)
• def dropField (p. 348)
• def setError (p. 348)
• def isError (p. 349)
• def getError (p. 349)
• def clearError (p. 349)
• def write (p. 349)
• def readFields (p. 349)
• def clone (p. 349)
• def schema (p. 349)
• def getRoot (p. 349)
• def toJson (p. 349)
• def getFieldNode (p. 350)
• def getField (p. 350)
• def hashCode (p. 350)
• def equals (p. 350)
• DynamicRecord Object (p. 350)
• def apply (p. 350)
326
• RecordTraverser Trait (p. 350)
• AWS Glue Scala GlueContext APIs (p. 351)
AWS Glue Developer Guide
ETL Programming in Scala

• def getCatalogSink (p. 351)


• def getCatalogSource (p. 352)
• def getJDBCSink (p. 352)
• def getSink (p. 353)
• def getSinkWithFormat (p. 353)
• def getSource (p. 354)
• def getSourceWithFormat (p. 355)
• def getSparkSession (p. 355)
• def this (p. 355)
• def this (p. 356)
• def this (p. 356)
• MappingSpec (p. 356)
• MappingSpec Case Class (p. 356)
• MappingSpec Object (p. 357)
• val orderingByTarget (p. 357)
• def apply (p. 357)
• def apply (p. 357)
• def apply (p. 357)
• AWS Glue Scala ResolveSpec APIs (p. 358)
• ResolveSpec Object (p. 358)
• def (p. 358)
• def (p. 358)
• ResolveSpec Case Class (p. 358)
• ResolveSpec def Methods (p. 359)
• AWS Glue Scala ArrayNode APIs (p. 359)
• ArrayNode Case Class (p. 359)
• ArrayNode def Methods (p. 359)
• AWS Glue Scala BinaryNode APIs (p. 360)
• BinaryNode Case Class (p. 360)
• BinaryNode val Fields (p. 360)
• BinaryNode def Methods (p. 360)
• AWS Glue Scala BooleanNode APIs (p. 360)
• BooleanNode Case Class (p. 360)
• BooleanNode val Fields (p. 360)
• BooleanNode def Methods (p. 360)
• AWS Glue Scala ByteNode APIs (p. 360)
• ByteNode Case Class (p. 360)
• ByteNode val Fields (p. 361)
• ByteNode def Methods (p. 361)
• AWS Glue Scala DateNode APIs (p. 361)
• DateNode Case Class (p. 361)
• DateNode val Fields (p. 361)
• DateNode def 327
Methods (p. 361)
• AWS Glue Scala DecimalNode APIs (p. 361)
• DecimalNode Case Class (p. 361)
AWS Glue Developer Guide
ETL Programming in Scala

• DecimalNode val Fields (p. 361)


• DecimalNode def Methods (p. 361)
• AWS Glue Scala DoubleNode APIs (p. 362)
• DoubleNode Case Class (p. 362)
• DoubleNode val Fields (p. 362)
• DoubleNode def Methods (p. 362)
• AWS Glue Scala DynamicNode APIs (p. 362)
• DynamicNode Class (p. 362)
• DynamicNode def Methods (p. 362)
• DynamicNode Object (p. 363)
• DynamicNode def Methods (p. 363)
• AWS Glue Scala FloatNode APIs (p. 363)
• FloatNode Case Class (p. 363)
• FloatNode val Fields (p. 363)
• FloatNode def Methods (p. 363)
• AWS Glue Scala IntegerNode APIs (p. 363)
• IntegerNode Case Class (p. 363)
• IntegerNode val Fields (p. 363)
• IntegerNode def Methods (p. 364)
• AWS Glue Scala LongNode APIs (p. 364)
• LongNode Case Class (p. 364)
• LongNode val Fields (p. 364)
• LongNode def Methods (p. 364)
• AWS Glue Scala MapLikeNode APIs (p. 364)
• MapLikeNode Class (p. 364)
• MapLikeNode def Methods (p. 364)
• AWS Glue Scala MapNode APIs (p. 365)
• MapNode Case Class (p. 365)
• MapNode def Methods (p. 365)
• AWS Glue Scala NullNode APIs (p. 365)
• NullNode Class (p. 366)
• NullNode Case Object (p. 366)
• AWS Glue Scala ObjectNode APIs (p. 366)
• ObjectNode Object (p. 366)
• ObjectNode def Methods (p. 366)
• ObjectNode Case Class (p. 366)
• ObjectNode def Methods (p. 366)
• AWS Glue Scala ScalarNode APIs (p. 367)
• ScalarNode Class (p. 367)
• ScalarNode def Methods (p. 367)
• ScalarNode Object (p. 367)
• ScalarNode def Methods (p. 367)
• AWS Glue Scala ShortNode APIs 328
(p. 368)
• ShortNode Case Class (p. 368)
• ShortNode val Fields (p. 368)
AWS Glue Developer Guide
Using Scala

• ShortNode def Methods (p. 368)


• AWS Glue Scala StringNode APIs (p. 368)
• StringNode Case Class (p. 368)
• StringNode val Fields (p. 368)
• StringNode def Methods (p. 368)
• AWS Glue Scala TimestampNode APIs (p. 368)
• TimestampNode Case Class (p. 368)
• TimestampNode val Fields (p. 369)
• TimestampNode def Methods (p. 369)
• AWS Glue Scala GlueArgParser APIs (p. 369)
• GlueArgParser Object (p. 369)
• GlueArgParser def Methods (p. 369)
• AWS Glue Scala Job APIs (p. 369)
• Job Object (p. 369)
• Job def Methods (p. 369)

Using Scala to Program AWS Glue ETL Scripts


You can automatically generate a Scala extract, transform, and load (ETL) program using the AWS Glue
console, and modify it as needed before assigning it to a job. Or, you can write your own program from
scratch. For more information, see Adding Jobs in AWS Glue (p. 142). AWS Glue then compiles your Scala
program on the server before running the associated job.

To ensure that your program compiles without errors and runs as expected, it's important that you load it
on a development endpoint in a REPL (Read-Eval-Print Loop) or an Apache Zeppelin Notebook and test it
there before running it in a job. Because the compile process occurs on the server, you will not have good
visibility into any problems that happen there.

Testing a Scala ETL Program in a Zeppelin Notebook on a


Development Endpoint
To test a Scala program on an AWS Glue development endpoint, set up the development endpoint as
described in Using Development Endpoints (p. 156).

Next, connect it to an Apache Zeppelin Notebook that is either running locally on your machine or
remotely on an Amazon EC2 notebook server. To install a local version of a Zeppelin Notebook, follow
the instructions in Tutorial: Local Zeppelin Notebook (p. 164).

The only difference between running Scala code and running PySpark code on your Notebook is that you
should start each paragraph on the Notebook with the the following:

%spark

This prevents the Notebook server from defaulting to the PySpark flavor of the Spark interpreter.

Testing a Scala ETL Program in a Scala REPL


You can test a Scala program on a development endpoint using the AWS Glue Scala REPL. Follow the
instructions in Tutorial: Use a REPL Shell (p. 170), except at the end of the SSH-to-REPL command,
replace -t gluepyspark with -t glue-spark-shell. This invokes the AWS Glue Scala REPL.

To close the REPL when you are finished, type sys.exit.

329
AWS Glue Developer Guide
Scala API List

APIs in the AWS Glue Scala Library


AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load
(ETL) jobs. The following sections describe the APIs in the AWS Glue Scala library.

com.amazonaws.services.glue
The com.amazonaws.services.glue package in the AWS Glue Scala library contains the following APIs:

• ChoiceOption (p. 331)


• DataSink (p. 331)
• DataSource trait (p. 333)
• DynamicFrame (p. 333)
• DynamicRecord (p. 347)
• GlueContext (p. 351)
• MappingSpec (p. 356)
• ResolveSpec (p. 358)

com.amazonaws.services.glue.types
The com.amazonaws.services.glue.types package in the AWS Glue Scala library contains the following
APIs:

• ArrayNode (p. 359)


• BinaryNode (p. 360)
• BooleanNode (p. 360)
• ByteNode (p. 360)
• DateNode (p. 361)
• DecimalNode (p. 361)
• DoubleNode (p. 362)
• DynamicNode (p. 362)
• FloatNode (p. 363)
• IntegerNode (p. 363)
• LongNode (p. 364)
• MapLikeNode (p. 364)
• MapNode (p. 365)
• NullNode (p. 365)
• ObjectNode (p. 366)
• ScalarNode (p. 367)
• ShortNode (p. 368)
• StringNode (p. 368)
• TimestampNode (p. 368)

com.amazonaws.services.glue.util
The com.amazonaws.services.glue.util package in the AWS Glue Scala library contains the following
APIs:

330
AWS Glue Developer Guide
Scala API List

• GlueArgParser (p. 369)


• Job (p. 369)

AWS Glue Scala ChoiceOption APIs


Topics
• ChoiceOption Trait (p. 331)
• ChoiceOption Object (p. 331)
• case class ChoiceOptionWithResolver (p. 331)
• case class MatchCatalogSchemaChoiceOption (p. 331)

Package: com.amazonaws.services.glue

ChoiceOption Trait

trait ChoiceOption extends Serializable

ChoiceOption Object
ChoiceOption

object ChoiceOption

A general strategy to resolve choice applicable to all ChoiceType nodes in a DynamicFrame.

• val CAST
• val MAKE_COLS
• val MAKE_STRUCT
• val MATCH_CATALOG
• val PROJECT

def apply

def apply(choice: String): ChoiceOption

case class ChoiceOptionWithResolver

case class ChoiceOptionWithResolver(name: String, choiceResolver: ChoiceResolver) extends


ChoiceOption {}

case class MatchCatalogSchemaChoiceOption

case class MatchCatalogSchemaChoiceOption() extends ChoiceOption {}

Abstract DataSink Class


Topics

331
AWS Glue Developer Guide
Scala API List

• def writeDynamicFrame (p. 332)


• def pyWriteDynamicFrame (p. 332)
• def supportsFormat (p. 332)
• def setFormat (p. 332)
• def withFormat (p. 332)
• def setAccumulableSize (p. 332)
• def getOutputErrorRecordsAccumulable (p. 333)
• def errorsAsDynamicFrame (p. 333)
• DataSink Object (p. 333)

Package: com.amazonaws.services.glue

abstract class DataSink

The writer analog to a DataSource. DataSink encapsulates a destination and a format that a
DynamicFrame can be written to.

def writeDynamicFrame

def writeDynamicFrame( frame : DynamicFrame,


callSite : CallSite = CallSite("Not provided", "")
) : DynamicFrame

def pyWriteDynamicFrame

def pyWriteDynamicFrame( frame : DynamicFrame,


site : String = "Not provided",
info : String = "" )

def supportsFormat

def supportsFormat( format : String ) : Boolean

def setFormat

def setFormat( format : String,


options : JsonOptions
) : Unit

def withFormat

def withFormat( format : String,


options : JsonOptions = JsonOptions.empty
) : DataSink

def setAccumulableSize

def setAccumulableSize( size : Int ) : Unit

332
AWS Glue Developer Guide
Scala API List

def getOutputErrorRecordsAccumulable

def getOutputErrorRecordsAccumulable : Accumulable[List[OutputError], OutputError]

def errorsAsDynamicFrame

def errorsAsDynamicFrame : DynamicFrame

DataSink Object

object DataSink

def recordMetrics

def recordMetrics( frame : DynamicFrame,


ctxt : String
) : DynamicFrame

AWS Glue Scala DataSource Trait


Package: com.amazonaws.services.glue

A high-level interface for producing a DynamicFrame.

trait DataSource {

def getDynamicFrame : DynamicFrame

def getDynamicFrame( minPartitions : Int,


targetPartitions : Int
) : DynamicFrame

def glueContext : GlueContext

def setFormat( format : String,


options : String
) : Unit

def setFormat( format : String,


options : JsonOptions
) : Unit

def supportsFormat( format : String ) : Boolean

def withFormat( format : String,


options : JsonOptions = JsonOptions.empty
) : DataSource
}

AWS Glue Scala DynamicFrame APIs


Package: com.amazonaws.services.glue

Contents
• AWS Glue Scala DynamicFrame Class (p. 334)
• val errorsCount (p. 335)

333
AWS Glue Developer Guide
Scala API List

• def applyMapping (p. 335)


• def assertErrorThreshold (p. 336)
• def count (p. 337)
• def dropField (p. 337)
• def dropFields (p. 337)
• def dropNulls (p. 337)
• def errorsAsDynamicFrame (p. 337)
• def filter (p. 337)
• def getName (p. 338)
• def getNumPartitions (p. 338)
• def getSchemaIfComputed (p. 338)
• def isSchemaComputed (p. 338)
• def javaToPython (p. 338)
• def join (p. 338)
• def map (p. 339)
• def printSchema (p. 339)
• def recomputeSchema (p. 339)
• def relationalize (p. 339)
• def renameField (p. 340)
• def repartition (p. 341)
• def resolveChoice (p. 341)
• def schema (p. 342)
• def selectField (p. 342)
• def selectFields (p. 342)
• def show (p. 342)
• def spigot (p. 343)
• def splitFields (p. 343)
• def splitRows (p. 343)
• def stageErrorsCount (p. 344)
• def toDF (p. 344)
• def unbox (p. 344)
• def unnest (p. 345)
• def withFrameSchema (p. 346)
• def withName (p. 346)
• def withTransformationContext (p. 346)
• The DynamicFrame Object (p. 346)
• def apply (p. 347)
• def emptyDynamicFrame (p. 347)
• def fromPythonRDD (p. 347)
• def ignoreErrors (p. 347)
• def inlineErrors (p. 347)
• def newFrameWithErrors (p. 347)

AWS Glue Scala DynamicFrame Class


Package: com.amazonaws.services.glue

334
AWS Glue Developer Guide
Scala API List

class DynamicFrame extends Serializable with Logging (


val glueContext : GlueContext,
_records : RDD[DynamicRecord],
val name : String = s"",
val transformationContext : String = DynamicFrame.UNDEFINED,
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0,
prevErrors : => Long = 0,
errorExpr : => Unit = {} )

A DynamicFrame is a distributed collection of self-describing DynamicRecord (p. 347) objects.

DynamicFrames are designed to provide a flexible data model for ETL (extract, transform, and load)
operations. They don't require a schema to create, and you can use them to read and transform data
that contains messy or inconsistent values and types. A schema can be computed on demand for those
operations that need one.

DynamicFrames provide a range of transformations for data cleaning and ETL. They also support
conversion to and from SparkSQL DataFrames to integrate with existing code and the many analytics
operations that DataFrames provide.

The following parameters are shared across many of the AWS Glue transformations that construct
DynamicFrames:

• transformationContext — The identifier for this DynamicFrame. The transformationContext


is used as a key for job bookmark state that is persisted across runs.
• callSite — Provides context information for error reporting. These values are automatically set
when calling from Python.
• stageThreshold — The maximum number of error records that are allowed from the computation of
this DynamicFrame before throwing an exception, excluding records that are present in the previous
DynamicFrame.
• totalThreshold — The maximum number of total error records before an exception is thrown,
including those from previous frames.

val errorsCount

val errorsCount

The number of error records in this DynamicFrame. This includes errors from previous operations.

def applyMapping

def applyMapping( mappings : Seq[Product4[String, String, String, String]],


caseSensitive : Boolean = true,
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

• mappings — A sequence of mappings to construct a new DynamicFrame.


• caseSensitive — Whether to treat source columns as case sensitive. Setting this to false might help
when integrating with case-insensitive stores like the AWS Glue Data Catalog.

Selects, projects, and casts columns based on a sequence of mappings.

335
AWS Glue Developer Guide
Scala API List

Each mapping is made up of a source column and type and a target column and type. Mappings can
be specified as either a four-tuple (source_path, source_type, target_path, target_type) or a
MappingSpec (p. 356) object containing the same information.

In addition to using mappings for simple projections and casting, you can use them to nest or unnest
fields by separating components of the path with '.' (period).

For example, suppose that you have a DynamicFrame with the following schema:

{{{
root
|-- name: string
|-- age: int
|-- address: struct
| |-- state: string
| |-- zip: int
}}}

You can make the following call to unnest the state and zip fields:

{{{
df.applyMapping(
Seq(("name", "string", "name", "string"),
("age", "int", "age", "int"),
("address.state", "string", "state", "string"),
("address.zip", "int", "zip", "int")))
}}}

The resulting schema is as follows:

{{{
root
|-- name: string
|-- age: int
|-- state: string
|-- zip: int
}}}

You can also use applyMapping to re-nest columns. For example, the following inverts the previous
transformation and creates a struct named address in the target:

{{{
df.applyMapping(
Seq(("name", "string", "name", "string"),
("age", "int", "age", "int"),
("state", "string", "address.state", "string"),
("zip", "int", "address.zip", "int")))
}}}

Field names that contain '.' (period) characters can be quoted by using backticks ('').
Note
Currently, you can't use the applyMapping method to map columns that are nested under
arrays.

def assertErrorThreshold

def assertErrorThreshold : Unit

336
AWS Glue Developer Guide
Scala API List

An action that forces computation and verifies that the number of error records falls below
stageThreshold and totalThreshold. Throws an exception if either condition fails.

def count

lazy
def count

Returns the number of elements in this DynamicFrame.

def dropField

def dropField( path : String,


transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Returns a new DynamicFrame with the specified column removed.

def dropFields

def dropFields( fieldNames : Seq[String], // The column names to drop.


transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Returns a new DynamicFrame with the specified columns removed.

You can use this method to delete nested columns, including those inside of arrays, but not to drop
specific array elements.

def dropNulls

def dropNulls( transformationContext : String = "",


callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0 )

Returns a new DynamicFrame with all null columns removed.


Note
This only removes columns of type NullType. Individual null values in other columns are not
removed or modified.

def errorsAsDynamicFrame

def errorsAsDynamicFrame

Returns a new DynamicFrame containing the error records from this DynamicFrame.

def filter

def filter( f : DynamicRecord => Boolean,


errorMsg : String = "",
transformationContext : String = "",

337
AWS Glue Developer Guide
Scala API List

callSite : CallSite = CallSite("Not provided"),


stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Constructs a new DynamicFrame containing only those records for which the function 'f' returns true.
The filter function 'f' should not mutate the input record.

def getName

def getName : String

Returns the name of this DynamicFrame.

def getNumPartitions

def getNumPartitions

Returns the number of partitions in this DynamicFrame.

def getSchemaIfComputed

def getSchemaIfComputed : Option[Schema]

Returns the schema if it has already been computed. Does not scan the data if the schema has not
already been computed.

def isSchemaComputed

def isSchemaComputed : Boolean

Returns true if the schema has been computed for this DynamicFrame, or false if not. If this
method returns false, then calling the schema method requires another pass over the records in this
DynamicFrame.

def javaToPython

def javaToPython : JavaRDD[Array[Byte]]

def join

def join( keys1 : Seq[String],


keys2 : Seq[String],
frame2 : DynamicFrame,
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

• keys1 — The columns in this DynamicFrame to use for the join.


• keys2 — The columns in frame2 to use for the join. Must be the same length as keys1.
• frame2 — The DynamicFrame to join against.

Returns the result of performing an equijoin with frame2 using the specified keys.

338
AWS Glue Developer Guide
Scala API List

def map

def map( f : DynamicRecord => DynamicRecord,


errorMsg : String = "",
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Returns a new DynamicFrame constructed by applying the specified function 'f' to each record in this
DynamicFrame.

This method copies each record before applying the specified function, so it is safe to mutate the
records. If the mapping function throws an exception on a given record, that record is marked as an error,
and the stack trace is saved as a column in the error record.

def printSchema

def printSchema : Unit

Prints the schema of this DynamicFrame to stdout in a human-readable format.

def recomputeSchema

def recomputeSchema : Schema

Forces a schema recomputation. This requires a scan over the data, but it may "tighten" the schema if
there are some fields in the current schema that are not present in the data.

Returns the recomputed schema.

def relationalize

def relationalize( rootTableName : String,


stagingPath : String,
options : JsonOptions = JsonOptions.empty,
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided"),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : Seq[DynamicFrame]

• rootTableName — The name to use for the base DynamicFrame in the output. DynamicFrames that
are created by pivoting arrays start with this as a prefix.
• stagingPath — The Amazon Simple Storage Service (Amazon S3) path for writing intermediate data.
• options — Relationalize options and configuration. Currently unused.

Flattens all nested structures and pivots arrays into separate tables.

You can use this operation to prepare deeply nested data for ingestion into a relational database. Nested
structs are flattened in the same manner as the unnest (p. 345) transform. Additionally, arrays are
pivoted into separate tables with each array element becoming a row. For example, suppose that you
have a DynamicFrame with the following data:

{"name": "Nancy", "age": 47, "friends": ["Fred", "Lakshmi"]}


{"name": "Stephanie", "age": 28, "friends": ["Yao", "Phil", "Alvin"]}

339
AWS Glue Developer Guide
Scala API List

{"name": "Nathan", "age": 54, "friends": ["Nicolai", "Karen"]}

Execute the following code:

{{{
df.relationalize("people", "s3:/my_bucket/my_path", JsonOptions.empty)
}}}

This produces two tables. The first table is named "people" and contains the following:

{{{
{"name": "Nancy", "age": 47, "friends": 1}
{"name": "Stephanie", "age": 28, "friends": 2}
{"name": "Nathan", "age": 54, "friends": 3)
}}}

Here, the friends array has been replaced with an auto-generated join key. A separate table named
people.friends is created with the following content:

{{{
{"id": 1, "index": 0, "val": "Fred"}
{"id": 1, "index": 1, "val": "Lakshmi"}
{"id": 2, "index": 0, "val": "Yao"}
{"id": 2, "index": 1, "val": "Phil"}
{"id": 2, "index": 2, "val": "Alvin"}
{"id": 3, "index": 0, "val": "Nicolai"}
{"id": 3, "index": 1, "val": "Karen"}
}}}

In this table, 'id' is a join key that identifies which record the array element came from, 'index' refers to
the position in the original array, and 'val' is the actual array entry.

The relationalize method returns the sequence of DynamicFrames created by applying this process
recursively to all arrays.
Note
The AWS Glue library automatically generates join keys for new tables. To ensure that join keys
are unique across job runs, you must enable job bookmarks.

def renameField

def renameField( oldName : String,


newName : String,
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

• oldName — The original name of the column.


• newName — The new name of the column.

Returns a new DynamicFrame with the specified field renamed.

You can use this method to rename nested fields. For example, the following code would rename state
to state_code inside the address struct:

{{{

340
AWS Glue Developer Guide
Scala API List

df.renameField("address.state", "address.state_code")
}}}

def repartition

def repartition( numPartitions : Int,


transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Returns a new DynamicFrame with numPartitions partitions.

def resolveChoice

def resolveChoice( specs : Seq[Product2[String, String]] = Seq.empty[ResolveSpec],


choiceOption : Option[ChoiceOption] = None,
database : Option[String] = None,
tableName : Option[String] = None,
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

• choiceOption — An action to apply to all ChoiceType columns not listed in the specs sequence.
• database — The Data Catalog database to use with the match_catalog action.
• tableName — The Data Catalog table to use with the match_catalog action.

Returns a new DynamicFrame by replacing one or more ChoiceTypes with a more specific type.

There are two ways to use resolveChoice. The first is to specify a sequence of specific columns and
how to resolve them. These are specified as tuples made up of (column, action) pairs.

The following are the possible actions:

• cast:type — Attempts to cast all values to the specified type.


• make_cols — Converts each distinct type to a column with the name columnName_type.
• make_struct — Converts a column to a struct with keys for each distinct type.
• project:type — Retains only values of the specified type.

The other mode for resolveChoice is to specify a single resolution for all ChoiceTypes. You can use
this in cases where the complete list of ChoiceTypes is unknown before execution. In addition to the
actions listed preceding, this mode also supports the following action:

• match_catalog — Attempts to cast each ChoiceType to the corresponding type in the specified
catalog table.

Examples:

Resolve the user.id column by casting to an int, and make the address field retain only structs:

{{{
df.resolveChoice(specs = Seq(("user.id", "cast:int"), ("address", "project:struct")))

341
AWS Glue Developer Guide
Scala API List

}}}

Resolve all ChoiceTypes by converting each choice to a separate column:

{{{
df.resolveChoice(choiceOption = Some(ChoiceOption("make_cols")))
}}}

Resolve all ChoiceTypes by casting to the types in the specified catalog table:

{{{
df.resolveChoice(choiceOption = Some(ChoiceOption("match_catalog")),
database = Some("my_database"),
tableName = Some("my_table"))
}}}

def schema

def schema : Schema

Returns the schema of this DynamicFrame.

The returned schema is guaranteed to contain every field that is present in a record in this
DynamicFrame. But in a small number of cases, it might also contain additional fields. You can use the
unnest (p. 345) method to "tighten" the schema based on the records in this DynamicFrame.

def selectField

def selectField( fieldName : String,


transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Returns a single field as a DynamicFrame.

def selectFields

def selectFields( paths : Seq[String],


transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

• paths — The sequence of column names to select.

Returns a new DynamicFrame containing the specified columns.


Note
You can only use the selectFields method to select top-level columns. You can use the
applyMapping (p. 335) method to select nested columns.

def show

def show( numRows : Int = 20 ) : Unit

342
AWS Glue Developer Guide
Scala API List

• numRows — The number of rows to print.

Prints rows from this DynamicFrame in JSON format.

def spigot

def spigot( path : String,


options : JsonOptions = new JsonOptions("{}"),
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided"),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Passthrough transformation that returns the same records but writes out a subset of records as a side
effect.

• path — The path in Amazon S3 to write output to, in the form s3://bucket//path.
• options — An optional JsonOptions map describing the sampling behavior.

Returns a DynamicFrame that contains the same records as this one.

By default, writes 100 arbitrary records to the location specified by path. You can customize this
behavior by using the options map. Valid keys include the following:

• topk — Specifies the total number of records written out. The default is 100.
• prob — Specifies the probability (as a decimal) that an individual record is included. Default is 1.

For example, the following call would sample the dataset by selecting each record with a 20 percent
probability and stopping after 200 records have been written:

{{{
df.spigot("s3://my_bucket/my_path", JsonOptions(Map("topk" -&gt; 200, "prob" -&gt; 0.2)))
}}}

def splitFields

def splitFields( paths : Seq[String],


transformationContext : String = "",
callSite : CallSite = CallSite("Not provided", ""),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : Seq[DynamicFrame]

• paths — The paths to include in the first DynamicFrame.

Returns a sequence of two DynamicFrames. The first DynamicFrame contains the specified paths, and
the second contains all other columns.

def splitRows

def splitRows( paths : Seq[String],


values : Seq[Any],
operators : Seq[String],
transformationContext : String,

343
AWS Glue Developer Guide
Scala API List

callSite : CallSite,
stageThreshold : Long,
totalThreshold : Long
) : Seq[DynamicFrame]

Splits rows based on predicates that compare columns to constants.

• paths — The columns to use for comparison.


• values — The constant values to use for comparison.
• operators — The operators to use for comparison.

Returns a sequence of two DynamicFrames. The first contains rows for which the predicate is true and
the second contains those for which it is false.

Predicates are specified using three sequences: 'paths' contains the (possibly nested) column names,
'values' contains the constant values to compare to, and 'operators' contains the operators to use for
comparison. All three sequences must be the same length: The nth operator is used to compare the nth
column with the nth value.

Each operator must be one of "!=", "=", "&lt;=", "&lt;", "&gt;=", or "&gt;".

As an example, the following call would split a DynamicFrame so that the first output frame would
contain records of people over 65 from the United States, and the second would contain all other
records:

{{{
df.splitRows(Seq("age", "address.country"), Seq(65, "USA"), Seq("&gt;=", "="))
}}}

def stageErrorsCount

def stageErrorsCount

Returns the number of error records created while computing this DynamicFrame. This excludes errors
from previous operations that were passed into this DynamicFrame as input.

def toDF

def toDF( specs : Seq[ResolveSpec] = Seq.empty[ResolveSpec] ) : DataFrame

Converts this DynamicFrame to an Apache Spark SQL DataFrame with the same schema and records.
Note
Because DataFrames don't support ChoiceTypes, this method automatically converts
ChoiceType columns into StructTypes. For more information and options for resolving
choice, see resolveChoice (p. 341).

def unbox

def unbox( path : String,


format : String,
optionString : String = "{}",
transformationContext : String = "",
callSite : CallSite = CallSite("Not provided"),
stageThreshold : Long = 0,
totalThreshold : Long = 0

344
AWS Glue Developer Guide
Scala API List

) : DynamicFrame

• path — The column to parse. Must be a string or binary.


• format — The format to use for parsing.
• optionString — Options to pass to the format, such as the CSV separator.

Parses an embedded string or binary column according to the specified format. Parsed columns are
nested under a struct with the original column name.

For example, suppose that you have a CSV file with an embedded JSON column:

name, age, address


Sally, 36, {"state": "NE", "city": "Omaha"}
...

After an initial parse, you would get a DynamicFrame with the following schema:

{{{
root
|-- name: string
|-- age: int
|-- address: string
}}}

You can call unbox on the address column to parse the specific components:

{{{
df.unbox("address", "json")
}}}

This gives us a DynamicFrame with the following schema:

{{{
root
|-- name: string
|-- age: int
|-- address: struct
| |-- state: string
| |-- city: string
}}}

def unnest

def unnest( transformationContext : String = "",


callSite : CallSite = CallSite("Not Provided"),
stageThreshold : Long = 0,
totalThreshold : Long = 0
) : DynamicFrame

Returns a new DynamicFrame with all nested structures flattened. Names are constructed using the
'.' (period) character.

For example, suppose that you have a DynamicFrame with the following schema:

{{{

345
AWS Glue Developer Guide
Scala API List

root
|-- name: string
|-- age: int
|-- address: struct
| |-- state: string
| |-- city: string
}}}

The following call unnests the address struct:

{{{
df.unnest()
}}}

The resulting schema is as follows:

{{{
root
|-- name: string
|-- age: int
|-- address.state: string
|-- address.city: string
}}}

This method also unnests nested structs inside of arrays. But for historical reasons, the names of such
fields are prepended with the name of the enclosing array and ".val".

def withFrameSchema

def withFrameSchema( getSchema : () => Schema ) : DynamicFrame

• getSchema — A function that returns the schema to use. Specified as a zero-parameter function to
defer potentially expensive computation.

Sets the schema of this DynamicFrame to the specified value. This is primarily used internally to avoid
costly schema recomputation. The passed-in schema must contain all columns present in the data.

def withName

def withName( name : String ) : DynamicFrame

• name — The new name to use.

Returns a copy of this DynamicFrame with a new name.

def withTransformationContext

def withTransformationContext( ctx : String ) : DynamicFrame

Returns a copy of this DynamicFrame with the specified transformation context.

The DynamicFrame Object


Package: com.amazonaws.services.glue

346
AWS Glue Developer Guide
Scala API List

object DynamicFrame

def apply

def apply( df : DataFrame,


glueContext : GlueContext
) : DynamicFrame

def emptyDynamicFrame

def emptyDynamicFrame( glueContext : GlueContext ) : DynamicFrame

def fromPythonRDD

def fromPythonRDD( rdd : JavaRDD[Array[Byte]],


glueContext : GlueContext
) : DynamicFrame

def ignoreErrors

def ignoreErrors( fn : DynamicRecord => DynamicRecord ) : DynamicRecord

def inlineErrors

def inlineErrors( msg : String,


callSite : CallSite
) : (DynamicRecord => DynamicRecord)

def newFrameWithErrors

def newFrameWithErrors( prevFrame : DynamicFrame,


rdd : RDD[DynamicRecord],
name : String = "",
transformationContext : String = "",
callSite : CallSite,
stageThreshold : Long,
totalThreshold : Long
) : DynamicFrame

AWS Glue Scala DynamicRecord Class


Topics
• def addField (p. 348)
• def dropField (p. 348)
• def setError (p. 348)
• def isError (p. 349)
• def getError (p. 349)
• def clearError (p. 349)
• def write (p. 349)
• def readFields (p. 349)

347
AWS Glue Developer Guide
Scala API List

• def clone (p. 349)


• def schema (p. 349)
• def getRoot (p. 349)
• def toJson (p. 349)
• def getFieldNode (p. 350)
• def getField (p. 350)
• def hashCode (p. 350)
• def equals (p. 350)
• DynamicRecord Object (p. 350)
• RecordTraverser Trait (p. 350)

Package: com.amazonaws.services.glue

class DynamicRecord extends Serializable with Writable with Cloneable

A DynamicRecord is a self-describing data structure that represents a row of data in the dataset that
is being processed. It is self-describing in the sense that you can get the schema of the row that is
represented by the DynamicRecord by inspecting the record itself. A DynamicRecord is similar to a
Row in Apache Spark.

def addField

def addField( path : String,


dynamicNode : DynamicNode
) : Unit

Adds a DynamicNode (p. 362) to the specified path.

• path — The path for the field to be added.


• dynamicNode — The DynamicNode (p. 362) to be added at the specified path.

def dropField

def dropField(path: String, underRename: Boolean = false): Option[DynamicNode]

Drops a DynamicNode (p. 362) from the specified path and returns the dropped node if there is not an
array in the specified path.

• path — The path to the field to drop.


• underRename — True if dropField is called as part of a rename transform, or false otherwise (false
by default).

Returns a scala.Option Option (DynamicNode (p. 362)).

def setError

def setError( error : Error )

Sets this record as an error record, as specified by the error parameter.

348
AWS Glue Developer Guide
Scala API List

Returns a DynamicRecord.

def isError

def isError

Checks whether this record is an error record.

def getError

def getError

Gets the Error if the record is an error record. Returns scala.Some Some (Error) if this record is an
error record, or otherwise scala.None .

def clearError

def clearError

Set the Error to scala.None.None .

def write

override def write( out : DataOutput ) : Unit

def readFields

override def readFields( in : DataInput ) : Unit

def clone

override def clone : DynamicRecord

Clones this record to a new DynamicRecord and returns it.

def schema

def schema

Gets the Schema by inspecting the record.

def getRoot

def getRoot : ObjectNode

Gets the root ObjectNode for the record.

def toJson

def toJson : String

349
AWS Glue Developer Guide
Scala API List

Gets the JSON string for the record.

def getFieldNode

def getFieldNode( path : String ) : Option[DynamicNode]

Gets the field's value at the specified path as an option of DynamicNode.

Returns scala.Some Some (DynamicNode (p. 362)) if the field exists, or otherwise
scala.None.None .

def getField

def getField( path : String ) : Option[Any]

Gets the field's value at the specified path as an option of DynamicNode.

Returns scala.Some Some (value).

def hashCode

override def hashCode : Int

def equals

override def equals( other : Any )

DynamicRecord Object

object DynamicRecord

def apply

def apply( row : Row,


schema : SparkStructType )

Apply method to convert an Apache Spark SQL Row to a DynamicRecord (p. 347).

• row — A Spark SQL Row.


• schema — The Schema of that row.

Returns a DynamicRecord.

RecordTraverser Trait

trait RecordTraverser {
def nullValue(): Unit
def byteValue(value: Byte): Unit
def binaryValue(value: Array[Byte]): Unit
def booleanValue(value: Boolean): Unit
def shortValue(value: Short) : Unit

350
AWS Glue Developer Guide
Scala API List

def intValue(value: Int) : Unit


def longValue(value: Long) : Unit
def floatValue(value: Float): Unit
def doubleValue(value: Double): Unit
def decimalValue(value: BigDecimal): Unit
def stringValue(value: String): Unit
def dateValue(value: Date): Unit
def timestampValue(value: Timestamp): Unit
def objectStart(length: Int): Unit
def objectKey(key: String): Unit
def objectEnd(): Unit
def mapStart(length: Int): Unit
def mapKey(key: String): Unit
def mapEnd(): Unit
def arrayStart(length: Int): Unit
def arrayEnd(): Unit
}

AWS Glue Scala GlueContext APIs


Package: com.amazonaws.services.glue

class GlueContext extends SQLContext(sc) (


@transient val sc : SparkContext,
val defaultSourcePartitioner : PartitioningStrategy )

GlueContext is the entry point for reading and writing a DynamicFrame (p. 333) from and to Amazon
Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, JDBC, and so on. This class provides
utility functions to create DataSource trait (p. 333) and DataSink (p. 331) objects that can in turn be
used to read and write DynamicFrames.

You can also use GlueContext to set a target number of partitions (default 20) in the DynamicFrame if
the number of partitions created from the source is less than a minimum threshold for partitions (default
10).

def getCatalogSink

def getCatalogSink( database : String,


tableName : String,
redshiftTmpDir : String = "",
transformationContext : String = ""
additionalOptions: JsonOptions = JsonOptions.empty,
catalogId: String = null
) : DataSink

Creates a DataSink (p. 331) that writes to a location specified in a table that is defined in the Data
Catalog.

• database — The database name in the Data Catalog.


• tableName — The table name in the Data Catalog.
• redshiftTmpDir — The temporary staging directory to be used with certain data sinks. Set to empty
by default.
• transformationContext — The transformation context that is associated with the sink to be used
by job bookmarks. Set to empty by default.
• additionalOptions – Additional options provided to AWS Glue.
• catalogId — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default
account ID of the caller is used.

351
AWS Glue Developer Guide
Scala API List

Returns the DataSink.

def getCatalogSource

def getCatalogSource( database : String,


tableName : String,
redshiftTmpDir : String = "",
transformationContext : String = ""
push_down_predicate : String = " "
additionalOptions: JsonOptions = JsonOptions.empty,
catalogId: String = null
) : DataSource

Creates a DataSource trait (p. 333) that reads data from a table definition in the Data Catalog.

• database — The database name in the Data Catalog.


• tableName — The table name in the Data Catalog.
• redshiftTmpDir — The temporary staging directory to be used with certain data sinks. Set to empty
by default.
• transformationContext — The transformation context that is associated with the sink to be used
by job bookmarks. Set to empty by default.
• push_down_predicate – Filters partitions without having to list and read all the files in your dataset.
For more information, see Pre-Filtering Using Pushdown Predicates (p. 250).
• additionalOptions – Additional options provided to AWS Glue.
• catalogId — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default
account ID of the caller is used.

Returns the DataSource.

def getJDBCSink

def getJDBCSink( catalogConnection : String,


options : JsonOptions,
redshiftTmpDir : String = "",
transformationContext : String = "",
catalogId: String = null
) : DataSink

Creates a DataSink (p. 331) that writes to a JDBC database that is specified in a Connection object in
the Data Catalog. The Connection object has information to connect to a JDBC sink, including the URL,
user name, password, VPC, subnet, and security groups.

• catalogConnection — The name of the connection in the Data Catalog that contains the JDBC URL
to write to.
• options — A string of JSON name-value pairs that provide additional information that is required to
write to a JDBC data store. This includes:
• dbtable (required) — The name of the JDBC table. For JDBC data stores that support schemas within
a database, specify schema.table-name. If a schema is not provided, then the default "public"
schema is used. The following example shows an options parameter that points to a schema named
test and a table named test_table in database test_db.

options = JsonOptions("""{"dbtable": "test.test_table", "database": "test_db"}""")

• database (required) — The name of the JDBC database.

352
AWS Glue Developer Guide
Scala API List

• Any additional options passed directly to the SparkSQL JDBC writer. For more information, see
Redshift data source for Spark.
• redshiftTmpDir — A temporary staging directory to be used with certain data sinks. Set to empty
by default.
• transformationContext — The transformation context that is associated with the sink to be used
by job bookmarks. Set to empty by default.
• catalogId — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default
account ID of the caller is used.

Example code:

getJDBCSink(catalogConnection = "my-connection-name", options = JsonOptions("""{"dbtable":


"my-jdbc-table", "database": "my-jdbc-db"}"""), redshiftTmpDir = "", transformationContext
= "datasink4")

Returns the DataSink.

def getSink

def getSink( connectionType : String,


options : JsonOptions,
transformationContext : String = ""
) : DataSink

Creates a DataSink (p. 331) that writes data to a destination like Amazon Simple Storage Service
(Amazon S3), JDBC, or the AWS Glue Data Catalog.

• connectionType — The type of the connection.


• options — A string of JSON name-value pairs that provide additional information to establish the
connection with the data sink.
• transformationContext — The transformation context that is associated with the sink to be used
by job bookmarks. Set to empty by default.

Returns the DataSink.

def getSinkWithFormat

def getSinkWithFormat( connectionType : String,


options : JsonOptions,
transformationContext : String = "",
format : String = null,
formatOptions : JsonOptions = JsonOptions.empty
) : DataSink

Creates a DataSink (p. 331) that writes data to a destination like Amazon S3, JDBC, or the Data Catalog,
and also sets the format for the data to be written out to the destination.

• connectionType — The type of the connection. Refer to DataSink (p. 331) for a list of supported
connection types.
• options — A string of JSON name-value pairs that provide additional information to establish a
connection with the data sink.
• transformationContext — The transformation context that is associated with the sink to be used
by job bookmarks. Set to empty by default.

353
AWS Glue Developer Guide
Scala API List

• format — The format of the data to be written out to the destination.


• formatOptions — A string of JSON name-value pairs that provide additional options for formatting
data at the destination. See Format Options (p. 248).

Returns the DataSink.

def getSource

def getSource( connectionType : String,


connectionOptions : JsonOptions,
transformationContext : String = ""
pushDownPredicate
) : DataSource

Creates a DataSource trait (p. 333) that reads data from a source like Amazon S3, JDBC, or the AWS
Glue Data Catalog.

• connectionType — The type of the data source. Can be one of “s3”, “mysql”, “redshift”, “oracle”,
“sqlserver”, “postgresql”, "dynamodb", “parquet”, or “orc”.
• connectionOptions — A string of JSON name-value pairs that provide additional information for
establishing a connection with the data source.
• connectionOptions when the connectionType is "s3":
• paths (required) — List of Amazon S3 paths to read.
• compressionType (optional) — Compression type of the data. This is generally not required if the
data has a standard file extension. Possible values are “gzip” and “bzip”.
• exclusions (optional) — A string containing a JSON list of glob patterns to exclude. For example
"[\"**.pdf\"]" excludes all PDF files.
• maxBand (optional) — This advanced option controls the duration in seconds after which AWS
Glue expects an Amazon S3 listing to be consistent. Files with modification timestamps falling
within the last maxBand seconds are tracked when using job bookmarks to account for Amazon S3
eventual consistency. It is rare to set this option. The default is 900 seconds.
• maxFilesInBand (optional) — This advanced option specifies the maximum number of files to save
from the last maxBand seconds. If this number is exceeded, extra files are skipped and processed
only in the next job run.
• groupFiles (optional) — Grouping files is enabled by default when the input contains more than
50,000 files. To disable grouping with fewer than 50,000 files, set this parameter to “inPartition”.
To disable grouping when there are more than 50,000 files, set this parameter to “none”.
• groupSize (optional) — The target group size in bytes. The default is computed based on the input
data size and the size of your cluster. When there are fewer than 50,000 input files, groupFiles
must be set to “inPartition” for this option to take effect.
• recurse (optional) — If set to true, recursively read files in any subdirectory of the specified paths.
• connectionOptions when the connectionType is "dynamodb":
• dynamodb.input.tableName (required) — The DynamoDB table from which to read.
• dynamodb.throughput.read.percent (optional) — The percentage of reserved capacity units (RCU)
to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.
• connectionOptions when the connectionType is "parquet" or "orc":
• paths (required) — List of Amazon S3 paths to read.
• Any additional options are passed directly to the SparkSQL DataSource. For more information, see
Redshift data source for Spark.
• connectionOptions when the connectionType is "redshift":
• url (required) — The JDBC URL for an Amazon Redshift database.
• dbtable (required) — The Amazon Redshift table to read.

354
AWS Glue Developer Guide
Scala API List

• redshiftTmpDir (required) — The Amazon S3 path where temporary data can be staged when
copying out of Amazon Redshift.
• user (required) — The username to use when connecting to the Amazon Redshift cluster.
• password (required) — The password to use when connecting to the Amazon Redshift cluster.
• transformationContext — The transformation context that is associated with the sink to be used
by job bookmarks. Set to empty by default.
• pushDownPredicate — Predicate on partition columns.

Returns the DataSource.

def getSourceWithFormat

def getSourceWithFormat( connectionType : String,


options : JsonOptions,
transformationContext : String = "",
format : String = null,
formatOptions : JsonOptions = JsonOptions.empty
) : DataSource

Creates a DataSource trait (p. 333) that reads data from a source like Amazon S3, JDBC, or the AWS
Glue Data Catalog, and also sets the format of data stored in the source.

• connectionType — The type of the data source. Can be one of “s3”, “mysql”, “redshift”, “oracle”,
“sqlserver”, “postgresql”, "dynamodb", “parquet”, or “orc”.
• options — A string of JSON name-value pairs that provide additional information for establishing a
connection with the data source.
• transformationContext — The transformation context that is associated with the sink to be used
by job bookmarks. Set to empty by default.
• format — The format of the data that is stored at the source. When the connectionType is "s3", you
can also specify format. Can be one of “avro”, “csv”, “grokLog”, “ion”, “json”, “xml”, “parquet”, or “orc”.
• formatOptions — A string of JSON name-value pairs that provide additional options for parsing
data at the source. See Format Options (p. 248).

Returns the DataSource.

def getSparkSession

def getSparkSession : SparkSession

Gets the SparkSession object associated with this GlueContext. Use this SparkSession object to
register tables and UDFs for use with DataFrame created from DynamicFrames.

Returns the SparkSession.

def this

def this( sc : SparkContext,


minPartitions : Int,
targetPartitions : Int )

Creates a GlueContext object using the specified SparkContext, minimum partitions, and target
partitions.

355
AWS Glue Developer Guide
Scala API List

• sc — The SparkContext.
• minPartitions — The minimum number of partitions.
• targetPartitions — The target number of partitions.

Returns the GlueContext.

def this

def this( sc : SparkContext )

Creates a GlueContext object with the provided SparkContext. Sets the minimum partitions to 10
and target partitions to 20.

• sc — The SparkContext.

Returns the GlueContext.

def this

def this( sparkContext : JavaSparkContext )

Creates a GlueContext object with the provided JavaSparkContext. Sets the minimum partitions to
10 and target partitions to 20.

• sparkContext — The JavaSparkContext.

Returns the GlueContext.

MappingSpec
Package: com.amazonaws.services.glue

MappingSpec Case Class

case class MappingSpec( sourcePath: SchemaPath,


sourceType: DataType,
targetPath: SchemaPath,
targetType: DataTyp
) extends Product4[String, String, String, String] {
override def _1: String = sourcePath.toString
override def _2: String = ExtendedTypeName.fromDataType(sourceType)
override def _3: String = targetPath.toString
override def _4: String = ExtendedTypeName.fromDataType(targetType)
}

• sourcePath — The SchemaPath of the source field.


• sourceType — The DataType of the source field.
• targetPath — The SchemaPath of the target field.
• targetType — The DataType of the target field.

A MappingSpec specifies a mapping from a source path and a source data type to a target path and
a target data type. The value at the source path in the source frame appears in the target frame at the
target path. The source data type is cast to the target data type.

356
AWS Glue Developer Guide
Scala API List

It extends from Product4 so that you can handle any Product4 in your applyMapping interface.

MappingSpec Object

object MappingSpec

The MappingSpec object has the following members:

val orderingByTarget

val orderingByTarget: Ordering[MappingSpec]

def apply

def apply( sourcePath : String,


sourceType : DataType,
targetPath : String,
targetType : DataType
) : MappingSpec

Creates a MappingSpec.

• sourcePath — A string representation of the source path.


• sourceType — The source DataType.
• targetPath — A string representation of the target path.
• targetType — The target DataType.

Returns a MappingSpec.

def apply

def apply( sourcePath : String,


sourceTypeString : String,
targetPath : String,
targetTypeString : String
) : MappingSpec

Creates a MappingSpec.

• sourcePath — A string representation of the source path.


• sourceType — A string representation of the source data type.
• targetPath — A string representation of the target path.
• targetType — A string representation of the target data type.

Returns a MappingSpec.

def apply

def apply( product : Product4[String, String, String, String] ) : MappingSpec

Creates a MappingSpec.

357
AWS Glue Developer Guide
Scala API List

• product — The Product4 of the source path, source data type, target path, and target data type.

Returns a MappingSpec.

AWS Glue Scala ResolveSpec APIs


Topics
• ResolveSpec Object (p. 358)
• ResolveSpec Case Class (p. 358)

Package: com.amazonaws.services.glue

ResolveSpec Object
ResolveSpec

object ResolveSpec

def

def apply( path : String,


action : String
) : ResolveSpec

Creates a ResolveSpec.

• path — A string representation of the choice field that needs to be resolved.


• action — A resolution action. The action can be one of the following: Project, KeepAsStruct, or
Cast.

Returns the ResolveSpec.

def

def apply( product : Product2[String, String] ) : ResolveSpec

Creates a ResolveSpec.

• product — Product2 of: source path, resolution action.

Returns the ResolveSpec.

ResolveSpec Case Class

case class ResolveSpec extends Product2[String, String] (


path : SchemaPath,
action : String )

Creates a ResolveSpec.

• path — The SchemaPath of the choice field that needs to be resolved.


• action — A resolution action. The action can be one of the following: Project, KeepAsStruct, or
Cast.

358
AWS Glue Developer Guide
Scala API List

ResolveSpec def Methods

def _1 : String

def _2 : String

AWS Glue Scala ArrayNode APIs


Package: com.amazonaws.services.glue.types

ArrayNode Case Class


ArrayNode

case class ArrayNode extends DynamicNode (


value : ArrayBuffer[DynamicNode] )

ArrayNode def Methods

def add( node : DynamicNode )

def clone

def equals( other : Any )

def get( index : Int ) : Option[DynamicNode]

def getValue

def hashCode : Int

def isEmpty : Boolean

def nodeType

def remove( index : Int )

def this

def toIterator : Iterator[DynamicNode]

def toJson : String

def update( index : Int,

359
AWS Glue Developer Guide
Scala API List

node : DynamicNode )

AWS Glue Scala BinaryNode APIs


Package: com.amazonaws.services.glue.types

BinaryNode Case Class


BinaryNode

case class BinaryNode extends ScalarNode(value, TypeCode.BINARY) (


value : Array[Byte] )

BinaryNode val Fields

• ordering

BinaryNode def Methods

def clone

def equals( other : Any )

def hashCode : Int

AWS Glue Scala BooleanNode APIs


Package: com.amazonaws.services.glue.types

BooleanNode Case Class


BooleanNode

case class BooleanNode extends ScalarNode(value, TypeCode.BOOLEAN) (


value : Boolean )

BooleanNode val Fields

• ordering

BooleanNode def Methods

def equals( other : Any )

AWS Glue Scala ByteNode APIs


Package: com.amazonaws.services.glue.types

ByteNode Case Class


ByteNode

360
AWS Glue Developer Guide
Scala API List

case class ByteNode extends ScalarNode(value, TypeCode.BYTE) (


value : Byte )

ByteNode val Fields

• ordering

ByteNode def Methods

def equals( other : Any )

AWS Glue Scala DateNode APIs


Package: com.amazonaws.services.glue.types

DateNode Case Class


DateNode

case class DateNode extends ScalarNode(value, TypeCode.DATE) (


value : Date )

DateNode val Fields

• ordering

DateNode def Methods

def equals( other : Any )

def this( value : Int )

AWS Glue Scala DecimalNode APIs


Package: com.amazonaws.services.glue.types

DecimalNode Case Class


DecimalNode

case class DecimalNode extends ScalarNode(value, TypeCode.DECIMAL) (


value : BigDecimal )

DecimalNode val Fields

• ordering

DecimalNode def Methods

def equals( other : Any )

361
AWS Glue Developer Guide
Scala API List

def this( value : Decimal )

AWS Glue Scala DoubleNode APIs


Package: com.amazonaws.services.glue.types

DoubleNode Case Class


DoubleNode

case class DoubleNode extends ScalarNode(value, TypeCode.DOUBLE) (


value : Double )

DoubleNode val Fields

• ordering

DoubleNode def Methods

def equals( other : Any )

AWS Glue Scala DynamicNode APIs


Topics
• DynamicNode Class (p. 362)
• DynamicNode Object (p. 363)

Package: com.amazonaws.services.glue.types

DynamicNode Class
DynamicNode

class DynamicNode extends Serializable with Cloneable

DynamicNode def Methods

def getValue : Any

Get plain value and bind to the current record:

def nodeType : TypeCode

def toJson : String

Method for debug:

def toRow( schema : Schema,


options : Map[String, ResolveOption]
) : Row

362
AWS Glue Developer Guide
Scala API List

def typeName : String

DynamicNode Object
DynamicNode

object DynamicNode

DynamicNode def Methods

def quote( field : String,


useQuotes : Boolean
) : String

def quote( node : DynamicNode,


useQuotes : Boolean
) : String

AWS Glue Scala FloatNode APIs


Package: com.amazonaws.services.glue.types

FloatNode Case Class


FloatNode

case class FloatNode extends ScalarNode(value, TypeCode.FLOAT) (


value : Float )

FloatNode val Fields

• ordering

FloatNode def Methods

def equals( other : Any )

AWS Glue Scala IntegerNode APIs


Package: com.amazonaws.services.glue.types

IntegerNode Case Class


IntegerNode

case class IntegerNode extends ScalarNode(value, TypeCode.INT) (


value : Int )

IntegerNode val Fields

• ordering

363
AWS Glue Developer Guide
Scala API List

IntegerNode def Methods

def equals( other : Any )

AWS Glue Scala LongNode APIs


Package: com.amazonaws.services.glue.types

LongNode Case Class


LongNode

case class LongNode extends ScalarNode(value, TypeCode.LONG) (


value : Long )

LongNode val Fields

• ordering

LongNode def Methods

def equals( other : Any )

AWS Glue Scala MapLikeNode APIs


Package: com.amazonaws.services.glue.types

MapLikeNode Class
MapLikeNode

class MapLikeNode extends DynamicNode (


value : mutable.Map[String, DynamicNode] )

MapLikeNode def Methods

def clear : Unit

def get( name : String ) : Option[DynamicNode]

def getValue

def has( name : String ) : Boolean

def isEmpty : Boolean

def put( name : String,


node : DynamicNode

364
AWS Glue Developer Guide
Scala API List

) : Option[DynamicNode]

def remove( name : String ) : Option[DynamicNode]

def toIterator : Iterator[(String, DynamicNode)]

def toJson : String

def toJson( useQuotes : Boolean ) : String

Example: Given this JSON:

{"foo": "bar"}

If useQuotes == true, toJson yields {"foo": "bar"}. If useQuotes == false, toJson yields
{foo: bar} @return.

AWS Glue Scala MapNode APIs


Package: com.amazonaws.services.glue.types

MapNode Case Class


MapNode

case class MapNode extends MapLikeNode(value) (


value : mutable.Map[String, DynamicNode] )

MapNode def Methods

def clone

def equals( other : Any )

def hashCode : Int

def nodeType

def this

AWS Glue Scala NullNode APIs


Topics
• NullNode Class (p. 366)
• NullNode Case Object (p. 366)

Package: com.amazonaws.services.glue.types

365
AWS Glue Developer Guide
Scala API List

NullNode Class
NullNode

class NullNode

NullNode Case Object


NullNode

case object NullNode extends NullNode

AWS Glue Scala ObjectNode APIs


Topics
• ObjectNode Object (p. 366)
• ObjectNode Case Class (p. 366)

Package: com.amazonaws.services.glue.types

ObjectNode Object
ObjectNode

object ObjectNode

ObjectNode def Methods

def apply( frameKeys : Set[String],


v1 : mutable.Map[String, DynamicNode],
v2 : mutable.Map[String, DynamicNode],
resolveWith : String
) : ObjectNode

ObjectNode Case Class


ObjectNode

case class ObjectNode extends MapLikeNode(value) (


val value : mutable.Map[String, DynamicNode] )

ObjectNode def Methods

def clone

def equals( other : Any )

def hashCode : Int

def nodeType

366
AWS Glue Developer Guide
Scala API List

def this

AWS Glue Scala ScalarNode APIs


Topics
• ScalarNode Class (p. 367)
• ScalarNode Object (p. 367)

Package: com.amazonaws.services.glue.types

ScalarNode Class
ScalarNode

class ScalarNode extends DynamicNode (


value : Any,
scalarType : TypeCode )

ScalarNode def Methods

def compare( other : Any,


operator : String
) : Boolean

def getValue

def hashCode : Int

def nodeType

def toJson

ScalarNode Object
ScalarNode

object ScalarNode

ScalarNode def Methods

def apply( v : Any ) : DynamicNode

def compare( tv : Ordered[T],


other : T,
operator : String
) : Boolean

def compareAny( v : Any,


y : Any,

367
AWS Glue Developer Guide
Scala API List

o : String )

def withEscapedSpecialCharacters( jsonToEscape : String ) : String

AWS Glue Scala ShortNode APIs


Package: com.amazonaws.services.glue.types

ShortNode Case Class


ShortNode

case class ShortNode extends ScalarNode(value, TypeCode.SHORT) (


value : Short )

ShortNode val Fields

• ordering

ShortNode def Methods

def equals( other : Any )

AWS Glue Scala StringNode APIs


Package: com.amazonaws.services.glue.types

StringNode Case Class


StringNode

case class StringNode extends ScalarNode(value, TypeCode.STRING) (


value : String )

StringNode val Fields

• ordering

StringNode def Methods

def equals( other : Any )

def this( value : UTF8String )

AWS Glue Scala TimestampNode APIs


Package: com.amazonaws.services.glue.types

TimestampNode Case Class


TimestampNode

368
AWS Glue Developer Guide
Scala API List

case class TimestampNode extends ScalarNode(value, TypeCode.TIMESTAMP) (


value : Timestamp )

TimestampNode val Fields

• ordering

TimestampNode def Methods

def equals( other : Any )

def this( value : Long )

AWS Glue Scala GlueArgParser APIs


Package: com.amazonaws.services.glue.util

GlueArgParser Object
GlueArgParser

object GlueArgParser

This is strictly consistent with the Python version of utils.getResolvedOptions in the


AWSGlueDataplanePython package.

GlueArgParser def Methods

def getResolvedOptions( args : Array[String],


options : Array[String]
) : Map[String, String]

def initParser( userOptionsSet : mutable.Set[String] ) : ArgumentParser

AWS Glue Scala Job APIs


Package: com.amazonaws.services.glue.util

Job Object
Job

object Job

Job def Methods

def commit

def init( jobName : String,


glueContext : GlueContext,
args : java.util.Map[String, String] = Map[String, String]().asJava

369
AWS Glue Developer Guide
Scala API List

) : this.type

def init( jobName : String,


glueContext : GlueContext,
endpoint : String,
args : java.util.Map[String, String]
) : this.type

def isInitialized

def reset

def runId

370
AWS Glue Developer Guide

AWS Glue API


Contents
• Security APIs in AWS Glue (p. 376)
• Data Types (p. 376)
• DataCatalogEncryptionSettings Structure (p. 377)
• EncryptionAtRest Structure (p. 377)
• ConnectionPasswordEncryption Structure (p. 377)
• EncryptionConfiguration Structure (p. 378)
• S3Encryption Structure (p. 378)
• CloudWatchEncryption Structure (p. 378)
• JobBookmarksEncryption Structure (p. 379)
• SecurityConfiguration Structure (p. 379)
• Operations (p. 379)
• GetDataCatalogEncryptionSettings Action (Python:
get_data_catalog_encryption_settings) (p. 379)
• PutDataCatalogEncryptionSettings Action (Python:
put_data_catalog_encryption_settings) (p. 380)
• PutResourcePolicy Action (Python: put_resource_policy) (p. 381)
• GetResourcePolicy Action (Python: get_resource_policy) (p. 381)
• DeleteResourcePolicy Action (Python: delete_resource_policy) (p. 382)
• CreateSecurityConfiguration Action (Python: create_security_configuration) (p. 382)
• DeleteSecurityConfiguration Action (Python: delete_security_configuration) (p. 383)
• GetSecurityConfiguration Action (Python: get_security_configuration) (p. 384)
• GetSecurityConfigurations Action (Python: get_security_configurations) (p. 384)
• Catalog API (p. 385)
• Database API (p. 385)
• Data Types (p. 385)
• Database Structure (p. 385)
• DatabaseInput Structure (p. 386)
• Operations (p. 386)
• CreateDatabase Action (Python: create_database) (p. 386)
• UpdateDatabase Action (Python: update_database) (p. 387)
• DeleteDatabase Action (Python: delete_database) (p. 387)
• GetDatabase Action (Python: get_database) (p. 388)
• GetDatabases Action (Python: get_databases) (p. 389)
• Table API (p. 389)
• Data Types (p. 389)
• Table Structure (p. 390)
• TableInput Structure (p. 391)
• Column Structure (p. 392)
• StorageDescriptor Structure (p. 392)
• SerDeInfo Structure (p. 393)
• Order Structure (p. 394)
• SkewedInfo Structure (p. 394)

371
AWS Glue Developer Guide

• TableVersion Structure (p. 394)


• TableError Structure (p. 395)
• TableVersionError Structure (p. 395)
• Operations (p. 395)
• CreateTable Action (Python: create_table) (p. 396)
• UpdateTable Action (Python: update_table) (p. 396)
• DeleteTable Action (Python: delete_table) (p. 397)
• BatchDeleteTable Action (Python: batch_delete_table) (p. 398)
• GetTable Action (Python: get_table) (p. 399)
• GetTables Action (Python: get_tables) (p. 399)
• GetTableVersion Action (Python: get_table_version) (p. 400)
• GetTableVersions Action (Python: get_table_versions) (p. 401)
• DeleteTableVersion Action (Python: delete_table_version) (p. 402)
• BatchDeleteTableVersion Action (Python: batch_delete_table_version) (p. 402)
• Partition API (p. 403)
• Data Types (p. 403)
• Partition Structure (p. 403)
• PartitionInput Structure (p. 404)
• PartitionSpecWithSharedStorageDescriptor Structure (p. 405)
• PartitionListComposingSpec Structure (p. 405)
• PartitionSpecProxy Structure (p. 405)
• PartitionValueList Structure (p. 406)
• Segment Structure (p. 406)
• PartitionError Structure (p. 406)
• Operations (p. 406)
• CreatePartition Action (Python: create_partition) (p. 406)
• BatchCreatePartition Action (Python: batch_create_partition) (p. 407)
• UpdatePartition Action (Python: update_partition) (p. 408)
• DeletePartition Action (Python: delete_partition) (p. 409)
• BatchDeletePartition Action (Python: batch_delete_partition) (p. 409)
• GetPartition Action (Python: get_partition) (p. 410)
• GetPartitions Action (Python: get_partitions) (p. 411)
• BatchGetPartition Action (Python: batch_get_partition) (p. 414)
• Connection API (p. 414)
• Data Types (p. 414)
• Connection Structure (p. 415)
• ConnectionInput Structure (p. 416)
• PhysicalConnectionRequirements Structure (p. 416)
• GetConnectionsFilter Structure (p. 417)
• Operations (p. 417)
• CreateConnection Action (Python: create_connection) (p. 417)
• DeleteConnection Action (Python: delete_connection) (p. 418)
• GetConnection Action (Python:
372 get_connection) (p. 418)
• GetConnections Action (Python: get_connections) (p. 419)
• UpdateConnection Action (Python: update_connection) (p. 420)
AWS Glue Developer Guide

• BatchDeleteConnection Action (Python: batch_delete_connection) (p. 420)


• User-Defined Function API (p. 421)
• Data Types (p. 421)
• UserDefinedFunction Structure (p. 421)
• UserDefinedFunctionInput Structure (p. 422)
• Operations (p. 422)
• CreateUserDefinedFunction Action (Python:
create_user_defined_function) (p. 422)
• UpdateUserDefinedFunction Action (Python:
update_user_defined_function) (p. 423)
• DeleteUserDefinedFunction Action (Python:
delete_user_defined_function) (p. 424)
• GetUserDefinedFunction Action (Python: get_user_defined_function) (p. 424)
• GetUserDefinedFunctions Action (Python: get_user_defined_functions) (p. 425)
• Importing an Athena Catalog to AWS Glue (p. 426)
• Data Types (p. 426)
• CatalogImportStatus Structure (p. 426)
• Operations (p. 426)
• ImportCatalogToGlue Action (Python: import_catalog_to_glue) (p. 426)
• GetCatalogImportStatus Action (Python: get_catalog_import_status) (p. 427)
• Crawlers and Classifiers API (p. 427)
• Classifier API (p. 428)
• Data Types (p. 428)
• Classifier Structure (p. 428)
• GrokClassifier Structure (p. 428)
• XMLClassifier Structure (p. 429)
• JsonClassifier Structure (p. 429)
• CreateGrokClassifierRequest Structure (p. 430)
• UpdateGrokClassifierRequest Structure (p. 430)
• CreateXMLClassifierRequest Structure (p. 431)
• UpdateXMLClassifierRequest Structure (p. 431)
• CreateJsonClassifierRequest Structure (p. 431)
• UpdateJsonClassifierRequest Structure (p. 432)
• Operations (p. 432)
• CreateClassifier Action (Python: create_classifier) (p. 432)
• DeleteClassifier Action (Python: delete_classifier) (p. 433)
• GetClassifier Action (Python: get_classifier) (p. 433)
• GetClassifiers Action (Python: get_classifiers) (p. 434)
• UpdateClassifier Action (Python: update_classifier) (p. 434)
• Crawler API (p. 435)
• Data Types (p. 435)
• Crawler Structure (p. 435)
• Schedule Structure (p. 436)
• CrawlerTargets Structure
373(p. 436)
• S3Target Structure (p. 437)
• JdbcTarget Structure (p. 437)
AWS Glue Developer Guide

• DynamoDBTarget Structure (p. 437)


• CrawlerMetrics Structure (p. 437)
• SchemaChangePolicy Structure (p. 438)
• LastCrawlInfo Structure (p. 438)
• Operations (p. 439)
• CreateCrawler Action (Python: create_crawler) (p. 439)
• DeleteCrawler Action (Python: delete_crawler) (p. 440)
• GetCrawler Action (Python: get_crawler) (p. 441)
• GetCrawlers Action (Python: get_crawlers) (p. 441)
• GetCrawlerMetrics Action (Python: get_crawler_metrics) (p. 442)
• UpdateCrawler Action (Python: update_crawler) (p. 442)
• StartCrawler Action (Python: start_crawler) (p. 443)
• StopCrawler Action (Python: stop_crawler) (p. 444)
• Crawler Scheduler API (p. 444)
• Data Types (p. 444)
• Schedule Structure (p. 444)
• Operations (p. 444)
• UpdateCrawlerSchedule Action (Python: update_crawler_schedule) (p. 445)
• StartCrawlerSchedule Action (Python: start_crawler_schedule) (p. 445)
• StopCrawlerSchedule Action (Python: stop_crawler_schedule) (p. 446)
• Autogenerating ETL Scripts API (p. 446)
• Data Types (p. 446)
• CodeGenNode Structure (p. 446)
• CodeGenNodeArg Structure (p. 447)
• CodeGenEdge Structure (p. 447)
• Location Structure (p. 447)
• CatalogEntry Structure (p. 448)
• MappingEntry Structure (p. 448)
• Operations (p. 448)
• CreateScript Action (Python: create_script) (p. 449)
• GetDataflowGraph Action (Python: get_dataflow_graph) (p. 449)
• GetMapping Action (Python: get_mapping) (p. 450)
• GetPlan Action (Python: get_plan) (p. 450)
• Jobs API (p. 451)
• Jobs (p. 451)
• Data Types (p. 451)
• Job Structure (p. 451)
• ExecutionProperty Structure (p. 453)
• NotificationProperty Structure (p. 453)
• JobCommand Structure (p. 453)
• ConnectionsList Structure (p. 453)
• JobUpdate Structure (p. 454)
• Operations (p. 455) 374
• CreateJob Action (Python: create_job) (p. 455)
• UpdateJob Action (Python: update_job) (p. 456)
AWS Glue Developer Guide

• GetJob Action (Python: get_job) (p. 457)


• GetJobs Action (Python: get_jobs) (p. 457)
• DeleteJob Action (Python: delete_job) (p. 458)
• Job Runs (p. 458)
• Data Types (p. 458)
• JobRun Structure (p. 459)
• Predecessor Structure (p. 460)
• JobBookmarkEntry Structure (p. 461)
• BatchStopJobRunSuccessfulSubmission Structure (p. 461)
• BatchStopJobRunError Structure (p. 461)
• Operations (p. 462)
• StartJobRun Action (Python: start_job_run) (p. 462)
• BatchStopJobRun Action (Python: batch_stop_job_run) (p. 463)
• GetJobRun Action (Python: get_job_run) (p. 463)
• GetJobRuns Action (Python: get_job_runs) (p. 464)
• ResetJobBookmark Action (Python: reset_job_bookmark) (p. 465)
• Triggers (p. 465)
• Data Types (p. 465)
• Trigger Structure (p. 465)
• TriggerUpdate Structure (p. 466)
• Predicate Structure (p. 467)
• Condition Structure (p. 467)
• Action Structure (p. 467)
• Operations (p. 468)
• CreateTrigger Action (Python: create_trigger) (p. 468)
• StartTrigger Action (Python: start_trigger) (p. 469)
• GetTrigger Action (Python: get_trigger) (p. 470)
• GetTriggers Action (Python: get_triggers) (p. 470)
• UpdateTrigger Action (Python: update_trigger) (p. 471)
• StopTrigger Action (Python: stop_trigger) (p. 471)
• DeleteTrigger Action (Python: delete_trigger) (p. 472)
• Development Endpoints API (p. 472)
• Data Types (p. 472)
• DevEndpoint Structure (p. 472)
• DevEndpointCustomLibraries Structure (p. 474)
• Operations (p. 474)
• CreateDevEndpoint Action (Python: create_dev_endpoint) (p. 475)
• UpdateDevEndpoint Action (Python: update_dev_endpoint) (p. 477)
• DeleteDevEndpoint Action (Python: delete_dev_endpoint) (p. 477)
• GetDevEndpoint Action (Python: get_dev_endpoint) (p. 478)
• GetDevEndpoints Action (Python: get_dev_endpoints) (p. 478)
• Common Data Types (p. 479)
• Tag Structure (p. 479) 375
• DecimalNumber Structure (p. 479)
• ErrorDetail Structure (p. 480)
AWS Glue Developer Guide
Security

• PropertyPredicate Structure (p. 480)


• ResourceUri Structure (p. 480)
• String Patterns (p. 480)
• Exceptions (p. 481)
• AccessDeniedException Structure (p. 481)
• AlreadyExistsException Structure (p. 481)
• ConcurrentModificationException Structure (p. 481)
• ConcurrentRunsExceededException Structure (p. 481)
• CrawlerNotRunningException Structure (p. 482)
• CrawlerRunningException Structure (p. 482)
• CrawlerStoppingException Structure (p. 482)
• EntityNotFoundException Structure (p. 482)
• GlueEncryptionException Structure (p. 482)
• IdempotentParameterMismatchException Structure (p. 483)
• InternalServiceException Structure (p. 483)
• InvalidExecutionEngineException Structure (p. 483)
• InvalidInputException Structure (p. 483)
• InvalidTaskStatusTransitionException Structure (p. 483)
• JobDefinitionErrorException Structure (p. 484)
• JobRunInTerminalStateException Structure (p. 484)
• JobRunInvalidStateTransitionException Structure (p. 484)
• JobRunNotInTerminalStateException Structure (p. 484)
• LateRunnerException Structure (p. 485)
• NoScheduleException Structure (p. 485)
• OperationTimeoutException Structure (p. 485)
• ResourceNumberLimitExceededException Structure (p. 485)
• SchedulerNotRunningException Structure (p. 485)
• SchedulerRunningException Structure (p. 486)
• SchedulerTransitioningException Structure (p. 486)
• UnrecognizedRunnerException Structure (p. 486)
• ValidationException Structure (p. 486)
• VersionMismatchException Structure (p. 486)

Security APIs in AWS Glue


The Security API describes the security data types, and the API related to security in AWS Glue.

Data Types
• DataCatalogEncryptionSettings Structure (p. 377)
• EncryptionAtRest Structure (p. 377)
• ConnectionPasswordEncryption Structure (p. 377)
• EncryptionConfiguration Structure (p. 378)
• S3Encryption Structure (p. 378)
• CloudWatchEncryption Structure (p. 378)

376
AWS Glue Developer Guide
DataCatalogEncryptionSettings

• JobBookmarksEncryption Structure (p. 379)


• SecurityConfiguration Structure (p. 379)

DataCatalogEncryptionSettings Structure
Contains configuration information for maintaining Data Catalog security.

Fields

• EncryptionAtRest – An EncryptionAtRest (p. 377) object.

Specifies the encryption-at-rest configuration for the Data Catalog.


• ConnectionPasswordEncryption – A ConnectionPasswordEncryption (p. 377) object.

When connection password protection is enabled, the Data Catalog uses a customer-provided key
to encrypt the password as part of CreateConnection or UpdateConnection and store it in the
ENCRYPTED_PASSWORD field in the connection properties. You can enable catalog encryption or only
password encryption.

EncryptionAtRest Structure
Specifies the encryption-at-rest configuration for the Data Catalog.

Fields

• CatalogEncryptionMode – Required: UTF-8 string (valid values: DISABLED | SSE-KMS="SSEKMS").

The encryption-at-rest mode for encrypting Data Catalog data.


• SseAwsKmsKeyId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

The ID of the AWS KMS key to use for encryption at rest.

ConnectionPasswordEncryption Structure
The data structure used by the Data Catalog to encrypt the password as part of CreateConnection or
UpdateConnection and store it in the ENCRYPTED_PASSWORD field in the connection properties. You
can enable catalog encryption or only password encryption.

When a CreationConnection request arrives containing a password, the Data Catalog first encrypts
the password using your AWS KMS key. It then encrypts the whole connection object again if catalog
encryption is also enabled.

This encryption requires that you set AWS KMS key permissions to enable or restrict access on the
password key according to your security requirements. For example, you might want only admin users to
have decrypt permission on the password key.

Fields

• ReturnConnectionPasswordEncrypted – Required: Boolean.

When the ReturnConnectionPasswordEncrypted flag is set to "true", passwords remain


encrypted in the responses of GetConnection and GetConnections. This encryption takes effect
independently from catalog encryption.

377
AWS Glue Developer Guide
EncryptionConfiguration

• AwsKmsKeyId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

An AWS KMS key that is used to encrypt the connection password.

If connection password protection is enabled, the caller of CreateConnection and


UpdateConnection needs at least kms:Encrypt permission on the specified AWS KMS key, to
encrypt passwords before storing them in the Data Catalog.

You can set the decrypt permission to enable or restrict access on the password key according to your
security requirements.

EncryptionConfiguration Structure
Specifies an encryption configuration.

Fields

• S3Encryption – An array of S3Encryption (p. 378) objects.

The encryption configuration for S3 data.


• CloudWatchEncryption – A CloudWatchEncryption (p. 378) object.

The encryption configuration for CloudWatch.


• JobBookmarksEncryption – A JobBookmarksEncryption (p. 379) object.

The encryption configuration for Job Bookmarks.

S3Encryption Structure
Specifies how S3 data should be encrypted.

Fields

• S3EncryptionMode – UTF-8 string (valid values: DISABLED | SSE-KMS="SSEKMS" | SSE-


S3="SSES3").

The encryption mode to use for S3 data.


• KmsKeyArn – UTF-8 string, matching the Custom string pattern #10 (p. 481).

The AWS ARN of the KMS key to be used to encrypt the data.

CloudWatchEncryption Structure
Specifies how CloudWatch data should be encrypted.

Fields

• CloudWatchEncryptionMode – UTF-8 string (valid values: DISABLED | SSE-KMS="SSEKMS").

The encryption mode to use for CloudWatch data.


• KmsKeyArn – UTF-8 string, matching the Custom string pattern #10 (p. 481).

The AWS ARN of the KMS key to be used to encrypt the data.

378
AWS Glue Developer Guide
JobBookmarksEncryption

JobBookmarksEncryption Structure
Specifies how Job bookmark data should be encrypted.

Fields

• JobBookmarksEncryptionMode – UTF-8 string (valid values: DISABLED | CSE-KMS="CSEKMS").

The encryption mode to use for Job bookmarks data.


• KmsKeyArn – UTF-8 string, matching the Custom string pattern #10 (p. 481).

The AWS ARN of the KMS key to be used to encrypt the data.

SecurityConfiguration Structure
Specifies a security configuration.

Fields

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the security configuration.


• CreatedTimeStamp – Timestamp.

The time at which this security configuration was created.


• EncryptionConfiguration – An EncryptionConfiguration (p. 378) object.

The encryption configuration associated with this security configuration.

Operations
• GetDataCatalogEncryptionSettings Action (Python: get_data_catalog_encryption_settings) (p. 379)
• PutDataCatalogEncryptionSettings Action (Python: put_data_catalog_encryption_settings) (p. 380)
• PutResourcePolicy Action (Python: put_resource_policy) (p. 381)
• GetResourcePolicy Action (Python: get_resource_policy) (p. 381)
• DeleteResourcePolicy Action (Python: delete_resource_policy) (p. 382)
• CreateSecurityConfiguration Action (Python: create_security_configuration) (p. 382)
• DeleteSecurityConfiguration Action (Python: delete_security_configuration) (p. 383)
• GetSecurityConfiguration Action (Python: get_security_configuration) (p. 384)
• GetSecurityConfigurations Action (Python: get_security_configurations) (p. 384)

GetDataCatalogEncryptionSettings Action (Python:


get_data_catalog_encryption_settings)
Retrieves the security configuration for a specified catalog.

379
AWS Glue Developer Guide
PutDataCatalogEncryptionSettings
(put_data_catalog_encryption_settings)

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog for which to retrieve the security configuration. If none is provided, the
AWS account ID is used by default.

Response

• DataCatalogEncryptionSettings – A DataCatalogEncryptionSettings (p. 377) object.

The requested security configuration.

Errors

• InternalServiceException
• InvalidInputException
• OperationTimeoutException

PutDataCatalogEncryptionSettings Action (Python:


put_data_catalog_encryption_settings)
Sets the security configuration for a specified catalog. After the configuration has been set, the specified
encryption is applied to every catalog write thereafter.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog for which to set the security configuration. If none is provided, the AWS
account ID is used by default.
• DataCatalogEncryptionSettings – Required: A DataCatalogEncryptionSettings (p. 377) object.

The security configuration to set.

Response

• No Response parameters.

Errors

• InternalServiceException
• InvalidInputException
• OperationTimeoutException

380
AWS Glue Developer Guide
PutResourcePolicy (put_resource_policy)

PutResourcePolicy Action (Python:


put_resource_policy)
Sets the Data Catalog resource policy for access control.

Request

• PolicyInJson – Required: UTF-8 string, not less than 2 or more than 10240 bytes long.

Contains the policy document to set, in JSON format.


• PolicyHashCondition – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The hash value returned when the previous policy was set using PutResourcePolicy. Its purpose
is to prevent concurrent modifications of a policy. Do not use this parameter if no previous policy has
been set.
• PolicyExistsCondition – UTF-8 string (valid values: MUST_EXIST | NOT_EXIST | NONE).

A value of MUST_EXIST is used to update a policy. A value of NOT_EXIST is used to create a new
policy. If a value of NONE or a null value is used, the call will not depend on the existence of a policy.

Response

• PolicyHash – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

A hash of the policy that has just been set. This must be included in a subsequent call that overwrites
or updates this policy.

Errors

• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException
• ConditionCheckFailureException

GetResourcePolicy Action (Python:


get_resource_policy)
Retrieves a specified resource policy.

Request

• No Request parameters.

Response

• PolicyInJson – UTF-8 string, not less than 2 or more than 10240 bytes long.

Contains the requested policy document, in JSON format.

381
AWS Glue Developer Guide
DeleteResourcePolicy (delete_resource_policy)

• PolicyHash – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Contains the hash value associated with this policy.


• CreateTime – Timestamp.

The date and time at which the policy was created.


• UpdateTime – Timestamp.

The date and time at which the policy was last updated.

Errors

• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException

DeleteResourcePolicy Action (Python:


delete_resource_policy)
Deletes a specified policy.

Request

• PolicyHashCondition – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The hash value returned when this policy was set.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException
• ConditionCheckFailureException

CreateSecurityConfiguration Action (Python:


create_security_configuration)
Creates a new security configuration.

382
AWS Glue Developer Guide
DeleteSecurityConfiguration
(delete_security_configuration)

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name for the new security configuration.


• EncryptionConfiguration – Required: An EncryptionConfiguration (p. 378) object.

The encryption configuration for the new security configuration.

Response

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name assigned to the new security configuration.


• CreatedTimestamp – Timestamp.

The time at which the new security configuration was created.

Errors

• AlreadyExistsException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• ResourceNumberLimitExceededException

DeleteSecurityConfiguration Action (Python:


delete_security_configuration)
Deletes a specified security configuration.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the security configuration to delete.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

383
AWS Glue Developer Guide
GetSecurityConfiguration (get_security_configuration)

GetSecurityConfiguration Action (Python:


get_security_configuration)
Retrieves a specified security configuration.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the security configuration to retrieve.

Response

• SecurityConfiguration – A SecurityConfiguration (p. 379) object.

The requested security configuration

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

GetSecurityConfigurations Action (Python:


get_security_configurations)
Retrieves a list of all security configurations.

Request

• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum number of results to return.


• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.

Response

• SecurityConfigurations – An array of SecurityConfiguration (p. 379) objects.

A list of security configurations.


• NextToken – UTF-8 string.

A continuation token, if there are more security configurations to return.

Errors

• EntityNotFoundException

384
AWS Glue Developer Guide
Catalog

• InvalidInputException
• InternalServiceException
• OperationTimeoutException

Catalog API
The Catalog API describes the data types and API related to working with catalogs in AWS Glue.

Topics
• Database API (p. 385)
• Table API (p. 389)
• Partition API (p. 403)
• Connection API (p. 414)
• User-Defined Function API (p. 421)
• Importing an Athena Catalog to AWS Glue (p. 426)

Database API
The Database API describes database data types, and includes the API for creating, deleting, locating,
updating, and listing databases.

Data Types
• Database Structure (p. 385)
• DatabaseInput Structure (p. 386)

Database Structure
The Database object represents a logical grouping of tables that may reside in a Hive metastore or an
RDBMS.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the database. For Hive compatibility, this is folded to lowercase when it is stored.
• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

Description of the database.


• LocationUri – Uniform resource identifier (uri), not less than 1 or more than 1024 bytes long,
matching the URI address multi-line string pattern (p. 481).

The location of the database (for example, an HDFS path).


• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

385
AWS Glue Developer Guide
Databases

These key-value pairs define parameters and properties of the database.


• CreateTime – Timestamp.

The time at which the metadata database was created in the catalog.

DatabaseInput Structure
The structure used to create or update a database.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the database. For Hive compatibility, this is folded to lowercase when it is stored.
• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

Description of the database


• LocationUri – Uniform resource identifier (uri), not less than 1 or more than 1024 bytes long,
matching the URI address multi-line string pattern (p. 481).

The location of the database (for example, an HDFS path).


• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

Thes key-value pairs define parameters and properties of the database.

Operations
• CreateDatabase Action (Python: create_database) (p. 386)
• UpdateDatabase Action (Python: update_database) (p. 387)
• DeleteDatabase Action (Python: delete_database) (p. 387)
• GetDatabase Action (Python: get_database) (p. 388)
• GetDatabases Action (Python: get_databases) (p. 389)

CreateDatabase Action (Python: create_database)


Creates a new database in a Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which to create the database. If none is supplied, the AWS account ID is
used by default.
• DatabaseInput – Required: A DatabaseInput (p. 386) object.

386
AWS Glue Developer Guide
Databases

A DatabaseInput object defining the metadata database to create in the catalog.

Response

• No Response parameters.

Errors

• InvalidInputException
• AlreadyExistsException
• ResourceNumberLimitExceededException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

UpdateDatabase Action (Python: update_database)


Updates an existing database definition in a Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which the metadata database resides. If none is supplied, the AWS
account ID is used by default.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the database to update in the catalog. For Hive compatibility, this is folded to lowercase.
• DatabaseInput – Required: A DatabaseInput (p. 386) object.

A DatabaseInput object specifying the new definition of the metadata database in the catalog.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

DeleteDatabase Action (Python: delete_database)


Removes a specified Database from a Data Catalog.

387
AWS Glue Developer Guide
Databases

Note
After completing this operation, you will no longer have access to the tables (and all table
versions and partitions that might belong to the tables) and the user-defined functions in the
deleted database. AWS Glue deletes these "orphaned" resources asynchronously in a timely
manner, at the discretion of the service.
To ensure immediate deletion of all related resources, before calling DeleteDatabase,
use DeleteTableVersion or BatchDeleteTableVersion, DeletePartition or
BatchDeletePartition, DeleteUserDefinedFunction, and DeleteTable or
BatchDeleteTable, to delete any resources that belong to the database.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which the database resides. If none is supplied, the AWS account ID is
used by default.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the Database to delete. For Hive compatibility, this must be all lowercase.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

GetDatabase Action (Python: get_database)


Retrieves the definition of a specified database.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which the database resides. If none is supplied, the AWS account ID is
used by default.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the database to retrieve. For Hive compatibility, this should be all lowercase.

Response

• Database – A Database (p. 385) object.

The definition of the specified database in the catalog.

388
AWS Glue Developer Guide
Tables

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

GetDatabases Action (Python: get_databases)


Retrieves all Databases defined in a given Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog from which to retrieve Databases. If none is supplied, the AWS account ID
is used by default.
• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum number of databases to return in one response.

Response

• DatabaseList – Required: An array of Database (p. 385) objects.

A list of Database objects from the specified catalog.


• NextToken – UTF-8 string.

A continuation token for paginating the returned list of tokens, returned if the current segment of the
list is not the last.

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

Table API
The Table API describes data types and operations associated with tables.

Data Types
• Table Structure (p. 390)
• TableInput Structure (p. 391)

389
AWS Glue Developer Guide
Tables

• Column Structure (p. 392)


• StorageDescriptor Structure (p. 392)
• SerDeInfo Structure (p. 393)
• Order Structure (p. 394)
• SkewedInfo Structure (p. 394)
• TableVersion Structure (p. 394)
• TableError Structure (p. 395)
• TableVersionError Structure (p. 395)

Table Structure
Represents a collection of related data organized in columns and rows.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the table. For Hive compatibility, this must be entirely lowercase.
• DatabaseName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the metadata database where the table metadata resides. For Hive compatibility, this must be
all lowercase.
• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

Description of the table.


• Owner – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Owner of the table.


• CreateTime – Timestamp.

Time when the table definition was created in the Data Catalog.
• UpdateTime – Timestamp.

Last time the table was updated.


• LastAccessTime – Timestamp.

Last time the table was accessed. This is usually taken from HDFS, and may not be reliable.
• LastAnalyzedTime – Timestamp.

Last time column statistics were computed for this table.


• Retention – Number (integer), not more than None.

Retention time for this table.


• StorageDescriptor – A StorageDescriptor (p. 392) object.

A storage descriptor containing information about the physical storage of this table.
• PartitionKeys – An array of Column (p. 392) objects.

A list of columns by which the table is partitioned. Only primitive types are supported as partition
keys.

390
AWS Glue Developer Guide
Tables

When creating a table used by Athena, and you do not specify any partitionKeys, you must at least
set the value of partitionKeys to an empty list. For example:

"PartitionKeys": []
• ViewOriginalText – UTF-8 string, not more than 409600 bytes long.

If the table is a view, the original text of the view; otherwise null.
• ViewExpandedText – UTF-8 string, not more than 409600 bytes long.

If the table is a view, the expanded text of the view; otherwise null.
• TableType – UTF-8 string, not more than 255 bytes long.

The type of this table (EXTERNAL_TABLE, VIRTUAL_VIEW, etc.).


• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

These key-value pairs define properties associated with the table.


• CreatedBy – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Person or entity who created the table.

TableInput Structure
Structure used to create or update the table.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the table. For Hive compatibility, this is folded to lowercase when it is stored.
• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

Description of the table.


• Owner – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Owner of the table.


• LastAccessTime – Timestamp.

Last time the table was accessed.


• LastAnalyzedTime – Timestamp.

Last time column statistics were computed for this table.


• Retention – Number (integer), not more than None.

Retention time for this table.


• StorageDescriptor – A StorageDescriptor (p. 392) object.

391
AWS Glue Developer Guide
Tables

A storage descriptor containing information about the physical storage of this table.
• PartitionKeys – An array of Column (p. 392) objects.

A list of columns by which the table is partitioned. Only primitive types are supported as partition
keys.

When creating a table used by Athena, and you do not specify any partitionKeys, you must at least
set the value of partitionKeys to an empty list. For example:

"PartitionKeys": []
• ViewOriginalText – UTF-8 string, not more than 409600 bytes long.

If the table is a view, the original text of the view; otherwise null.
• ViewExpandedText – UTF-8 string, not more than 409600 bytes long.

If the table is a view, the expanded text of the view; otherwise null.
• TableType – UTF-8 string, not more than 255 bytes long.

The type of this table (EXTERNAL_TABLE, VIRTUAL_VIEW, etc.).


• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

These key-value pairs define properties associated with the table.

Column Structure
A column in a Table.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the Column.


• Type – UTF-8 string, not more than 131072 bytes long, matching the Single-line string
pattern (p. 481).

The datatype of data in the Column.


• Comment – Comment string, not more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Free-form text comment.

StorageDescriptor Structure
Describes the physical storage of table data.

Fields

• Columns – An array of Column (p. 392) objects.

392
AWS Glue Developer Guide
Tables

A list of the Columns in the table.


• Location – Location string, not more than 2056 bytes long, matching the URI address multi-line
string pattern (p. 481).

The physical location of the table. By default this takes the form of the warehouse location, followed
by the database location in the warehouse, followed by the table name.
• InputFormat – Format string, not more than 128 bytes long, matching the Single-line string
pattern (p. 481).

The input format: SequenceFileInputFormat (binary), or TextInputFormat, or a custom format.


• OutputFormat – Format string, not more than 128 bytes long, matching the Single-line string
pattern (p. 481).

The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat, or a


custom format.
• Compressed – Boolean.

True if the data in the table is compressed, or False if not.


• NumberOfBuckets – Number (integer).

Must be specified if the table contains any dimension columns.


• SerdeInfo – A SerDeInfo (p. 393) object.

Serialization/deserialization (SerDe) information.


• BucketColumns – An array of UTF-8 strings.

A list of reducer grouping columns, clustering columns, and bucketing columns in the table.
• SortColumns – An array of Order (p. 394) objects.

A list specifying the sort order of each bucket in the table.


• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

User-supplied properties in key-value form.


• SkewedInfo – A SkewedInfo (p. 394) object.

Information about values that appear very frequently in a column (skewed values).
• StoredAsSubDirectories – Boolean.

True if the table data is stored in subdirectories, or False if not.

SerDeInfo Structure
Information about a serialization/deserialization program (SerDe) which serves as an extractor and
loader.

Fields

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

393
AWS Glue Developer Guide
Tables

Name of the SerDe.


• SerializationLibrary – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

Usually the class that implements the SerDe. An example is:


org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe.
• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

These key-value pairs define initialization parameters for the SerDe.

Order Structure
Specifies the sort order of a sorted column.

Fields

• Column – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

The name of the column.


• SortOrder – Required: Number (integer), not more than 1.

Indicates that the column is sorted in ascending order (== 1), or in descending order (==0).

SkewedInfo Structure
Specifies skewed values in a table. Skewed are ones that occur with very high frequency.

Fields

• SkewedColumnNames – An array of UTF-8 strings.

A list of names of columns that contain skewed values.


• SkewedColumnValues – An array of UTF-8 strings.

A list of values that appear so frequently as to be considered skewed.


• SkewedColumnValueLocationMaps – A map array of key-value pairs.

Each key is a UTF-8 string.

Each value is a UTF-8 string.

A mapping of skewed values to the columns that contain them.

TableVersion Structure
Specifies a version of a table.

394
AWS Glue Developer Guide
Tables

Fields

• Table – A Table (p. 390) object.

The table in question


• VersionId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID value that identifies this table version. A VersionId is a string representation of an integer.
Each version is incremented by 1.

TableError Structure
An error record for table operations.

Fields

• TableName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the table. For Hive compatibility, this must be entirely lowercase.
• ErrorDetail – An ErrorDetail (p. 480) object.

Detail about the error.

TableVersionError Structure
An error record for table-version operations.

Fields

• TableName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the table in question.


• VersionId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID value of the version in question. A VersionID is a string representation of an integer. Each
version is incremented by 1.
• ErrorDetail – An ErrorDetail (p. 480) object.

Detail about the error.

Operations
• CreateTable Action (Python: create_table) (p. 396)
• UpdateTable Action (Python: update_table) (p. 396)
• DeleteTable Action (Python: delete_table) (p. 397)
• BatchDeleteTable Action (Python: batch_delete_table) (p. 398)
• GetTable Action (Python: get_table) (p. 399)
• GetTables Action (Python: get_tables) (p. 399)

395
AWS Glue Developer Guide
Tables

• GetTableVersion Action (Python: get_table_version) (p. 400)


• GetTableVersions Action (Python: get_table_versions) (p. 401)
• DeleteTableVersion Action (Python: delete_table_version) (p. 402)
• BatchDeleteTableVersion Action (Python: batch_delete_table_version) (p. 402)

CreateTable Action (Python: create_table)


Creates a new table definition in the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which to create the Table. If none is supplied, the AWS account ID is
used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The catalog database in which to create the new table. For Hive compatibility, this name is entirely
lowercase.
• TableInput – Required: A TableInput (p. 391) object.

The TableInput object that defines the metadata table to create in the catalog.

Response

• No Response parameters.

Errors

• AlreadyExistsException
• InvalidInputException
• EntityNotFoundException
• ResourceNumberLimitExceededException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

UpdateTable Action (Python: update_table)


Updates a metadata table in the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the table resides. If none is supplied, the AWS account ID is used by
default.

396
AWS Glue Developer Guide
Tables

• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database in which the table resides. For Hive compatibility, this name is
entirely lowercase.
• TableInput – Required: A TableInput (p. 391) object.

An updated TableInput object to define the metadata table in the catalog.


• SkipArchive – Boolean.

By default, UpdateTable always creates an archived version of the table before updating it. If
skipArchive is set to true, however, UpdateTable does not create the archived version.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• ConcurrentModificationException
• ResourceNumberLimitExceededException
• GlueEncryptionException

DeleteTable Action (Python: delete_table)


Removes a table definition from the Data Catalog.
Note
After completing this operation, you will no longer have access to the table versions and
partitions that belong to the deleted table. AWS Glue deletes these "orphaned" resources
asynchronously in a timely manner, at the discretion of the service.
To ensure immediate deletion of all related resources, before calling DeleteTable, use
DeleteTableVersion or BatchDeleteTableVersion, and DeletePartition or
BatchDeletePartition, to delete any resources that belong to the table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the table resides. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database in which the table resides. For Hive compatibility, this name is
entirely lowercase.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

397
AWS Glue Developer Guide
Tables

The name of the table to be deleted. For Hive compatibility, this name is entirely lowercase.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

BatchDeleteTable Action (Python: batch_delete_table)


Deletes multiple tables at once.
Note
After completing this operation, you will no longer have access to the table versions and
partitions that belong to the deleted table. AWS Glue deletes these "orphaned" resources
asynchronously in a timely manner, at the discretion of the service.
To ensure immediate deletion of all related resources, before calling BatchDeleteTable,
use DeleteTableVersion or BatchDeleteTableVersion, and DeletePartition or
BatchDeletePartition, to delete any resources that belong to the table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the table resides. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the tables to delete reside. For Hive compatibility, this name
is entirely lowercase.
• TablesToDelete – Required: An array of UTF-8 strings, not more than 100 strings.

A list of the table to delete.

Response

• Errors – An array of TableError (p. 395) objects.

A list of errors encountered in attempting to delete the specified tables.

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException

398
AWS Glue Developer Guide
Tables

• OperationTimeoutException

GetTable Action (Python: get_table)


Retrieves the Table definition in a Data Catalog for a specified table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the table resides. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the database in the catalog in which the table resides. For Hive compatibility, this name is
entirely lowercase.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the table for which to retrieve the definition. For Hive compatibility, this name is entirely
lowercase.

Response

• Table – A Table (p. 390) object.

The Table object that defines the specified table.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

GetTables Action (Python: get_tables)


Retrieves the definitions of some or all of the tables in a given Database.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the tables reside. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The database in the catalog whose tables to list. For Hive compatibility, this name is entirely lowercase.

399
AWS Glue Developer Guide
Tables

• Expression – UTF-8 string, not more than 2048 bytes long, matching the Single-line string
pattern (p. 481).

A regular expression pattern. If present, only those tables whose names match the pattern are
returned.
• NextToken – UTF-8 string.

A continuation token, included if this is a continuation call.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum number of tables to return in a single response.

Response

• TableList – An array of Table (p. 390) objects.

A list of the requested Table objects.


• NextToken – UTF-8 string.

A continuation token, present if the current list segment is not the last.

Errors

• EntityNotFoundException
• InvalidInputException
• OperationTimeoutException
• InternalServiceException
• GlueEncryptionException

GetTableVersion Action (Python: get_table_version)


Retrieves a specified version of a table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the tables reside. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The database in the catalog in which the table resides. For Hive compatibility, this name is entirely
lowercase.
• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table. For Hive compatibility, this name is entirely lowercase.
• VersionId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID value of the table version to be retrieved. A VersionID is a string representation of an integer.
Each version is incremented by 1.

400
AWS Glue Developer Guide
Tables

Response

• TableVersion – A TableVersion (p. 394) object.

The requested table version.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

GetTableVersions Action (Python: get_table_versions)


Retrieves a list of strings that identify available versions of a specified table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the tables reside. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The database in the catalog in which the table resides. For Hive compatibility, this name is entirely
lowercase.
• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table. For Hive compatibility, this name is entirely lowercase.
• NextToken – UTF-8 string.

A continuation token, if this is not the first call.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum number of table versions to return in one response.

Response

• TableVersions – An array of TableVersion (p. 394) objects.

A list of strings identifying available versions of the specified table.


• NextToken – UTF-8 string.

A continuation token, if the list of available versions does not include the last one.

Errors

• EntityNotFoundException

401
AWS Glue Developer Guide
Tables

• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

DeleteTableVersion Action (Python: delete_table_version)


Deletes a specified version of a table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the tables reside. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The database in the catalog in which the table resides. For Hive compatibility, this name is entirely
lowercase.
• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table. For Hive compatibility, this name is entirely lowercase.
• VersionId – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The ID of the table version to be deleted. A VersionID is a string representation of an integer. Each
version is incremented by 1.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

BatchDeleteTableVersion Action (Python:


batch_delete_table_version)
Deletes a specified batch of versions of a table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

402
AWS Glue Developer Guide
Partitions

The ID of the Data Catalog where the tables reside. If none is supplied, the AWS account ID is used by
default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The database in the catalog in which the table resides. For Hive compatibility, this name is entirely
lowercase.
• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table. For Hive compatibility, this name is entirely lowercase.
• VersionIds – Required: An array of UTF-8 strings, not more than 100 strings.

A list of the IDs of versions to be deleted. A VersionId is a string representation of an integer. Each
version is incremented by 1.

Response

• Errors – An array of TableVersionError (p. 395) objects.

A list of errors encountered while trying to delete the specified table versions.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

Partition API
The Partition API describes data types and operations used to work with partitions.

Data Types
• Partition Structure (p. 403)
• PartitionInput Structure (p. 404)
• PartitionSpecWithSharedStorageDescriptor Structure (p. 405)
• PartitionListComposingSpec Structure (p. 405)
• PartitionSpecProxy Structure (p. 405)
• PartitionValueList Structure (p. 406)
• Segment Structure (p. 406)
• PartitionError Structure (p. 406)

Partition Structure
Represents a slice of table data.

403
AWS Glue Developer Guide
Partitions

Fields

• Values – An array of UTF-8 strings.

The values of the partition.


• DatabaseName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the catalog database where the table in question is located.
• TableName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the table in question.


• CreationTime – Timestamp.

The time at which the partition was created.


• LastAccessTime – Timestamp.

The last time at which the partition was accessed.


• StorageDescriptor – A StorageDescriptor (p. 392) object.

Provides information about the physical location where the partition is stored.
• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

These key-value pairs define partition parameters.


• LastAnalyzedTime – Timestamp.

The last time at which column statistics were computed for this partition.

PartitionInput Structure
The structure used to create and update a partion.

Fields

• Values – An array of UTF-8 strings.

The values of the partition. Although this parameter is not required by the SDK, you must specify this
parameter for a valid input.
• LastAccessTime – Timestamp.

The last time at which the partition was accessed.


• StorageDescriptor – A StorageDescriptor (p. 392) object.

Provides information about the physical location where the partition is stored.
• Parameters – A map array of key-value pairs.

Each key is a Key string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a UTF-8 string, not more than 512000 bytes long.

404
AWS Glue Developer Guide
Partitions

These key-value pairs define partition parameters.


• LastAnalyzedTime – Timestamp.

The last time at which column statistics were computed for this partition.

PartitionSpecWithSharedStorageDescriptor Structure
A partition specification for partitions that share a physical location.

Fields

• StorageDescriptor – A StorageDescriptor (p. 392) object.

The shared physical storage information.


• Partitions – An array of Partition (p. 403) objects.

A list of the partitions that share this physical location.

PartitionListComposingSpec Structure
Lists related partitions.

Fields

• Partitions – An array of Partition (p. 403) objects.

A list of the partitions in the composing specification.

PartitionSpecProxy Structure
Provides a root path to specified partitions.

Fields

• DatabaseName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The catalog database in which the partions reside.


• TableName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the table containing the partitions.


• RootPath – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The root path of the proxy for addressing the partitions.


• PartitionSpecWithSharedSD – A PartitionSpecWithSharedStorageDescriptor (p. 405) object.

A specification of partitions that share the same physical storage location.


• PartitionListComposingSpec – A PartitionListComposingSpec (p. 405) object.

Specifies a list of partitions.

405
AWS Glue Developer Guide
Partitions

PartitionValueList Structure
Contains a list of values defining partitions.

Fields

• Values – Required: An array of UTF-8 strings.

The list of values.

Segment Structure
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be executed in
parallel.

Fields

• SegmentNumber – Required: Number (integer), not more than None.

The zero-based index number of the this segment. For example, if the total number of segments is 4,
SegmentNumber values will range from zero through three.
• TotalSegments – Required: Number (integer), not less than 1 or more than 10.

The total numer of segments.

PartitionError Structure
Contains information about a partition error.

Fields

• PartitionValues – An array of UTF-8 strings.

The values that define the partition.


• ErrorDetail – An ErrorDetail (p. 480) object.

Details about the partition error.

Operations
• CreatePartition Action (Python: create_partition) (p. 406)
• BatchCreatePartition Action (Python: batch_create_partition) (p. 407)
• UpdatePartition Action (Python: update_partition) (p. 408)
• DeletePartition Action (Python: delete_partition) (p. 409)
• BatchDeletePartition Action (Python: batch_delete_partition) (p. 409)
• GetPartition Action (Python: get_partition) (p. 410)
• GetPartitions Action (Python: get_partitions) (p. 411)
• BatchGetPartition Action (Python: batch_get_partition) (p. 414)

CreatePartition Action (Python: create_partition)


Creates a new partition.

406
AWS Glue Developer Guide
Partitions

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the catalog in which the partion is to be created. Currently, this should be the AWS account
ID.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the metadata database in which the partition is to be created.


• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the metadata table in which the partition is to be created.


• PartitionInput – Required: A PartitionInput (p. 404) object.

A PartitionInput structure defining the partition to be created.

Response

• No Response parameters.

Errors

• InvalidInputException
• AlreadyExistsException
• ResourceNumberLimitExceededException
• InternalServiceException
• EntityNotFoundException
• OperationTimeoutException
• GlueEncryptionException

BatchCreatePartition Action (Python: batch_create_partition)


Creates one or more partitions in a batch operation.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the catalog in which the partion is to be created. Currently, this should be the AWS account
ID.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the metadata database in which the partition is to be created.


• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the metadata table in which the partition is to be created.

407
AWS Glue Developer Guide
Partitions

• PartitionInputList – Required: An array of PartitionInput (p. 404) objects, not more than 100
structures.

A list of PartitionInput structures that define the partitions to be created.

Response

• Errors – An array of PartitionError (p. 406) objects.

Errors encountered when trying to create the requested partitions.

Errors

• InvalidInputException
• AlreadyExistsException
• ResourceNumberLimitExceededException
• InternalServiceException
• EntityNotFoundException
• OperationTimeoutException
• GlueEncryptionException

UpdatePartition Action (Python: update_partition)


Updates a partition.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the partition to be updated resides. If none is supplied, the AWS
account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database in which the table in question resides.
• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table where the partition to be updated is located.


• PartitionValueList – Required: An array of UTF-8 strings, not more than 100 strings.

A list of the values defining the partition.


• PartitionInput – Required: A PartitionInput (p. 404) object.

The new partition object to which to update the partition.

Response

• No Response parameters.

408
AWS Glue Developer Guide
Partitions

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

DeletePartition Action (Python: delete_partition)


Deletes a specified partition.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the partition to be deleted resides. If none is supplied, the AWS
account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database in which the table in question resides.
• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table where the partition to be deleted is located.


• PartitionValues – Required: An array of UTF-8 strings.

The values that define the partition.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

BatchDeletePartition Action (Python: batch_delete_partition)


Deletes one or more partitions in a batch operation.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the partition to be deleted resides. If none is supplied, the AWS
account ID is used by default.

409
AWS Glue Developer Guide
Partitions

• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database in which the table in question resides.
• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table where the partitions to be deleted is located.


• PartitionsToDelete – Required: An array of PartitionValueList (p. 406) objects, not more than 25
structures.

A list of PartitionInput structures that define the partitions to be deleted.

Response

• Errors – An array of PartitionError (p. 406) objects.

Errors encountered when trying to delete the requested partitions.

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException

GetPartition Action (Python: get_partition)


Retrieves information about a specified partition.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the partition in question resides. If none is supplied, the AWS
account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the partition resides.


• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the partition's table.


• PartitionValues – Required: An array of UTF-8 strings.

The values that define the partition.

Response

• Partition – A Partition (p. 403) object.

The requested information, in the form of a Partition object.

410
AWS Glue Developer Guide
Partitions

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

GetPartitions Action (Python: get_partitions)


Retrieves information about the partitions in a table.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the partitions in question reside. If none is supplied, the AWS
account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the partitions reside.


• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the partitions' table.


• Expression – Predicate string, not more than 2048 bytes long, matching the URI address multi-line
string pattern (p. 481).

An expression filtering the partitions to be returned.

The expression uses SQL syntax similar to the SQL WHERE filter clause. The SQL statement parser
JSQLParser parses the expression.

Operators: The following are the operators that you can use in the Expression API call:
=

Checks if the values of the two operands are equal or not; if yes, then the condition becomes true.

Example: Assume 'variable a' holds 10 and 'variable b' holds 20.

(a = b) is not true.
<>

Checks if the values of two operands are equal or not; if the values are not equal, then the
condition becomes true.

Example: (a < > b) is true.


>

Checks if the value of the left operand is greater than the value of the right operand; if yes, then
the condition becomes true.

Example: (a > b) is not true.


411
AWS Glue Developer Guide
Partitions

<

Checks if the value of the left operand is less than the value of the right operand; if yes, then the
condition becomes true.

Example: (a < b) is true.


>=

Checks if the value of the left operand is greater than or equal to the value of the right operand; if
yes, then the condition becomes true.

Example: (a >= b) is not true.


<=

Checks if the value of the left operand is less than or equal to the value of the right operand; if
yes, then the condition becomes true.

Example: (a <= b) is true.


AND, OR, IN, BETWEEN, LIKE, NOT, IS NULL

Logical operators.

Supported Partition Key Types: The following are the the supported partition keys.
• string
• date
• timestamp
• int
• bigint
• long
• tinyint
• smallint
• decimal

If an invalid type is encountered, an exception is thrown.

The following list shows the valid operators on each type. When you define a crawler, the
partitionKey type is created as a STRING, to be compatible with the catalog partitions.

Sample API Call:

Example

The table twitter_partition has three partitions:

year = 2015
year = 2016
year = 2017

Example

Get Partition year equals to 2015

aws glue get-partitions --database-name dbname --table-name twitter_partition


--expression "year*=*'2015'"
412
AWS Glue Developer Guide
Partitions

Example

Get Partition year between 2016-2018 (exclusive)

aws glue get-partitions --database-name dbname --table-name twitter_partition


--expression "year>'2016' AND year<'2018'"

Example

Get Partition year year between 2015-2018 (inclusive). The following API calls are equivalent to each
other

aws glue get-partitions --database-name dbname --table-name twitter_partition


--expression "year>='2015' AND year<='2018'"

aws glue get-partitions --database-name dbname --table-name twitter_partition


--expression "year BETWEEN 2015 AND 2018"

aws glue get-partitions --database-name dbname --table-name twitter_partition


--expression "year IN (2015,2016,2017,2018)"

Example

A wildcard partition filter, where the following call output will be partition year=2017. A regular
expression is not supported in LIKE.

aws glue get-partitions --database-name dbname --table-name twitter_partition


--expression "year LIKE '%7'"

• NextToken – UTF-8 string.

A continuation token, if this is not the first call to retrieve these partitions.
• Segment – A Segment (p. 406) object.

The segment of the table's partitions to scan in this request.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum number of partitions to return in a single response.

Response

• Partitions – An array of Partition (p. 403) objects.

A list of requested partitions.


• NextToken – UTF-8 string.

A continuation token, if the returned list of partitions does not does not include the last one.

Errors

• EntityNotFoundException
• InvalidInputException
• OperationTimeoutException
• InternalServiceException

413
AWS Glue Developer Guide
Connections

• GlueEncryptionException

BatchGetPartition Action (Python: batch_get_partition)


Retrieves partitions in a batch request.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the partitions in question reside. If none is supplied, the AWS
account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the partitions reside.


• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the partitions' table.


• PartitionsToGet – Required: An array of PartitionValueList (p. 406) objects, not more than 1000
structures.

A list of partition values identifying the partitions to retrieve.

Response

• Partitions – An array of Partition (p. 403) objects.

A list of the requested partitions.


• UnprocessedKeys – An array of PartitionValueList (p. 406) objects, not more than 1000 structures.

A list of the partition values in the request for which partions were not returned.

Errors

• InvalidInputException
• EntityNotFoundException
• OperationTimeoutException
• InternalServiceException
• GlueEncryptionException

Connection API
The Connection API describes AWS Glue connection data types, and the API for creating, deleting,
updating, and listing connections.

Data Types
• Connection Structure (p. 415)

414
AWS Glue Developer Guide
Connections

• ConnectionInput Structure (p. 416)


• PhysicalConnectionRequirements Structure (p. 416)
• GetConnectionsFilter Structure (p. 417)

Connection Structure
Defines a connection to a data source.

Fields

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the connection definition.


• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

The description of the connection.


• ConnectionType – UTF-8 string (valid values: JDBC | SFTP).

The type of the connection. Currently, only JDBC is supported; SFTP is not supported.
• MatchCriteria – An array of UTF-8 strings, not more than 10 strings.

A list of criteria that can be used in selecting this connection.


• ConnectionProperties – A map array of key-value pairs, not more than 100 pairs.

Each key is a UTF-8 string (valid values: HOST | PORT | USERNAME="USER_NAME" | PASSWORD |
ENCRYPTED_PASSWORD | JDBC_DRIVER_JAR_URI | JDBC_DRIVER_CLASS_NAME | JDBC_ENGINE
| JDBC_ENGINE_VERSION | CONFIG_FILES | INSTANCE_ID | JDBC_CONNECTION_URL |
JDBC_ENFORCE_SSL).

Each value is a Value string, not more than 1024 bytes long.

These key-value pairs define parameters for the connection:


• HOST - The host URI: either the fully qualified domain name (FQDN) or the IPv4 address of the
database host.
• PORT - The port number, between 1024 and 65535, of the port on which the database host is
listening for database connections.
• USER_NAME - The name under which to log in to the database. The value string for USER_NAME is
"USERNAME".
• PASSWORD - A password, if one is used, for the user name.
• ENCRYPTED_PASSWORD - When you enable connection password protection by setting
ConnectionPasswordEncryption in the Data Catalog encryption settings, this field stores the
encrypted password.
• JDBC_DRIVER_JAR_URI - The Amazon S3 path of the JAR file that contains the JDBC driver to use.
• JDBC_DRIVER_CLASS_NAME - The class name of the JDBC driver to use.
• JDBC_ENGINE - The name of the JDBC engine to use.
• JDBC_ENGINE_VERSION - The version of the JDBC engine to use.
• CONFIG_FILES - (Reserved for future use).
• INSTANCE_ID - The instance ID to use.
• JDBC_CONNECTION_URL - The URL for the JDBC connection.
• JDBC_ENFORCE_SSL - A Boolean string (true, false) specifying whether Secure Sockets Layer (SSL)
with hostname matching will be enforced for the JDBC connection on the client. The default is false.

415
AWS Glue Developer Guide
Connections

• PhysicalConnectionRequirements – A PhysicalConnectionRequirements (p. 416) object.

A map of physical connection requirements, such as virtual private cloud (VPC) and SecurityGroup,
that are needed to make this connection successfully.
• CreationTime – Timestamp.

The time that this connection definition was created.


• LastUpdatedTime – Timestamp.

The last time that this connection definition was updated.


• LastUpdatedBy – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The user, group, or role that last updated this connection definition.

ConnectionInput Structure
A structure that is used to specify a connection to create or update.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the connection.


• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

The description of the connection.


• ConnectionType – Required: UTF-8 string (valid values: JDBC | SFTP).

The type of the connection. Currently, only JDBC is supported; SFTP is not supported.
• MatchCriteria – An array of UTF-8 strings, not more than 10 strings.

A list of criteria that can be used in selecting this connection.


• ConnectionProperties – Required: A map array of key-value pairs, not more than 100 pairs.

Each key is a UTF-8 string (valid values: HOST | PORT | USERNAME="USER_NAME" | PASSWORD |
ENCRYPTED_PASSWORD | JDBC_DRIVER_JAR_URI | JDBC_DRIVER_CLASS_NAME | JDBC_ENGINE
| JDBC_ENGINE_VERSION | CONFIG_FILES | INSTANCE_ID | JDBC_CONNECTION_URL |
JDBC_ENFORCE_SSL).

Each value is a Value string, not more than 1024 bytes long.

These key-value pairs define parameters for the connection.


• PhysicalConnectionRequirements – A PhysicalConnectionRequirements (p. 416) object.

A map of physical connection requirements, such as virtual private cloud (VPC) and SecurityGroup,
that are needed to successfully make this connection.

PhysicalConnectionRequirements Structure
Specifies the physical requirements for a connection.

416
AWS Glue Developer Guide
Connections

Fields

• SubnetId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The subnet ID used by the connection.


• SecurityGroupIdList – An array of UTF-8 strings, not more than 50 strings.

The security group ID list used by the connection.


• AvailabilityZone – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The connection's Availability Zone. This field is redundant because the specified subnet implies the
Availability Zone to be used. Currently the field must be populated, but it will be deprecated in the
future.

GetConnectionsFilter Structure
Filters the connection definitions that are returned by the GetConnections API operation.

Fields

• MatchCriteria – An array of UTF-8 strings, not more than 10 strings.

A criteria string that must match the criteria recorded in the connection definition for that connection
definition to be returned.
• ConnectionType – UTF-8 string (valid values: JDBC | SFTP).

The type of connections to return. Currently, only JDBC is supported; SFTP is not supported.

Operations
• CreateConnection Action (Python: create_connection) (p. 417)
• DeleteConnection Action (Python: delete_connection) (p. 418)
• GetConnection Action (Python: get_connection) (p. 418)
• GetConnections Action (Python: get_connections) (p. 419)
• UpdateConnection Action (Python: update_connection) (p. 420)
• BatchDeleteConnection Action (Python: batch_delete_connection) (p. 420)

CreateConnection Action (Python: create_connection)


Creates a connection definition in the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which to create the connection. If none is provided, the AWS account ID
is used by default.
• ConnectionInput – Required: A ConnectionInput (p. 416) object.

A ConnectionInput object defining the connection to create.

417
AWS Glue Developer Guide
Connections

Response

• No Response parameters.

Errors

• AlreadyExistsException
• InvalidInputException
• OperationTimeoutException
• ResourceNumberLimitExceededException
• GlueEncryptionException

DeleteConnection Action (Python: delete_connection)


Deletes a connection from the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which the connection resides. If none is provided, the AWS account ID is
used by default.
• ConnectionName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the connection to delete.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• OperationTimeoutException

GetConnection Action (Python: get_connection)


Retrieves a connection definition from the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which the connection resides. If none is provided, the AWS account ID is
used by default.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the connection definition to retrieve.

418
AWS Glue Developer Guide
Connections

• HidePassword – Boolean.

Allows you to retrieve the connection metadata without returning the password. For instance, the
AWS Glue console uses this flag to retrieve the connection, and does not display the password. Set
this parameter when the caller might not have permission to use the AWS KMS key to decrypt the
password, but does have permission to access the rest of the connection properties.

Response

• Connection – A Connection (p. 415) object.

The requested connection definition.

Errors

• EntityNotFoundException
• OperationTimeoutException
• InvalidInputException
• GlueEncryptionException

GetConnections Action (Python: get_connections)


Retrieves a list of connection definitions from the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which the connections reside. If none is provided, the AWS account ID is
used by default.
• Filter – A GetConnectionsFilter (p. 417) object.

A filter that controls which connections will be returned.


• HidePassword – Boolean.

Allows you to retrieve the connection metadata without returning the password. For instance, the
AWS Glue console uses this flag to retrieve the connection, and does not display the password. Set
this parameter when the caller might not have permission to use the AWS KMS key to decrypt the
password, but does have permission to access the rest of the connection properties.
• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum number of connections to return in one response.

Response

• ConnectionList – An array of Connection (p. 415) objects.

A list of requested connection definitions.


• NextToken – UTF-8 string.

419
AWS Glue Developer Guide
Connections

A continuation token, if the list of connections returned does not include the last of the filtered
connections.

Errors

• EntityNotFoundException
• OperationTimeoutException
• InvalidInputException
• GlueEncryptionException

UpdateConnection Action (Python: update_connection)


Updates a connection definition in the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog in which the connection resides. If none is provided, the AWS account ID is
used by default.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the connection definition to update.


• ConnectionInput – Required: A ConnectionInput (p. 416) object.

A ConnectionInput object that redefines the connection in question.

Response

• No Response parameters.

Errors

• InvalidInputException
• EntityNotFoundException
• OperationTimeoutException
• InvalidInputException
• GlueEncryptionException

BatchDeleteConnection Action (Python:


batch_delete_connection)
Deletes a list of connection definitions from the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

420
AWS Glue Developer Guide
User-Defined Functions

The ID of the Data Catalog in which the connections reside. If none is provided, the AWS account ID is
used by default.
• ConnectionNameList – Required: An array of UTF-8 strings, not more than 25 strings.

A list of names of the connections to delete.

Response

• Succeeded – An array of UTF-8 strings.

A list of names of the connection definitions that were successfully deleted.


• Errors – A map array of key-value pairs.

Each key is a UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Each value is a An ErrorDetail (p. 480) object.

A map of the names of connections that were not successfully deleted to error details.

Errors

• InternalServiceException
• OperationTimeoutException

User-Defined Function API


The User-Defined Function API describes AWS Glue data types and operations used in working with
functions.

Data Types
• UserDefinedFunction Structure (p. 421)
• UserDefinedFunctionInput Structure (p. 422)

UserDefinedFunction Structure
Represents the equivalent of a Hive user-defined function (UDF) definition.

Fields

• FunctionName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the function.


• ClassName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The Java class that contains the function code.


• OwnerName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The owner of the function.

421
AWS Glue Developer Guide
User-Defined Functions

• OwnerType – UTF-8 string (valid values: USER | ROLE | GROUP).

The owner type.


• CreateTime – Timestamp.

The time at which the function was created.


• ResourceUris – An array of ResourceUri (p. 480) objects, not more than 1000 structures.

The resource URIs for the function.

UserDefinedFunctionInput Structure
A structure used to create or updata a user-defined function.

Fields

• FunctionName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the function.


• ClassName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The Java class that contains the function code.


• OwnerName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The owner of the function.


• OwnerType – UTF-8 string (valid values: USER | ROLE | GROUP).

The owner type.


• ResourceUris – An array of ResourceUri (p. 480) objects, not more than 1000 structures.

The resource URIs for the function.

Operations
• CreateUserDefinedFunction Action (Python: create_user_defined_function) (p. 422)
• UpdateUserDefinedFunction Action (Python: update_user_defined_function) (p. 423)
• DeleteUserDefinedFunction Action (Python: delete_user_defined_function) (p. 424)
• GetUserDefinedFunction Action (Python: get_user_defined_function) (p. 424)
• GetUserDefinedFunctions Action (Python: get_user_defined_functions) (p. 425)

CreateUserDefinedFunction Action (Python:


create_user_defined_function)
Creates a new function definition in the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

422
AWS Glue Developer Guide
User-Defined Functions

The ID of the Data Catalog in which to create the function. If none is supplied, the AWS account ID is
used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database in which to create the function.


• FunctionInput – Required: An UserDefinedFunctionInput (p. 422) object.

A FunctionInput object that defines the function to create in the Data Catalog.

Response

• No Response parameters.

Errors

• AlreadyExistsException
• InvalidInputException
• InternalServiceException
• EntityNotFoundException
• OperationTimeoutException
• ResourceNumberLimitExceededException
• GlueEncryptionException

UpdateUserDefinedFunction Action (Python:


update_user_defined_function)
Updates an existing function definition in the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the function to be updated is located. If none is supplied, the AWS
account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the function to be updated is located.
• FunctionName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the function.


• FunctionInput – Required: An UserDefinedFunctionInput (p. 422) object.

A FunctionInput object that re-defines the function in the Data Catalog.

Response

• No Response parameters.

423
AWS Glue Developer Guide
User-Defined Functions

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

DeleteUserDefinedFunction Action (Python:


delete_user_defined_function)
Deletes an existing function definition from the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the function to be deleted is located. If none is supplied, the AWS
account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the function is located.


• FunctionName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the function definition to be deleted.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

GetUserDefinedFunction Action (Python:


get_user_defined_function)
Retrieves a specified function definition from the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the function to be retrieved is located. If none is supplied, the AWS
account ID is used by default.

424
AWS Glue Developer Guide
User-Defined Functions

• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the function is located.


• FunctionName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the function.

Response

• UserDefinedFunction – An UserDefinedFunction (p. 421) object.

The requested function definition.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• GlueEncryptionException

GetUserDefinedFunctions Action (Python:


get_user_defined_functions)
Retrieves a multiple function definitions from the Data Catalog.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the Data Catalog where the functions to be retrieved are located. If none is supplied, the
AWS account ID is used by default.
• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the catalog database where the functions are located.
• Pattern – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

An optional function-name pattern string that filters the function definitions returned.
• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum number of functions to return in one response.

Response

• UserDefinedFunctions – An array of UserDefinedFunction (p. 421) objects.

425
AWS Glue Developer Guide
Importing an Athena Catalog

A list of requested function definitions.


• NextToken – UTF-8 string.

A continuation token, if the list of functions returned does not include the last requested function.

Errors

• EntityNotFoundException
• InvalidInputException
• OperationTimeoutException
• InternalServiceException
• GlueEncryptionException

Importing an Athena Catalog to AWS Glue


The Migration API describes AWS Glue data types and operations having to do with migrating an Athena
Data Catalog to AWS Glue.

Data Types
• CatalogImportStatus Structure (p. 426)

CatalogImportStatus Structure
A structure containing migration status information.

Fields

• ImportCompleted – Boolean.

True if the migration has completed, or False otherwise.


• ImportTime – Timestamp.

The time that the migration was started.


• ImportedBy – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the person who initiated the migration.

Operations
• ImportCatalogToGlue Action (Python: import_catalog_to_glue) (p. 426)
• GetCatalogImportStatus Action (Python: get_catalog_import_status) (p. 427)

ImportCatalogToGlue Action (Python: import_catalog_to_glue)


Imports an existing Athena Data Catalog to AWS Glue

426
AWS Glue Developer Guide
Crawlers and Classifiers

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the catalog to import. Currently, this should be the AWS account ID.

Response

• No Response parameters.

Errors

• InternalServiceException
• OperationTimeoutException

GetCatalogImportStatus Action (Python:


get_catalog_import_status)
Retrieves the status of a migration operation.

Request

• CatalogId – Catalog id string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the catalog to migrate. Currently, this should be the AWS account ID.

Response

• ImportStatus – A CatalogImportStatus (p. 426) object.

The status of the specified catalog migration.

Errors

• InternalServiceException
• OperationTimeoutException

Crawlers and Classifiers API


The Crawler and Classifiers API describes the AWS Glue crawler and classifier data types, and includes the
API for creating, deleting, updating, and listing crawlers or classifiers.

Topics
• Classifier API (p. 428)
• Crawler API (p. 435)
• Crawler Scheduler API (p. 444)

427
AWS Glue Developer Guide
Classifiers

Classifier API
The Classifier API describes AWS Glue classifier data types, and includes the API for creating, deleting,
updating, and listing classifiers.

Data Types
• Classifier Structure (p. 428)
• GrokClassifier Structure (p. 428)
• XMLClassifier Structure (p. 429)
• JsonClassifier Structure (p. 429)
• CreateGrokClassifierRequest Structure (p. 430)
• UpdateGrokClassifierRequest Structure (p. 430)
• CreateXMLClassifierRequest Structure (p. 431)
• UpdateXMLClassifierRequest Structure (p. 431)
• CreateJsonClassifierRequest Structure (p. 431)
• UpdateJsonClassifierRequest Structure (p. 432)

Classifier Structure
Classifiers are triggered during a crawl task. A classifier checks whether a given file is in a format it can
handle, and if it is, the classifier creates a schema in the form of a StructType object that matches that
data format.

You can use the standard classifiers that AWS Glue supplies, or you can write your own classifiers to best
categorize your data sources and specify the appropriate schemas to use for them. A classifier can be a
grok classifier, an XML classifier, or a JSON classifier, as specified in one of the fields in the Classifier
object.

Fields

• GrokClassifier – A GrokClassifier (p. 428) object.

A GrokClassifier object.
• XMLClassifier – A XMLClassifier (p. 429) object.

An XMLClassifier object.
• JsonClassifier – A JsonClassifier (p. 429) object.

A JsonClassifier object.

GrokClassifier Structure
A classifier that uses grok patterns.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the classifier.


• Classification – Required: UTF-8 string.

428
AWS Glue Developer Guide
Classifiers

An identifier of the data format that the classifier matches, such as Twitter, JSON, Omniture logs, and
so on.
• CreationTime – Timestamp.

The time this classifier was registered.


• LastUpdated – Timestamp.

The time this classifier was last updated.


• Version – Number (long).

The version of this classifier.


• GrokPattern – Required: UTF-8 string, not less than 1 or more than 2048 bytes long, matching the A
Logstash Grok string pattern (p. 481).

The grok pattern applied to a data store by this classifier. For more information, see built-in patterns in
Writing Custom Classifers.
• CustomPatterns – UTF-8 string, not more than 16000 bytes long, matching the URI address multi-
line string pattern (p. 481).

Optional custom grok patterns defined by this classifier. For more information, see custom patterns in
Writing Custom Classifers.

XMLClassifier Structure
A classifier for XML content.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the classifier.


• Classification – Required: UTF-8 string.

An identifier of the data format that the classifier matches.


• CreationTime – Timestamp.

The time this classifier was registered.


• LastUpdated – Timestamp.

The time this classifier was last updated.


• Version – Number (long).

The version of this classifier.


• RowTag – UTF-8 string.

The XML tag designating the element that contains each record in an XML document being parsed.
Note that this cannot identify a self-closing element (closed by />). An empty row element that
contains only attributes can be parsed as long as it ends with a closing tag (for example, <row
item_a="A" item_b="B"></row> is okay, but <row item_a="A" item_b="B" /> is not).

JsonClassifier Structure
A classifier for JSON content.

429
AWS Glue Developer Guide
Classifiers

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the classifier.


• CreationTime – Timestamp.

The time this classifier was registered.


• LastUpdated – Timestamp.

The time this classifier was last updated.


• Version – Number (long).

The version of this classifier.


• JsonPath – Required: UTF-8 string.

A JsonPath string defining the JSON data for the classifier to classify. AWS Glue supports a subset of
JsonPath, as described in Writing JsonPath Custom Classifiers.

CreateGrokClassifierRequest Structure
Specifies a grok classifier for CreateClassifier to create.

Fields

• Classification – Required: UTF-8 string.

An identifier of the data format that the classifier matches, such as Twitter, JSON, Omniture logs,
Amazon CloudWatch Logs, and so on.
• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the new classifier.


• GrokPattern – Required: UTF-8 string, not less than 1 or more than 2048 bytes long, matching the A
Logstash Grok string pattern (p. 481).

The grok pattern used by this classifier.


• CustomPatterns – UTF-8 string, not more than 16000 bytes long, matching the URI address multi-
line string pattern (p. 481).

Optional custom grok patterns used by this classifier.

UpdateGrokClassifierRequest Structure
Specifies a grok classifier to update when passed to UpdateClassifier.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the GrokClassifier.


• Classification – UTF-8 string.

430
AWS Glue Developer Guide
Classifiers

An identifier of the data format that the classifier matches, such as Twitter, JSON, Omniture logs,
Amazon CloudWatch Logs, and so on.
• GrokPattern – UTF-8 string, not less than 1 or more than 2048 bytes long, matching the A Logstash
Grok string pattern (p. 481).

The grok pattern used by this classifier.


• CustomPatterns – UTF-8 string, not more than 16000 bytes long, matching the URI address multi-
line string pattern (p. 481).

Optional custom grok patterns used by this classifier.

CreateXMLClassifierRequest Structure
Specifies an XML classifier for CreateClassifier to create.

Fields

• Classification – Required: UTF-8 string.

An identifier of the data format that the classifier matches.


• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the classifier.


• RowTag – UTF-8 string.

The XML tag designating the element that contains each record in an XML document being parsed.
Note that this cannot identify a self-closing element (closed by />). An empty row element that
contains only attributes can be parsed as long as it ends with a closing tag (for example, <row
item_a="A" item_b="B"></row> is okay, but <row item_a="A" item_b="B" /> is not).

UpdateXMLClassifierRequest Structure
Specifies an XML classifier to be updated.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the classifier.


• Classification – UTF-8 string.

An identifier of the data format that the classifier matches.


• RowTag – UTF-8 string.

The XML tag designating the element that contains each record in an XML document being parsed.
Note that this cannot identify a self-closing element (closed by />). An empty row element that
contains only attributes can be parsed as long as it ends with a closing tag (for example, <row
item_a="A" item_b="B"></row> is okay, but <row item_a="A" item_b="B" /> is not).

CreateJsonClassifierRequest Structure
Specifies a JSON classifier for CreateClassifier to create.

431
AWS Glue Developer Guide
Classifiers

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the classifier.


• JsonPath – Required: UTF-8 string.

A JsonPath string defining the JSON data for the classifier to classify. AWS Glue supports a subset of
JsonPath, as described in Writing JsonPath Custom Classifiers.

UpdateJsonClassifierRequest Structure
Specifies a JSON classifier to be updated.

Fields

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the classifier.


• JsonPath – UTF-8 string.

A JsonPath string defining the JSON data for the classifier to classify. AWS Glue supports a subset of
JsonPath, as described in Writing JsonPath Custom Classifiers.

Operations
• CreateClassifier Action (Python: create_classifier) (p. 432)
• DeleteClassifier Action (Python: delete_classifier) (p. 433)
• GetClassifier Action (Python: get_classifier) (p. 433)
• GetClassifiers Action (Python: get_classifiers) (p. 434)
• UpdateClassifier Action (Python: update_classifier) (p. 434)

CreateClassifier Action (Python: create_classifier)


Creates a classifier in the user's account. This may be a GrokClassifier, an XMLClassifier, or
abbrev JsonClassifier, depending on which field of the request is present.

Request

• GrokClassifier – A CreateGrokClassifierRequest (p. 430) object.

A GrokClassifier object specifying the classifier to create.


• XMLClassifier – A CreateXMLClassifierRequest (p. 431) object.

An XMLClassifier object specifying the classifier to create.


• JsonClassifier – A CreateJsonClassifierRequest (p. 431) object.

A JsonClassifier object specifying the classifier to create.

432
AWS Glue Developer Guide
Classifiers

Response

• No Response parameters.

Errors

• AlreadyExistsException
• InvalidInputException
• OperationTimeoutException

DeleteClassifier Action (Python: delete_classifier)


Removes a classifier from the Data Catalog.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the classifier to remove.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• OperationTimeoutException

GetClassifier Action (Python: get_classifier)


Retrieve a classifier by name.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the classifier to retrieve.

Response

• Classifier – A Classifier (p. 428) object.

The requested classifier.

Errors

• EntityNotFoundException
• OperationTimeoutException

433
AWS Glue Developer Guide
Classifiers

GetClassifiers Action (Python: get_classifiers)


Lists all classifier objects in the Data Catalog.

Request

• MaxResults – Number (integer), not less than 1 or more than 1000.

Size of the list to return (optional).


• NextToken – UTF-8 string.

An optional continuation token.

Response

• Classifiers – An array of Classifier (p. 428) objects.

The requested list of classifier objects.


• NextToken – UTF-8 string.

A continuation token.

Errors

• OperationTimeoutException

UpdateClassifier Action (Python: update_classifier)


Modifies an existing classifier (a GrokClassifier, XMLClassifier, or JsonClassifier, depending
on which field is present).

Request

• GrokClassifier – An UpdateGrokClassifierRequest (p. 430) object.

A GrokClassifier object with updated fields.


• XMLClassifier – An UpdateXMLClassifierRequest (p. 431) object.

An XMLClassifier object with updated fields.


• JsonClassifier – An UpdateJsonClassifierRequest (p. 432) object.

A JsonClassifier object with updated fields.

Response

• No Response parameters.

Errors

• InvalidInputException
• VersionMismatchException
• EntityNotFoundException

434
AWS Glue Developer Guide
Crawlers

• OperationTimeoutException

Crawler API
The Crawler API describes AWS Glue crawler data types, along with the API for creating, deleting,
updating, and listing crawlers.

Data Types
• Crawler Structure (p. 435)
• Schedule Structure (p. 436)
• CrawlerTargets Structure (p. 436)
• S3Target Structure (p. 437)
• JdbcTarget Structure (p. 437)
• DynamoDBTarget Structure (p. 437)
• CrawlerMetrics Structure (p. 437)
• SchemaChangePolicy Structure (p. 438)
• LastCrawlInfo Structure (p. 438)

Crawler Structure
Specifies a crawler program that examines a data source and uses classifiers to try to determine its
schema. If successful, the crawler records metadata concerning the data source in the AWS Glue Data
Catalog.

Fields

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The crawler name.


• Role – UTF-8 string.

The IAM role (or ARN of an IAM role) used to access customer resources, such as data in Amazon S3.
• Targets – A CrawlerTargets (p. 436) object.

A collection of targets to crawl.


• DatabaseName – UTF-8 string.

The database where metadata is written by this crawler.


• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

A description of the crawler.


• Classifiers – An array of UTF-8 strings.

A list of custom classifiers associated with the crawler.


• SchemaChangePolicy – A SchemaChangePolicy (p. 438) object.

Sets the behavior when the crawler finds a changed or deleted object.

435
AWS Glue Developer Guide
Crawlers

• State – UTF-8 string (valid values: READY | RUNNING | STOPPING).

Indicates whether the crawler is running, or whether a run is pending.


• TablePrefix – UTF-8 string, not more than 128 bytes long.

The prefix added to the names of tables that are created.


• Schedule – A Schedule (p. 444) object.

For scheduled crawlers, the schedule when the crawler runs.


• CrawlElapsedTime – Number (long).

If the crawler is running, contains the total time elapsed since the last crawl began.
• CreationTime – Timestamp.

The time when the crawler was created.


• LastUpdated – Timestamp.

The time the crawler was last updated.


• LastCrawl – A LastCrawlInfo (p. 438) object.

The status of the last crawl, and potentially error information if an error occurred.
• Version – Number (long).

The version of the crawler.


• Configuration – UTF-8 string.

Crawler configuration information. This versioned JSON string allows users to specify aspects of a
crawler's behavior. For more information, see Configuring a Crawler.
• CrawlerSecurityConfiguration – UTF-8 string, not more than 128 bytes long.

The name of the SecurityConfiguration structure to be used by this Crawler.

Schedule Structure
A scheduling object using a cron statement to schedule an event.

Fields

• ScheduleExpression – UTF-8 string.

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For
example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
• State – UTF-8 string (valid values: SCHEDULED | NOT_SCHEDULED | TRANSITIONING).

The state of the schedule.

CrawlerTargets Structure
Specifies data stores to crawl.

Fields

• S3Targets – An array of S3Target (p. 437) objects.


436
AWS Glue Developer Guide
Crawlers

Specifies Amazon S3 targets.


• JdbcTargets – An array of JdbcTarget (p. 437) objects.

Specifies JDBC targets.


• DynamoDBTargets – An array of DynamoDBTarget (p. 437) objects.

Specifies DynamoDB targets.

S3Target Structure
Specifies a data store in Amazon S3.

Fields

• Path – UTF-8 string.

The path to the Amazon S3 target.


• Exclusions – An array of UTF-8 strings.

A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a
Crawler.

JdbcTarget Structure
Specifies a JDBC data store to crawl.

Fields

• ConnectionName – UTF-8 string.

The name of the connection to use to connect to the JDBC target.


• Path – UTF-8 string.

The path of the JDBC target.


• Exclusions – An array of UTF-8 strings.

A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a
Crawler.

DynamoDBTarget Structure
Specifies a DynamoDB table to crawl.

Fields

• Path – UTF-8 string.

The name of the DynamoDB table to crawl.

CrawlerMetrics Structure
Metrics for a specified crawler.

437
AWS Glue Developer Guide
Crawlers

Fields

• CrawlerName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the crawler.


• TimeLeftSeconds – Number (double), not more than None.

The estimated time left to complete a running crawl.


• StillEstimating – Boolean.

True if the crawler is still estimating how long it will take to complete this run.
• LastRuntimeSeconds – Number (double), not more than None.

The duration of the crawler's most recent run, in seconds.


• MedianRuntimeSeconds – Number (double), not more than None.

The median duration of this crawler's runs, in seconds.


• TablesCreated – Number (integer), not more than None.

The number of tables created by this crawler.


• TablesUpdated – Number (integer), not more than None.

The number of tables updated by this crawler.


• TablesDeleted – Number (integer), not more than None.

The number of tables deleted by this crawler.

SchemaChangePolicy Structure
Crawler policy for update and deletion behavior.

Fields

• UpdateBehavior – UTF-8 string (valid values: LOG | UPDATE_IN_DATABASE).

The update behavior when the crawler finds a changed schema.


• DeleteBehavior – UTF-8 string (valid values: LOG | DELETE_FROM_DATABASE |
DEPRECATE_IN_DATABASE).

The deletion behavior when the crawler finds a deleted object.

LastCrawlInfo Structure
Status and error information about the most recent crawl.

Fields

• Status – UTF-8 string (valid values: SUCCEEDED | CANCELLED | FAILED).

Status of the last crawl.


• ErrorMessage – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).
438
AWS Glue Developer Guide
Crawlers

If an error occurred, the error information about the last crawl.


• LogGroup – UTF-8 string, not less than 1 or more than 512 bytes long, matching the Log group string
pattern (p. 481).

The log group for the last crawl.


• LogStream – UTF-8 string, not less than 1 or more than 512 bytes long, matching the Log-stream
string pattern (p. 481).

The log stream for the last crawl.


• MessagePrefix – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The prefix for a message about this crawl.


• StartTime – Timestamp.

The time at which the crawl started.

Operations
• CreateCrawler Action (Python: create_crawler) (p. 439)
• DeleteCrawler Action (Python: delete_crawler) (p. 440)
• GetCrawler Action (Python: get_crawler) (p. 441)
• GetCrawlers Action (Python: get_crawlers) (p. 441)
• GetCrawlerMetrics Action (Python: get_crawler_metrics) (p. 442)
• UpdateCrawler Action (Python: update_crawler) (p. 442)
• StartCrawler Action (Python: start_crawler) (p. 443)
• StopCrawler Action (Python: stop_crawler) (p. 444)

CreateCrawler Action (Python: create_crawler)


Creates a new crawler with specified targets, role, configuration, and optional schedule. At least one
crawl target must be specified, in the s3Targets field, the jdbcTargets field, or the DynamoDBTargets field.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the new crawler.


• Role – Required: UTF-8 string.

The IAM role (or ARN of an IAM role) used by the new crawler to access customer resources.
• DatabaseName – Required: UTF-8 string.

The AWS Glue database where results are written, such as: arn:aws:daylight:us-
east-1::database/sometable/*.
• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

A description of the new crawler.


• Targets – Required: A CrawlerTargets (p. 436) object.

439
AWS Glue Developer Guide
Crawlers

A list of collection of targets to crawl.


• Schedule – UTF-8 string.

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For
example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
• Classifiers – An array of UTF-8 strings.

A list of custom classifiers that the user has registered. By default, all built-in classifiers are included in
a crawl, but these custom classifiers always override the default classifiers for a given classification.
• TablePrefix – UTF-8 string, not more than 128 bytes long.

The table prefix used for catalog tables that are created.
• SchemaChangePolicy – A SchemaChangePolicy (p. 438) object.

Policy for the crawler's update and deletion behavior.


• Configuration – UTF-8 string.

Crawler configuration information. This versioned JSON string allows users to specify aspects of a
crawler's behavior. For more information, see Configuring a Crawler.
• CrawlerSecurityConfiguration – UTF-8 string, not more than 128 bytes long.

The name of the SecurityConfiguration structure to be used by this Crawler.

Response

• No Response parameters.

Errors

• InvalidInputException
• AlreadyExistsException
• OperationTimeoutException
• ResourceNumberLimitExceededException

DeleteCrawler Action (Python: delete_crawler)


Removes a specified crawler from the Data Catalog, unless the crawler state is RUNNING.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the crawler to remove.

Response

• No Response parameters.

Errors

• EntityNotFoundException

440
AWS Glue Developer Guide
Crawlers

• CrawlerRunningException
• SchedulerTransitioningException
• OperationTimeoutException

GetCrawler Action (Python: get_crawler)


Retrieves metadata for a specified crawler.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the crawler to retrieve metadata for.

Response

• Crawler – A Crawler (p. 435) object.

The metadata for the specified crawler.

Errors

• EntityNotFoundException
• OperationTimeoutException

GetCrawlers Action (Python: get_crawlers)


Retrieves metadata for all crawlers defined in the customer account.

Request

• MaxResults – Number (integer), not less than 1 or more than 1000.

The number of crawlers to return on each call.


• NextToken – UTF-8 string.

A continuation token, if this is a continuation request.

Response

• Crawlers – An array of Crawler (p. 435) objects.

A list of crawler metadata.


• NextToken – UTF-8 string.

A continuation token, if the returned list has not reached the end of those defined in this customer
account.

Errors

• OperationTimeoutException

441
AWS Glue Developer Guide
Crawlers

GetCrawlerMetrics Action (Python: get_crawler_metrics)


Retrieves metrics about specified crawlers.

Request

• CrawlerNameList – An array of UTF-8 strings, not more than 100 strings.

A list of the names of crawlers about which to retrieve metrics.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum size of a list to return.


• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.

Response

• CrawlerMetricsList – An array of CrawlerMetrics (p. 437) objects.

A list of metrics for the specified crawler.


• NextToken – UTF-8 string.

A continuation token, if the returned list does not contain the last metric available.

Errors

• OperationTimeoutException

UpdateCrawler Action (Python: update_crawler)


Updates a crawler. If a crawler is running, you must stop it using StopCrawler before updating it.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the new crawler.


• Role – UTF-8 string.

The IAM role (or ARN of an IAM role) used by the new crawler to access customer resources.
• DatabaseName – UTF-8 string.

The AWS Glue database where results are stored, such as: arn:aws:daylight:us-
east-1::database/sometable/*.
• Description – UTF-8 string, not more than 2048 bytes long, matching the URI address multi-line
string pattern (p. 481).

A description of the new crawler.


• Targets – A CrawlerTargets (p. 436) object.

A list of targets to crawl.


• Schedule – UTF-8 string.

442
AWS Glue Developer Guide
Crawlers

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For
example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
• Classifiers – An array of UTF-8 strings.

A list of custom classifiers that the user has registered. By default, all built-in classifiers are included in
a crawl, but these custom classifiers always override the default classifiers for a given classification.
• TablePrefix – UTF-8 string, not more than 128 bytes long.

The table prefix used for catalog tables that are created.
• SchemaChangePolicy – A SchemaChangePolicy (p. 438) object.

Policy for the crawler's update and deletion behavior.


• Configuration – UTF-8 string.

Crawler configuration information. This versioned JSON string allows users to specify aspects of a
crawler's behavior. For more information, see Configuring a Crawler.
• CrawlerSecurityConfiguration – UTF-8 string, not more than 128 bytes long.

The name of the SecurityConfiguration structure to be used by this Crawler.

Response

• No Response parameters.

Errors

• InvalidInputException
• VersionMismatchException
• EntityNotFoundException
• CrawlerRunningException
• OperationTimeoutException

StartCrawler Action (Python: start_crawler)


Starts a crawl using the specified crawler, regardless of what is scheduled. If the crawler is already
running, returns a CrawlerRunningException.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the crawler to start.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• CrawlerRunningException

443
AWS Glue Developer Guide
Scheduler

• OperationTimeoutException

StopCrawler Action (Python: stop_crawler)


If the specified crawler is running, stops the crawl.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

Name of the crawler to stop.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• CrawlerNotRunningException
• CrawlerStoppingException
• OperationTimeoutException

Crawler Scheduler API


The Crawler Scheduler API describes AWS Glue crawler data types, along with the API for creating,
deleting, updating, and listing crawlers.

Data Types
• Schedule Structure (p. 444)

Schedule Structure
A scheduling object using a cron statement to schedule an event.

Fields

• ScheduleExpression – UTF-8 string.

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For
example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
• State – UTF-8 string (valid values: SCHEDULED | NOT_SCHEDULED | TRANSITIONING).

The state of the schedule.

Operations
• UpdateCrawlerSchedule Action (Python: update_crawler_schedule) (p. 445)
• StartCrawlerSchedule Action (Python: start_crawler_schedule) (p. 445)

444
AWS Glue Developer Guide
Scheduler

• StopCrawlerSchedule Action (Python: stop_crawler_schedule) (p. 446)

UpdateCrawlerSchedule Action (Python:


update_crawler_schedule)
Updates the schedule of a crawler using a cron expression.

Request

• CrawlerName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

Name of the crawler whose schedule to update.


• Schedule – UTF-8 string.

The updated cron expression used to specify the schedule (see Time-Based Schedules for Jobs and
Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 *
* ? *).

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InvalidInputException
• VersionMismatchException
• SchedulerTransitioningException
• OperationTimeoutException

StartCrawlerSchedule Action (Python: start_crawler_schedule)


Changes the schedule state of the specified crawler to SCHEDULED, unless the crawler is already running
or the schedule state is already SCHEDULED.

Request

• CrawlerName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

Name of the crawler to schedule.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• SchedulerRunningException

445
AWS Glue Developer Guide
Autogenerating ETL Scripts

• SchedulerTransitioningException
• NoScheduleException
• OperationTimeoutException

StopCrawlerSchedule Action (Python: stop_crawler_schedule)


Sets the schedule state of the specified crawler to NOT_SCHEDULED, but does not stop the crawler if it is
already running.

Request

• CrawlerName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

Name of the crawler whose schedule state to set.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• SchedulerNotRunningException
• SchedulerTransitioningException
• OperationTimeoutException

Autogenerating ETL Scripts API


The ETL script-generation API describes the datatypes and API for generating ETL scripts in AWS Glue.

Data Types
• CodeGenNode Structure (p. 446)
• CodeGenNodeArg Structure (p. 447)
• CodeGenEdge Structure (p. 447)
• Location Structure (p. 447)
• CatalogEntry Structure (p. 448)
• MappingEntry Structure (p. 448)

CodeGenNode Structure
Represents a node in a directed acyclic graph (DAG)

Fields

• Id – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Identifier string
pattern (p. 481).

A node identifier that is unique within the node's graph.

446
AWS Glue Developer Guide
CodeGenNodeArg

• NodeType – Required: UTF-8 string.

The type of node this is.


• Args – Required: An array of CodeGenNodeArg (p. 447) objects, not more than 50 structures.

Properties of the node, in the form of name-value pairs.


• LineNumber – Number (integer).

The line number of the node.

CodeGenNodeArg Structure
An argument or property of a node.

Fields

• Name – Required: UTF-8 string.

The name of the argument or property.


• Value – Required: UTF-8 string.

The value of the argument or property.


• Param – Boolean.

True if the value is used as a parameter.

CodeGenEdge Structure
Represents a directional edge in a directed acyclic graph (DAG).

Fields

• Source – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Identifier
string pattern (p. 481).

The ID of the node at which the edge starts.


• Target – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Identifier
string pattern (p. 481).

The ID of the node at which the edge ends.


• TargetParameter – UTF-8 string.

The target of the edge.

Location Structure
The location of resources.

Fields

• Jdbc – An array of CodeGenNodeArg (p. 447) objects, not more than 50 structures.

A JDBC location.

447
AWS Glue Developer Guide
CatalogEntry

• S3 – An array of CodeGenNodeArg (p. 447) objects, not more than 50 structures.

An Amazon S3 location.
• DynamoDB – An array of CodeGenNodeArg (p. 447) objects, not more than 50 structures.

A DynamoDB Table location.

CatalogEntry Structure
Specifies a table definition in the Data Catalog.

Fields

• DatabaseName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The database in which the table metadata resides.


• TableName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the table in question.

MappingEntry Structure
Defines a mapping.

Fields

• SourceTable – UTF-8 string.

The name of the source table.


• SourcePath – UTF-8 string.

The source path.


• SourceType – UTF-8 string.

The source type.


• TargetTable – UTF-8 string.

The target table.


• TargetPath – UTF-8 string.

The target path.


• TargetType – UTF-8 string.

The target type.

Operations
• CreateScript Action (Python: create_script) (p. 449)
• GetDataflowGraph Action (Python: get_dataflow_graph) (p. 449)
• GetMapping Action (Python: get_mapping) (p. 450)

448
AWS Glue Developer Guide
CreateScript (create_script)

• GetPlan Action (Python: get_plan) (p. 450)

CreateScript Action (Python: create_script)


Transforms a directed acyclic graph (DAG) into code.

Request

• DagNodes – An array of CodeGenNode (p. 446) objects.

A list of the nodes in the DAG.


• DagEdges – An array of CodeGenEdge (p. 447) objects.

A list of the edges in the DAG.


• Language – UTF-8 string (valid values: PYTHON | SCALA).

The programming language of the resulting code from the DAG.

Response

• PythonScript – UTF-8 string.

The Python script generated from the DAG.


• ScalaCode – UTF-8 string.

The Scala code generated from the DAG.

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException

GetDataflowGraph Action (Python:


get_dataflow_graph)
Transforms a Python script into a directed acyclic graph (DAG).

Request

• PythonScript – UTF-8 string.

The Python script to transform.

Response

• DagNodes – An array of CodeGenNode (p. 446) objects.

A list of the nodes in the resulting DAG.


• DagEdges – An array of CodeGenEdge (p. 447) objects.

A list of the edges in the resulting DAG.

449
AWS Glue Developer Guide
GetMapping (get_mapping)

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException

GetMapping Action (Python: get_mapping)


Creates mappings.

Request

• Source – Required: A CatalogEntry (p. 448) object.

Specifies the source table.


• Sinks – An array of CatalogEntry (p. 448) objects.

A list of target tables.


• Location – A Location (p. 447) object.

Parameters for the mapping.

Response

• Mapping – Required: An array of MappingEntry (p. 448) objects.

A list of mappings to the specified targets.

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• EntityNotFoundException

GetPlan Action (Python: get_plan)


Gets code to perform a specified mapping.

Request

• Mapping – Required: An array of MappingEntry (p. 448) objects.

The list of mappings from a source table to target tables.


• Source – Required: A CatalogEntry (p. 448) object.

The source table.


• Sinks – An array of CatalogEntry (p. 448) objects.

The target tables.


• Location – A Location (p. 447) object.

450
AWS Glue Developer Guide
Jobs

Parameters for the mapping.


• Language – UTF-8 string (valid values: PYTHON | SCALA).

The programming language of the code to perform the mapping.

Response

• PythonScript – UTF-8 string.

A Python script to perform the mapping.


• ScalaCode – UTF-8 string.

Scala code to perform the mapping.

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException

Jobs API
The Jobs API describes jobs data types and contains APIs for working with jobs, job runs, and triggers in
AWS Glue.

Topics
• Jobs (p. 451)
• Job Runs (p. 458)
• Triggers (p. 465)

Jobs
The Jobs API describes the data types and API related to creating, updating, deleting, or viewing jobs in
AWS Glue.

Data Types
• Job Structure (p. 451)
• ExecutionProperty Structure (p. 453)
• NotificationProperty Structure (p. 453)
• JobCommand Structure (p. 453)
• ConnectionsList Structure (p. 453)
• JobUpdate Structure (p. 454)

Job Structure
Specifies a job definition.

451
AWS Glue Developer Guide
Jobs

Fields

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name you assign to this job definition.


• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

Description of the job being defined.


• LogUri – UTF-8 string.

This field is reserved for future use.


• Role – UTF-8 string.

The name or ARN of the IAM role associated with this job.
• CreatedOn – Timestamp.

The time and date that this job definition was created.
• LastModifiedOn – Timestamp.

The last point in time when this job definition was modified.
• ExecutionProperty – An ExecutionProperty (p. 453) object.

An ExecutionProperty specifying the maximum number of concurrent runs allowed for this job.
• Command – A JobCommand (p. 453) object.

The JobCommand that executes this job.


• DefaultArguments – A map array of key-value pairs.

Each key is a UTF-8 string.

Each value is a UTF-8 string.

The default arguments for this job, specified as name-value pairs.

You can specify arguments here that your own job-execution script consumes, as well as arguments
that AWS Glue itself consumes.

For information about how to specify and consume your own Job arguments, see the Calling AWS Glue
APIs in Python topic in the developer guide.

For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special
Parameters Used by AWS Glue topic in the developer guide.
• Connections – A ConnectionsList (p. 453) object.

The connections used for this job.


• MaxRetries – Number (integer).

The maximum number of times to retry this job after a JobRun fails.
• AllocatedCapacity – Number (integer).

The number of AWS Glue data processing units (DPUs) allocated to runs of this job. From 2 to 100
DPUs can be allocated; the default is 10. A DPU is a relative measure of processing power that consists
of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue
pricing page.
• Timeout – Number (integer), at least 1.

452
AWS Glue Developer Guide
Jobs

The job timeout in minutes. This is the maximum time that a job run can consume resources before it is
terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).
• NotificationProperty – A NotificationProperty (p. 453) object.

Specifies configuration properties of a job notification.


• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure to be used with this job.

ExecutionProperty Structure
An execution property of a job.

Fields

• MaxConcurrentRuns – Number (integer).

The maximum number of concurrent runs allowed for the job. The default is 1. An error is returned
when this threshold is reached. The maximum value you can specify is controlled by a service limit.

NotificationProperty Structure
Specifies configuration properties of a notification.

Fields

• NotifyDelayAfter – Number (integer), at least 1.

After a job run starts, the number of minutes to wait before sending a job run delay notification.

JobCommand Structure
Specifies code executed when a job is run.

Fields

• Name – UTF-8 string.

The name of the job command: this must be glueetl.


• ScriptLocation – UTF-8 string.

Specifies the S3 path to a script that executes a job (required).

ConnectionsList Structure
Specifies the connections used by a job.

Fields

• Connections – An array of UTF-8 strings.

A list of connections used by the job.

453
AWS Glue Developer Guide
Jobs

JobUpdate Structure
Specifies information used to update an existing job definition. Note that the previous job definition will
be completely overwritten by this information.

Fields

• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

Description of the job being defined.


• LogUri – UTF-8 string.

This field is reserved for future use.


• Role – UTF-8 string.

The name or ARN of the IAM role associated with this job (required).
• ExecutionProperty – An ExecutionProperty (p. 453) object.

An ExecutionProperty specifying the maximum number of concurrent runs allowed for this job.
• Command – A JobCommand (p. 453) object.

The JobCommand that executes this job (required).


• DefaultArguments – A map array of key-value pairs.

Each key is a UTF-8 string.

Each value is a UTF-8 string.

The default arguments for this job.

You can specify arguments here that your own job-execution script consumes, as well as arguments
that AWS Glue itself consumes.

For information about how to specify and consume your own Job arguments, see the Calling AWS Glue
APIs in Python topic in the developer guide.

For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special
Parameters Used by AWS Glue topic in the developer guide.
• Connections – A ConnectionsList (p. 453) object.

The connections used for this job.


• MaxRetries – Number (integer).

The maximum number of times to retry this job if it fails.


• AllocatedCapacity – Number (integer).

The number of AWS Glue data processing units (DPUs) to allocate to this Job. From 2 to 100 DPUs
can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of
4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing
page.
• Timeout – Number (integer), at least 1.

The job timeout in minutes. This is the maximum time that a job run can consume resources before it is
terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).
• NotificationProperty – A NotificationProperty (p. 453) object.

454
AWS Glue Developer Guide
Jobs

Specifies configuration properties of a job notification.


• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure to be used with this job.

Operations
• CreateJob Action (Python: create_job) (p. 455)
• UpdateJob Action (Python: update_job) (p. 456)
• GetJob Action (Python: get_job) (p. 457)
• GetJobs Action (Python: get_jobs) (p. 457)
• DeleteJob Action (Python: delete_job) (p. 458)

CreateJob Action (Python: create_job)


Creates a new job definition.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name you assign to this job definition. It must be unique in your account.
• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

Description of the job being defined.


• LogUri – UTF-8 string.

This field is reserved for future use.


• Role – Required: UTF-8 string.

The name or ARN of the IAM role associated with this job.
• ExecutionProperty – An ExecutionProperty (p. 453) object.

An ExecutionProperty specifying the maximum number of concurrent runs allowed for this job.
• Command – Required: A JobCommand (p. 453) object.

The JobCommand that executes this job.


• DefaultArguments – A map array of key-value pairs.

Each key is a UTF-8 string.

Each value is a UTF-8 string.

The default arguments for this job.

You can specify arguments here that your own job-execution script consumes, as well as arguments
that AWS Glue itself consumes.

For information about how to specify and consume your own Job arguments, see the Calling AWS Glue
APIs in Python topic in the developer guide.

455
AWS Glue Developer Guide
Jobs

For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special
Parameters Used by AWS Glue topic in the developer guide.
• Connections – A ConnectionsList (p. 453) object.

The connections used for this job.


• MaxRetries – Number (integer).

The maximum number of times to retry this job if it fails.


• AllocatedCapacity – Number (integer).

The number of AWS Glue data processing units (DPUs) to allocate to this Job. From 2 to 100 DPUs
can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of
4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing
page.
• Timeout – Number (integer), at least 1.

The job timeout in minutes. This is the maximum time that a job run can consume resources before it is
terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).
• NotificationProperty – A NotificationProperty (p. 453) object.

Specifies configuration properties of a job notification.


• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure to be used with this job.

Response

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The unique name that was provided for this job definition.

Errors

• InvalidInputException
• IdempotentParameterMismatchException
• AlreadyExistsException
• InternalServiceException
• OperationTimeoutException
• ResourceNumberLimitExceededException
• ConcurrentModificationException

UpdateJob Action (Python: update_job)


Updates an existing job definition.

Request

• JobName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

Name of the job definition to update.

456
AWS Glue Developer Guide
Jobs

• JobUpdate – Required: A JobUpdate (p. 454) object.

Specifies the values with which to update the job definition.

Response

• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Returns the name of the updated job definition.

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• ConcurrentModificationException

GetJob Action (Python: get_job)


Retrieves an existing job definition.

Request

• JobName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

The name of the job definition to retrieve.

Response

• Job – A Job (p. 451) object.

The requested job definition.

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException

GetJobs Action (Python: get_jobs)


Retrieves all current job definitions.

Request

• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.

457
AWS Glue Developer Guide
Job Runs

• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum size of the response.

Response

• Jobs – An array of Job (p. 451) objects.

A list of job definitions.


• NextToken – UTF-8 string.

A continuation token, if not all job definitions have yet been returned.

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException

DeleteJob Action (Python: delete_job)


Deletes a specified job definition. If the job definition is not found, no exception is thrown.

Request

• JobName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

The name of the job definition to delete.

Response

• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the job definition that was deleted.

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException

Job Runs
The Jobs Runs API describes the data types and API related to starting, stopping, or viewing job runs, and
resetting job bookmarks, in AWS Glue.

Data Types
• JobRun Structure (p. 459)

458
AWS Glue Developer Guide
Job Runs

• Predecessor Structure (p. 460)


• JobBookmarkEntry Structure (p. 461)
• BatchStopJobRunSuccessfulSubmission Structure (p. 461)
• BatchStopJobRunError Structure (p. 461)

JobRun Structure
Contains information about a job run.

Fields

• Id – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The ID of this job run.


• Attempt – Number (integer).

The number of the attempt to run this job.


• PreviousRunId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the previous run of this job. For example, the JobRunId specified in the StartJobRun action.
• TriggerName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the trigger that started this job run.


• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the job definition being used in this run.


• StartedOn – Timestamp.

The date and time at which this job run was started.
• LastModifiedOn – Timestamp.

The last time this job run was modified.


• CompletedOn – Timestamp.

The date and time this job run completed.


• JobRunState – UTF-8 string (valid values: STARTING | RUNNING | STOPPING | STOPPED | SUCCEEDED
| FAILED | TIMEOUT).

The current state of the job run.


• Arguments – A map array of key-value pairs.

Each key is a UTF-8 string.

Each value is a UTF-8 string.

The job arguments associated with this run. These override equivalent default arguments set for the
job.

You can specify arguments here that your own job-execution script consumes, as well as arguments
that AWS Glue itself consumes.

459
AWS Glue Developer Guide
Job Runs

For information about how to specify and consume your own job arguments, see the Calling AWS Glue
APIs in Python topic in the developer guide.

For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special
Parameters Used by AWS Glue topic in the developer guide.
• ErrorMessage – UTF-8 string.

An error message associated with this job run.


• PredecessorRuns – An array of Predecessor (p. 460) objects.

A list of predecessors to this job run.


• AllocatedCapacity – Number (integer).

The number of AWS Glue data processing units (DPUs) allocated to this JobRun. From 2 to 100 DPUs
can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of
4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing
page.
• ExecutionTime – Number (integer).

The amount of time (in seconds) that the job run consumed resources.
• Timeout – Number (integer), at least 1.

The JobRun timeout in minutes. This is the maximum time that a job run can consume resources
before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours). This
overrides the timeout value set in the parent job.
• NotificationProperty – A NotificationProperty (p. 453) object.

Specifies configuration properties of a job run notification.


• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure to be used with this job run.
• LogGroupName – UTF-8 string.

The name of the log group for secure logging, that can be server-side encrypted in CloudWatch using
KMS. This name can be /aws-glue/jobs/, in which case the default encryption is NONE. If you add
a role name and SecurityConfiguration name (in other words, /aws-glue/jobs-yourRoleName-
yourSecurityConfigurationName/), then that security configuration will be used to encrypt the
log group.

Predecessor Structure
A job run that was used in the predicate of a conditional trigger that triggered this job run.

Fields

• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the job definition used by the predecessor job run.
• RunId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The job-run ID of the predecessor job run.

460
AWS Glue Developer Guide
Job Runs

JobBookmarkEntry Structure
Defines a point which a job can resume processing.

Fields

• JobName – UTF-8 string.

Name of the job in question.


• Version – Number (integer).

Version of the job.


• Run – Number (integer).

The run ID number.


• Attempt – Number (integer).

The attempt ID number.


• JobBookmark – UTF-8 string.

The bookmark itself.

BatchStopJobRunSuccessfulSubmission Structure
Records a successful request to stop a specified JobRun.

Fields

• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the job definition used in the job run that was stopped.
• JobRunId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The JobRunId of the job run that was stopped.

BatchStopJobRunError Structure
Records an error that occurred when attempting to stop a specified job run.

Fields

• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the job definition used in the job run in question.
• JobRunId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The JobRunId of the job run in question.


• ErrorDetail – An ErrorDetail (p. 480) object.

Specifies details about the error that was encountered.

461
AWS Glue Developer Guide
Job Runs

Operations
• StartJobRun Action (Python: start_job_run) (p. 462)
• BatchStopJobRun Action (Python: batch_stop_job_run) (p. 463)
• GetJobRun Action (Python: get_job_run) (p. 463)
• GetJobRuns Action (Python: get_job_runs) (p. 464)
• ResetJobBookmark Action (Python: reset_job_bookmark) (p. 465)

StartJobRun Action (Python: start_job_run)


Starts a job run using a job definition.

Request

• JobName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

The name of the job definition to use.


• JobRunId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The ID of a previous JobRun to retry.


• Arguments – A map array of key-value pairs.

Each key is a UTF-8 string.

Each value is a UTF-8 string.

The job arguments specifically for this run. They override the equivalent default arguments set for in
the job definition itself.

You can specify arguments here that your own job-execution script consumes, as well as arguments
that AWS Glue itself consumes.

For information about how to specify and consume your own Job arguments, see the Calling AWS Glue
APIs in Python topic in the developer guide.

For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special
Parameters Used by AWS Glue topic in the developer guide.
• AllocatedCapacity – Number (integer).

The number of AWS Glue data processing units (DPUs) to allocate to this JobRun. From 2 to 100 DPUs
can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of
4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing
page.
• Timeout – Number (integer), at least 1.

The JobRun timeout in minutes. This is the maximum time that a job run can consume resources
before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours). This
overrides the timeout value set in the parent job.
• NotificationProperty – A NotificationProperty (p. 453) object.

Specifies configuration properties of a job run notification.


• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

462
AWS Glue Developer Guide
Job Runs

The name of the SecurityConfiguration structure to be used with this job run.

Response

• JobRunId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The ID assigned to this job run.

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• ResourceNumberLimitExceededException
• ConcurrentRunsExceededException

BatchStopJobRun Action (Python: batch_stop_job_run)


Stops one or more job runs for a specified job definition.

Request

• JobName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

The name of the job definition for which to stop job runs.
• JobRunIds – Required: An array of UTF-8 strings, not less than 1 or more than 25 strings.

A list of the JobRunIds that should be stopped for that job definition.

Response

• SuccessfulSubmissions – An array of BatchStopJobRunSuccessfulSubmission (p. 461) objects.

A list of the JobRuns that were successfully submitted for stopping.


• Errors – An array of BatchStopJobRunError (p. 461) objects.

A list of the errors that were encountered in tryng to stop JobRuns, including the JobRunId for which
each error was encountered and details about the error.

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException

GetJobRun Action (Python: get_job_run)


Retrieves the metadata for a given job run.

463
AWS Glue Developer Guide
Job Runs

Request

• JobName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

Name of the job definition being run.


• RunId – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The ID of the job run.


• PredecessorsIncluded – Boolean.

True if a list of predecessor runs should be returned.

Response

• JobRun – A JobRun (p. 459) object.

The requested job-run metadata.

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException

GetJobRuns Action (Python: get_job_runs)


Retrieves metadata for all runs of a given job definition.

Request

• JobName – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-
line string pattern (p. 481).

The name of the job definition for which to retrieve all job runs.
• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.


• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum size of the response.

Response

• JobRuns – An array of JobRun (p. 459) objects.

A list of job-run metatdata objects.


• NextToken – UTF-8 string.

A continuation token, if not all reequested job runs have been returned.

464
AWS Glue Developer Guide
Triggers

Errors

• InvalidInputException
• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException

ResetJobBookmark Action (Python: reset_job_bookmark)


Resets a bookmark entry.

Request

• JobName – Required: UTF-8 string.

The name of the job in question.

Response

• JobBookmarkEntry – A JobBookmarkEntry (p. 461) object.

The reset bookmark entry.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

Triggers
The Triggers API describes the data types and API related to creating, updating, or deleting, and starting
and stopping job triggers in AWS Glue.

Data Types
• Trigger Structure (p. 465)
• TriggerUpdate Structure (p. 466)
• Predicate Structure (p. 467)
• Condition Structure (p. 467)
• Action Structure (p. 467)

Trigger Structure
Information about a specific trigger.

Fields

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

465
AWS Glue Developer Guide
Triggers

Name of the trigger.


• Id – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Reserved for future use.


• Type – UTF-8 string (valid values: SCHEDULED | CONDITIONAL | ON_DEMAND).

The type of trigger that this is.


• State – UTF-8 string (valid values: CREATING | CREATED | ACTIVATING | ACTIVATED |
DEACTIVATING | DEACTIVATED | DELETING | UPDATING).

The current state of the trigger.


• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

A description of this trigger.


• Schedule – UTF-8 string.

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For
example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
• Actions – An array of Action (p. 467) objects.

The actions initiated by this trigger.


• Predicate – A Predicate (p. 467) object.

The predicate of this trigger, which defines when it will fire.

TriggerUpdate Structure
A structure used to provide information used to update a trigger. This object will update the the previous
trigger definition by overwriting it completely.

Fields

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

Reserved for future use.


• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

A description of this trigger.


• Schedule – UTF-8 string.

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For
example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
• Actions – An array of Action (p. 467) objects.

The actions initiated by this trigger.


• Predicate – A Predicate (p. 467) object.

The predicate of this trigger, which defines when it will fire.

466
AWS Glue Developer Guide
Triggers

Predicate Structure
Defines the predicate of the trigger, which determines when it fires.

Fields

• Logical – UTF-8 string (valid values: AND | ANY).

Optional field if only one condition is listed. If multiple conditions are listed, then this field is required.
• Conditions – An array of Condition (p. 467) objects.

A list of the conditions that determine when the trigger will fire.

Condition Structure
Defines a condition under which a trigger fires.

Fields

• LogicalOperator – UTF-8 string (valid values: EQUALS).

A logical operator.
• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the Job to whose JobRuns this condition applies and on which this trigger waits.
• State – UTF-8 string (valid values: STARTING | RUNNING | STOPPING | STOPPED | SUCCEEDED |
FAILED | TIMEOUT).

The condition state. Currently, the values supported are SUCCEEDED, STOPPED, TIMEOUT and FAILED.

Action Structure
Defines an action to be initiated by a trigger.

Fields

• JobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of a job to be executed.


• Arguments – A map array of key-value pairs.

Each key is a UTF-8 string.

Each value is a UTF-8 string.

Arguments to be passed to the job run.

You can specify arguments here that your own job-execution script consumes, as well as arguments
that AWS Glue itself consumes.

For information about how to specify and consume your own Job arguments, see the Calling AWS Glue
APIs in Python topic in the developer guide.

For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special
Parameters Used by AWS Glue topic in the developer guide.

467
AWS Glue Developer Guide
Triggers

• Timeout – Number (integer), at least 1.

The JobRun timeout in minutes. This is the maximum time that a job run can consume resources
before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours). This
overrides the timeout value set in the parent job.
• NotificationProperty – A NotificationProperty (p. 453) object.

Specifies configuration properties of a job run notification.


• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure to be used with this action.

Operations
• CreateTrigger Action (Python: create_trigger) (p. 468)
• StartTrigger Action (Python: start_trigger) (p. 469)
• GetTrigger Action (Python: get_trigger) (p. 470)
• GetTriggers Action (Python: get_triggers) (p. 470)
• UpdateTrigger Action (Python: update_trigger) (p. 471)
• StopTrigger Action (Python: stop_trigger) (p. 471)
• DeleteTrigger Action (Python: delete_trigger) (p. 472)

CreateTrigger Action (Python: create_trigger)


Creates a new trigger.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the trigger.


• Type – Required: UTF-8 string (valid values: SCHEDULED | CONDITIONAL | ON_DEMAND).

The type of the new trigger.


• Schedule – UTF-8 string.

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For
example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

This field is required when the trigger type is SCHEDULED.


• Predicate – A Predicate (p. 467) object.

A predicate to specify when the new trigger should fire.

This field is required when the trigger type is CONDITIONAL.


• Actions – Required: An array of Action (p. 467) objects.

The actions initiated by this trigger when it fires.


• Description – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

468
AWS Glue Developer Guide
Triggers

A description of the new trigger.


• StartOnCreation – Boolean.

Set to true to start SCHEDULED and CONDITIONAL triggers when created. True not supported for
ON_DEMAND triggers.

Response

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the trigger.

Errors

• AlreadyExistsException
• InvalidInputException
• IdempotentParameterMismatchException
• InternalServiceException
• OperationTimeoutException
• ResourceNumberLimitExceededException
• ConcurrentModificationException

StartTrigger Action (Python: start_trigger)


Starts an existing trigger. See Triggering Jobs for information about how different types of trigger are
started.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the trigger to start.

Response

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the trigger that was started.

Errors

• InvalidInputException
• InternalServiceException
• EntityNotFoundException
• OperationTimeoutException
• ResourceNumberLimitExceededException

469
AWS Glue Developer Guide
Triggers

• ConcurrentRunsExceededException

GetTrigger Action (Python: get_trigger)


Retrieves the definition of a trigger.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the trigger to retrieve.

Response

• Trigger – A Trigger (p. 465) object.

The requested trigger definition.

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

GetTriggers Action (Python: get_triggers)


Gets all the triggers associated with a job.

Request

• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.


• DependentJobName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the job for which to retrieve triggers. The trigger that can start this job will be returned,
and if there is no such trigger, all triggers will be returned.
• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum size of the response.

Response

• Triggers – An array of Trigger (p. 465) objects.

A list of triggers for the specified job.


• NextToken – UTF-8 string.

A continuation token, if not all the requested triggers have yet been returned.

470
AWS Glue Developer Guide
Triggers

Errors

• EntityNotFoundException
• InvalidInputException
• InternalServiceException
• OperationTimeoutException

UpdateTrigger Action (Python: update_trigger)


Updates a trigger definition.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the trigger to update.


• TriggerUpdate – Required: A TriggerUpdate (p. 466) object.

The new values with which to update the trigger.

Response

• Trigger – A Trigger (p. 465) object.

The resulting trigger definition.

Errors

• InvalidInputException
• InternalServiceException
• EntityNotFoundException
• OperationTimeoutException
• ConcurrentModificationException

StopTrigger Action (Python: stop_trigger)


Stops a specified trigger.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the trigger to stop.

Response

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the trigger that was stopped.

471
AWS Glue Developer Guide
DevEndpoints

Errors

• InvalidInputException
• InternalServiceException
• EntityNotFoundException
• OperationTimeoutException
• ConcurrentModificationException

DeleteTrigger Action (Python: delete_trigger)


Deletes a specified trigger. If the trigger is not found, no exception is thrown.

Request

• Name – Required: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The name of the trigger to delete.

Response

• Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The name of the trigger that was deleted.

Errors

• InvalidInputException
• InternalServiceException
• OperationTimeoutException
• ConcurrentModificationException

Development Endpoints API


The Development Endpoints API describes the AWS Glue API related to testing using a custom
DevEndpoint.

Data Types
• DevEndpoint Structure (p. 472)
• DevEndpointCustomLibraries Structure (p. 474)

DevEndpoint Structure
A development endpoint where a developer can remotely debug ETL scripts.

Fields

• EndpointName – UTF-8 string.

472
AWS Glue Developer Guide
DevEndpoint

The name of the DevEndpoint.


• RoleArn – UTF-8 string, matching the AWS IAM ARN string pattern (p. 481).

The AWS ARN of the IAM role used in this DevEndpoint.


• SecurityGroupIds – An array of UTF-8 strings.

A list of security group identifiers used in this DevEndpoint.


• SubnetId – UTF-8 string.

The subnet ID for this DevEndpoint.


• YarnEndpointAddress – UTF-8 string.

The YARN endpoint address used by this DevEndpoint.


• PrivateAddress – UTF-8 string.

A private IP address to access the DevEndpoint within a VPC, if the DevEndpoint is created within one.
The PrivateAddress field is present only when you create the DevEndpoint within your virtual private
cloud (VPC).
• ZeppelinRemoteSparkInterpreterPort – Number (integer).

The Apache Zeppelin port for the remote Apache Spark interpreter.
• PublicAddress – UTF-8 string.

The public IP address used by this DevEndpoint. The PublicAddress field is present only when you
create a non-VPC (virtual private cloud) DevEndpoint.
• Status – UTF-8 string.

The current status of this DevEndpoint.


• NumberOfNodes – Number (integer).

The number of AWS Glue Data Processing Units (DPUs) allocated to this DevEndpoint.
• AvailabilityZone – UTF-8 string.

The AWS availability zone where this DevEndpoint is located.


• VpcId – UTF-8 string.

The ID of the virtual private cloud (VPC) used by this DevEndpoint.


• ExtraPythonLibsS3Path – UTF-8 string.

Path(s) to one or more Python libraries in an S3 bucket that should be loaded in your DevEndpoint.
Multiple values must be complete paths separated by a comma.

Please note that only pure Python libraries can currently be used on a DevEndpoint. Libraries that rely
on C extensions, such as the pandas Python data analysis library, are not yet supported.
• ExtraJarsS3Path – UTF-8 string.

Path to one or more Java Jars in an S3 bucket that should be loaded in your DevEndpoint.

Please note that only pure Java/Scala libraries can currently be used on a DevEndpoint.
• FailureReason – UTF-8 string.

The reason for a current failure in this DevEndpoint.


• LastUpdateStatus – UTF-8 string.
473
AWS Glue Developer Guide
DevEndpointCustomLibraries

The status of the last update.


• CreatedTimestamp – Timestamp.

The point in time at which this DevEndpoint was created.


• LastModifiedTimestamp – Timestamp.

The point in time at which this DevEndpoint was last modified.


• PublicKey – UTF-8 string.

The public key to be used by this DevEndpoint for authentication. This attribute is provided for
backward compatibility, as the recommended attribute to use is public keys.
• PublicKeys – An array of UTF-8 strings, not more than 5 strings.

A list of public keys to be used by the DevEndpoints for authentication. The use of this attribute is
preferred over a single public key because the public keys allow you to have a different private key per
client.
Note
If you previously created an endpoint with a public key, you must remove that key to be able
to set a list of public keys: call the UpdateDevEndpoint API with the public key content in
the deletePublicKeys attribute, and the list of new keys in the addPublicKeys attribute.
• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure to be used with this DevEndpoint.

DevEndpointCustomLibraries Structure
Custom libraries to be loaded into a DevEndpoint.

Fields

• ExtraPythonLibsS3Path – UTF-8 string.

Path(s) to one or more Python libraries in an S3 bucket that should be loaded in your DevEndpoint.
Multiple values must be complete paths separated by a comma.

Please note that only pure Python libraries can currently be used on a DevEndpoint. Libraries that rely
on C extensions, such as the pandas Python data analysis library, are not yet supported.
• ExtraJarsS3Path – UTF-8 string.

Path to one or more Java Jars in an S3 bucket that should be loaded in your DevEndpoint.

Please note that only pure Java/Scala libraries can currently be used on a DevEndpoint.

Operations
• CreateDevEndpoint Action (Python: create_dev_endpoint) (p. 475)
• UpdateDevEndpoint Action (Python: update_dev_endpoint) (p. 477)
• DeleteDevEndpoint Action (Python: delete_dev_endpoint) (p. 477)
• GetDevEndpoint Action (Python: get_dev_endpoint) (p. 478)
• GetDevEndpoints Action (Python: get_dev_endpoints) (p. 478)

474
AWS Glue Developer Guide
CreateDevEndpoint (create_dev_endpoint)

CreateDevEndpoint Action (Python:


create_dev_endpoint)
Creates a new DevEndpoint.

Request

• EndpointName – Required: UTF-8 string.

The name to be assigned to the new DevEndpoint.


• RoleArn – Required: UTF-8 string, matching the AWS IAM ARN string pattern (p. 481).

The IAM role for the DevEndpoint.


• SecurityGroupIds – An array of UTF-8 strings.

Security group IDs for the security groups to be used by the new DevEndpoint.
• SubnetId – UTF-8 string.

The subnet ID for the new DevEndpoint to use.


• PublicKey – UTF-8 string.

The public key to be used by this DevEndpoint for authentication. This attribute is provided for
backward compatibility, as the recommended attribute to use is public keys.
• PublicKeys – An array of UTF-8 strings, not more than 5 strings.

A list of public keys to be used by the DevEndpoints for authentication. The use of this attribute is
preferred over a single public key because the public keys allow you to have a different private key per
client.
Note
If you previously created an endpoint with a public key, you must remove that key to be able
to set a list of public keys: call the UpdateDevEndpoint API with the public key content in
the deletePublicKeys attribute, and the list of new keys in the addPublicKeys attribute.
• NumberOfNodes – Number (integer).

The number of AWS Glue Data Processing Units (DPUs) to allocate to this DevEndpoint.
• ExtraPythonLibsS3Path – UTF-8 string.

Path(s) to one or more Python libraries in an S3 bucket that should be loaded in your DevEndpoint.
Multiple values must be complete paths separated by a comma.

Please note that only pure Python libraries can currently be used on a DevEndpoint. Libraries that rely
on C extensions, such as the pandas Python data analysis library, are not yet supported.
• ExtraJarsS3Path – UTF-8 string.

Path to one or more Java Jars in an S3 bucket that should be loaded in your DevEndpoint.
• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure to be used with this DevEndpoint.

Response

• EndpointName – UTF-8 string.

475
AWS Glue Developer Guide
CreateDevEndpoint (create_dev_endpoint)

The name assigned to the new DevEndpoint.


• Status – UTF-8 string.

The current status of the new DevEndpoint.


• SecurityGroupIds – An array of UTF-8 strings.

The security groups assigned to the new DevEndpoint.


• SubnetId – UTF-8 string.

The subnet ID assigned to the new DevEndpoint.


• RoleArn – UTF-8 string, matching the AWS IAM ARN string pattern (p. 481).

The AWS ARN of the role assigned to the new DevEndpoint.


• YarnEndpointAddress – UTF-8 string.

The address of the YARN endpoint used by this DevEndpoint.


• ZeppelinRemoteSparkInterpreterPort – Number (integer).

The Apache Zeppelin port for the remote Apache Spark interpreter.
• NumberOfNodes – Number (integer).

The number of AWS Glue Data Processing Units (DPUs) allocated to this DevEndpoint.
• AvailabilityZone – UTF-8 string.

The AWS availability zone where this DevEndpoint is located.


• VpcId – UTF-8 string.

The ID of the VPC used by this DevEndpoint.


• ExtraPythonLibsS3Path – UTF-8 string.

Path(s) to one or more Python libraries in an S3 bucket that will be loaded in your DevEndpoint.
• ExtraJarsS3Path – UTF-8 string.

Path to one or more Java Jars in an S3 bucket that will be loaded in your DevEndpoint.
• FailureReason – UTF-8 string.

The reason for a current failure in this DevEndpoint.


• SecurityConfiguration – UTF-8 string, not less than 1 or more than 255 bytes long, matching the
Single-line string pattern (p. 481).

The name of the SecurityConfiguration structure being used with this DevEndpoint.
• CreatedTimestamp – Timestamp.

The point in time at which this DevEndpoint was created.

Errors

• AccessDeniedException
• AlreadyExistsException
• IdempotentParameterMismatchException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException

476
AWS Glue Developer Guide
UpdateDevEndpoint (update_dev_endpoint)

• ValidationException
• ResourceNumberLimitExceededException

UpdateDevEndpoint Action (Python:


update_dev_endpoint)
Updates a specified DevEndpoint.

Request

• EndpointName – Required: UTF-8 string.

The name of the DevEndpoint to be updated.


• PublicKey – UTF-8 string.

The public key for the DevEndpoint to use.


• AddPublicKeys – An array of UTF-8 strings, not more than 5 strings.

The list of public keys for the DevEndpoint to use.


• DeletePublicKeys – An array of UTF-8 strings, not more than 5 strings.

The list of public keys to be deleted from the DevEndpoint.


• CustomLibraries – A DevEndpointCustomLibraries (p. 474) object.

Custom Python or Java libraries to be loaded in the DevEndpoint.


• UpdateEtlLibraries – Boolean.

True if the list of custom libraries to be loaded in the development endpoint needs to be updated, or
False otherwise.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException
• ValidationException

DeleteDevEndpoint Action (Python:


delete_dev_endpoint)
Deletes a specified DevEndpoint.

Request

• EndpointName – Required: UTF-8 string.

477
AWS Glue Developer Guide
GetDevEndpoint (get_dev_endpoint)

The name of the DevEndpoint.

Response

• No Response parameters.

Errors

• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException

GetDevEndpoint Action (Python: get_dev_endpoint)


Retrieves information about a specified DevEndpoint.
Note
When you create a development endpoint in a virtual private cloud (VPC), AWS Glue returns
only a private IP address, and the public IP address field is not populated. When you create a
non-VPC development endpoint, AWS Glue returns only a public IP address.

Request

• EndpointName – Required: UTF-8 string.

Name of the DevEndpoint for which to retrieve information.

Response

• DevEndpoint – A DevEndpoint (p. 472) object.

A DevEndpoint definition.

Errors

• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException

GetDevEndpoints Action (Python:


get_dev_endpoints)
Retrieves all the DevEndpoints in this AWS account.
Note
When you create a development endpoint in a virtual private cloud (VPC), AWS Glue returns
only a private IP address and the public IP address field is not populated. When you create a
non-VPC development endpoint, AWS Glue returns only a public IP address.

478
AWS Glue Developer Guide
Common Data Types

Request

• MaxResults – Number (integer), not less than 1 or more than 1000.

The maximum size of information to return.


• NextToken – UTF-8 string.

A continuation token, if this is a continuation call.

Response

• DevEndpoints – An array of DevEndpoint (p. 472) objects.

A list of DevEndpoint definitions.


• NextToken – UTF-8 string.

A continuation token, if not all DevEndpoint definitions have yet been returned.

Errors

• EntityNotFoundException
• InternalServiceException
• OperationTimeoutException
• InvalidInputException

Common Data Types


The Common Data Types describes miscellaneous common data types in AWS Glue.

Tag Structure
An AWS Tag.

Fields

• key – UTF-8 string, not less than 1 or more than 128 bytes long.

The tag key.


• value – UTF-8 string, not more than 256 bytes long.

The tag value.

DecimalNumber Structure
Contains a numeric value in decimal format.

Fields

• UnscaledValue – Blob.

The unscaled numeric value.


• Scale – Number (integer).

479
AWS Glue Developer Guide
ErrorDetail

The scale that determines where the decimal point falls in the unscaled value.

ErrorDetail Structure
Contains details about an error.

Fields

• ErrorCode – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line
string pattern (p. 481).

The code associated with this error.


• ErrorMessage – Description string, not more than 2048 bytes long, matching the URI address multi-
line string pattern (p. 481).

A message describing the error.

PropertyPredicate Structure
Defines a property predicate.

Fields

• Key – Value string, not more than 1024 bytes long.

The key of the property.


• Value – Value string, not more than 1024 bytes long.

The value of the property.


• Comparator – UTF-8 string (valid values: EQUALS | GREATER_THAN | LESS_THAN |
GREATER_THAN_EQUALS | LESS_THAN_EQUALS).

The comparator used to compare this property to others.

ResourceUri Structure
URIs for function resources.

Fields

• ResourceType – UTF-8 string (valid values: JAR | FILE | ARCHIVE).

The type of the resource.


• Uri – Uniform resource identifier (uri), not less than 1 or more than 1024 bytes long, matching the URI
address multi-line string pattern (p. 481).

The URI for accessing the resource.

String Patterns
The API uses the following regular expressions to define what is valid content for various string
parameters and members:

480
AWS Glue Developer Guide
Exceptions

• Single-line string pattern – "[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF


\t]*"
• URI address multi-line string pattern – "[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF
\uDFFF\r\n\t]*"
• A Logstash Grok string pattern – "[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF
\r\t]*"
• Identifier string pattern – "[A-Za-z_][A-Za-z0-9_]*"
• AWS Glue ARN string pattern – "arn:aws:glue:.*"
• AWS IAM ARN string pattern – "arn:aws:iam::\d{12}:role/.*"
• Version string pattern – "^[a-zA-Z0-9-_]+$"
• Log group string pattern – "[\.\-_/#A-Za-z0-9]+"
• Log-stream string pattern – "[^:*]*"
• Custom string pattern #10 – "arn:aws:kms:.*"

Exceptions
This section describes AWS Glue exceptions that you can use to find the source of problems and fix them.

AccessDeniedException Structure
Access to a resource was denied.

Fields

• Message – UTF-8 string.

A message describing the problem.

AlreadyExistsException Structure
A resource to be created or added already exists.

Fields

• Message – UTF-8 string.

A message describing the problem.

ConcurrentModificationException Structure
Two processes are trying to modify a resource simultaneously.

Fields

• Message – UTF-8 string.

A message describing the problem.

ConcurrentRunsExceededException Structure
Too many jobs are being run concurrently.

481
AWS Glue Developer Guide
CrawlerNotRunningException

Fields

• Message – UTF-8 string.

A message describing the problem.

CrawlerNotRunningException Structure
The specified crawler is not running.

Fields

• Message – UTF-8 string.

A message describing the problem.

CrawlerRunningException Structure
The operation cannot be performed because the crawler is already running.

Fields

• Message – UTF-8 string.

A message describing the problem.

CrawlerStoppingException Structure
The specified crawler is stopping.

Fields

• Message – UTF-8 string.

A message describing the problem.

EntityNotFoundException Structure
A specified entity does not exist

Fields

• Message – UTF-8 string.

A message describing the problem.

GlueEncryptionException Structure
An encryption operation failed.

Fields

• Message – UTF-8 string.

482
AWS Glue Developer Guide
IdempotentParameterMismatchException

A message describing the problem.

IdempotentParameterMismatchException Structure
The same unique identifier was associated with two different records.

Fields

• Message – UTF-8 string.

A message describing the problem.

InternalServiceException Structure
An internal service error occurred.

Fields

• Message – UTF-8 string.

A message describing the problem.

InvalidExecutionEngineException Structure
An unknown or invalid execution engine was specified.

Fields

• message – UTF-8 string.

A message describing the problem.

InvalidInputException Structure
The input provided was not valid.

Fields

• Message – UTF-8 string.

A message describing the problem.

InvalidTaskStatusTransitionException Structure
Proper transition from one task to the next failed.

Fields

• message – UTF-8 string.

A message describing the problem.

483
AWS Glue Developer Guide
JobDefinitionErrorException

JobDefinitionErrorException Structure
A job definition is not valid.

Fields

• message – UTF-8 string.

A message describing the problem.

JobRunInTerminalStateException Structure
The terminal state of a job run signals a failure.

Fields

• message – UTF-8 string.

A message describing the problem.

JobRunInvalidStateTransitionException Structure
A job run encountered an invalid transition from source state to target state.

Fields

• jobRunId – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string
pattern (p. 481).

The Id of the job run in question.


• message – UTF-8 string.

A message describing the problem.


• sourceState – UTF-8 string (valid values: STARTING | RUNNING | STOPPING | STOPPED | SUCCEEDED
| FAILED | TIMEOUT).

The source state.


• targetState – UTF-8 string (valid values: STARTING | RUNNING | STOPPING | STOPPED | SUCCEEDED
| FAILED | TIMEOUT).

The target state.

JobRunNotInTerminalStateException Structure
A job run is not in a terminal state.

Fields

• message – UTF-8 string.

A message describing the problem.

484
AWS Glue Developer Guide
LateRunnerException

LateRunnerException Structure
A job runner is late.

Fields

• Message – UTF-8 string.

A message describing the problem.

NoScheduleException Structure
There is no applicable schedule.

Fields

• Message – UTF-8 string.

A message describing the problem.

OperationTimeoutException Structure
The operation timed out.

Fields

• Message – UTF-8 string.

A message describing the problem.

ResourceNumberLimitExceededException Structure
A resource numerical limit was exceeded.

Fields

• Message – UTF-8 string.

A message describing the problem.

SchedulerNotRunningException Structure
The specified scheduler is not running.

Fields

• Message – UTF-8 string.

A message describing the problem.

485
AWS Glue Developer Guide
SchedulerRunningException

SchedulerRunningException Structure
The specified scheduler is already running.

Fields

• Message – UTF-8 string.

A message describing the problem.

SchedulerTransitioningException Structure
The specified scheduler is transitioning.

Fields

• Message – UTF-8 string.

A message describing the problem.

UnrecognizedRunnerException Structure
The job runner was not recognized.

Fields

• Message – UTF-8 string.

A message describing the problem.

ValidationException Structure
A value could not be validated.

Fields

• Message – UTF-8 string.

A message describing the problem.

VersionMismatchException Structure
There was a version conflict.

Fields

• Message – UTF-8 string.

A message describing the problem.

486
AWS Glue Developer Guide

Document History for AWS Glue


The following table describes important changes to the documentation for AWS Glue.

• Latest API version: 2018-12-12


• Latest documentation update: December 11, 2018

update-history-change update-history-description update-history-date

Support for encrypting Added information about December 11, 2018


connection passwords (p. 487) encrypting passwords used in
connection objects. For more
information, see Encrypting
Connection Passwords.

Support for resource-level Added information about using October 15, 2018
permission and resource-based resource-level permissions and
policies (p. 487) resource-based policies with
AWS Glue. For more information,
see the topics within Security in
AWS Glue.

Support for Amazon SageMaker Added information about October 5, 2018


notebooks (p. 487) using Amazon SageMaker
notebooks with AWS Glue
development endpoints.
For more information, see
Managing Notebooks.

Support for encryption (p. 487) Added information about using August 24, 2018
encryption with AWS Glue.
For more information, see
Encryption and Secure Access
for AWS Glue and Setting Up
Encryption in AWS Glue.

Support for Apache Spark job Added information about the July 13, 2018
metrics (p. 487) use of Apache Spark metrics for
better debugging and profiling
of ETL jobs. You can easily track
runtime metrics such as bytes
read and written, memory usage
and CPU load of the driver and
executors, and data shuffles
among executors from the
AWS Glue console. For more
information, see Monitoring
AWS Glue Using CloudWatch
Metrics, Job Monitoring and
Debugging and Working with
Jobs on the AWS Glue Console.

Support of DynamoDB as a data Added information about July 10, 2018


source (p. 487) crawling DynamoDB and using
it as a data source of ETL jobs.

487
AWS Glue Developer Guide
Earlier Updates

For more information, see


Cataloging Tables with a Crawler
and Connection Parameters.

Updates to create notebook Updated information about July 9, 2018


server procedure (p. 487) how to create a notebook server
on an Amazon EC2 instance
associated with a development
endpoint. For more information,
see Creating a Notebook Server
Associated with a Development
Endpoint.

Updates now available over You can now subscribe to an June 25, 2018
RSS (p. 487) RSS feed to receive notifications
about updates to the AWS Glue
Developer Guide.

Support delay notifications for Added information about May 25, 2018
jobs (p. 487) configuring a delay threshold
when a job runs. For more
information, see Adding Jobs in
AWS Glue.

Configure a crawler to append Added information about May 7, 2018


new columns (p. 487) new configuration option for
crawlers, MergeNewColumns.
For more information, see
Configuring a Crawler.

Support timeout of Added information about setting April 10, 2018


jobs (p. 487) a timeout threshold when a job
runs. For more information, see
Adding Jobs in AWS Glue.

Support Scala ETL script and Added information about using January 12, 2018
trigger jobs based on additional Scala as the ETL programming
run states (p. 487) language. In addition, the trigger
API now supports firing when
any conditions are met (in
addition to all conditions). Also,
jobs can be triggered based on
a "failed" or "stopped" job run
(in addition to a "succeeded" job
run).

Earlier Updates
The following table describes the important changes in each release of the AWS Glue Developer Guide
before January 2018.

Change Description Date

Support XML data Added information about classifying XML November 16, 2017
sources and new data sources and new crawler option for
partition changes.

488
AWS Glue Developer Guide
Earlier Updates

Change Description Date


crawler configuration
option

New transforms, Added information about the map and September 29, 2017
support for filter transforms, support for Amazon RDS
additional Amazon Microsoft SQL Server and Amazon RDS
RDS database Oracle, and new features for development
engines, and endpoints.
development
endpoint
enhancements

AWS Glue initial This is the initial release of the AWS Glue August 14, 2017
release Developer Guide.

489
AWS Glue Developer Guide

AWS Glossary
For the latest AWS terminology, see the AWS Glossary in the AWS General Reference.

490

You might also like