Page tree

Skip to end of metadata
Go to start of metadata

Overview

AWS Glue is a service that prepares data for analyses and reports. It does the heavy lifting of data cleaning and the tasks involved in extract, transform and load jobs (ETL). Glue uses machine learning to flatten data, infer data types, clean invalid records, compile multiple datastores, create database schemas and fit the data into it. The first step to using Glue is to add a datastore to the AWS Glue Catalog. Crawlers extract metadata about the datastore and save it into the catalogue. After the dataset is added to the catalog, it can be queried instantly. S3 buckets, RDS and Redshift are examples of datastores that can be added to the catalogue. Second, AWS Glue ETL creates and runs scrips that  get the data ready for analyses. AWS Glue ETL automatically creates code in python that transforms the data. If desired, the code can be modified by the user before the jobs are run. Glue schedule jobs to run the code and prepare data for BI services. The jobs are schedule to run periodically or triggered by an event. Glue is a serverless service, so there is no need to configure any resources when scheduling the jobs to run.

Pricing Guidelines

Amazon Glue is priced hourly and by second increments. Check the Glue page for updated pricing.  Glue charges by Data Processing Units (DPUs), the amount of CPU and memory used to run jobs. One DPU has 4 vCPU and 16 GB of memory. That is the minimum amount of resources that Glue needs to run jobs. Glue provides two types of jobs: Apache Spark and Python shell. Each type of job has different minimum amount of minutes to run and minimum DPU requirements. Please see the Glue page for further details.

Additional Charges

If you use Amazon S3, Amazon RDS, or Amazon Redshift for the catalogue, you will be charged data request and transferring rates. Additionally, if you use CloudWatch or CloudTrail, you will be charged those services standard rates.

Architecture

Amazon Glue is a ETL service that makes it easier to process data for BI analysis. Glue has 4 main components:

  • Glue Metadata Catalogue: stores metadata about the datasource to be processed,  (RDS, S3, Redshift)
  • Crawlers: Search data stores for inputs to be included in the analysis
  • Glue ETL: Generates code to transforms data, so it can be consumed by BI applications.
  • Jobs: Runs Glue ETL code in a serverless manner. The jobs can be schedule to run in a flexible way.

See Amazon Glue Security Documentation for details. For a full and updated list of Amazon Glue features please visit Amazon Glue Features and Amazon Glue Resources.

Automated Safeguards

ClearDATA's Automated Safeguards for Glue ensure that all components of this application (Catalog, crawlers, database connections, jobs) are properly configured to meet ClearDATA's defined controls required to host and process PHI.  

ClearDATA reviews the default catalog settings to ensure the catalog metadata storage is encrypted at rest and that Glue resources are using encrypted storage and encrypted database connections.  If these settings are not properly configured, ClearDATA will modify the settings to meet the controls listed below.

Compliance Guidance

Encrypted Storage

HIPAA Technical Safeguard 45 CFR 164.312(a)(2)(iv) requires encryption and decryption addressable standard strongly suggests that you implement a mechanism to encrypt and decrypt electronic protected health information (ePHI). ClearDATA's interpretation of this regulation is that all storage must be encrypted.  ClearDATA Automated Safeguards for Glue will evaluate all appropriate resources for at-rest encryption, and remediate those resources if necessary. ClearDATA evaluates the Security Configuration and the default Data Catalog for the encryption at rest setting. Furthermore, ClearDATA evaluates all Crawlers, Jobs, Triggers, and Development Endpoints to ensure that any Security Configuration applied to these resources meets these encryption guidelines.

Data Catalog

Data Catalog Encryption setting are reviewed when the settings are updated and upon creation and update of Catalog resources (Connections, Crawlers, Development Endpoints, Jobs, Security Configurations, Tables, Triggers). When at-rest-encryption is enabled for the catalog, all future catalog  metadata will encrypted.

It is sometimes difficult to verify that Data Catalog Encryption Settings have been saved when viewing in the AWS Glue Console.  The Save button appears to always be available, and may cause an error if it is clicked after the Automated Safegaurd has enabled encryption.  This appears to be an inconsistency in the Glue Console of which we have informed Amazon. To further verify these settings, you can use a GetDataCatalogEncryptionSettings API Action.

Customers can replace the default KMS key with a Customer Master Key (CMK) at any time. See Encrypting Your Data Catalog for details.

You must have permission to encrypt, decrypt and generate keys. Make sure those permissions are set up in the KMS policy.

If you remove the key, you won’t be able to decrypt the data. You must have permission to encrypt, decrypt and generate keys. Make sure those permissions are set up in the KMS policy.

Remediation

If at-rest encryption is disabled on the default Data Catalog, it is marked as non-compliant and encryption is enabled transparently using the default encryption key.

Security Configuration

Security Configuration encryption settings are reviewed when a new Security Configuration is created. Data encryption settings for Crawlers, Jobs, Triggers, and Development Endpoints are reviewed when these resources are created or updated and enforced through attached Security Configurations. If encryption for S3 buckets and/or CloudWatch logs is disabled on a Security Configuration that is attached to a Job, Crawler, of Development Endpoint, the resource is marked as non-compliant.

Security Configurations are not yet a supported resource for AWS CloudFormation, so Automated Safeguards services for this resource type may produce a Warning in the CloudWatch Logs.

Remediation

If encryption for S3 buckets and/or CloudWatch logs is disabled on the Security Configuration, it is marked as non-compliant and the Security Configuration is deleted.

Development Endpoints

Development Endpoints that are marked non-compliant are deleted. In the case of ETL Jobs and Crawlers, AWS Automated Safeguards attempts to update the non-compliant resource by creating a new ClearData Security Configuration with information from a compliant Security Configuration already present in the Glue Catalog. Customers can replace the ClearData Security Configuration by attaching another compliant security configuration of their choice.

Remediation

If a model Security Configuration is not available, the Job or Crawler is finally deleted.

Triggers

Encryption for Trigger Actions is dictated through the Security Configuration attached to an Action Jobs. However, an optional Trigger Security Configuration can be attached to an Action to override the default Job security settings. However, this action will not trigger remediation of the Job resources themselves. Otherwise, any Trigger Security Configuration with disabled encryption for S3 buckets and/or CloudWatch logs are removed from the Trigger through an update.

Remediation

Any Trigger that has any non-compliant Jobs is marked as non-compliant and deleted.

Encrypted Connections

HIPAA Technical Safeguard 45 CFR 164.312(a)(2)(iv) requires encryption and decryption addressable standard strongly suggests that you implement a mechanism to encrypt and decrypt electronic protected health information (ePHI).  

ClearDATA's interpretation of this regulation is that all connections between data sources and Glue must use an encrypted connection, ensuring the data transmitted from data sources to and from Glue is encrypted. That mean the data from database connections, crawlers and ETL jobs needs to be encrypted. Glue Connection setting are reviewed when a new Connection resource is created or updated. A secure socket layer (SSL) connection must be required. 

JDBC Connections for Jobs and Crawlers are reviewed when these resources are created or updated. If these resources are configured with non-compliant Connections, they are marked as non-compliant and remediated. ClearDATA's Automated Safeguards updates Jobs by removing non-compliant connections. However, if no compliant Connections remain, the Job is deleted. Likewise, Automated Safeguards update Crawlers by removing Crawler Targets that have non-compliant JDBC Connections. If no Targets of any type remain, the Crawler is deleted. The customer may choose to reconfigure remediated Jobs and Crawlers with compliant connections to make sure they are working as intended.

Remediation

Any Connection that does not have the enforce SSL option configured will be marked as non-compliant and deleted.

Shared Responsibility

ClearDATA will ensure that all Glue catalogues and Glue ETL created in accounts with Automated Safeguards meet the requirements outlined above under Compliance Guidance.  If a Glue catalogues or Glue ETL is created and found to violate any of the items listed, the Automated Safeguards will remediate it, and an alert will be sent to the ClearDATA SNS topic with details of the violation.

Please contact your ClearDATA team for a full copy of the Responsibilities Matrix.

Exclusion

Disabling automated remediation is available for Crawlers, Jobs, Triggers, & Development endpoints in Glue.  Please contact ClearDATA Support to request that an exclusion be placed to allow for the feature to be created.

Reference Architecture


RACI

Item

ClearDATA

Customer

Enforce Automated SafeguardsRAIC
Ensure any service excluded from automated remediation does not contain any PHI/PIIICRA
Configure, manage, and maintain all Glue Metadata Catalogues, Crawlers, Jobs, Triggers, ETL code and any other Glue featuresICRA





  • No labels