Skip to content

AWS Athena Glue RedShift EMR Use-Cases

Updated: at 09:55 AM

data

Table of content

Amazon Athena

Athena is out-of-the-box integrated with AWS Glue Data Catalog. Use-Cases:

When to Use Athena:

AWS Glue

Use-Cases:

When to Use Glue:

Glue crawler

Glue crawlers are automated data discovery tools that scan a data source to classify, group, and catalog the data within it automatically. It then creates new or updates existing tables in your AWS Glue Data Catalog.

Glue Data Catalog

The AWS Glue Data Catalog is an index of your data’s location, schema, and runtime metrics. You need this information to create and monitor your extract, transform, and load (ETL) jobs.

Amazon Redshift

Use-Cases:

When to Use Redshift:

EMR

Amazon EMR (Elastic Map Reduce) is hosted version of open-source tools Apache Hadoop, Apache Spark, HBase, Flink, Hudi, and Presto.

MapReduce is named after the two basic operations - reading data, putting it into a format suitable for analysis (map), and performing maths operations i.e. counting tax collected from houses (reduce).

const totalTax = data.results
  .map(i => ({ n: i.houseNo, tax: i.houseTax })) // clean up data (chose)
  .reduce((acc, i) => (!i.tax ? acc : acc + i.tax), 0); // maths operation

When to Use EMR:

EMR vs RedShift

FeatureAmazon EMRAmazon Redshift
Primary Use CaseBig data processing using frameworks like Hadoop, Spark.Data warehousing and analytics.
NodesCluster can be resized manually or automatically.Cluster size can be adjusted, supports node types for different workloads.
DurabilityShort Running (One Shot)Long Running (24x7)
SecurityEncryption in transit/at rest, IAM roles, network configurations for security.Offers encryption, VPC, IAM roles, and also supports Redshift Spectrum for secure data querying.
In VPCCan be launched within a VPC for isolated processing.Runs in VPC by default for enhanced network security.
BackupDepends on the data storage used (e.g., S3). Manual snapshot management for HDFS.Automated and manual snapshots to S3, with configurable retention periods.
Cost_ Pricing based on the type and # of instances, and the duration of cluster operation.
_ Spot-Instance for low cost
Pricing based on node types, # of nodes, and hours run; also offers reserved instances for cost saving.
In general expensive
Data SourcesS3, DynamoDB, HDFS.S3, Dynamo, RDS, EMR, Glue, EC2
Data DestinationS3, HDFS, or exported to other systems.Results are typically stored within Redshift clusters but can be exported to S3 or other services.
PerformancePerformance depends on the chosen hardware and software configuration. Optimized for parallel processing of big data.Highly optimized for complex queries across large datasets using columnar storage and data compression.
ScalabilityHighly scalable, supports adding/removing instances to the cluster as needed.Scalable, with the ability to resize clusters and use elastic resize for quick adjustments.
ManagementManaged Hadoop framework, but requires configuration and management of applications.Fully managed service, with less operational overhead for scaling and maintenance.

Summary of Athena, Glue, RS and EMR

Complex Use-Cases

There are several use cases where combining AWS Glue, Amazon Athena, Amazon Redshift, and Amazon EMR could be beneficial to achieve complex data processing, storage, and analysis objectives. These AWS services complement each other and can be used together in various data engineering and analytics workflows.

If you enjoy the content, please consider supporting work