Design patterns to handle Amazon EMR on EKS workloads for Apache Spark


Amazon EMR on Amazon EKS allows you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (Amazon EKS) with out provisioning clusters. With EMR on EKS, you may consolidate analytical workloads together with your different Kubernetes-based functions on the identical Amazon EKS cluster to enhance useful resource utilization and simplify infrastructure administration. Kubernetes makes use of namespaces to supply isolation between teams of assets inside a single Kubernetes cluster. Amazon EMR creates a digital cluster by registering Amazon EMR with a namespace on an EKS cluster. Amazon EMR can then run analytics workloads on that namespace.

In EMR on EKS, you may submit your Spark jobs to Amazon EMR digital clusters utilizing the AWS Command Line Interface (AWS CLI), SDK, or Amazon EMR Studio. Amazon EMR requests the Kubernetes scheduler on Amazon EKS to schedule pods. For each job you run, EMR on EKS creates a container with an Amazon Linux 2 base picture, Apache Spark, and related dependencies. Every Spark job runs in a pod on Amazon EKS employee nodes. In case your Amazon EKS cluster has employee nodes in several Availability Zones, the Spark software driver and executor pods can unfold throughout a number of Availability Zones. On this case, knowledge switch prices apply for cross-AZ communication and will increase knowledge processing latency. If you wish to scale back knowledge processing latency and keep away from cross-AZ knowledge switch prices, it’s best to configure Spark functions to run solely inside a single Availability Zone.

On this publish, we share 4 design patterns to handle EMR on EKS workloads for Apache Spark. We then present use a pod template to schedule a job with EMR on EKS, and use Karpenter as our autoscaling device.

Sample 1: Handle Spark jobs by pod template

Prospects usually consolidate a number of functions on a shared Amazon EKS cluster to enhance utilization and save prices. Nonetheless, every software might have completely different necessities. For instance, it’s possible you’ll wish to run performance-intensive workloads akin to machine studying mannequin coaching jobs on SSD-backed cases for higher efficiency, or fault-tolerant and versatile functions on Amazon Elastic Compute Cloud (Amazon EC2) Spot Cases for decrease price. In EMR on EKS, there are just a few methods to configure how your Spark job runs on Amazon EKS employee nodes. You may make the most of the Spark configurations on Kubernetes with the EMR on EKS StartJobRun API, or you should use Spark’s pod template characteristic. Pod templates are specs that decide run every pod in your EKS clusters. With pod templates, you have got extra flexibility and might use pod template information to outline Kubernetes pod configurations that Spark doesn’t assist.

You should use pod templates to attain the next advantages:

  • Scale back prices – You may schedule Spark executor pods to run on EC2 Spot Cases whereas scheduling Spark driver pods to run on EC2 On-Demand Cases.
  • Enhance monitoring – You may improve your Spark workload’s observability. For instance, you may deploy a sidecar container by way of a pod template to your Spark job that may ahead logs to your centralized logging software
  • Enhance useful resource utilization – You may assist a number of groups operating their Spark workloads on the identical shared Amazon EKS cluster

You may implement these patterns utilizing pod templates and Kubernetes labels and selectors. Kubernetes labels are key-value pairs which might be connected to things, akin to Kubernetes employee nodes, to determine attributes which might be significant and related to customers. You may then select the place Kubernetes schedules pods utilizing nodeSelector or Kubernetes affinity and anti-affinity in order that it might probably solely run on particular employee nodes. nodeSelector is the only technique to constrain pods to nodes with particular labels. Affinity and anti-affinity develop the kinds of constraints you may outline.

Autoscaling in Spark workload

Autoscaling is a operate that routinely scales your compute assets up or right down to adjustments in demand. For Kubernetes auto scaling, Amazon EKS helps two auto scaling merchandise: the Kubernetes Cluster Autoscaler and the Karpenter open-source auto scaling mission. Kubernetes autoscaling ensures your cluster has sufficient nodes to schedule your pods with out losing assets. If some pods fail to schedule on present employee nodes as a result of inadequate assets, it will increase the scale of the cluster and provides further nodes. It additionally makes an attempt to take away underutilized nodes when its pods can run elsewhere.

Sample 2: Activate Dynamic Useful resource Allocation (DRA) in Spark

Spark supplies a mechanism known as Dynamic Useful resource Allocation (DRA), which dynamically adjusts the assets your software occupies based mostly on the workload. With DRA, the Spark driver spawns the preliminary variety of executors after which scales up the quantity till the required most variety of executors is met to course of the pending duties. Idle executors are deleted when there are not any pending duties. It’s significantly helpful when you’re not sure what number of executors are wanted to your job processing.

You may implement it in EMR on EKS by following the Dynamic Useful resource Allocation workshop.

Sample 3: Absolutely management cluster autoscaling by Cluster Autoscaler

Cluster Autoscaler makes use of the idea of node teams because the component of capability management and scale. In AWS, node teams are applied by auto scaling teams. Cluster Autoscaler implements it by controlling the DesiredReplicas area of your auto scaling teams.

To avoid wasting prices and enhance useful resource utilization, you should use Cluster Autoscaler in your Amazon EKS cluster to routinely scale your Spark pods. The next are suggestions for autoscaling Spark jobs with Amazon EMR on EKS utilizing Cluster Autoscaler:

  • Create Availability Zone bounded auto scaling teams to verify Cluster Autoscaler solely provides employee nodes in the identical Availability Zone to keep away from cross-AZ knowledge switch prices and knowledge processing latency.
  • Create separate node teams for EC2 On-Demand and Spot Cases. By doing this, you may add or shrink driver pods and executor pods independently.
  • In Cluster Autoscaler, every node in a node group must have similar scheduling properties. That features EC2 occasion sorts, which ought to be of comparable vCPU to reminiscence ratio to keep away from inconsistency and wastage of assets. To be taught extra about Cluster Autoscaler node teams greatest practices, consult with Configuring your Node Teams.
  • Adhere to Spot Occasion greatest practices and maximize diversification to take benefits of a number of Spot swimming pools. Create a number of node teams for Spark executor pods with completely different vCPU to reminiscence ratios. This significantly will increase the steadiness and resiliency of your software.
  • When you have got a number of node teams, use pod templates and Kubernetes labels and selectors to handle Spark pod deployment to particular Availability Zones and EC2 occasion sorts.

The next diagram illustrates Availability Zone bounded auto scaling teams.

As a number of node teams are created, Cluster Autoscaler has the idea of expanders, which offer completely different methods for choosing which node group to scale. As of this writing, the next methods are supported: random, most-pods, least-waste, and precedence. With a number of node teams of EC2 On-Demand and Spot Cases, you should use the precedence expander, which permits Cluster Autoscaler to pick out the node group that has the very best precedence assigned by the consumer. For configuration particulars, consult with Precedence based mostly expander for Cluster Autoscaler.

Sample 4: Group-less autoscaling with Karpenter

Karpenter is an open-source, versatile, high-performance Kubernetes cluster auto scaler constructed with AWS. The general purpose is similar of auto scaling Amazon EKS clusters to regulate un-schedulable pods; nevertheless, Karpenter takes a special method than Cluster Autoscaler, generally known as group-less provisioning. It observes the combination useful resource requests of unscheduled pods and makes selections to launch minimal compute assets to suit the un-schedulable pods for environment friendly binpacking and decreasing scheduling latency. It could possibly additionally delete nodes to cut back infrastructure prices. Karpenter works straight with the Amazon EC2 Fleet.

To configure Karpenter, you create provisioners that outline how Karpenter manages un-schedulable pods and expired nodes. It is best to make the most of the idea of layered constraints to handle scheduling constraints. To scale back EMR on EKS prices and enhance Amazon EKS cluster utilization, you should use Karpenter with comparable constraints of Single-AZ, On-Demand Cases for Spark driver pods, and Spot Cases for executor pods with out creating a number of kinds of node teams. With its group-less method, Karpenter means that you can be extra versatile and diversify higher.

The next are suggestions for auto scaling EMR on EKS with Karpenter:

  • Configure Karpenter provisioners to launch nodes in a single Availability Zone to keep away from cross-AZ knowledge switch prices and scale back knowledge processing latency.
  • Create a provisioner for EC2 Spot Cases and EC2 On-Demand Cases. You may scale back prices by scheduling Spark driver pods to run on EC2 On-Demand Cases and schedule Spark executor pods to run on EC2 Spot Cases.
  • Restrict the occasion sorts by offering an inventory of EC2 cases or let Karpenter select from all of the Spot swimming pools obtainable to it. This follows the Spot greatest practices of diversifying throughout a number of Spot swimming pools.
  • Use pod templates and Kubernetes labels and selectors to permit Karpenter to spin up right-sized nodes required for un-schedulable pods.

The next diagram illustrates how Karpenter works.

Karpenter How it Works

To summarize the design patterns we mentioned:

  1. Pod templates assist tailor your Spark workloads. You may configure Spark pods in a single Availability Zone and make the most of EC2 Spot Cases for Spark executor pods, leading to higher price-performance.
  2. EMR on EKS helps the DRA characteristic in Spark. It’s helpful when you’re not acquainted what number of Spark executors are wanted to your job processing, and use DRA to dynamically modify the assets your software wants.
  3. Using Cluster Autoscaler allows you to totally management autoscale your Amazon EMR on EKS workloads. It improves your Spark software availability and cluster effectivity by quickly launching right-sized compute assets.
  4. Karpenter simplifies autoscaling with its group-less provisioning of compute assets. The advantages embody diminished scheduling latency, and environment friendly bin-packing to cut back infrastructure prices.

Walkthrough overview

In our instance walkthrough, we are going to present use Pod template to schedule a job with EMR on EKS. We use Karpenter as our autoscaling device.

We full the next steps to implement the answer:

  1. Create an Amazon EKS cluster.
  2. Put together the cluster for EMR on EKS.
  3. Register the cluster with Amazon EMR.
  4. For Amazon EKS auto scaling, arrange Karpenter auto scaling in Amazon EKS.
  5. Submit a pattern Spark job utilizing pod templates to run in single Availability Zone and make the most of Spot for Spark executor pods.

The next diagram illustrates this structure.

Stipulations

To observe together with the walkthrough, guarantee that you’ve got the next prerequisite assets:

  • An AWS account that gives entry to AWS providers.
  • An AWS Identification and Entry Administration Person (IAM) consumer with an entry key and secret key to configure the AWS CLI, and permissions to create IAM roles, IAM insurance policies, Amazon EKS IAM roles and repair linked roles, AWS CloudFormation stacks, and a VPC. For extra info, see Actions, assets, and situation keys for Amazon Elastic Container Service for Kubernetes and Utilizing service-linked roles. You have to full all steps on this publish as the identical consumer.
  • An Amazon Easy Storage Service (Amazon S3) bucket to retailer your pod templates.
  • The AWS CLI, eksctl, and kubectl. Directions for set up of those instruments are given in Step 1.

Create an Amazon EKS cluster

There are two methods to create an EKS cluster: you should use AWS Administration Console and AWS CLI, or you may set up all of the required assets for Amazon EKS utilizing eksctl, a easy command line utility for creating and managing Kubernetes clusters on EKS. For this publish, we use eksctl to create our cluster.

Let’s begin with putting in the instruments to arrange and handle your Kubernetes cluster.

  1. Set up the AWS CLI with the next command (Linux OS) and make sure it really works:
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    sudo ./aws/set up
    aws --version

    For different working methods, see Putting in, updating, and uninstalling the AWS CLI model.

  2. Set up eksctl, the command line utility for creating and managing Kubernetes clusters on Amazon EKS:
    curl --silent --location "https://github.com/weaveworks/eksctl/releases/newest/obtain/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
    sudo mv -v /tmp/eksctl /usr/native/bin
    eksctl model

    eksctl is a device collectively developed by AWS and Weaveworks that automates a lot of the expertise of making EKS clusters.

  3. Set up the Kubernetes command-line device, kubectl, which lets you run instructions in opposition to Kubernetes clusters:
    curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.18.8/2020-09-18/bin/linux/amd64/kubectl
    chmod +x ./kubectl
    sudo mv ./kubectl /usr/native/bin

  4. Create a brand new file known as eks-create-cluster.yaml with the next:
    apiVersion: eksctl.io/v1alpha5
    form: ClusterConfig
    
    metadata:
      title: emr-on-eks-blog-cluster
      area: us-west-2
    
    availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"]
    
    managedNodeGroups:#On-demand nodegroups for spark job
    - title: singleaz-ng-ondemand
      instanceType: m5.xlarge
      desiredCapacity: 1
      availabilityZones: ["us-west-2b"]
    

  5. Create an Amazon EKS cluster utilizing the eks-create-cluster.yaml file:
    eksctl create cluster -f eks-create-cluster.yaml

    On this Amazon EKS cluster, we create a single managed node group with a common goal m5.xlarge EC2 Occasion. Launching Amazon EKS cluster, its managed node teams, and all dependencies usually takes 10–quarter-hour.

  6. After you create the cluster, you may run the next to substantiate all node teams have been created:
    eksctl get nodegroups --cluster emr-on-eks-blog-cluster

    Now you can use kubectl to work together with the created Amazon EKS cluster.

  7. After you create your Amazon EKS cluster, you have to configure your kubeconfig file to your cluster utilizing the AWS CLI:
    aws eks --region us-west-2 update-kubeconfig --name emr-on-eks-blog-cluster
    kubectl cluster-info
    

Now you can use kubectl to hook up with your Kubernetes cluster.

Put together your Amazon EKS cluster for EMR on EKS

Now we put together our Amazon EKS cluster to combine it with EMR on EKS.

  1. Let’s create the namespace emr-on-eks-blog in our Amazon EKS cluster:
    kubectl create namespace emr-on-eks-blog

  2. We use the automation powered by eksctl to create role-based entry management permissions and so as to add the EMR on EKS service-linked position into the aws-auth configmap:
    eksctl create iamidentitymapping --cluster emr-on-eks-blog-cluster --namespace emr-on-eks-blog --service-name "emr-containers"

  3. The Amazon EKS cluster already has an OpenID Join supplier URL. You allow IAM roles for service accounts by associating IAM with the Amazon EKS cluster OIDC:
    eksctl utils associate-iam-oidc-provider --cluster emr-on-eks-blog-cluster —approve

    Now let’s create the IAM position that Amazon EMR makes use of to run Spark jobs.

  4. Create the file blog-emr-trust-policy.json:

    {
    "Model": "2012-10-17",
    "Assertion": [
    {
    "Effect": "Allow",
    "Principal": {
    "Service": "elasticmapreduce.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
    }
    ]
    }

    Arrange an IAM position:

    aws iam create-role --role-name blog-emrJobExecutionRole —assume-role-policy-document file://blog-emr-trust-policy.json

    This IAM position accommodates all permissions that the Spark job wants—as an example, we offer entry to S3 buckets and Amazon CloudWatch to entry obligatory information (pod templates) and share logs.

    Subsequent, we have to connect the required IAM insurance policies to the position so it might probably write logs to Amazon S3 and CloudWatch.

  5. Create the file blog-emr-policy-document with the required IAM insurance policies. Exchange the bucket title together with your S3 bucket ARN.

    {
    "Model": "2012-10-17",
    "Assertion": [
    {
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:ListBucket"
    ],
    "Useful resource": ["arn:aws:s3:::<bucket-name>"]
    },
    {
    "Impact": "Permit",
    "Motion": [
    "logs:PutLogEvents",
    "logs:CreateLogStream",
    "logs:DescribeLogGroups",
    "logs:DescribeLogStreams"
    ],
    "Useful resource": [
    "arn:aws:logs:::"
    ]
    }
    ]
    }

    Connect it to the IAM position created within the earlier step:

    aws iam put-role-policy --role-name blog-emrJobExecutionRole --policy-name blog-EMR-JobExecution-policy —policy-document file://blog-emr-policy-document.json

  6. Now we replace the belief relationship between the IAM position we simply created with the Amazon EMR service id. The namespace offered right here within the belief coverage must be identical when registering the digital cluster in subsequent step:
    aws emr-containers update-role-trust-policy --cluster-name emr-on-eks-blog-cluster --namespace emr-on-eks-blog --role-name blog-emrJobExecutionRole --region us-west-2

Register the Amazon EKS cluster with Amazon EMR

Registering your Amazon EKS cluster is the ultimate step to arrange EMR on EKS to run workloads.

We create a digital cluster and map it to the Kubernetes namespace created earlier:

aws emr-containers create-virtual-cluster 
    --region us-west-2 
    --name emr-on-eks-blog-cluster 
    --container-provider '{
       "id": "emr-on-eks-blog-cluster",
       "kind": "EKS",
       "information": {
          "eksInfo": {
              "namespace": "emr-on-eks-blog"
          }
       }
    }'

After you register, it’s best to get affirmation that your EMR digital cluster is created:

{
"arn": "arn:aws:emr-containers:us-west-2:142939128734:/virtualclusters/lwpylp3kqj061ud7fvh6sjuyk",
"id": "lwpylp3kqj061ud7fvh6sjuyk",
"title": "emr-on-eks-blog-cluster"
}

A digital cluster is an Amazon EMR idea that implies that Amazon EMR registered to a Kubernetes namespace and might run jobs in that namespace. When you navigate to your Amazon EMR console, you may see the digital cluster listed.

Arrange Karpenter in Amazon EKS

To get began with Karpenter, guarantee there’s some compute capability obtainable, and set up it utilizing the Helm charts offered within the public repository. Karpenter additionally requires permissions to provision compute assets. For extra info, consult with Getting Began.

Karpenter’s single duty is to provision compute to your Kubernetes clusters, which is configured by a customized useful resource known as a provisioner. As soon as put in in your cluster, the Karpenter provisioner observes incoming Kubernetes pods, which may’t be scheduled as a result of inadequate compute assets within the cluster, and routinely launches new assets to fulfill their scheduling and useful resource necessities.

For our use case, we provision two provisioners.

The primary is a Karpenter provisioner for Spark driver pods to run on EC2 On-Demand Cases:

apiVersion: karpenter.sh/v1alpha5
form: Provisioner
metadata:
  title: ondemand
spec:
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels:
    karpenter.sh/capacity-type: on-demand

  necessities:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]

  limits:
    assets:
      cpu: "1000"
      reminiscence: 1000Gi

  supplier: 
    subnetSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster
    securityGroupSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster

The second is a Karpenter provisioner for Spark executor pods to run on EC2 Spot Cases:

apiVersion: karpenter.sh/v1alpha5
form: Provisioner
metadata:
  title: default
spec:
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels:
    karpenter.sh/capacity-type: spot

  necessities:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]

  limits:
    assets:
      cpu: "1000"
      reminiscence: 1000Gi

  supplier: 
    subnetSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster
    securityGroupSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster

Be aware the highlighted portion of the provisioner config. Within the necessities part, we use the well-known labels with Amazon EKS and Karpenter so as to add constraints for a way Karpenter launches nodes. We add constraints that if the pod is in search of a label karpenter.sh/capacity-type: spot, it makes use of this provisioner to launch an EC2 Spot Occasion solely in Availability Zone us-west-2b. Equally, we observe the identical constraint for the karpenter.sh/capacity-type: on-demand label. We can be extra granular and supply EC2 occasion sorts in our provisioner, and they are often of various vCPU and reminiscence ratios, supplying you with extra flexibility and including resiliency to your software. Karpenter launches nodes solely when each the provisioner’s and pod’s necessities are met. To be taught extra in regards to the Karpenter provisioner API, consult with Provisioner API.

Within the subsequent step, we outline pod necessities and align them with what we have now outlined in Karpenter’s provisioner.

Submit Spark job utilizing Pod template

In Kubernetes, labels are key-value pairs which might be connected to things, akin to pods. Labels are supposed for use to specify figuring out attributes of objects which might be significant and related to customers. You may constrain a pod in order that it might probably solely run on specific set of nodes. There are a number of methods to do that, and the advisable approaches all use label selectors to facilitate the choice.

Starting with Amazon EMR variations 5.33.0 or 6.3.0, EMR on EKS helps Spark’s pod template characteristic. We use pod templates so as to add particular labels the place Spark driver and executor pods ought to be launched.

Create a pod template file for a Spark driver pod and save them in your S3 bucket:

apiVersion: v1
form: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
  containers:
  - title: spark-kubernetes-driver # This will probably be interpreted as Spark driver container

Create a pod template file for a Spark executor pod and save them in your S3 bucket:

apiVersion: v1
form: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: spot
  containers:
  - title: spark-kubernetes-executor # This will probably be interpreted as Spark driver container

Pod templates present completely different fields to handle job scheduling. For added particulars, consult with Pod template fields. Be aware the nodeSelector for the Spark driver pods and Spark executor pods, which match the labels we outlined with the Karpenter provisioner.

For a pattern Spark job, we use the next code, which creates a number of parallel threads and waits for just a few seconds:

cat << EOF > threadsleep.py
import sys
from time import sleep
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("threadsleep").getOrCreate()
def sleep_for_x_seconds(x):sleep(x*20)
sc=spark.sparkContext
sc.parallelize(vary(1,6), 5).foreach(sleep_for_x_seconds)
spark.cease()
EOF

Copy the pattern Spark job into your S3 bucket:

aws s3 mb s3://<YourS3Bucket>
aws s3 cp threadsleep.py s3://<YourS3Bucket>

Earlier than we submit the Spark job, let’s get the required values of the EMR digital cluster and Amazon EMR job execution position ARN:

export S3blogbucket= s3://<YourS3Bucket>
export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?state=='RUNNING'].id" --region us-west-2 --output textual content)

export EMR_ROLE_ARN=$(aws iam get-role --role-name blog-emrJobExecutionRole --query Function.Arn --region us-west-2 --output textual content)

To allow the pod template characteristic with EMR on EKS, you should use configuration-overrides to specify the Amazon S3 path to the pod template:

aws emr-containers start-job-run 
--virtual-cluster-id $VIRTUAL_CLUSTER_ID 
--name spark-threadsleep-single-az 
--execution-role-arn $EMR_ROLE_ARN 
--release-label emr-5.33.0-latest 
--region us-west-2 
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "'${S3blogbucket}'/threadsleep.py",
        "sparkSubmitParameters": "--conf spark.executor.cases=6 --conf spark.executor.reminiscence=1G --conf spark.executor.cores=1 --conf spark.driver.cores=2"
        }
    }' 
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"1G",
"spark.kubernetes.driver.podTemplateFile":"'${S3blogbucket}'/spark_driver_podtemplate.yaml", "spark.kubernetes.executor.podTemplateFile":"'${S3blogbucket}'/spark_executor_podtemplate.yaml"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-on-eks/emreksblog", 
        "logStreamNamePrefix": "threadsleep"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3blogbucket"'/logs/"
      }
    }
}'

Within the Spark job, we’re requesting two cores for the Spark driver and one core every for Spark executor pod. As a result of we solely had a single EC2 occasion in our managed node group, Karpenter appears on the un-schedulable Spark driver pods and makes use of the on-demand provisioner to launch EC2 On-Demand Cases for Spark driver pods in us-west-2b. Equally, when the Spark executor pods are in pending state, as a result of there are not any Spot Cases, Karpenter launches Spot Cases in us-west-2b.

This manner, Karpenter optimizes your prices by ranging from zero Spot and On-Demand Cases and solely creates them dynamically when required. Moreover, Karpenter batches pending pods after which binpacks them based mostly on CPU, reminiscence, and GPUs required, bearing in mind node overhead, VPC CNI assets required, and daemon units that will probably be packed when citing a brand new node. This makes certain you’re effectively using your assets with least wastage.

Clear up

Don’t overlook to scrub up the assets you created to keep away from any pointless prices.

  1. Delete all of the digital clusters that you just created:
    #Checklist all of the digital cluster ids
    aws emr-containers list-virtual-clusters#Delete digital cluster by passing digital cluster id
    aws emr-containers delete-virtual-cluster —id <virtual-cluster-id>
    

  2. Delete the Amazon EKS cluster:
    eksctl delete cluster emr-on-eks-blog-cluster

  3. Delete the EMR_EKS_Job_Execution_Role position and insurance policies.

Conclusion

On this publish, we noticed create an Amazon EKS cluster, configure Amazon EKS managed node teams, create an EMR digital cluster on Amazon EKS, and submit Spark jobs. Utilizing pod templates, we noticed how to make sure Spark workloads are scheduled in the identical Availability Zone and make the most of Spot with Karpenter auto scaling to cut back prices and optimize your Spark workloads.

To get began, check out the EMR on EKS workshop. For extra assets, consult with the next:


Concerning the creator

Jamal Arif is a Options Architect at AWS and a containers specialist. He helps AWS prospects of their modernization journey to construct progressive, resilient, and cost-effective options. In his spare time, Jamal enjoys spending time outside together with his household mountaineering and mountain biking.