Tuesday, August 16, 2022
HomeBig DataHow SumUp constructed a low-latency characteristic retailer utilizing Amazon EMR and Amazon...

How SumUp constructed a low-latency characteristic retailer utilizing Amazon EMR and Amazon Keyspaces

This submit was co-authored by Vadym Dolin, Information Architect at SumUp. In their very own phrases, SumUp is a number one monetary know-how firm, working throughout 35 markets on three continents. SumUp helps small companies achieve success by enabling them to simply accept card funds in-store, in-app, and on-line, in a easy, safe, and cost-effective method. In the present day, SumUp card readers and different monetary merchandise are utilized by greater than 4 million retailers world wide.

The SumUp Engineering crew is dedicated to creating handy, impactful, and safe monetary merchandise for retailers. To satisfy this imaginative and prescient, SumUp is more and more investing in synthetic intelligence and machine studying (ML). The inner ML platform in SumUp permits groups to seamlessly construct, deploy, and function ML options at scale.

One of many central components of SumUp’s ML platform is the net characteristic retailer. It permits a number of ML fashions to retrieve characteristic vectors with single-digit millisecond latency, and permits utility of AI for latency-critical use instances. The platform processes a whole lot of transactions each second, with quantity spikes throughout peak hours, and has regular progress that doubles the variety of transactions yearly. Due to this, the ML platform requires its low-latency characteristic retailer to be additionally extremely dependable and scalable.

On this submit, we present how SumUp constructed a millisecond-latency characteristic retailer. We additionally talk about the architectural concerns when establishing this answer so it could possibly scale to serve a number of use instances, and current outcomes showcasing the setups efficiency.

Overview of answer

To coach ML fashions, we’d like historic information. Throughout this part, information scientists experiment with completely different options to check which of them produce the perfect mannequin. From a platform perspective, we have to help bulk learn and write operations. Learn latency isn’t crucial at this stage as a result of the information is learn into coaching jobs. After the fashions are skilled and moved to manufacturing for real-time inference, we’ve the next necessities for the platform change: we have to help low-latency reads and use solely the newest options information.

To satisfy these wants, SumUp constructed a characteristic retailer consisting of offline and on-line information shops. These had been optimized for the necessities as described within the following desk.

Information RetailerHistorical past NecessitiesML Workflow NecessitiesLatency NecessitiesStorage NecessitiesThroughput NecessitiesStorage Medium
OfflineTotal Historical pastCoachingNot necessaryPrice-effective for big volumesBulk learn and writesAmazon S3
On-lineSolely the newest OptionsInferenceSingle-digit millisecondNot necessaryLearn optimizedAmazon Keyspaces

Amazon Keyspaces (for Apache Cassandra) is a serverless, scalable, and managed Apache Cassandra–appropriate database service. It’s constructed for constant, single-digit-millisecond response occasions at scale. SumUp makes use of Amazon Keyspaces as a key-value pair retailer, and these options make it appropriate for his or her on-line characteristic retailer. Delta Lake is an open-source storage layer that helps ACID transactions and is totally appropriate with Apache Spark, making it extremely performant at bulk learn and write operations. You’ll be able to retailer Delta Lake tables on Amazon Easy Storage Service (Amazon S3), which makes it a very good match for the offline characteristic retailer. Information scientists can use this stack to coach fashions towards the offline characteristic retailer (Delta Lake). When the skilled fashions are moved to manufacturing, we change to utilizing the net characteristic retailer (Amazon Keyspaces), which provides the newest options set, scalable reads, and far decrease latency.

One other necessary consideration is that we write a single characteristic job to populate each characteristic shops. In any other case, SumUp must preserve two units of code or pipelines for every characteristic creation job. We use Amazon EMR and create the options utilizing PySpark DataFrames. The identical DataFrame is written to each Delta Lake and Amazon Keyspaces, which eliminates the hurdle of getting separate pipelines.

Lastly, SumUp wished to make the most of managed companies. It was necessary to SumUp that information scientists and information engineers focus their efforts on constructing and deploying ML fashions. SumUp had experimented with managing their very own Cassandra cluster, and located it tough to scale as a result of it required specialised experience. Amazon Keyspaces supplied scalability with out administration and upkeep overhead. For operating Spark workloads, we determined to make use of Amazon EMR. Amazon EMR makes it straightforward to provision new clusters and routinely or manually add and take away capability as wanted. You too can outline a customized coverage for auto scaling the cluster to fit your wants. Amazon EMR model 6.0.0 and above helps Spark model 3.0.0, which is appropriate with Delta Lake.

It took SumUp 3 months from testing out AWS companies to constructing a production-grade characteristic retailer able to serving ML fashions. On this submit we share a simplified model of the stack, consisting of the next elements:

  • S3 bucket A – Shops the uncooked information
  • EMR cluster – For operating PySpark jobs for populating the characteristic retailer
  • Amazon Keyspaces feature_store – Shops the net options desk
  • S3 Bucket B – Shops the Delta Lake desk for offline options
  • IAM position feature_creator – For operating the characteristic job with the suitable permissions
  • Pocket book occasion – For operating the characteristic engineering code

We use a simplified model of the setup to make it straightforward to observe the code examples. SumUp information scientists use Jupyter notebooks for exploratory evaluation of the information. Function engineering jobs are deployed utilizing an AWS Step Capabilities state machine, which consists of an AWS Lambda perform that submits a PySpark job to the EMR cluster.

The next diagram illustrates our simplified structure.


To observe the answer, you want sure entry rights and AWS Id and Entry Administration (IAM) privileges:

  • An IAM person with AWS Command Line Interface (AWS CLI) entry to an AWS account
  • IAM privileges to do the next:
    • Generate Amazon Keyspaces credentials
    • Create a keyspace and desk
    • Create an S3 bucket
    • Create an EMR cluster
    • IAM Get Function

Arrange the dataset

We begin by cloning the challenge git repository, which comprises the dataset we have to place in bucket A. We use an artificial dataset, beneath Information/daily_dataset.csv. This dataset consists of vitality meter readings for households. The file comprises data just like the variety of measures, minimal, most, imply, median, sum, and std for every family every day. To create an S3 bucket (in case you don’t have already got one) and add the information file, observe these steps:

  1. Clone the challenge repository domestically by operating the shell command:
    git clone https://github.com/aws-samples/amazon-keyspaces-emr-featurestore-kit.git

  2. On the Amazon S3 console, select Create bucket.
  3. Give the bucket a reputation. For this submit, we use featurestore-blogpost-bucket-xxxxxxxxxx (it’s useful to append the account quantity to the bucket title to make sure the title is exclusive for frequent prefixes).
  4. Select the Area you’re working in.
    It’s necessary that you simply create all assets in the identical Area for this submit.
  5. Public entry is blocked by default, and we suggest that you simply preserve it that method.
  6. Disable bucket versioning and encryption (we don’t want it for this submit).
  7. Select Create bucket.
  8. After the bucket is created, select the bucket title and drag the folders Dataset and EMR into the bucket.

Arrange Amazon Keyspaces

We have to generate credentials for Amazon Keyspaces, which we use to attach with the service. The steps for producing the credentials are as follows:

  1. On the IAM console, select Customers within the navigation pane.
  2. Select an IAM person you wish to generate credentials for.
  3. On the Safety credentials tab, beneath Credentials for Amazon Keyspaces (for Apache Cassandra), select Generate Credentials.
    A pop-up seems with the credentials, and an choice to obtain the credentials. We suggest downloading a replica since you received’t have the ability to view the credentials once more.We additionally must create a desk in Amazon Keyspaces to retailer our characteristic information. We now have shared the schema for the keyspace and desk within the GitHub challenge information Keyspaces/keyspace.cql and Keyspaces/Table_Schema.cql.
  4. On the Amazon Keyspaces console, select CQL editor within the navigation pane.
  5. Enter the contents of the file Keyspaces/Keyspace.cql within the editor and select Run command.
  6. Clear the contents of the editor, enter the contents of Keyspaces/Table_Schema.cql, and select Run command.

Desk creation is an asynchronous course of, and also you’re notified if the desk is efficiently created. You too can view it by selecting Tables within the navigation pane.

Arrange an EMR cluster

Subsequent, we arrange an EMR cluster so we will run PySpark code to generate options. First, we have to arrange a belief retailer password. A truststore file comprises the Utility Server’s trusted certificates, together with public keys for different entities, this file is generated by the offered script and we have to present a password for safeguarding this file. Amazon Keyspaces offers encryption in transit and at relaxation to guard and safe information transmission and storage, and makes use of Transport Layer Safety (TLS) to assist safe connections with shoppers. To connect with Amazon Keyspaces utilizing TLS, we have to obtain an Amazon digital certificates and configure the Python driver to make use of TLS. This certificates is saved in a belief retailer; after we retrieve it, we have to present the right password.

  1. Within the file EMR/emr_bootstrap_script.sh, replace the next line to a password you wish to use:
    # Create a JKS keystore from the certificates

  2. To level the bootstrap script to the one we uploaded to Amazon S3, replace the next line to mirror the S3 bucket we created earlier:
    # Copy the Cassandra Connector config
    aws s3 cp s3://{your-s3-bucket}/EMR/app.config /house/hadoop/app.config

  3. To replace the app.config file to mirror the right belief retailer password, within the file EMR/app.config, replace the worth for truststore-password to the worth you set earlier:
        ssl-engine-factory {
          class = DefaultSslEngineFactory
          truststore-path = "/house/hadoop/.certs/cassandra_keystore.jks"
          truststore-password = "{your_password_here}"

  4. Within the file EMR/app.config, replace the next strains to mirror the Area and the person title and password generated earlier:
    contact-points = ["cassandra.<your-region>.amazonaws.com:9142"]
    load-balancing-policy.local-datacenter = <your-region>
    auth-provider {
        class = PlainTextAuthProvider
        username = "{your-keyspace-username}"
        password = "{your-keyspace-password}"

    We have to create default occasion roles, that are wanted to run the EMR cluster.

  5. Replace the contents S3 bucket created within the pre-requisite part by dragging the EMR folder into the bucket once more.
  6. To create the default roles, run the create-default-roles command:
    aws emr create-default-roles

    Subsequent, we create an EMR cluster. The next code snippet is an AWS CLI command that has Hadoop, Spark 3.0, Livy and JupyterHub put in. This additionally runs the bootstrapping script on the cluster to arrange the connection to Amazon Keyspaces.

  7. Create the cluster with the next code. Present the subnet ID to begin a Jupyter pocket book occasion related to this cluster, the S3 bucket you created earlier, and the Area you’re working in. You’ll be able to present the default Subnet, and to search out this navigate to VPC>Subnets and replica the default subnet id.
    aws emr create-cluster --termination-protected --applications Identify=Hadoop Identify=Spark Identify=Livy Identify=Hive Identify=JupyterHub --tags 'creator=feature-store-blogpost' --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"your-subnet-id"}' --service-role EMR_DefaultRole --release-label emr-6.1.0 --log-uri 's3n://{your-s3-bucket}/elasticmapreduce/' --name 'emr_feature_store' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"m5.xlarge","Identify":"Core - 2"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Identify":"Grasp - 1"}]' --bootstrap-actions '[{"Path":"s3://{your-s3-bucket HERE}/EMR/emr_bootstrap_script.sh","Name":"Execute_bootstarp_script"}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region your-region

    Lastly, we create an EMR pocket book occasion to run the PySpark pocket book Function Creation and loading-notebook.ipynb (included within the repo).

  8. On the Amazon EMR console, select Notebooks within the navigation pane.
  9. Select Create pocket book.
  10. Give the pocket book a reputation and select the cluster emr_feature_store.
  11. Optionally, configure the extra settings.
  12. Select Create pocket book.It will possibly take a couple of minutes earlier than the pocket book occasion is up and operating.
  13. When the pocket book is prepared, choose the pocket book and select both Open JupyterLab or Open Jupyter.
  14. Within the pocket book occasion import, open the pocket book Function Creation and loading-notebook.ipynb (included within the repo) and alter the kernel to PySpark.
  15. Comply with the directions within the pocket book and run the cells one after the other to learn the information from Amazon S3, create options, and write these to Delta Lake and Amazon Keyspaces.

Efficiency testing

To check throughput for our on-line characteristic retailer, we run a simulation on the options we created. We simulate roughly 40,000 requests per second. Every request queries information for a selected key (an ID in our characteristic desk). The method duties do the next:

  • Initialize a connection to Amazon Keyspaces
  • Generate a random ID to question the information
  • Generate a CQL assertion:
    SELECT * FROM feature_store.energy_data_features WHERE id=[list_of_ids[random_index between 0-5559]];

  • Begin a timer
  • Ship the request to Amazon Keyspaces
  • Cease the timer when the response from Amazon Keyspaces is obtained

To run the simulation, we begin 245 parallel AWS Fargate duties operating on Amazon Elastic Container Service (Amazon ECS). Every activity runs a Python script that makes 1 million requests to Amazon Keyspaces. As a result of our dataset solely comprises 5,560 distinctive IDs, we generate 1 million random numbers between 0–5560 initially of the simulation and question the ID for every request. To run the simulation, we included the code within the folder Simulation. You’ll be able to run the simulation in a SageMaker pocket book occasion by finishing the next steps:

  1. On the Amazon SageMaker console, create a SageMaker pocket book occasion (or use an present one).You’ll be able to select an ml.t3.massive occasion.
  2. Let SageMaker create an execution position for you in case you don’t have one.
  3. Open the SageMaker pocket book and select Add.
  4. Add the Simulation folder from the repository. Alternatively, open a terminal window on the pocket book occasion and clone the repository https://github.com/aws-samples/amazon-keyspaces-emr-featurestore-kit.git.
  5. Comply with the directions and run the steps and cells within the Simulation/ECS_Simulation.ipynb pocket book.
  6. On the Amazon ECS console, select the cluster you provisioned with the pocket book and select the Duties tab to watch the duties.

Every activity writes the latency figures to a file and strikes this to an S3 location. When the simulation ends, we acquire all the information to get aggregated stats and plot charts.

In our setup, we set the capability mode for Amazon Keyspaces to Provisioned RCU (learn capability items) at 40000 (fastened). After we begin the simulation, the RCU rise near 40000. After we begin the simulation, the RCU (learn capability items) rise near 40000, and the simulation takes round an hour to complete, as illustrated within the following visualization.

The primary evaluation we current is the latency distribution for the 245 million requests made throughout the simulation. Right here the 99% percentile falls inside single-digit millisecond latency, as we’d count on.

QuantileLatency (ms)

For the second evaluation, we current the next time collection charts for latency. The chart on the backside exhibits the uncooked latency figures from all of the 245 employees. The chart above that plots the common and minimal latency throughout all employees grouped over 1-second intervals. Right here we will see each the minimal and the common latency all through the simulation stays under 10 milliseconds. The third chart from the underside plots most latency throughout all employees grouped over 1-second intervals. This chart exhibits occasional spikes in latency however nothing constant we have to fear about. The highest two charts are latency distributions; the one on the left plots all the information, and the one on the fitting plots the 99.9% percentile. Because of the presence of some outliers, the chart on the left exhibits a peak near zero and a really tailed distribution. After we take away these outliers, we will see within the chart on the fitting that 99.9% of requests are accomplished in lower than 5.5 milliseconds. It is a nice consequence, contemplating we despatched 245 million requests.


A number of the assets we created on this blogpost would incur prices if left operating. Bear in mind to terminate the EMR cluster, empty the S3 bucket and delete it, delete the Amazon KeySpaces desk. Additionally delete the SageMaker and Amazon EMR notebooks. The Amazon ECS cluster is billed on duties and wouldn’t incur any extra prices.


Amazon EMR, Amazon S3, and Amazon Keyspaces present a versatile and scalable growth expertise for characteristic engineering. EMR clusters are straightforward to handle, and groups can share environments with out compromising compute and storage capabilities. EMR bootstrapping makes it straightforward to put in and take a look at out new instruments and rapidly spin up environments to check out new concepts. Having the characteristic retailer cut up into offline and on-line retailer simplifies mannequin coaching and deployment, and offers efficiency advantages.

In our testing, Amazon Keyspaces was in a position to deal with peak throughput learn requests inside our desired requirement of single digit latency. It’s additionally value mentioning that we discovered the on-demand mode to adapt to the utilization sample and an enchancment in learn/write latency a few days from when it was switched on.

One other necessary consideration to make for latency-sensitive queries is row size. In our testing, tables with decrease row size had decrease learn latency. Due to this fact, it’s extra environment friendly to separate the information into a number of tables and make asynchronous calls to retrieve it from a number of tables.

We encourage you to discover including safety features and adopting safety greatest practices in line with your wants and potential firm requirements.

In the event you discovered this submit helpful, take a look at Loading information into Amazon Keyspaces with cqlsh for tips about how one can tune Amazon Keyspaces, and Orchestrate Apache Spark purposes utilizing AWS Step Capabilities and Apache Livy on how one can construct and deploy PySpark jobs.

Concerning the authors

Shaheer Mansoor is a Information Scientist at AWS. His focus is on constructing machine studying platforms that may host AI options at scale. His curiosity areas are ML Ops, Function Shops, Mannequin Internet hosting and Mannequin Monitoring.

Vadym Dolinin is a Machine Studying Architect in SumUp. He works with a number of groups on crafting the ML platform, which permits information scientists to construct, deploy, and function machine studying options in SumUp. Vadym has 13 years of expertise within the domains of knowledge engineering, analytics, BI, and ML.

Oliver Zollikofer is a Information Scientist at AWS. He permits world enterprise prospects to construct and deploy machine studying fashions, in addition to architect associated cloud options.



Please enter your comment!
Please enter your name here

Most Popular