Monday, August 8, 2022
HomeBig DataFraud Detection with Cloudera Stream Processing Half 1

Fraud Detection with Cloudera Stream Processing Half 1

In a earlier weblog of this collection, Turning Streams Into Knowledge Merchandise, we talked concerning the elevated want for decreasing the latency between knowledge era/ingestion and producing analytical outcomes and insights from this knowledge. We mentioned how Cloudera Stream Processing (CSP) with Apache Kafka and Apache Flink could possibly be used to course of this knowledge in actual time and at scale. On this weblog we are going to present an actual instance of how that’s finished, how we will use CSP to carry out real-time fraud detection.

Constructing real-time streaming analytics knowledge pipelines requires the flexibility to course of knowledge within the stream. A essential prerequisite for in-stream processing is having the aptitude to gather and transfer the information as it’s being generated on the level of origin. That is what we name the first-mile downside. This weblog shall be printed in two elements. Partially one we are going to look into how Cloudera DataFlow powered by Apache NiFi solves the first-mile downside by making it simple and environment friendly to purchase, rework, and transfer knowledge in order that we will allow streaming analytics use instances with little or no effort. We may also briefly focus on the benefits of operating this circulate in a cloud-native Kubernetes deployment of Cloudera DataFlow.

Partially two we are going to discover how we will run real-time streaming analytics utilizing Apache Flink, and we are going to use Cloudera SQL Stream Builder GUI to simply create streaming jobs utilizing solely SQL language (no Java/Scala coding required). We may also use the knowledge produced by the streaming analytics jobs to feed totally different downstream programs and dashboards. 

The use case

Fraud detection is a good instance of a time-critical use case for us to discover. All of us have been by way of a state of affairs the place the small print of our bank card, or the cardboard of somebody we all know, has been compromised and illegitimate transactions had been charged to the cardboard. To attenuate the harm in that state of affairs, the bank card firm should be capable to determine potential fraud instantly in order that it will probably block the cardboard and make contact with the person to confirm the transactions and presumably difficulty a brand new card to interchange the compromised one.

The cardboard transaction knowledge normally comes from event-driven sources, the place new knowledge arrives as card purchases occur in the actual world. Apart from the streaming knowledge although, we even have conventional knowledge shops (databases, key-value shops, object shops, and many others.) containing knowledge which will have for use to counterpoint the streaming knowledge. In our use case, the streaming knowledge doesn’t include account and person particulars, so we should be a part of the streams with the reference knowledge to provide all the knowledge we have to examine towards every potential fraudulent transaction.

Relying on the downstream makes use of of the knowledge produced we could must retailer the information in numerous codecs: produce the checklist of potential fraudulent transactions to a Kafka subject in order that notification programs can motion them immediately; save statistics in a relational or operational dashboard, for additional analytics or to feed dashboards; or persist the stream of uncooked transactions to a sturdy long-term storage for future reference and extra analytics.

Our instance on this weblog will use the performance inside Cloudera DataFlow and CDP to implement the next:

  1. Apache NiFi in Cloudera DataFlow will learn a stream of transactions despatched over the community.
  2. For every transaction, NiFi makes a name to a manufacturing mannequin in Cloudera Machine Studying (CML) to attain the fraud potential of the transaction.
  3. If the fraud rating is above a sure threshold, NiFi instantly routes the transaction to a Kafka subject that’s subscribed by notification programs that can set off the suitable actions.
  4. The scored transactions are written to the Kafka subject that can feed the real-time analytics course of that runs on Apache Flink.
  5. The transaction knowledge augmented with the rating can be continued to an Apache Kudu database for later querying and feed of the fraud dashboard.
  6. Utilizing SQL Stream Builder (SSB), we use steady streaming SQL to investigate the stream of transactions and detect potential fraud based mostly on the geographical location of the purchases.
  7. The recognized fraudulent transactions are written to a different Kafka subject that feeds the system that can take the required actions.
  8. The streaming SQL job additionally saves the fraud detections to the Kudu database.
  9. A dashboard feeds from the Kudu database to indicate fraud abstract statistics.

Buying with Cloudera DataFlow

Apache NiFi is a element of Cloudera DataFlow that makes it simple to amass knowledge in your use instances and implement the required pipelines to cleanse, rework, and feed your stream processing workflows. With greater than 300 processors accessible out of the field, it may be used to carry out common knowledge distribution, buying and processing any kind of knowledge, from and to nearly any kind of supply or sink.

On this use case we created a comparatively easy NiFi circulate that implements all of the operations from steps one by way of 5 above, and we are going to describe these operations in additional element under.

In our use case, we’re processing monetary transaction knowledge from an exterior agent. This agent is sending every transaction because it occurs to a community handle. Every transaction comprises the next info:

  • The transaction time stamp
  • The ID of the related account
  • A singular transaction ID
  • The transaction quantity
  • The geographical coordinates of the place the transaction occurred (latitude and longitude)

The transaction message is in JSON format as seems like the instance under:


  "ts": "2022-06-21 11:17:26",

  "account_id": "716",

  "transaction_id": "e933787c-f0ff-11ec-8cad-acde48001122",

  "quantity": 1926,

  "lat": -35.40439536601375,

  "lon": 174.68080620053922


NiFi is ready to create community listeners to obtain knowledge coming over the community. For this instance we will merely drag and drop a ListenUDP processor into the NiFi canvas and configure it with the specified port. It’s potential to parameterize the configuration of processors to make flows reusable. On this case we outlined a parameter referred to as #{enter.udp.port}, which we will later set to the precise port we’d like.


Describing the information with a schema

A schema is a doc that describes the construction of the information. When sending and receiving knowledge throughout a number of functions in your surroundings and even processors in a NiFi circulate, it’s helpful to have a repository the place the schema for all various kinds of knowledge are centrally managed and saved. This makes it simpler for functions to speak to one another.

Cloudera Knowledge Platform (CDP) comes with a Schema Registry service. For our pattern use case, now we have saved the schema for our transaction knowledge within the Schema Registry service and have configured our NiFi circulate to make use of the right schema identify. NiFi is built-in with Schema Registry and it’ll mechanically hook up with it to retrieve the schema definition each time wanted all through the circulate.

The trail that the information takes in a NiFi circulate is decided by visible connections between the totally different processors. Right here, for instance, the information acquired beforehand by the ListenUDP processor is “tagged” with the identify of the schema that we need to use: “transaction.”

Scoring and routing transactions

We educated and constructed a machine studying (ML) mannequin utilizing Cloudera Machine Studying (CML) to attain every transaction in accordance with their potential to be fraudulent. CML supplies a service with a REST endpoint that we will use to carry out scoring. As the information flows by way of the NiFi knowledge circulate, we need to name the ML mannequin service for knowledge factors to get the fraud rating for every one in every of them.

We use the NiFi’s LookupRecord for this, which permits lookups towards a REST service. The response from the CML mannequin comprises a fraud rating, represented by an actual quantity between zero and one.

The output of the LookupRecord processor, which comprises the unique transaction knowledge merged with the response from the ML mannequin, was then linked to a really helpful processor in NiFi: the QueryRecord processor.

The QueryRecord processor means that you can outline a number of outputs for the processor and affiliate a SQL question with every of them. It applies the SQL question to the information that’s streaming by way of the processor and sends the outcomes of every question to the related output.

On this circulate we outlined three SQL queries to run concurrently on this processor:


Observe that some processors additionally outline further outputs, like “failure,” “retry,” and many others., so that you could outline your individual error-handling logic in your flows.

Feeding streams to different programs

At this level of the circulate now we have already enriched our stream with the ML mannequin’s fraud rating and remodeled the streams in accordance with what we’d like downstream. All that’s left to finish our knowledge ingestion is to ship the information to Kafka, which we are going to use to feed our real-time analytical course of, and save the transactions to a Kudu desk, which we’ll later use to feed our dashboard, in addition to for different non-real-time analytical processes down the road.

Apache Kafka and Apache Kudu are additionally a part of CDP, and it’s quite simple to configure the Kafka- and Kudu-specific processors to finish the duty for us.

Working the information circulate natively on the cloud

As soon as the NiFi circulate is constructed it may be executed in any NiFi deployment you might need. Cloudera DataFlow for the Public Cloud (CDF-PC) supplies a cloud-native elastic circulate runtime that may run flows effectively.

In comparison with fixed-size NiFi clusters, the CDF’s cloud-native circulate runtime has an a variety of benefits:

  • You don’t must handle NiFi clusters. You may merely hook up with the CDF console, add the circulate definition, and execute it. The mandatory NiFi service is mechanically instantiated as a Kubernetes service to execute the circulate, transparently to the person.
  • It supplies higher useful resource isolation between flows.
  • Move executions can auto-scale up and down to make sure the correct quantity of sources to deal with the present quantity of knowledge being processed. This avoids useful resource hunger and in addition saves prices by deallocating pointless sources when they’re not used.
  • Constructed-in monitoring with user-defined KPIs that may be tailor-made to every particular circulate are totally different granularities (system, circulate, processor, connection, and many others.).

Safe inbound connections

Along with the above, configuring safe community endpoints to behave as ingress gateways is a notoriously tough downside to resolve within the cloud, and the steps fluctuate with every cloud supplier. 

It requires organising load balancers, DNS information, certificates, and keystore administration. 

CDF-PC abstracts away these complexities with the inbound connections function, which permits the person to create an inbound connection endpoint by simply offering the specified endpoint identify and port quantity.

Parameterized and customizable deployments

Upon the circulate deployment you possibly can outline parameters for the circulate execution and in addition select the scale and auto-scaling traits of the circulate:


Native monitoring and alerting

Customized KPIs might be outlined to watch the facets of the circulate which are vital to you. Alerts might be additionally outlined to generate notifications when the configured thresholds are crossed:

After the deployment the metrics collected for the outlined KPI might be monitored on the CDF dashboard:

Cloudera DataFlow additionally supplies direct entry to the NiFi canvas for the circulate so that you could examine particulars of the execution or troubleshoot points, if needed. All of the performance from the GUI can be accessible programmatically, both by way of the CDP CLI or the CDF API. The method of making and managing circulate might be absolutely automated and built-in with CD/CI pipelines.


Gathering knowledge on the level of origination because it will get generated, and shortly making it accessible on the analytical platform, is essential for the success of any mission that requires knowledge streams to be processed in actual time. On this weblog we confirmed how Cloudera DataFlow makes it simple to create, check, and deploy knowledge pipelines within the cloud.

Apache NiFi’s graphical person interface and richness of processors permits customers to create easy and complicated knowledge flows with out having to jot down code. The interactive expertise makes it very simple to check and troubleshoot flows in the course of the improvement course of.

Cloudera DataFlow’s circulate runtime provides robustness and effectivity to the execution of the flows in manufacturing in a cloud-native and elastic surroundings, which permits it to increase and shrink to accommodate the workload demand.

Within the half two of this weblog we are going to have a look at how Cloudera Stream Processing (CSP) can be utilized to finish the implementation of our fraud detection use case, performing real-time streaming analytics on the information that now we have simply ingested.

What’s the quickest option to study extra about Cloudera DataFlow and take it for a spin? First, go to our new Cloudera DataFlow house web page. Then, take our interactive product tour or join a free trial



Please enter your comment!
Please enter your name here

Most Popular