Monte Carlo and Databricks Companion to Assist Firms Construct Extra Dependable Information Lakehouses


It is a collaborative put up between Monte Carlo and Databricks. We thank Matt Sulkis, Head of Partnerships, Monte Carlo, for his contributions.

 
As firms more and more leverage data-driven insights to innovate and preserve their aggressive edge, it’s important that this information is correct and dependable. With Monte Carlo and Databricks’ partnership, groups can belief their information by means of end-to-end information observability throughout their lakehouse environments.

Has your CTO ever advised you that the numbers in a report you confirmed her appeared manner off?

Has a knowledge scientist ever pinged you when a vital Spark job didn’t run?

What a couple of rise in a discipline’s null price that went unnoticed for days or perhaps weeks till it brought on a major error in an ML mannequin downstream?

You’re not alone if you happen to answered sure to any of those questions. Information downtime, in different phrases, durations of time when information is lacking, inaccurate, or in any other case faulty, is an all-too-familiar actuality for even the most effective information groups. It prices tens of millions of {dollars} in wasted income and as much as 50 p.c of a knowledge engineering crew’s time that may very well be spent constructing information merchandise and ML fashions that transfer the needle for the enterprise.

To assist firms speed up the adoption of extra dependable information merchandise, Monte Carlo and Databricks are excited to announce our partnership, bringing end-to-end information observability and information high quality automation instruments to the information lakehouse. Information engineering and analytics groups that depend upon Databricks to derive vital insights about their enterprise and construct ML fashions that may now leverage the facility of automated information observability and monitoring to forestall unhealthy information from affecting downstream shoppers.

Attaining dependable Databricks pipelines with information observability

With our new partnership and up to date integration, Monte Carlo offers full, end-to-end protection throughout information lake and lakehouse environments powered by Databricks.

Over the previous few years Databricks has established the lakehouse class, revolutionizing how organizations retailer and course of their information at an unprecedented scale throughout practically infinite use circumstances. Cloud information lakes like Delta Lake have gotten so highly effective (and common) that in response to Mordor Intelligence, the information lake market is predicted to develop from $3.74 billion in 2020 to $17.60 billion by 2026, a compound annual development price of practically 30%.

Monte Carlo itself is constructed on the Databricks Lakehouse Platform, enabling our information and engineering groups to construct and practice our anomaly detection fashions at unprecedented pace and scale. Constructing on high of Databricks has allowed us to concentrate on our core worth of bettering observability and high quality of knowledge for our prospects whereas leveraging the automation, infrastructure administration, and analytics scaling instruments of the lakehouse. This makes our assets extra environment friendly and higher in a position to serve our prospects’ information high quality wants. As our enterprise grows, we’re assured it would scale with Databricks and improve the worth of our core providing.

Now, with Monte Carlo and Databricks’ partnership, information groups can make sure that these investments are leveraging dependable, correct information at every stage of the pipeline.

“As information pipelines develop into more and more complicated and corporations ingest increasingly information, typically from third-party sources, it’s paramount that this information is dependable,” stated Barr Moses, co-founder and CEO of Monte Carlo. “Monte Carlo is worked up to companion with Databricks to assist firms belief their information by means of end-to-end information observability throughout their lakehouse.”

With Monte Carlo, data teams get complete Databricks Lakehouse Platform coverage no matter the metastore.
With Monte Carlo, information groups get full Databricks Lakehouse Platform protection regardless of the metastore.

Coupled with our new Databricks Unity Catalog and Delta Lake integrations, this partnership will make it simpler for organizations to take full benefit of Monte Carlo’s information high quality monitoring, alerting, and root trigger evaluation functionalities. Concurrently, Monte Carlo prospects will profit from Databricks’ pace, scale, and suppleness. With Databricks, analytics or machine studying duties that beforehand took hours and even days to finish can now be delivered in minutes, making it quicker and extra scalable to construct impactful information merchandise for the enterprise.

Right here’s how groups on Databricks and Monte Carlo can profit from our strategic partnership:

  • Obtain end-to-end information observability throughout your Databricks Lakehouse Platform with out writing code. Get full, automated protection throughout your information pipelines with a low-code implementation course of. Entry out-of-the-box visibility into information Freshness, Quantity, Distribution, Schema, and Lineage by plugging Monte Carlo into your lakehouse.
  • Know when information breaks, as quickly because it occurs. Monte Carlo repeatedly screens your Databricks belongings and proactively alerts stakeholders to information points. Monte Carlo’s machine learning-first method offers information groups full protection for freshness, quantity and schema adjustments, and opt-in distribution screens and business-context-specific checks layered on high make sure you’re lined at every stage of your information pipeline.
  • Discover the basis trigger of knowledge high quality points quick. Pre-built machine learning-based monitoring and anomaly detection save time and assets, giving groups a single pane of glass to analyze and resolve information points. By bringing all data and context in your pipelines into one place, groups spend much less time firefighting information points and extra time innovating for the enterprise.
  • Instantly perceive the enterprise impression of unhealthy information. Finish-to-end Spark lineage on high of Unity Catalog in your pipelines from the purpose it enters Databricks (or additional upstream!) all the way down to the enterprise intelligence layer, information groups can triage and assess the enterprise impression of their information points, decreasing danger and bettering productiveness all through the group.
  • Forestall information downtime. Give your groups full visibility into your Databricks pipelines and the way they impression downstream studies and dashboards to make extra knowledgeable improvement choices. With Monte Carlo, groups can higher handle breaking adjustments to ELTs, Spark fashions, and BI belongings by figuring out what’s impacted and who to inform.

Along with supporting current mutual prospects, Monte Carlo offers end-to-end, automated protection for groups migrating from their legacy stacks to Databricks Lakehouse Platform. Furthermore, Monte Carlo’s security-first method to information observability ensures that information by no means leaves your Databricks Lakehouse Platform.

Monte Carlo can automatically monitor and alert for data schema, volume, freshness, and distribution anomalies within the Databricks Lakehouse Platform.
Monte Carlo can robotically monitor and alert for information schema, quantity, freshness, and distribution anomalies inside the Databricks Lakehouse Platform.

What our mutual prospects must say

Monte Carlo and Databricks prospects like ThredUp, a number one on-line consignment market, and Ibotta, a world cashback and rewards app, are excited to leverage the brand new Delta Lake and Unity Catalog integrations to enhance information reliability at scale throughout their lakehouse environments.

ThredUp’s information engineering groups leverage Monte Carlo’s capabilities to know the place and the way their information breaks in real-time. The answer has enabled ThredUp to right away determine unhealthy information earlier than it impacts the enterprise, saving them time and assets on manually firefighting information downtime.

“With Monte Carlo, my crew is healthier positioned to grasp the impression of a detected information concern and determine on the subsequent steps like stakeholder communication and useful resource prioritization. Monte Carlo’s end-to-end lineage helps the crew draw these connections between vital information tables and the Looker studies, dashboards, and KPIs the corporate depends on to make enterprise choices,” stated Satish Rane, Head of Information Engineering, ThredUp. “I’m excited to leverage Monte Carlo’s information observability for our Databricks surroundings.”

At Ibotta, Head of Information, Jeff Hepburn and his crew depend on Monte Carlo to ship end-to-end visibility into the well being of their information pipelines, beginning with ingestion in Databricks all the way down to the enterprise intelligence layer.

“Information-driven determination making is a large precedence for Ibotta, however our analytics are solely as dependable as the information that informs them. With Monte Carlo, my crew has the instruments to detect and resolve information incidents earlier than they have an effect on downstream stakeholders, and their end-to-end lineage helps us perceive the internal workings of our information ecosystem in order that if points come up, we all know the place and how one can repair them,” stated Jeff Hepburn, Head of Information, Ibotta. “I’m excited to leverage Monte Carlo’s information observability with Databricks.”

Pioneering the way forward for information observability for information lakes

These updates allow groups to leverage Databricks for information engineering, information science, and machine studying use circumstances to forestall information downtime at scale.

With regards to making certain information reliability on the lakehouse, Monte Carlo and Databricks are higher collectively. For extra particulars on how one can execute these integrations, see our documentation.