The Battle for the Customer Data Platform: Publishing My RudderStack Deal Memo

The platform shift from the traditional CRM sales & marketing stack to a CDP with a cloud warehouse.

16 min readOct 18, 2020

With Twilio’s acquisition of Segment, it’s a good time to publish my deal memo for RudderStack, a Customer Data Platform for Developers.

I wrote the deal memo for RudderStack’s $5M Series Seed in May 2020, and partnered up with Founder and CEO Soumyadeb Mitra on his journey to improve the lives of developers working with data.

What’s the big deal about CDPs?

CDPs operate at the intersection of data engineering and sales & marketing. As a founder of a data infrastructure company with my previous company intermix.io.where I was responsible for building out our revenue operations, I had a unique exposure to both, with my fair share of both bruises and successes.

To me, it seemed clear that the clock for the traditional CRM stack with Salesforce as the dominant industry player had started ticking. All of our customers at intermix.io were trying to move their customer logic out of their CRM into a data warehouse like Snowflake or Redshift, and commoditize the CRM into a last-mile visualization layer.

If you’ve ever worked with Salesforce, then you know that the amount of “human middleware” and maintenance of 1:1 tool integrations required to make the whole thing work is insane. I was the guy in charge of sales & marketing, and experienced that pain firsthand. Of course, we also used Segment, and if you’ve ever used it, then you’ll understand why Twilio is paying $3.2B.

The RudderStack deal memo

So why RudderStack, when the market for CDPs was already well developed, and had a clear leader with Segment with a 10-year advantage in the market?

The deal memo explains the decision in detail, but in short, three reasons:

separation of storage and compute — SaaS-based CDPs like Segment become an expensive proposition with a growing number of events RudderStack separates storage and compute, and offers a faster, cheaper and more flexible way of processing data.
open source — companies want to be in charge of their data, for control, privacy and security reasons, in particular enterprises. RudderStack is an open source alternative to traditional SaaS-based CDPs that store your data in their proprietary platforms.
a thoughtful founder with a great team — in the run up to the round, Soumyadeb and I talked a lot about trends in the data infrastructure space, go-to-market approaches and sales execution for start-ups. With every interaction, I walked away with the impression “that team is so on top of their game”.

I did not edit the deal memo for this post, but did remove some of the confidential metrics. S28 Capital led the round, they were also investors in my previous company intermix.io.

If you ever have the chance to work with the S28 team, I can’t say enough good things about them. They are your rock in the storm during bad times, and a multiplier of force in good times. And I write about both in my corner of the Internet. 👇🏻

And with that, let’s dive into the deal memo!

Introduction

RudderLabs represents a compelling Series Seed investment opportunity. The company’s goal is to become the primary product for enterprises to build their “Customer Data Platform” (CDP), and allow data engineering, data science and analytics teams to ingest, combine and analyze data from any source in real-time.

Much of the functionality and value generated by the “traditional” sales & marketing stack, built around a CRM at the core, is shifting away from the CRM into the CDP at an accelerating pace.

CEO Soumyadeb Mitra is a 2nd time founder. With “RudderStack”, he has built a fast-growing product that rides on three major trends.

explosion of systems and (real-time streaming) data sources
cheap and scalable data warehouse / data lake architectures
continued adoption of data science across all industries, with the need for real-time data processing

The company operates with an open source model, which facilitates bottom-up adoption by the developer community. For the enterprise, open source represents a more attractive option than pure play SaaS tools, for control, security and compliance reasons.

Unlike hosted, off-the-shelf CDPs, RudderStack can run in a customer’s cloud, passing on the cost for compute and storage. For the hosted version of RudderStack, the product pipes events into the customers infrastructure where all processing takes place. This way, RudderStack has up to a ~10x lower cost advantage. This is especially relevant with Covid-19, which I think has an acute but also lasting impact on budgets for marketing / analytics tooling.

Deal

This is a Series Seed round, after a $1.5M pre-seed round in July 2019.

S28 Capital is leading a $3.5M round, with a $[redacted]M pre-money valuation.
S28 Capital is committing $[redacted]M of the $3.5M, leaving $[redacted]M for others.
I’m proposing to invest $208K in this round.

Lead investor at S28 Capital is Shvet Jain. Shvet has a history of incubating and funding successful open source companies, e.g. Gravitational, Mattermost.

[Note: the previous $1.5M pre-seed round converted as part of this seed]

Company

“First time founders think about product, second time founders think about distribution” — Justin Kan

Founder

With RudderStack, founder and CEO Soumyadeb Mitra is building his second company.

I’ve gotten to know Soumyadeb in the past 12 months, since we operate in the same market (analytics / data science) and share seed investors (S28 Capital).
Soumyadeb’s last company Mariana was acquired by 8x8 ($EGHT) in 2018. Post-acquisition at 8x8, Soumyadeb built up the Machine Learning team.
While at 8x8, Soumyadeb built the CDP and the related use cases @8x8.

Soumyadeb has a PhD in Computer Science & Database Systems from Urbana-Champaign. The experience of setting up a CDP in a large public enterprise (8x8) with a complex data stack was his motivation for starting Rudder.

Team

The team consists of 17 full time engineers, of which 9 have previously worked together.
The engineering team is based in India and Greece, which is a massive cost advantage. More than half of the team comes with engineering degrees from the IITs (top 10 engineering institutes in India).
RudderLabs hired its first sales rep in Feb 2020.

The size and productivity of the team, how far they’ve gotten within less than a year following their $1.5M in seed funding in July 2019 — is nothing but amazing.

Business

RudderStack is an open source Customer Data Platform. Developers can download the source, and deploy a CDP in their own cloud environment (e.g. AWS, GCP, Azure, etc.) or even on their laptop. RuderStack also has a hosted offering, which pipes events into a customer’s data infrastructure.
Every enterprise is dealing with an explosion in SaaS tools, yet is stuck to working with a legacy CRM. Data is coupled with the tools, captured in inconsistent ways, and spread across silos. At that point, the choice is either to lock down the choice of tools their employees are working with, or face a “bad data” problem. Either one is a bad strategy.
RudderStack unifies and simplifies data collection for a multi-tool analytics stack. It offers an analytics API that’s compatible with Segment to capture data from websites, mobile apps, SaaS tools and production databases. Employees can work with the tools of their choice, while the company has one consistent data set to work with.
RudderStack acts as a real-time layer between event data and stream processing. It’s most attractive for CTOs and engineers who work with other real-time processing engines like Spark / Databricks.

Revenue Model

RudderLabs is following a classic open source business model. My lesson learned is that pricing always is and always *should* be an ongoing learning process. For now:

Anybody can use the open source version; the paid version includes premium enterprise features (e.g. dashboards, support). Important — RudderLabs uses a single code base, so the customer doesn’t have to switch to a different installation.
The team version starts at a $2,000 monthly flat licensing fee and scales with the size of the underlying compute nodes. The enterprise version is $4,000 / month. RudderLabs also offers a hosted version. For comparison, at the time of its IPO, Elastic had a $38K ASP.
About 10% of the people who download the source proactively reach out for a phone call / discovery call. This is without any marketing / sales follow-up.
Using directional math, assuming an initial ASP of $25K, the company needs to acquire 400 customers to get to $10M in ARR, as the next major milestones. I’ll assume that 1 out of 4 who go through a discovery call will eventually upgrade to a paid version. Therefore, to get to 400 customers, the company needs 400 * 4 * 10 = 16,000 “clones” of the open source.

This may seem daunting at first, but is absolutely within reach. The GitHub repo gets around 100 clones per week right now, i.e. ~5,000 / year and ramping up.

Traction

RudderLabs does have paying customers with revenue in the lower 6-figure range.

At this stage I’m more concerned about building traction for the open source. After less than a year of existence, I think it’s impressive.

https://github.com/rudderlabs/rudder-server

1.7K stars on GitHub
67 forks
34 watch

Compare that with Apache Spark (Databricks), the most successful open source data science project, 7 years after its first funding in 2013, and another 4 years previously in incubation in Berkeley’s AMPLa, and a total of $900M raised.

25.8K stars
21.5K forks
2.1K watch

Compared to Spark, RudderStack’s numbers may look small at first glance, but for less than 9 months of existence it’s impressive. Consider there were zero lines of code in July 2019.

Why RudderStack?

I think there are three differentiating angles that give RudderStack a strong shot at rolling up the market from a challenger position.

Business model: With an open source model, RudderStack is different from all other existing CDP vendors out there. Existing vendors offer a hosted product and cater to marketers. RudderLabs on the other hand caters to engineers and CTOs. They are used to trying out open source projects and starting with documentation. And all data science efforts rely on buy-in from engineering, as they need to run the underlying infrastructure.
Technology: Existing CDPs couple the control plane with the data plane. That means they’re coupling the product / the UI with the underlying compute and data infrastructure. All data is stuck in the CDP, and you’re at their grace to support data sources that may be exotic for them but industry-relevant for the customer. CDPs also don’t support real-time streaming. With RudderStack, the data and control plane are separate. A customer can run RudderStack in their infrastructure, couple if with other data processing engines, and also support real-time uses cases,
Distribution: Because RudderStack can run in a customer’s infrastructure, the cloud platform providers have a strong incentive to recommend RudderStack, because it brings net-new compute and storage workloads to the clouds. This is a playbook that Databricks used very successfully.

RudderStack is different from “traditional” CDPs, which operate with a hosted model and cater to marketers.

With the rise of machine learning and data science in general, ownership of the analytics domain has shifted from Marketing who use “drag & drop” tools” into Engineering who apply software principles to working with data.

See the related RudderLab discussion on HackerNews:

https://news.ycombinator.com/item?id=21081756

https://news.ycombinator.com/item?id=22637302

Risks

There are a variety of successful competitors already in the market, with 100x the funding and a much longer operating history.

The company needs to remain laser-focused on building out supporting functionality and sources / destinations that the enterprise is asking for. Reaching the $10M revenue mark requires building out an enterprise salesforce. The Company is currently only operating with a single sales rep.

Given the larger funding and headcount of competitors, forging distribution partnerships with e.g. the cloud platforms are crucial.

Market

RudderLabs is going after the emerging category “Customer Data Platforms”:

https://explodingtopics.com/topic/customer-data-platform

There’s a massive value shift away from the legacy CRM stack to CDPs. CRMs just don’t work anymore in a world of exploding SaaS tools and a data science approach to building customer experiences.

What’s a CDP?

CDPs help to collect, process and unify data from disparate sources to build a complete picture of customer activity across the entire lifecycle, and any channel. At the highest level, a CDP does three things:

Collect — ingest data from any channel that touches the customer
Consolidate — combine that data to create a unified customer profile
Activate — Deploy profiles to “downstream” tools and processes to create a better customer experience.

The benefit of a CDP is that every team in a company acts on one consistent, identical data set, yet keeping the flexibility to use the SaaS tools of their choice.

This stands in contrast to the existing legacy CRM world where users are held hostage by a stack that has been cobbled together by vendor acquisitions (Salesforce, Oracle, Adobe) and years and years of system integrator work.

What’s the problem with the existing CRM?

The existing paradigm in the enterprise to manage customer data is the CRM, a technology that by now is 20 years old. The CRM is the system of record to manage customer interactions, business transactions and internal sales processes. It usually all starts with some form on a website, or a sales rep creating a customer record manually.

That approach has stopped working, for three reasons.

Tool explosion: According to Blissfully, a company that centralizes management of SaaS tools, a typical enterprise with 1,000+ employees uses 203 different apps. The purchase decision for these apps has been “consumerized”, there is no more central IT that controls everything.
Data fragmentation: Customer data is scattered across the different tools, and is coupled with the tool. Each tool has its own way of defining and measuring a specific metric. The glue to get datasets from two or more different tools together are bespoke 1:1 integrations and spreadsheets.
Tool substitution. There’s constant onboarding and sunsetting of tools. Especially marketing teams are exploring new things all the time. You can’t hold up a world of 1:1 integrations if the half-life of your existing stack is less than a year.

Together, and not addressed, this situation leads to a “bad data” problem.

Why does that matter?

Bad data is a huge productivity drain. Google the term, and you’ll find some ginormous numbers from reputable sources. But I think those big numbers are meaningless, they don’t show the opportunity cost at the company and human level.

Analysts and data science teams need documented and clean data sets to do their work. Rich, clean data is the #1 success factor for all data science projects — analytics, machine learning, AI, etc.
With bad data, teams of highly paid analytics engineers and data scientists end up spending 70–80% of their time building custom data pipelines and cleaning up data. It’s the most unproductive use of their time.

Companies literally go out of business with bad data.

I’ve seen this myself with food delivery. With my company intermix.io, we had pretty much every delivery under the sun as a customer at some point, as a vital part to monitor their data infrastructure.

The “winners” hit the panic button very early on when they realized they had “bad data”. Think Postmates, DeliveryHero, DoorDash, Takeaway, Instacart. They killed all spreadsheets, did a full reset of their data infrastructure and analytics engineering teams, and made “data” a C-level agenda. Around 2015 they started building out what back then didn’t have a name but today would be called a “CDP”.
The “not so lucky ones” that went out of business had a data problem. This is my personal opinion, but I don’t think they failed because of a bad product. They failed because they couldn’t get a handle on their customer acquisition cost, COGS, number of successful deliveries, fraud, predicting food prep times on a Saturday night vs. a Monday night, etc.- which were all data problems.

That’s just one example from one vertical — the same story plays out across pretty much all industries.

How do CDPs solve the problem?

I covered the “collect, consolidate, activate” above, Let’s go into a little bit more detail for better understanding.

The “collect” part happens via a single API. The big idea behind the concept of a single API for analytics is that any SaaS tool pretty much tries to answer two questions. 1) Who is the user, and 2) What does the user do? So rather than using 10 tools on your website or mobile app (e.g. for analytics, A/B testing, social network marketing pixels, etc.), each one with its own nomenclature, you use a single API, with one consistent framework to capture data.
The “consolidate” part happens either via the CDP in a hosted model, or via a data warehouse like Snowflake, Redshift or BigQuery. You also want to be able to join different data sets, clean them for data quality purposes and then “slice and dice” them in any possible way. Data warehouses deliver on that — they are cheap, flexible and scalable. Analytics teams can manipulate data with SQL, which is easy to learn. Having all your raw data in a single place, in one consistent and documented format is equivalent to having analytics superpowers.
The “activate” part happens either via the CDP, via the warehouse or via a real-time stream. Processed and cleaned data is routed to the corresponding tools (also “destinations”), and each team (product, marketing, sales, support, etc.) can do their job with the tools of their choice. But all work off one consistent data set.

The last part “activate” is where it really gets interesting.

In my experience, leading data teams prefer to collect, clean and standardize the data FIRST in the warehouse or in a streaming processing engine, and THEN pass it on to the tool.
It’s the data team that builds the models, aggregations, logic, etc. — they “productize” data. The tool itself is really only there the last mile visualization layer.

So in a way, CDPs are commoditizing the existing SaaS tools, and are becoming the “choke point” for downstream consumption.

That puts CDPs in an extremely powerful position, and enables them to capture the value that today is captured by the CRM stack.

Competition

Companies in the CDP category started out as a single API / SDK to instrument web and mobile apps, to send data to various downstream marketing tools (e.g. Google Analytics, FB Ads, etc.).

The two relevant competitors in this category are Segment and mParticle.

Segment is based in San Francisco and came out of Ycombinator (YC S11). Last round of funding was in April 2019, $175M raise, total raised $284M. Segment positions itself as a CDP with a single API for engineering teams to collect data from web and mobile apps. Segment has ~590 employees. Segment has a stronghold in the developer community and claims to have 20,000+ companies using their product, which includes companies on a free tier. With their start-up / developer focus, I believe they have ASPs in the $120-$150K range. I’m estimating that Segment is in the $80–90M ARR range.
mParticle was founded by former Yahoo Execs in 2013 and is based in NYC. Last round of funding was in March 2020, $45M raise, total raised ~$120M. mParticle focuses on selling to enterprises / brands, with vertical solutions for e.g. retail, travel, FinTech, media and gaming. mParticle has ~150 employees. With their focus on the enterprise, I believe they have ASPs in the $200–250K+ range. I’m estimating that mParticle is in the $30–35M ARR range.

Other companies in this category include players with less funding and employees like Simon Data, Lytics, Blueconic, ActionIQ and Amperity.

What the companies in this category have in common is the hosted SaaS model. Clearly the model has its merit, given the traction.

But I also believe the current model for CDPs has three major drawbacks.

Price / margin pressure — because hosted CDPs process and store data in their own platform, they also incur the cost to run that infrastructure. If hosted CDPs want to achieve 80–90% software margins, it means they need to apply a ~10x mark-up on cloud storage / processing costs, with ASPs in the 6- and 7-figure range. That’s an achievable range, but also a range where companies start to push back. In fact, I believe that gross margins for both Segment and mParticle margins are more in the 60–70% range. As data volume inevitably grows, this problem won’t go away.
Lack of streaming / real-time — If you route data from the API directly to a destination, data is available pretty much in real-time. For data warehouses it’s different however, and it can take anywhere from 4–24 hours from the time an event happens to the time the user has the data available in the platform. That’s just not good enough for a large number of use cases. It’s technically of course possible to get to real-time in, but the bottleneck becomes the network transfer from the CDP into the customer’s data warehouse, especially for larger amounts a higher sync frequency becomes a hurdle. Streaming data is impossible with CDPs.
Privacy — many enterprises just don’t want to have their most valuable data pass through a 3rd party’s platform. That’s especially true for regulated industries like Financial Services, Insurance and Healthcare. But also Government and Defense. No GDPR, SOC2, etc. compliance will change that.

With that in mind, I think long-term the hosted CDPs will do well in the SMB segment with non-technical Marketing organizations, but will run into limits in the enterprise — which is where the money is.

RudderStack can run on top of existing data warehouse / data lake infrastructure, and has already built out tooling for the engineering persona, such as Grafana dashboards and Kubernetes support.

Exit

Ticker NYST:RDR

The two comparables that come to mind are Elastic and Databricks.

Both operate with an open source model, and capitalize on an ever-growing need in the enterprise to handle large amounts of data.
Elastic has a ~$5B market cap as of April 2020, and was up against enterprise giant Splunk when they started. Databricks’ post-money valuation from their last round in October 2019 was $6.2B.

In the general CDP segment, there has been M&A activity already (first company after the bullet is the one acquired):

However, these have all been acquisitions to build products for marketing teams.

Summary

RudderStack is a differentiated play in the fast-growing market for CDPs. The opportunity is the value shift as a result of the unbundling of the legacy monolithic CRM stack.

The trend in this market however is the shift of the analytics function from marketing into engineering. Existing CDPs can’t address their needs, such as real-time streaming, control over data flows and privacy / security. RudderStack also has a cost advantage due to the separation of the control plane and data plane.

RudderStack has strong early traction, with a strong team led by an experienced, 2nd time founder & CEO. I recommend taking our full allocation of $208K in this $3M Series Seed round, with a pre-money of $M[redacted].

Alright, that’s it!

So how did it go? The syndicate closed within less than two days, and my hunch is that we would have landed at over $300K had we kept it open.

Since the round, RudderStack has been growing steadily, adding new logos to their customer base every week. I’m excited about what’s ahead in their journey, and are grateful to be part of it!

I also publish a weekly newsletter “Finding Distribution” on Substack. Click on the link below to subscribe!

Finding Distribution

Start-up stories about go-to-market, finding traction and generating revenue. Click to read Finding Distribution, by…

findingdistribution.substack.com