OpenTelemetry vs Datadog: Open Standard or Managed Platform?
Compare OpenTelemetry and Datadog across total cost of ownership, instrumentation, vendor lock-in, and architecture. TCO at 10, 50, and 200 services, OTel Collector pipeline config, hybrid approach, and a phased migration guide.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Two Philosophies for the Same Problem
Every engineering team eventually hits the same wall: the system is too complex to debug by reading logs on individual machines. You need observability -- metrics, traces, and logs correlated across services. At that point, you face a fundamental choice. OpenTelemetry (OTel) is a vendor-neutral instrumentation framework that gives you full control over your telemetry pipeline. Datadog is a fully managed observability platform that handles collection, storage, querying, and alerting under one roof. They are not direct competitors -- one is plumbing, the other is the entire house -- but the choice between them shapes your architecture, your costs, and your vendor independence for years.
This guide breaks down both approaches with real cost numbers, instrumentation comparisons, and a practical migration path. No marketing fluff -- just the tradeoffs as they play out in production.
What Are OpenTelemetry and Datadog?
Definition: OpenTelemetry is an open-source, vendor-neutral observability framework (CNCF project) that provides APIs, SDKs, and a Collector for generating, processing, and exporting telemetry data -- traces, metrics, and logs. It standardizes instrumentation but does not store or visualize data. Datadog is a commercial SaaS observability platform that provides its own agents, integrations, storage backend, dashboards, alerting, and APM -- a complete managed solution from instrumentation to incident response.
The key distinction: OTel handles data collection. Datadog handles collection and everything after it. You can use OTel to feed data into Datadog, use Datadog's own agent exclusively, or build a fully open-source stack with OTel plus Prometheus, Grafana, and Tempo. Understanding this layering is the first step toward making a sound decision.
Architecture Comparison
The architectural differences between these two approaches affect deployment, operations, and long-term flexibility.
| Aspect | OpenTelemetry + OSS Stack | Datadog |
|---|---|---|
| Instrumentation | OTel SDKs (vendor-neutral APIs) | dd-trace libraries (proprietary) or OTel SDKs |
| Collection | OTel Collector (self-managed) | Datadog Agent (self-managed or serverless) |
| Storage | Prometheus, Tempo, Loki, ClickHouse (self-managed or cloud) | Datadog-managed (fully hosted) |
| Visualization | Grafana (self-hosted or Grafana Cloud) | Datadog dashboards (built-in) |
| Alerting | Alertmanager, Grafana Alerting | Datadog Monitors (built-in) |
| Data format | OTLP (open standard) | Proprietary + OTLP ingestion support |
| Operational burden | High -- you run the infrastructure | Low -- Datadog manages it |
Total Cost of Ownership at Three Scales
Cost is where this decision gets concrete. I've modeled TCO at three scales based on real-world deployments, including infrastructure, licensing, and engineering time to operate the stack. These numbers assume a containerized environment on AWS with average telemetry volume per service.
10-Service Startup
| Cost Component | OTel + Grafana Cloud | Datadog Pro |
|---|---|---|
| Platform/licensing | $0 (free tier covers it) | ~$690/mo (23 hosts x $15 infra + APM) |
| Infrastructure (Collector, storage) | ~$150/mo (small Collector + Grafana free tier) | $0 (Datadog-managed) |
| Engineering time (setup + maintenance) | ~40 hours initial, 4 hrs/mo ongoing | ~8 hours initial, 1 hr/mo ongoing |
| Estimated monthly TCO | ~$400-600 | ~$700-900 |
At this scale, Datadog is competitive. The engineering time savings nearly offset the licensing cost, and you get a polished experience from day one. For a startup with limited ops capacity, Datadog often wins here.
50-Service Mid-Stage Company
| Cost Component | OTel + Grafana Stack | Datadog Pro |
|---|---|---|
| Platform/licensing | ~$800/mo (Grafana Cloud Pro) | ~$5,500/mo (hosts + APM + log ingestion) |
| Infrastructure | ~$600/mo (Collector cluster, storage) | $0 |
| Engineering time | ~80 hours initial, 12 hrs/mo ongoing | ~20 hours initial, 4 hrs/mo ongoing |
| Estimated monthly TCO | ~$2,500-3,500 | ~$6,000-8,000 |
At 50 services, OTel + Grafana starts pulling ahead significantly. The engineering overhead is real but manageable for a team that has a dedicated platform or SRE function. The cost delta funds a significant portion of an SRE salary.
200-Service Enterprise
| Cost Component | OTel + Grafana Stack | Datadog Enterprise |
|---|---|---|
| Platform/licensing | ~$4,000/mo (Grafana Cloud Advanced) | ~$50,000+/mo (hosts + APM + logs + custom metrics) |
| Infrastructure | ~$3,000/mo (HA Collector, object storage) | $0 |
| Engineering time | 1-2 dedicated SREs | 0.5 SRE for agent management |
| Estimated monthly TCO | ~$12,000-18,000 | ~$50,000-80,000 |
At enterprise scale, the gap is dramatic. Datadog's per-host pricing model compounds relentlessly. Custom metrics pricing alone can add five figures monthly. This is where large organizations either negotiate aggressively with Datadog or migrate to an OTel-based stack.
Instrumentation: OTel SDKs vs. dd-trace
Both approaches offer auto-instrumentation for common frameworks and manual instrumentation APIs for custom business logic. Here is how they compare in a Node.js application.
OpenTelemetry Instrumentation
// tracing.ts -- loaded before application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
serviceName: 'payment-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
Datadog dd-trace Instrumentation
// tracing.ts -- loaded before application code
import tracer from 'dd-trace';
tracer.init({
service: 'payment-service',
env: 'production',
version: '1.4.2',
logInjection: true,
runtimeMetrics: true,
profiling: true,
});
The Datadog setup is undeniably simpler. Fewer packages, less configuration, and features like profiling and runtime metrics are built in. OTel requires more explicit configuration but gives you portability -- that same instrumentation code works with Jaeger, Tempo, Honeycomb, or any OTLP-compatible backend.
The OTel Collector: Your Telemetry Pipeline
The OpenTelemetry Collector is the architectural component that makes OTel powerful. It sits between your services and your backends, acting as a vendor-neutral telemetry router that can process, filter, sample, enrich, and fan out data to multiple destinations simultaneously.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 2048
memory_limiter:
check_interval: 1s
limit_mib: 1024
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
attributes:
actions:
- key: deployment.environment
value: production
action: upsert
exporters:
otlphttp/grafana:
endpoint: https://otlp-gateway-prod-us-east.grafana.net/otlp
headers:
Authorization: "Basic ${GRAFANA_OTLP_TOKEN}"
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.com
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch, attributes]
exporters: [otlphttp/grafana, datadog]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/grafana]
This configuration demonstrates the Collector's killer feature: multi-destination export. You can send traces to both Grafana Cloud and Datadog simultaneously, making migration incremental rather than all-or-nothing. The tail sampling processor keeps 100% of errors and slow traces while sampling 5% of routine traffic, drastically reducing storage costs.
Grafana Flexibility vs. Datadog Polish
On the visualization side, the tradeoff is customizability versus out-of-the-box experience.
| Feature | Grafana | Datadog |
|---|---|---|
| Dashboard building | Extremely flexible, any data source | Polished templates, guided setup |
| Data sources | 100+ plugins (Prometheus, Loki, Tempo, Postgres, etc.) | Datadog metrics/traces/logs only |
| Alerting | Multi-source, Alertmanager or Grafana-native | Integrated monitors with anomaly detection |
| Trace-to-log correlation | Manual config (Tempo + Loki linking) | Automatic, zero config |
| APM service map | Requires Tempo + service graph connector | Built-in, auto-generated |
| Learning curve | Steeper -- PromQL, LogQL, TraceQL | Lower -- unified query interface |
| Notebooks/collaboration | Basic annotations | Full notebooks, incident timelines |
Datadog's strength is correlation. Click a spike on a metric dashboard, pivot to traces for that time window, drill into a specific trace, jump to the associated logs -- all without leaving the platform. Grafana can do this too with Tempo, Loki, and Prometheus, but the linking requires configuration and the experience is less seamless. For teams that value speed-to-insight during incidents, Datadog's polish is real.
Vendor Lock-In: The Hidden Cost
Vendor lock-in is the argument most cited for OTel, and it deserves a nuanced discussion rather than hand-waving.
Datadog lock-in is real and multifaceted:
- Instrumentation lock-in: dd-trace libraries use proprietary span formats and tags. Migrating means re-instrumenting every service.
- Dashboard lock-in: Datadog dashboards, monitors, and SLOs are defined in Datadog's proprietary format. They cannot be exported to Grafana or any other tool.
- Custom metrics lock-in: DogStatsD metric naming conventions differ from Prometheus/OTel conventions. Migration requires renaming and re-alerting.
- Workflow lock-in: Incident management, runbooks, and on-call workflows built in Datadog must be rebuilt elsewhere.
OTel avoids instrumentation lock-in by design:
- OTLP is an open standard supported by every major backend.
- Switching from Tempo to Honeycomb means changing one exporter config in the Collector.
- Your application code never changes when you swap backends.
- Grafana dashboards can be version-controlled as JSON and migrated between instances.
That said, OTel does not eliminate all lock-in. If you build heavily on Grafana Cloud's specific features (Adaptive Metrics, for example), you carry some platform dependency. The difference is that the instrumentation layer -- the part that touches every service -- remains portable.
The Hybrid Approach: OTel Instrumentation with Datadog Backend
You don't have to choose one or the other at every layer. The most pragmatic approach for many teams is a hybrid: instrument with OpenTelemetry, send to Datadog.
# Hybrid: OTel Collector sending to Datadog
exporters:
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.com
traces:
span_name_as_resource_name: true
metrics:
resource_attributes_as_tags: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [datadog]
This gives you Datadog's dashboards, APM, and alerting while keeping your instrumentation vendor-neutral. If you later decide to move off Datadog, you change the Collector's exporter -- not your application code. Datadog supports OTLP ingestion natively, so compatibility is solid.
Caveats of the hybrid approach:
- Some Datadog-specific features (Continuous Profiler, Error Tracking deep integration) work better with dd-trace.
- OTel metric naming conventions may not map perfectly to Datadog's expectations. Test your dashboards.
- You still pay Datadog's pricing -- the hybrid approach saves you from instrumentation lock-in, not from licensing costs.
Migration Guide: Datadog to OTel + Grafana
If you're moving off Datadog, here is the phased approach that minimizes risk:
- Phase 1 -- Deploy the OTel Collector alongside the Datadog Agent. Configure it to receive OTLP and export to both Datadog and your target backend (e.g., Grafana Cloud). This lets you validate data parity without disrupting existing dashboards.
- Phase 2 -- Migrate instrumentation service by service. Replace dd-trace with OTel SDKs in non-critical services first. Verify traces and metrics appear correctly in both backends. Use feature flags to toggle between instrumentation libraries during the transition.
- Phase 3 -- Rebuild dashboards and alerts. Recreate your most critical Datadog dashboards in Grafana. Start with SLO dashboards and on-call views. This is the most time-consuming step -- budget 2-4 weeks for a 50-service deployment.
- Phase 4 -- Cut over and decommission. Once all services emit OTel telemetry and all critical dashboards exist in Grafana, remove the Datadog exporter from the Collector and cancel the contract. Keep Datadog read-only access for 30 days to handle any gaps.
Migration reality check: Plan for 3-6 months for a 50+ service deployment. The instrumentation swap is the easy part. Rebuilding institutional knowledge embedded in Datadog dashboards, monitors, and runbooks takes longer than anyone estimates. Do not underestimate phase 3.
Decision Framework
Use this framework to decide which approach fits your team:
| Choose | When |
|---|---|
| Datadog | Small team (fewer than 5 engineers), fewer than 20 services, no dedicated SRE, need observability fast, budget is not the primary constraint |
| OTel + Grafana | Platform/SRE team available, 30+ services, cost-sensitive, multi-cloud or hybrid environments, vendor independence is a strategic priority |
| Hybrid (OTel + Datadog) | Currently on Datadog and want to reduce future lock-in, planning eventual migration, need Datadog features today but want portable instrumentation |
Frequently Asked Questions
Can I use OpenTelemetry with Datadog?
Yes. Datadog natively supports OTLP ingestion for traces and metrics. You instrument with OTel SDKs, send data to the OTel Collector, and export to Datadog's OTLP endpoint. This gives you vendor-neutral instrumentation while using Datadog's platform. Some Datadog-specific features like Continuous Profiler work best with dd-trace, but core APM, dashboards, and alerting work well with OTel-sourced data.
Is OpenTelemetry really free?
The software is free and open source. The infrastructure to run it is not. You need compute for the OTel Collector (typically 2-4 vCPUs and 4-8 GB RAM for a mid-size deployment), a storage backend (Prometheus, Tempo, Loki -- either self-hosted or via Grafana Cloud), and engineering time to operate the pipeline. For small deployments, Grafana Cloud's free tier covers basic needs. At scale, the infrastructure and engineering costs are real but consistently lower than Datadog licensing.
What does Datadog cost for 100 hosts?
Datadog Pro pricing for 100 hosts with Infrastructure Monitoring ($15/host), APM ($31/host), and Log Management (estimated 100 GB/day at $0.10/GB) runs approximately $15,000-20,000 per month before custom metrics, Synthetics, or other add-ons. Enterprise pricing includes additional features at higher per-host rates. Custom metric pricing ($0.05 per custom metric per host) is the cost that surprises most teams. Negotiate annual contracts for 20-40% discounts on list price.
How does tail sampling in the OTel Collector reduce costs?
Tail sampling evaluates complete traces before deciding whether to store them. You configure policies to keep 100% of error traces and slow traces (which you always want for debugging) while sampling a small percentage (e.g., 5-10%) of successful, fast traces. This typically reduces trace storage volume by 80-95% with minimal loss of debugging capability. The OTel Collector's tail_sampling processor handles this natively. Datadog offers similar ingestion controls, but since you pay per indexed span, the savings mechanism differs.
How long does it take to migrate from Datadog to OpenTelemetry?
For a 10-service deployment, expect 4-6 weeks. For 50+ services, plan 3-6 months. The instrumentation swap (replacing dd-trace with OTel SDKs) is straightforward -- typically a day per service. The bottleneck is rebuilding dashboards, alerts, SLOs, and operational runbooks in the new stack. Parallel-run both systems during migration to validate data parity. The OTel Collector's multi-exporter capability makes this dual-write pattern easy.
Does Datadog support OpenTelemetry natively?
Datadog added native OTLP ingestion in 2023 and has steadily improved compatibility. The Datadog Agent can act as an OTLP receiver, and Datadog's backend maps OTel spans and metrics to its internal data model. However, some translations are imperfect -- OTel resource attributes may not map cleanly to Datadog tags, and metric naming conventions differ. Test your specific use cases. The Datadog exporter in the OTel Collector (contrib distribution) provides the best compatibility.
When should I avoid OpenTelemetry?
Avoid building an OTel-based stack if you have no platform engineering capacity, fewer than 10 services, or need production-ready observability within days rather than weeks. OTel's flexibility comes with operational complexity -- running the Collector at high availability, managing storage backends, configuring Grafana datasources, and troubleshooting pipeline issues all require engineering investment. If your team's strength is product development and you have budget for Datadog, the managed platform may be the right tradeoff.
Conclusion
OpenTelemetry and Datadog are not interchangeable alternatives -- they operate at different layers of the observability stack. OTel is an instrumentation standard and telemetry pipeline. Datadog is a complete managed platform. The right choice depends on your team size, service count, budget constraints, and how much operational complexity you're willing to absorb.
For most teams, the answer evolves over time. Start with Datadog if you need observability fast and have the budget. Instrument with OTel from day one if you can, using the hybrid approach to keep your options open. As you grow past 30-50 services, reassess -- the cost gap between Datadog and an OTel-based stack widens with every host you add, and that savings compounds month after month.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Best Monitoring Tools: Prometheus vs Datadog vs New Relic
Monitoring is essential for maintaining reliable systems, but the choice of tools can significantly impact cost and performance. This article compares Prometheus, Datadog, and New Relic, focusing on features, pricing, scalability, and ease of use.
12 min read
ObservabilityHow eBPF Is Changing Observability
eBPF enables kernel-level observability without application code changes. Learn how Cilium, Pixie, Falco, and bpftrace use eBPF for network monitoring, security, profiling, and tracing in production Kubernetes environments.
10 min read
ObservabilityAlerting Done Right: Reducing Noise and Writing Actionable Alerts
Most alerts are noise. Learn how to write actionable alerts by focusing on symptoms, implementing hysteresis, using multi-window burn rate alerting, and routing through Alertmanager. Includes a five-question checklist for every alert.
12 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.