Running Enterprise Node.js Applications at Scale: Lessons from Production
Battle-tested strategies for building, deploying, and maintaining Node.js applications that handle millions of requests while staying reliable and maintainable.

The moment everything changed#
Our Node.js app was humming along nicely. 10k requests per minute, stable memory usage, happy users. Then we landed a major client. Overnight, traffic spiked to 500k requests per minute.
The app crashed. Hard.
That week taught me more about scalability than the previous three years combined. Let me share what I learned about running Node.js at enterprise scale.
Part 1: Foundation - Application Architecture#
The Cluster Module: Your First Line of Defense#
Node.js is single-threaded. On a server with 8 CPU cores, you're using 12.5% of capacity. Unacceptable.
// server.ts
import cluster from "cluster";
import os from "os";
import { createServer } from "./app";
const numCPUs = os.cpus().length;
if (cluster.isPrimary) {
console.log(`Primary ${process.pid} is running`);
console.log(`Forking ${numCPUs} workers...`);
// Fork workers
for (let i = 0; i < numCPUs; i++) {
cluster.fork();
}
// Handle worker crashes
cluster.on("exit", (worker, code, signal) => {
console.log(`Worker ${worker.process.pid} died (${signal || code})`);
console.log("Starting a new worker...");
cluster.fork();
});
// Graceful shutdown
process.on("SIGTERM", () => {
console.log("SIGTERM received, shutting down gracefully");
for (const id in cluster.workers) {
cluster.workers[id]?.kill();
}
});
} else {
// Workers share the same port
const server = createServer();
server.listen(3000, () => {
console.log(`Worker ${process.pid} started`);
});
// Graceful shutdown for workers
process.on("SIGTERM", () => {
console.log(`Worker ${process.pid} shutting down...`);
server.close(() => {
process.exit(0);
});
});
}But there's a better way: PM2
// ecosystem.config.js
module.exports = {
apps: [
{
name: "api",
script: "./dist/server.js",
instances: "max", // Use all CPUs
exec_mode: "cluster",
// Memory management
max_memory_restart: "512M",
// Logging
error_file: "./logs/error.log",
out_file: "./logs/out.log",
log_date_format: "YYYY-MM-DD HH:mm:ss Z",
// Environment
env: {
NODE_ENV: "development",
},
env_production: {
NODE_ENV: "production",
},
// Auto-restart on file changes (development)
watch: false,
// Graceful shutdown
kill_timeout: 5000,
wait_ready: true,
listen_timeout: 10000,
},
],
};# Start with PM2
pm2 start ecosystem.config.js --env production
# Monitor
pm2 monit
# Logs
pm2 logs
# Zero-downtime reload
pm2 reload apiPart 2: Request Handling - Making Every Millisecond Count#
Async Everything, Block Nothing#
// ❌ BAD: Blocking operations
import fs from "fs";
import { createHash } from "crypto";
app.get("/bad", (req, res) => {
// Blocks the event loop!
const data = fs.readFileSync("/large-file.json");
// CPU-intensive operation
const hash = createHash("sha256").update(data).digest("hex");
res.json({ hash });
});
// ✅ GOOD: Non-blocking
import { readFile } from "fs/promises";
import { createHash } from "crypto";
import { Worker } from "worker_threads";
app.get("/good", async (req, res) => {
try {
// Non-blocking file read
const data = await readFile("/large-file.json");
// Offload CPU work to worker thread
const hash = await hashInWorker(data);
res.json({ hash });
} catch (error) {
res.status(500).json({ error: "Internal server error" });
}
});
function hashInWorker(data: Buffer): Promise<string> {
return new Promise((resolve, reject) => {
const worker = new Worker("./hash-worker.js", {
workerData: data,
});
worker.on("message", resolve);
worker.on("error", reject);
worker.on("exit", (code) => {
if (code !== 0) {
reject(new Error(`Worker stopped with exit code ${code}`));
}
});
});
}Worker thread for CPU-intensive tasks:
// hash-worker.js
import { parentPort, workerData } from "worker_threads";
import { createHash } from "crypto";
const hash = createHash("sha256").update(workerData).digest("hex");
parentPort?.postMessage(hash);Connection Pooling: Don't Create, Reuse#
// lib/db.ts
import { Pool } from "pg";
// ❌ BAD: New connection per request
export async function queryBad(sql: string) {
const client = new Client(dbConfig);
await client.connect();
const result = await client.query(sql);
await client.end();
return result;
}
// ✅ GOOD: Connection pool
const pool = new Pool({
host: process.env.DB_HOST,
port: Number(process.env.DB_PORT),
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
// Pool configuration
max: 20, // Maximum pool size
min: 5, // Minimum pool size
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 2000, // Timeout if all connections busy
// Performance
statement_timeout: 30000, // Query timeout
query_timeout: 30000,
});
export async function query(sql: string, params?: any[]) {
const client = await pool.connect();
try {
return await client.query(sql, params);
} finally {
client.release(); // Return to pool
}
}
// Monitor pool health
pool.on("connect", () => {
console.log("New client connected to pool");
});
pool.on("error", (err) => {
console.error("Unexpected error on idle client", err);
});
// Graceful shutdown
export async function closePool() {
await pool.end();
console.log("Database pool closed");
}Caching: The Ultimate Performance Multiplier#
// lib/cache.ts
import Redis from "ioredis";
const redis = new Redis({
host: process.env.REDIS_HOST,
port: Number(process.env.REDIS_PORT),
password: process.env.REDIS_PASSWORD,
// Connection pool
maxRetriesPerRequest: 3,
enableReadyCheck: true,
retryStrategy: (times) => {
const delay = Math.min(times * 50, 2000);
return delay;
},
});
export async function cached<T>(
key: string,
ttlSeconds: number,
fn: () => Promise<T>,
): Promise<T> {
// Try cache first
const cached = await redis.get(key);
if (cached) {
return JSON.parse(cached);
}
// Cache miss, compute value
const value = await fn();
// Store in cache
await redis.setex(key, ttlSeconds, JSON.stringify(value));
return value;
}
export async function invalidate(pattern: string) {
const keys = await redis.keys(pattern);
if (keys.length > 0) {
await redis.del(...keys);
}
}
// Usage
app.get("/api/posts/:slug", async (req, res) => {
const post = await cached(
`post:${req.params.slug}`,
3600, // 1 hour
async () => {
return db.post.findUnique({
where: { slug: req.params.slug },
include: { author: true, tags: true },
});
},
);
if (!post) {
return res.status(404).json({ error: "Not found" });
}
res.json(post);
});
// Invalidate on update
app.put("/api/posts/:slug", async (req, res) => {
await db.post.update({
where: { slug: req.params.slug },
data: req.body,
});
// Clear cache
await invalidate(`post:${req.params.slug}`);
res.json({ success: true });
});Multi-level caching:
// lib/multi-cache.ts
import NodeCache from "node-cache";
// L1: In-memory cache (fastest, per-instance)
const memoryCache = new NodeCache({
stdTTL: 60, // 1 minute default
checkperiod: 120, // Cleanup interval
useClones: false, // Don't clone objects (faster)
});
// L2: Redis (shared across instances)
import { redis } from "./cache";
export async function multiCached<T>(
key: string,
ttl: { memory: number; redis: number },
fn: () => Promise<T>,
): Promise<T> {
// L1: Check memory
const memCached = memoryCache.get<T>(key);
if (memCached) {
return memCached;
}
// L2: Check Redis
const redisCached = await redis.get(key);
if (redisCached) {
const value = JSON.parse(redisCached);
memoryCache.set(key, value, ttl.memory);
return value;
}
// Cache miss: compute
const value = await fn();
// Store in both caches
memoryCache.set(key, value, ttl.memory);
await redis.setex(key, ttl.redis, JSON.stringify(value));
return value;
}Part 3: Error Handling - When Things Go Wrong#
Comprehensive error handling#
// lib/error-handler.ts
import { Request, Response, NextFunction } from "express";
export class AppError extends Error {
constructor(
public statusCode: number,
public message: string,
public isOperational: boolean = true,
) {
super(message);
Error.captureStackTrace(this, this.constructor);
}
}
// Global error handler
export function errorHandler(
err: Error,
req: Request,
res: Response,
next: NextFunction,
) {
// Log error
console.error("Error:", {
message: err.message,
stack: err.stack,
url: req.url,
method: req.method,
ip: req.ip,
});
// Send to error tracking service
if (process.env.NODE_ENV === "production") {
// Sentry, Datadog, etc.
trackError(err, {
user: req.user?.id,
url: req.url,
method: req.method,
});
}
// Operational errors (expected)
if (err instanceof AppError && err.isOperational) {
return res.status(err.statusCode).json({
error: err.message,
});
}
// Programming errors (unexpected)
res.status(500).json({
error:
process.env.NODE_ENV === "production"
? "Internal server error"
: err.message,
});
}
// Async wrapper to catch promise rejections
export function asyncHandler(
fn: (req: Request, res: Response, next: NextFunction) => Promise<any>,
) {
return (req: Request, res: Response, next: NextFunction) => {
Promise.resolve(fn(req, res, next)).catch(next);
};
}
// Usage
app.get(
"/api/posts/:id",
asyncHandler(async (req, res) => {
const post = await db.post.findUnique({
where: { id: req.params.id },
});
if (!post) {
throw new AppError(404, "Post not found");
}
res.json(post);
}),
);
// Apply error handler (must be last!)
app.use(errorHandler);Circuit breaker pattern#
// lib/circuit-breaker.ts
export class CircuitBreaker {
private failures = 0;
private lastFailureTime = 0;
private state: "closed" | "open" | "half-open" = "closed";
constructor(
private threshold: number = 5,
private timeout: number = 60000, // 1 minute
private resetTimeout: number = 30000, // 30 seconds
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "open") {
const now = Date.now();
if (now - this.lastFailureTime >= this.resetTimeout) {
this.state = "half-open";
console.log("Circuit breaker: moving to half-open");
} else {
throw new Error("Circuit breaker is open");
}
}
try {
const result = await fn();
if (this.state === "half-open") {
this.reset();
}
return result;
} catch (error) {
this.recordFailure();
throw error;
}
}
private recordFailure() {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = "open";
console.log("Circuit breaker: opened after", this.failures, "failures");
}
}
private reset() {
this.failures = 0;
this.state = "closed";
console.log("Circuit breaker: reset to closed");
}
}
// Usage with external API
const apiCircuitBreaker = new CircuitBreaker(5, 60000, 30000);
async function callExternalAPI(data: any) {
return apiCircuitBreaker.execute(async () => {
const response = await fetch("https://external-api.com/endpoint", {
method: "POST",
body: JSON.stringify(data),
headers: { "Content-Type": "application/json" },
});
if (!response.ok) {
throw new Error(`API error: ${response.status}`);
}
return response.json();
});
}Part 4: Monitoring & Observability#
Structured logging#
// lib/logger.ts
import winston from "winston";
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || "info",
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json(),
),
defaultMeta: {
service: "api",
environment: process.env.NODE_ENV,
},
transports: [
new winston.transports.File({
filename: "logs/error.log",
level: "error",
maxsize: 10485760, // 10MB
maxFiles: 5,
}),
new winston.transports.File({
filename: "logs/combined.log",
maxsize: 10485760,
maxFiles: 5,
}),
],
});
// Console in development
if (process.env.NODE_ENV !== "production") {
logger.add(
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple(),
),
}),
);
}
export default logger;
// Usage
logger.info("User logged in", { userId: "123", ip: "1.2.3.4" });
logger.error("Database connection failed", { error: err.message });
logger.warn("High memory usage", { usage: process.memoryUsage() });Request tracing#
// middleware/tracing.ts
import { v4 as uuidv4 } from "uuid";
import { Request, Response, NextFunction } from "express";
export function tracingMiddleware(
req: Request,
res: Response,
next: NextFunction,
) {
// Generate unique request ID
const requestId = (req.headers["x-request-id"] as string) || uuidv4();
req.id = requestId;
// Add to all log statements
req.log = logger.child({ requestId });
// Track timing
const start = Date.now();
res.on("finish", () => {
const duration = Date.now() - start;
req.log.info("Request completed", {
method: req.method,
url: req.url,
statusCode: res.statusCode,
duration,
userAgent: req.headers["user-agent"],
ip: req.ip,
});
// Send metrics
metrics.recordHttpRequest({
method: req.method,
path: req.route?.path || req.url,
statusCode: res.statusCode,
duration,
});
});
next();
}Metrics collection#
// lib/metrics.ts
import { Counter, Histogram, Registry } from "prom-client";
const registry = new Registry();
// HTTP request counter
const httpRequestCounter = new Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "path", "status"],
registers: [registry],
});
// HTTP request duration
const httpRequestDuration = new Histogram({
name: "http_request_duration_ms",
help: "Duration of HTTP requests in ms",
labelNames: ["method", "path", "status"],
buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000],
registers: [registry],
});
// Database query duration
const dbQueryDuration = new Histogram({
name: "db_query_duration_ms",
help: "Duration of database queries in ms",
labelNames: ["operation"],
buckets: [1, 5, 10, 25, 50, 100, 250, 500, 1000],
registers: [registry],
});
export const metrics = {
recordHttpRequest(data: {
method: string;
path: string;
statusCode: number;
duration: number;
}) {
const labels = {
method: data.method,
path: data.path,
status: data.statusCode.toString(),
};
httpRequestCounter.inc(labels);
httpRequestDuration.observe(labels, data.duration);
},
recordDbQuery(operation: string, duration: number) {
dbQueryDuration.observe({ operation }, duration);
},
getRegistry() {
return registry;
},
};
// Expose metrics endpoint
app.get("/metrics", async (req, res) => {
res.set("Content-Type", registry.contentType);
res.end(await registry.metrics());
});Health checks#
// routes/health.ts
import { Router } from "express";
import { pool } from "@/lib/db";
import { redis } from "@/lib/cache";
const router = Router();
// Liveness probe (is the app running?)
router.get("/health/live", (req, res) => {
res.json({ status: "ok" });
});
// Readiness probe (is the app ready to serve traffic?)
router.get("/health/ready", async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
};
const allHealthy = Object.values(checks).every((c) => c.healthy);
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? "ready" : "not ready",
checks,
timestamp: new Date().toISOString(),
});
});
async function checkDatabase(): Promise<{
healthy: boolean;
latency?: number;
}> {
const start = Date.now();
try {
await pool.query("SELECT 1");
return { healthy: true, latency: Date.now() - start };
} catch (error) {
return { healthy: false };
}
}
async function checkRedis(): Promise<{ healthy: boolean; latency?: number }> {
const start = Date.now();
try {
await redis.ping();
return { healthy: true, latency: Date.now() - start };
} catch (error) {
return { healthy: false };
}
}
export default router;Part 5: Deployment Strategies#
Zero-downtime deployments#
// server.ts
import express from "express";
import { createServer } from "http";
const app = express();
const server = createServer(app);
// Track active connections
let connections = new Set();
server.on("connection", (conn) => {
connections.add(conn);
conn.on("close", () => {
connections.delete(conn);
});
});
// Graceful shutdown
function gracefulShutdown(signal: string) {
console.log(`${signal} received, shutting down gracefully`);
// Stop accepting new connections
server.close(() => {
console.log("HTTP server closed");
// Close database connections
pool.end().then(() => {
console.log("Database pool closed");
process.exit(0);
});
});
// Force close after timeout
setTimeout(() => {
console.error("Forced shutdown after timeout");
connections.forEach((conn) => conn.destroy());
process.exit(1);
}, 30000); // 30 seconds
}
process.on("SIGTERM", () => gracefulShutdown("SIGTERM"));
process.on("SIGINT", () => gracefulShutdown("SIGINT"));
const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
console.log(`Server running on port ${PORT}`);
// Signal PM2 that app is ready
if (process.send) {
process.send("ready");
}
});Docker multi-stage build#
# Dockerfile
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
# Copy package files
COPY package*.json ./
COPY tsconfig.json ./
# Install dependencies
RUN npm ci --only=production && \
npm cache clean --force
# Copy source
COPY src ./src
# Build TypeScript
RUN npm run build
# Stage 2: Production
FROM node:20-alpine
WORKDIR /app
# Security: Run as non-root
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# Copy only necessary files from builder
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/package*.json ./
USER nodejs
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s \
CMD node -e "require('http').get('http://localhost:3000/health/live', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"
CMD ["node", "dist/server.js"]Kubernetes deployment#
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
labels:
app: api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: myregistry/api:latest
ports:
- containerPort: 3000
env:
- name: NODE_ENV
value: production
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
ports:
- port: 80
targetPort: 3000
type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80Part 6: Performance Optimization#
Memory management#
// Monitor memory usage
setInterval(() => {
const usage = process.memoryUsage();
logger.info("Memory usage", {
rss: `${Math.round(usage.rss / 1024 / 1024)}MB`,
heapTotal: `${Math.round(usage.heapTotal / 1024 / 1024)}MB`,
heapUsed: `${Math.round(usage.heapUsed / 1024 / 1024)}MB`,
external: `${Math.round(usage.external / 1024 / 1024)}MB`,
});
// Alert if memory usage is high
if (usage.heapUsed / usage.heapTotal > 0.9) {
logger.warn("High memory usage detected");
}
}, 60000); // Every minute
// Force garbage collection in development
if (process.env.NODE_ENV === "development" && global.gc) {
setInterval(() => {
global.gc();
logger.debug("Manual garbage collection triggered");
}, 300000); // Every 5 minutes
}Stream large responses#
// ❌ BAD: Load entire file into memory
app.get("/download", async (req, res) => {
const data = await readFile("/large-file.zip");
res.send(data); // OOM for large files!
});
// ✅ GOOD: Stream the file
import { createReadStream } from "fs";
app.get("/download", (req, res) => {
const stream = createReadStream("/large-file.zip");
res.setHeader("Content-Type", "application/zip");
res.setHeader("Content-Disposition", "attachment; filename=file.zip");
stream.pipe(res);
stream.on("error", (err) => {
logger.error("Stream error", { error: err.message });
res.status(500).end();
});
});
// Streaming database results
app.get("/export", async (req, res) => {
res.setHeader("Content-Type", "text/csv");
res.setHeader("Content-Disposition", "attachment; filename=export.csv");
// Write CSV header
res.write("id,name,email\n");
// Stream results
const cursor = db.user.findMany({
select: { id: true, name: true, email: true },
});
for await (const user of cursor) {
res.write(`${user.id},${user.name},${user.email}\n`);
}
res.end();
});Request payload limits#
import express from "express";
const app = express();
// Prevent large payloads from overwhelming the server
app.use(express.json({ limit: "10mb" }));
app.use(express.urlencoded({ extended: true, limit: "10mb" }));
// Rate limiting
import rateLimit from "express-rate-limit";
const limiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // Limit each IP to 100 requests per window
message: "Too many requests from this IP",
standardHeaders: true,
legacyHeaders: false,
});
app.use("/api/", limiter);Part 7: Security Hardening#
// middleware/security.ts
import helmet from "helmet";
import cors from "cors";
// Security headers
app.use(
helmet({
contentSecurityPolicy: {
directives: {
defaultSrc: ["'self'"],
styleSrc: ["'self'", "'unsafe-inline'"],
scriptSrc: ["'self'"],
imgSrc: ["'self'", "data:", "https:"],
},
},
hsts: {
maxAge: 31536000,
includeSubDomains: true,
preload: true,
},
}),
);
// CORS
app.use(
cors({
origin: process.env.ALLOWED_ORIGINS?.split(",") || [],
credentials: true,
maxAge: 86400, // 24 hours
}),
);
// Request sanitization
import mongoSanitize from "express-mongo-sanitize";
import xss from "xss-clean";
app.use(mongoSanitize()); // Prevent NoSQL injection
app.use(xss()); // Prevent XSS
// Input validation
import { body, validationResult } from "express-validator";
app.post(
"/api/users",
body("email").isEmail().normalizeEmail(),
body("password").isLength({ min: 12 }),
body("name").trim().isLength({ min: 1, max: 100 }),
async (req, res) => {
const errors = validationResult(req);
if (!errors.isEmpty()) {
return res.status(400).json({ errors: errors.array() });
}
// Process request...
},
);The Enterprise Checklist#
✓ Architecture
- Multi-process with cluster/PM2
- Load balancing
- Service discovery
- Circuit breakers
✓ Performance
- Connection pooling
- Multi-level caching
- Async everywhere
- Streaming for large data
✓ Reliability
- Graceful shutdown
- Health checks
- Error boundaries
- Automatic restarts
✓ Observability
- Structured logging
- Request tracing
- Metrics collection
- APM integration
✓ Security
- Input validation
- Rate limiting
- Security headers
- Secrets management
✓ Deployment
- Zero-downtime deploys
- Blue-green deployments
- Auto-scaling
- Rollback strategy
The Real Numbers#
After implementing these patterns, here's what we saw:
- Throughput: 10k → 800k requests/min
- P99 latency: 850ms → 120ms
- Memory usage: 512MB → 280MB per instance
- Crash rate: 3-4/day → 0 (zero crashes in 90 days)
- MTTR (Mean Time To Recovery): 15min → 2min
Key Takeaways#
- Think in processes: One Node process ≠ one server
- Async is non-negotiable: Block the event loop = game over
- Cache aggressively: Memory > Redis > Database > External API
- Monitor everything: You can't improve what you don't measure
- Design for failure: Services will fail. Handle it gracefully.
Running Node.js at scale isn't magic. It's understanding the runtime, applying proven patterns, and monitoring relentlessly.
What's your biggest Node.js scaling challenge? Let's solve it together.
