Web Scraping Best Practices in 2026: A Practitioner's Guide

After building and maintaining 10 production scrapers that serve over 2,700 users with >99% success rates, here are the practices that actually matter.

Architecture: Think in Pipelines, Not Scripts

The biggest mistake I see is treating scraping as a single-step process. Production scrapers are data pipelines:

URL Discovery — find what to scrape (sitemaps, category pages, search, APIs)
Request Execution — fetch the data with proper retry and rotation
Parsing — extract structured fields from raw responses
Normalization — clean, validate, and standardize the output
Storage — push to datasets, databases, or downstream systems

Each step should be independently testable and retryable. When Sephora changes their product page layout, only step 3 needs updating — the rest of the pipeline stays stable.

Always Prefer APIs Over HTML Parsing

Before writing a single CSS selector, check if the site has:

Public APIs — documented endpoints that return JSON
Private APIs — XHR/fetch calls visible in browser DevTools
GraphQL endpoints — increasingly common, often with introspection enabled
Embedded JSON — __NEXT_DATA__, window.__INITIAL_STATE__, or JSON-LD in the HTML

API responses are structured, versioned, and far more stable than HTML layouts. My Sephora scraper converts every web URL into an API call — it hasn’t broken once from a frontend redesign.

Proxy Strategy: Match the Protection

Not every site needs residential proxies. Here’s my decision framework:

Protection Level	Proxy Type	Example Sites
None / Basic	Datacenter	Most Shopify stores, small sites
Rate limiting	Rotating datacenter	Medium e-commerce, content sites
Fingerprinting	Residential	Sephora, Farfetch, major brands
Advanced WAF	Residential + TLS fingerprint	Akamai, Cloudflare Enterprise

The key insight: proxy cost scales with protection level. Don’t waste money on residential proxies for sites that only check IP reputation. My Shopify scraper works fine with datacenter proxies because Shopify’s default protection is minimal.

Session Management Is Everything

The difference between a 60% and 99% success rate is usually session management:

Rotate sessions, not just IPs — a new IP with the same cookies looks suspicious
Warm up sessions — visit the homepage before hitting product pages
Respect rate limits — 5 concurrent requests beats 50 that get blocked
Exponential backoff — 1s, 2s, 4s, 8s retries, not immediate retries

My Sephora EU scraper manages guest tokens with automatic refresh and exponential backoff. It maintains persistent sessions that look like real browsing patterns.

Normalize Your Output

Raw scraped data is messy. Normalize everything:

Prices

Store as integers (cents, not dollars). $29.99 becomes 2999. This avoids floating-point precision errors that corrupt financial data downstream. Every one of my e-commerce scrapers uses this convention.

URLs

Always store absolute URLs, never relative paths. Resolve them at extraction time.

Dates

ISO 8601 (2026-04-01T00:00:00Z), always with timezone. Never store locale-formatted dates.

Text

Strip excess whitespace, normalize Unicode, and decide on an HTML handling policy (strip tags vs. preserve formatting).

Error Handling: Expect Failure

Production scrapers fail constantly — the question is how gracefully. My approach:

1
2
3
4
5
Request fails (network error, timeout, 4xx/5xx)
  → Retry with exponential backoff (up to 5 attempts)
    → Rotate session/proxy on retry
      → Log failure with full context if all retries exhausted
        → Continue processing remaining URLs (don't crash the batch)

Track success rates per URL pattern. If /category/* pages suddenly drop below 90% success, the site probably changed something — you’ll catch it before users report it.

Monitor and Alert

A scraper without monitoring is a scraper waiting to silently fail. Track:

Success rate per run and per URL pattern
Output count — sudden drops mean something broke
Data quality — null fields, unexpected values, schema violations
Cost — proxy usage, compute time, storage

My Apify actors all expose these metrics. When success rates dip, I get notified within hours — often before any user notices.

Start Simple, Add Complexity

Every scraper I build starts as the simplest thing that works:

HTTP + Cheerio first (fastest, cheapest)
Add fingerprinting only if blocked
Add browser rendering only if JavaScript is required
Add proxy rotation only if rate-limited

My Ulta scraper is pure Cheerio — no browser needed. My Universal Web Printer uses Playwright because it must render JavaScript. Right tool for the job.

These aren’t theoretical principles — they’re extracted from running production scrapers that process millions of requests. If you need a custom scraper built with these practices, let’s talk.