web-scraping best-practices data-extraction architecture

Web Scraping Best Practices in 2026: A Practitioner's Guide

4 min read by Richard Feng

After building and maintaining 10 production scrapers that serve over 2,700 users with >99% success rates, here are the practices that actually matter.

Architecture: Think in Pipelines, Not Scripts

The biggest mistake I see is treating scraping as a single-step process. Production scrapers are data pipelines:

  1. URL Discovery — find what to scrape (sitemaps, category pages, search, APIs)
  2. Request Execution — fetch the data with proper retry and rotation
  3. Parsing — extract structured fields from raw responses
  4. Normalization — clean, validate, and standardize the output
  5. Storage — push to datasets, databases, or downstream systems

Each step should be independently testable and retryable. When Sephora changes their product page layout, only step 3 needs updating — the rest of the pipeline stays stable.

Always Prefer APIs Over HTML Parsing

Before writing a single CSS selector, check if the site has:

  • Public APIs — documented endpoints that return JSON
  • Private APIs — XHR/fetch calls visible in browser DevTools
  • GraphQL endpoints — increasingly common, often with introspection enabled
  • Embedded JSON__NEXT_DATA__, window.__INITIAL_STATE__, or JSON-LD in the HTML

API responses are structured, versioned, and far more stable than HTML layouts. My Sephora scraper converts every web URL into an API call — it hasn’t broken once from a frontend redesign.

Proxy Strategy: Match the Protection

Not every site needs residential proxies. Here’s my decision framework:

Protection LevelProxy TypeExample Sites
None / BasicDatacenterMost Shopify stores, small sites
Rate limitingRotating datacenterMedium e-commerce, content sites
FingerprintingResidentialSephora, Farfetch, major brands
Advanced WAFResidential + TLS fingerprintAkamai, Cloudflare Enterprise

The key insight: proxy cost scales with protection level. Don’t waste money on residential proxies for sites that only check IP reputation. My Shopify scraper works fine with datacenter proxies because Shopify’s default protection is minimal.

Session Management Is Everything

The difference between a 60% and 99% success rate is usually session management:

  • Rotate sessions, not just IPs — a new IP with the same cookies looks suspicious
  • Warm up sessions — visit the homepage before hitting product pages
  • Respect rate limits — 5 concurrent requests beats 50 that get blocked
  • Exponential backoff — 1s, 2s, 4s, 8s retries, not immediate retries

My Sephora EU scraper manages guest tokens with automatic refresh and exponential backoff. It maintains persistent sessions that look like real browsing patterns.

Normalize Your Output

Raw scraped data is messy. Normalize everything:

Prices

Store as integers (cents, not dollars). $29.99 becomes 2999. This avoids floating-point precision errors that corrupt financial data downstream. Every one of my e-commerce scrapers uses this convention.

URLs

Always store absolute URLs, never relative paths. Resolve them at extraction time.

Dates

ISO 8601 (2026-04-01T00:00:00Z), always with timezone. Never store locale-formatted dates.

Text

Strip excess whitespace, normalize Unicode, and decide on an HTML handling policy (strip tags vs. preserve formatting).

Error Handling: Expect Failure

Production scrapers fail constantly — the question is how gracefully. My approach:

1
2
3
4
5
Request fails (network error, timeout, 4xx/5xx)
  → Retry with exponential backoff (up to 5 attempts)
    → Rotate session/proxy on retry
      → Log failure with full context if all retries exhausted
        → Continue processing remaining URLs (don't crash the batch)

Track success rates per URL pattern. If /category/* pages suddenly drop below 90% success, the site probably changed something — you’ll catch it before users report it.

Monitor and Alert

A scraper without monitoring is a scraper waiting to silently fail. Track:

  • Success rate per run and per URL pattern
  • Output count — sudden drops mean something broke
  • Data quality — null fields, unexpected values, schema violations
  • Cost — proxy usage, compute time, storage

My Apify actors all expose these metrics. When success rates dip, I get notified within hours — often before any user notices.

Start Simple, Add Complexity

Every scraper I build starts as the simplest thing that works:

  1. HTTP + Cheerio first (fastest, cheapest)
  2. Add fingerprinting only if blocked
  3. Add browser rendering only if JavaScript is required
  4. Add proxy rotation only if rate-limited

My Ulta scraper is pure Cheerio — no browser needed. My Universal Web Printer uses Playwright because it must render JavaScript. Right tool for the job.


These aren’t theoretical principles — they’re extracted from running production scrapers that process millions of requests. If you need a custom scraper built with these practices, let’s talk.

Richard Feng
Web scraping engineer with 12+ years of experience. Building production-grade data extraction tools.

Related Posts

Need This Data?

Check out our production-grade scraping tools or hire me for a custom solution.