Web Scraping Best Practices in 2026: A Practitioner's Guide
After building and maintaining 10 production scrapers that serve over 2,700 users with >99% success rates, here are the practices that actually matter.
Architecture: Think in Pipelines, Not Scripts
The biggest mistake I see is treating scraping as a single-step process. Production scrapers are data pipelines:
- URL Discovery — find what to scrape (sitemaps, category pages, search, APIs)
- Request Execution — fetch the data with proper retry and rotation
- Parsing — extract structured fields from raw responses
- Normalization — clean, validate, and standardize the output
- Storage — push to datasets, databases, or downstream systems
Each step should be independently testable and retryable. When Sephora changes their product page layout, only step 3 needs updating — the rest of the pipeline stays stable.
Always Prefer APIs Over HTML Parsing
Before writing a single CSS selector, check if the site has:
- Public APIs — documented endpoints that return JSON
- Private APIs — XHR/fetch calls visible in browser DevTools
- GraphQL endpoints — increasingly common, often with introspection enabled
- Embedded JSON —
__NEXT_DATA__,window.__INITIAL_STATE__, or JSON-LD in the HTML
API responses are structured, versioned, and far more stable than HTML layouts. My Sephora scraper converts every web URL into an API call — it hasn’t broken once from a frontend redesign.
Proxy Strategy: Match the Protection
Not every site needs residential proxies. Here’s my decision framework:
| Protection Level | Proxy Type | Example Sites |
|---|---|---|
| None / Basic | Datacenter | Most Shopify stores, small sites |
| Rate limiting | Rotating datacenter | Medium e-commerce, content sites |
| Fingerprinting | Residential | Sephora, Farfetch, major brands |
| Advanced WAF | Residential + TLS fingerprint | Akamai, Cloudflare Enterprise |
The key insight: proxy cost scales with protection level. Don’t waste money on residential proxies for sites that only check IP reputation. My Shopify scraper works fine with datacenter proxies because Shopify’s default protection is minimal.
Session Management Is Everything
The difference between a 60% and 99% success rate is usually session management:
- Rotate sessions, not just IPs — a new IP with the same cookies looks suspicious
- Warm up sessions — visit the homepage before hitting product pages
- Respect rate limits — 5 concurrent requests beats 50 that get blocked
- Exponential backoff — 1s, 2s, 4s, 8s retries, not immediate retries
My Sephora EU scraper manages guest tokens with automatic refresh and exponential backoff. It maintains persistent sessions that look like real browsing patterns.
Normalize Your Output
Raw scraped data is messy. Normalize everything:
Prices
Store as integers (cents, not dollars). $29.99 becomes 2999. This avoids floating-point precision errors that corrupt financial data downstream. Every one of my e-commerce scrapers uses this convention.
URLs
Always store absolute URLs, never relative paths. Resolve them at extraction time.
Dates
ISO 8601 (2026-04-01T00:00:00Z), always with timezone. Never store locale-formatted dates.
Text
Strip excess whitespace, normalize Unicode, and decide on an HTML handling policy (strip tags vs. preserve formatting).
Error Handling: Expect Failure
Production scrapers fail constantly — the question is how gracefully. My approach:
| |
Track success rates per URL pattern. If /category/* pages suddenly drop below 90% success, the site probably changed something — you’ll catch it before users report it.
Monitor and Alert
A scraper without monitoring is a scraper waiting to silently fail. Track:
- Success rate per run and per URL pattern
- Output count — sudden drops mean something broke
- Data quality — null fields, unexpected values, schema violations
- Cost — proxy usage, compute time, storage
My Apify actors all expose these metrics. When success rates dip, I get notified within hours — often before any user notices.
Start Simple, Add Complexity
Every scraper I build starts as the simplest thing that works:
- HTTP + Cheerio first (fastest, cheapest)
- Add fingerprinting only if blocked
- Add browser rendering only if JavaScript is required
- Add proxy rotation only if rate-limited
My Ulta scraper is pure Cheerio — no browser needed. My Universal Web Printer uses Playwright because it must render JavaScript. Right tool for the job.
These aren’t theoretical principles — they’re extracted from running production scrapers that process millions of requests. If you need a custom scraper built with these practices, let’s talk.
Related Posts
Understanding Anti-Bot Protection: What Works in 2026
A technical deep-dive into modern anti-bot systems — Cloudflare, Akamai, Datadome — and the legitimate bypass techniques used in production scraping.
Read more