# Proooxy — Web Scraping Tools & Data-as-a-Service — Full Content (llms-full.txt) > Professional web scraping tools and Data-as-a-Service solutions by Richard Feng. 10 production-grade Apify actors for e-commerce data extraction, SEO auditing, and more. This file is a single-document concatenation of every public page on https://proooxy.com/, intended for direct ingestion by LLMs and AI agents. Each page is preceded by a fact-block of structured frontmatter (URL, type, category, technology, regions, dates) and followed by its full Markdown body. Generated automatically from the live site. --- ## Site metadata - **Site:** Proooxy — Web Scraping Tools & Data-as-a-Service - **URL:** https://proooxy.com/ - **Description:** Professional web scraping tools and Data-as-a-Service solutions by Richard Feng. 10 production-grade Apify actors for e-commerce data extraction, SEO auditing, and more. - **Author:** Richard Feng - **GitHub:** https://github.com/autofacts - **Twitter / X:** https://twitter.com/chideat - **Apify Store:** https://apify.com/autofacts - **Tools:** 10 production actors - **Posts:** 2 - **Last generated:** 2026-05-08 --- ## Tool catalog index - [Sephora Scraper (Global)](https://proooxy.com/tools/sephora-scraper/) — Scrape any Sephora storefront — 21 markets, one actor. - [Boohoo Scraper](https://proooxy.com/tools/boohoo-scraper/) — Scrape Boohoo product data across 7 regional stores. - [Farfetch Scraper](https://proooxy.com/tools/farfetch-scraper/) — Scrape luxury fashion products from Farfetch with multi-currency support. - [Global API Load Tester](https://proooxy.com/tools/load-tester/) — Simulate 10K+ RPS with geo-distributed load testing. - [Lululemon Scraper](https://proooxy.com/tools/lululemon-scraper/) — Extract product data with variants and media from Lululemon. - [Schema Markup Scraper & SEO Auditor](https://proooxy.com/tools/schema-markup-scraper/) — Extract structured data and audit SEO for any website. - [Sephora EU Scraper](https://proooxy.com/tools/sephora-eu-scraper/) — Extract product data from Sephora across 9 European markets. - [Shopify Scraper](https://proooxy.com/tools/shopify-scraper/) — Extract product data from any Shopify store. - [Ulta Beauty Scraper](https://proooxy.com/tools/ulta-scraper/) — Scrape complete product data from Ulta Beauty. - [Universal Web Printer](https://proooxy.com/tools/web-printer/) — Convert URLs and HTML to PDF, PNG, JPEG, or WebP. --- # Sephora Scraper (Global) - **URL:** https://proooxy.com/tools/sephora-scraper/ - **Type:** tools - **Description:** Apify actor that extracts complete Sephora product data — variants, prices, images, ingredients, and reviews — from 21 storefronts across the US, Canada, 9 EU markets, and 10 Asia-Pacific markets in a single normalized schema. - **Summary:** Scrape any Sephora storefront — 21 markets, one actor. - **Category:** ecommerce - **Tech stack:** Python, Crawlee, curl_cffi - **Markets / regions:** US, CA, FR, IT, DE, ES, PL, CZ, GR, RO, PT, NZ, AU, SG, MY, TH, ID, PH, HK, TW, BN - **Anti-bot strategy:** Akamai bypass via residential proxies + curl_cffi TLS fingerprinting - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/sephora-scraper - **Keywords:** sephora scraper, sephora global scraper, sephora product data, beauty product scraper, sephora api, cosmetics data extraction, sephora europe, sephora apac - **Published:** 2026-04-18 - **Modified:** 2026-04-18 **Key features:** - 21 storefronts in one actor — US, Canada, 9 EU markets, and 10 APAC markets covered by a single SKU - Auto-detected market — paste any sephora.* URL and the dispatcher routes it to the right module - Mixed multi-market runs — US + EU + SEA URLs in one startUrls list, streamed to a single dataset tagged with `market` - Locale-correct pricing — NZD, EUR, USD, AUD and 17 other currencies returned by Sephora's own localization layer - Normalized schema — every market emits the same `source / brand / title / options / variants / medias / stats` shape - Per-market session isolation — auth state cannot cross-contaminate between regions - Global circuit breaker — 50 consecutive failures abort the run to avoid burning compute on a downed target **Use cases:** - Pan-regional pricing intelligence across US, EU, and APAC beauty markets - Cross-market product availability and assortment monitoring - Competitive analysis for brands launching in new Sephora regions - Ingredient comparisons across regional formulations - Review and rating tracking — including SEA wishlist signals and US AI sentiment summaries - Loyalty / membership pricing audits per market **Input parameters:** - `startUrls` (array, required) — Product or category URLs from any sephora.* storefront. Market is auto-detected from the hostname. - `market` (string, optional) — Optional market override (us, eu-fr, eu-it, eu-de, eu-es, eu-pl, eu-cz, eu-gr, eu-ro, eu-pt, sea-nz, sea-au, sea-sg, sea-my, sea-th, sea-id, sea-ph, sea-hk, sea-tw, sea-bn). - `locale` (string, optional) — Optional BCP 47 locale (e.g. fr-FR, en-NZ) — overrides the market default. - `categoryIds` (array, optional) — EU-only. SFCC category IDs like C479 — alternative to pasting category URLs. - `proxy` (object, optional) — Apify proxy config. Residential strongly recommended; pin apifyProxyCountry to the target market. - `maxConcurrency` (number, optional) — Concurrent requests. Default 5. US: 2-5. EU: 3. SEA: 8-16. - `maxRequestsPerCrawl` (number, optional) — Global hard cap across all markets. 0 = unlimited. **FAQ:** **Q: Which Sephora storefronts does this scraper support?** 21 storefronts: US (sephora.com), Canada (sephora.ca), 9 EU markets (FR, IT, DE, ES, PL, CZ, GR, RO, PT), and 10 APAC markets (NZ, AU, SG, MY, TH, ID, PH, HK, TW, BN). Market is auto-detected from the hostname — no input changes needed when mixing markets. **Q: Can I scrape multiple markets in a single run?** Yes. Mix sephora.com, sephora.fr, and sephora.nz URLs in one startUrls list. The dispatcher groups them by market, runs each module concurrently with market-appropriate auth, and tags every dataset item with a `market` field. **Q: How does the scraper handle anti-bot protection?** It uses residential proxies for all markets and curl_cffi for browser-grade TLS fingerprinting on EU and SEA traffic to bypass Akamai. US traffic uses Crawlee's HttpCrawler with a session pool that rotates on 403/429. **Q: Will my existing v1.x US run configs keep working?** Yes. Pre-2.0 inputs — startUrls, maxConcurrency, proxy, maxRequestsPerCrawl — behave identically. The only output change is a new `market` key on every item, which is a soft additive change. **Q: Why are some fields null in SEA data?** Sephora SEA's API doesn't expose a `lovesCount` counter, so APAC items have `stats.lovesCount = null`. Each variant has a boolean `wishlisted` field instead. Conversely, `sentiments` (AI review summaries) and `source.crawlUrl` are US-only. **Q: Do I need separate API tokens or accounts per market?** No. Your existing Apify API token works unchanged. The actor handles per-market guest tokens internally — no credentials required for EU/SEA, and US works on standard Apify residential proxies. ## Supported markets | Region | Market ID | Country | Currency | Hostname | |---|---|---|---|---| | Americas | `us` | United States | USD | sephora.com | | Americas | `us` | Canada | CAD | sephora.ca | | EU | `eu-fr` | France | EUR | sephora.fr | | EU | `eu-it` | Italy | EUR | sephora.it | | EU | `eu-de` | Germany | EUR | sephora.de | | EU | `eu-es` | Spain | EUR | sephora.es | | EU | `eu-pl` | Poland | PLN | sephora.pl | | EU | `eu-cz` | Czech Republic | CZK | sephora.cz | | EU | `eu-gr` | Greece | EUR | sephora.gr | | EU | `eu-ro` | Romania | RON | sephora.ro | | EU | `eu-pt` | Portugal | EUR | sephora.pt | | APAC | `sea-nz` | New Zealand | NZD | sephora.nz | | APAC | `sea-au` | Australia | AUD | sephora.com.au | | APAC | `sea-sg` | Singapore | SGD | sephora.sg | | APAC | `sea-my` | Malaysia | MYR | sephora.com.my | | APAC | `sea-th` | Thailand | THB | sephora.co.th | | APAC | `sea-id` | Indonesia | IDR | sephora.co.id | | APAC | `sea-ph` | Philippines | PHP | sephora.ph | | APAC | `sea-hk` | Hong Kong | HKD | sephora.hk | | APAC | `sea-tw` | Taiwan | TWD | sephora.tw | | APAC | `sea-bn` | Brunei | BND | sephora.bn | ## Output Example ```json { "market": "sea-nz", "source": { "id": 58792, "canonicalUrl": "https://www.sephora.nz/products/rare-beauty-true-to-myself-natural-matte-longwear-foundation", "retailer": "SEPHORA", "currency": "NZD" }, "brand": "Rare Beauty", "title": "True To Myself Natural Matte Longwear Foundation", "description": "

A self-priming and self-setting foundation...

", "ingredients": "Aqua/Water, Cyclopentasiloxane, Glycerin...", "currentSku": "770225", "categories": ["makeup/face/foundation"], "options": [ { "name": "shade", "id": "66488", "values": [{"value": "1 Fair Neutral", "orderable": true}] } ], "variants": [ { "id": "276343", "sku": "770225", "price": { "current": 77.0, "original": 77.0, "stockStatus": "IN_STOCK" }, "options": [{"name": "shade", "value": "1 Fair Neutral"}], "highlights": ["NEW", "Only at Sephora"], "wishlisted": null } ], "medias": [{ "url": "https://www.sephora.nz/.../foundation-shade.jpg", "type": "image" }], "stats": { "reviewCount": 971, "rating": 4.8, "lovesCount": null } } ``` ## Tips - **Pin proxy country to the target market.** A residential exit in a mismatched country is the single largest source of 403s from Sephora's Akamai layer. Set `apifyProxyCountry` to the storefront's ISO code (`US`, `FR`, `NZ`, etc.). - **Smoke-test first.** Set `maxRequestsPerCrawl=10` before your first production run in a new market. - **Tune concurrency per region.** US: 2-5. EU: 3. SEA: 8-16. Each market gets its own semaphore in mixed runs. --- # Boohoo Scraper - **URL:** https://proooxy.com/tools/boohoo-scraper/ - **Type:** tools - **Description:** Extract product data from Boohoo e-commerce sites across 7 regions with automatic pagination, facet filtering, and multi-currency support. - **Summary:** Scrape Boohoo product data across 7 regional stores. - **Category:** ecommerce - **Tech stack:** TypeScript, Cheerio, Fingerprint Generator - **Markets / regions:** NL, SE, UK, IE, FR, AU, US - **Anti-bot strategy:** Fingerprint generation for anti-bot bypass - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/boohoo-scraper - **Keywords:** boohoo scraper, fast fashion data, boohoo product extraction, multi-region scraper, fashion data api - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - 7 regional store support — NL (EUR), SE (SEK), UK (GBP), IE (EUR), FR (EUR), AU (AUD), US (USD) - Category and search scraping with automatic pagination - Facet filter support — size, color, price range, style - Full product details including variants and stock status - Browser fingerprint generation for anti-bot bypass - Multi-currency pricing based on regional store **Use cases:** - Fast fashion competitive pricing analysis - Multi-region price comparison for the same products - Trend monitoring in affordable fashion - Inventory and stock tracking across regions - Fashion market research across European and global markets **Input parameters:** - `startUrls` (array, required) — Boohoo product or category URLs - `maxRequestsPerCrawl` (number, optional) — Request limit (default: 5) - `maxConcurrency` (number, optional) — Parallel requests (default: 5) - `proxy` (object, optional) — Proxy configuration **FAQ:** **Q: Which regional Boohoo stores are supported?** Netherlands (EUR), Sweden (SEK), United Kingdom (GBP), Ireland (EUR), France (EUR), Australia (AUD), and United States (USD). **Q: Can I filter products by size or color?** Yes, the scraper supports facet filtering. You can provide filtered category URLs and the scraper will respect the applied filters. **Q: How does the scraper handle pagination?** Pagination is automatic. Provide a category or search URL and the scraper will follow all pagination links to extract every product. ## Output Example ```json { "source": "https://www.boohoo.com/...", "brand": "boohoo", "title": "Oversized Hoodie", "description": "Stay cozy in this oversized hoodie...", "categories": ["Women", "Hoodies & Sweatshirts"], "price": { "current": 1500, "original": 3000, "currency": "GBP" }, "variants": [ { "sku": "BH-OH-BLK-S", "size": "S", "color": "Black", "inStock": true } ], "medias": [ { "type": "image", "url": "https://..." } ] } ``` --- # Farfetch Scraper - **URL:** https://proooxy.com/tools/farfetch-scraper/ - **Type:** tools - **Description:** Extract luxury fashion product data from Farfetch including multi-currency pricing, size/fit variations, and product recommendations. - **Summary:** Scrape luxury fashion products from Farfetch with multi-currency support. - **Category:** ecommerce - **Tech stack:** TypeScript, Cheerio, Crawlee - **Markets / regions:** Global - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/farfetch-scraper - **Keywords:** farfetch scraper, luxury fashion data, farfetch product extraction, designer brand scraper, fashion data api - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - Category and product detail page scraping - Multi-currency pricing — auto-detects based on proxy location - Optional size/fit variation extraction - Up to 90 recommended products per item - Full media galleries and detailed descriptions - Brand, category, and extra info extraction **Use cases:** - Luxury fashion market intelligence - Cross-platform price comparison for designer brands - Fashion trend analysis and product discovery - Competitive pricing for multi-brand retailers - Product recommendation engine training data **Input parameters:** - `startUrls` (array, required) — Farfetch product or category URLs - `proxy` (object, optional) — Proxy config — location affects currency - `maxRequestsPerCrawl` (number, optional) — Request limit (default: 100) - `maxConcurrency` (number, optional) — Parallel requests (default: 5) - `withSizeFit` (boolean, optional) — Include size/fit data (default: false) - `withRecommends` (boolean, optional) — Include recommendations (default: false) **FAQ:** **Q: How does multi-currency pricing work?** Farfetch displays prices based on your location. The scraper uses your proxy location to determine which currency is returned. Use a US proxy for USD, UK proxy for GBP, etc. **Q: How many recommended products can be extracted?** Up to 90 recommended products per item when withRecommends is enabled. This is useful for building product graphs and recommendation datasets. ## Output Example ```json { "source": "https://www.farfetch.com/shopping/...", "brand": "Gucci", "title": "GG Marmont Matelasse Shoulder Bag", "description": "Crafted from matelasse leather...", "details": ["Made in Italy", "100% Calf Leather"], "categories": ["Women", "Bags", "Shoulder Bags"], "options": [ { "name": "Size", "values": ["One Size"] } ], "variants": [ { "sku": "FF-GU-001", "price": 229000, "currency": "USD", "inStock": true } ], "medias": [ { "type": "image", "url": "https://..." } ] } ``` --- # Global API Load Tester - **URL:** https://proooxy.com/tools/load-tester/ - **Type:** tools - **Description:** High-performance load testing tool simulating 10,000+ requests per second with geo-distributed traffic, weighted targets, and interactive HTML reports. - **Summary:** Simulate 10K+ RPS with geo-distributed load testing. - **Category:** utility - **Tech stack:** Go, Vegeta - **Markets / regions:** Global - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/global-api-load-tester - **Keywords:** load testing, api load tester, stress test, performance testing, vegeta load test, geo-distributed testing - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - Extreme performance — 10,000+ requests per second - Geo-distributed testing from US, EU, and Asia - Weighted multi-target attacks (e.g., 90% reads / 10% writes) - Residential proxy support for realistic traffic - Constant-rate pacing to prevent Coordinated Omission - Interactive HTML reports via Vegeta Plots - Detailed latency, throughput, and error metrics **Use cases:** - API performance benchmarking before launch - Capacity planning and infrastructure sizing - Finding breaking points and bottlenecks - Testing CDN and load balancer configurations - Geo-distributed latency testing - Regression testing for performance-critical endpoints **Input parameters:** - `targets` (array, required) — Target endpoints with URL, method, body, headers, and weight - `rate` (number, optional) — Requests per second (default: 50) - `duration` (number, optional) — Test duration in seconds (default: 60) - `geoDistribution` (array, optional) — Regions with country codes and traffic weights - `useStickySessions` (boolean, optional) — Maintain session affinity (default: true) - `maxCostLimit` (number, optional) — Cost ceiling for the test run **FAQ:** **Q: How does geo-distributed testing work?** You specify country codes and traffic weights. The tool distributes requests across Apify proxy servers in those regions, simulating realistic global traffic patterns. **Q: What is Coordinated Omission?** It's a common load testing pitfall where the tool slows down when the target is overloaded, making results look better than reality. Vegeta uses constant-rate pacing to avoid this. **Q: Can I test authenticated endpoints?** Yes, include authorization headers in the target configuration. Each target can have its own headers, method, and body. ## Output Example The load tester generates interactive HTML reports and structured metrics: ```json { "summary": { "totalRequests": 50000, "duration": "60s", "rps": 833.33, "successRate": 99.8, "latency": { "mean": "12.4ms", "p50": "10.1ms", "p95": "28.7ms", "p99": "89.2ms", "max": "342.1ms" }, "statusCodes": { "200": 49900, "503": 100 } } } ``` --- # Lululemon Scraper - **URL:** https://proooxy.com/tools/lululemon-scraper/ - **Type:** tools - **Description:** Crawl and extract product details from Lululemon including variant data, color options, media galleries, and pricing information. - **Summary:** Extract product data with variants and media from Lululemon. - **Category:** ecommerce - **Tech stack:** TypeScript, Crawlee - **Markets / regions:** US - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/lululemon-scraper - **Keywords:** lululemon scraper, athletic wear data, lululemon product extraction, activewear scraper - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - Category and product page crawling - Variant extraction with color and size options - Full media galleries with color-specific images - Price tracking with structured output - Category hierarchy extraction - Lightweight and fast with Crawlee framework **Use cases:** - Athletic wear market research and competitive analysis - Price monitoring for resellers and comparison platforms - Product catalog aggregation for fitness e-commerce - Color and size availability tracking - Trend analysis in activewear fashion **Input parameters:** - `startUrls` (array, required) — Lululemon product or category URLs - `proxy` (object, optional) — Proxy configuration - `maxConcurrency` (number, optional) — Parallel request limit **FAQ:** **Q: Does the scraper handle different color variants?** Yes, each color variant is extracted with its own images, SKU, and availability status. The media gallery is color-specific. **Q: Can I scrape entire Lululemon categories?** Yes, provide a category URL and the scraper will crawl all products within that category. ## Output Example ```json { "source": "https://shop.lululemon.com/p/...", "brand": "lululemon", "title": "Align High-Rise Pant 25\"", "description": "Buttery-soft, weightless Nulu fabric...", "categories": ["Women", "Pants", "Yoga Pants"], "options": [ { "name": "Color", "values": ["Black", "True Navy", "Dark Olive"] }, { "name": "Size", "values": ["2", "4", "6", "8", "10", "12"] } ], "variants": [ { "sku": "LL-AHR-BLK-6", "name": "Black / 6", "price": 9800, "currency": "USD", "inStock": true } ], "medias": [ { "type": "image", "url": "https://...", "color": "Black" } ], "stats": { "rating": 4.7, "reviewCount": 15234 } } ``` --- # Schema Markup Scraper & SEO Auditor - **URL:** https://proooxy.com/tools/schema-markup-scraper/ - **Type:** tools - **Description:** Extract JSON-LD, Microdata, RDFa, Open Graph, and Twitter Cards from any URL with a comprehensive SEO audit scoring system. - **Summary:** Extract structured data and audit SEO for any website. - **Category:** utility - **Tech stack:** TypeScript, Crawlee - **Markets / regions:** Global - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/schema-markup-scraper - **Keywords:** schema markup scraper, seo auditor, json-ld extractor, structured data, open graph extractor, seo analysis tool - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - Structured data extraction — JSON-LD, Microdata, and RDFa - Social meta tags — Open Graph, Twitter Cards, Dublin Core - SEO analysis with 0-100 scoring - Canonical URL and hreflang validation - Author extraction for EEAT signals - LocalBusiness detection with 80+ subtypes - Image alt text audit - Breadcrumb schema validation - Geo tags and NAP extraction **Use cases:** - Technical SEO auditing at scale - Structured data validation for websites - Competitive SEO analysis — compare schema markup across competitors - EEAT signal assessment for content sites - Local SEO auditing for businesses - Pre-launch SEO checklist validation **Input parameters:** - `startUrls` (array, required) — URLs to analyze - `proxy` (object, optional) — Proxy configuration - `maxRequestsPerCrawl` (number, optional) — Limit total URLs to audit - `maxConcurrency` (number, optional) — Parallel requests - `extractMetaTags` (boolean, optional) — Extract meta tags (default: true) - `extractSeoAnalysis` (boolean, optional) — Run SEO analysis (default: true) - `computeSeoScore` (boolean, optional) — Calculate 0-100 SEO score (default: true) - `extractGeoData` (boolean, optional) — Extract geo tags and NAP data **FAQ:** **Q: What structured data formats are supported?** JSON-LD, Microdata, and RDFa. The scraper also extracts Open Graph, Twitter Cards, and Dublin Core metadata. **Q: How is the SEO score calculated?** The 0-100 score evaluates title tags, meta descriptions, heading hierarchy, image alt text, canonical URLs, mobile viewport, structured data presence, and more. **Q: Can I audit multiple pages at once?** Yes, provide multiple URLs in startUrls. The scraper processes them in parallel for fast bulk auditing. ## Output Example ```json { "url": "https://example.com/product/...", "title": "Example Product Page", "linkedData": [ { "@type": "Product", "name": "..." } ], "openGraph": { "og:title": "Example Product", "og:type": "product" }, "twitterCard": { "card": "summary_large_image" }, "seoAudit": { "score": 78, "issues": [ "Missing alt text on 3 images", "No hreflang tags detected" ] }, "headings": { "h1": ["Example Product"], "h2": ["Description", "Reviews"] } } ``` --- # Sephora EU Scraper - **URL:** https://proooxy.com/tools/sephora-eu-scraper/ - **Type:** tools - **Description:** Scrape complete product data from Sephora Europe across 9 EU markets with multi-variant extraction, Akamai WAF bypass, and smart token management. - **Summary:** Extract product data from Sephora across 9 European markets. - **Category:** ecommerce - **Tech stack:** TypeScript, Crawlee, Akamai Bypass - **Markets / regions:** FR, IT, DE, ES, PL, CZ, GR, RO, PT - **Anti-bot strategy:** Akamai WAF — browser-grade TLS fingerprinting - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/sephora-eu-scraper - **Keywords:** sephora europe scraper, sephora eu data, european beauty data, akamai bypass scraper, multi-market scraper - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - 9 EU market support — FR, IT, DE, ES, PL, CZ, GR, RO, PT - Multi-variant extraction with individual pricing and stock status - High-resolution image galleries for each product - Category browsing via category IDs for bulk extraction - Browser-grade TLS fingerprinting to bypass Akamai WAF - Guest token management with automatic refresh and exponential backoff **Use cases:** - Pan-European beauty market price comparison - Cross-market product availability monitoring - EU market expansion research for beauty brands - Competitive intelligence across European markets - Regional pricing strategy analysis **Input parameters:** - `startUrls` (array, optional) — Sephora EU product URLs to scrape - `categoryIds` (array, optional) — Category IDs for bulk product extraction - `locale` (string, optional) — Target market locale (e.g., fr-FR, it-IT) - `maxProducts` (number, optional) — Maximum products to extract - `maxConcurrency` (number, optional) — Parallel request limit - `proxyConfiguration` (object, optional) — Proxy settings — residential recommended **FAQ:** **Q: Which European Sephora markets are supported?** France (fr-FR), Italy (it-IT), Germany (de-DE), Spain (es-ES), Poland (pl-PL), Czech Republic (cs-CZ), Greece (el-GR), Romania (ro-RO), and Portugal (pt-PT). **Q: How does the scraper bypass Akamai WAF?** It uses browser-grade TLS fingerprinting to mimic real browser connections, making requests indistinguishable from genuine user traffic. **Q: Can I scrape entire categories?** Yes, you can provide category IDs to extract all products within a category. This is the most efficient way to do bulk extraction. ## Output Example ```json { "source": "https://www.sephora.fr/p/...", "brand": "Rare Beauty", "title": "Soft Pinch Liquid Blush", "description": "Un blush liquide longue tenue...", "shortDescription": "Blush liquide", "categories": ["Maquillage", "Teint", "Blush"], "options": [ { "name": "Shade", "values": ["Joy", "Hope", "Grace"] } ], "variants": [ { "sku": "EU-RB-001", "name": "Joy", "price": 2800, "currency": "EUR", "inStock": true } ], "medias": [ { "type": "image", "url": "https://..." } ], "stats": { "rating": 4.7, "reviewCount": 3421 } } ``` --- # Shopify Scraper - **URL:** https://proooxy.com/tools/shopify-scraper/ - **Type:** tools - **Description:** Professional-grade tool for extracting high-fidelity product data from any Shopify-powered store including collections, search, and product recommendations. - **Summary:** Extract product data from any Shopify store. - **Category:** ecommerce - **Tech stack:** TypeScript, got-scraping - **Markets / regions:** Global - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/shopify-scraper - **Keywords:** shopify scraper, shopify product data, shopify store scraper, ecommerce data extraction, shopify api alternative - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - Universal — works with any Shopify-powered store - Store-wide catalog extraction and search support - Product recommendations (up to 20 per product) - Collection and individual product scraping - Tag and category extraction - Currency normalization (prices x100) for precision **Use cases:** - Market research across Shopify stores in any niche - Competitive analysis for DTC brands - Product catalog aggregation for comparison platforms - Trend monitoring across independent e-commerce stores - Building product recommendation datasets - Price monitoring for resellers **Input parameters:** - `startUrls` (array, required) — Shopify store URLs — product, collection, or store home - `proxy` (object, optional) — Residential proxy recommended for best results - `maxRequestsPerCrawl` (number, optional) — Request limit (default: 100) - `maxRecommendationsPerProduct` (number, optional) — Recommended products to fetch (default: 0, max: 20) - `query` (string, optional) — Search query to find products within a store **FAQ:** **Q: Does this work with any Shopify store?** Yes, the scraper works with any store powered by Shopify. It leverages Shopify's standard product data structure, which is consistent across all stores. **Q: Can I search for specific products?** Yes, use the query parameter to search within a specific store. This is useful for finding specific product types without scraping the entire catalog. **Q: How are product recommendations extracted?** Set maxRecommendationsPerProduct to fetch related products. Up to 20 recommendations are available per product, useful for building product graphs. ## Output Example ```json { "source": "https://store.example.com/products/...", "brand": "Example Brand", "title": "Premium Organic Cotton T-Shirt", "description": "Made from 100% organic cotton...", "categories": ["Tops", "T-Shirts"], "tags": ["organic", "sustainable", "cotton"], "options": [ { "name": "Size", "values": ["S", "M", "L", "XL"] }, { "name": "Color", "values": ["White", "Black", "Navy"] } ], "variants": [ { "sku": "SHOP-OCT-WHT-M", "name": "White / M", "price": 4500, "currency": "USD", "inStock": true } ], "medias": [ { "type": "image", "url": "https://..." } ] } ``` --- # Ulta Beauty Scraper - **URL:** https://proooxy.com/tools/ulta-scraper/ - **Type:** tools - **Description:** Extract product details, pricing, images, and SKU information from Ulta Beauty including category pages, brand pages, and sale sections. - **Summary:** Scrape complete product data from Ulta Beauty. - **Category:** ecommerce - **Tech stack:** TypeScript, Cheerio, Crawlee - **Markets / regions:** US - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/ulta-scraper - **Keywords:** ulta scraper, ulta beauty data, beauty product scraper, ulta product extraction, cosmetics data - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - Supports category, product detail, brand, and sale pages - Full product details with prices, descriptions, and images - SKU-level data extraction with variant grouping - Automatic detection of page type from URL - Lightweight Cheerio-based parsing for speed - Groups related SKUs under the same product **Use cases:** - Beauty industry competitive analysis — Ulta vs Sephora pricing - Product catalog building for comparison shopping platforms - Sale and promotion monitoring - Brand discovery and market presence tracking - SKU-level inventory monitoring **Input parameters:** - `startUrls` (array, required) — Ulta product, category, brand, or sale URLs - `proxy` (object, optional) — Proxy configuration - `maxConcurrency` (number, optional) — Maximum parallel requests - `maxRequestsPerCrawl` (number, optional) — Limit total requests per run **FAQ:** **Q: What types of Ulta pages can be scraped?** The scraper supports product detail pages, category listing pages, brand pages, and sale/promotion pages. It automatically detects the page type from the URL. **Q: How are product variants handled?** Variants (different shades, sizes) are grouped under the same parent product. Each variant includes its own SKU, price, and availability status. ## Output Example ```json { "source": "https://www.ulta.com/p/...", "brand": "NYX Professional Makeup", "title": "Butter Gloss", "description": "A buttery soft and silky lip gloss...", "categories": ["Makeup", "Lips", "Lip Gloss"], "variants": [ { "sku": "ULTA-NYX-BG-001", "name": "Angel Food Cake", "price": 900, "currency": "USD", "inStock": true } ], "stats": { "rating": 4.5, "reviewCount": 8932 } } ``` --- # Universal Web Printer - **URL:** https://proooxy.com/tools/web-printer/ - **Type:** tools - **Description:** Convert any URL or HTML to PDF, PNG, JPEG, or WebP with smart scroll-stitch, element extraction, PDF encryption, and watermarking. - **Summary:** Convert URLs and HTML to PDF, PNG, JPEG, or WebP. - **Category:** utility - **Tech stack:** TypeScript, Playwright, PDF-lib, Sharp - **Markets / regions:** Global - **Reported success rate:** >99% - **Apify listing:** https://apify.com/autofacts/universal-web-printer - **Keywords:** web to pdf, html to pdf, screenshot api, url to image, web printer, pdf generator - **Published:** 2026-04-04 - **Modified:** 2026-04-04 **Key features:** - Multi-format output — PDF, PNG, JPEG, WebP - Multiple view modes — viewport, full-page, CSS selector, readability - Smart scroll-stitch for accurate full-page captures - Element-level extraction via CSS selectors - Page manipulation — remove elements, click buttons, inject CSS, hide fixed headers - PDF encryption (RC4 128-bit) and watermarking - PDF merging for multi-page documents - Custom viewport and scale factor configuration **Use cases:** - Automated report generation from web dashboards - Website archival and documentation - Visual regression testing snapshots - E-commerce product page screenshots for catalogs - Legal compliance — capturing web content as evidence - Generating PDFs from web applications **Input parameters:** - `startUrls` (array, optional) — URLs to render - `htmlContent` (string, optional) — Raw HTML to render - `outputFormat` (string, optional) — pdf, png, jpeg, or webp (default: pdf) - `viewMode` (string, optional) — viewport, fullPage, selector, or readability - `targetSelector` (string, optional) — CSS selector for element-level capture - `viewportWidth` (number, optional) — Browser viewport width (default: 1280) - `viewportHeight` (number, optional) — Browser viewport height (default: 720) - `removeSelectors` (array, optional) — CSS selectors of elements to remove before capture - `pdfPassword` (string, optional) — Encrypt PDF with RC4 128-bit encryption **FAQ:** **Q: Can I capture just a specific element on the page?** Yes, use the targetSelector parameter with a CSS selector to capture only a specific element. For example, use '#main-content' to capture just the main content area. **Q: How does smart scroll-stitch work?** For full-page captures, the tool scrolls the page in increments, capturing each viewport slice, then stitches them together. This ensures lazy-loaded content and animations are properly captured. **Q: Can I remove cookie banners or ads before capture?** Yes, use removeSelectors to specify CSS selectors of elements to remove. You can also use hideFixedElements to hide sticky headers and floating elements. ## Output Example The tool generates files in your chosen format (PDF, PNG, JPEG, or WebP) and stores them in the Apify dataset. Each output includes metadata: ```json { "url": "https://example.com", "format": "pdf", "fileName": "example-com.pdf", "fileSize": 245832, "viewMode": "fullPage", "viewport": { "width": 1280, "height": 720 }, "encrypted": false } ``` --- # Web Scraping Best Practices in 2026: A Practitioner's Guide - **URL:** https://proooxy.com/blog/web-scraping-best-practices-2026/ - **Type:** blog - **Description:** Battle-tested web scraping strategies from 12+ years of production experience — architecture patterns, error handling, proxy management, and output normalization. - **Keywords:** web scraping best practices, production scraping, data extraction guide, scraper architecture - **Published:** 2026-04-01 - **Modified:** 2026-04-01 After building and maintaining 10 production scrapers that serve over 2,700 users with >99% success rates, here are the practices that actually matter. ## Architecture: Think in Pipelines, Not Scripts The biggest mistake I see is treating scraping as a single-step process. Production scrapers are data pipelines: 1. **URL Discovery** — find what to scrape (sitemaps, category pages, search, APIs) 2. **Request Execution** — fetch the data with proper retry and rotation 3. **Parsing** — extract structured fields from raw responses 4. **Normalization** — clean, validate, and standardize the output 5. **Storage** — push to datasets, databases, or downstream systems Each step should be independently testable and retryable. When Sephora changes their product page layout, only step 3 needs updating — the rest of the pipeline stays stable. ## Always Prefer APIs Over HTML Parsing Before writing a single CSS selector, check if the site has: - **Public APIs** — documented endpoints that return JSON - **Private APIs** — XHR/fetch calls visible in browser DevTools - **GraphQL endpoints** — increasingly common, often with introspection enabled - **Embedded JSON** — `__NEXT_DATA__`, `window.__INITIAL_STATE__`, or JSON-LD in the HTML API responses are structured, versioned, and far more stable than HTML layouts. My [Sephora scraper](/tools/sephora-scraper/) converts every web URL into an API call — it hasn't broken once from a frontend redesign. ## Proxy Strategy: Match the Protection Not every site needs residential proxies. Here's my decision framework: | Protection Level | Proxy Type | Example Sites | |-----------------|------------|---------------| | None / Basic | Datacenter | Most Shopify stores, small sites | | Rate limiting | Rotating datacenter | Medium e-commerce, content sites | | Fingerprinting | Residential | Sephora, Farfetch, major brands | | Advanced WAF | Residential + TLS fingerprint | Akamai, Cloudflare Enterprise | The key insight: **proxy cost scales with protection level**. Don't waste money on residential proxies for sites that only check IP reputation. My [Shopify scraper](/tools/shopify-scraper/) works fine with datacenter proxies because Shopify's default protection is minimal. ## Session Management Is Everything The difference between a 60% and 99% success rate is usually session management: - **Rotate sessions, not just IPs** — a new IP with the same cookies looks suspicious - **Warm up sessions** — visit the homepage before hitting product pages - **Respect rate limits** — 5 concurrent requests beats 50 that get blocked - **Exponential backoff** — 1s, 2s, 4s, 8s retries, not immediate retries My [Sephora EU scraper](/tools/sephora-eu-scraper/) manages guest tokens with automatic refresh and exponential backoff. It maintains persistent sessions that look like real browsing patterns. ## Normalize Your Output Raw scraped data is messy. Normalize everything: ### Prices Store as integers (cents, not dollars). `$29.99` becomes `2999`. This avoids floating-point precision errors that corrupt financial data downstream. Every one of my e-commerce scrapers uses this convention. ### URLs Always store absolute URLs, never relative paths. Resolve them at extraction time. ### Dates ISO 8601 (`2026-04-01T00:00:00Z`), always with timezone. Never store locale-formatted dates. ### Text Strip excess whitespace, normalize Unicode, and decide on an HTML handling policy (strip tags vs. preserve formatting). ## Error Handling: Expect Failure Production scrapers fail constantly — the question is how gracefully. My approach: ``` Request fails (network error, timeout, 4xx/5xx) → Retry with exponential backoff (up to 5 attempts) → Rotate session/proxy on retry → Log failure with full context if all retries exhausted → Continue processing remaining URLs (don't crash the batch) ``` Track success rates per URL pattern. If `/category/*` pages suddenly drop below 90% success, the site probably changed something — you'll catch it before users report it. ## Monitor and Alert A scraper without monitoring is a scraper waiting to silently fail. Track: - **Success rate** per run and per URL pattern - **Output count** — sudden drops mean something broke - **Data quality** — null fields, unexpected values, schema violations - **Cost** — proxy usage, compute time, storage My Apify actors all expose these metrics. When success rates dip, I get notified within hours — often before any user notices. ## Start Simple, Add Complexity Every scraper I build starts as the simplest thing that works: 1. **HTTP + Cheerio** first (fastest, cheapest) 2. **Add fingerprinting** only if blocked 3. **Add browser rendering** only if JavaScript is required 4. **Add proxy rotation** only if rate-limited My [Ulta scraper](/tools/ulta-scraper/) is pure Cheerio — no browser needed. My [Universal Web Printer](/tools/web-printer/) uses Playwright because it must render JavaScript. Right tool for the job. --- These aren't theoretical principles — they're extracted from running production scrapers that process millions of requests. If you need a custom scraper built with these practices, [let's talk](/contact/). --- # Understanding Anti-Bot Protection: What Works in 2026 - **URL:** https://proooxy.com/blog/bypassing-anti-bot-protection-guide/ - **Type:** blog - **Description:** A technical deep-dive into modern anti-bot systems — Cloudflare, Akamai, Datadome — and the legitimate bypass techniques used in production scraping. - **Keywords:** anti-bot bypass, cloudflare bypass, akamai bypass, datadome bypass, bot detection, web scraping protection - **Published:** 2026-03-15 - **Modified:** 2026-03-15 Anti-bot protection is an arms race. As someone who builds production scrapers that bypass these systems daily, here's a practitioner's view of the landscape — what the protections actually check and what legitimate bypass techniques look like. ## The Detection Layers Modern anti-bot systems operate in layers. Understanding these layers is the key to reliable bypass: ### Layer 1: IP Reputation The simplest check. Anti-bot services maintain databases of known datacenter IP ranges, VPN exits, and previously flagged IPs. **What they check:** - Is this IP from AWS, GCP, Azure, or a known hosting provider? - Has this IP been flagged for bot activity before? - How many requests have come from this IP recently? **Counter-approach:** Residential proxies from services like Apify Proxy or Bright Data provide IP addresses that belong to real ISPs, making them indistinguishable from regular users at the IP level. ### Layer 2: TLS Fingerprinting This is where it gets interesting. Every HTTP client has a unique TLS handshake signature based on: - Supported cipher suites and their order - TLS extensions and their order - Supported TLS versions - ALPN protocols A standard `axios` or `requests` library has a TLS fingerprint that screams "bot" because it doesn't match any real browser. Services like Akamai and Cloudflare maintain fingerprint databases for every browser version. **Counter-approach:** Libraries like `got-scraping` (which my [Shopify scraper](/tools/shopify-scraper/) uses) and specialized TLS clients can mimic browser-grade TLS fingerprints. My [Sephora EU scraper](/tools/sephora-eu-scraper/) uses browser-grade TLS fingerprinting to bypass Akamai WAF. ### Layer 3: HTTP/2 Fingerprinting Beyond TLS, HTTP/2 settings reveal the client type: - SETTINGS frame parameters (header table size, max concurrent streams) - WINDOW_UPDATE frame values - Priority tree structure - Header compression (HPACK) patterns Each browser has characteristic HTTP/2 settings. Chrome, Firefox, and Safari all look different at this level. ### Layer 4: JavaScript Challenges Cloudflare's "checking your browser" page and similar challenges execute JavaScript that: - Checks for browser APIs (canvas, WebGL, AudioContext) - Measures execution timing - Validates DOM properties - Sends challenge responses back to the server **Counter-approach:** Headless browsers (Playwright, Puppeteer) execute these challenges natively. The key is ensuring your headless browser doesn't leak automation signals (more on this below). ### Layer 5: Behavioral Analysis The most sophisticated layer. These systems analyze: - Mouse movement patterns (too linear = bot) - Scroll behavior (instant scroll to bottom = bot) - Time between actions (too consistent = bot) - Navigation patterns (going directly to product pages without browsing = suspicious) - Request cadence (perfectly uniform intervals = bot) ## Protection Profiles: Know What You're Facing ### Cloudflare **Common on:** Small to medium sites, blogs, APIs Cloudflare offers several protection levels: - **Basic** — IP reputation + rate limiting. Datacenter proxies with rate respect usually work. - **Managed Challenge** — JavaScript challenge + turnstile. Needs browser or challenge solver. - **Enterprise/Bot Management** — Full behavioral analysis + fingerprinting. Needs residential proxy + proper fingerprinting. ### Akamai Bot Manager **Common on:** Enterprise e-commerce (Sephora EU, major retailers) Akamai is one of the toughest to bypass because of: - Aggressive TLS fingerprinting - Sensor data collection via client-side JavaScript - Session-level behavioral analysis - Cookie integrity verification My approach for Akamai: browser-grade TLS fingerprinting + guest token management + request pacing that mimics human browsing. ### Datadome **Common on:** E-commerce, ticketing Datadome focuses on: - Device fingerprinting via JavaScript - CAPTCHA challenges for suspicious traffic - Real-time behavioral scoring ### PerimeterX (now HUMAN) **Common on:** Retail, financial services Known for aggressive JavaScript challenges and behavioral analysis. ## Legitimate Bypass Architecture For production systems that need reliable, ongoing data extraction, here's the architecture pattern I use: ### 1. API-First Approach Before attempting to bypass any protection, check if there's an API path that avoids the WAF entirely. Many protections only apply to browser-facing endpoints, not API routes. My [Sephora scraper](/tools/sephora-scraper/) converts every web URL to an API call. The API endpoints have lighter protection than the website because they're designed for mobile apps. ### 2. Session Warming Don't jump straight to the data page. Build a realistic browsing session: ``` Visit homepage → Browse categories → View product listing → Access product detail ``` Each step builds session credibility. The anti-bot system sees a pattern that matches real user behavior. ### 3. Fingerprint Consistency This is critical: your fingerprint must be **internally consistent**. If your TLS says "Chrome 120" but your User-Agent says "Chrome 118", that's a detection signal. Align: - TLS fingerprint - HTTP/2 settings - User-Agent header - Accept-Language and other headers - JavaScript browser properties (if using headless) ### 4. Request Pacing Real humans don't make requests at precisely 1-second intervals. Introduce realistic variance: - Base delay between requests (2-5 seconds) - Random jitter (+/- 30%) - Longer pauses after navigation events - Occasional "idle" periods ### 5. Graceful Degradation When you encounter a challenge or block: 1. Don't immediately retry — this confirms bot behavior 2. Back off exponentially 3. Rotate to a fresh session (new IP + new cookies) 4. Try a different proxy region 5. If persistent, switch to a browser-based approach ## What Doesn't Work (Anymore) - **Just changing User-Agent** — detection systems check dozens of signals, not just one header - **Random delays alone** — without proper fingerprinting, timing doesn't help - **Headless Chrome with default settings** — automation signals leak everywhere (`navigator.webdriver`, missing plugins, Chrome DevTools Protocol artifacts) - **Cookie replay** — modern systems tie cookies to TLS fingerprints and IP ranges ## Ethical Considerations Anti-bot bypass is a tool. Like any tool, it can be used responsibly or irresponsibly. **Legitimate use cases:** - Price comparison for consumer benefit - Market research with public data - Accessibility (making data available in structured formats) - Academic research - Quality assurance and monitoring **Always respect:** - robots.txt directives - Rate limits (even if you can exceed them, don't) - Personal data regulations (GDPR, CCPA) - Terms of service (understand the legal landscape in your jurisdiction) All my [tools](/tools/) are designed for legitimate data extraction with built-in rate limiting and proxy best practices. --- Understanding anti-bot systems makes you a better scraping engineer. If you need production-grade scrapers that handle these challenges reliably, check out my [tools](/tools/) or [get in touch](/contact/) for custom work. --- # About - **URL:** https://proooxy.com/about/ - **Type:** page - **Description:** Richard Feng — web scraping engineer with 12+ years of coding experience specializing in data extraction, API reverse engineering, and anti-bot bypass. - **Published:** 0001-01-01 - **Modified:** 0001-01-01 ## Who I Am I'm Richard Feng, a freelance web automation expert with 12+ years of coding experience. I specialize in **web scraping, data extraction, and API reverse engineering** — turning complex, protected websites into clean, structured data. My toolkit spans **Node.js (TypeScript), Python, Golang, and Java**, with deep expertise in frameworks like **Crawlee, Playwright, and Cheerio**. I've built production systems that handle millions of requests with >99% success rates. ## What I Do I build and maintain **10 production-grade scraping tools** on Apify, serving over **2,700 users** with a consistent **>99% success rate**. My tools focus on: ### E-Commerce Data Extraction Scrapers for major retail platforms including [Sephora](/tools/sephora-scraper/), [Ulta Beauty](/tools/ulta-scraper/), [Farfetch](/tools/farfetch-scraper/), [Lululemon](/tools/lululemon-scraper/), [Boohoo](/tools/boohoo-scraper/), and a universal [Shopify scraper](/tools/shopify-scraper/) that works with any Shopify-powered store. ### Developer Utilities Tools beyond scraping — [SEO auditing](/tools/schema-markup-scraper/), [web-to-PDF/image conversion](/tools/web-printer/), and [high-performance load testing](/tools/load-tester/). ## Specialties - **Reverse engineering private APIs** — turning undocumented endpoints into reliable data sources - **Anti-bot bypass** — Cloudflare, Datadome, Akamai WAF, and custom protections - **Multi-region scraping** — handling different locales, currencies, and compliance requirements - **High-reliability systems** — building scrapers that maintain >99% success rates at scale ## Tech Stack | Category | Technologies | |----------|-------------| | Languages | TypeScript, Python, Go, Java | | Scraping | Crawlee, Playwright, Cheerio, Parsel, got-scraping | | Anti-Bot | Fingerprint generators, TLS fingerprinting, session rotation | | Infrastructure | Apify Platform, Docker, GitHub Actions | | Testing | Vegeta, custom load testing frameworks | ## Work With Me I offer custom scraping solutions, data pipeline consulting, and ongoing data extraction services. If you need data from the web — [let's talk](/contact/). --- # Contact - **URL:** https://proooxy.com/contact/ - **Type:** page - **Description:** Get in touch for custom web scraping solutions, data pipeline consulting, and Data-as-a-Service engagements. - **Published:** 0001-01-01 - **Modified:** 0001-01-01 ## Let's Build Your Data Pipeline I build bespoke web scrapers and data extraction systems for businesses of all sizes. Whether you need a one-time data pull or an ongoing data pipeline, I can help. ### What I Can Do For You - **Custom Scrapers** — purpose-built for your target websites with anti-bot bypass - **Data Pipelines** — end-to-end extraction, transformation, and delivery to your systems - **API Reverse Engineering** — turn undocumented private APIs into reliable data sources - **Scraper Maintenance** — keep existing scrapers running when websites change - **Technical Consulting** — architecture review for your scraping infrastructure ### How It Works 1. **Tell me what you need** — describe the data, the source, and the format 2. **I'll assess feasibility** — free initial evaluation of the target site's complexity 3. **Proposal & timeline** — clear scope, fixed pricing, and delivery date 4. **Build & deliver** — production-grade solution with documentation ### Get In Touch
### Or Reach Me Directly - **Email**: [kvcnow@gmail.com](mailto:kvcnow@gmail.com) - **GitHub**: [@autofacts](https://github.com/autofacts) - **Twitter**: [@chideat](https://twitter.com/chideat) - **Apify**: [apify.com/autofacts](https://apify.com/autofacts) ---