ai-scraper-py

🥇Gold

AI Scraper is a cutting-edge tool designed for data extraction from various web sources. It automates complex scraping tasks, enabling users to create efficient workflows for scalable AI scraping operations.

53700Updated 1mo ago

Intermediate15 minutes to implementdevelopment

Saves ~120 min per use

Quick InstallView Source

claude install oxylabs/ai-scraper-py

Works with:

GitHub Copilot

Overview

About This Skill

How to Use

[{"step":"Install and configure ai-scraper-py","action":"Run `pip install ai-scraper-py` and set up environment variables for proxies (if needed) and headers to mimic a real browser. Configure the tool to handle JavaScript-rendered content by enabling a headless browser (e.g., Playwright).","tip":"Use `--headless-mode` for faster scraping or `--browser-mode` if the target site requires JavaScript execution."},{"step":"Define the scraping target and data fields","action":"Specify the target URL (e.g., `https://example.com/products`) and the data fields to extract (e.g., `[product_name, price, availability]`). Use the tool’s `--selectors` flag to map HTML elements to fields via CSS selectors or XPath.","tip":"Inspect the target page using browser dev tools to identify stable selectors (e.g., `#product-title`, `.price`). Avoid selectors with dynamic IDs like `data-testid-12345`."},{"step":"Handle pagination and anti-scraping measures","action":"Configure pagination by setting `--max-pages 5` and use `--delay 2` to avoid rate-limiting. Enable `--proxy-rotation` if the site blocks your IP. For CAPTCHAs, integrate a solver service (e.g., 2Captcha) via `--captcha-solver`.","tip":"Test the scraper on a single page first to validate selectors and error handling before scaling up."},{"step":"Output and process the extracted data","action":"Save the output to a file using `--output-format json --output-file products.json`. Validate the data with `--validate` to ensure all required fields are present. For large datasets, stream the output to a database (e.g., PostgreSQL) using `--db-connection-string`.","tip":"Use `--log-level debug` to troubleshoot issues like failed requests or missing data."},{"step":"Automate and scale the scraping workflow","action":"Schedule the scraper using cron (Linux/macOS) or Task Scheduler (Windows) to run at set intervals. For enterprise use, deploy the scraper on a cloud platform (e.g., AWS Lambda) with `--cloud-mode` to handle high volumes.","tip":"Monitor scraping performance with `--metrics` to track success rates, errors, and execution time."}]

Use Cases

Extracting product data from e-commerce sites

Gathering competitor pricing information

Automating lead generation from social media

Collecting market research data from multiple sources

Best For

RevOpsGrowth

Setup & Installation

Quick Install

Terminal

claude install oxylabs/ai-scraper-py

Alternative Install (Git Clone)

git clone https://github.com/oxylabs/ai-scraper-py

Requirements

Claude Code or compatible AI agent
Works with: GitHub Copilot

Quick Start Guide

Install the Skill

Copy the install command above and run it in your terminal.

Open Your AI Agent

Launch Claude Code, Cursor, or your preferred AI coding agent.

Try It Out

Use the prompt template or examples below to test the skill.

Customize

Adapt the skill to your specific use case and workflow.

Usage Examples

Prompt Template

Use ai-scraper-py to extract structured data from [WEBSITE_URL] for [USE_CASE]. Focus on extracting [SPECIFIC_DATA_FIELDS] while handling dynamic content, pagination, and anti-scraping measures. Output the data in [OUTPUT_FORMAT] (e.g., CSV, JSON). Include error handling for failed requests or CAPTCHAs. Example: 'Use ai-scraper-py to extract product listings from https://example.com/electronics for a price comparison tool. Extract product name, price, availability, and ratings. Handle pagination up to 5 pages and output as JSON with error logging for failed requests.'

Example Output

```json
{
  "extracted_data": [
    {
      "product_name": "Wireless Bluetooth Headphones",
      "price": 79.99,
      "availability": "In Stock",
      "rating": 4.5,
      "url": "https://example.com/electronics/headphones/12345",
      "last_updated": "2024-05-20T14:30:00Z"
    },
    {
      "product_name": "Smart LED Desk Lamp",
      "price": 49.99,
      "availability": "Out of Stock",
      "rating": 3.8,
      "url": "https://example.com/electronics/lamps/67890",
      "last_updated": "2024-05-19T09:15:00Z"
    }
  ],
  "metadata": {
    "total_items": 2,
    "pages_scraped": 1,
    "errors": [],
    "scraping_timestamp": "2024-05-20T15:00:00Z"
  }
}
```

**Scraping Summary:**
- Successfully scraped 2 product listings from https://example.com/electronics in 4.2 seconds.
- No errors encountered during the process. The website did not trigger anti-scraping measures (e.g., CAPTCHA or rate-limiting).
- The output is structured JSON, ready for integration into a price comparison tool or database.

**Key Observations:**
1. The product listings include all requested fields (name, price, availability, rating, URL).
2. The `last_updated` field ensures data freshness for downstream applications.
3. The `metadata` section provides traceability for debugging or auditing purposes.

**Next Steps:**
- Schedule this script to run daily at 10 AM UTC to keep the dataset current.
- Integrate the output with a database (e.g., PostgreSQL) using the provided JSON structure.
- For larger datasets, consider parallelizing the scraping process across multiple pages or domains.

Apply to these tools

Browse all tools

Autom

Your one-stop shop for church and ministry supplies.

Automa

Automate your browser workflows effortlessly

Automation Anywhere

Enterprise RPA platform with AI-powered process automation

Alloy Automation

No-code SaaS integration and data sync

Forge Automation

Fast and reliable CNC machining services.

Webmecanik Automation

Marketing automation and CRM for SMBs

Compatible MCP servers

Browse all MCP servers

Find the right skills for your stack

Take a free 3-minute scan and get personalized AI skill recommendations.

Take free scan