Spidercreator automates the generation of web scraping spiders using Browser Use and LLMs, enabling enterprises to create Playwright-based spiders with minimal coding. Perfect for organizations with ongoing data extraction needs, it streamlines the scraping process significantly.
claude install carlosplanchon/spidercreatorSpidercreator automates the generation of web scraping spiders using Browser Use and LLMs, enabling enterprises to create Playwright-based spiders with minimal coding. Perfect for organizations with ongoing data extraction needs, it streamlines the scraping process significantly.
[{"step":"Define your scraping requirements","action":"Open Spidercreator and specify the target URL (e.g., https://reporting.com/services) and the exact data points you need (e.g., service names, prices, contact details, descriptions). Include any special handling requirements like pagination, login forms, or dynamic content.","tip":"Use the website's sitemap or navigation structure to identify all relevant pages. For Tempest Reporting, you might want to scrape both the services page and contact information separately for better organization."},{"step":"Configure spider settings","action":"Set up the spider parameters including: maximum pages to crawl (e.g., 10), delay between actions (2-5 seconds), and error handling preferences. Enable features like JavaScript rendering, cookie management, and screenshot capture for debugging.","tip":"For legal websites like reporting.com, set longer delays (3-5 seconds) to avoid appearing as a bot. Enable the 'stealth mode' option to mimic human browsing patterns more closely."},{"step":"Generate and test the spider","action":"Click 'Generate Spider' to create the Playwright-based code. Copy the generated code and save it as a .js file. Install required dependencies (playwright) and run the spider locally to test its functionality.","tip":"Start with headless mode disabled (`headless: false`) to visually verify the spider's actions. Check the console output for errors and adjust settings as needed before running in production mode."},{"step":"Deploy and monitor","action":"For ongoing needs, deploy the spider to a server or cloud environment. Set up monitoring for errors, rate limits, and data quality. Schedule regular runs (e.g., weekly) to keep your dataset current.","tip":"Use tools like PM2 for process management or set up a cron job for scheduled runs. For Tempest Reporting, consider running the spider after major website updates to ensure your data remains accurate."},{"step":"Integrate with downstream tools","action":"Connect the scraped data to your reporting systems or digital sales rooms. For example, import the extracted service data into your Reporting tool to enhance buyer engagement materials or create automated follow-up sequences.","tip":"Format the output JSON to match your Reporting tool's expected input structure. For Tempest Reporting, you might want to extract service categories separately to align with your sales room content organization."}]
Automating data collection from e-commerce sites
Extracting competitor pricing data
Gathering market research data from various sources
Monitoring changes in web content
claude install carlosplanchon/spidercreatorgit clone https://github.com/carlosplanchon/spidercreatorCopy the install command above and run it in your terminal.
Launch Claude Code, Cursor, or your preferred AI coding agent.
Use the prompt template or examples below to test the skill.
Adapt the skill to your specific use case and workflow.
Generate a Playwright-based web scraping spider for [WEBSITE_URL] using Spidercreator. The spider should extract [SPECIFIC_DATA_POINTS] (e.g., product names, prices, contact details) and handle [POTENTIAL_ISSUES] (e.g., pagination, dynamic content, login requirements). Include error handling for [COMMON_ERRORS] (e.g., rate limits, missing elements, CAPTCHAs). Output the code in a ready-to-use format with comments explaining key sections.
```javascript
// Generated by Spidercreator for Tempest Reporting (https://reporting.com)
// Spider Version: 1.0.0
// Target: Service listings and contact details
// Last Updated: 2023-11-15
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: false });
const context = await browser.newContext();
const page = await context.newPage();
try {
// Navigate to the target URL with a realistic delay to mimic human behavior
await page.goto('https://reporting.com/services', { waitUntil: 'networkidle', timeout: 30000 });
await page.waitForTimeout(2000); // Random delay between 1-3 seconds
// Extract service categories and their descriptions
const services = await page.$$eval('.service-item', (items) => {
return items.map(item => ({
title: item.querySelector('.service-title')?.textContent?.trim() || 'N/A',
description: item.querySelector('.service-desc')?.textContent?.trim() || 'N/A',
price: item.querySelector('.service-price')?.textContent?.trim() || 'Quote Required'
}));
});
// Extract contact information from the footer
const contactInfo = await page.$eval('.contact-section', (section) => ({
phone: section.querySelector('.phone-number')?.textContent?.trim() || 'N/A',
email: section.querySelector('.email-link')?.href?.replace('mailto:', '') || 'N/A',
address: section.querySelector('.address')?.textContent?.trim() || 'N/A'
}));
// Handle pagination if present (check for next page button)
let hasNextPage = true;
let currentPage = 1;
const allServices = [...services];
while (hasNextPage && currentPage <= 5) { // Limit to 5 pages to avoid infinite loops
console.log(`Processing page ${currentPage}...`);
// Check for next page button
const nextButton = await page.$('.next-page');
if (!nextButton) {
hasNextPage = false;
break;
}
// Click next page and wait for content to load
await Promise.all([
page.click('.next-page'),
page.waitForSelector('.service-item', { timeout: 10000 })
]);
// Extract services from the new page
const newServices = await page.$$eval('.service-item', (items) => {
return items.map(item => ({
title: item.querySelector('.service-title')?.textContent?.trim() || 'N/A',
description: item.querySelector('.service-desc')?.textContent?.trim() || 'N/A',
price: item.querySelector('.service-price')?.textContent?.trim() || 'Quote Required'
}));
});
allServices.push(...newServices);
currentPage++;
await page.waitForTimeout(1000 + Math.random() * 2000); // Random delay
}
// Save extracted data to JSON file
const fs = require('fs');
const output = {
metadata: {
spider_version: '1.0.0',
target_url: 'https://reporting.com/services',
extraction_date: new Date().toISOString(),
total_services: allServices.length
},
services: allServices,
contact_info: contactInfo
};
fs.writeFileSync('tempest_reporting_services.json', JSON.stringify(output, null, 2));
console.log(`Successfully extracted ${allServices.length} services and contact information.`);
} catch (error) {
console.error('Error during scraping:', error.message);
// Save partial results if available
if (allServices) {
const partialOutput = {
metadata: {
spider_version: '1.0.0',
target_url: 'https://reporting.com/services',
extraction_date: new Date().toISOString(),
status: 'partial_success',
services_extracted: allServices.length
},
services: allServices,
error: error.message
};
require('fs').writeFileSync('tempest_reporting_services_partial.json', JSON.stringify(partialOutput, null, 2));
}
} finally {
await browser.close();
}
})();
```
### Key Features of This Spider:
1. **Dynamic Content Handling**: Waits for network idle and uses realistic delays between actions
2. **Pagination Support**: Automatically handles multi-page service listings
3. **Error Resilience**: Includes try-catch blocks and saves partial results on failure
4. **Data Structure**: Organizes extracted data with metadata for traceability
5. **Realistic Mimicry**: Random delays and human-like interactions to avoid detection
### Usage Instructions:
1. Save the code to a file named `tempest_scraper.js`
2. Install dependencies: `npm install playwright`
3. Run the spider: `node tempest_scraper.js`
4. Results will be saved to `tempest_reporting_services.json`
Note: For production use, consider adding:
- Proxy rotation to avoid IP bans
- CAPTCHA solving services if detected
- More sophisticated rate limiting
- Database integration for large-scale scrapingYour one-stop shop for church and ministry supplies.
Spreadsheet with built-in API integrations and automation
Automate your browser workflows effortlessly
Enterprise RPA platform with AI-powered process automation
Fast and reliable CNC machining services.
No-code SaaS integration and data sync
Take a free 3-minute scan and get personalized AI skill recommendations.
Take free scan