diff --git a/api-reference/endpoint/smartcrawler/start.mdx b/api-reference/endpoint/smartcrawler/start.mdx index c28c7a6..cee1feb 100644 --- a/api-reference/endpoint/smartcrawler/start.mdx +++ b/api-reference/endpoint/smartcrawler/start.mdx @@ -229,7 +229,7 @@ sha256={your_webhook_secret} To verify that a webhook request is authentic: -1. Retrieve your webhook secret from the [dashboard](https://dashboard.scrapegraphai.com) +1. Retrieve your webhook secret from the [dashboard](https://scrapegraphai.com/dashboard) 2. Compare the `X-Webhook-Signature` header value with `sha256={your_secret}` @@ -305,5 +305,5 @@ The webhook POST request contains the following JSON payload: | result | string | The crawl result data (null if failed) | -Make sure to configure your webhook secret in the [dashboard](https://dashboard.scrapegraphai.com) before using webhooks. Each user has a unique webhook secret for secure verification. +Make sure to configure your webhook secret in the [dashboard](https://scrapegraphai.com/dashboard) before using webhooks. Each user has a unique webhook secret for secure verification. diff --git a/api-reference/errors.mdx b/api-reference/errors.mdx index 5933a08..07846f7 100644 --- a/api-reference/errors.mdx +++ b/api-reference/errors.mdx @@ -139,17 +139,17 @@ except APIError as e: ``` ```javascript JavaScript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); -const response = await smartScraper(apiKey, { - website_url: 'https://example.com', - user_prompt: 'Extract data', -}); - -if (response.status === 'error') { - console.error('Error:', response.error); +try { + const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract data', + }); + console.log('Data:', data); +} catch (error) { + console.error('Error:', error.message); } ``` diff --git a/api-reference/introduction.mdx b/api-reference/introduction.mdx index bee4eb5..872bcb1 100644 --- a/api-reference/introduction.mdx +++ b/api-reference/introduction.mdx @@ -9,7 +9,7 @@ The ScrapeGraphAI API provides powerful endpoints for AI-powered web scraping an ## Authentication -All API requests require authentication using an API key. You can get your API key from the [dashboard](https://dashboard.scrapegraphai.com). +All API requests require authentication using an API key. You can get your API key from the [dashboard](https://scrapegraphai.com/dashboard). ```bash SGAI-APIKEY: your-api-key-here diff --git a/cookbook/examples/pagination.mdx b/cookbook/examples/pagination.mdx index e078401..df05074 100644 --- a/cookbook/examples/pagination.mdx +++ b/cookbook/examples/pagination.mdx @@ -349,22 +349,20 @@ if __name__ == "__main__": ## JavaScript SDK Example ```javascript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; import 'dotenv/config'; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_APIKEY }); -const response = await smartScraper(apiKey, { - website_url: 'https://www.amazon.in/s?k=tv&crid=1TEF1ZFVLU8R8&sprefix=t%2Caps%2C390&ref=nb_sb_noss_2', - user_prompt: 'Extract all product info including name, price, rating, and image_url', - total_pages: 3, -}); +const { data } = await sgai.extract( + 'https://www.amazon.in/s?k=tv&crid=1TEF1ZFVLU8R8&sprefix=t%2Caps%2C390&ref=nb_sb_noss_2', + { + prompt: 'Extract all product info including name, price, rating, and image_url', + totalPages: 3, + } +); -if (response.status === 'error') { - console.error('Error:', response.error); -} else { - console.log('Response:', JSON.stringify(response.data, null, 2)); -} +console.log('Response:', JSON.stringify(data, null, 2)); ``` ## Example Output diff --git a/cookbook/introduction.mdx b/cookbook/introduction.mdx index b888df7..ca0c3e8 100644 --- a/cookbook/introduction.mdx +++ b/cookbook/introduction.mdx @@ -87,7 +87,7 @@ Each example is available in multiple implementations: 4. Experiment and adapt the code for your needs -Make sure to have your ScrapeGraphAI API key ready. Get one from the [dashboard](https://dashboard.scrapegraphai.com) if you haven't already. +Make sure to have your ScrapeGraphAI API key ready. Get one from the [dashboard](https://scrapegraphai.com/dashboard) if you haven't already. ## Additional Resources diff --git a/dashboard/overview.mdx b/dashboard/overview.mdx index df173d8..4c26957 100644 --- a/dashboard/overview.mdx +++ b/dashboard/overview.mdx @@ -19,21 +19,6 @@ The ScrapeGraphAI dashboard is your central hub for managing all your web scrapi - **Last Used**: Timestamp of your most recent API request - **Quick Actions**: Buttons to start new scraping jobs or access common features -## Usage Analytics - -Track your API usage patterns with our detailed analytics view: - - - Usage Analytics Graph - - -The usage graph provides: -- **Service-specific metrics**: Track usage for SmartScraper, SearchScraper, and Markdownify separately -- **Time-based analysis**: View usage patterns over different time periods -- **Interactive tooltips**: Hover over data points to see detailed information -- **Trend analysis**: Identify usage patterns and optimize your API consumption - - ## Key Features - **Usage Statistics**: Monitor your API usage and remaining credits @@ -43,7 +28,7 @@ The usage graph provides: ## Getting Started -1. Log in to your [dashboard](https://dashboard.scrapegraphai.com) +1. Log in to your [dashboard](https://scrapegraphai.com/dashboard) 2. View your API key in the settings section 3. Check your available credits 4. Start your first scraping job diff --git a/developer-guides/llm-sdks-and-frameworks/anthropic.mdx b/developer-guides/llm-sdks-and-frameworks/anthropic.mdx index f01b8e5..902608b 100644 --- a/developer-guides/llm-sdks-and-frameworks/anthropic.mdx +++ b/developer-guides/llm-sdks-and-frameworks/anthropic.mdx @@ -27,24 +27,23 @@ If using Node < 20, install `dotenv` and add `import 'dotenv/config'` to your co This example demonstrates a simple workflow: scrape a website and summarize the content using Claude. ```typescript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; import Anthropic from '@anthropic-ai/sdk'; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_APIKEY }); const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); -const scrapeResult = await smartScraper(apiKey, { - website_url: 'https://scrapegraphai.com', - user_prompt: 'Extract all content from this page', +const { data } = await sgai.extract('https://scrapegraphai.com', { + prompt: 'Extract all content from this page', }); -console.log('Scraped content length:', JSON.stringify(scrapeResult.data.result).length); +console.log('Scraped content length:', JSON.stringify(data).length); const message = await anthropic.messages.create({ model: 'claude-haiku-4-5', max_tokens: 1024, messages: [ - { role: 'user', content: `Summarize in 100 words: ${JSON.stringify(scrapeResult.data.result)}` } + { role: 'user', content: `Summarize in 100 words: ${JSON.stringify(data)}` } ] }); @@ -56,12 +55,12 @@ console.log('Response:', message); This example shows how to use Claude's tool use feature to let the model decide when to scrape websites based on user requests. ```typescript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; import { Anthropic } from '@anthropic-ai/sdk'; import { z } from 'zod'; import { zodToJsonSchema } from 'zod-to-json-schema'; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_APIKEY }); const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); @@ -91,12 +90,11 @@ if (toolUse && toolUse.type === 'tool_use') { const input = toolUse.input as { url: string }; console.log(`Calling tool: ${toolUse.name} | URL: ${input.url}`); - const result = await smartScraper(apiKey, { - website_url: input.url, - user_prompt: 'Extract all content from this page', + const { data } = await sgai.extract(input.url, { + prompt: 'Extract all content from this page', }); - console.log(`Scraped content preview: ${JSON.stringify(result.data.result)?.substring(0, 300)}...`); + console.log(`Scraped content preview: ${JSON.stringify(data)?.substring(0, 300)}...`); // Continue with the conversation or process the scraped content as needed } ``` @@ -106,11 +104,11 @@ if (toolUse && toolUse.type === 'tool_use') { This example demonstrates how to use Claude to extract structured data from scraped website content. ```typescript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; import Anthropic from '@anthropic-ai/sdk'; import { z } from 'zod'; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_APIKEY }); const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); const CompanyInfoSchema = z.object({ @@ -119,9 +117,8 @@ const CompanyInfoSchema = z.object({ description: z.string().optional() }); -const scrapeResult = await smartScraper(apiKey, { - website_url: 'https://stripe.com', - user_prompt: 'Extract all content from this page', +const { data } = await sgai.extract('https://stripe.com', { + prompt: 'Extract all content from this page', }); const prompt = `Extract company information from this website content. @@ -135,7 +132,7 @@ Output ONLY valid JSON in this exact format (no markdown, no explanation): } Website content: -${JSON.stringify(scrapeResult.data.result)}`; +${JSON.stringify(data)}`; const message = await anthropic.messages.create({ model: 'claude-haiku-4-5', diff --git a/developer-guides/llm-sdks-and-frameworks/gemini.mdx b/developer-guides/llm-sdks-and-frameworks/gemini.mdx index 0e710c3..1910ef2 100644 --- a/developer-guides/llm-sdks-and-frameworks/gemini.mdx +++ b/developer-guides/llm-sdks-and-frameworks/gemini.mdx @@ -27,22 +27,21 @@ If using Node < 20, install `dotenv` and add `import 'dotenv/config'` to your co This example demonstrates a simple workflow: scrape a website and summarize the content using Gemini. ```typescript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; import { GoogleGenAI } from '@google/genai'; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_APIKEY }); const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); -const scrapeResult = await smartScraper(apiKey, { - website_url: 'https://scrapegraphai.com', - user_prompt: 'Extract all content from this page', +const { data } = await sgai.extract('https://scrapegraphai.com', { + prompt: 'Extract all content from this page', }); -console.log('Scraped content length:', JSON.stringify(scrapeResult.data.result).length); +console.log('Scraped content length:', JSON.stringify(data).length); const response = await ai.models.generateContent({ model: 'gemini-2.5-flash', - contents: `Summarize: ${JSON.stringify(scrapeResult.data.result)}`, + contents: `Summarize: ${JSON.stringify(data)}`, }); console.log('Summary:', response.text); @@ -53,18 +52,17 @@ console.log('Summary:', response.text); This example shows how to analyze website content using Gemini's multi-turn conversation capabilities. ```typescript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; import { GoogleGenAI } from '@google/genai'; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_APIKEY }); const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); -const scrapeResult = await smartScraper(apiKey, { - website_url: 'https://news.ycombinator.com/', - user_prompt: 'Extract all content from this page', +const { data } = await sgai.extract('https://news.ycombinator.com/', { + prompt: 'Extract all content from this page', }); -console.log('Scraped content length:', JSON.stringify(scrapeResult.data.result).length); +console.log('Scraped content length:', JSON.stringify(data).length); const chat = ai.chats.create({ model: 'gemini-2.5-flash' @@ -72,7 +70,7 @@ const chat = ai.chats.create({ // Ask for the top 3 stories on Hacker News const result1 = await chat.sendMessage({ - message: `Based on this website content from Hacker News, what are the top 3 stories right now?\n\n${JSON.stringify(scrapeResult.data.result)}` + message: `Based on this website content from Hacker News, what are the top 3 stories right now?\n\n${JSON.stringify(data)}` }); console.log('Top 3 Stories:', result1.text); @@ -88,22 +86,21 @@ console.log('4th and 5th Stories:', result2.text); This example demonstrates how to extract structured data using Gemini's JSON mode from scraped website content. ```typescript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; import { GoogleGenAI, Type } from '@google/genai'; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_APIKEY }); const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); -const scrapeResult = await smartScraper(apiKey, { - website_url: 'https://stripe.com', - user_prompt: 'Extract all content from this page', +const { data } = await sgai.extract('https://stripe.com', { + prompt: 'Extract all content from this page', }); -console.log('Scraped content length:', JSON.stringify(scrapeResult.data.result).length); +console.log('Scraped content length:', JSON.stringify(data).length); const response = await ai.models.generateContent({ model: 'gemini-2.5-flash', - contents: `Extract company information: ${JSON.stringify(scrapeResult.data.result)}`, + contents: `Extract company information: ${JSON.stringify(data)}`, config: { responseMimeType: 'application/json', responseSchema: { diff --git a/docs.json b/docs.json index 282bcf6..60b1947 100644 --- a/docs.json +++ b/docs.json @@ -3,250 +3,313 @@ "theme": "mint", "name": "ScrapeGraphAI", "colors": { - "primary": "#9333ea", - "light": "#9f52eb", - "dark": "#1f2937" + "primary": "#AC6DFF", + "light": "#AC6DFF", + "dark": "#AC6DFF" }, "favicon": "/favicon.svg", "navigation": { - "tabs": [ + "versions": [ { - "tab": "Home", - "groups": [ + "version": "v2", + "default": true, + "tabs": [ { - "group": "Get Started", - "pages": [ - "introduction", - "install", + "tab": "Home", + "groups": [ { - "group": "Use Cases", + "group": "Get Started", "pages": [ - "use-cases/overview", - "use-cases/ai-llm", - "use-cases/lead-generation", - "use-cases/market-intelligence", - "use-cases/content-aggregation", - "use-cases/research-analysis", - "use-cases/seo-analytics" + "introduction", + "install", + "transition-from-v1-to-v2", + { + "group": "Use Cases", + "pages": [ + "use-cases/overview", + "use-cases/ai-llm", + "use-cases/lead-generation", + "use-cases/market-intelligence", + "use-cases/content-aggregation", + "use-cases/research-analysis", + "use-cases/seo-analytics" + ] + }, + { + "group": "Dashboard", + "pages": [ + "dashboard/overview", + "dashboard/settings" + ] + } ] }, { - "group": "Dashboard", + "group": "Services", "pages": [ - "dashboard/overview", - "dashboard/playground", - "dashboard/settings" + "services/scrape", + "services/extract", + "services/search", + "services/crawl", + "services/monitor", + { + "group": "CLI", + "icon": "terminal", + "pages": [ + "services/cli/introduction", + "services/cli/commands", + "services/cli/json-mode", + "services/cli/ai-agent-skill", + "services/cli/examples" + ] + }, + { + "group": "MCP Server", + "icon": "/logo/mcp.svg", + "pages": [ + "services/mcp-server/introduction", + "services/mcp-server/cursor", + "services/mcp-server/claude", + "services/mcp-server/smithery" + ] + }, + "services/toonify", + { + "group": "Additional Parameters", + "pages": [ + "services/additional-parameters/headers", + "services/additional-parameters/pagination", + "services/additional-parameters/proxy", + "services/additional-parameters/wait-ms" + ] + } + ] + }, + { + "group": "Official SDKs", + "pages": [ + "sdks/python", + "sdks/javascript", + "sdks/mocking" + ] + }, + { + "group": "LLM SDKs & Frameworks", + "pages": [ + "developer-guides/llm-sdks-and-frameworks/gemini", + "developer-guides/llm-sdks-and-frameworks/anthropic" + ] + }, + { + "group": "Contribute", + "pages": [ + "contribute/opensource" ] } ] }, { - "group": "Services", - "pages": [ - "services/smartscraper", - "services/searchscraper", - "services/markdownify", - "services/scrape", - "services/smartcrawler", - "services/sitemap", - "services/agenticscraper", + "tab": "Knowledge Base", + "groups": [ + { + "group": "Knowledge Base", + "pages": [ + "knowledge-base/introduction" + ] + }, + { + "group": "Scraping Tools", + "pages": [ + "knowledge-base/ai-tools/lovable", + "knowledge-base/ai-tools/v0", + "knowledge-base/ai-tools/bolt", + "knowledge-base/ai-tools/cursor" + ] + }, { "group": "CLI", - "icon": "terminal", "pages": [ - "services/cli/introduction", - "services/cli/commands", - "services/cli/json-mode", - "services/cli/ai-agent-skill", - "services/cli/examples" + "knowledge-base/cli/getting-started", + "knowledge-base/cli/json-mode", + "knowledge-base/cli/ai-agent-skill", + "knowledge-base/cli/command-examples" + ] + }, + { + "group": "Troubleshooting", + "pages": [ + "knowledge-base/troubleshooting/cors-error", + "knowledge-base/troubleshooting/empty-results", + "knowledge-base/troubleshooting/rate-limiting", + "knowledge-base/troubleshooting/timeout-errors" ] }, { - "group": "MCP Server", - "icon": "/logo/mcp.svg", + "group": "Scraping Guides", "pages": [ - "services/mcp-server/introduction", - "services/mcp-server/cursor", - "services/mcp-server/claude", - "services/mcp-server/smithery" + "knowledge-base/scraping/javascript-rendering", + "knowledge-base/scraping/pagination", + "knowledge-base/scraping/custom-headers", + "knowledge-base/scraping/proxy" ] }, - "services/toonify", { - "group": "Additional Parameters", + "group": "Account & Credits", "pages": [ - "services/additional-parameters/headers", - "services/additional-parameters/pagination", - "services/additional-parameters/proxy", - "services/additional-parameters/wait-ms" + "knowledge-base/account/pricing", + "knowledge-base/account/api-keys", + "knowledge-base/account/credits", + "knowledge-base/account/rate-limits" ] } ] }, { - "group": "Official SDKs", - "pages": [ - "sdks/python", - "sdks/javascript", - "sdks/mocking" - ] - }, - { - "group": "Integrations", - "pages": [ - "integrations/langchain", - "integrations/llamaindex", - "integrations/crewai", - "integrations/agno", - "integrations/langflow", - "integrations/vercel_ai", - "integrations/google-adk", - "integrations/x402" - ] - }, - { - "group": "LLM SDKs & Frameworks", - "pages": [ - "developer-guides/llm-sdks-and-frameworks/gemini", - "developer-guides/llm-sdks-and-frameworks/anthropic" - ] - }, - { - "group": "Contribute", - "pages": [ - "contribute/opensource" - ] - } - ] - }, - { - "tab": "Knowledge Base", - "groups": [ - { - "group": "Knowledge Base", - "pages": [ - "knowledge-base/introduction" - ] - }, - { - "group": "Scraping Tools", - "pages": [ - "knowledge-base/ai-tools/lovable", - "knowledge-base/ai-tools/v0", - "knowledge-base/ai-tools/bolt", - "knowledge-base/ai-tools/cursor" - ] - }, - { - "group": "CLI", - "pages": [ - "knowledge-base/cli/getting-started", - "knowledge-base/cli/json-mode", - "knowledge-base/cli/ai-agent-skill", - "knowledge-base/cli/command-examples" - ] - }, - { - "group": "Troubleshooting", - "pages": [ - "knowledge-base/troubleshooting/cors-error", - "knowledge-base/troubleshooting/empty-results", - "knowledge-base/troubleshooting/rate-limiting", - "knowledge-base/troubleshooting/timeout-errors" - ] - }, - { - "group": "Scraping Guides", - "pages": [ - "knowledge-base/scraping/javascript-rendering", - "knowledge-base/scraping/pagination", - "knowledge-base/scraping/custom-headers", - "knowledge-base/scraping/proxy" - ] - }, - { - "group": "Account & Credits", - "pages": [ - "knowledge-base/account/api-keys", - "knowledge-base/account/credits", - "knowledge-base/account/rate-limits" - ] - } - ] - }, - { - "tab": "Cookbook", - "groups": [ - { - "group": "Cookbook", - "pages": [ - "cookbook/introduction" + "tab": "Cookbook", + "groups": [ + { + "group": "Cookbook", + "pages": [ + "cookbook/introduction" + ] + }, + { + "group": "Examples", + "pages": [ + "cookbook/examples/company-info", + "cookbook/examples/github-trending", + "cookbook/examples/wired", + "cookbook/examples/homes", + "cookbook/examples/research-agent", + "cookbook/examples/chat-webpage", + "cookbook/examples/pagination" + ] + } ] }, { - "group": "Examples", - "pages": [ - "cookbook/examples/company-info", - "cookbook/examples/github-trending", - "cookbook/examples/wired", - "cookbook/examples/homes", - "cookbook/examples/research-agent", - "cookbook/examples/chat-webpage", - "cookbook/examples/pagination" + "tab": "API Reference", + "groups": [ + { + "group": "API Documentation", + "pages": [ + "api-reference/introduction", + "api-reference/errors" + ] + }, + { + "group": "SmartScraper", + "pages": [ + "api-reference/endpoint/smartscraper/start", + "api-reference/endpoint/smartscraper/get-status" + ] + }, + { + "group": "SearchScraper", + "pages": [ + "api-reference/endpoint/searchscraper/start", + "api-reference/endpoint/searchscraper/get-status" + ] + }, + { + "group": "Markdownify", + "pages": [ + "api-reference/endpoint/markdownify/start", + "api-reference/endpoint/markdownify/get-status" + ] + }, + { + "group": "SmartCrawler", + "pages": [ + "api-reference/endpoint/smartcrawler/start", + "api-reference/endpoint/smartcrawler/get-status" + ] + }, + { + "group": "Sitemap", + "pages": [ + "api-reference/endpoint/sitemap/start", + "api-reference/endpoint/sitemap/get-status" + ] + }, + { + "group": "User", + "pages": [ + "api-reference/endpoint/user/get-credits", + "api-reference/endpoint/user/submit-feedback" + ] + } ] } ] }, { - "tab": "API Reference", - "groups": [ - { - "group": "API Documentation", - "pages": [ - "api-reference/introduction", - "api-reference/errors" - ] - }, - { - "group": "SmartScraper", - "pages": [ - "api-reference/endpoint/smartscraper/start", - "api-reference/endpoint/smartscraper/get-status" - ] - }, - { - "group": "SearchScraper", - "pages": [ - "api-reference/endpoint/searchscraper/start", - "api-reference/endpoint/searchscraper/get-status" - ] - }, + "version": "v1", + "tabs": [ { - "group": "Markdownify", - "pages": [ - "api-reference/endpoint/markdownify/start", - "api-reference/endpoint/markdownify/get-status" - ] - }, - { - "group": "SmartCrawler", - "pages": [ - "api-reference/endpoint/smartcrawler/start", - "api-reference/endpoint/smartcrawler/get-status" - ] - }, - { - "group": "Sitemap", - "pages": [ - "api-reference/endpoint/sitemap/start", - "api-reference/endpoint/sitemap/get-status" + "tab": "Home", + "groups": [ + { + "group": "Get Started", + "pages": [ + "v1/introduction", + "v1/quickstart" + ] + }, + { + "group": "Services", + "pages": [ + "v1/smartscraper", + "v1/searchscraper", + "v1/markdownify", + "v1/scrape", + "v1/smartcrawler", + "v1/sitemap", + "v1/agenticscraper", + { + "group": "CLI", + "icon": "terminal", + "pages": [ + "v1/cli/introduction", + "v1/cli/commands", + "v1/cli/json-mode", + "v1/cli/ai-agent-skill", + "v1/cli/examples" + ] + }, + { + "group": "MCP Server", + "icon": "/logo/mcp.svg", + "pages": [ + "v1/mcp-server/introduction", + "v1/mcp-server/cursor", + "v1/mcp-server/claude", + "v1/mcp-server/smithery" + ] + }, + "v1/toonify", + { + "group": "Additional Parameters", + "pages": [ + "v1/additional-parameters/headers", + "v1/additional-parameters/pagination", + "v1/additional-parameters/proxy", + "v1/additional-parameters/wait-ms" + ] + } + ] + } ] }, { - "group": "User", - "pages": [ - "api-reference/endpoint/user/get-credits", - "api-reference/endpoint/user/submit-feedback" + "tab": "API Reference", + "groups": [ + { + "group": "API Documentation", + "pages": [ + "v1/api-reference/introduction" + ] + } ] } ] @@ -259,12 +322,7 @@ "href": "https://scrapegraphai.com/", "icon": "globe" }, - { - "anchor": "Community", - "href": "https://discord.gg/uJN7TYcpNa", - "icon": "discord" - }, - { +{ "anchor": "Blog", "href": "https://scrapegraphai.com/blog", "icon": "newspaper" @@ -273,13 +331,24 @@ } }, "logo": { - "light": "https://raw.githubusercontent.com/ScrapeGraphAI/docs-mintlify/main/logo/light.svg", - "dark": "https://raw.githubusercontent.com/ScrapeGraphAI/docs-mintlify/main/logo/dark.svg", + "light": "/logos/logo-light.svg", + "dark": "/logos/logo-dark.svg", "href": "https://docs.scrapegraphai.com" }, "background": { "color": { - "dark": "#101725" + "dark": "#242424", + "light": "#EFEFEF" + } + }, + "fonts": { + "heading": { + "family": "IBM Plex Sans", + "weight": 500 + }, + "body": { + "family": "IBM Plex Sans", + "weight": 400 } }, "navbar": { @@ -293,14 +362,14 @@ "href": "mailto:contact@scrapegraphai.com" }, { - "label": "⭐ 23.2k+", + "label": "⭐ 23.3k+", "href": "https://github.com/ScrapeGraphAI/Scrapegraph-ai" } ], "primary": { "type": "button", "label": "Dashboard", - "href": "https://dashboard.scrapegraphai.com" + "href": "https://scrapegraphai.com/dashboard" } }, "footer": { @@ -322,4 +391,4 @@ "vscode" ] } -} \ No newline at end of file +} diff --git a/favicon.svg b/favicon.svg index 33285d6..6fb828b 100644 --- a/favicon.svg +++ b/favicon.svg @@ -1,145 +1,15 @@ - - - - + + + + + + + + + + + + + + + diff --git a/images/dashboard/dashboard-1.png b/images/dashboard/dashboard-1.png index 1120f7e..2249c72 100644 Binary files a/images/dashboard/dashboard-1.png and b/images/dashboard/dashboard-1.png differ diff --git a/images/dashboard/settings-1.png b/images/dashboard/settings-1.png index 87ea08e..16f93b3 100644 Binary files a/images/dashboard/settings-1.png and b/images/dashboard/settings-1.png differ diff --git a/install.md b/install.md index 1f1165d..07cb1d5 100644 --- a/install.md +++ b/install.md @@ -1,11 +1,11 @@ --- title: Installation -description: 'Install and get started with ScrapeGraphAI SDKs' +description: 'Install and get started with ScrapeGraphAI v2 SDKs' --- ## Prerequisites -- Obtain your **API key** by signing up on the [ScrapeGraphAI Dashboard](https://dashboard.scrapegraphai.com) +- Obtain your **API key** by signing up on the [ScrapeGraphAI Dashboard](https://scrapegraphai.com/dashboard) --- @@ -22,10 +22,10 @@ from scrapegraph_py import Client client = Client(api_key="your-api-key-here") -# Scrape a website -response = client.smartscraper( - website_url="https://scrapegraphai.com", - user_prompt="Extract information about the company" +# Extract data from a website +response = client.extract( + url="https://scrapegraphai.com", + prompt="Extract information about the company" ) print(response) ``` @@ -40,6 +40,8 @@ For more advanced usage, see the [Python SDK documentation](/sdks/python). ## JavaScript SDK +Requires **Node.js >= 22**. + Install using npm, pnpm, yarn, or bun: ```bash @@ -59,20 +61,16 @@ bun add scrapegraph-js **Usage:** ```javascript -import { smartScraper } from "scrapegraph-js"; +import scrapegraphai from "scrapegraph-js"; -const apiKey = "your-api-key-here"; +const sgai = scrapegraphai({ apiKey: "your-api-key-here" }); -const response = await smartScraper(apiKey, { - website_url: "https://scrapegraphai.com", - user_prompt: "What does the company do?", -}); +const { data } = await sgai.extract( + "https://scrapegraphai.com", + { prompt: "What does the company do?" } +); -if (response.status === "error") { - console.error("Error:", response.error); -} else { - console.log(response.data.result); -} +console.log(data); ``` @@ -85,17 +83,20 @@ For more advanced usage, see the [JavaScript SDK documentation](/sdks/javascript ## Key Concepts -### SmartScraper -Extract specific information from any webpage using AI. Provide a URL and a prompt describing what you want to extract. [Learn more](/services/smartscraper) +### Scrape (formerly Markdownify) +Convert any webpage into markdown, HTML, screenshot, or branding format. [Learn more](/services/scrape) + +### Extract (formerly SmartScraper) +Extract specific information from any webpage using AI. Provide a URL and a prompt describing what you want to extract. [Learn more](/services/extract) -### SearchScraper -Search and extract information from multiple web sources using AI. Start with just a prompt - SearchScraper will find relevant websites and extract the information you need. [Learn more](/services/searchscraper) +### Search (formerly SearchScraper) +Search and extract information from multiple web sources using AI. Start with just a query - Search will find relevant websites and extract the information you need. [Learn more](/services/search) -### SmartCrawler -AI-powered extraction for any webpage with crawl capabilities. Automatically navigate and extract data from multiple pages. [Learn more](/services/smartcrawler) +### Crawl (formerly SmartCrawler) +Multi-page website crawling with flexible output formats. Traverse multiple pages, follow links, and return content in your preferred format. [Learn more](/services/crawl) -### Markdownify -Convert any webpage into clean, formatted markdown. Perfect for content aggregation and processing. [Learn more](/services/markdownify) +### Monitor +Scheduled web monitoring with AI-powered extraction. Set up recurring scraping jobs that automatically extract data on a cron schedule. [Learn more](/services/monitor) ### Structured Output with Schemas Both SDKs support structured output using schemas: @@ -119,34 +120,37 @@ class CompanyInfo(BaseModel): industry: str = Field(description="Industry sector") client = Client(api_key="your-api-key") -result = client.smartscraper( - website_url="https://scrapegraphai.com", - user_prompt="Extract company information", +response = client.extract( + url="https://scrapegraphai.com", + prompt="Extract company information", output_schema=CompanyInfo ) -print(result) +print(response) ``` ### JavaScript Example ```javascript -import { smartScraper } from "scrapegraph-js"; +import scrapegraphai from "scrapegraph-js"; import { z } from "zod"; +const sgai = scrapegraphai({ apiKey: "your-api-key" }); + const CompanySchema = z.object({ - company_name: z.string().describe("The company name"), + companyName: z.string().describe("The company name"), description: z.string().describe("Company description"), website: z.string().url().describe("Company website URL"), industry: z.string().describe("Industry sector"), }); -const apiKey = "your-api-key"; -const response = await smartScraper(apiKey, { - website_url: "https://scrapegraphai.com", - user_prompt: "Extract company information", - output_schema: CompanySchema, -}); -console.log(response.data.result); +const { data } = await sgai.extract( + "https://scrapegraphai.com", + { + prompt: "Extract company information", + schema: CompanySchema, + } +); +console.log(data); ``` --- diff --git a/integrations/claude-code-skill.mdx b/integrations/claude-code-skill.mdx index 4f85e3c..fedb452 100644 --- a/integrations/claude-code-skill.mdx +++ b/integrations/claude-code-skill.mdx @@ -60,7 +60,7 @@ export SGAI_API_KEY="sgai-..." ``` -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com). +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard). ## What's Included diff --git a/integrations/crewai.mdx b/integrations/crewai.mdx index dba500c..7288b59 100644 --- a/integrations/crewai.mdx +++ b/integrations/crewai.mdx @@ -100,7 +100,7 @@ SCRAPEGRAPH_API_KEY=your_api_key_here ``` -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) ## Use Cases diff --git a/integrations/google-adk.mdx b/integrations/google-adk.mdx index 1d3c7f9..dc87dee 100644 --- a/integrations/google-adk.mdx +++ b/integrations/google-adk.mdx @@ -84,7 +84,7 @@ SGAI_API_KEY = "your-api-key-here" ``` -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) ## Tool Filtering diff --git a/integrations/langchain.mdx b/integrations/langchain.mdx index aed504f..ad0a97f 100644 --- a/integrations/langchain.mdx +++ b/integrations/langchain.mdx @@ -25,20 +25,20 @@ pip install langchain-scrapegraph ## Available Tools -### SmartScraperTool +### ExtractTool Extract structured data from any webpage using natural language prompts: ```python -from langchain_scrapegraph.tools import SmartScraperTool +from langchain_scrapegraph.tools import ExtractTool # Initialize the tool (uses SGAI_API_KEY from environment) -tool = SmartscraperTool() +tool = ExtractTool() # Extract information using natural language result = tool.invoke({ - "website_url": "https://www.example.com", - "user_prompt": "Extract the main heading and first paragraph" + "url": "https://www.example.com", + "prompt": "Extract the main heading and first paragraph" }) ``` @@ -46,60 +46,51 @@ result = tool.invoke({ Define the structure of the output using Pydantic models: ```python -from typing import List from pydantic import BaseModel, Field -from langchain_scrapegraph.tools import SmartScraperTool +from langchain_scrapegraph.tools import ExtractTool class WebsiteInfo(BaseModel): - title: str = Field(description="The main title of the webpage") - description: str = Field(description="The main description or first paragraph") - urls: List[str] = Field(description="The URLs inside the webpage") + title: str = Field(description="The main title of the page") + description: str = Field(description="The main description") -# Initialize with schema -tool = SmartScraperTool(llm_output_schema=WebsiteInfo) +# Initialize with output schema +tool = ExtractTool(llm_output_schema=WebsiteInfo) result = tool.invoke({ - "website_url": "https://www.example.com", - "user_prompt": "Extract the website information" + "url": "https://example.com", + "prompt": "Extract the title and description" }) ``` -### SearchScraperTool +### SearchTool -Process HTML content directly with AI extraction: +Search the web and extract structured results using AI: ```python -from langchain_scrapegraph.tools import SearchScraperTool +from langchain_scrapegraph.tools import SearchTool - -tool = SearchScraperTool() +tool = SearchTool() result = tool.invoke({ - "user_prompt": "Find the best restaurants in San Francisco", + "query": "Find the best restaurants in San Francisco", }) - ``` - -```python -from typing import Optional -from pydantic import BaseModel, Field -from langchain_scrapegraph.tools import SearchScraperTool +### ScrapeTool -class RestaurantInfo(BaseModel): - name: str = Field(description="The restaurant name") - address: str = Field(description="The restaurant address") - rating: float = Field(description="The restaurant rating") +Scrape a webpage and return it in the desired format: +```python +from langchain_scrapegraph.tools import ScrapeTool -tool = SearchScraperTool(llm_output_schema=RestaurantInfo) +tool = ScrapeTool() -result = tool.invoke({ - "user_prompt": "Find the best restaurants in San Francisco" -}) +# Scrape as markdown (default) +result = tool.invoke({"url": "https://example.com"}) +# Scrape as HTML +result = tool.invoke({"url": "https://example.com", "format": "html"}) ``` - ### MarkdownifyTool @@ -112,34 +103,146 @@ tool = MarkdownifyTool() markdown = tool.invoke({"website_url": "https://example.com"}) ``` +### Crawl Tools + +Start and manage crawl jobs with `CrawlStartTool`, `CrawlStatusTool`, `CrawlStopTool`, and `CrawlResumeTool`: + +```python +import time +from langchain_scrapegraph.tools import CrawlStartTool, CrawlStatusTool + +start_tool = CrawlStartTool() +status_tool = CrawlStatusTool() + +# Start a crawl job +result = start_tool.invoke({ + "url": "https://example.com", + "depth": 2, + "max_pages": 5, + "format": "markdown", +}) +print("Crawl started:", result) + +# Check status +crawl_id = result.get("id") +if crawl_id: + time.sleep(5) + status = status_tool.invoke({"crawl_id": crawl_id}) + print("Crawl status:", status) +``` + +### Monitor Tools + +Create and manage monitors (replaces scheduled jobs) with `MonitorCreateTool`, `MonitorListTool`, `MonitorGetTool`, `MonitorPauseTool`, `MonitorResumeTool`, and `MonitorDeleteTool`: + +```python +from langchain_scrapegraph.tools import MonitorCreateTool, MonitorListTool + +create_tool = MonitorCreateTool() +list_tool = MonitorListTool() + +# Create a monitor +result = create_tool.invoke({ + "name": "Price Monitor", + "url": "https://example.com/products", + "prompt": "Extract current product prices", + "cron": "0 9 * * *", # Daily at 9 AM +}) +print("Monitor created:", result) + +# List all monitors +monitors = list_tool.invoke({}) +print("All monitors:", monitors) +``` + +### HistoryTool + +Retrieve request history: + +```python +from langchain_scrapegraph.tools import HistoryTool + +tool = HistoryTool() +history = tool.invoke({}) +``` + +### GetCreditsTool + +Check your remaining API credits: + +```python +from langchain_scrapegraph.tools import GetCreditsTool + +tool = GetCreditsTool() +credits = tool.invoke({}) +``` + ## Example Agent Create a research agent that can gather and analyze web data: ```python -from langchain.agents import initialize_agent, AgentType -from langchain_scrapegraph.tools import SmartScraperTool +from langchain.agents import AgentExecutor, create_openai_functions_agent +from langchain_core.messages import SystemMessage +from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_openai import ChatOpenAI +from langchain_scrapegraph.tools import ExtractTool, GetCreditsTool, SearchTool -# Initialize tools +# Initialize the tools tools = [ - SmartScraperTool(), + ExtractTool(), + GetCreditsTool(), + SearchTool(), ] -# Create an agent -agent = initialize_agent( - tools=tools, - llm=ChatOpenAI(temperature=0), - agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, - verbose=True -) - -# Use the agent -response = agent.run(""" - Visit example.com, make a summary of the content and extract the main heading and first paragraph -""") +# Create the prompt template +prompt = ChatPromptTemplate.from_messages([ + SystemMessage( + content=( + "You are a helpful AI assistant that can analyze websites and extract information. " + "You have access to tools that can help you scrape and process web content. " + "Always explain what you're doing before using a tool." + ) + ), + MessagesPlaceholder(variable_name="chat_history", optional=True), + ("user", "{input}"), + MessagesPlaceholder(variable_name="agent_scratchpad"), +]) + +# Initialize the LLM +llm = ChatOpenAI(temperature=0) + +# Create the agent +agent = create_openai_functions_agent(llm, tools, prompt) +agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) + +# Example usage +response = agent_executor.invoke({ + "input": "Extract the main products from https://www.scrapegraphai.com/" +}) +print(response["output"]) ``` +## Migration from v1 + +If you're upgrading from v1, here are the key changes: + +| v1 Tool | v2 Tool | +|---------|---------| +| `SmartScraperTool` | `ExtractTool` | +| `SearchScraperTool` | `SearchTool` | +| `SmartCrawlerTool` | `CrawlStartTool` / `CrawlStatusTool` / `CrawlStopTool` / `CrawlResumeTool` | +| `CreateScheduledJobTool` | `MonitorCreateTool` | +| `GetScheduledJobsTool` | `MonitorListTool` | +| `GetScheduledJobTool` | `MonitorGetTool` | +| `PauseScheduledJobTool` | `MonitorPauseTool` | +| `ResumeScheduledJobTool` | `MonitorResumeTool` | +| `DeleteScheduledJobTool` | `MonitorDeleteTool` | +| `MarkdownifyTool` | `MarkdownifyTool` (unchanged) | +| `GetCreditsTool` | `GetCreditsTool` (unchanged) | +| `AgenticScraperTool` | Removed | +| -- | `HistoryTool` (new) | + ## Configuration Set your ScrapeGraph API key in your environment: @@ -156,7 +259,7 @@ os.environ["SGAI_API_KEY"] = "your-api-key-here" ``` -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) ## Use Cases diff --git a/integrations/vercel_ai.mdx b/integrations/vercel_ai.mdx index 889df6b..f57a290 100644 --- a/integrations/vercel_ai.mdx +++ b/integrations/vercel_ai.mdx @@ -5,19 +5,19 @@ description: "Integrate ScrapeGraphAI into Vercel AI" ## Overview -[Vercel AI sdk](https://ai-sdk.dev/) is a very populate javascript/typescript framework to interact with various LLMs providers. This page shows how to integrate it with ScrapeGraph +[Vercel AI SDK](https://ai-sdk.dev/) is a popular JavaScript/TypeScript framework to interact with various LLM providers. This page shows how to integrate it with ScrapeGraph. - View the integration on LlamaHub + View the Vercel AI SDK documentation ## Installation -Follow out [javascript sdk installation steps](/sdks/javascript) using your favourite package manager: +Follow our [JavaScript SDK installation steps](/sdks/javascript) using your favourite package manager: ```bash # Using npm @@ -33,7 +33,7 @@ yarn add scrapegraph-js bun add scrapegraph-js ``` -Then, install [vercel ai](https://ai-sdk.dev/docs/getting-started) with their [openai provider](https://ai-sdk.dev/providers/ai-sdk-providers/openai) +Then, install [Vercel AI](https://ai-sdk.dev/docs/getting-started) with their [OpenAI provider](https://ai-sdk.dev/providers/ai-sdk-providers/openai): ```bash # Using npm @@ -51,15 +51,15 @@ bun add ai @ai-sdk/openai ## Usage -ScrapeGraph sdk can be used like any other tools, see [vercel ai tool calling doc](https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling) +The ScrapeGraph SDK can be used like any other tool. See [Vercel AI tool calling docs](https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling). ```ts import { z } from "zod"; import { generateText, tool } from "ai"; import { openai } from "@ai-sdk/openai"; -import { smartScraper } from "scrapegraph-js"; +import { scrapegraphai } from "scrapegraph-js"; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: process.env.SGAI_API_KEY }); const ArticleSchema = z.object({ title: z.string().describe("The article title"), @@ -77,15 +77,14 @@ const result = await generateText({ model: openai("gpt-4.1-mini"), tools: { scrape: tool({ - description: "Get articles information for a given url.", + description: "Extract articles information from a given URL.", parameters: z.object({ - url: z.string().describe("The exact url."), + url: z.string().describe("The exact URL."), }), execute: async ({ url }) => { - const response = await smartScraper(apiKey, { - website_url: url, - user_prompt: "Extract the article information", - output_schema: ArticlesArraySchema, + const response = await sgai.extract(url, { + prompt: "Extract the article information", + schema: ArticlesArraySchema, }); return response.data; }, @@ -97,8 +96,6 @@ const result = await generateText({ console.log(result); ``` -**TODO ADD THE LOGS** - ## Support Need help with the integration? @@ -107,7 +104,7 @@ Need help with the integration? Report bugs and request features diff --git a/introduction.mdx b/introduction.mdx index d848a3c..ab84d8d 100644 --- a/introduction.mdx +++ b/introduction.mdx @@ -33,7 +33,7 @@ description: 'Welcome to ScrapeGraphAI - AI-Powered Web Data Extraction' - Sign up and access your API key from the [dashboard](https://dashboard.scrapegraphai.com) + Sign up and access your API key from the [dashboard](https://scrapegraphai.com/dashboard) Select from our specialized extraction services based on your needs diff --git a/knowledge-base/account/api-keys.mdx b/knowledge-base/account/api-keys.mdx index 71d593d..99a3bb0 100644 --- a/knowledge-base/account/api-keys.mdx +++ b/knowledge-base/account/api-keys.mdx @@ -7,7 +7,7 @@ Your API key authenticates every request you make to the ScrapeGraphAI API. Keep ## Finding your API key -1. Log in to the [ScrapeGraphAI dashboard](https://dashboard.scrapegraphai.com). +1. Log in to the [ScrapeGraphAI dashboard](https://scrapegraphai.com/dashboard). 2. Navigate to **Settings**. 3. Your API key is displayed in the **API Key** section. @@ -24,9 +24,10 @@ client = Client(api_key="your-api-key") ``` ```javascript JavaScript -import { smartScraper } from "scrapegraph-js"; +import { scrapegraphai } from "scrapegraph-js"; -const result = await smartScraper("your-api-key", url, prompt); +const sgai = scrapegraphai({ apiKey: "your-api-key" }); +const { data } = await sgai.extract(url, { prompt }); ``` ```bash cURL @@ -61,7 +62,7 @@ client = Client(api_key=os.getenv("SGAI_API_KEY")) If your key has been exposed or you want to rotate it for security: -1. Go to **Settings** in the [dashboard](https://dashboard.scrapegraphai.com). +1. Go to **Settings** in the [dashboard](https://scrapegraphai.com/dashboard). 2. Click **Regenerate API Key**. 3. Copy the new key immediately — it will only be shown once. 4. Update all services and environment variables that use the old key. diff --git a/knowledge-base/account/credits.mdx b/knowledge-base/account/credits.mdx index 72d38f0..2fd375d 100644 --- a/knowledge-base/account/credits.mdx +++ b/knowledge-base/account/credits.mdx @@ -7,14 +7,29 @@ ScrapeGraphAI uses a credit system to measure API usage. Each successful API cal ## Credit costs per service -| Service | Credits per request | +| Service | Credits per request | Details | +|---|---|---| +| **Scrape** (markdown) | 1 | Basic page scrape returning markdown | +| **Scrape** (screenshot) | 2 | Page scrape with a screenshot | +| **Scrape** (branding analysis) | 25 | Full branding analysis of a page | +| **Extract** | 5 | Structured data extraction | +| **Search** (no prompt) | 2 per result | Search results without LLM processing | +| **Search** (with prompt) | 5 per result | Search results processed by an LLM | +| **Crawl** | 2 startup + per-page scrape cost | Startup fee plus scrape cost for each page | +| **Monitor** | +5 | Additional credits when a change is detected | + +### Proxy modifiers + +Using a proxy adds extra credits on top of the base service cost: + +| Proxy mode | Additional credits | |---|---| -| SmartScraper | 1 | -| SearchScraper | 5 | -| Markdownify | 1 | -| SmartCrawler | 1 per page crawled | -| Sitemap | 1 | -| AgenticScraper | Variable | +| Fast / JS rendering | +0 | +| Stealth | +4 | +| JS + Stealth | +5 | +| Auto (worst case) | +9 | + +For a full breakdown of plans and monthly credit allowances, see [Plans & Pricing](/knowledge-base/account/pricing). Failed requests and requests that return an error are not charged. @@ -22,7 +37,7 @@ ScrapeGraphAI uses a credit system to measure API usage. Each successful API cal ## Checking your credit balance -Log in to the [dashboard](https://dashboard.scrapegraphai.com) to see: +Log in to the [dashboard](https://scrapegraphai.com/dashboard) to see: - **Remaining credits** for your current billing period - **Usage history** broken down by service and date @@ -54,7 +69,7 @@ When your credits are exhausted, the API returns an HTTP `402 Payment Required` } ``` -Upgrade your plan or purchase additional credits from the [dashboard](https://dashboard.scrapegraphai.com). +Upgrade your plan or purchase additional credits from the [dashboard](https://scrapegraphai.com/dashboard). ## Tips to reduce credit usage diff --git a/knowledge-base/account/pricing.mdx b/knowledge-base/account/pricing.mdx new file mode 100644 index 0000000..77124a4 --- /dev/null +++ b/knowledge-base/account/pricing.mdx @@ -0,0 +1,109 @@ +--- +title: Plans & Pricing +description: 'Overview of ScrapeGraphAI plans, pricing, and what each tier includes' +--- + +ScrapeGraphAI offers flexible plans to fit teams of every size — from hobbyists to enterprises. All plans include access to every service; higher tiers unlock more credits, throughput, and support. + +## Plans + + + + **$0 / month** + + - 500 API credits / month + - 10 requests / min + - 1 monitor + - 1 concurrent crawl + + + + **$17 / month** (or $204 / year — save $36) + + - 10,000 API credits / month + - 100 requests / min + - 5 monitors + - 3 concurrent crawls + + + + **$85 / month** (or $1,020 / year — save $180) + + - 100,000 API credits / month + - 500 requests / min + - 25 monitors + - 15 concurrent crawls + - Basic Proxy Rotation + + + + **$425 / month** (or $5,100 / year — save $900) + + - 750,000 API credits / month + - 5,000 requests / min + - 100 monitors + - 50 concurrent crawls + - Advanced Proxy Rotation + - Priority support + + + +Need more? **Enterprise** plans offer custom credit volumes, custom rate limits, dedicated support, and SLA guarantees. [Contact us](mailto:contact@scrapegraphai.com) for details. + +## Credit costs per service + +Every API call consumes credits. The exact cost depends on the service and the options you use. + +| Service | Base cost | Details | +|---|---|---| +| **Scrape** (markdown) | 1 credit | Basic page scrape returning markdown | +| **Scrape** (screenshot) | 2 credits | Page scrape with a screenshot | +| **Scrape** (branding analysis) | 25 credits | Full branding analysis of a page | +| **Extract** | 5 credits | Structured data extraction | +| **Search** (no prompt) | 2 credits / result | Search results without LLM processing | +| **Search** (with prompt) | 5 credits / result | Search results processed by an LLM | +| **Crawl** | 2 credits startup + per-page scrape cost | Startup fee plus scrape cost for each page | +| **Monitor** | +5 credits | Additional credits charged when a change is detected | + +### Proxy modifiers + +Using a proxy adds extra credits on top of the base service cost: + +| Proxy mode | Additional credits | +|---|---| +| Fast / JS rendering | +0 | +| Stealth | +4 | +| JS + Stealth | +5 | +| Auto (worst case) | +9 | + + + Failed requests and requests that return an error are **not** charged. + + +## Comparing plans at a glance + +| | Free | Starter | Growth | Pro | Enterprise | +|---|---|---|---|---|---| +| **Monthly price** | $0 | $17 | $85 | $425 | Custom | +| **Annual price** | $0 | $204 | $1,020 | $5,100 | Custom | +| **Credits / month** | 500 | 10,000 | 100,000 | 750,000 | Custom | +| **Requests / min** | 10 | 100 | 500 | 5,000 | Custom | +| **Monitors** | 1 | 5 | 25 | 100 | Custom | +| **Concurrent crawls** | 1 | 3 | 15 | 50 | Custom | +| **Proxy rotation** | — | — | Basic | Advanced | Custom | +| **Priority support** | — | — | — | Yes | Yes | +| **SLA guarantee** | — | — | — | — | Yes | + +## Upgrading or downgrading + +You can change your plan at any time from the [dashboard](https://scrapegraphai.com/dashboard). When upgrading mid-cycle, you receive the additional credits immediately. Downgrades take effect at the start of the next billing period. + +## Annual billing + +All paid plans offer an annual billing option with significant savings: + +- **Starter** — save $36 / year +- **Growth** — save $180 / year +- **Pro** — save $900 / year + +Switch to annual billing from the [dashboard](https://scrapegraphai.com/dashboard). diff --git a/knowledge-base/account/rate-limits.mdx b/knowledge-base/account/rate-limits.mdx index 5a495d5..8c54205 100644 --- a/knowledge-base/account/rate-limits.mdx +++ b/knowledge-base/account/rate-limits.mdx @@ -7,12 +7,15 @@ ScrapeGraphAI enforces rate limits to ensure reliable performance for all users. ## Limits overview -| Plan | Requests per minute | Concurrent jobs | Monthly credits | -|---|---|---|---| -| Free | 5 | 1 | 100 | -| Starter | 30 | 5 | 5,000 | -| Pro | 100 | 20 | 50,000 | -| Enterprise | Custom | Custom | Custom | +| Plan | Requests per minute | Concurrent crawls | Monitors | Monthly credits | +|---|---|---|---|---| +| Free | 10 | 1 | 1 | 500 | +| Starter | 100 | 3 | 5 | 10,000 | +| Growth | 500 | 15 | 25 | 100,000 | +| Pro | 5,000 | 50 | 100 | 750,000 | +| Enterprise | Custom | Custom | Custom | Custom | + +For full pricing details, see [Plans & Pricing](/knowledge-base/account/pricing). Contact [support](mailto:contact@scrapegraphai.com) for custom limits or high-volume plans. @@ -59,5 +62,5 @@ def scrape_with_backoff(client, url, prompt, max_retries=5): ## Increasing your limits -- **Upgrade your plan** from the [dashboard](https://dashboard.scrapegraphai.com) to get higher limits immediately. +- **Upgrade your plan** from the [dashboard](https://scrapegraphai.com/dashboard) to get higher limits immediately. - **Enterprise customers** can request custom rate limit configurations by contacting [support](mailto:contact@scrapegraphai.com). diff --git a/knowledge-base/ai-tools/cursor.mdx b/knowledge-base/ai-tools/cursor.mdx index 017d321..a2ade0a 100644 --- a/knowledge-base/ai-tools/cursor.mdx +++ b/knowledge-base/ai-tools/cursor.mdx @@ -53,14 +53,14 @@ Ask Cursor: > Write a JavaScript function using scrapegraph-js that extracts product details from an e-commerce page. ```javascript -import { smartScraper } from "scrapegraph-js"; +import { scrapegraphai } from "scrapegraph-js"; + +const sgai = scrapegraphai({ apiKey: "your-api-key" }); async function extractProduct(url) { - return await smartScraper( - "your-api-key", - url, - "Extract the product name, price, and availability" - ); + return await sgai.extract(url, { + prompt: "Extract the product name, price, and availability", + }); } ``` diff --git a/knowledge-base/ai-tools/lovable.mdx b/knowledge-base/ai-tools/lovable.mdx index ab252ff..4d8b928 100644 --- a/knowledge-base/ai-tools/lovable.mdx +++ b/knowledge-base/ai-tools/lovable.mdx @@ -13,7 +13,7 @@ Because Lovable apps run in the browser, API calls to ScrapeGraphAI must be made ### 1. Get your API key -Log in to the [ScrapeGraphAI dashboard](https://dashboard.scrapegraphai.com) and copy your API key from the Settings page. +Log in to the [ScrapeGraphAI dashboard](https://scrapegraphai.com/dashboard) and copy your API key from the Settings page. ### 2. Create a Supabase Edge Function diff --git a/knowledge-base/cli/getting-started.mdx b/knowledge-base/cli/getting-started.mdx index cb64ee3..d68c913 100644 --- a/knowledge-base/cli/getting-started.mdx +++ b/knowledge-base/cli/getting-started.mdx @@ -39,7 +39,7 @@ Package: [just-scrape](https://www.npmjs.com/package/just-scrape) on npm | [GitH ## Setting up your API key -The CLI needs a ScrapeGraphAI API key. Get one from the [dashboard](https://dashboard.scrapegraphai.com). The CLI checks for it in this order: +The CLI needs a ScrapeGraphAI API key. Get one from the [dashboard](https://scrapegraphai.com/dashboard). The CLI checks for it in this order: 1. **Environment variable** — `export SGAI_API_KEY="sgai-..."` 2. **`.env` file** — `SGAI_API_KEY=sgai-...` in the project root @@ -53,19 +53,14 @@ The easiest approach for a new machine is to just run any command — the CLI wi | Variable | Description | Default | |---|---|---| | `SGAI_API_KEY` | ScrapeGraphAI API key | — | -| `JUST_SCRAPE_API_URL` | Override the API base URL | `https://api.scrapegraphai.com/v1` | -| `JUST_SCRAPE_TIMEOUT_S` | Request/polling timeout in seconds | `120` | -| `JUST_SCRAPE_DEBUG` | Set to `1` to enable debug logging to stderr | `0` | +| `SGAI_API_URL` | Override the API base URL | `https://api.scrapegraphai.com` | +| `SGAI_TIMEOUT_S` | Request timeout in seconds | `30` | -## Verify your setup - -Run a quick health check to confirm the key is valid: +Legacy variables (`JUST_SCRAPE_API_URL`, `JUST_SCRAPE_TIMEOUT_S`, `JUST_SCRAPE_DEBUG`) are still bridged. -```bash -just-scrape validate -``` +## Verify your setup -Check your credit balance: +Check your credit balance to confirm the key is valid: ```bash just-scrape credits @@ -74,7 +69,7 @@ just-scrape credits ## Your first scrape ```bash -just-scrape smart-scraper https://news.ycombinator.com \ +just-scrape extract https://news.ycombinator.com \ -p "Extract the top 5 story titles and their URLs" ``` diff --git a/knowledge-base/scraping/custom-headers.mdx b/knowledge-base/scraping/custom-headers.mdx index fd4b482..b69d919 100644 --- a/knowledge-base/scraping/custom-headers.mdx +++ b/knowledge-base/scraping/custom-headers.mdx @@ -26,19 +26,18 @@ response = client.smartscraper( ``` ```javascript -import { smartScraper } from "scrapegraph-js"; +import { scrapegraphai } from "scrapegraph-js"; -const result = await smartScraper( - "your-api-key", - "https://example.com/protected-page", - "Extract the main content", - { +const sgai = scrapegraphai({ apiKey: "your-api-key" }); +const { data } = await sgai.extract("https://example.com/protected-page", { + prompt: "Extract the main content", + fetchConfig: { headers: { Authorization: "Bearer your-token-here", Cookie: "session=abc123", }, - } -); + }, +}); ``` See the [headers parameter documentation](/services/additional-parameters/headers) for the full reference. diff --git a/knowledge-base/scraping/javascript-rendering.mdx b/knowledge-base/scraping/javascript-rendering.mdx index 5ab0afe..732b6d8 100644 --- a/knowledge-base/scraping/javascript-rendering.mdx +++ b/knowledge-base/scraping/javascript-rendering.mdx @@ -26,14 +26,13 @@ response = client.smartscraper( ``` ```javascript -import { smartScraper } from "scrapegraph-js"; - -const result = await smartScraper( - "your-api-key", - "https://example.com/products", - "Extract all product names and prices", - { wait_ms: 2000 } -); +import { scrapegraphai } from "scrapegraph-js"; + +const sgai = scrapegraphai({ apiKey: "your-api-key" }); +const { data } = await sgai.extract("https://example.com/products", { + prompt: "Extract all product names and prices", + fetchConfig: { wait: 2000 }, +}); ``` See the [wait_ms parameter documentation](/services/additional-parameters/wait-ms) for more details. diff --git a/knowledge-base/scraping/pagination.mdx b/knowledge-base/scraping/pagination.mdx index b086aaa..48466d3 100644 --- a/knowledge-base/scraping/pagination.mdx +++ b/knowledge-base/scraping/pagination.mdx @@ -43,19 +43,17 @@ print(f"Total products extracted: {len(all_results)}") ``` ```javascript -import { smartScraper } from "scrapegraph-js"; +import { scrapegraphai } from "scrapegraph-js"; -const apiKey = "your-api-key"; +const sgai = scrapegraphai({ apiKey: "your-api-key" }); const allResults = []; for (let page = 1; page <= 5; page++) { const url = `https://example.com/products?page=${page}`; - const result = await smartScraper( - apiKey, - url, - "Extract all product names and prices on this page" - ); - allResults.push(...(result?.products ?? [])); + const { data } = await sgai.extract(url, { + prompt: "Extract all product names and prices on this page", + }); + allResults.push(...(data?.products ?? [])); } ``` diff --git a/knowledge-base/scraping/proxy.mdx b/knowledge-base/scraping/proxy.mdx index 1350b71..2713ad8 100644 --- a/knowledge-base/scraping/proxy.mdx +++ b/knowledge-base/scraping/proxy.mdx @@ -1,88 +1,131 @@ --- -title: Scraping behind a proxy -description: 'Route requests through your own proxy for geo-targeting or privacy' +title: Proxy & Fetch Configuration +description: 'Control proxy routing, stealth mode, and geo-targeting with FetchConfig' --- -Using a proxy lets you route ScrapeGraphAI requests through a specific IP address or geographic location. This is useful for accessing geo-restricted content, bypassing IP-based blocks, or testing region-specific pages. +In v2, all proxy and fetch behaviour is controlled through the `FetchConfig` object. You can set the proxy strategy (`mode`), country-based geotargeting (`country`), wait times, scrolling, custom headers, and more. -## How to pass a proxy +See the [full proxy reference](/services/additional-parameters/proxy) for all available options. -Use the `proxy` parameter available in SmartScraper, SearchScraper, and Markdownify: +## Choosing a fetch mode -```python -from scrapegraph_py import Client +The `mode` parameter controls how pages are retrieved: + +| Mode | Description | +|------|-------------| +| `auto` | Automatically selects the best strategy (default) | +| `fast` | Direct HTTP fetch, no JS rendering — fastest option | +| `js` | Headless browser for JavaScript-heavy pages | +| `direct+stealth` | Residential proxy with stealth headers (no JS) | +| `js+stealth` | JS rendering + residential/stealth proxy | + +## Examples + +### Geo-targeted content + +Access content from a specific country using the `country` parameter: + + + +```python Python +from scrapegraph_py import Client, FetchConfig client = Client(api_key="your-api-key") -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the main content", - proxy="http://username:password@proxy-host:8080", +response = client.extract( + url="https://example.com", + prompt="Extract the main content", + fetch_config=FetchConfig(country="de"), # Route through Germany ) ``` -```javascript -import { smartScraper } from "scrapegraph-js"; - -const result = await smartScraper( - "your-api-key", - "https://example.com", - "Extract the main content", - { - proxy: "http://username:password@proxy-host:8080", - } -); +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract the main content', + fetchConfig: { country: 'de' }, +}); ``` -See the [proxy parameter documentation](/services/additional-parameters/proxy) for the full reference. + -## Proxy URL format +### Stealth mode for protected sites -``` -http://username:password@host:port -socks5://username:password@host:port -``` +Use stealth modes to bypass anti-bot protections: -If the proxy does not require authentication: + -``` -http://host:port +```python Python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + +response = client.scrape( + url="https://protected-site.com", + format="markdown", + fetch_config=FetchConfig( + mode="js+stealth", + wait=3000, + scrolls=3, + country="us", + ), +) ``` -## Common use cases +```javascript JavaScript +const { data } = await sgai.scrape('https://protected-site.com', { + format: 'markdown', + fetchConfig: { + mode: 'js+stealth', + wait: 3000, + scrolls: 3, + country: 'us', + }, +}); +``` -### Geo-targeted content + -Access content that is only available in a specific country: +### Custom headers and cookies -```python -# Using a proxy located in Germany -proxy = "http://user:pass@de-proxy.example.com:8080" -``` +Pass custom HTTP headers or cookies with your requests: -### Bypassing IP-based rate limits + -If the target website blocks your IP after too many requests, rotate through a pool of proxy IPs: +```python Python +from scrapegraph_py import Client, FetchConfig -```python -import itertools +client = Client(api_key="your-api-key") -proxies = itertools.cycle([ - "http://user:pass@proxy1.example.com:8080", - "http://user:pass@proxy2.example.com:8080", - "http://user:pass@proxy3.example.com:8080", -]) +response = client.extract( + url="https://example.com", + prompt="Extract product details", + fetch_config=FetchConfig( + headers={"Accept-Language": "en-US"}, + cookies={"session": "abc123"}, + ), +) +``` -for url in urls_to_scrape: - response = client.smartscraper( - website_url=url, - user_prompt="Extract the product details", - proxy=next(proxies), - ) +```javascript JavaScript +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract product details', + fetchConfig: { + headers: { 'Accept-Language': 'en-US' }, + cookies: { session: 'abc123' }, + }, +}); ``` + + ## Tips -- Use a reputable proxy provider for reliable uptime and performance. -- Test your proxy connection independently before passing it to ScrapeGraphAI to rule out proxy-side issues. -- Do not use public/free proxies for sensitive data — they may log or modify your traffic. +- Start with `mode: "auto"` and only switch to a specific mode if you need to. +- Use `js+stealth` for sites with strong anti-bot protections. +- Add `wait` time for pages that load content dynamically after the initial render. +- Use `scrolls` to trigger lazy-loaded content on infinite-scroll pages. +- The `country` parameter doesn't affect pricing — credits are charged the same regardless of proxy location. diff --git a/knowledge-base/troubleshooting/empty-results.mdx b/knowledge-base/troubleshooting/empty-results.mdx index b74e467..0163d04 100644 --- a/knowledge-base/troubleshooting/empty-results.mdx +++ b/knowledge-base/troubleshooting/empty-results.mdx @@ -47,10 +47,10 @@ If you define an `output_schema` with required fields, the LLM will return `null If you have exhausted your credits or are being rate-limited, the API may return an empty or error response. -**Fix:** Check your [dashboard](https://dashboard.scrapegraphai.com) for remaining credits and current usage. +**Fix:** Check your [dashboard](https://scrapegraphai.com/dashboard) for remaining credits and current usage. ## Debugging tips - Log the full API response — the `result` key contains the extracted data; `status` and `error` keys may contain useful information. - Test the URL with a simple prompt like `"What is the main heading of this page?"` to verify that extraction works at all. -- Use the [interactive playground](https://dashboard.scrapegraphai.com) to test your URL and prompt before integrating. +- Use the [interactive playground](https://scrapegraphai.com/dashboard) to test your URL and prompt before integrating. diff --git a/knowledge-base/troubleshooting/rate-limiting.mdx b/knowledge-base/troubleshooting/rate-limiting.mdx index 6a5732f..0363681 100644 --- a/knowledge-base/troubleshooting/rate-limiting.mdx +++ b/knowledge-base/troubleshooting/rate-limiting.mdx @@ -28,7 +28,7 @@ When you exceed the rate limit, the API returns an HTTP `429 Too Many Requests` | Enterprise | Custom | Custom | - Check the [dashboard](https://dashboard.scrapegraphai.com) for up-to-date limits for your current plan. + Check the [dashboard](https://scrapegraphai.com/dashboard) for up-to-date limits for your current plan. ## How to handle rate limits in code @@ -56,12 +56,14 @@ def scrape_with_retry(url: str, prompt: str, max_retries: int = 3): ### JavaScript — with retry ```javascript -import { smartScraper } from "scrapegraph-js"; +import { scrapegraphai } from "scrapegraph-js"; -async function scrapeWithRetry(apiKey, url, prompt, retries = 3) { +const sgai = scrapegraphai({ apiKey: "your-api-key" }); + +async function scrapeWithRetry(url, prompt, retries = 3) { for (let i = 0; i < retries; i++) { try { - return await smartScraper(apiKey, url, prompt); + return await sgai.extract(url, { prompt }); } catch (err) { if (err.status === 429) { const wait = Math.pow(2, i) * 1000; diff --git a/logo/dark.svg b/logo/dark.svg deleted file mode 100644 index 33285d6..0000000 --- a/logo/dark.svg +++ /dev/null @@ -1,145 +0,0 @@ - - - - diff --git a/logo/light.svg b/logo/light.svg deleted file mode 100644 index 33285d6..0000000 --- a/logo/light.svg +++ /dev/null @@ -1,145 +0,0 @@ - - - - diff --git a/logos/logo-color.svg b/logos/logo-color.svg new file mode 100644 index 0000000..6fb828b --- /dev/null +++ b/logos/logo-color.svg @@ -0,0 +1,15 @@ + + + + + + + + + + + + + + + diff --git a/logos/logo-dark-alt.svg b/logos/logo-dark-alt.svg new file mode 100644 index 0000000..cd47e15 --- /dev/null +++ b/logos/logo-dark-alt.svg @@ -0,0 +1,15 @@ + + + + + + + + + + + + + + + diff --git a/logos/logo-dark.svg b/logos/logo-dark.svg new file mode 100644 index 0000000..8545571 --- /dev/null +++ b/logos/logo-dark.svg @@ -0,0 +1,15 @@ + + + + + + + + + + + + + + + diff --git a/logos/logo-light.svg b/logos/logo-light.svg new file mode 100644 index 0000000..6fb828b --- /dev/null +++ b/logos/logo-light.svg @@ -0,0 +1,15 @@ + + + + + + + + + + + + + + + diff --git a/resources/blog.mdx b/resources/blog.mdx index 0d4aa0a..2076c5e 100644 --- a/resources/blog.mdx +++ b/resources/blog.mdx @@ -44,7 +44,7 @@ Master the art of prompt engineering for AI web scraping. This comprehensive gui ## Additional Resources - **Complete Guide**: [The Art of Prompting](https://scrapegraphai.com/blog/prompt-engineering-guide) -- **Practice in Playground**: [Test your prompts](https://dashboard.scrapegraphai.com/playground) +- **Practice in Playground**: [Test your prompts](https://scrapegraphai.com/dashboard) - **Community Support**: [Discord discussions](https://discord.gg/uJN7TYcpNa) - **Examples**: Check our [Cookbook](/cookbook/introduction) for real-world implementations diff --git a/sdks/javascript.mdx b/sdks/javascript.mdx index ed4fb55..4dd33f2 100644 --- a/sdks/javascript.mdx +++ b/sdks/javascript.mdx @@ -1,6 +1,6 @@ --- title: "JavaScript SDK" -description: "Official JavaScript/TypeScript SDK for ScrapeGraphAI" +description: "Official JavaScript/TypeScript SDK for ScrapeGraphAI v2" icon: "js" --- @@ -22,8 +22,6 @@ icon: "js" ## Installation -Install the package using npm, pnpm, yarn or bun: - ```bash # Using npm npm i scrapegraph-js @@ -38,82 +36,77 @@ yarn add scrapegraph-js bun add scrapegraph-js ``` -## Features + +v2 requires **Node.js >= 22**. + -- **AI-Powered Extraction**: Smart web scraping with artificial intelligence -- **Async by Design**: Fully asynchronous architecture -- **Type Safety**: Built-in TypeScript support with Zod schemas -- **Zero Exceptions**: All errors wrapped in `ApiResult` — no try/catch needed -- **Developer Friendly**: Comprehensive error handling and debug logging +## What's New in v2 -## Quick Start +- **Factory pattern**: Create a client with `scrapegraphai({ apiKey })` instead of importing individual functions +- **Renamed methods**: `smartScraper` → `extract`, `searchScraper` → `search` +- **camelCase parameters**: All params are now camelCase (e.g., `fetchConfig` instead of `fetch_config`) +- **Throws on error**: Methods return `{ data, requestId }` and throw on failure (no more `ApiResult` wrapper) +- **Native Zod support**: Pass Zod schemas directly to `schema` parameter +- **Namespace methods**: `crawl.start()`, `monitor.create()`, etc. +- **Removed**: `agenticScraper`, `generateSchema`, `sitemap`, `checkHealth`, `markdownify` -### Basic example + +v2 is a breaking release. If you're upgrading from v1, see the [Migration Guide](https://github.com/ScrapeGraphAI/scrapegraph-js/blob/main/MIGRATION.md). + - - Store your API keys securely in environment variables. Use `.env` files and - libraries like `dotenv` to load them into your app. - +## Quick Start ```javascript -import { smartScraper } from "scrapegraph-js"; -import "dotenv/config"; +import { scrapegraphai } from "scrapegraph-js"; -const apiKey = process.env.SGAI_APIKEY; +const sgai = scrapegraphai({ apiKey: "your-api-key" }); -const response = await smartScraper(apiKey, { - website_url: "https://example.com", - user_prompt: "What does the company do?", -}); +const { data, requestId } = await sgai.extract( + "https://example.com", + { prompt: "What does the company do?" } +); -if (response.status === "error") { - console.error("Error:", response.error); -} else { - console.log(response.data.result); -} +console.log(data); ``` + +Store your API keys securely in environment variables. Use `.env` files and +libraries like `dotenv` to load them into your app. + + +### Client Options + +| Parameter | Type | Default | Description | +| ---------- | ------ | -------------------------------- | ------------------------------- | +| apiKey | string | Required | Your ScrapeGraphAI API key | +| baseUrl | string | `https://api.scrapegraphai.com` | API base URL | +| timeout | number | `30000` | Request timeout in ms | +| maxRetries | number | `2` | Maximum number of retries | + ## Services -### SmartScraper +### extract() -Extract specific information from any webpage using AI: +Extract structured data from any webpage using AI. Replaces the v1 `smartScraper` function. ```javascript -const response = await smartScraper(apiKey, { - website_url: "https://example.com", - user_prompt: "Extract the main content", -}); -``` - -All functions return an `ApiResult` object: -```typescript -type ApiResult = { - status: "success" | "error"; - data: T | null; - error?: string; - elapsedMs: number; -}; +const { data, requestId } = await sgai.extract( + "https://example.com", + { prompt: "Extract the main heading and description" } +); ``` #### Parameters -| Parameter | Type | Required | Description | -| --------------- | ------- | -------- | ----------------------------------------------------------------------------------- | -| apiKey | string | Yes | The ScrapeGraph API Key (first argument). | -| user_prompt | string | Yes | A textual description of what you want to extract. | -| website_url | string | No* | The URL of the webpage to scrape. *One of `website_url`, `website_html`, or `website_markdown` is required. | -| output_schema | object | No | A Zod schema (converted to JSON) that describes the structure of the response. | -| number_of_scrolls | number | No | Number of scrolls for infinite scroll pages (0-50). | -| stealth | boolean | No | Enable anti-detection mode (+4 credits). | -| headers | object | No | Custom HTTP headers. | -| mock | boolean | No | Enable mock mode for testing. | -| wait_ms | number | No | Page load wait time in ms (default: 3000). | -| country_code | string | No | Proxy routing country code (e.g., "us"). | - - -Define a simple schema using Zod: +| Parameter | Type | Required | Description | +| -------------------- | ----------- | -------- | -------------------------------------------------------- | +| url | string | Yes | The URL of the webpage to scrape | +| options.prompt | string | Yes | A description of what you want to extract | +| options.schema | ZodSchema / object | No | Zod schema or JSON schema for structured response | +| options.fetchConfig | FetchConfig | No | Fetch configuration | +| options.llmConfig | LlmConfig | No | LLM configuration | + ```javascript import { z } from "zod"; @@ -122,301 +115,222 @@ const ArticleSchema = z.object({ author: z.string().describe("The author's name"), publishDate: z.string().describe("Article publication date"), content: z.string().describe("Main article content"), - category: z.string().describe("Article category"), }); -const ArticlesArraySchema = z - .array(ArticleSchema) - .describe("Array of articles"); +const { data } = await sgai.extract( + "https://example.com/blog/article", + { + prompt: "Extract the article information", + schema: ArticleSchema, + } +); -const response = await smartScraper(apiKey, { - website_url: "https://example.com/blog/article", - user_prompt: "Extract the article information", - output_schema: ArticlesArraySchema, -}); - -console.log(`Title: ${response.data.result.title}`); -console.log(`Author: ${response.data.result.author}`); -console.log(`Published: ${response.data.result.publishDate}`); +console.log(`Title: ${data.title}`); +console.log(`Author: ${data.author}`); ``` - - -Define a complex schema for nested data structures: - + ```javascript -import { z } from "zod"; - -const EmployeeSchema = z.object({ - name: z.string().describe("Employee's full name"), - position: z.string().describe("Job title"), - department: z.string().describe("Department name"), - email: z.string().describe("Email address"), -}); - -const OfficeSchema = z.object({ - location: z.string().describe("Office location/city"), - address: z.string().describe("Full address"), - phone: z.string().describe("Contact number"), -}); - -const CompanySchema = z.object({ - name: z.string().describe("Company name"), - description: z.string().describe("Company description"), - industry: z.string().describe("Industry sector"), - foundedYear: z.number().describe("Year company was founded"), - employees: z.array(EmployeeSchema).describe("List of key employees"), - offices: z.array(OfficeSchema).describe("Company office locations"), - website: z.string().url().describe("Company website URL"), -}); +const { data } = await sgai.extract( + "https://example.com", + { + prompt: "Extract the main heading", + fetchConfig: { + mode: 'js+stealth', + wait: 2000, + scrolls: 3, + }, + llmConfig: { + temperature: 0.3, + maxTokens: 1000, + }, + } +); +``` + -const response = await smartScraper(apiKey, { - website_url: "https://example.com/about", - user_prompt: "Extract detailed company information including employees and offices", - output_schema: CompanySchema, -}); +### search() -console.log(`Company: ${response.data.result.name}`); -console.log("\nKey Employees:"); -response.data.result.employees.forEach((employee) => { - console.log(`- ${employee.name} (${employee.position})`); -}); +Search the web and extract information. Replaces the v1 `searchScraper` function. -console.log("\nOffice Locations:"); -response.data.result.offices.forEach((office) => { - console.log(`- ${office.location}: ${office.address}`); -}); +```javascript +const { data } = await sgai.search( + "What are the key features and pricing of ChatGPT Plus?", + { numResults: 5 } +); ``` - +#### Parameters - -For modern web applications built with React, Vue, Angular, or other JavaScript frameworks: +| Parameter | Type | Required | Description | +| -------------------- | ----------- | -------- | -------------------------------------------------------- | +| query | string | Yes | The search query | +| options.numResults | number | No | Number of results (3-20). Default: 5 | +| options.schema | ZodSchema / object | No | Schema for structured response | +| options.fetchConfig | FetchConfig | No | Fetch configuration | +| options.llmConfig | LlmConfig | No | LLM configuration | + ```javascript -import { smartScraper } from 'scrapegraph-js'; -import { z } from 'zod'; - -const apiKey = 'your-api-key'; +import { z } from "zod"; const ProductSchema = z.object({ - name: z.string().describe('Product name'), - price: z.string().describe('Product price'), - description: z.string().describe('Product description'), - availability: z.string().describe('Product availability status') + name: z.string().describe("Product name"), + price: z.string().describe("Product price"), + features: z.array(z.string()).describe("Key features"), }); -const response = await smartScraper(apiKey, { - website_url: 'https://example-react-store.com/products/123', - user_prompt: 'Extract product details including name, price, description, and availability', - output_schema: ProductSchema, -}); +const { data } = await sgai.search( + "Find information about iPhone 15 Pro", + { + schema: ProductSchema, + numResults: 5, + } +); -if (response.status === 'error') { - console.error('Error:', response.error); -} else { - console.log('Product:', response.data.result.name); - console.log('Price:', response.data.result.price); - console.log('Available:', response.data.result.availability); -} +console.log(`Product: ${data.name}`); +console.log(`Price: ${data.price}`); ``` - -### SearchScraper +### scrape() -Search and extract information from multiple web sources using AI: +Convert any webpage to markdown, HTML, screenshot, or branding format. ```javascript -const response = await searchScraper(apiKey, { - user_prompt: "Find the best restaurants in San Francisco", - location_geo_code: "us", - time_range: "past_week", -}); +const { data } = await sgai.scrape("https://example.com"); +console.log(data); ``` #### Parameters -| Parameter | Type | Required | Description | -| ------------------ | ------- | -------- | ---------------------------------------------------------------------------------- | -| apiKey | string | Yes | The ScrapeGraph API Key (first argument). | -| user_prompt | string | Yes | A textual description of what you want to achieve. | -| num_results | number | No | Number of websites to search (3-20). Default: 3. | -| extraction_mode | boolean | No | **true** = AI extraction mode (10 credits/page), **false** = markdown mode (2 credits/page). | -| output_schema | object | No | Zod schema for structured response format (AI extraction mode only). | -| location_geo_code | string | No | Geo code for location-based search (e.g., "us"). | -| time_range | string | No | Time range filter. Options: "past_hour", "past_24_hours", "past_week", "past_month", "past_year". | +| Parameter | Type | Required | Description | +| -------------------- | ----------- | -------- | -------------------------------------------------------- | +| url | string | Yes | The URL of the webpage to scrape | +| options.format | string | No | `"markdown"`, `"html"`, `"screenshot"`, `"branding"` | +| options.fetchConfig | FetchConfig | No | Fetch configuration | - -Define a simple schema using Zod: +### crawl -```javascript -import { z } from "zod"; +Manage multi-page crawl operations asynchronously. -const ArticleSchema = z.object({ - title: z.string().describe("The article title"), - author: z.string().describe("The author's name"), - publishDate: z.string().describe("Article publication date"), - content: z.string().describe("Main article content"), - category: z.string().describe("Article category"), +```javascript +// Start a crawl +const job = await sgai.crawl.start("https://example.com", { + maxDepth: 2, + maxPages: 10, + includePatterns: ["/blog/*", "/docs/**"], + excludePatterns: ["/admin/*", "/api/*"], }); +console.log(`Crawl started: ${job.data.id}`); -const response = await searchScraper(apiKey, { - user_prompt: "Find news about the latest trends in AI", - output_schema: ArticleSchema, - location_geo_code: "us", - time_range: "past_week", -}); +// Check status +const status = await sgai.crawl.status(job.data.id); +console.log(`Status: ${status.data.status}`); -console.log(`Title: ${response.data.result.title}`); -console.log(`Author: ${response.data.result.author}`); -console.log(`Published: ${response.data.result.publishDate}`); +// Stop / Resume +await sgai.crawl.stop(job.data.id); +await sgai.crawl.resume(job.data.id); ``` - +### monitor - -Define a complex schema for nested data structures: +Create and manage site monitoring jobs. ```javascript -import { z } from "zod"; - -const EmployeeSchema = z.object({ - name: z.string().describe("Employee's full name"), - position: z.string().describe("Job title"), - department: z.string().describe("Department name"), - email: z.string().describe("Email address"), +// Create a monitor +const monitor = await sgai.monitor.create({ + url: "https://example.com", + prompt: "Track price changes", + schedule: "daily", }); -const OfficeSchema = z.object({ - location: z.string().describe("Office location/city"), - address: z.string().describe("Full address"), - phone: z.string().describe("Contact number"), -}); - -const RestaurantSchema = z.object({ - name: z.string().describe("Restaurant name"), - address: z.string().describe("Restaurant address"), - rating: z.number().describe("Restaurant rating"), - website: z.string().url().describe("Restaurant website URL"), -}); +// List all monitors +const monitors = await sgai.monitor.list(); -const response = await searchScraper(apiKey, { - user_prompt: "Find the best restaurants in San Francisco", - output_schema: RestaurantSchema, - location_geo_code: "us", - time_range: "past_month", -}); +// Get / Pause / Resume / Delete +const details = await sgai.monitor.get(monitor.data.id); +await sgai.monitor.pause(monitor.data.id); +await sgai.monitor.resume(monitor.data.id); +await sgai.monitor.delete(monitor.data.id); ``` - +### credits() - -Use markdown mode for cost-effective content gathering: +Check your account credit balance. ```javascript -import { searchScraper } from 'scrapegraph-js'; +const { data } = await sgai.credits(); +console.log(`Remaining: ${data.remainingCredits}`); +console.log(`Used: ${data.totalCreditsUsed}`); +``` -const apiKey = 'your-api-key'; +### history() -const response = await searchScraper(apiKey, { - user_prompt: 'Latest developments in artificial intelligence', - num_results: 3, - extraction_mode: false, - location_geo_code: "us", - time_range: "past_week", +Retrieve paginated request history. + +```javascript +const { data } = await sgai.history({ + endpoint: "extract", + status: "completed", + limit: 20, + offset: 0, }); -if (response.status === 'error') { - console.error('Error:', response.error); -} else { - const markdownContent = response.data.markdown_content; - console.log('Markdown content length:', markdownContent.length); - console.log('Reference URLs:', response.data.reference_urls); - console.log('Content preview:', markdownContent.substring(0, 500) + '...'); -} +data.items.forEach((entry) => { + console.log(`${entry.createdAt} - ${entry.endpoint} - ${entry.status}`); +}); ``` -**Markdown Mode Benefits:** -- **Cost-effective**: Only 2 credits per page (vs 10 credits for AI extraction) -- **Full content**: Get complete page content in markdown format -- **Faster**: No AI processing overhead -- **Perfect for**: Content analysis, bulk data collection, building datasets +## Configuration Objects - +### FetchConfig - -Filter search results by date range to get only recent information: +Controls how pages are fetched. See the [proxy configuration guide](/services/additional-parameters/proxy) for details on modes and geotargeting. ```javascript -import { searchScraper } from 'scrapegraph-js'; - -const apiKey = 'your-api-key'; - -const response = await searchScraper(apiKey, { - user_prompt: 'Latest news about AI developments', - num_results: 5, - time_range: 'past_week', // Options: 'past_hour', 'past_24_hours', 'past_week', 'past_month', 'past_year' -}); - -if (response.status === 'error') { - console.error('Error:', response.error); -} else { - console.log('Recent AI news:', response.data.result); - console.log('Reference URLs:', response.data.reference_urls); +{ + mode: 'js+stealth', // Proxy strategy: auto, fast, js, direct+stealth, js+stealth + timeout: 15000, // Request timeout in ms (1000-60000) + wait: 2000, // Wait after page load in ms (0-30000) + scrolls: 3, // Number of scrolls (0-100) + country: 'us', // Proxy country code (ISO 3166-1 alpha-2) + headers: { 'X-Custom': 'header' }, + cookies: { key: 'value' }, + mock: false, // Enable mock mode for testing } ``` -**Time Range Options:** -- `past_hour` - Results from the past hour -- `past_24_hours` - Results from the past 24 hours -- `past_week` - Results from the past week -- `past_month` - Results from the past month -- `past_year` - Results from the past year - -**Use Cases:** -- Finding recent news and updates -- Tracking time-sensitive information -- Getting latest product releases -- Monitoring recent market changes - - - -### Markdownify +### LlmConfig -Convert any webpage into clean, formatted markdown: +Controls LLM behavior for AI-powered methods. ```javascript -const response = await markdownify(apiKey, { - website_url: "https://example.com", -}); +{ + model: "gpt-4o-mini", // LLM model to use + temperature: 0.3, // Response creativity (0-1) + maxTokens: 1000, // Maximum response tokens + chunker: { // Content chunking strategy + size: "dynamic", // Chunk size (number or "dynamic") + overlap: 100, // Overlap between chunks + }, +} ``` -#### Parameters - -| Parameter | Type | Required | Description | -| ----------- | ------- | -------- | ---------------------------------------------- | -| apiKey | string | Yes | The ScrapeGraph API Key (first argument). | -| website_url | string | Yes | The URL of the webpage to convert to markdown. | -| wait_ms | number | No | Page load wait time in ms (default: 3000). | -| stealth | boolean | No | Enable anti-detection mode (+4 credits). | -| country_code| string | No | Proxy routing country code (e.g., "us"). | +## Error Handling -## API Credits - -Check your available API credits: +v2 throws errors instead of returning `ApiResult`. Use try/catch: ```javascript -import { getCredits } from "scrapegraph-js"; - -const credits = await getCredits(apiKey); - -if (credits.status === "error") { - console.error("Error fetching credits:", credits.error); -} else { - console.log("Remaining credits:", credits.data.remaining_credits); - console.log("Total used:", credits.data.total_credits_used); +try { + const { data, requestId } = await sgai.extract( + "https://example.com", + { prompt: "Extract the title" } + ); + console.log(data); +} catch (err) { + console.error(`Request failed: ${err.message}`); } ``` @@ -438,9 +352,3 @@ if (credits.status === "error") { Get help from our development team - - - This project is licensed under the MIT License. See the - [LICENSE](https://github.com/ScrapeGraphAI/scrapegraph-js/blob/main/LICENSE) - file for details. - diff --git a/sdks/mocking.mdx b/sdks/mocking.mdx index 592de50..ef82ad4 100644 --- a/sdks/mocking.mdx +++ b/sdks/mocking.mdx @@ -1,594 +1,265 @@ --- title: 'Mocking & Testing' -description: 'Test ScrapeGraphAI functionality in an isolated environment without consuming API credits' +description: 'Test ScrapeGraphAI v2 functionality without consuming API credits' icon: 'test-tube' --- -ScrapeGraph API Banner - - - Test your code without making real API calls + + Use familiar testing tools for mocking - - Override responses for specific endpoints + + Test without consuming API credits ## Overview -A mock environment is an isolated test environment. You can use mock mode to test ScrapeGraphAI functionality in your application, and experiment with new features without affecting your live integration or consuming API credits. For example, when testing in mock mode, the scraping requests you create aren't processed by our servers or counted against your credit usage. - -## Use cases - -Mock mode provides an environment for testing various functionalities and scenarios without the implications of real API calls. Below are some common use cases for mocking in your ScrapeGraphAI integrations: - -| Scenario | Description | -|----------|-------------| -| **Simulate scraping responses to test without real API calls** | Use mock mode to test scraping functionality without real API calls. Create mock responses in your application to test data processing logic or use custom handlers to simulate various response scenarios. | -| **Scale isolated testing for teams** | Your team can test in separate mock environments to make sure that data and actions are completely isolated from other tests. Changes made in one mock configuration don't interfere with changes in another. | -| **Test without API key requirements** | You can test your integration without providing real API keys, making it easier for external developers, implementation partners, or design agencies to work with your code without access to your live API credentials. | -| **Test in development or CI/CD pipelines** | Access mock mode from your development environment or continuous integration pipelines. Test ScrapeGraphAI functionality directly in your code or use familiar testing frameworks and fixtures. | +In v2, the built-in mock mode (`mock=True`, `mock_handler`, `mock_responses`) has been removed from the SDKs. Instead, use standard mocking libraries for your language to test ScrapeGraphAI integrations without making real API calls or consuming credits. -## Test in mock mode + +If you're migrating from v1, replace `Client(mock=True)` with standard mocking patterns shown below. + -You can simulate scraping responses and use mock data to test your integration without consuming API credits. Learn more about using mock responses to confirm that your integration works correctly. +## Python SDK Testing -## Basic Mock Usage - -Enable mock mode by setting `mock=True` when initializing the client: +### Using `unittest.mock` ```python +from unittest.mock import patch, MagicMock from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Set logging level for better visibility -sgai_logger.set_logging(level="INFO") - -def basic_mock_usage(): - # Initialize the client with mock mode enabled - client = Client.from_env(mock=True) - - print("\n-- get_credits (mock) --") - print(client.get_credits()) - - print("\n-- markdownify (mock) --") - md = client.markdownify(website_url="https://example.com") - print(md) - - print("\n-- get_markdownify (mock) --") - md_status = client.get_markdownify("00000000-0000-0000-0000-000000000123") - print(md_status) - - print("\n-- smartscraper (mock) --") - ss = client.smartscraper(user_prompt="Extract title", website_url="https://example.com") - print(ss) - -if __name__ == "__main__": - basic_mock_usage() -``` - - -When mock mode is enabled, all API calls return predefined mock responses instead of making real HTTP requests. This ensures your tests run quickly and don't consume API credits. - -## Custom Response Overrides +def test_extract(): + client = Client(api_key="test-key") -You can override specific endpoint responses using the `mock_responses` parameter: - -```python -def mock_with_path_overrides(): - # Initialize the client with mock mode and custom responses - client = Client.from_env( - mock=True, - mock_responses={ - "/v1/credits": {"remaining_credits": 42, "total_credits_used": 58, "mock": true} + mock_response = { + "data": { + "title": "Test Page", + "content": "This is test content" }, - ) - - print("\n-- get_credits with override (mock) --") - print(client.get_credits()) -``` + "request_id": "test-request-123" + } - -You can override responses for any endpoint by providing the path and expected response: + with patch.object(client, "extract", return_value=mock_response): + response = client.extract( + url="https://example.com", + prompt="Extract title and content" + ) -```python -client = Client.from_env( - mock=True, - mock_responses={ - "/v1/credits": { - "remaining_credits": 100, - "total_credits_used": 0, - "mock": true - }, - "/v1/smartscraper/start": { - "job_id": "mock-job-123", - "status": "processing", - "mock": true - }, - "/v1/smartscraper/status/mock-job-123": { - "job_id": "mock-job-123", - "status": "completed", - "result": { - "title": "Mock Title", - "content": "Mock content from the webpage", - "mock": true - } - }, - "/v1/markdownify/start": { - "job_id": "mock-markdown-456", - "status": "processing", - "mock": true - }, - "/v1/markdownify/status/mock-markdown-456": { - "job_id": "mock-markdown-456", - "status": "completed", - "result": "# Mock Markdown\n\nThis is mock markdown content.", - "mock": true - } - } -) + assert response["data"]["title"] == "Test Page" + assert response["request_id"] == "test-request-123" ``` - -## Custom Handler Functions +### Using `responses` Library -For more complex mocking scenarios, you can provide a custom handler function: +Mock HTTP requests at the transport layer: ```python -def mock_with_custom_handler(): - def handler(method, url, kwargs): - return {"handled_by": "custom_handler", "method": method, "url": url} - - # Initialize the client with mock mode and custom handler - client = Client.from_env(mock=True, mock_handler=handler) +import responses +from scrapegraph_py import Client - print("\n-- searchscraper via custom handler (mock) --") - resp = client.searchscraper(user_prompt="Search something") - print(resp) -``` +@responses.activate +def test_extract_http(): + responses.post( + "https://api.scrapegraphai.com/api/v2/extract", + json={ + "data": {"title": "Mock Title"}, + "request_id": "mock-123" + }, + status=200, + ) - -Create sophisticated mock responses based on request parameters: + client = Client(api_key="test-key") + response = client.extract( + url="https://example.com", + prompt="Extract the title" + ) -```python -def advanced_custom_handler(): - def smart_handler(method, url, kwargs): - # Handle different endpoints with custom logic - if "/v1/credits" in url: - return { - "remaining_credits": 50, - "total_credits_used": 50, - "mock": true - } - elif "/v1/smartscraper" in url: - # Extract user_prompt from kwargs to create contextual responses - user_prompt = kwargs.get("user_prompt", "") - if "title" in user_prompt.lower(): - return { - "job_id": "mock-title-job", - "status": "completed", - "result": { - "title": "Extracted Title", - "content": "This is the extracted content", - "mock": true - } - } - else: - return { - "job_id": "mock-generic-job", - "status": "completed", - "result": { - "data": "Generic extracted data", - "mock": true - } - } - else: - return {"error": "Unknown endpoint", "url": url} - - client = Client.from_env(mock=True, mock_handler=smart_handler) - - # Test different scenarios - print("Credits:", client.get_credits()) - print("Title extraction:", client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the title" - )) - print("Generic extraction:", client.smartscraper( - website_url="https://example.com", - user_prompt="Extract some data" - )) + assert response["data"]["title"] == "Mock Title" ``` - -## Testing Best Practices - -### Unit Testing with Mocks +### Using `pytest` Fixtures ```python -import unittest -from unittest.mock import patch +import pytest +from unittest.mock import MagicMock from scrapegraph_py import Client -class TestScrapeGraphAI(unittest.TestCase): - def setUp(self): - self.client = Client.from_env(mock=True) - - def test_get_credits(self): - credits = self.client.get_credits() - self.assertIn("remaining_credits", credits) - self.assertIn("total_credits_used", credits) - - def test_smartscraper_with_schema(self): - from pydantic import BaseModel, Field - - class TestSchema(BaseModel): - title: str = Field(description="Page title") - content: str = Field(description="Page content") - - response = self.client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title and content", - output_schema=TestSchema - ) - - self.assertIsInstance(response, TestSchema) - self.assertIsNotNone(response.title) - self.assertIsNotNone(response.content) - -if __name__ == "__main__": - unittest.main() -``` - -### Integration Testing - -```python -def test_integration_flow(): - """Test a complete workflow using mocks""" - client = Client.from_env( - mock=True, - mock_responses={ - "/v1/credits": {"remaining_credits": 10, "total_credits_used": 90, "mock": true}, - "/v1/smartscraper/start": { - "job_id": "test-job-123", - "status": "processing", - "mock": true - }, - "/v1/smartscraper/status/test-job-123": { - "job_id": "test-job-123", - "status": "completed", - "result": { - "title": "Test Page", - "content": "Test content", - "mock": true - } - } - } - ) - - # Test the complete flow - credits = client.get_credits() - assert credits["remaining_credits"] == 10 - - # Start a scraping job - job = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title and content" +@pytest.fixture +def mock_client(): + client = Client(api_key="test-key") + client.extract = MagicMock(return_value={ + "data": {"title": "Mock Title"}, + "request_id": "mock-123" + }) + client.search = MagicMock(return_value={ + "data": {"results": []}, + "request_id": "mock-456" + }) + client.credits = MagicMock(return_value={ + "remaining_credits": 100, + "total_credits_used": 0 + }) + return client + +def test_extract(mock_client): + response = mock_client.extract( + url="https://example.com", + prompt="Extract the title" ) - - # Check job status - status = client.get_smartscraper("test-job-123") - assert status["status"] == "completed" - assert "title" in status["result"] -``` - -## Environment Variables - -You can also control mocking through environment variables: - -```bash -# Enable mock mode via environment variable -export SGAI_MOCK=true + assert response["data"]["title"] == "Mock Title" -# Set custom mock responses (JSON format) -export SGAI_MOCK_RESPONSES='{"\/v1\/credits": {"remaining_credits": 100, "mock": true}}' +def test_credits(mock_client): + credits = mock_client.credits() + assert credits["remaining_credits"] == 100 ``` -```python -# The client will automatically detect mock mode from environment -client = Client.from_env() # Will use mock mode if SGAI_MOCK=true -``` - -## Async Mocking - -Mocking works seamlessly with async clients: +### Async Testing with `aioresponses` ```python +import pytest import asyncio +from aioresponses import aioresponses from scrapegraph_py import AsyncClient -async def async_mock_example(): - async with AsyncClient(mock=True) as client: - # All async methods work with mocks - credits = await client.get_credits() - print(f"Mock credits: {credits}") - - response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract data" +@pytest.mark.asyncio +async def test_async_extract(): + with aioresponses() as mocked: + mocked.post( + "https://api.scrapegraphai.com/api/v2/extract", + payload={ + "data": {"title": "Async Mock"}, + "request_id": "async-123" + }, ) - print(f"Mock response: {response}") - -# Run the async example -asyncio.run(async_mock_example()) -``` - -## HTTP Method Mocking with cURL - -You can also test ScrapeGraphAI endpoints directly using cURL with mock responses. This is useful for testing API integrations without using SDKs. - -### Basic cURL Mock Usage - -```bash -# Enable mock mode via environment variable -export SGAI_MOCK=true - -# Test credits endpoint with mock -curl -X GET "https://api.scrapegraph.ai/v1/credits" \ - -H "Authorization: Bearer $SGAI_API_KEY" \ - -H "Content-Type: application/json" -``` -### Custom Mock Responses with cURL + async with AsyncClient(api_key="test-key") as client: + response = await client.extract( + url="https://example.com", + prompt="Extract data" + ) -```bash -# Set custom mock responses via environment variable -export SGAI_MOCK_RESPONSES='{ - "/v1/credits": { - "remaining_credits": 100, - "total_credits_used": 0, - "mock": true - }, -}' - -# Test smartscraper endpoint -curl -X POST "https://api.scrapegraph.ai/v1/smartscraper/" \ - -H "Authorization: Bearer $SGAI_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{ - "website_url": "https://example.com", - "user_prompt": "Extract title and content" - "mock": true - }' + assert response["data"]["title"] == "Async Mock" ``` -### Testing Different HTTP Methods +## JavaScript SDK Testing -```bash -# POST request - to smartscraper -curl --location 'https://api.scrapegraphai.com/v1/smartscraper' \ ---data '{ - "website_url": "https://www.scrapegraphai.com//", - "user_prompt": "Extract founder info ", - "mock":true -}' -``` +### Using Jest / Vitest -```bash -# POST request - to Markdownify -curl --location 'https://api.scrapegraphai.com/v1/markdownify' \ ---data '{ - "website_url": "https://www.scrapegraphai.com//", - "mock":true -}' -``` - -```bash -# POST request - to SearchScraper -curl --location 'https://api.scrapegraphai.com/v1/searchscraper' \ ---data '{ - "website_url": "https://www.scrapegraphai.com//", - "mock":true - "output_schema":{}, - "num_results":3, -}' +```javascript +import { describe, it, expect, vi } from "vitest"; +import { scrapegraphai } from "scrapegraph-js"; + +// Mock the module +vi.mock("scrapegraph-js", () => ({ + scrapegraphai: vi.fn(() => ({ + extract: vi.fn().mockResolvedValue({ + data: { title: "Mock Title" }, + requestId: "mock-123", + }), + search: vi.fn().mockResolvedValue({ + data: { results: [] }, + requestId: "mock-456", + }), + credits: vi.fn().mockResolvedValue({ + data: { remainingCredits: 100 }, + }), + })), +})); + +describe("ScrapeGraphAI", () => { + const sgai = scrapegraphai({ apiKey: "test-key" }); + + it("should extract data", async () => { + const { data } = await sgai.extract("https://example.com", { + prompt: "Extract the title", + }); + expect(data.title).toBe("Mock Title"); + }); + + it("should check credits", async () => { + const { data } = await sgai.credits(); + expect(data.remainingCredits).toBe(100); + }); +}); ``` +### Using MSW (Mock Service Worker) -## JavaScript SDK Mocking - -The JavaScript SDK supports per-request mocking via the `mock` parameter. Pass `mock: true` in the params object of any function to receive mock data instead of making a real API call. - -### Per-Request Mock Mode +Mock at the network level for more realistic testing: ```javascript -import { smartScraper, scrape, searchScraper, getCredits } from 'scrapegraph-js'; - -const API_KEY = 'your-api-key'; - -// SmartScraper with mock -const smartResult = await smartScraper(API_KEY, { - website_url: 'https://example.com', - user_prompt: 'Extract the title', - mock: true, -}); -console.log('SmartScraper mock:', smartResult.data); - -// Scrape with mock -const scrapeResult = await scrape(API_KEY, { - website_url: 'https://example.com', - mock: true, -}); -console.log('Scrape mock:', scrapeResult.data); - -// SearchScraper with mock -const searchResult = await searchScraper(API_KEY, { - user_prompt: 'Find AI news', - mock: true, +import { http, HttpResponse } from "msw"; +import { setupServer } from "msw/node"; +import { scrapegraphai } from "scrapegraph-js"; + +const server = setupServer( + http.post("https://api.scrapegraphai.com/api/v2/extract", () => { + return HttpResponse.json({ + data: { title: "MSW Mock Title" }, + requestId: "msw-123", + }); + }), + http.get("https://api.scrapegraphai.com/api/v2/credits", () => { + return HttpResponse.json({ + data: { remainingCredits: 50, totalCreditsUsed: 50 }, + }); + }) +); + +beforeAll(() => server.listen()); +afterAll(() => server.close()); +afterEach(() => server.resetHandlers()); + +test("extract returns mocked data", async () => { + const sgai = scrapegraphai({ apiKey: "test-key" }); + const { data } = await sgai.extract("https://example.com", { + prompt: "Extract the title", + }); + expect(data.title).toBe("MSW Mock Title"); }); -console.log('SearchScraper mock:', searchResult.data); ``` - -The JavaScript SDK does not have global mock functions like `enableMock()` or `setMockResponses()`. Mock mode is controlled per-request via the `mock: true` parameter. All functions return `ApiResult` — errors are never thrown. - - -## SDK Comparison - - - - - `Client(mock=True)` initialization - - `mock_responses` parameter for overrides - - `mock_handler` for custom logic - - Environment variable: `SGAI_MOCK=true` - - - - `mock: true` in per-request params - - All functions support mock parameter - - Native async/await - - - - Environment variable: `SGAI_MOCK=true` - - `SGAI_MOCK_RESPONSES` for custom responses - - Direct HTTP method testing - - No SDK dependencies required - - - -### Feature Comparison - -| Feature | Python SDK | JavaScript SDK | cURL/HTTP | -|---------|------------|----------------|-----------| -| **Global Mock Mode** | `Client(mock=True)` | N/A | `SGAI_MOCK=true` | -| **Per-Request Mock** | `{mock: True}` in params | `mock: true` in params | N/A | -| **Custom Responses** | `mock_responses` dict | N/A | `SGAI_MOCK_RESPONSES` | -| **Custom Handler** | `mock_handler` function | N/A | N/A | -| **Environment Variable** | `SGAI_MOCK=true` | N/A | `SGAI_MOCK=true` | -| **Async Support** | `AsyncClient(mock=True)` | Native async/await | N/A | -| **Dependencies** | Python SDK required | JavaScript SDK required | None | - -## Limitations - -* You can't test real-time scraping performance in mock mode. -* Mock responses don't reflect actual website changes or dynamic content. -* Rate limiting and credit consumption are not simulated in mock mode. -* Some advanced features may behave differently in mock mode compared to live mode. - -## Troubleshooting - - +## Testing with cURL -### Mock responses not working -- Ensure `mock=True` is set when initializing the client -- Check that your mock response paths match the actual API endpoints -- Verify the response format matches the expected schema +Test API endpoints directly using cURL against a local mock server or staging environment: -### Custom handler not being called -- Make sure you're passing the `mock_handler` parameter correctly -- Check that your handler function accepts the correct parameters: `(method, url, kwargs)` -- Ensure the handler returns a valid response object +```bash +# Test extract endpoint +curl -X POST "https://api.scrapegraphai.com/api/v2/extract" \ + -H "Authorization: Bearer your-api-key" \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://example.com", + "prompt": "Extract the title" + }' -### Schema validation errors -- Mock responses must match the expected Pydantic schema structure -- Use the same field names and types as defined in your schema -- Test your mock responses with the actual schema classes +# Test credits endpoint +curl -X GET "https://api.scrapegraphai.com/api/v2/credits" \ + -H "Authorization: Bearer your-api-key" +``` - +## SDK Comparison -## Examples +| Feature | Python | JavaScript | +|---------|--------|------------| +| **Mock library** | `unittest.mock`, `responses` | Jest/Vitest mocks, MSW | +| **HTTP-level mocking** | `responses`, `aioresponses` | MSW (Mock Service Worker) | +| **Async mocking** | `aioresponses`, `unittest.mock` | Native async/await | +| **Fixture support** | pytest fixtures | beforeEach/afterEach | - -Here's a complete example showing all mocking features: +## Best Practices -```python -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger -from pydantic import BaseModel, Field -from typing import List - -# Set up logging -sgai_logger.set_logging(level="INFO") - -class ProductInfo(BaseModel): - name: str = Field(description="Product name") - price: str = Field(description="Product price") - features: List[str] = Field(description="Product features") - -def complete_mock_demo(): - # Initialize with comprehensive mock responses - client = Client.from_env( - mock=True, - mock_responses={ - "/v1/credits": { - "remaining_credits": 25, - "total_credits_used": 75, - "mock": true - }, - "/v1/smartscraper/start": { - "job_id": "demo-job-789", - "status": "processing", - "mock": true - }, - "/v1/smartscraper/status/demo-job-789": { - "job_id": "demo-job-789", - "status": "completed", - "result": { - "name": "iPhone 15 Pro", - "price": "$999", - "features": [ - "A17 Pro chip", - "48MP camera system", - "Titanium design", - "Action Button" - ], - "mock": true - } - } - } - ) - - print("=== ScrapeGraphAI Mock Demo ===\n") - - # Test credits endpoint - print("1. Checking credits:") - credits = client.get_credits() - print(f" Remaining: {credits['remaining_credits']}") - print(f" Used: {credits['total_credits_used']}\n") - - # Test smartscraper with schema - print("2. Extracting product information:") - product = client.smartscraper( - website_url="https://apple.com/iphone-15-pro", - user_prompt="Extract product name, price, and key features", - output_schema=ProductInfo - ) - - print(f" Product: {product.name}") - print(f" Price: {product.price}") - print(" Features:") - for feature in product.features: - print(f" - {feature}") - - print("\n3. Testing markdownify:") - markdown = client.markdownify(website_url="https://example.com") - print(f" Markdown length: {len(markdown)} characters") - - print("\n=== Demo Complete ===") - -if __name__ == "__main__": - complete_mock_demo() -``` - +- Mock at the **client method level** for unit tests (fastest, simplest) +- Mock at the **HTTP level** for integration tests (validates request/response shapes) +- Use **fixtures** to share mock configurations across tests +- Keep mock responses **realistic** - match the actual API response structure +- Test both **success and error** scenarios ## Support - + Report bugs or request features @@ -596,4 +267,4 @@ if __name__ == "__main__": -Need help with mocking? Check out our [Python SDK documentation](/sdks/python) or join our [Discord community](https://discord.gg/uJN7TYcpNa) for support. +Need help with testing? Join our [Discord community](https://discord.gg/uJN7TYcpNa) for support. diff --git a/sdks/python.mdx b/sdks/python.mdx index 43da3f2..a780001 100644 --- a/sdks/python.mdx +++ b/sdks/python.mdx @@ -1,15 +1,9 @@ --- title: 'Python SDK' -description: 'Official Python SDK for ScrapeGraphAI' +description: 'Official Python SDK for ScrapeGraphAI v2' icon: 'python' --- -ScrapeGraph API Banner - [![PyPI version](https://badge.fury.io/py/scrapegraph-py.svg)](https://badge.fury.io/py/scrapegraph-py) @@ -21,23 +15,23 @@ icon: 'python' ## Installation -Install the package using pip: - ```bash pip install scrapegraph-py ``` -## Features +## What's New in v2 -- **AI-Powered Extraction**: Advanced web scraping using artificial intelligence -- **Flexible Clients**: Both synchronous and asynchronous support -- **Type Safety**: Structured output with Pydantic schemas -- **Production Ready**: Detailed logging and automatic retries -- **Developer Friendly**: Comprehensive error handling +- **Renamed methods**: `smartscraper()` → `extract()`, `searchscraper()` → `search()` +- **Unified config objects**: `FetchConfig` and `LlmConfig` replace scattered parameters +- **Namespace methods**: `crawl.start()`, `crawl.status()`, `monitor.create()`, etc. +- **New endpoints**: `credits()`, `history()`, `crawl.stop()`, `crawl.resume()` +- **Removed**: `markdownify()`, `agenticscraper()`, `sitemap()`, `healthz()`, `feedback()`, built-in mock mode -## Quick Start + +v2 is a breaking release. If you're upgrading from v1, see the [Migration Guide](https://github.com/ScrapeGraphAI/scrapegraph-py/blob/main/MIGRATION_V2.md). + -Initialize the client with your API key: +## Quick Start ```python from scrapegraph_py import Client @@ -49,30 +43,54 @@ client = Client(api_key="your-api-key-here") You can also set the `SGAI_API_KEY` environment variable and initialize the client without parameters: `client = Client()` +### Client Options + +| Parameter | Type | Default | Description | +| ------------- | ------ | -------------------------------- | ------------------------------- | +| api_key | string | `SGAI_API_KEY` env var | Your ScrapeGraphAI API key | +| base_url | string | `https://api.scrapegraphai.com` | API base URL | +| verify_ssl | bool | `True` | Verify SSL certificates | +| timeout | int | `30` | Request timeout in seconds | +| max_retries | int | `3` | Maximum number of retries | +| retry_delay | float | `1.0` | Delay between retries (seconds) | + +You can also use the `Client.from_env()` class method to create a client from the `SGAI_API_KEY` environment variable: + +```python +client = Client.from_env() +``` + +Both `Client` and `AsyncClient` support context managers for automatic session cleanup: + +```python +with Client(api_key="your-api-key") as client: + response = client.extract(url="https://example.com", prompt="Extract data") +``` + ## Services -### SmartScraper +### Extract -Extract specific information from any webpage using AI: +Extract structured data from any webpage using AI. Replaces the v1 `smartscraper()` method. ```python -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the main heading and description" +response = client.extract( + url="https://example.com", + prompt="Extract the main heading and description" ) +print(response) ``` #### Parameters -| Parameter | Type | Required | Description | -| ---------------- | ------- | -------- | ---------------------------------------------------------------------------------- | -| website_url | string | Yes | The URL of the webpage that needs to be scraped. | -| user_prompt | string | Yes | A textual description of what you want to achieve. | -| output_schema | object | No | The Pydantic object that describes the structure and format of the response. | - - -Define a simple schema for basic data extraction: +| Parameter | Type | Required | Description | +| ------------ | ----------- | -------- | -------------------------------------------------------- | +| url | string | Yes | The URL of the webpage to scrape | +| prompt | string | Yes | A description of what you want to extract | +| output_schema| object | No | Pydantic model for structured response | +| fetch_config | FetchConfig | No | Fetch configuration (stealth, rendering, etc.) | + ```python from pydantic import BaseModel, Field @@ -81,93 +99,38 @@ class ArticleData(BaseModel): author: str = Field(description="The author's name") publish_date: str = Field(description="Article publication date") content: str = Field(description="Main article content") - category: str = Field(description="Article category") -response = client.smartscraper( - website_url="https://example.com/blog/article", - user_prompt="Extract the article information", +response = client.extract( + url="https://example.com/blog/article", + prompt="Extract the article information", output_schema=ArticleData ) -print(f"Title: {response.title}") -print(f"Author: {response.author}") -print(f"Published: {response.publish_date}") +print(f"Title: {response['data']['title']}") +print(f"Author: {response['data']['author']}") ``` - -Define a complex schema for nested data structures: +### Search -```python -from typing import List -from pydantic import BaseModel, Field - -class Employee(BaseModel): - name: str = Field(description="Employee's full name") - position: str = Field(description="Job title") - department: str = Field(description="Department name") - email: str = Field(description="Email address") - -class Office(BaseModel): - location: str = Field(description="Office location/city") - address: str = Field(description="Full address") - phone: str = Field(description="Contact number") - -class CompanyData(BaseModel): - name: str = Field(description="Company name") - description: str = Field(description="Company description") - industry: str = Field(description="Industry sector") - founded_year: int = Field(description="Year company was founded") - employees: List[Employee] = Field(description="List of key employees") - offices: List[Office] = Field(description="Company office locations") - website: str = Field(description="Company website URL") - -# Extract comprehensive company information -response = client.smartscraper( - website_url="https://example.com/about", - user_prompt="Extract detailed company information including employees and offices", - output_schema=CompanyData -) - -# Access nested data -print(f"Company: {response.name}") -print("\nKey Employees:") -for employee in response.employees: - print(f"- {employee.name} ({employee.position})") - -print("\nOffice Locations:") -for office in response.offices: - print(f"- {office.location}: {office.address}") -``` - - -### SearchScraper - -Search and extract information from multiple web sources using AI: +Search the web and extract information from multiple sources. Replaces the v1 `searchscraper()` method. ```python -from scrapegraph_py.models import TimeRange - -response = client.searchscraper( - user_prompt="What are the key features and pricing of ChatGPT Plus?", - time_range=TimeRange.PAST_WEEK # Optional: Filter results by time range +response = client.search( + query="What are the key features and pricing of ChatGPT Plus?" ) ``` #### Parameters -| Parameter | Type | Required | Description | -| ---------------- | ------- | -------- | ---------------------------------------------------------------------------------- | -| user_prompt | string | Yes | A textual description of what you want to achieve. | -| num_results | number | No | Number of websites to search (3-20). Default: 3. | -| extraction_mode | boolean | No | **True** = AI extraction mode (10 credits/page), **False** = markdown mode (2 credits/page). Default: True | -| output_schema | object | No | The Pydantic object that describes the structure and format of the response (AI extraction mode only) | -| location_geo_code| string | No | Optional geo code for location-based search (e.g., "us") | -| time_range | TimeRange| No | Optional time range filter for search results. Options: TimeRange.PAST_HOUR, TimeRange.PAST_24_HOURS, TimeRange.PAST_WEEK, TimeRange.PAST_MONTH, TimeRange.PAST_YEAR | - - -Define a simple schema for structured search results: +| Parameter | Type | Required | Description | +| ------------- | ----------- | -------- | -------------------------------------------------------- | +| query | string | Yes | The search query | +| num_results | number | No | Number of results (3-20). Default: 5 | +| output_schema | object | No | Pydantic model for structured response | +| fetch_config | FetchConfig | No | Fetch configuration | + ```python from pydantic import BaseModel, Field from typing import List @@ -177,174 +140,154 @@ class ProductInfo(BaseModel): description: str = Field(description="Product description") price: str = Field(description="Product price") features: List[str] = Field(description="List of key features") - availability: str = Field(description="Availability information") - -from scrapegraph_py.models import TimeRange -response = client.searchscraper( - user_prompt="Find information about iPhone 15 Pro", +response = client.search( + query="Find information about iPhone 15 Pro", output_schema=ProductInfo, - location_geo_code="us", # Optional: Geo code for location-based search - time_range=TimeRange.PAST_MONTH # Optional: Filter results by time range + num_results=5, ) -print(f"Product: {response.name}") -print(f"Price: {response.price}") -print("\nFeatures:") -for feature in response.features: - print(f"- {feature}") +print(f"Product: {response['data']['name']}") +print(f"Price: {response['data']['price']}") ``` - -Define a complex schema for comprehensive market research: +### Scrape -```python -from typing import List -from pydantic import BaseModel, Field +Convert any webpage into markdown, HTML, screenshot, or branding format. -class MarketPlayer(BaseModel): - name: str = Field(description="Company name") - market_share: str = Field(description="Market share percentage") - key_products: List[str] = Field(description="Main products in market") - strengths: List[str] = Field(description="Company's market strengths") - -class MarketTrend(BaseModel): - name: str = Field(description="Trend name") - description: str = Field(description="Trend description") - impact: str = Field(description="Expected market impact") - timeframe: str = Field(description="Trend timeframe") - -class MarketAnalysis(BaseModel): - market_size: str = Field(description="Total market size") - growth_rate: str = Field(description="Annual growth rate") - key_players: List[MarketPlayer] = Field(description="Major market players") - trends: List[MarketTrend] = Field(description="Market trends") - challenges: List[str] = Field(description="Industry challenges") - opportunities: List[str] = Field(description="Market opportunities") - -from scrapegraph_py.models import TimeRange - -# Perform comprehensive market research -response = client.searchscraper( - user_prompt="Analyze the current AI chip market landscape", - output_schema=MarketAnalysis, - location_geo_code="us", # Optional: Geo code for location-based search - time_range=TimeRange.PAST_MONTH # Optional: Filter results by time range +```python +response = client.scrape( + url="https://example.com" ) - -# Access structured market data -print(f"Market Size: {response.market_size}") -print(f"Growth Rate: {response.growth_rate}") - -print("\nKey Players:") -for player in response.key_players: - print(f"\n{player.name}") - print(f"Market Share: {player.market_share}") - print("Key Products:") - for product in player.key_products: - print(f"- {product}") - -print("\nMarket Trends:") -for trend in response.trends: - print(f"\n{trend.name}") - print(f"Impact: {trend.impact}") - print(f"Timeframe: {trend.timeframe}") ``` - - -Use markdown mode for cost-effective content gathering: +#### Parameters -```python -from scrapegraph_py import Client +| Parameter | Type | Required | Description | +| ------------- | ----------- | -------- | -------------------------------------------------------- | +| url | string | Yes | The URL of the webpage to scrape | +| format | string | No | Output format: `"markdown"`, `"html"`, `"screenshot"`, `"branding"` | +| fetch_config | FetchConfig | No | Fetch configuration | -client = Client(api_key="your-api-key") +### Crawl -from scrapegraph_py.models import TimeRange +Manage multi-page crawl operations asynchronously. -# Enable markdown mode for cost-effective content gathering -response = client.searchscraper( - user_prompt="Latest developments in artificial intelligence", - num_results=3, - extraction_mode=False, # Enable markdown mode (2 credits per page vs 10 credits) - location_geo_code="us", # Optional: Geo code for location-based search - time_range=TimeRange.PAST_WEEK # Optional: Filter results by time range +```python +# Start a crawl +job = client.crawl.start( + url="https://example.com", + depth=2, + include_patterns=["/blog/*", "/docs/**"], + exclude_patterns=["/admin/*", "/api/*"], ) +print(f"Crawl started: {job['id']}") -# Access the raw markdown content -markdown_content = response['markdown_content'] -reference_urls = response['reference_urls'] +# Check status +status = client.crawl.status(job["id"]) +print(f"Status: {status['status']}") -print(f"Markdown content length: {len(markdown_content)} characters") -print(f"Reference URLs: {len(reference_urls)}") +# Stop a crawl +client.crawl.stop(job["id"]) -# Process the markdown content -print("Content preview:", markdown_content[:500] + "...") +# Resume a crawl +client.crawl.resume(job["id"]) +``` -# Save to file for analysis -with open('ai_research_content.md', 'w', encoding='utf-8') as f: - f.write(markdown_content) +#### crawl.start() Parameters -print("Content saved to ai_research_content.md") -``` +| Parameter | Type | Required | Description | +| ---------------- | ----------- | -------- | -------------------------------------------------------- | +| url | string | Yes | The starting URL to crawl | +| depth | int | No | Crawl depth level | +| include_patterns | list[str] | No | URL patterns to include (`*` any chars, `**` any path) | +| exclude_patterns | list[str] | No | URL patterns to exclude | +| fetch_config | FetchConfig | No | Fetch configuration | -**Markdown Mode Benefits:** -- **Cost-effective**: Only 2 credits per page (vs 10 credits for AI extraction) -- **Full content**: Get complete page content in markdown format -- **Faster**: No AI processing overhead -- **Perfect for**: Content analysis, bulk data collection, building datasets +### Monitor - +Create and manage site monitoring jobs. + +```python +# Create a monitor +monitor = client.monitor.create( + url="https://example.com", + prompt="Track price changes", + schedule="daily", +) + +# List all monitors +monitors = client.monitor.list() - -Filter search results by date range to get only recent information: +# Get a specific monitor +details = client.monitor.get(monitor["id"]) + +# Pause / Resume / Delete +client.monitor.pause(monitor["id"]) +client.monitor.resume(monitor["id"]) +client.monitor.delete(monitor["id"]) +``` + +### Credits + +Check your account credit balance. ```python -from scrapegraph_py import Client -from scrapegraph_py.models import TimeRange +credits = client.credits() +print(f"Remaining: {credits['remaining_credits']}") +print(f"Used: {credits['total_credits_used']}") +``` -client = Client(api_key="your-api-key") +### History -# Search for recent news from the past week -response = client.searchscraper( - user_prompt="Latest news about AI developments", - num_results=5, - time_range=TimeRange.PAST_WEEK # Options: PAST_HOUR, PAST_24_HOURS, PAST_WEEK, PAST_MONTH, PAST_YEAR -) +Retrieve paginated request history with optional service filtering. -print("Recent AI news:", response['result']) -print("Reference URLs:", response['reference_urls']) +```python +history = client.history(endpoint="extract", status="completed", limit=20, offset=0) +for entry in history["items"]: + print(f"{entry['created_at']} - {entry['endpoint']} - {entry['status']}") ``` -**Time Range Options:** -- `TimeRange.PAST_HOUR` - Results from the past hour -- `TimeRange.PAST_24_HOURS` - Results from the past 24 hours -- `TimeRange.PAST_WEEK` - Results from the past week -- `TimeRange.PAST_MONTH` - Results from the past month -- `TimeRange.PAST_YEAR` - Results from the past year +## Configuration Objects -**Use Cases:** -- Finding recent news and updates -- Tracking time-sensitive information -- Getting latest product releases -- Monitoring recent market changes +### FetchConfig - +Controls how pages are fetched. See the [proxy configuration guide](/services/additional-parameters/proxy) for details on modes and geotargeting. -### Markdownify +```python +from scrapegraph_py import FetchConfig + +config = FetchConfig( + mode="js+stealth", # Proxy strategy: auto, fast, js, direct+stealth, js+stealth + timeout=15000, # Request timeout in ms (1000-60000) + wait=2000, # Wait after page load in ms (0-30000) + scrolls=3, # Number of scrolls (0-100) + country="us", # Proxy country code (ISO 3166-1 alpha-2) + headers={"X-Custom": "header"}, + cookies={"key": "value"}, + mock=False, # Enable mock mode for testing +) +``` -Convert any webpage into clean, formatted markdown: +### LlmConfig + +Controls LLM behavior for AI-powered methods. ```python -response = client.markdownify( - website_url="https://example.com" +from scrapegraph_py import LlmConfig + +config = LlmConfig( + model="gpt-4o-mini", # LLM model to use + temperature=0.3, # Response creativity (0.0-2.0) + max_tokens=1000, # Maximum response tokens + chunker="auto", # Content chunking strategy ("auto" or custom config) ) ``` ## Async Support -All endpoints support asynchronous operations: +All methods are available on the async client: ```python import asyncio @@ -352,38 +295,32 @@ from scrapegraph_py import AsyncClient async def main(): async with AsyncClient() as client: - response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the main content" + # Extract + response = await client.extract( + url="https://example.com", + prompt="Extract the main content" ) print(response) -asyncio.run(main()) -``` + # Crawl + job = await client.crawl.start("https://example.com", depth=2) + status = await client.crawl.status(job["id"]) + print(status) -## Feedback + # Credits + credits = await client.credits() + print(credits) -Help us improve by submitting feedback programmatically: - -```python -client.submit_feedback( - request_id="your-request-id", - rating=5, - feedback_text="Great results!" -) +asyncio.run(main()) ``` ## Support - + Report issues and contribute to the SDK Get help from our development team - - - This project is licensed under the MIT License. See the [LICENSE](https://github.com/ScrapeGraphAI/scrapegraph-sdk/blob/main/LICENSE) file for details. - diff --git a/services/additional-parameters/headers.mdx b/services/additional-parameters/headers.mdx index 53446b5..0046076 100644 --- a/services/additional-parameters/headers.mdx +++ b/services/additional-parameters/headers.mdx @@ -77,9 +77,9 @@ response = client.markdownify( ``` ```javascript JavaScript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); // Define custom headers const headers = { @@ -88,11 +88,10 @@ const headers = { 'Sec-Ch-Ua-Platform': '"Windows"', }; -// Use with SmartScraper -const response = await smartScraper(apiKey, { - website_url: 'https://example.com', - user_prompt: 'Extract the main content', - headers: headers, +// Use with extract (SmartScraper) +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract the main content', + fetchConfig: { headers }, }); ``` @@ -139,9 +138,9 @@ response = client.smartscraper( ``` ```javascript JavaScript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); // Example with session cookies const headers = { @@ -149,10 +148,9 @@ const headers = { 'Cookie': 'session_id=abc123; user_id=12345; theme=dark', }; -const response = await smartScraper(apiKey, { - website_url: 'https://example.com/dashboard', - user_prompt: 'Extract user information', - headers: headers, +const { data } = await sgai.extract('https://example.com/dashboard', { + prompt: 'Extract user information', + fetchConfig: { headers }, }); ``` diff --git a/services/additional-parameters/pagination.mdx b/services/additional-parameters/pagination.mdx index 207833f..b4e312b 100644 --- a/services/additional-parameters/pagination.mdx +++ b/services/additional-parameters/pagination.mdx @@ -65,15 +65,14 @@ response = client.smartscraper( ### JavaScript SDK ```javascript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); // Basic pagination - scrape 3 pages -const response = await smartScraper(apiKey, { - website_url: 'https://example-store.com/products', - user_prompt: 'Extract all product information', - total_pages: 3, +const { data } = await sgai.extract('https://example-store.com/products', { + prompt: 'Extract all product information', + totalPages: 3, }); ``` diff --git a/services/additional-parameters/proxy.mdx b/services/additional-parameters/proxy.mdx index adcedd8..35be03f 100644 --- a/services/additional-parameters/proxy.mdx +++ b/services/additional-parameters/proxy.mdx @@ -1,6 +1,6 @@ --- title: 'Proxy Configuration' -description: 'Configure proxy settings and geotargeting for web scraping requests' +description: 'Configure proxy settings, fetch modes, and geotargeting for web scraping requests' icon: 'globe' --- @@ -10,10 +10,12 @@ icon: 'globe' ## Overview -The ScrapeGraphAI API uses an intelligent proxy system that automatically handles web scraping requests through multiple proxy providers. The system uses a fallback strategy to ensure maximum reliability - if one provider fails, it automatically tries the next one. +The ScrapeGraphAI API uses an intelligent proxy system that automatically handles web scraping requests through multiple proxy providers. The system uses a fallback strategy to ensure maximum reliability — if one provider fails, it automatically tries the next one. **No configuration required**: The proxy system is fully automatic and transparent to API users. You don't need to configure proxy credentials or settings yourself. +In v2, all proxy and fetch behaviour is controlled through the `FetchConfig` object, which you can pass to any service method (`extract`, `scrape`, `search`, `crawl`, etc.). + ## How It Works The API automatically routes your scraping requests through multiple proxy providers in a smart order: @@ -21,11 +23,58 @@ The API automatically routes your scraping requests through multiple proxy provi 1. The system tries different proxy providers automatically 2. If one provider fails, it automatically falls back to the next one 3. Successful providers are cached for each domain to improve performance -4. Everything happens transparently - you just make your API request as normal +4. Everything happens transparently — you just make your API request as normal + +## Fetch Modes + +The `mode` parameter inside `FetchConfig` controls how pages are retrieved and which proxy strategy is used: + +| Mode | Description | JS Rendering | Stealth Proxy | Best For | +|------|-------------|:------------:|:-------------:|----------| +| `auto` | Automatically selects the best provider chain | Adaptive | Adaptive | General use (default) | +| `fast` | Direct HTTP fetch via impit | No | No | Static pages, maximum speed | +| `js` | Headless browser rendering | Yes | No | JavaScript-heavy SPAs | +| `direct+stealth` | Residential proxy with stealth headers | No | Yes | Anti-bot sites (static) | +| `js+stealth` | JS rendering + residential proxy | Yes | Yes | Anti-bot sites (dynamic) | + + + +```python Python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + +# Use stealth mode with JS rendering +response = client.extract( + url="https://example.com", + prompt="Extract product information", + fetch_config=FetchConfig( + mode="js+stealth", + wait=2000, + ), +) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +// Use stealth mode with JS rendering +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract product information', + fetchConfig: { + mode: 'js+stealth', + wait: 2000, + }, +}); +``` + + ## Country Selection (Geotargeting) -You can optionally specify a country code to route requests through proxies in a specific country. This is useful for: +You can optionally specify a two-letter country code via `FetchConfig.country` to route requests through proxies in a specific country. This is useful for: - Accessing geo-restricted content - Getting localized versions of websites @@ -34,46 +83,46 @@ You can optionally specify a country code to route requests through proxies in a ### Using Country Code -Include the `country_code` parameter in your API request: - ```python Python -from scrapegraph_py import Client +from scrapegraph_py import Client, FetchConfig client = Client(api_key="your-api-key") -# Request with country code -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - country_code="us" # Route through US proxies +# Route through US proxies +response = client.extract( + url="https://example.com", + prompt="Extract product information", + fetch_config=FetchConfig(country="us"), ) ``` ```javascript JavaScript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); -// Request with country code -const response = await smartScraper(apiKey, { - website_url: 'https://example.com', - user_prompt: 'Extract product information', - country_code: 'us', +// Route through US proxies +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract product information', + fetchConfig: { country: 'us' }, }); ``` ```bash cURL curl -X 'POST' \ - 'https://api.scrapegraphai.com/v1/smartscraper' \ + 'https://api.scrapegraphai.com/api/v2/extract' \ -H 'accept: application/json' \ + -H 'Authorization: Bearer your-api-key' \ -H 'SGAI-APIKEY: your-api-key' \ -H 'Content-Type: application/json' \ -d '{ - "website_url": "https://example.com", - "user_prompt": "Extract product information", - "country_code": "us" + "url": "https://example.com", + "prompt": "Extract product information", + "fetchConfig": { + "country": "us" + } }' ``` @@ -106,16 +155,52 @@ And many more! The API supports over 100 countries. Use standard ISO 3166-1 alph -## Available Parameters +## FetchConfig Reference + +All proxy and fetch behaviour is configured through the `FetchConfig` object: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `mode` | string | `"auto"` | Fetch/proxy mode: `auto`, `fast`, `js`, `direct+stealth`, `js+stealth` | +| `timeout` | int | `30000` | Request timeout in milliseconds (1000–60000) | +| `wait` | int | `0` | Milliseconds to wait after page load before scraping (0–30000) | +| `scrolls` | int | `0` | Number of page scrolls to perform (0–100) | +| `country` | string | — | Two-letter ISO country code for geo-located proxy routing (e.g. `"us"`) | +| `headers` | object | — | Custom HTTP headers to send with the request | +| `cookies` | object | — | Cookies to send with the request | +| `mock` | bool | `false` | Enable mock mode for testing (no real request is made) | + + + +```python Python +from scrapegraph_py import FetchConfig + +config = FetchConfig( + mode="js+stealth", # Proxy strategy + timeout=15000, # 15s timeout + wait=2000, # Wait 2s after page load + scrolls=3, # Scroll 3 times + country="us", # Route through US proxies + headers={"Accept-Language": "en-US"}, + cookies={"session": "abc123"}, + mock=False, +) +``` -The following parameters in API requests can affect proxy behavior: +```javascript JavaScript +const fetchConfig = { + mode: 'js+stealth', // Proxy strategy + timeout: 15000, // 15s timeout + wait: 2000, // Wait 2s after page load + scrolls: 3, // Scroll 3 times + country: 'us', // Route through US proxies + headers: { 'Accept-Language': 'en-US' }, + cookies: { session: 'abc123' }, + mock: false, +}; +``` -### `country_code` (optional) -- **Type**: String -- **Description**: Two-letter ISO country code to route requests through proxies in a specific country -- **Example**: `"us"`, `"uk"`, `"de"`, `"it"`, `"fr"` -- **Default**: No specific country (uses optimal routing) -- **Format**: ISO 3166-1 alpha-2 (e.g., `us`, `gb`, `de`) + ## Usage Examples @@ -128,22 +213,21 @@ from scrapegraph_py import Client client = Client(api_key="your-api-key") -# Automatic proxy selection - no configuration needed -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information" +# Automatic proxy selection — no configuration needed +response = client.extract( + url="https://example.com", + prompt="Extract product information", ) ``` ```javascript JavaScript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); // Automatic proxy selection -const response = await smartScraper(apiKey, { - website_url: 'https://example.com', - user_prompt: 'Extract product information', +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract product information', }); ``` @@ -154,35 +238,79 @@ const response = await smartScraper(apiKey, { ```python Python -from scrapegraph_py import Client +from scrapegraph_py import Client, FetchConfig client = Client(api_key="your-api-key") # Route through US proxies -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - country_code="us" +response = client.extract( + url="https://example.com", + prompt="Extract product information", + fetch_config=FetchConfig(country="us"), ) # Route through UK proxies -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - country_code="uk" +response = client.extract( + url="https://example.com", + prompt="Extract product information", + fetch_config=FetchConfig(country="gb"), ) ``` ```javascript JavaScript -import { smartScraper } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); // Route through US proxies -const response = await smartScraper(apiKey, { - website_url: 'https://example.com', - user_prompt: 'Extract product information', - country_code: 'us', +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract product information', + fetchConfig: { country: 'us' }, +}); + +// Route through UK proxies +const { data: ukData } = await sgai.extract('https://example.com', { + prompt: 'Extract product information', + fetchConfig: { country: 'gb' }, +}); +``` + + + +### Stealth Mode with JS Rendering + + + +```python Python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + +response = client.scrape( + url="https://heavily-protected-site.com", + format="markdown", + fetch_config=FetchConfig( + mode="js+stealth", + wait=3000, + scrolls=5, + country="us", + ), +) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const { data } = await sgai.scrape('https://heavily-protected-site.com', { + format: 'markdown', + fetchConfig: { + mode: 'js+stealth', + wait: 3000, + scrolls: 5, + country: 'us', + }, }); ``` @@ -192,75 +320,105 @@ const response = await smartScraper(apiKey, { #### Accessing Geo-Restricted Content -```python -from scrapegraph_py import Client + + +```python Python +from scrapegraph_py import Client, FetchConfig client = Client(api_key="your-api-key") # Access US-only content -response = client.smartscraper( - website_url="https://us-only-service.com", - user_prompt="Extract available services", - country_code="us" +response = client.extract( + url="https://us-only-service.com", + prompt="Extract available services", + fetch_config=FetchConfig(country="us"), ) ``` +```javascript JavaScript +const { data } = await sgai.extract('https://us-only-service.com', { + prompt: 'Extract available services', + fetchConfig: { country: 'us' }, +}); +``` + + + #### Getting Localized Content ```python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + # Get German version of a website -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product prices in local currency", - country_code="de" +response = client.extract( + url="https://example.com", + prompt="Extract product prices in local currency", + fetch_config=FetchConfig(country="de"), ) # Get French version -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product prices in local currency", - country_code="fr" +response = client.extract( + url="https://example.com", + prompt="Extract product prices in local currency", + fetch_config=FetchConfig(country="fr"), ) ``` #### E-commerce Price Comparison ```python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + # Compare prices from different regions -countries = ["us", "uk", "de", "fr"] +countries = ["us", "gb", "de", "fr"] for country in countries: - response = client.smartscraper( - website_url="https://ecommerce-site.com/product/123", - user_prompt="Extract product price and availability", - country_code=country + response = client.extract( + url="https://ecommerce-site.com/product/123", + prompt="Extract product price and availability", + fetch_config=FetchConfig(country=country), ) - print(f"{country}: {response['result']}") + print(f"{country}: {response['data']}") ``` ## Best Practices -### 1. Use Country Code When Needed +### 1. Choose the Right Fetch Mode + +Pick the mode that matches your target site: +- **`auto`** (default) — let the system decide; works for most sites +- **`fast`** — use for simple, static HTML pages +- **`js`** — use for SPAs and JavaScript-rendered content +- **`direct+stealth`** — use for anti-bot sites that don't require JS +- **`js+stealth`** — use for anti-bot sites with dynamic content + +### 2. Use Country Code When Needed Only specify a country code if you have a specific requirement: -- ✅ Accessing geo-restricted content -- ✅ Getting localized versions of websites -- ✅ Complying with regional requirements -- ❌ Don't specify if you don't need it - let the system optimize automatically +- Accessing geo-restricted content +- Getting localized versions of websites +- Complying with regional requirements +- Don't specify if you don't need it — let the system optimize automatically -### 2. Let the System Handle Routing +### 3. Let the System Handle Routing The API automatically selects the best proxy provider for each request: - No manual proxy selection needed - Automatic failover ensures reliability - Performance is optimized automatically -### 3. Handle Errors Gracefully +### 4. Handle Errors Gracefully If a request fails, the system has already tried multiple providers: -```python -from scrapegraph_py import Client + + +```python Python +from scrapegraph_py import Client, FetchConfig import time client = Client(api_key="your-api-key") @@ -268,10 +426,10 @@ client = Client(api_key="your-api-key") def scrape_with_retry(url, prompt, max_retries=3): for attempt in range(max_retries): try: - response = client.smartscraper( - website_url=url, - user_prompt=prompt, - country_code="us" + response = client.extract( + url=url, + prompt=prompt, + fetch_config=FetchConfig(country="us"), ) return response except Exception as e: @@ -282,7 +440,29 @@ def scrape_with_retry(url, prompt, max_retries=3): raise e ``` -### 4. Monitor Rate Limits +```javascript JavaScript +async function scrapeWithRetry(url, prompt, maxRetries = 3) { + for (let attempt = 0; attempt < maxRetries; attempt++) { + try { + return await sgai.extract(url, { + prompt, + fetchConfig: { country: 'us' }, + }); + } catch (err) { + if (attempt < maxRetries - 1) { + console.log(`Attempt ${attempt + 1} failed: ${err.message}`); + await new Promise((r) => setTimeout(r, 2 ** attempt * 1000)); + } else { + throw err; + } + } + } +} +``` + + + +### 5. Monitor Rate Limits Be aware of your API rate limits: - The proxy system respects these limits automatically @@ -298,8 +478,9 @@ If your scraping request fails: 1. **Verify the URL**: Make sure the URL is correct and accessible 2. **Check the website**: Some websites may block automated access regardless of proxy -3. **Retry the request**: The system uses automatic retries, but you can manually retry after a delay -4. **Try a different country**: If geo-restriction is the issue, try a different `country_code` +3. **Try a different mode**: Switch to `js+stealth` for heavily-protected sites +4. **Retry the request**: The system uses automatic retries, but you can manually retry after a delay +5. **Try a different country**: If geo-restriction is the issue, try a different `country` ### Rate Limiting @@ -318,21 +499,21 @@ If you receive rate limit errors (HTTP 429): If you're trying to access geo-restricted content: -- Use the `country_code` parameter to specify the required country +- Use the `country` parameter inside `FetchConfig` to specify the required country - Make sure the content is available in that country - Some content may still be restricted regardless of proxy location - Try multiple country codes if one doesn't work -### Proxy Selection Issues +### Anti-Bot Protection - -If you're experiencing proxy-related issues: + +If a website is blocking your requests: -- The system automatically tries multiple providers -- No manual configuration is needed -- If issues persist, contact support with your request ID -- Check if the issue is specific to certain websites or domains +- Use `mode: "direct+stealth"` or `mode: "js+stealth"` in `FetchConfig` +- Add a `wait` time to let the page fully load +- Use `scrolls` to trigger lazy-loaded content +- Add custom `headers` if the site expects specific ones ## FAQ @@ -341,42 +522,30 @@ If you're experiencing proxy-related issues: **A**: No, the proxy system is fully managed and automatic. You don't need to provide any proxy credentials or configuration. - -**A**: No, the system automatically selects the best proxy provider for each request. This ensures optimal performance and reliability. + +**A**: Use the `mode` parameter in `FetchConfig`. Set it to `auto` (default), `fast`, `js`, `direct+stealth`, or `js+stealth` depending on your needs. - -**A**: The proxy selection is handled automatically and transparently. You don't need to know which proxy was used - just use the API as normal. - - - -**A**: The API uses managed proxy services. If you have specific proxy requirements, please contact support. + +**A**: No, the system automatically selects the best proxy provider for each request. You can influence the strategy by setting the `mode` parameter. **A**: The API will return an error. The system tries multiple providers with automatic fallback, so this is rare. If it happens, verify the URL and try again. - -**A**: No, the `country_code` parameter doesn't affect pricing. Credits are charged the same regardless of proxy location. + +**A**: No, the `country` parameter doesn't affect pricing. Credits are charged the same regardless of proxy location. - -**A**: Yes, `country_code` is available for all scraping services including SmartScraper, SearchScraper, SmartCrawler, and Markdownify. + +**A**: Yes, `FetchConfig` is available for all services including `extract`, `scrape`, `search`, `crawl`, and `monitor`. **A**: Both `uk` and `gb` refer to the United Kingdom. The API accepts both codes for compatibility. -## API Reference - -For detailed API documentation, see: -- [SmartScraper Start Job](/api-reference/endpoint/smartscraper/start) -- [SearchScraper Start Job](/api-reference/endpoint/searchscraper/start) -- [SmartCrawler Start Job](/api-reference/endpoint/smartcrawler/start) -- [Markdownify Start Job](/api-reference/endpoint/markdownify/start) - ## Support & Resources diff --git a/services/additional-parameters/wait-ms.mdx b/services/additional-parameters/wait-ms.mdx index 45a4646..db961ea 100644 --- a/services/additional-parameters/wait-ms.mdx +++ b/services/additional-parameters/wait-ms.mdx @@ -67,27 +67,25 @@ response = client.markdownify( ### JavaScript SDK ```javascript -import { smartScraper, scrape, markdownify } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; -const apiKey = 'your-api-key'; +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); -// SmartScraper with custom wait time -const response = await smartScraper(apiKey, { - website_url: 'https://example.com', - user_prompt: 'Extract product information', - wait_ms: 5000, +// Extract with custom wait time +const { data } = await sgai.extract('https://example.com', { + prompt: 'Extract product information', + fetchConfig: { wait: 5000 }, }); // Scrape with custom wait time -const scrapeResponse = await scrape(apiKey, { - website_url: 'https://example.com', - wait_ms: 5000, +const { data: scrapeData } = await sgai.scrape('https://example.com', { + fetchConfig: { wait: 5000 }, }); // Markdownify with custom wait time -const mdResponse = await markdownify(apiKey, { - website_url: 'https://example.com', - wait_ms: 5000, +const { data: mdData } = await sgai.scrape('https://example.com', { + format: 'markdown', + fetchConfig: { wait: 5000 }, }); ``` diff --git a/services/agenticscraper.mdx b/services/agenticscraper.mdx index 45e0c65..b2167d6 100644 --- a/services/agenticscraper.mdx +++ b/services/agenticscraper.mdx @@ -17,7 +17,7 @@ Agentic Scraper is our most advanced service for automating browser actions and - **Optionally** use AI to extract structured data according to a schema -Try it instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) – no coding required! +Try it instantly in our [interactive playground](https://scrapegraphai.com/dashboard) – no coding required! ## Difference: With vs Without AI Extraction @@ -39,7 +39,7 @@ const apiKey = process.env.SGAI_APIKEY; // Basic scraping without AI extraction const response = await agenticScraper(apiKey, { - url: 'https://dashboard.scrapegraphai.com/', + url: 'https://scrapegraphai.com/dashboard', steps: [ 'Type email@gmail.com in email input box', 'Type test-password@123 in password inputbox', @@ -52,7 +52,7 @@ console.log(response.data); // With AI extraction const aiResponse = await agenticScraper(apiKey, { - url: 'https://dashboard.scrapegraphai.com/', + url: 'https://scrapegraphai.com/dashboard', steps: [ 'Type email@gmail.com in email input box', 'Type test-password@123 in password inputbox', @@ -86,7 +86,7 @@ curl -X 'POST' \ -H 'SGAI-APIKEY: your-api-key' \ -H 'Content-Type: application/json' \ -d '{ - "url": "https://dashboard.scrapegraphai.com/", + "url": "https://scrapegraphai.com/dashboard", "use_session": true, "steps": ["Type email@gmail.com in email input box", "Type test-password@123 in password inputbox", "click on login"], "ai_extraction": false @@ -99,7 +99,7 @@ curl -X 'POST' \ -H 'SGAI-APIKEY: your-api-key' \ -H 'Content-Type: application/json' \ -d '{ - "url": "https://dashboard.scrapegraphai.com/", + "url": "https://scrapegraphai.com/dashboard", "use_session": true, "steps": ["Type email@gmail.com in email input box", "Type test-password@123 in password inputbox", "click on login", "wait for dashboard to load completely"], "user_prompt": "Extract user info, dashboard sections, and remaining credits", @@ -132,7 +132,7 @@ client = Client(api_key=api_key) # Basic example: login and scrape without AI response = client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", + url="https://scrapegraphai.com/dashboard", use_session=True, steps=[ "Type email@gmail.com in email input box", @@ -157,7 +157,7 @@ output_schema = { } } ai_response = client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", + url="https://scrapegraphai.com/dashboard", use_session=True, steps=[ "Type email@gmail.com in email input box", @@ -175,12 +175,12 @@ client.close() ```bash CLI # Basic scraping without AI extraction -just-scrape agentic-scraper https://dashboard.scrapegraphai.com/ \ +just-scrape agentic-scraper https://scrapegraphai.com/dashboard \ -s "Type email@gmail.com in email input box,Type test-password@123 in password inputbox,Click login" \ --use-session # With AI extraction -just-scrape agentic-scraper https://dashboard.scrapegraphai.com/ \ +just-scrape agentic-scraper https://scrapegraphai.com/dashboard \ -s "Type email@gmail.com in email input box,Type test-password@123 in password inputbox,Click login,wait for dashboard to load" \ --ai-extraction -p "Extract user info, dashboard sections, and remaining credits" \ --use-session @@ -201,7 +201,7 @@ just-scrape agentic-scraper https://dashboard.scrapegraphai.com/ \ | ai_extraction | bool | No | true = AI extraction, false = raw content only | -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) ## Use Cases @@ -245,6 +245,6 @@ For technical details see: - + Get your API key and start using Agentic Scraper now! diff --git a/services/cli.mdx b/services/cli.mdx index ab551d2..a0c8b4e 100644 --- a/services/cli.mdx +++ b/services/cli.mdx @@ -6,10 +6,10 @@ icon: 'terminal' ## Overview -`just-scrape` is the official CLI for [ScrapeGraph AI](https://scrapegraphai.com) — AI-powered web scraping, data extraction, search, and crawling, straight from your terminal. +`just-scrape` is the official CLI for [ScrapeGraph AI](https://scrapegraphai.com) — AI-powered web scraping, data extraction, search, and crawling, straight from your terminal. Uses the **v2 API**. -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) ## Installation @@ -58,110 +58,81 @@ The CLI needs a ScrapeGraph API key. Four ways to provide it (checked in order): | Variable | Description | Default | |---|---|---| | `SGAI_API_KEY` | ScrapeGraph API key | — | -| `JUST_SCRAPE_API_URL` | Override API base URL | `https://api.scrapegraphai.com/v1` | -| `JUST_SCRAPE_TIMEOUT_S` | Request/polling timeout in seconds | `120` | -| `JUST_SCRAPE_DEBUG` | Set to `1` to enable debug logging | `0` | +| `SGAI_API_URL` | Override API base URL | `https://api.scrapegraphai.com` | +| `SGAI_TIMEOUT_S` | Request timeout in seconds | `30` | + +Legacy variables (`JUST_SCRAPE_API_URL`, `JUST_SCRAPE_TIMEOUT_S`, `JUST_SCRAPE_DEBUG`) are still bridged. ## JSON Mode All commands support `--json` for machine-readable output. Banner, spinners, and interactive prompts are suppressed — only minified JSON on stdout. Saves tokens when piped to AI agents. ```bash -just-scrape credits --json | jq '.remaining_credits' -just-scrape smart-scraper https://example.com -p "Extract data" --json > result.json +just-scrape credits --json | jq '.remainingCredits' +just-scrape extract https://example.com -p "Extract data" --json > result.json ``` ## Commands -### SmartScraper - -Extract structured data from any URL using AI. [Full docs →](/services/smartscraper) - -```bash -just-scrape smart-scraper -p -just-scrape smart-scraper -p --schema -just-scrape smart-scraper -p --scrolls -just-scrape smart-scraper -p --pages -just-scrape smart-scraper -p --stealth -just-scrape smart-scraper -p --cookies --headers -just-scrape smart-scraper -p --plain-text -``` - -### SearchScraper +### Extract -Search the web and extract structured data from results. [Full docs →](/services/searchscraper) +Extract structured data from any URL using AI (replaces `smart-scraper`). [Full docs →](/api-reference/extract) ```bash -just-scrape search-scraper -just-scrape search-scraper --num-results -just-scrape search-scraper --no-extraction -just-scrape search-scraper --schema -just-scrape search-scraper --stealth --headers +just-scrape extract -p +just-scrape extract -p --schema +just-scrape extract -p --scrolls +just-scrape extract -p --mode direct+stealth +just-scrape extract -p --cookies --headers +just-scrape extract -p --country ``` -### Markdownify - -Convert any webpage to clean markdown. [Full docs →](/services/markdownify) - -```bash -just-scrape markdownify -just-scrape markdownify --stealth -just-scrape markdownify --headers -``` - -### Crawl +### Search -Crawl multiple pages and extract data from each. [Full docs →](/services/smartcrawler) +Search the web and extract structured data from results (replaces `search-scraper`). [Full docs →](/api-reference/search) ```bash -just-scrape crawl -p -just-scrape crawl -p --max-pages -just-scrape crawl -p --depth -just-scrape crawl --no-extraction --max-pages -just-scrape crawl -p --schema -just-scrape crawl -p --rules -just-scrape crawl -p --no-sitemap -just-scrape crawl -p --stealth +just-scrape search +just-scrape search --num-results +just-scrape search -p +just-scrape search --schema +just-scrape search --headers ``` ### Scrape -Get raw HTML content from a URL. [Full docs →](/services/scrape) +Scrape content from a URL in various formats: markdown (default), html, screenshot, or branding. [Full docs →](/api-reference/scrape) ```bash just-scrape scrape -just-scrape scrape --stealth -just-scrape scrape --branding -just-scrape scrape --country-code +just-scrape scrape -f html +just-scrape scrape -f screenshot +just-scrape scrape -f branding +just-scrape scrape -m direct+stealth +just-scrape scrape --country ``` -### Sitemap - -Get all URLs from a website's sitemap. [Full docs →](/services/sitemap) - -```bash -just-scrape sitemap -just-scrape sitemap --json | jq -r '.urls[]' -``` - -### Agentic Scraper +### Markdownify -Browser automation with AI — login, click, navigate, fill forms. [Full docs →](/services/agenticscraper) +Convert any webpage to clean markdown (convenience wrapper for `scrape --format markdown`). [Full docs →](/api-reference/scrape) ```bash -just-scrape agentic-scraper -s -just-scrape agentic-scraper -s --ai-extraction -p -just-scrape agentic-scraper -s --schema -just-scrape agentic-scraper -s --use-session +just-scrape markdownify +just-scrape markdownify -m direct+stealth +just-scrape markdownify --headers ``` -### Generate Schema +### Crawl -Generate a JSON schema from a natural language description. +Crawl multiple pages. The CLI starts the crawl and polls until completion. [Full docs →](/api-reference/crawl) ```bash -just-scrape generate-schema -just-scrape generate-schema --existing-schema +just-scrape crawl +just-scrape crawl --max-pages +just-scrape crawl --max-depth +just-scrape crawl --max-links-per-page +just-scrape crawl --allow-external +just-scrape crawl -m direct+stealth ``` ### History @@ -176,7 +147,7 @@ just-scrape history --page-size just-scrape history --json ``` -Services: `markdownify`, `smartscraper`, `searchscraper`, `scrape`, `crawl`, `agentic-scraper`, `sitemap` +Services: `scrape`, `extract`, `search`, `monitor`, `crawl` ### Credits @@ -184,14 +155,7 @@ Check your credit balance. ```bash just-scrape credits -``` - -### Validate - -Validate your API key. - -```bash -just-scrape validate +just-scrape credits --json | jq '.remainingCredits' ``` ## AI Agent Integration @@ -214,7 +178,7 @@ bunx skills add https://github.com/ScrapeGraphAI/just-scrape Join our Discord community - + Get your API key diff --git a/services/cli/ai-agent-skill.mdx b/services/cli/ai-agent-skill.mdx index 50ee527..17ae68c 100644 --- a/services/cli/ai-agent-skill.mdx +++ b/services/cli/ai-agent-skill.mdx @@ -17,9 +17,10 @@ Browse the skill: [skills.sh/scrapegraphai/just-scrape/just-scrape](https://skil Once installed, your coding agent can: -- Scrape a website to gather data needed for a task +- Extract structured data from any website using AI - Convert documentation pages to markdown for context - Search the web and extract structured results +- Crawl multiple pages and collect data - Check your credit balance mid-session - Browse request history @@ -28,13 +29,13 @@ Once installed, your coding agent can: Agents call `just-scrape` in `--json` mode for clean, token-efficient output: ```bash -just-scrape smart-scraper https://api.example.com/docs \ +just-scrape extract https://api.example.com/docs \ -p "Extract all endpoint names, methods, and descriptions" \ --json ``` ```bash -just-scrape search-scraper "latest release notes for react-query" \ +just-scrape search "latest release notes for react-query" \ --num-results 3 --json ``` @@ -76,15 +77,14 @@ This project uses `just-scrape` (ScrapeGraph AI CLI) for web scraping. The API key is set via the SGAI_API_KEY environment variable. Available commands (always use --json flag): -- `just-scrape smart-scraper -p --json` — AI extraction from a URL -- `just-scrape search-scraper --json` — search the web and extract data +- `just-scrape extract -p --json` — AI extraction from a URL +- `just-scrape search --json` — search the web and extract data - `just-scrape markdownify --json` — convert a page to markdown -- `just-scrape crawl -p --json` — crawl multiple pages -- `just-scrape scrape --json` — get raw HTML -- `just-scrape sitemap --json` — get all URLs from a sitemap +- `just-scrape crawl --json` — crawl multiple pages +- `just-scrape scrape --json` — get page content (markdown, html, screenshot, branding) Use --schema to enforce a JSON schema on the output. -Use --stealth for sites with anti-bot protection. +Use --mode direct+stealth or --mode js+stealth for sites with anti-bot protection. ``` ### Example prompts for Claude Code @@ -120,7 +120,7 @@ claude -p "Use just-scrape to scrape https://example.com/changelog \ - Pass `--schema` with a JSON schema to get typed, predictable output: ```bash -just-scrape smart-scraper https://example.com \ +just-scrape extract https://example.com \ -p "Extract company info" \ --schema '{"type":"object","properties":{"name":{"type":"string"},"founded":{"type":"number"}}}' \ --json diff --git a/services/cli/commands.mdx b/services/cli/commands.mdx index 566f827..524084a 100644 --- a/services/cli/commands.mdx +++ b/services/cli/commands.mdx @@ -3,110 +3,79 @@ title: 'Commands' description: 'Full reference for every just-scrape command and its flags' --- -## smart-scraper +## extract -Extract structured data from any URL using AI. [Full docs →](/services/smartscraper) +Extract structured data from any URL using AI (replaces `smart-scraper`). [Full docs →](/api-reference/extract) ```bash -just-scrape smart-scraper -p -just-scrape smart-scraper -p --schema -just-scrape smart-scraper -p --scrolls # infinite scroll (0-100) -just-scrape smart-scraper -p --pages # multi-page (1-100) -just-scrape smart-scraper -p --stealth # anti-bot bypass (+4 credits) -just-scrape smart-scraper -p --cookies --headers -just-scrape smart-scraper -p --plain-text # plain text instead of JSON +just-scrape extract -p +just-scrape extract -p --schema +just-scrape extract -p --scrolls # infinite scroll (0-100) +just-scrape extract -p --mode js+stealth # anti-bot bypass +just-scrape extract -p --cookies --headers +just-scrape extract -p --country # geo-targeting ``` -## search-scraper +## search -Search the web and extract structured data from results. [Full docs →](/services/searchscraper) +Search the web and extract structured data from results (replaces `search-scraper`). [Full docs →](/api-reference/search) ```bash -just-scrape search-scraper -just-scrape search-scraper --num-results # sources to scrape (3-20, default 3) -just-scrape search-scraper --no-extraction # markdown only (2 credits vs 10) -just-scrape search-scraper --schema -just-scrape search-scraper --stealth --headers +just-scrape search +just-scrape search -p # extraction prompt for results +just-scrape search --num-results # sources to scrape (1-20, default 3) +just-scrape search --schema +just-scrape search --headers ``` ## markdownify -Convert any webpage to clean markdown. [Full docs →](/services/markdownify) +Convert any webpage to clean markdown (convenience wrapper for `scrape --format markdown`). [Full docs →](/api-reference/scrape) ```bash just-scrape markdownify -just-scrape markdownify --stealth +just-scrape markdownify -m direct+stealth # anti-bot bypass just-scrape markdownify --headers ``` -## crawl - -Crawl multiple pages and extract data from each. [Full docs →](/services/smartcrawler) - -```bash -just-scrape crawl -p -just-scrape crawl -p --max-pages # max pages (default 10) -just-scrape crawl -p --depth # crawl depth (default 1) -just-scrape crawl --no-extraction --max-pages # markdown only (2 credits/page) -just-scrape crawl -p --schema -just-scrape crawl -p --rules # include_paths, same_domain -just-scrape crawl -p --no-sitemap # skip sitemap discovery -just-scrape crawl -p --stealth -``` - ## scrape -Get raw HTML content from a URL. [Full docs →](/services/scrape) - -```bash -just-scrape scrape -just-scrape scrape --stealth # anti-bot bypass (+4 credits) -just-scrape scrape --branding # extract branding (+2 credits) -just-scrape scrape --country-code # geo-targeting -``` - -## sitemap - -Get all URLs from a website's sitemap. [Full docs →](/services/sitemap) +Scrape content from a URL in various formats. [Full docs →](/api-reference/scrape) ```bash -just-scrape sitemap -just-scrape sitemap --json | jq -r '.urls[]' +just-scrape scrape # markdown (default) +just-scrape scrape -f html # raw HTML +just-scrape scrape -f screenshot # screenshot +just-scrape scrape -f branding # extract branding info +just-scrape scrape -m direct+stealth # anti-bot bypass +just-scrape scrape --country # geo-targeting ``` -## agentic-scraper - -Browser automation with AI — login, click, navigate, fill forms. [Full docs →](/services/agenticscraper) - -```bash -just-scrape agentic-scraper -s -just-scrape agentic-scraper -s --ai-extraction -p -just-scrape agentic-scraper -s --schema -just-scrape agentic-scraper -s --use-session # persist browser session -``` - -## generate-schema +## crawl -Generate a JSON schema from a natural language description. +Crawl multiple pages. The CLI starts the crawl and polls until completion. [Full docs →](/api-reference/crawl) ```bash -just-scrape generate-schema -just-scrape generate-schema --existing-schema +just-scrape crawl +just-scrape crawl --max-pages # max pages (default 50) +just-scrape crawl --max-depth # crawl depth (default 2) +just-scrape crawl --max-links-per-page # max links per page (default 10) +just-scrape crawl --allow-external # allow external domains +just-scrape crawl -m direct+stealth # anti-bot bypass ``` ## history -Browse request history for any service. Interactive by default — arrow keys to navigate, select to view details. +View request history for a service. Interactive by default — arrow keys to navigate, select to view details. ```bash just-scrape history -just-scrape history -just-scrape history --page # start from page (default 1) -just-scrape history --page-size # results per page (max 100) +just-scrape history --page # start from page (default 1) +just-scrape history --page-size # results per page (max 100) just-scrape history --json ``` -Services: `markdownify`, `smartscraper`, `searchscraper`, `scrape`, `crawl`, `agentic-scraper`, `sitemap` +Services: `scrape`, `extract`, `search`, `monitor`, `crawl` ## credits @@ -117,14 +86,6 @@ just-scrape credits just-scrape credits --json | jq '.remaining_credits' ``` -## validate - -Validate your API key. - -```bash -just-scrape validate -``` - ## Global flags All commands support these flags: diff --git a/services/cli/examples.mdx b/services/cli/examples.mdx index 68c7e3a..708fb18 100644 --- a/services/cli/examples.mdx +++ b/services/cli/examples.mdx @@ -3,39 +3,39 @@ title: 'Examples' description: 'Practical examples for every just-scrape command' --- -## smart-scraper +## extract ```bash # Extract product listings -just-scrape smart-scraper https://store.example.com/shoes \ +just-scrape extract https://store.example.com/shoes \ -p "Extract all product names, prices, and ratings" # Enforce output schema + scroll to load more content -just-scrape smart-scraper https://news.example.com \ +just-scrape extract https://news.example.com \ -p "Get all article headlines and dates" \ --schema '{"type":"object","properties":{"articles":{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"date":{"type":"string"}}}}}}' \ --scrolls 5 # Anti-bot bypass for JS-heavy SPAs -just-scrape smart-scraper https://app.example.com/dashboard \ +just-scrape extract https://app.example.com/dashboard \ -p "Extract user stats" \ - --stealth + --mode js+stealth ``` -## search-scraper +## search ```bash # Research across multiple sources -just-scrape search-scraper "What are the best Python web frameworks in 2025?" \ +just-scrape search "What are the best Python web frameworks in 2025?" \ --num-results 10 -# Get raw markdown only (cheaper — 2 credits vs 10) -just-scrape search-scraper "React vs Vue comparison" \ - --no-extraction --num-results 5 - # Structured output with schema -just-scrape search-scraper "Top 5 cloud providers pricing" \ +just-scrape search "Top 5 cloud providers pricing" \ --schema '{"type":"object","properties":{"providers":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"free_tier":{"type":"string"}}}}}}' + +# With extraction prompt +just-scrape search "React vs Vue comparison" \ + -p "Summarize the key differences" ``` ## markdownify @@ -46,93 +46,57 @@ just-scrape markdownify https://blog.example.com/my-article # Save to a file just-scrape markdownify https://docs.example.com/api \ - --json | jq -r '.result' > api-docs.md + --json | jq -r '.markdown' > api-docs.md # Bypass Cloudflare -just-scrape markdownify https://protected.example.com --stealth -``` - -## crawl - -```bash -# Crawl a docs site and collect code examples -just-scrape crawl https://docs.example.com \ - -p "Extract all code snippets with their language" \ - --max-pages 20 --depth 3 - -# Crawl only blog pages, skip everything else -just-scrape crawl https://example.com \ - -p "Extract article titles and summaries" \ - --rules '{"include_paths":["/blog/*"],"same_domain":true}' \ - --max-pages 50 - -# Raw markdown from all pages (no AI extraction, cheaper) -just-scrape crawl https://example.com \ - --no-extraction --max-pages 10 +just-scrape markdownify https://protected.example.com -m js+stealth ``` ## scrape ```bash -# Get raw HTML +# Get markdown (default format) just-scrape scrape https://example.com -# Geo-targeted + anti-bot bypass -just-scrape scrape https://store.example.com \ - --stealth --country-code DE - -# Extract branding info (logos, colors, fonts) -just-scrape scrape https://example.com --branding -``` +# Get raw HTML +just-scrape scrape https://example.com -f html -## sitemap +# Take a screenshot +just-scrape scrape https://example.com -f screenshot -```bash -# List all pages on a site -just-scrape sitemap https://example.com +# Extract branding info (logos, colors, fonts) +just-scrape scrape https://example.com -f branding -# Pipe URLs to another tool -just-scrape sitemap https://example.com --json | jq -r '.urls[]' +# Geo-targeted + anti-bot bypass +just-scrape scrape https://store.example.com \ + -m direct+stealth --country DE ``` -## agentic-scraper +## crawl ```bash -# Log in and extract dashboard data -just-scrape agentic-scraper https://app.example.com/login \ - -s "Fill email with user@test.com,Fill password with secret,Click Sign In" \ - --ai-extraction -p "Extract all dashboard metrics" - -# Navigate a multi-step form -just-scrape agentic-scraper https://example.com/wizard \ - -s "Click Next,Select Premium plan,Fill name with John,Click Submit" - -# Persistent browser session across multiple runs -just-scrape agentic-scraper https://app.example.com \ - -s "Click Settings" --use-session -``` - -## generate-schema +# Crawl a docs site +just-scrape crawl https://docs.example.com \ + --max-pages 20 --max-depth 3 -```bash -# Generate a schema from a description -just-scrape generate-schema "E-commerce product with name, price, ratings, and reviews array" +# Allow external links +just-scrape crawl https://example.com \ + --max-pages 50 --allow-external -# Refine an existing schema -just-scrape generate-schema "Add an availability field" \ - --existing-schema '{"type":"object","properties":{"name":{"type":"string"},"price":{"type":"number"}}}' +# Anti-bot bypass for protected sites +just-scrape crawl https://example.com -m direct+stealth ``` ## history ```bash # Interactive history browser -just-scrape history smartscraper +just-scrape history extract -# Fetch a specific request by ID -just-scrape history smartscraper abc123-def456-7890 +# Export last 100 extract jobs as JSON +just-scrape history extract --json --page-size 100 \ + | jq '.[] | {id: .request_id, status}' -# Export last 100 crawl jobs as JSON -just-scrape history crawl --json --page-size 100 \ - | jq '.requests[] | {id: .request_id, status}' +# Browse crawl history +just-scrape history crawl --json ``` diff --git a/services/cli/introduction.mdx b/services/cli/introduction.mdx index 7ccc61b..74322fd 100644 --- a/services/cli/introduction.mdx +++ b/services/cli/introduction.mdx @@ -7,7 +7,7 @@ icon: 'terminal' `just-scrape` is the official CLI for [ScrapeGraph AI](https://scrapegraphai.com) — AI-powered web scraping, data extraction, search, and crawling, straight from your terminal. - Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) + Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) ## Installation @@ -56,21 +56,21 @@ The CLI needs a ScrapeGraph API key. Four ways to provide it (checked in order): | Variable | Description | Default | |---|---|---| | `SGAI_API_KEY` | ScrapeGraph API key | — | -| `JUST_SCRAPE_API_URL` | Override API base URL | `https://api.scrapegraphai.com/v1` | -| `JUST_SCRAPE_TIMEOUT_S` | Request/polling timeout in seconds | `120` | -| `JUST_SCRAPE_DEBUG` | Set to `1` to enable debug logging | `0` | +| `SGAI_API_URL` | Override API base URL | `https://api.scrapegraphai.com` | +| `SGAI_TIMEOUT_S` | Request timeout in seconds | `30` | + +Legacy variables (`JUST_SCRAPE_API_URL`, `JUST_SCRAPE_TIMEOUT_S`, `JUST_SCRAPE_DEBUG`) are still bridged. ## Verify your setup ```bash -just-scrape validate # check your API key just-scrape credits # check your credit balance ``` ## Quick start ```bash -just-scrape smart-scraper https://news.ycombinator.com \ +just-scrape extract https://news.ycombinator.com \ -p "Extract the top 5 story titles and their URLs" ``` diff --git a/services/cli/json-mode.mdx b/services/cli/json-mode.mdx index 60d13c8..1b56590 100644 --- a/services/cli/json-mode.mdx +++ b/services/cli/json-mode.mdx @@ -23,7 +23,7 @@ just-scrape [args] --json ### Save results to a file ```bash -just-scrape smart-scraper https://store.example.com \ +just-scrape extract https://store.example.com \ -p "Extract all product names and prices" \ --json > products.json ``` @@ -31,18 +31,16 @@ just-scrape smart-scraper https://store.example.com \ ### Extract a specific field with jq ```bash -just-scrape credits --json | jq '.remaining_credits' +just-scrape credits --json | jq '.remainingCredits' -just-scrape sitemap https://example.com --json | jq -r '.urls[]' - -just-scrape history smartscraper --json | jq '.requests[] | {id: .request_id, status}' +just-scrape history extract --json | jq '.[].status' ``` ### Convert a page to markdown and save it ```bash just-scrape markdownify https://docs.example.com/api \ - --json | jq -r '.result' > api-docs.md + --json | jq -r '.markdown' > api-docs.md ``` ### Chain commands in a shell script @@ -50,7 +48,7 @@ just-scrape markdownify https://docs.example.com/api \ ```bash #!/bin/bash while IFS= read -r url; do - just-scrape smart-scraper "$url" \ + just-scrape extract "$url" \ -p "Extract the page title and main content" \ --json >> results.jsonl done < urls.txt @@ -69,8 +67,8 @@ Credits response: ```json { - "remaining_credits": 4820, - "total_credits": 5000 + "remainingCredits": 4820, + "totalCredits": 5000 } ``` diff --git a/services/crawl.mdx b/services/crawl.mdx new file mode 100644 index 0000000..58b63c6 --- /dev/null +++ b/services/crawl.mdx @@ -0,0 +1,195 @@ +--- +title: 'Crawl' +description: 'Multi-page website crawling with flexible output formats' +icon: 'spider' +--- + +## Overview + +Crawl is an advanced web crawling service that traverses multiple pages, follows links, and returns content in your preferred format (markdown or HTML). It provides namespaced operations for starting, monitoring, stopping, and resuming crawl jobs. + + +Try Crawl instantly in our [interactive playground](https://scrapegraphai.com/dashboard) + + +## Getting Started + +### Quick Start + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +# Start a crawl +response = client.crawl.start( + "https://example.com", + depth=2, + max_pages=10, + format="markdown", +) +print("Crawl started:", response) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const job = await sgai.crawl.start('https://example.com', { + maxDepth: 2, + maxPages: 10, + format: 'markdown', +}); + +console.log('Crawl started:', job.data.id); + +// Check status +const status = await sgai.crawl.status(job.data.id); + +// Stop / Resume +await sgai.crawl.stop(job.data.id); +await sgai.crawl.resume(job.data.id); +``` + +```bash cURL +curl -X POST https://api.scrapegraphai.com/api/v2/crawl \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer your-api-key" \ + -d '{ + "url": "https://example.com", + "depth": 2, + "max_pages": 10, + "format": "markdown" + }' +``` + + + +#### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| url | string | Yes | The starting URL to crawl. | +| depth | int | No | How many levels deep to follow links. | +| max_pages | int | No | Maximum number of pages to crawl. | +| format | string | No | Output format: `"markdown"` or `"html"`. Default: `"markdown"`. | +| include_patterns | list | No | URL patterns to include (e.g., `["/blog/*"]`). | +| exclude_patterns | list | No | URL patterns to exclude (e.g., `["/admin/*"]`). | +| fetch_config | FetchConfig | No | Configuration for page fetching (headers, stealth, etc.). | + + +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) + + +## Managing Crawl Jobs + +### Check Status + +```python +status = client.crawl.status(crawl_id) +print("Status:", status) +``` + +### Stop a Running Crawl + +```python +client.crawl.stop(crawl_id) +``` + +### Resume a Stopped Crawl + +```python +client.crawl.resume(crawl_id) +``` + +## Advanced Usage + +### With FetchConfig + +```python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + +response = client.crawl.start( + "https://example.com", + depth=2, + max_pages=10, + format="markdown", + include_patterns=["/blog/*"], + exclude_patterns=["/admin/*"], + fetch_config=FetchConfig( + render_js=True, + stealth=True, + wait_ms=1000, + headers={"User-Agent": "MyBot"}, + ), +) +``` + +### Async Support + +```python +import asyncio +from scrapegraph_py import AsyncClient + +async def main(): + async with AsyncClient(api_key="your-api-key") as client: + job = await client.crawl.start( + "https://example.com", + depth=2, + max_pages=5, + ) + + status = await client.crawl.status(job["id"]) + print("Crawl status:", status) + +asyncio.run(main()) +``` + +## Key Features + + + + Traverse entire websites following links automatically + + + Get results in markdown or HTML format + + + Start, stop, resume, and monitor crawl jobs + + + Include or exclude pages by URL patterns + + + +## Integration Options + +### Official SDKs +- [Python SDK](/sdks/python) - Perfect for data science and backend applications +- [JavaScript SDK](/sdks/javascript) - Ideal for web applications and Node.js + +### AI Framework Integrations +- [LangChain Integration](/integrations/langchain) - Use Crawl in your LLM workflows +- [LlamaIndex Integration](/integrations/llamaindex) - Build powerful search and QA systems + +## Support & Resources + + + + Comprehensive guides and tutorials + + + Detailed API documentation + + + Join our Discord community + + + Check out our open-source projects + + diff --git a/services/extensions/chrome.mdx b/services/extensions/chrome.mdx index 65006d0..07ef294 100644 --- a/services/extensions/chrome.mdx +++ b/services/extensions/chrome.mdx @@ -44,7 +44,7 @@ First, you'll need to get your ScrapeGraphAI API key: Chrome Extension API Key Setup -1. Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +1. Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) 2. Click the ScrapeGraphAI icon in your toolbar 3. Enter your API key in the settings 4. Click Save to store your API key securely diff --git a/services/extensions/firefox.mdx b/services/extensions/firefox.mdx index 05846d3..e1b4597 100644 --- a/services/extensions/firefox.mdx +++ b/services/extensions/firefox.mdx @@ -48,7 +48,7 @@ First, you'll need to get your ScrapeGraphAI API key: Firefox Extension API Key Setup -1. Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +1. Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) 2. Click the ScrapeGraphAI icon in your toolbar 3. Enter your API key in the settings 4. Click Save to store your API key securely diff --git a/services/extract.mdx b/services/extract.mdx new file mode 100644 index 0000000..e06c9a1 --- /dev/null +++ b/services/extract.mdx @@ -0,0 +1,257 @@ +--- +title: 'Extract' +description: 'AI-powered structured data extraction from any webpage' +icon: 'robot' +--- + + + Extract Service + + +## Overview + +Extract is our flagship LLM-powered web scraping service that intelligently extracts structured data from any website. Using advanced LLM models, it understands context and content like a human would, making web data extraction more reliable and efficient than ever. + + +Try Extract instantly in our [interactive playground](https://scrapegraphai.com/dashboard) + + +## Getting Started + +### Quick Start + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.extract( + url="https://scrapegraphai.com/", + prompt="Extract info about the company" +) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const { data, requestId } = await sgai.extract( + 'https://scrapegraphai.com', + { prompt: 'What does the company do?' } +); + +console.log(data); +``` + +```bash cURL +curl -X POST https://api.scrapegraphai.com/api/v2/extract \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer your-api-key" \ + -d '{ + "url": "https://example.com", + "prompt": "Extract product details including name, price, and availability." + }' +``` + + + +#### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| url | string | Yes | The URL of the webpage to scrape. | +| prompt | string | Yes | A textual description of what you want to extract. | +| output_schema | object | No | Pydantic or Zod schema for structured response format. | +| fetch_config | FetchConfig | No | Configuration for page fetching (headers, cookies, stealth, etc.). | +| llm_config | LlmConfig | No | Configuration for the AI model (model, temperature, max_tokens, etc.). | + + +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) + + + +```json +{ + "id": "sg-req-abc123", + "status": "completed", + "result": { + "company_name": "ScrapeGraphAI", + "description": "ScrapeGraphAI is a powerful AI scraping API designed for efficient web data extraction...", + "features": [ + "Effortless, cost-effective, and AI-powered data extraction", + "Handles proxy rotation and rate limits", + "Supports a wide variety of websites" + ] + } +} +``` + + +## FetchConfig + +Use `FetchConfig` to control how the page is fetched: + +```python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + +response = client.extract( + url="https://example.com", + prompt="Extract the main content", + fetch_config=FetchConfig( + headers={"User-Agent": "MyBot"}, + cookies={"session": "abc123"}, + scrolls=3, + render_js=True, + stealth=True, + wait_ms=2000, + ), +) +``` + +| Parameter | Type | Description | +|-----------|------|-------------| +| headers | dict | Custom HTTP headers to send. | +| cookies | dict | Cookies to include in the request. | +| scrolls | int | Number of page scrolls (0-100). | +| render_js | bool | Render heavy JavaScript before extraction. | +| stealth | bool | Enable stealth mode to avoid bot detection. | +| wait_ms | int | Milliseconds to wait before capturing page content. | +| country | string | Two-letter ISO country code for geo-targeted proxy routing. | + +## Custom Schema Example + +Define exactly what data you want to extract: + + + +```python Python +from pydantic import BaseModel, Field +from scrapegraph_py import Client + +class ArticleData(BaseModel): + title: str = Field(description="Article title") + author: str = Field(description="Author name") + content: str = Field(description="Main article content") + publish_date: str = Field(description="Publication date") + +client = Client(api_key="your-api-key") + +response = client.extract( + url="https://example.com/article", + prompt="Extract the article information", + output_schema=ArticleData +) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; +import { z } from 'zod'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const schema = z.object({ + title: z.string().describe('The title of the webpage'), + description: z.string().describe('The description of the webpage'), + summary: z.string().describe('A brief summary of the webpage'), +}); + +const { data } = await sgai.extract( + 'https://scrapegraphai.com/', + { + prompt: 'What does the company do?', + schema: schema, + } +); + +console.log(data); +``` + + + +## Async Support + +For applications requiring asynchronous execution: + + + +```python Python +import asyncio +from scrapegraph_py import AsyncClient + +async def main(): + async with AsyncClient(api_key="your-api-key") as client: + urls = [ + "https://scrapegraphai.com/", + "https://github.com/ScrapeGraphAI/Scrapegraph-ai", + ] + + tasks = [ + client.extract( + url=url, + prompt="Summarize the main content", + ) + for url in urls + ] + + responses = await asyncio.gather(*tasks, return_exceptions=True) + + for i, response in enumerate(responses): + if isinstance(response, Exception): + print(f"Error for {urls[i]}: {response}") + else: + print(f"Result for {urls[i]}: {response}") + +if __name__ == "__main__": + asyncio.run(main()) +``` + + + +## Key Features + + + + Works with any website structure, including JavaScript-rendered content + + + Contextual understanding of content for accurate extraction + + + Returns clean, structured data in your preferred format + + + Define custom output schemas using Pydantic or Zod + + + +## Integration Options + +### Official SDKs +- [Python SDK](/sdks/python) - Perfect for data science and backend applications +- [JavaScript SDK](/sdks/javascript) - Ideal for web applications and Node.js + +### AI Framework Integrations +- [LangChain Integration](/integrations/langchain) - Use Extract in your LLM workflows +- [LlamaIndex Integration](/integrations/llamaindex) - Build powerful search and QA systems + +## Support & Resources + + + + Comprehensive guides and tutorials + + + Detailed API documentation + + + Join our Discord community + + + Check out our open-source projects + + diff --git a/services/markdownify.mdx b/services/markdownify.mdx index 74ba35c..b3eb13b 100644 --- a/services/markdownify.mdx +++ b/services/markdownify.mdx @@ -13,7 +13,7 @@ icon: 'markdown' Markdownify is our specialized service that transforms web content into clean, well-formatted markdown. It intelligently preserves the content's structure while removing unnecessary elements, making it perfect for content migration, documentation creation, and knowledge base building. -Try Markdownify instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) +Try Markdownify instantly in our [interactive playground](https://scrapegraphai.com/dashboard) ## Getting Started @@ -78,7 +78,7 @@ just-scrape markdownify https://example.com/article | country_code | string | No | Proxy routing country code (e.g., "us"). | -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) @@ -310,6 +310,6 @@ For detailed API documentation, see: - + Sign up now and get your API key to begin converting web content to clean markdown! diff --git a/services/mcp-server.mdx b/services/mcp-server.mdx index 8993e6d..54b12e6 100644 --- a/services/mcp-server.mdx +++ b/services/mcp-server.mdx @@ -29,7 +29,7 @@ A production‑ready Model Context Protocol (MCP) server that connects LLMs to t ## Get Your API Key -Create an account and copy your API key from the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com). +Create an account and copy your API key from the [ScrapeGraph Dashboard](https://scrapegraphai.com/dashboard). --- diff --git a/services/mcp-server/introduction.mdx b/services/mcp-server/introduction.mdx index b7ad1c8..6b8fea8 100644 --- a/services/mcp-server/introduction.mdx +++ b/services/mcp-server/introduction.mdx @@ -53,7 +53,7 @@ The MCP server exposes 8 enterprise-ready tools: - Create an account and copy your API key from the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com) + Create an account and copy your API key from the [ScrapeGraph Dashboard](https://scrapegraphai.com/dashboard) Select your preferred AI assistant: Cursor or Claude Desktop diff --git a/services/mcp-server/smithery.mdx b/services/mcp-server/smithery.mdx index 683da98..bbe550b 100644 --- a/services/mcp-server/smithery.mdx +++ b/services/mcp-server/smithery.mdx @@ -45,7 +45,7 @@ After installation, Smithery will configure your client's MCP settings. You may ## API Key -You'll need a ScrapeGraph API key to use the MCP server. Get one from the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com) if you haven't already. +You'll need a ScrapeGraph API key to use the MCP server. Get one from the [ScrapeGraph Dashboard](https://scrapegraphai.com/dashboard) if you haven't already. ## Alternative Installation Methods diff --git a/services/monitor.mdx b/services/monitor.mdx new file mode 100644 index 0000000..6fabcb4 --- /dev/null +++ b/services/monitor.mdx @@ -0,0 +1,219 @@ +--- +title: 'Monitor' +description: 'Scheduled web monitoring with AI-powered extraction' +icon: 'clock' +--- + +## Overview + +Monitor enables you to set up recurring web scraping jobs that automatically extract data on a schedule. Create monitors that run on a cron schedule and extract structured data from any webpage. + + +Try Monitor in our [dashboard](https://scrapegraphai.com/dashboard) + + +## Getting Started + +### Quick Start + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +# Create a monitor +monitor = client.monitor.create( + name="Price Tracker", + url="https://example.com/products", + prompt="Extract current product prices", + cron="0 9 * * *", # Daily at 9 AM +) +print("Monitor created:", monitor) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const monitor = await sgai.monitor.create({ + name: 'Price Tracker', + url: 'https://example.com/products', + prompt: 'Extract current product prices', + cron: '0 9 * * *', +}); + +console.log('Monitor created:', monitor.data.id); + +// List / Get / Pause / Resume / Delete +const monitors = await sgai.monitor.list(); +await sgai.monitor.pause(monitor.data.id); +await sgai.monitor.resume(monitor.data.id); +await sgai.monitor.delete(monitor.data.id); +``` + +```bash cURL +curl -X POST https://api.scrapegraphai.com/api/v2/monitor \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer your-api-key" \ + -d '{ + "name": "Price Tracker", + "url": "https://example.com/products", + "prompt": "Extract current product prices", + "cron": "0 9 * * *" + }' +``` + + + +#### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| name | string | Yes | A descriptive name for the monitor. | +| url | string | Yes | The URL to monitor. | +| prompt | string | Yes | What data to extract on each run. | +| cron | string | Yes | Cron expression for the schedule (e.g., `"0 9 * * *"` for daily at 9 AM). | +| output_schema | object | No | Pydantic or Zod schema for structured response format. | +| fetch_config | FetchConfig | No | Configuration for page fetching (headers, stealth, etc.). | +| llm_config | LlmConfig | No | Configuration for the AI model (temperature, max_tokens, etc.). | + + +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) + + +## Managing Monitors + +### List All Monitors + +```python +monitors = client.monitor.list() +print("All monitors:", monitors) +``` + +### Get a Specific Monitor + +```python +monitor = client.monitor.get(monitor_id) +print("Monitor details:", monitor) +``` + +### Pause a Monitor + +```python +client.monitor.pause(monitor_id) +``` + +### Resume a Monitor + +```python +client.monitor.resume(monitor_id) +``` + +### Delete a Monitor + +```python +client.monitor.delete(monitor_id) +``` + +## Advanced Usage + +### With Output Schema and Config + +```python +from pydantic import BaseModel, Field +from scrapegraph_py import Client, FetchConfig, LlmConfig + +class ProductPrice(BaseModel): + name: str = Field(description="Product name") + price: float = Field(description="Current price") + in_stock: bool = Field(description="Whether the product is in stock") + +client = Client(api_key="your-api-key") + +monitor = client.monitor.create( + name="Product Price Monitor", + url="https://example.com/products", + prompt="Extract product names, prices, and stock status", + cron="0 */6 * * *", # Every 6 hours + output_schema=ProductPrice, + fetch_config=FetchConfig(stealth=True), + llm_config=LlmConfig(temperature=0.1), +) +``` + +### Async Support + +```python +import asyncio +from scrapegraph_py import AsyncClient + +async def main(): + async with AsyncClient(api_key="your-api-key") as client: + monitor = await client.monitor.create( + name="Tracker", + url="https://example.com", + prompt="Extract prices", + cron="0 9 * * *", + ) + + monitors = await client.monitor.list() + print("All monitors:", monitors) + +asyncio.run(main()) +``` + +## Key Features + + + + Run extraction jobs on any cron schedule + + + Use natural language prompts to define what to extract + + + Create, pause, resume, and delete monitors easily + + + Define structured output with Pydantic or Zod schemas + + + +## Common Cron Expressions + +| Expression | Schedule | +|-----------|----------| +| `0 9 * * *` | Daily at 9 AM | +| `0 */6 * * *` | Every 6 hours | +| `0 9 * * 1` | Every Monday at 9 AM | +| `0 0 1 * *` | First day of every month | +| `*/30 * * * *` | Every 30 minutes | + +## Integration Options + +### Official SDKs +- [Python SDK](/sdks/python) - Perfect for data science and backend applications +- [JavaScript SDK](/sdks/javascript) - Ideal for web applications and Node.js + +### AI Framework Integrations +- [LangChain Integration](/integrations/langchain) - Use Monitor in your LLM workflows + +## Support & Resources + + + + Comprehensive guides and tutorials + + + Detailed API documentation + + + Join our Discord community + + + Check out our open-source projects + + diff --git a/services/scrape.mdx b/services/scrape.mdx index f2a2373..7c73afc 100644 --- a/services/scrape.mdx +++ b/services/scrape.mdx @@ -1,6 +1,6 @@ --- title: 'Scrape' -description: 'Extract raw HTML content from web pages with JavaScript rendering support' +description: 'Scrape web pages in markdown, HTML, or screenshot format' icon: 'code' --- @@ -10,10 +10,10 @@ icon: 'code' ## Overview -The Scrape service provides direct access to raw HTML content from web pages, with optional JavaScript rendering support. This service is perfect for applications that need the complete HTML structure of a webpage, including dynamically generated content. +The Scrape service fetches web page content and returns it in your preferred format: markdown, HTML, screenshot, or branding. It replaces the previous Markdownify service and supports flexible output through a simple `format` parameter. -Try the Scrape service instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) +Try the Scrape service instantly in our [interactive playground](https://scrapegraphai.com/dashboard) ## Getting Started @@ -24,211 +24,115 @@ Try the Scrape service instantly in our [interactive playground](https://dashboa ```python Python from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger -sgai_logger.set_logging(level="INFO") +client = Client(api_key="your-api-key") -# Initialize the client -sgai_client = Client(api_key="your-api-key") +# Get markdown (default) +response = client.scrape("https://example.com") -# Scrape request -response = sgai_client.htmlify( - website_url="https://example.com", - branding=True # Set to True to extract brand design and metadata -) +# Get HTML +response = client.scrape("https://example.com", format="html") -print("HTML Content:", response.html) -print("Request ID:", response.scrape_request_id) -print("Status:", response.status) -# Optional branding result -if response.branding: - print("Branding extracted") +# Get screenshot +response = client.scrape("https://example.com", format="screenshot") ``` ```javascript JavaScript -import { scrape } from 'scrapegraph-js'; +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); -const apiKey = 'your-api-key'; +// Get markdown (default) +const { data } = await sgai.scrape('https://example.com'); -const response = await scrape(apiKey, { - website_url: 'https://example.com', - branding: true, +// Get HTML +const { data: htmlData } = await sgai.scrape('https://example.com', { + format: 'html', }); -if (response.status === 'error') { - console.error('Error:', response.error); -} else { - console.log('HTML Content:', response.data.html); - console.log('Request ID:', response.data.scrape_request_id); - console.log('Status:', response.data.status); - if (response.data.branding) { - console.log('Branding extracted'); - } -} +console.log(data); ``` ```bash cURL -curl -X POST https://api.scrapegraphai.com/v1/scrape \ +curl -X POST https://api.scrapegraphai.com/api/v2/scrape \ -H "Content-Type: application/json" \ - -H "SGAI-APIKEY: your-api-key" \ + -H "Authorization: Bearer your-api-key" \ -d '{ - "website_url": "https://example.com", - "branding": true + "url": "https://example.com", + "format": "markdown" }' ``` -```bash CLI -just-scrape scrape https://example.com --branding -``` - #### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| -| apiKey | string | Yes | The ScrapeGraph API Key. | -| website_url | string | Yes | The URL of the webpage to scrape. | -| branding | boolean | No | Return extracted brand design and metadata. Default: false | -| stealth | boolean | No | Enable stealth mode for anti-bot protection. Adds additional credits. Default: false | -| wait_ms | integer | No | Milliseconds to wait before capturing page content. Default: 3000 | -| country_code | string | No | Two-letter ISO country code for geo-targeted proxy routing (e.g., "us", "gb", "de"). | +| url | string | Yes | The URL of the webpage to scrape. | +| format | string | No | Output format: `"markdown"` (default), `"html"`, `"screenshot"`, or `"branding"`. | +| fetch_config | FetchConfig | No | Configuration for page fetching (headers, stealth, etc.). | -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) - -```json -{ - "scrape_request_id": "2f0f7a7e-7eb3-4bd2-8f8d-ae8a7f2d9c1a", - "status": "completed", - "html": "Example Page

Welcome to Example.com

This is the raw HTML content...

", - "error": "" -} -``` - -The response includes: -- `scrape_request_id`: Unique identifier for tracking your request -- `status`: Current status of the scraping operation -- `html`: Raw HTML content of the webpage -- `error`: Error message (if any occurred during scraping) -
- - + ```json { - "scrape_request_id": "2f0f7a7e-7eb3-4bd2-8f8d-ae8a7f2d9c1a", - "status": "completed", - "html": "...", - "error": "", - "branding": { - "branding": { - "colorScheme": "light", - "colors": { - "primary": "#0B5FFF", - "accent": "#FF8A00", - "background": "#FFFFFF", - "textPrimary": "#111827", - "link": "#0B5FFF" - }, - "fonts": [ - { "family": "Inter", "role": "body" } - ], - "typography": { - "fontFamilies": { "primary": "Inter", "heading": "Inter" }, - "fontStacks": { "heading": ["Inter"], "body": ["Inter"] }, - "fontSizes": { "h1": "32px", "h2": "24px", "body": "16px" } - }, - "spacing": { "baseUnit": 4, "borderRadius": "6px" }, - "components": { - "input": { "borderColor": "#E5E7EB", "borderRadius": "6px" }, - "buttonPrimary": { - "background": "#0B5FFF", - "textColor": "#FFFFFF", - "borderRadius": "6px", - "shadow": "..." - } - }, - "images": { - "logo": "https://example.com/logo.svg", - "favicon": "https://example.com/favicon.ico", - "ogImage": "https://example.com/og.png" - }, - "designSystem": { "framework": "tailwind", "componentLibrary": null }, - "confidence": { "overall": 0.86 } - }, - "metadata": { - "title": "Example", - "language": "en", - "favicon": "https://example.com/favicon.ico" - } + "id": "0d6c4b31-931b-469b-9a7f-2f1e002e79ca", + "format": "markdown", + "content": [ + "# Example Domain\n\nThis domain is for use in documentation examples..." + ], + "metadata": { + "contentType": "text/html" } } ``` - -When `branding=true` is passed, the response includes a `branding` object with brand design data and page metadata. -## Key Features - - - - Get complete HTML structure including all elements - - - Optionally extract brand colors, fonts, typography, UI components, images, and metadata - - - Quick extraction for simple HTML content - - - Consistent results across different websites - - - -## Use Cases +## Output Formats -### Web Development -- Extract HTML templates -- Analyze page structure -- Test website rendering -- Debug HTML issues +| Format | Description | +|--------|-------------| +| `markdown` | Clean markdown conversion of the page content (default). | +| `html` | Raw HTML content of the page. | +| `screenshot` | Screenshot image of the rendered page. | +| `branding` | Brand design extraction: colors, fonts, typography, logos. | -### Data Analysis -- Parse HTML content -- Extract specific elements -- Monitor website changes -- Build web scrapers - -### Content Processing -- Process dynamic content -- Handle JavaScript-heavy sites -- Extract embedded data -- Analyze page performance +## Advanced Usage - -Want to learn more about our AI-powered scraping technology? Visit our [main website](https://scrapegraphai.com) to discover how we're revolutionizing web data extraction. - +### With FetchConfig -## Advanced Usage +```python +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") + +response = client.scrape( + "https://example.com", + format="markdown", + fetch_config=FetchConfig( + render_js=True, + stealth=True, + wait_ms=2000, + headers={"User-Agent": "MyBot"}, + ), +) +``` ### Async Support -For applications requiring asynchronous execution, the Scrape service provides async support: - ```python -from scrapegraph_py import AsyncClient import asyncio +from scrapegraph_py import AsyncClient async def main(): async with AsyncClient(api_key="your-api-key") as client: - response = await client.htmlify( - website_url="https://example.com" - ) + response = await client.scrape("https://example.com") print(response) -# Run the async function asyncio.run(main()) ``` @@ -239,41 +143,44 @@ Process multiple URLs concurrently for better performance: ```python import asyncio from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") async def main(): - # Initialize async client - sgai_client = AsyncClient(api_key="your-api-key") - - # URLs to scrape - urls = [ - "https://example.com", - "https://scrapegraphai.com/", - "https://github.com/ScrapeGraphAI/Scrapegraph-ai", - ] - - tasks = [sgai_client.htmlify(website_url=url) for url in urls] - - # Execute requests concurrently - responses = await asyncio.gather(*tasks, return_exceptions=True) - - # Process results - for i, response in enumerate(responses): - if isinstance(response, Exception): - print(f"\nError for {urls[i]}: {response}") - else: - print(f"\nPage {i+1} HTML:") - print(f"URL: {urls[i]}") - print(f"HTML Length: {len(response['html'])} characters") - - await sgai_client.close() - -if __name__ == "__main__": - asyncio.run(main()) + async with AsyncClient(api_key="your-api-key") as client: + urls = [ + "https://example.com", + "https://scrapegraphai.com/", + "https://github.com/ScrapeGraphAI/Scrapegraph-ai", + ] + + tasks = [client.scrape(url) for url in urls] + responses = await asyncio.gather(*tasks, return_exceptions=True) + + for i, response in enumerate(responses): + if isinstance(response, Exception): + print(f"Error for {urls[i]}: {response}") + else: + print(f"Page {i+1}: {len(str(response))} characters") + +asyncio.run(main()) ``` +## Key Features + + + + Get content as markdown, HTML, screenshot, or branding data + + + Handle JavaScript-heavy sites with render_js support + + + Quick extraction for simple content + + + Consistent results across different websites + + + ## Integration Options ### Official SDKs @@ -284,37 +191,6 @@ if __name__ == "__main__": - [LangChain Integration](/integrations/langchain) - Use Scrape in your content pipelines - [LlamaIndex Integration](/integrations/llamaindex) - Create searchable knowledge bases -## Best Practices - -### Performance Optimization -1. Process multiple URLs concurrently -3. Cache results when possible -4. Monitor API usage and costs - -### Error Handling -- Always check the `status` field -- Handle network timeouts gracefully -- Implement retry logic for failed requests -- Log errors for debugging - -### Content Processing -- Validate HTML structure before parsing -- Handle different character encodings -- Extract only needed content sections -- Clean up HTML for further processing - -## Example Projects - -Check out our [cookbook](/cookbook/introduction) for real-world examples: -- Web scraping automation tools -- Content monitoring systems -- HTML analysis applications -- Dynamic content extractors - -## API Reference - -For detailed API documentation, see the [API Reference](/api-reference/introduction). - ## Support & Resources @@ -330,11 +206,8 @@ For detailed API documentation, see the [API Reference](/api-reference/introduct Check out our open-source projects - - Visit our official website - - + Sign up now and get your API key to begin scraping web content! diff --git a/services/search.mdx b/services/search.mdx new file mode 100644 index 0000000..0ad6668 --- /dev/null +++ b/services/search.mdx @@ -0,0 +1,172 @@ +--- +title: 'Search' +description: 'AI-powered web search with structured data extraction' +icon: 'magnifying-glass' +--- + +## Overview + +Search enables you to search the web and extract structured results using AI. It combines web search capabilities with intelligent data extraction, returning clean, structured data from search results. + + +Try Search instantly in our [interactive playground](https://scrapegraphai.com/dashboard) + + +## Getting Started + +### Quick Start + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.search( + query="What are the key features of ChatGPT Plus?" +) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const { data } = await sgai.search( + 'What are the key features of ChatGPT Plus?' +); + +console.log(data); +``` + +```bash cURL +curl -X POST https://api.scrapegraphai.com/api/v2/search \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer your-api-key" \ + -d '{ + "query": "What are the key features of ChatGPT Plus?" + }' +``` + + + +#### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| query | string | Yes | The search query to execute. | +| num_results | int | No | Number of search results to return (3-20). Default: 5. | +| output_schema | object | No | Pydantic or Zod schema for structured response format. | +| llm_config | LlmConfig | No | Configuration for the AI model (model, temperature, max_tokens, etc.). | + + +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) + + +## Custom Schema Example + +Define the structure of the output using Pydantic or Zod: + + + +```python Python +from pydantic import BaseModel, Field +from scrapegraph_py import Client + +class SearchResult(BaseModel): + title: str = Field(description="The result title") + summary: str = Field(description="Brief summary of the result") + url: str = Field(description="Source URL") + +client = Client(api_key="your-api-key") + +response = client.search( + query="Latest AI developments 2024", + num_results=10, + output_schema=SearchResult, +) +``` + +```javascript JavaScript +import { scrapegraphai } from 'scrapegraph-js'; +import { z } from 'zod'; + +const sgai = scrapegraphai({ apiKey: 'your-api-key' }); + +const schema = z.object({ + title: z.string().describe('The result title'), + summary: z.string().describe('Brief summary'), + url: z.string().describe('Source URL'), +}); + +const { data } = await sgai.search( + 'Latest AI developments 2024', + { + schema: schema, + numResults: 10, + } +); +``` + + + +## Async Support + +```python +import asyncio +from scrapegraph_py import AsyncClient + +async def main(): + async with AsyncClient(api_key="your-api-key") as client: + response = await client.search( + query="Best practices for web scraping" + ) + print(response) + +asyncio.run(main()) +``` + +## Key Features + + + + Intelligent extraction from search results + + + Returns clean, structured data in your preferred format + + + Define custom output schemas using Pydantic or Zod + + + Control the number of search results returned + + + +## Integration Options + +### Official SDKs +- [Python SDK](/sdks/python) - Perfect for data science and backend applications +- [JavaScript SDK](/sdks/javascript) - Ideal for web applications and Node.js + +### AI Framework Integrations +- [LangChain Integration](/integrations/langchain) - Use Search in your LLM workflows +- [LlamaIndex Integration](/integrations/llamaindex) - Build powerful search and QA systems + +## Support & Resources + + + + Comprehensive guides and tutorials + + + Detailed API documentation + + + Join our Discord community + + + Check out our open-source projects + + diff --git a/services/searchscraper.mdx b/services/searchscraper.mdx index e4127d0..cdede96 100644 --- a/services/searchscraper.mdx +++ b/services/searchscraper.mdx @@ -17,7 +17,7 @@ SearchScraper offers two modes: - **Markdown Mode**: Returns raw markdown content from scraped pages (2 credits per page) -Try SearchScraper instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) +Try SearchScraper instantly in our [interactive playground](https://scrapegraphai.com/dashboard) ## Getting Started @@ -931,7 +931,7 @@ For detailed API documentation, see: - + Sign up now and get your API key to begin searching and extracting data with SearchScraper! diff --git a/services/sitemap.mdx b/services/sitemap.mdx index 3db5765..5b1fbac 100644 --- a/services/sitemap.mdx +++ b/services/sitemap.mdx @@ -13,7 +13,7 @@ icon: 'sitemap' Sitemap is our service that extracts all URLs from a website's sitemap.xml file automatically. The API discovers the sitemap from robots.txt, common locations like /sitemap.xml, or sitemap index files—perfect for discovering pages for bulk scraping, content inventory, or combining with other endpoints. -Try Sitemap instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) - no coding required! +Try Sitemap instantly in our [interactive playground](https://scrapegraphai.com/dashboard) - no coding required! ## Getting Started @@ -78,7 +78,7 @@ just-scrape sitemap https://scrapegraphai.com | stealth | boolean | No | Enable stealth mode for anti-bot protection. Adds +4 credits. Default: false | -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) diff --git a/services/smartcrawler.mdx b/services/smartcrawler.mdx index f84670a..0a7dbb2 100644 --- a/services/smartcrawler.mdx +++ b/services/smartcrawler.mdx @@ -14,7 +14,7 @@ SmartCrawler is our advanced web crawling service that offers two modes: Unlike SmartScraper, which extracts data from a single page, SmartCrawler can traverse multiple pages, follow links, and either extract structured data or convert content to clean markdown from entire websites or sections. -Try SmartCrawler instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) +Try SmartCrawler instantly in our [interactive playground](https://scrapegraphai.com/dashboard) ## Getting Started @@ -116,7 +116,7 @@ just-scrape crawl https://scrapegraphai.com/ -p "Extract info about the company" -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) ## Markdown Conversion Mode @@ -578,7 +578,7 @@ SmartCrawler supports webhook notifications for job completion. When you provide ### Setting Up Webhooks -1. Configure your webhook secret in the [dashboard](https://dashboard.scrapegraphai.com) +1. Configure your webhook secret in the [dashboard](https://scrapegraphai.com/dashboard) 2. Provide a `webhook_url` in your crawl request 3. Verify incoming webhooks using the `X-Webhook-Signature` header @@ -617,6 +617,6 @@ For detailed API documentation, see: - + Sign up now and get your API key to begin extracting data with SmartCrawler! \ No newline at end of file diff --git a/services/smartscraper.mdx b/services/smartscraper.mdx index ce9f208..d10eb70 100644 --- a/services/smartscraper.mdx +++ b/services/smartscraper.mdx @@ -13,7 +13,7 @@ icon: 'robot' SmartScraper is our flagship LLM-powered web scraping service that intelligently extracts structured data from any website. Using advanced LLM models, it understands context and content like a human would, making web data extraction more reliable and efficient than ever. -Try SmartScraper instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) +Try SmartScraper instantly in our [interactive playground](https://scrapegraphai.com/dashboard) ## Getting Started @@ -83,7 +83,7 @@ just-scrape smart-scraper https://scrapegraphai.com/ -p "Extract info about the | country_code | string | No | Proxy routing country code (e.g., "us"). | -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) +Get your API key from the [dashboard](https://scrapegraphai.com/dashboard) @@ -632,6 +632,6 @@ For detailed API documentation, see: - + Sign up now and get your API key to begin extracting data with SmartScraper! diff --git a/transition-from-v1-to-v2.mdx b/transition-from-v1-to-v2.mdx new file mode 100644 index 0000000..5e4a8f3 --- /dev/null +++ b/transition-from-v1-to-v2.mdx @@ -0,0 +1,203 @@ +--- +title: Transition Guide from v1 to v2 +description: Move from v1 to v2 quickly and safely +--- + +## Transition from v1 to v2 + +If you are coming from the legacy v1 docs, use this page as your migration checkpoint. + +Before anything else, log in to the dashboard at [scrapegraphai.com/login](https://scrapegraphai.com/login). + +## Method-by-method migration + +Use this table to map old entry points to new ones. Details and examples follow below. + +| v1 | v2 | Notes | +|----|-----|------| +| `markdownify` | **`scrape`** with `format="markdown"` (Python) or `format: "markdown"` (JS) | HTML → markdown and related “raw page” outputs live under **`scrape`**. | +| `smartscraper` / `smartScraper` | **`extract`** | Same job: structured extraction from a URL. Rename params and pass extra fetch/LLM options via config objects. | +| `searchscraper` / `searchScraper` | **`search`** | Web search + extraction; use `query` (or positional string in JS). | +| `crawl` (single start call) | **`crawl.start`**, then **`crawl.status`**, **`crawl.stop`**, **`crawl.resume`** | Crawl is explicitly async: you poll or track job id. | +| Monitors (if you used them) | **`monitor.create`**, **`monitor.list`**, **`monitor.get`**, pause/resume/delete | Same product, namespaced API. | +| `sitemap` | **Removed from v2 SDKs** | Discover URLs with **`crawl.start`** and URL patterns, or call the REST sitemap endpoint if your integration still requires it—see [Sitemap](/services/sitemap) and SDK release notes. | +| `agenticscraper` | **Removed** | Use **`extract`** with `FetchConfig` (e.g. `render_js`, `wait_ms`, `stealth`) for hard pages, or **`crawl.start`** for multi-page flows. | +| `healthz` / `checkHealth`, `feedback`, built-in mock helpers | **Removed or changed** | Use **`credits`**, **`history`**, and dashboard features; check the SDK migration guides for replacements. | + +## Code-level transition + +### 1. Markdownify → `scrape` + +**Before:** `markdownify(url)`. + +**After:** `scrape(url, format="markdown")` (Python) or `scrape(url, { format: "markdown" })` (JS). + + + +```python Python (v2) +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") +response = client.scrape( + "https://example.com", + format="markdown", +) + +print(response) +``` + +```javascript JavaScript (v2) +import { scrapegraphai } from "scrapegraph-js"; + +const sgai = scrapegraphai({ apiKey }); +const { data, requestId } = await sgai.scrape("https://example.com", { + format: "markdown", +}); + +console.log(data); +``` + + + +### 2. SmartScraper → `extract` + +**Before (v1):** `website_url` + `user_prompt`, optional flags on the same object. + +**After (v2):** `url` + `prompt`; move fetch-related flags into `FetchConfig` / `fetchConfig`. + + + +```python Python (v1) +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") +response = client.smartscraper( + website_url="https://example.com", + user_prompt="Extract the title and price", + stealth=True, +) +``` + +```python Python (v2) +from scrapegraph_py import Client, FetchConfig + +client = Client(api_key="your-api-key") +response = client.extract( + url="https://example.com", + prompt="Extract the title and price", + fetch_config=FetchConfig(stealth=True), +) +``` + +```javascript JavaScript (v1) +import { smartScraper } from "scrapegraph-js"; + +const response = await smartScraper(apiKey, { + website_url: "https://example.com", + user_prompt: "Extract the title and price", + stealth: true, +}); +``` + +```javascript JavaScript (v2) +import { scrapegraphai } from "scrapegraph-js"; + +const sgai = scrapegraphai({ apiKey }); +const { data, requestId } = await sgai.extract("https://example.com", { + prompt: "Extract the title and price", + fetchConfig: { stealth: true }, +}); +``` + + + +### 3. SearchScraper → `search` + +**Before:** `searchscraper` / `searchScraper` with a prompt-style query. + +**After:** `search` with `query` (Python keyword argument; JS first argument is the query string). + + + +```python Python (v2) +response = client.search( + query="Latest pricing for product X", + num_results=5, +) +``` + +```javascript JavaScript (v2) +const { data } = await sgai.search("Latest pricing for product X", { + numResults: 5, +}); +``` + + + +### 4. Crawl jobs + +**Before:** One-shot `crawl(...)` style usage depending on SDK version. + +**After:** Start a job, then poll or webhook as documented: + + + +```python Python (v2) +job = client.crawl.start( + url="https://example.com", + depth=2, + include_patterns=["/blog/*"], + exclude_patterns=["/admin/*"], +) +status = client.crawl.status(job["id"]) +``` + +```javascript JavaScript (v2) +const job = await sgai.crawl.start("https://example.com", { + maxDepth: 2, + includePatterns: ["/blog/*"], + excludePatterns: ["/admin/*"], +}); +const status = await sgai.crawl.status(job.data.id); +``` + + + +### 5. REST calls + +If you call the API with `curl` or a generic HTTP client: + +- Use the v2 host and path pattern: **`https://api.scrapegraphai.com/api/v2/`** (e.g. `/api/v2/scrape`, `/api/v2/extract`, `/api/v2/search`, `/api/v2/crawl`, `/api/v2/monitor`). +- Replace JSON fields to match v2 bodies (e.g. `url` and `prompt` instead of `website_url` and `user_prompt` on extract). +- Keep using the **`SGAI-APIKEY`** header unless the endpoint docs specify otherwise. + +Exact paths and payloads are listed under each service (for example [Scrape](/services/scrape)) and in the [API reference](/api-reference/introduction). + +## What else changed in v2 (docs & product) + +- Unified and clearer API documentation +- Updated service pages and endpoint organization +- New guides for MCP server and SDK usage + +## Recommended path + +1. Log in at [scrapegraphai.com/login](https://scrapegraphai.com/login) +2. Start from [Introduction](/introduction) +3. Follow [Installation](/install) +4. Upgrade packages: `pip install -U scrapegraph-py` / `npm i scrapegraph-js@latest` (Node **≥ 22** for JS v2) + +## SDK migration guides (detailed changelogs) + +- [Python SDK Migration Guide](https://github.com/ScrapeGraphAI/scrapegraph-py/blob/main/MIGRATION_V2.md) +- [JavaScript SDK Migration Guide](https://github.com/ScrapeGraphAI/scrapegraph-js/blob/main/MIGRATION.md) + +Full method documentation: + +- [Python SDK](/sdks/python) +- [JavaScript SDK](/sdks/javascript) + +## Legacy v1 docs + +You can still access v1 documentation here: + +- [v1 Introduction](/v1/introduction) diff --git a/use-cases/ai-llm.mdx b/use-cases/ai-llm.mdx index 15f8607..6387cee 100644 --- a/use-cases/ai-llm.mdx +++ b/use-cases/ai-llm.mdx @@ -33,13 +33,13 @@ class ArticleSchema(BaseModel): summary: Optional[str] = Field(description="Article summary or description") # Initialize the client -client = Client() +client = Client(api_key="your-api-key") try: # Scrape relevant content - response = client.smartscraper( - website_url="https://example.com/article", - user_prompt="Extract the main article content, title, author, and publication date", + response = client.extract( + url="https://example.com/article", + prompt="Extract the main article content, title, author, and publication date", output_schema=ArticleSchema ) @@ -72,15 +72,14 @@ class ResearchResults(BaseModel): articles: List[ResearchData] # Initialize the client -client = Client() +client = Client(api_key="your-api-key") try: # Search and scrape multiple sources - search_results = client.searchscraper( - user_prompt="What are the latest developments in artificial intelligence?", + search_results = client.search( + query="What are the latest developments in artificial intelligence?", output_schema=ResearchResults, - num_results=5, # Number of websites to search (3-20) - extraction_mode=True # Use AI extraction mode for structured data + num_results=5 ) # Process with your AI model diff --git a/use-cases/content-aggregation.mdx b/use-cases/content-aggregation.mdx index ba77e83..16a6061 100644 --- a/use-cases/content-aggregation.mdx +++ b/use-cases/content-aggregation.mdx @@ -43,7 +43,7 @@ class NewsAggregationResult(BaseModel): sources: List[str] = Field(description="List of source URLs") timestamp: str = Field(description="Aggregation timestamp") -client = Client() +client = Client(api_key="your-api-key") # Define news sources to aggregate news_sources = [ @@ -56,9 +56,9 @@ news_sources = [ aggregated_results = [] for source in news_sources: - response = client.smartscraper( - website_url=source, - user_prompt="Extract all news articles from the homepage, including title, content, author, publication date, category, and tags. Also extract featured images if available.", + response = client.extract( + url=source, + prompt="Extract all news articles from the homepage, including title, content, author, publication date, category, and tags. Also extract featured images if available.", output_schema=NewsAggregationResult ) aggregated_results.append(response) @@ -110,50 +110,47 @@ class BlogMonitorResult(BaseModel): blog_url: str = Field(description="Blog homepage URL") last_updated: str = Field(description="Last monitoring timestamp") -client = Client() +client = Client(api_key="your-api-key") # Start the crawler job -job = client.smartcrawler_initiate( +job = client.crawl.start( url="https://example-blog.com", - user_prompt="Extract all blog posts from the last 7 days, including title, content, author, publication date, categories, and metadata. Calculate estimated reading time based on content length.", - extraction_mode="ai", depth=2, # Crawl up to 2 levels deep - same_domain_only=True + include_patterns=["/blog/*"], + exclude_patterns=["/tag/*", "/author/*"] ) # Wait for job completion and get results -job_id = job.job_id +job_id = job["id"] while True: - status = client.smartcrawler_get_status(job_id) - if status.state == "completed": - response = status.result + status = client.crawl.status(job_id) + if status.get("status") == "completed": + response = status.get("data", {}) break - elif status.state in ["failed", "cancelled"]: - print(f"Job failed: {status.error}") + elif status.get("status") in ["failed", "cancelled", "error"]: + print(f"Job failed: {status.get('error')}") break time.sleep(5) # Wait 5 seconds before checking again # Process the crawled content if successful -if response: - print(f"Blog: {response.blog_url}") - print(f"Total Posts Found: {response.total_posts}") - print(f"Last Updated: {response.last_updated}\n") +if response and response.get("pages"): + print(f"Total Pages Found: {len(response['pages'])}") - for post in response.posts: + for post in response["pages"]: # Check if post is recent - post_date = datetime.strptime(post.publication_date, "%Y-%m-%d") + post_date = datetime.strptime(post["publication_date"], "%Y-%m-%d") if post_date > datetime.now() - timedelta(days=7): - print(f"Title: {post.title}") - print(f"Author: {post.author}") - print(f"Published: {post.publication_date}") - print(f"Reading Time: {post.reading_time} minutes") - if post.categories: - print(f"Categories: {', '.join(post.categories)}") - if post.tags: - print(f"Tags: {', '.join(post.tags)}") - if post.excerpt: - print(f"\nExcerpt: {post.excerpt}") - print(f"URL: {post.url}\n") + print(f"Title: {post['title']}") + print(f"Author: {post.get('author', 'Unknown')}") + print(f"Published: {post['publication_date']}") + print(f"Reading Time: {post.get('reading_time', 'N/A')} minutes") + if post.get("categories"): + print(f"Categories: {', '.join(post['categories'])}") + if post.get("tags"): + print(f"Tags: {', '.join(post['tags'])}") + if post.get("excerpt"): + print(f"\nExcerpt: {post['excerpt']}") + print(f"URL: {post['url']}\n") ``` ## Best Practices diff --git a/use-cases/lead-generation.mdx b/use-cases/lead-generation.mdx index 34bf61f..9ab551d 100644 --- a/use-cases/lead-generation.mdx +++ b/use-cases/lead-generation.mdx @@ -39,9 +39,9 @@ class CompanyContacts(BaseModel): client = Client(api_key="your-api-key") # Scrape company website -response = client.smartscraper( - website_url="https://company.com/about", - user_prompt="Extract all contact information for decision makers and leadership team", +response = client.extract( + url="https://company.com/about", + prompt="Extract all contact information for decision makers and leadership team", output_schema=CompanyContacts ) @@ -79,11 +79,10 @@ client = Client(api_key="your-api-key") try: # Search for businesses in a specific category - search_results = client.searchscraper( - user_prompt="Find software companies in San Francisco with their contact details", + search_results = client.search( + query="Find software companies in San Francisco with their contact details", output_schema=BusinessSearchResults, - num_results=10, # Number of websites to search (3-20) - extraction_mode=True # Use AI extraction mode for structured data + num_results=10 ) # Extract and validate leads @@ -94,9 +93,9 @@ try: try: # Get more detailed information from company website - details = client.smartscraper( - website_url=business.website, - user_prompt="Extract detailed company information including team size, tech stack, and all contact methods", + details = client.extract( + url=business.website, + prompt="Extract detailed company information including team size, tech stack, and all contact methods", output_schema=CompanyContacts # Defined earlier in the file ) diff --git a/use-cases/market-intelligence.mdx b/use-cases/market-intelligence.mdx index fc1ce56..8ee46b3 100644 --- a/use-cases/market-intelligence.mdx +++ b/use-cases/market-intelligence.mdx @@ -43,9 +43,9 @@ class PriceMonitorResult(BaseModel): client = Client() # Monitor competitor prices -response = client.smartscraper( - website_url="https://competitor-store.com/category/products", - user_prompt="Extract pricing information for all products including name, current price, original price if available, and availability status", +response = client.extract( + url="https://competitor-store.com/category/products", + prompt="Extract pricing information for all products including name, current price, original price if available, and availability status", output_schema=PriceMonitorResult ) @@ -86,8 +86,8 @@ class TrendAnalysisResult(BaseModel): client = Client() # Search and analyze market trends -response = client.searchscraper( - user_prompt="Analyze market trends and sentiment in the electric vehicle industry. Focus on pricing trends, consumer preferences, and technological advancements.", +response = client.search( + query="Analyze market trends and sentiment in the electric vehicle industry. Focus on pricing trends, consumer preferences, and technological advancements.", num_results=10, # Number of sources to analyze output_schema=TrendAnalysisResult ) diff --git a/use-cases/research-analysis.mdx b/use-cases/research-analysis.mdx index 062db98..0aed058 100644 --- a/use-cases/research-analysis.mdx +++ b/use-cases/research-analysis.mdx @@ -47,8 +47,8 @@ class ResearchCollectionResult(BaseModel): client = Client() # Search and collect research papers -response = client.searchscraper( - user_prompt="Find recent research papers on machine learning applications in healthcare, focusing on papers published in the last year. Extract complete paper details including abstract, citations, and DOI.", +response = client.search( + query="Find recent research papers on machine learning applications in healthcare, focusing on papers published in the last year. Extract complete paper details including abstract, citations, and DOI.", num_results=15, # Number of papers to collect output_schema=ResearchCollectionResult ) @@ -115,9 +115,9 @@ class IndustryAnalysis(BaseModel): client = Client() # Collect industry analysis data -response = client.smartscraper( - website_url="https://industry-research-site.com/sector-analysis", - user_prompt="Extract comprehensive industry analysis including detailed market metrics, company profiles, trends, and regulatory factors. Focus on quantitative data where available.", +response = client.extract( + url="https://industry-research-site.com/sector-analysis", + prompt="Extract comprehensive industry analysis including detailed market metrics, company profiles, trends, and regulatory factors. Focus on quantitative data where available.", output_schema=IndustryAnalysis ) diff --git a/use-cases/seo-analytics.mdx b/use-cases/seo-analytics.mdx index 3201d77..cfa5bc8 100644 --- a/use-cases/seo-analytics.mdx +++ b/use-cases/seo-analytics.mdx @@ -63,9 +63,9 @@ target_keywords = [ for keyword in target_keywords: # Analyze SERP data - response = client.smartscraper( - website_url=f"https://www.google.com/search?q={keyword}", - user_prompt="Extract detailed search results including positions, titles, descriptions, and all rich results. Also analyze ad presence and total result counts.", + response = client.extract( + url=f"https://www.google.com/search?q={keyword}", + prompt="Extract detailed search results including positions, titles, descriptions, and all rich results. Also analyze ad presence and total result counts.", output_schema=SERPAnalysis ) @@ -151,9 +151,9 @@ target_urls = [ for url in target_urls: # Extract content metrics - response = client.smartscraper( - website_url=url, - user_prompt="Perform comprehensive content analysis including meta tags, headings structure, internal/external links, and structured data. Calculate content quality score based on best practices.", + response = client.extract( + url=url, + prompt="Perform comprehensive content analysis including meta tags, headings structure, internal/external links, and structured data. Calculate content quality score based on best practices.", output_schema=ContentMetrics ) diff --git a/v1/additional-parameters/headers.mdx b/v1/additional-parameters/headers.mdx new file mode 100644 index 0000000..b202338 --- /dev/null +++ b/v1/additional-parameters/headers.mdx @@ -0,0 +1,23 @@ +--- +title: 'Custom Headers' +description: 'Pass custom HTTP headers with your requests (v1)' +icon: 'heading' +--- + + +You are viewing the **v1 (legacy)** documentation. In v2, use `FetchConfig(headers={...})`. See the [v2 documentation](/services/additional-parameters/headers). + + +## Custom Headers + +Pass custom HTTP headers with your scraping requests: + +```python +response = client.smartscraper( + website_url="https://example.com", + user_prompt="Extract data", + headers={"Authorization": "Bearer token", "Accept-Language": "en-US"} +) +``` + +For v2 usage with `FetchConfig`, see the [v2 documentation](/services/additional-parameters/headers). diff --git a/v1/additional-parameters/pagination.mdx b/v1/additional-parameters/pagination.mdx new file mode 100644 index 0000000..a32d8dd --- /dev/null +++ b/v1/additional-parameters/pagination.mdx @@ -0,0 +1,15 @@ +--- +title: 'Pagination' +description: 'Handle paginated content (v1)' +icon: 'arrow-right' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 documentation](/services/additional-parameters/pagination). + + +## Pagination + +Handle paginated content using the `number_of_scrolls` parameter or by specifying pagination logic in your prompt. + +For v2 usage, see the [v2 documentation](/services/additional-parameters/pagination). diff --git a/v1/additional-parameters/proxy.mdx b/v1/additional-parameters/proxy.mdx new file mode 100644 index 0000000..57744d9 --- /dev/null +++ b/v1/additional-parameters/proxy.mdx @@ -0,0 +1,23 @@ +--- +title: 'Proxy' +description: 'Route requests through specific countries (v1)' +icon: 'shield' +--- + + +You are viewing the **v1 (legacy)** documentation. In v2, use `FetchConfig(country="us")`. See the [v2 documentation](/services/additional-parameters/proxy). + + +## Proxy Routing + +Route scraping requests through proxies in specific countries using the `country_code` parameter: + +```python +response = client.smartscraper( + website_url="https://example.com", + user_prompt="Extract data", + country_code="us" +) +``` + +For v2 usage with `FetchConfig`, see the [v2 documentation](/services/additional-parameters/proxy). diff --git a/v1/additional-parameters/wait-ms.mdx b/v1/additional-parameters/wait-ms.mdx new file mode 100644 index 0000000..93fbbe6 --- /dev/null +++ b/v1/additional-parameters/wait-ms.mdx @@ -0,0 +1,23 @@ +--- +title: 'Wait Time' +description: 'Configure page load wait time (v1)' +icon: 'clock' +--- + + +You are viewing the **v1 (legacy)** documentation. In v2, use `FetchConfig(wait_ms=3000)`. See the [v2 documentation](/services/additional-parameters/wait-ms). + + +## Wait Time + +Configure how long to wait for the page to load before scraping: + +```python +response = client.smartscraper( + website_url="https://example.com", + user_prompt="Extract data", + wait_ms=5000 # Wait 5 seconds +) +``` + +For v2 usage with `FetchConfig`, see the [v2 documentation](/services/additional-parameters/wait-ms). diff --git a/v1/agenticscraper.mdx b/v1/agenticscraper.mdx new file mode 100644 index 0000000..9b1ee49 --- /dev/null +++ b/v1/agenticscraper.mdx @@ -0,0 +1,39 @@ +--- +title: 'AgenticScraper' +description: 'Agent-based multi-step scraping (v1)' +icon: 'robot' +--- + + +You are viewing the **v1 (legacy)** documentation. AgenticScraper has been removed in v2. Use `extract()` with `FetchConfig` for advanced scraping, or `crawl.start()` for multi-page extraction. See the [v2 documentation](/services/agenticscraper). + + +## Overview + +AgenticScraper uses an AI agent to perform multi-step scraping operations, navigating through pages and interacting with elements as needed. + +## Usage + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.agenticscraper( + website_url="https://example.com", + user_prompt="Navigate to the pricing page and extract all plan details" +) +``` + +```javascript JavaScript +import { agenticScraper } from "scrapegraph-js"; + +const response = await agenticScraper(apiKey, { + website_url: "https://example.com", + user_prompt: "Navigate to the pricing page and extract all plan details", +}); +``` + + diff --git a/v1/api-reference/introduction.mdx b/v1/api-reference/introduction.mdx new file mode 100644 index 0000000..660053e --- /dev/null +++ b/v1/api-reference/introduction.mdx @@ -0,0 +1,51 @@ +--- +title: 'API Reference' +description: 'ScrapeGraphAI v1 API Reference' +icon: 'book' +--- + + +You are viewing the **v1 (legacy)** API documentation. The v1 API uses `/v1/*` endpoints. Please migrate to the [v2 API](/api-reference/introduction) which uses `/api/v2/*` endpoints. + + +## Base URL + +``` +https://api.scrapegraphai.com/v1 +``` + +## Authentication + +All v1 API requests require the `SGAI-APIKEY` header: + +```bash +curl -X POST "https://api.scrapegraphai.com/v1/smartscraper" \ + -H "SGAI-APIKEY: your-api-key" \ + -H "Content-Type: application/json" \ + -d '{"website_url": "https://example.com", "user_prompt": "Extract data"}' +``` + + +In v2, authentication uses the `Authorization: Bearer` header instead. + + +## v1 Endpoints + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/v1/smartscraper` | POST | Start a SmartScraper job | +| `/v1/smartscraper/{id}` | GET | Get SmartScraper job status | +| `/v1/searchscraper` | POST | Start a SearchScraper job | +| `/v1/searchscraper/{id}` | GET | Get SearchScraper job status | +| `/v1/markdownify` | POST | Start a Markdownify job | +| `/v1/markdownify/{id}` | GET | Get Markdownify job status | +| `/v1/smartcrawler` | POST | Start a SmartCrawler job | +| `/v1/smartcrawler/{id}` | GET | Get SmartCrawler job status | +| `/v1/sitemap` | POST | Start a Sitemap job | +| `/v1/sitemap/{id}` | GET | Get Sitemap job status | +| `/v1/credits` | GET | Get remaining credits | +| `/v1/feedback` | POST | Submit feedback | + +## Migration to v2 + +See the [v2 API Reference](/api-reference/introduction) for the latest endpoints and authentication methods. diff --git a/v1/cli/ai-agent-skill.mdx b/v1/cli/ai-agent-skill.mdx new file mode 100644 index 0000000..eea9b7e --- /dev/null +++ b/v1/cli/ai-agent-skill.mdx @@ -0,0 +1,15 @@ +--- +title: 'AI Agent Skill' +description: 'Use CLI as an AI agent skill (v1)' +icon: 'robot' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 AI Agent Skill documentation](/services/cli/ai-agent-skill). + + +## Overview + +The ScrapeGraphAI CLI can be used as a skill within AI agent frameworks, enabling agents to scrape and extract web data. + +For detailed usage, see the [v2 documentation](/services/cli/ai-agent-skill). diff --git a/v1/cli/commands.mdx b/v1/cli/commands.mdx new file mode 100644 index 0000000..6eddc1a --- /dev/null +++ b/v1/cli/commands.mdx @@ -0,0 +1,20 @@ +--- +title: 'CLI Commands' +description: 'Available CLI commands (v1)' +icon: 'terminal' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 CLI commands](/services/cli/commands). + + +## Available Commands + +| Command | Description | +|---------|-------------| +| `sgai smartscraper` | Extract data from a webpage using AI | +| `sgai searchscraper` | Search and extract from multiple sources | +| `sgai markdownify` | Convert webpage to markdown | +| `sgai credits` | Check remaining API credits | + +For detailed usage, see the [v2 CLI documentation](/services/cli/commands). diff --git a/v1/cli/examples.mdx b/v1/cli/examples.mdx new file mode 100644 index 0000000..af5365f --- /dev/null +++ b/v1/cli/examples.mdx @@ -0,0 +1,31 @@ +--- +title: 'CLI Examples' +description: 'CLI usage examples (v1)' +icon: 'play' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 CLI examples](/services/cli/examples). + + +## Examples + +### Extract company info + +```bash +sgai smartscraper --url "https://example.com/about" --prompt "Extract the company name and description" +``` + +### Search the web + +```bash +sgai searchscraper --prompt "Latest AI news" --num-results 5 +``` + +### Convert to markdown + +```bash +sgai markdownify --url "https://example.com/article" +``` + +For more examples, see the [v2 CLI documentation](/services/cli/examples). diff --git a/v1/cli/introduction.mdx b/v1/cli/introduction.mdx new file mode 100644 index 0000000..161dace --- /dev/null +++ b/v1/cli/introduction.mdx @@ -0,0 +1,27 @@ +--- +title: 'CLI Introduction' +description: 'ScrapeGraphAI Command Line Interface (v1)' +icon: 'terminal' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 CLI documentation](/services/cli/introduction). + + +## Overview + +The ScrapeGraphAI CLI provides a command-line interface for interacting with ScrapeGraphAI services directly from your terminal. + +## Installation + +```bash +pip install scrapegraph-py +``` + +## Quick Start + +```bash +sgai smartscraper --url "https://example.com" --prompt "Extract the title" +``` + +For more details, see the [v2 CLI documentation](/services/cli/introduction). diff --git a/v1/cli/json-mode.mdx b/v1/cli/json-mode.mdx new file mode 100644 index 0000000..932ca48 --- /dev/null +++ b/v1/cli/json-mode.mdx @@ -0,0 +1,17 @@ +--- +title: 'JSON Mode' +description: 'CLI JSON output mode (v1)' +icon: 'code' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 JSON mode documentation](/services/cli/json-mode). + + +## JSON Output + +Use the `--json` flag to get structured JSON output from CLI commands: + +```bash +sgai smartscraper --url "https://example.com" --prompt "Extract data" --json +``` diff --git a/v1/introduction.mdx b/v1/introduction.mdx new file mode 100644 index 0000000..23eaab7 --- /dev/null +++ b/v1/introduction.mdx @@ -0,0 +1,88 @@ +--- +title: Introduction +description: 'Welcome to ScrapeGraphAI v1 - AI-Powered Web Data Extraction' +--- + + +You are viewing the **v1 (legacy)** documentation. v1 is deprecated and will be removed in a future release. Please migrate to [v2](/introduction) for the latest features and improvements. + + + + +## Overview + +[ScrapeGraphAI](https://scrapegraphai.com) is a powerful suite of LLM-driven web scraping tools designed to extract structured data from any website and HTML content. Our API is designed to be easy to use and integrate with your existing workflows. + +### Perfect For + + + + Feed your AI agents with structured web data for enhanced decision-making + + + Extract and structure web data for research and analysis + + + Build comprehensive datasets from web sources + + + Create scraping-powered platforms and applications + + + +## Getting Started + + + + Sign up and access your API key from the [dashboard](https://dashboard.scrapegraphai.com) + + + Select from our specialized extraction services based on your needs + + + Begin extracting data using our SDKs or direct API calls + + + +## Core Services + +- **SmartScraper**: AI-powered extraction for any webpage +- **SearchScraper**: Find and extract any data using AI starting from a prompt +- **SmartCrawler**: AI-powered extraction for any webpage with crawl +- **Markdownify**: Convert web content to clean Markdown format +- **Sitemap**: Extract sitemaps from websites +- **AgenticScraper**: Agent-based multi-step scraping +- **Toonify**: Convert images to cartoon style + +## v1 SDKs + +### Python +```python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.smartscraper( + website_url="https://example.com", + user_prompt="Extract the main content" +) +``` + +### JavaScript +```javascript +import { smartScraper } from "scrapegraph-js"; + +const response = await smartScraper(apiKey, { + website_url: "https://example.com", + user_prompt: "What does the company do?", +}); +``` + +## Migrate to v2 + +v2 brings significant improvements including renamed methods, unified configuration objects, and new endpoints. See the migration guides: +- [Python SDK Migration Guide](https://github.com/ScrapeGraphAI/scrapegraph-py/blob/main/MIGRATION_V2.md) +- [JavaScript SDK Migration Guide](https://github.com/ScrapeGraphAI/scrapegraph-js/blob/main/MIGRATION.md) diff --git a/v1/markdownify.mdx b/v1/markdownify.mdx new file mode 100644 index 0000000..2a4028e --- /dev/null +++ b/v1/markdownify.mdx @@ -0,0 +1,46 @@ +--- +title: 'Markdownify' +description: 'Convert web content to clean markdown (v1)' +icon: 'markdown' +--- + + +You are viewing the **v1 (legacy)** documentation. In v2, Markdownify has been replaced by `scrape()` with `output_format="markdown"`. See the [v2 documentation](/services/markdownify). + + +## Overview + +Markdownify converts any webpage into clean, formatted markdown. + +## Usage + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.markdownify( + website_url="https://example.com" +) +``` + +```javascript JavaScript +import { markdownify } from "scrapegraph-js"; + +const response = await markdownify(apiKey, { + website_url: "https://example.com", +}); +``` + + + +### Parameters + +| Parameter | Type | Required | Description | +| ------------ | ------- | -------- | ---------------------------------------- | +| website_url | string | Yes | The URL of the webpage to convert | +| wait_ms | number | No | Page load wait time in ms | +| stealth | boolean | No | Enable anti-detection mode | +| country_code | string | No | Proxy routing country code | diff --git a/v1/mcp-server/claude.mdx b/v1/mcp-server/claude.mdx new file mode 100644 index 0000000..d03af08 --- /dev/null +++ b/v1/mcp-server/claude.mdx @@ -0,0 +1,11 @@ +--- +title: 'Claude Integration' +description: 'Use ScrapeGraphAI MCP with Claude (v1)' +icon: 'message-bot' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 Claude integration](/services/mcp-server/claude). + + +For Claude MCP setup, see the [v2 documentation](/services/mcp-server/claude). diff --git a/v1/mcp-server/cursor.mdx b/v1/mcp-server/cursor.mdx new file mode 100644 index 0000000..dbcdfa4 --- /dev/null +++ b/v1/mcp-server/cursor.mdx @@ -0,0 +1,11 @@ +--- +title: 'Cursor Integration' +description: 'Use ScrapeGraphAI MCP with Cursor (v1)' +icon: 'code' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 Cursor integration](/services/mcp-server/cursor). + + +For Cursor MCP setup, see the [v2 documentation](/services/mcp-server/cursor). diff --git a/v1/mcp-server/introduction.mdx b/v1/mcp-server/introduction.mdx new file mode 100644 index 0000000..3162ccf --- /dev/null +++ b/v1/mcp-server/introduction.mdx @@ -0,0 +1,15 @@ +--- +title: 'MCP Server Introduction' +description: 'ScrapeGraphAI MCP Server (v1)' +icon: 'server' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 MCP Server documentation](/services/mcp-server/introduction). + + +## Overview + +The ScrapeGraphAI MCP (Model Context Protocol) Server enables AI assistants and tools to use ScrapeGraphAI as a data source. + +For setup and usage, see the [v2 MCP Server documentation](/services/mcp-server/introduction). diff --git a/v1/mcp-server/smithery.mdx b/v1/mcp-server/smithery.mdx new file mode 100644 index 0000000..edee161 --- /dev/null +++ b/v1/mcp-server/smithery.mdx @@ -0,0 +1,11 @@ +--- +title: 'Smithery Integration' +description: 'Use ScrapeGraphAI MCP with Smithery (v1)' +icon: 'hammer' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 Smithery integration](/services/mcp-server/smithery). + + +For Smithery MCP setup, see the [v2 documentation](/services/mcp-server/smithery). diff --git a/v1/quickstart.mdx b/v1/quickstart.mdx new file mode 100644 index 0000000..34fe5d8 --- /dev/null +++ b/v1/quickstart.mdx @@ -0,0 +1,69 @@ +--- +title: Quickstart +description: 'Get started with ScrapeGraphAI v1 SDKs' +--- + + +You are viewing the **v1 (legacy)** documentation. Please migrate to [v2](/install) for the latest features. + + +## Prerequisites + +- Obtain your **API key** by signing up on the [ScrapeGraphAI Dashboard](https://dashboard.scrapegraphai.com) + +--- + +## Python SDK + +```bash +pip install scrapegraph-py +``` + +```python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key-here") + +response = client.smartscraper( + website_url="https://scrapegraphai.com", + user_prompt="Extract information about the company" +) +print(response) +``` + + +You can also set the `SGAI_API_KEY` environment variable and initialize the client without parameters: `client = Client()` + + +--- + +## JavaScript SDK + +```bash +npm i scrapegraph-js +``` + +```javascript +import { smartScraper } from "scrapegraph-js"; + +const apiKey = "your-api-key-here"; + +const response = await smartScraper(apiKey, { + website_url: "https://scrapegraphai.com", + user_prompt: "What does the company do?", +}); + +if (response.status === "error") { + console.error("Error:", response.error); +} else { + console.log(response.data.result); +} +``` + +--- + +## Next Steps + +- Explore the [SmartScraper](/v1/smartscraper) service +- Check out [SearchScraper](/v1/searchscraper) for search-based extraction +- Use [Markdownify](/v1/markdownify) for HTML-to-markdown conversion diff --git a/v1/scrape.mdx b/v1/scrape.mdx new file mode 100644 index 0000000..5d6ba07 --- /dev/null +++ b/v1/scrape.mdx @@ -0,0 +1,37 @@ +--- +title: 'Scrape' +description: 'Basic webpage scraping service (v1)' +icon: 'spider-web' +--- + + +You are viewing the **v1 (legacy)** documentation. See the [v2 documentation](/services/scrape). + + +## Overview + +The Scrape service provides basic webpage scraping capabilities, returning the raw content of a webpage. + +## Usage + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.scrape( + website_url="https://example.com" +) +``` + +```javascript JavaScript +import { scrape } from "scrapegraph-js"; + +const response = await scrape(apiKey, { + website_url: "https://example.com", +}); +``` + + diff --git a/v1/searchscraper.mdx b/v1/searchscraper.mdx new file mode 100644 index 0000000..eba7042 --- /dev/null +++ b/v1/searchscraper.mdx @@ -0,0 +1,52 @@ +--- +title: 'SearchScraper' +description: 'Search and extract information from multiple web sources (v1)' +icon: 'magnifying-glass' +--- + + +You are viewing the **v1 (legacy)** documentation. In v2, SearchScraper has been renamed to `search()`. See the [v2 documentation](/services/searchscraper). + + +## Overview + +SearchScraper enables you to search the web and extract structured information from multiple sources using AI. + +## Usage + + + +```python Python +from scrapegraph_py import Client +from scrapegraph_py.models import TimeRange + +client = Client(api_key="your-api-key") + +response = client.searchscraper( + user_prompt="What are the key features of ChatGPT Plus?", + time_range=TimeRange.PAST_WEEK +) +``` + +```javascript JavaScript +import { searchScraper } from "scrapegraph-js"; + +const response = await searchScraper(apiKey, { + user_prompt: "Find the best restaurants in San Francisco", + location_geo_code: "us", + time_range: "past_week", +}); +``` + + + +### Parameters + +| Parameter | Type | Required | Description | +| ----------------- | --------- | -------- | -------------------------------------------------------- | +| user_prompt | string | Yes | Search query description | +| num_results | number | No | Number of websites to search (3-20) | +| extraction_mode | boolean | No | AI extraction (true) or markdown mode (false) | +| output_schema | object | No | Schema for structured response | +| location_geo_code | string | No | Geo code for location-based search | +| time_range | TimeRange | No | Time range filter for results | diff --git a/v1/sitemap.mdx b/v1/sitemap.mdx new file mode 100644 index 0000000..284f64f --- /dev/null +++ b/v1/sitemap.mdx @@ -0,0 +1,37 @@ +--- +title: 'Sitemap' +description: 'Extract sitemaps from websites (v1)' +icon: 'sitemap' +--- + + +You are viewing the **v1 (legacy)** documentation. The Sitemap endpoint has been removed in v2. Use `crawl.start()` with URL patterns instead. See the [v2 documentation](/services/sitemap). + + +## Overview + +The Sitemap service extracts and parses sitemap data from any website. + +## Usage + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.sitemap( + website_url="https://example.com" +) +``` + +```javascript JavaScript +import { sitemap } from "scrapegraph-js"; + +const response = await sitemap(apiKey, { + website_url: "https://example.com", +}); +``` + + diff --git a/v1/smartcrawler.mdx b/v1/smartcrawler.mdx new file mode 100644 index 0000000..6ceb27f --- /dev/null +++ b/v1/smartcrawler.mdx @@ -0,0 +1,41 @@ +--- +title: 'SmartCrawler' +description: 'AI-powered multi-page crawling service (v1)' +icon: 'spider' +--- + + +You are viewing the **v1 (legacy)** documentation. In v2, crawling uses `crawl.start()`, `crawl.status()`, `crawl.stop()`, and `crawl.resume()`. See the [v2 documentation](/services/smartcrawler). + + +## Overview + +SmartCrawler enables AI-powered extraction across multiple pages of a website, automatically navigating and collecting structured data. + +## Usage + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.crawl( + website_url="https://example.com", + user_prompt="Extract all blog post titles", + depth=2 +) +``` + +```javascript JavaScript +import { smartCrawler } from "scrapegraph-js"; + +const response = await smartCrawler(apiKey, { + website_url: "https://example.com", + user_prompt: "Extract all blog post titles", + depth: 2, +}); +``` + + diff --git a/v1/smartscraper.mdx b/v1/smartscraper.mdx new file mode 100644 index 0000000..cc082cc --- /dev/null +++ b/v1/smartscraper.mdx @@ -0,0 +1,52 @@ +--- +title: 'SmartScraper' +description: 'AI-powered web scraping for any website (v1)' +icon: 'robot' +--- + + +You are viewing the **v1 (legacy)** documentation. In v2, SmartScraper has been renamed to `extract()`. See the [v2 documentation](/services/smartscraper). + + +## Overview + +SmartScraper is our flagship LLM-powered web scraping service that intelligently extracts structured data from any website. + +## Usage + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.smartscraper( + website_url="https://example.com", + user_prompt="Extract the main heading and description" +) +``` + +```javascript JavaScript +import { smartScraper } from "scrapegraph-js"; + +const response = await smartScraper(apiKey, { + website_url: "https://example.com", + user_prompt: "Extract the main content", +}); +``` + + + +### Parameters + +| Parameter | Type | Required | Description | +| ------------- | ------- | -------- | --------------------------------------------------------------------------- | +| website_url | string | Yes | The URL of the webpage to scrape | +| user_prompt | string | Yes | A textual description of what you want to extract | +| output_schema | object | No | Pydantic/Zod schema for structured response | +| stealth | boolean | No | Enable anti-detection mode | +| headers | object | No | Custom HTTP headers | +| mock | boolean | No | Enable mock mode for testing | +| wait_ms | number | No | Page load wait time in ms | +| country_code | string | No | Proxy routing country code | diff --git a/v1/toonify.mdx b/v1/toonify.mdx new file mode 100644 index 0000000..ab5293e --- /dev/null +++ b/v1/toonify.mdx @@ -0,0 +1,25 @@ +--- +title: 'Toonify' +description: 'Convert images to cartoon style (v1)' +icon: 'palette' +--- + + +You are viewing the **v1 (legacy)** documentation. Toonify has been removed in v2. See the [v2 documentation](/services/toonify). + + +## Overview + +Toonify converts images into cartoon-style illustrations using AI. + +## Usage + +```python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.toonify( + website_url="https://example.com/image.jpg" +) +```