Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
284 changes: 103 additions & 181 deletions README.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "just-scrape",
"version": "0.2.1",
"version": "0.3.0",
"description": "ScrapeGraph AI CLI tool",
"type": "module",
"main": "dist/cli.mjs",
Expand Down Expand Up @@ -28,7 +28,7 @@
"chalk": "^5.4.1",
"citty": "^0.1.6",
"dotenv": "^17.2.4",
"scrapegraph-js": "^1.0.0"
"scrapegraph-js": "github:ScrapeGraphAI/scrapegraph-js#b570a57"
},
"devDependencies": {
"@biomejs/biome": "^1.9.4",
Expand Down
221 changes: 77 additions & 144 deletions skills/just-scrape/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: just-scrape
description: "CLI tool for AI-powered web scraping, data extraction, search, and crawling via ScrapeGraph AI. Use when the user needs to scrape websites, extract structured data from URLs, convert pages to markdown, crawl multi-page sites, search the web for information, automate browser interactions (login, click, fill forms), get raw HTML, discover sitemaps, or generate JSON schemas. Triggers on tasks involving: (1) extracting data from websites, (2) web scraping or crawling, (3) converting webpages to markdown, (4) AI-powered web search with extraction, (5) browser automation, (6) generating output schemas for scraping. The CLI is just-scrape (npm package just-scrape)."
description: "CLI tool for AI-powered web scraping, data extraction, search, and crawling via ScrapeGraph AI v2 API. Use when the user needs to scrape websites, extract structured data from URLs, convert pages to markdown, crawl multi-page sites, or search the web for information. Triggers on tasks involving: (1) extracting data from websites, (2) web scraping or crawling, (3) converting webpages to markdown, (4) AI-powered web search with extraction. The CLI is just-scrape (npm package just-scrape)."
---

# Web Scraping with just-scrape
Expand Down Expand Up @@ -30,191 +30,132 @@ API key resolution order: `SGAI_API_KEY` env var → `.env` file → `~/.scrapeg

| Need | Command |
|---|---|
| Extract structured data from a known URL | `smart-scraper` |
| Search the web and extract from results | `search-scraper` |
| Extract structured data from a known URL | `extract` |
| Search the web and extract from results | `search` |
| Scrape a page (markdown, html, screenshot, branding) | `scrape` |
| Convert a page to clean markdown | `markdownify` |
| Crawl multiple pages from a site | `crawl` |
| Get raw HTML | `scrape` |
| Automate browser actions (login, click, fill) | `agentic-scraper` |
| Generate a JSON schema from description | `generate-schema` |
| Get all URLs from a sitemap | `sitemap` |
| Check credit balance | `credits` |
| Browse past requests | `history` |
| Validate API key | `validate` |

## Common Flags

All commands support `--json` for machine-readable output (suppresses banner, spinners, prompts).

Scraping commands share these optional flags:
- `--stealth` — bypass anti-bot detection (+4 credits)
- `--mode <mode>` / `-m <mode>` — fetch mode: `auto` (default), `fast`, `js`, `direct+stealth`, `js+stealth`
- `--headers <json>` — custom HTTP headers as JSON string
- `--schema <json>` — enforce output JSON schema
- `--country <iso>` — ISO country code for geo-targeting

## Commands

### Smart Scraper
### Extract

Extract structured data from any URL using AI.

```bash
just-scrape smart-scraper <url> -p <prompt>
just-scrape smart-scraper <url> -p <prompt> --schema <json>
just-scrape smart-scraper <url> -p <prompt> --scrolls <n> # infinite scroll (0-100)
just-scrape smart-scraper <url> -p <prompt> --pages <n> # multi-page (1-100)
just-scrape smart-scraper <url> -p <prompt> --stealth # anti-bot (+4 credits)
just-scrape smart-scraper <url> -p <prompt> --cookies <json> --headers <json>
just-scrape smart-scraper <url> -p <prompt> --plain-text
just-scrape extract <url> -p <prompt>
just-scrape extract <url> -p <prompt> --schema <json>
just-scrape extract <url> -p <prompt> --scrolls <n> # infinite scroll (0-100)
just-scrape extract <url> -p <prompt> --mode js+stealth # anti-bot bypass
just-scrape extract <url> -p <prompt> --cookies <json> --headers <json>
just-scrape extract <url> -p <prompt> --country <iso> # geo-targeting
```

```bash
# E-commerce extraction
just-scrape smart-scraper https://store.example.com/shoes -p "Extract all product names, prices, and ratings"
just-scrape extract https://store.example.com/shoes -p "Extract all product names, prices, and ratings"

# Strict schema + scrolling
just-scrape smart-scraper https://news.example.com -p "Get headlines and dates" \
just-scrape extract https://news.example.com -p "Get headlines and dates" \
--schema '{"type":"object","properties":{"articles":{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"date":{"type":"string"}}}}}}' \
--scrolls 5

# JS-heavy SPA behind anti-bot
just-scrape smart-scraper https://app.example.com/dashboard -p "Extract user stats" \
--stealth
just-scrape extract https://app.example.com/dashboard -p "Extract user stats" \
--mode js+stealth
```

### Search Scraper
### Search

Search the web and extract structured data from results.

```bash
just-scrape search-scraper <prompt>
just-scrape search-scraper <prompt> --num-results <n> # sources to scrape (3-20, default 3)
just-scrape search-scraper <prompt> --no-extraction # markdown only (2 credits vs 10)
just-scrape search-scraper <prompt> --schema <json>
just-scrape search-scraper <prompt> --stealth --headers <json>
just-scrape search <query>
just-scrape search <query> --num-results <n> # sources to scrape (1-20, default 3)
just-scrape search <query> -p <prompt> # extraction prompt
just-scrape search <query> --schema <json>
just-scrape search <query> --headers <json>
```

```bash
# Research across sources
just-scrape search-scraper "Best Python web frameworks in 2025" --num-results 10

# Cheap markdown-only
just-scrape search-scraper "React vs Vue comparison" --no-extraction --num-results 5
just-scrape search "Best Python web frameworks in 2025" --num-results 10

# Structured output
just-scrape search-scraper "Top 5 cloud providers pricing" \
just-scrape search "Top 5 cloud providers pricing" \
--schema '{"type":"object","properties":{"providers":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"free_tier":{"type":"string"}}}}}}'
```

### Markdownify

Convert any webpage to clean markdown.

```bash
just-scrape markdownify <url>
just-scrape markdownify <url> --stealth # +4 credits
just-scrape markdownify <url> --headers <json>
```

```bash
just-scrape markdownify https://blog.example.com/my-article
just-scrape markdownify https://protected.example.com --stealth
just-scrape markdownify https://docs.example.com/api --json | jq -r '.result' > api-docs.md
```

### Crawl

Crawl multiple pages and extract data from each.

```bash
just-scrape crawl <url> -p <prompt>
just-scrape crawl <url> -p <prompt> --max-pages <n> # default 10
just-scrape crawl <url> -p <prompt> --depth <n> # default 1
just-scrape crawl <url> --no-extraction --max-pages <n> # markdown only (2 credits/page)
just-scrape crawl <url> -p <prompt> --schema <json>
just-scrape crawl <url> -p <prompt> --rules <json> # include_paths, same_domain
just-scrape crawl <url> -p <prompt> --no-sitemap
just-scrape crawl <url> -p <prompt> --stealth
```

```bash
# Crawl docs site
just-scrape crawl https://docs.example.com -p "Extract all code snippets" --max-pages 20 --depth 3

# Filter to blog pages only
just-scrape crawl https://example.com -p "Extract article titles" \
--rules '{"include_paths":["/blog/*"],"same_domain":true}' --max-pages 50

# Raw markdown, no AI extraction (cheaper)
just-scrape crawl https://example.com --no-extraction --max-pages 10
```

### Scrape

Get raw HTML content from a URL.
Scrape content from a URL in various formats.

```bash
just-scrape scrape <url>
just-scrape scrape <url> --stealth # +4 credits
just-scrape scrape <url> --branding # extract logos/colors/fonts (+2 credits)
just-scrape scrape <url> --country-code <iso>
just-scrape scrape <url> # markdown (default)
just-scrape scrape <url> -f html # raw HTML
just-scrape scrape <url> -f screenshot # screenshot
just-scrape scrape <url> -f branding # extract branding info
just-scrape scrape <url> -m direct+stealth # anti-bot bypass
just-scrape scrape <url> --country <iso> # geo-targeting
```

```bash
just-scrape scrape https://example.com
just-scrape scrape https://store.example.com --stealth --country-code DE
just-scrape scrape https://example.com --branding
just-scrape scrape https://example.com -f html
just-scrape scrape https://store.example.com -m direct+stealth --country DE
just-scrape scrape https://example.com -f branding
```

### Agentic Scraper
### Markdownify

Browser automation with AI — login, click, navigate, fill forms. Steps are comma-separated strings.
Convert any webpage to clean markdown (convenience wrapper for `scrape --format markdown`).

```bash
just-scrape agentic-scraper <url> -s <steps>
just-scrape agentic-scraper <url> -s <steps> --ai-extraction -p <prompt>
just-scrape agentic-scraper <url> -s <steps> --schema <json>
just-scrape agentic-scraper <url> -s <steps> --use-session # persist browser session
just-scrape markdownify <url>
just-scrape markdownify <url> -m direct+stealth
just-scrape markdownify <url> --headers <json>
```

```bash
# Login + extract dashboard
just-scrape agentic-scraper https://app.example.com/login \
-s "Fill email with user@test.com,Fill password with secret,Click Sign In" \
--ai-extraction -p "Extract all dashboard metrics"

# Multi-step form
just-scrape agentic-scraper https://example.com/wizard \
-s "Click Next,Select Premium plan,Fill name with John,Click Submit"

# Persistent session across runs
just-scrape agentic-scraper https://app.example.com \
-s "Click Settings" --use-session
just-scrape markdownify https://blog.example.com/my-article
just-scrape markdownify https://protected.example.com -m js+stealth
just-scrape markdownify https://docs.example.com/api --json | jq -r '.markdown' > api-docs.md
```

### Generate Schema
### Crawl

Generate a JSON schema from a natural language description.
Crawl multiple pages. The CLI starts the crawl and polls until completion.

```bash
just-scrape generate-schema <prompt>
just-scrape generate-schema <prompt> --existing-schema <json>
just-scrape crawl <url>
just-scrape crawl <url> --max-pages <n> # default 50
just-scrape crawl <url> --max-depth <n> # default 2
just-scrape crawl <url> --max-links-per-page <n> # default 10
just-scrape crawl <url> --allow-external # allow external domains
just-scrape crawl <url> -m direct+stealth # anti-bot bypass
```

```bash
just-scrape generate-schema "E-commerce product with name, price, ratings, and reviews array"

# Refine an existing schema
just-scrape generate-schema "Add an availability field" \
--existing-schema '{"type":"object","properties":{"name":{"type":"string"},"price":{"type":"number"}}}'
```

### Sitemap
# Crawl docs site
just-scrape crawl https://docs.example.com --max-pages 20 --max-depth 3

Get all URLs from a website's sitemap.
# Crawl staying within domain
just-scrape crawl https://example.com --max-pages 50

```bash
just-scrape sitemap <url>
just-scrape sitemap https://example.com --json | jq -r '.urls[]'
# Get crawl results as JSON
just-scrape crawl https://example.com --json --max-pages 10
```

### History
Expand All @@ -225,67 +166,59 @@ Browse request history. Interactive by default (arrow keys to navigate, select t
just-scrape history <service> # interactive browser
just-scrape history <service> <request-id> # specific request
just-scrape history <service> --page <n>
just-scrape history <service> --page-size <n> # max 100
just-scrape history <service> --page-size <n> # default 20, max 100
just-scrape history <service> --json
```

Services: `markdownify`, `smartscraper`, `searchscraper`, `scrape`, `crawl`, `agentic-scraper`, `sitemap`
Services: `scrape`, `extract`, `search`, `monitor`, `crawl`

```bash
just-scrape history smartscraper
just-scrape history crawl --json --page-size 100 | jq '.requests[] | {id: .request_id, status}'
just-scrape history extract
just-scrape history crawl --json --page-size 100 | jq '.[].status'
```

### Credits & Validate
### Credits

```bash
just-scrape credits
just-scrape credits --json | jq '.remaining_credits'
just-scrape validate
just-scrape credits --json | jq '.remainingCredits'
```

## Common Patterns

### Generate schema then scrape with it

```bash
just-scrape generate-schema "Product with name, price, and reviews" --json | jq '.schema' > schema.json
just-scrape smart-scraper https://store.example.com -p "Extract products" --schema "$(cat schema.json)"
```

### Pipe JSON for scripting

```bash
just-scrape sitemap https://example.com --json | jq -r '.urls[]' | while read url; do
just-scrape smart-scraper "$url" -p "Extract title" --json >> results.jsonl
done
just-scrape extract https://example.com -p "Extract all links" --json | jq '.data'
```

### Protected sites

```bash
# JS-heavy SPA behind Cloudflare
just-scrape smart-scraper https://protected.example.com -p "Extract data" --stealth
just-scrape extract https://protected.example.com -p "Extract data" --mode js+stealth

# With custom cookies/headers
just-scrape smart-scraper https://example.com -p "Extract data" \
just-scrape extract https://example.com -p "Extract data" \
--cookies '{"session":"abc123"}' --headers '{"Authorization":"Bearer token"}'
```

## Credit Costs
## Fetch Modes

| Feature | Extra Credits |
| Mode | Description |
|---|---|
| `--stealth` | +4 per request |
| `--branding` (scrape only) | +2 |
| `search-scraper` extraction | 10 per request |
| `search-scraper --no-extraction` | 2 per request |
| `crawl --no-extraction` | 2 per page |
| `auto` | Automatic selection (default) |
| `fast` | Fastest, no JS rendering |
| `js` | Full JS rendering |
| `direct+stealth` | Direct fetch with anti-bot bypass |
| `js+stealth` | JS rendering with anti-bot bypass |

## Environment Variables

```bash
SGAI_API_KEY=sgai-... # API key
JUST_SCRAPE_TIMEOUT_S=300 # Request timeout in seconds (default 120)
JUST_SCRAPE_DEBUG=1 # Debug logging to stderr
SGAI_API_URL=... # Override API base URL (default: https://api.scrapegraphai.com)
SGAI_TIMEOUT_S=30 # Request timeout in seconds (default 30)
```

Legacy variables (`JUST_SCRAPE_API_URL`, `JUST_SCRAPE_TIMEOUT_S`, `JUST_SCRAPE_DEBUG`) are still bridged.
10 changes: 3 additions & 7 deletions src/cli.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,13 @@ const main = defineCommand({
description: "ScrapeGraph AI CLI tool",
},
subCommands: {
"smart-scraper": () => import("./commands/smart-scraper.js").then((m) => m.default),
"search-scraper": () => import("./commands/search-scraper.js").then((m) => m.default),
extract: () => import("./commands/extract.js").then((m) => m.default),
search: () => import("./commands/search.js").then((m) => m.default),
scrape: () => import("./commands/scrape.js").then((m) => m.default),
markdownify: () => import("./commands/markdownify.js").then((m) => m.default),
crawl: () => import("./commands/crawl.js").then((m) => m.default),
sitemap: () => import("./commands/sitemap.js").then((m) => m.default),
scrape: () => import("./commands/scrape.js").then((m) => m.default),
"agentic-scraper": () => import("./commands/agentic-scraper.js").then((m) => m.default),
"generate-schema": () => import("./commands/generate-schema.js").then((m) => m.default),
history: () => import("./commands/history.js").then((m) => m.default),
credits: () => import("./commands/credits.js").then((m) => m.default),
validate: () => import("./commands/validate.js").then((m) => m.default),
},
});

Expand Down
Loading
Loading