Brave Search Extractor

Small, async-friendly toolkit to query the Brave Search API and extract readable article text using Mozilla Readability — with optional auto‑archiving of results to JSON files.

Executive Summary

Brave Search Extractor helps you: (1) query Brave Search efficiently, (2) fetch 10–20 links per request, and (3) extract readable article text locally. This design keeps your unit cost very low (often 1/8–1/5 of hosted alternatives for typical workflows) while supporting high concurrency.

Features

Async Brave Search client (aiohttp)
Simple rate limiting (requests/second)
Readability‑based content extraction (readability-lxml + lxml)
Batch extraction with bounded concurrency
Optional auto‑archiving to archives/ (daily search logs and extracted content files)

Project Layout

search/
├── brave_client.py         # Brave Search API client
├── content_extractor.py    # Readability-based content extractor
├── archive_manager.py      # JSON archive writer
├── config_loader.py        # Configuration loader
├── demo.py                 # End-to-end demo script
├── search_config.yaml      # Local config (ignored by Git)
└── archives/               # Auto-generated archives directory

Note: The project ships a small demo and a minimal public API. No secrets are committed; use the example config to create your local search_config.yaml.

Requirements

Python 3.9+ (tested on 3.9.6 and newer)
Install dependencies:

pip install -r requirements.txt

Configuration

Copy the example config and set your Brave API key (do not commit the real key):

cp search_config.example.yaml search_config.yaml
# Edit search_config.yaml and set brave_search.api_key

Used configuration keys (from search_config.yaml):

brave_search.api_key (required)
brave_search.base_url (optional, default: https://api.search.brave.com/res/v1)
brave_search.rate_limit.requests_per_second (optional, default: 1.0)
brave_search.enable_archive (default: true)
brave_search.archive_path (default: ./archives)

Quick Start

Example usage (Python):

import asyncio
from search.brave_client import BraveSearchClient
from search.content_extractor import ContentExtractor

async def main():
    client = BraveSearchClient()
    results = await client.search("Bitcoin news", count=10)

    extractor = ContentExtractor()
    content = await extractor.extract(results[0].url)
    print(content.title)
    print(content.text[:400])

asyncio.run(main())

Run the demo (single way):

From the repo root: python run_demo.py "bitcoin whale"

API Overview

Public types exposed by the package:

search.brave_client.BraveSearchClient — async client for Brave Search.
- Method search(query: str, **params) -> List[SearchResult].
- Common params: count (default 10), offset (0–9). Rate‑limited via a simple client‑side limiter.
search.brave_client.SearchResult — structure of a search result.
- Fields: url, title, description, snippet, age?, extra_snippets?, source_type? (e.g., web/news).
search.content_extractor.ContentExtractor — async extractor using Readability.
- Method extract(url: str) -> ExtractedContent — returns one result.
- Method extract_batch(urls: List[str], max_concurrent: int = 5) -> List[ExtractedContent] — bounded concurrency; may auto‑archive.
search.content_extractor.ExtractedContent — structure of extracted content.
- Fields: url, title?, text?, success (bool), error?.

Cost & Throughput Advantages

The project is optimized for a common workflow: “fetch 10–20 links per search, then extract the article text ourselves.” Under this pattern, Brave Search + self‑hosted extraction achieves excellent unit economics and high throughput.

TL;DR

For “1 search returning 10–20 links + extract 10 pages,” Brave typically costs about 1/8–1/5 of Tavily (PAYG), while offering higher peak RPS and larger monthly ceilings, which makes it well‑suited for high concurrency and scale.

Key Metrics at a Glance (as of 2025‑10; see vendor docs for details)

Dimension	Brave Free	Brave Base	Brave Pro	Tavily (Free / PAYG)
Unit price	$0 / 1,000 requests	$3 / 1,000 requests	$5 / 1,000 requests	Free: 1,000 credits / month; $0.008 / credit (PAYG)
Peak rate	1 rps	20 rps	50 rps	Dev: 100 rpm (≈ 1.67 rps); Prod: 1,000 rpm (≈ 16.7 rps)
Monthly cap	2,000 / month	20M / month	Unlimited	Varies by plan; Free 1,000 credits / month
Results per request (`count`)	Up to 20	Up to 20	Up to 20	`max_results` typically up to 20 (billed by `search_depth`)
Pagination offset	Up to 9	Up to 9	Up to 9	—
Billing granularity	Per request	Per request	Per request	Per credit: `search basic = 1`, `advanced = 2`; `extract/crawl` = 1 or 2 credits per 5 successful URLs

Sources: Brave pricing and API docs; Tavily pricing, rate limits, and endpoint docs.

Concrete Workflow Cost Example

Scenario: 1 search returns 10 links; we extract 10 pages ourselves.

Brave path (we only pay for search; extraction is self‑hosted):
- Base: 1 request = $0.003 ($3 / 1,000)
- Pro: 1 request = $0.005 ($5 / 1,000)
- Note: one request can return 10–20 links and still counts as a single request.
Tavily path (use Tavily for search + hosted extraction):
- basic: search = 1 credit; extract basic (10 URLs) = 2 credits (1 credit per 5 successful URLs) → total 3 credits × $0.008 = $0.024
- advanced: search advanced = 2 credits; extract advanced (10 URLs) = 4 credits → total 6 credits × $0.008 = $0.048

Cost ratio (lower is better):

Brave Base vs Tavily basic: $0.003 / $0.024 ≈ 12.5% (≈ 1/8). Pro vs basic: ≈ 1/4.8.
Brave Base vs Tavily advanced: $0.003 / $0.048 ≈ 6.25% (≈ 1/16). Pro vs advanced: ≈ 1/9.6.

These calculations follow the vendors’ published billing models and do not include network jitter or retries. Always validate against the latest pricing.

Throughput & Scalability (single API key)

Brave: Free 1 rps / 2,000 per month; Base 20 rps / 20M per month; Pro 50 rps / unlimited per month.
Tavily: Dev 100 rpm and Prod 1,000 rpm. Good for medium‑to‑high concurrency, but peak is below Brave Pro in raw RPS.

Executive Takeaways

Unit economics: For the same “1 search + extract 10 pages,” Brave is typically ≈ 1/8–1/5 the cost of Tavily PAYG; the gap widens for Tavily “advanced”.
Headroom: Brave offers 20–50 rps and very large monthly caps (20M/month on Base, unlimited on Pro), enabling batch and scale‑out workloads.
Strategy: Decouple “search” and “extraction”—use Brave for low‑cost bulk link retrieval (10–20 per request), keep extraction in‑house for cost control and throughput. Use Tavily advanced as a complementary retrieval option when needed.

Note: Pricing, limits, and features can change. Check the official vendor documentation for the latest information.

Error Model

BraveSearchClient.search(...) raises on HTTP errors (e.g., invalid key, rate limiting). Wrap calls with try/except as needed.
ContentExtractor.extract(...) returns an ExtractedContent with success=False and a short error string on failure; it does not raise.
ContentExtractor.extract_batch(...) aggregates results and converts per‑task exceptions into ExtractedContent(success=False, error=...) entries so the overall batch completes.

Limitations

Extraction relies on static HTML; sites that require heavy client‑side rendering or aggressive anti‑bot protections may fail or return partial text.
Default headers use a generic desktop User‑Agent and Accept-Language: en-US; tune these if you target non‑English pages.
Readability works best on article‑like pages; quality varies by site layout and markup.

Archiving

If enable_archive is true, the code writes:

Daily search logs: archives/daily/YYYY-MM-DD_searches.json
Extraction batch index: archives/extracted/YYYY-MM-DD_extractions.json
Individual extracted articles: archives/extracted/YYYY-MM-DD_HH-MM-SS_<hash>.json

Directories are created automatically. Files in archives/ are ignored by Git.

Troubleshooting

Missing config: Configuration file not found: search_config.yaml → Copy the example and set your key.
API key error: Brave API key not configured! → Edit search_config.yaml and replace the placeholder.
ModuleNotFoundError: No module named 'search' → Run python run_demo.py ... from the repository root.
HTTP errors (403/429/etc.): Sites may block scraping; Brave API enforces rate limits (≈1 req/s).
Timeouts on extraction: Adjust ContentExtractor(timeout=...) or try again later.
Archive disabled: If enable_archive=false or you instantiate ContentExtractor(auto_archive=False), the demo runs but skips writing to archives/.

Logging tip for your own scripts:

import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")

Security

Never commit real API keys. search_config.yaml is Git‑ignored; use search_config.example.yaml as a template.
Review any generated data under archives/ before sharing.

Compliance

Use this project in accordance with websites’ Terms of Service and applicable laws. Respect robots.txt and rate limits.

License

MIT — see LICENSE.

Roadmap (nice to have)

Honor default_params from config automatically
Global search/extraction indexes and URL de‑duplication
Additional search endpoints and richer result fields

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Brave Search Extractor

Executive Summary

Features

Project Layout

Requirements

Configuration

Quick Start

API Overview

Cost & Throughput Advantages

TL;DR

Key Metrics at a Glance (as of 2025‑10; see vendor docs for details)

Concrete Workflow Cost Example

Throughput & Scalability (single API key)

Executive Takeaways

Error Model

Limitations

Archiving

Troubleshooting

Security

Compliance

License

Roadmap (nice to have)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
archive_manager.py		archive_manager.py
brave_client.py		brave_client.py
config_loader.py		config_loader.py
content_extractor.py		content_extractor.py
demo.py		demo.py
requirements.txt		requirements.txt
run_demo.py		run_demo.py
search_config.example.yaml		search_config.example.yaml

License

friuo/llm-low-cost-search

Folders and files

Latest commit

History

Repository files navigation

Brave Search Extractor

Executive Summary

Features

Project Layout

Requirements

Configuration

Quick Start

API Overview

Cost & Throughput Advantages

TL;DR

Key Metrics at a Glance (as of 2025‑10; see vendor docs for details)

Concrete Workflow Cost Example

Throughput & Scalability (single API key)

Executive Takeaways

Error Model

Limitations

Archiving

Troubleshooting

Security

Compliance

License

Roadmap (nice to have)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages