What are Loaders?

CAMEL’s Loaders provide flexible ways to ingest and process all kinds of data structured files, unstructured text, web content, and even OCR from images. They power your agent’s ability to interact with the outside world. itionally, several data readers were added, including Apify Reader, Chunkr Reader, Firecrawl Reader, Jina_url Reader, and Mistral Reader, which enable retrieval of external data for improved data integration and analysis.

Types

Get Started

Using Base IO

This module is designed to read files of various formats, extract their contents, and represent them as File objects, each tailored to handle a specific file type.

from io import BytesIO
from camel.loaders import create_file_from_raw_bytes

# Read a pdf file from disk

with open("test.pdf", "rb") as file:
file_content = file.read()

# Use the create_file function to create an object based on the file extension

file_obj = create_file_from_raw_bytes(file_content, "test.pdf")

# Once you have the File object, you can access its content

print(file_obj.docs[0]["page_content"])


Using Unstructured IO

To get started with the Unstructured IO module, just import and initialize it. You can parse, clean, extract, chunk, and stage data from files or URLs. Here’s how you use it step by step:


from camel.loaders import UnstructuredIO

uio = UnstructuredIO()
example_url = (
    "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-"
    "philadelphia-eagles-spt-intl/index.html"
)
elements = uio.parse_file_or_url(example_url)
print(("\n\n".join([str(el) for el in elements])))

’ with extra spaces and – dashes.") options = [ ('replace_unicode_quotes',{" "}
{}), ('clean_dashes', {}), ('clean_non_ascii_chars', {}),
('clean_extra_whitespace', {}), ] cleaned_text =
uio.clean_text_data(text=example_dirty_text, clean_options=options)
print(cleaned_text) ``` ```markdown cleaned_text.md >>> Some dirty text with
extra spaces and dashes. ```
</CodeGroup>

<br />

<strong>3. Extract data from text (for example, emails):</strong>
<CodeGroup>
```python unstructured_io_extract.py
example_email_text = "Contact me at [email protected]."

extracted_text = uio.extract_data_from_text(
text=example_email_text,
extract_type="extract_email_address"
)
print(extracted_text)


chunks = uio.chunk_elements(elements=elements, chunk_type="chunk_by_title")

for chunk in chunks:
print(chunk)
print("\n" + "-" \* 80)


staged_element = uio.stage_elements(elements=elements, stage_type="stage_for_baseplate")
print(staged_element)

This guide gets you started with Unstructured IO. For more, see the Unstructured IO Documentation.


Using Apify Reader

Initialize the Apify client, set up the required actors and parameters, and run the actor.

from camel.loaders import Apify

apify = Apify()

run_input = {
"startUrls": [{"url": "https://www.camel-ai.org/"}],
"maxCrawlDepth": 0,
"maxCrawlPages": 1,
}
actor_result = apify.run_actor(
actor_id="apify/website-content-crawler", run_input=run_input
)
dataset_result = apify.get_dataset_items(
dataset_id=actor_result["defaultDatasetId"]
)
print(dataset_result)


Using Firecrawl Reader

Firecrawl Reader provides a simple way to turn any website into LLM-ready markdown format. Here’s how you can use it step by step:

1

Initialize the Firecrawl client and start a crawl

First, create a Firecrawl client and crawl a specific URL.

from camel.loaders import Firecrawl

firecrawl = Firecrawl()
response = firecrawl.crawl(url="https://www.camel-ai.org/about")
print(response["status"])  # Should print "completed" when done

When the status is “completed”, the content extraction is done and you can retrieve the results.

2

Retrieve the extracted markdown content

Once finished, access the LLM-ready markdown directly from the response:

print(response["data"][0]["markdown"])

That’s it. With just a couple of lines, you can turn any website into clean markdown, ready for LLM pipelines or further processing.


Using Chunkr Reader

Chunkr Reader allows you to process PDFs (and other docs) in chunks, with built-in OCR and format control.
Below is a basic usage pattern:

Initialize the ChunkrReader and ChunkrReaderConfig, set the file path and chunking options, then submit your task and fetch results:

import asyncio
from camel.loaders import ChunkrReader, ChunkrReaderConfig

async def main():
    chunkr = ChunkrReader()

    config = ChunkrReaderConfig(
        chunk_processing=512,      # Example: target chunk length
        ocr_strategy="Auto",       # Example: OCR strategy
        high_resolution=False      # False for faster processing (old "Fast" model)
    )

    # Replace with your actual file path.
    file_path = "/path/to/your/document.pdf"
    try:
        task_id = await chunkr.submit_task(
            file_path=file_path,
            chunkr_config=config,
        )
        print(f"Task ID: {task_id}")

        # Poll and fetch the output.
        if task_id:
            task_output_json_str = await chunkr.get_task_output(task_id=task_id)
            if task_output_json_str:
                print("Task Output:")
                print(task_output_json_str)
            else:
                print(f"Failed to get output for task {task_id}, or task did not succeed/was cancelled.")
    except ValueError as e:
        print(f"An error occurred during task submission or retrieval: {e}")
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}. Please check the path.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    print("To run this example, replace '/path/to/your/document.pdf' with a real file path, ensure CHUNKR_API_KEY is set, and uncomment 'asyncio.run(main())'.")
    # asyncio.run(main()) # Uncomment to run the example

A successful task returns a chunked structure like this:

> > > Task ID: 7becf001-6f07-4f63-bddf-5633df363bbb
> > > Task Output:
> > > { "task_id": "7becf001-6f07-4f63-bddf-5633df363bbb", "status": "Succeeded", "created_at": "2024-11-08T12:45:04.260765Z", "finished_at": "2024-11-08T12:45:48.942365Z", "expires_at": null, "message": "Task succeeded", "output": { "chunks": [ { "segments": [ { "segment_id": "d53ec931-3779-41be-a220-3fe4da2770c5", "bbox": { "left": 224.16666, "top": 370.0, "width": 2101.6665, "height": 64.166664 }, "page_number": 1, "page_width": 2550.0, "page_height": 3300.0, "content": "Large Language Model based Multi-Agents: A Survey of Progress and Challenges", "segment_type": "Title", "ocr": null, "image": "https://chunkmydocs-bucket-prod.storage.googleapis.com/.../d53ec931-3779-41be-a220-3fe4da2770c5.jpg?...", "html": "<h1>Large Language Model based Multi-Agents: A Survey of Progress and Challenges</h1>", "markdown": "# Large Language Model based Multi-Agents: A Survey of Progress and Challenges\n\n" } ], "chunk_length": 11 }, { "segments": [ { "segment_id": "7bb38fc7-c1b3-4153-a3cc-116c0b9caa0a", "bbox": { "left": 432.49997, "top": 474.16666, "width": 1659.9999, "height": 122.49999 }, "page_number": 1, "page_width": 2550.0, "page_height": 3300.0, "content": "Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020", "segment_type": "Text", "ocr": null, "image": "https://chunkmydocs-bucket-prod.storage.googleapis.com/.../7bb38fc7-c1b3-4153-a3cc-116c0b9caa0a.jpg?...", "html": "<p>Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020</p>", "markdown": "Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020\n\n" } ], "chunk_length": 100 } ] } }

Using Jina Reader

Jina Reader provides a convenient interface to extract clean, LLM-friendly content from any URL in a chosen format (like markdown):

from camel.loaders import JinaURLReader
from camel.types.enums import JinaReturnFormat

jina_reader = JinaURLReader(return_format=JinaReturnFormat.MARKDOWN)
response = jina_reader.read_content("https://docs.camel-ai.org/")
print(response)

Using MarkitDown Reader

MarkitDown Reader lets you convert files (like HTML or docs) into LLM-ready markdown with a single line.

from camel.loaders import MarkItDownLoader

loader = MarkItDownLoader()
response = loader.convert_file("demo.html")

print(response)

Example output:

> > > Welcome to CAMEL’s documentation! — CAMEL 0.2.61 documentation

[Skip to main content](https://docs.camel-ai.org/#main-content)
...

Using Mistral Reader

Mistral Reader offers OCR and text extraction from both PDFs and images, whether local or remote. Just specify the file path or URL:

from camel.loaders import MistralReader

mistral_reader = MistralReader()

# Extract text from a PDF URL
url_ocr_response = mistral_reader.extract_text(
    file_path="https://arxiv.org/pdf/2201.04234", pages=[5]
)
print(url_ocr_response)

You can also extract from images or local files:

# Extract text from an image URL
image_ocr_response = mistral_reader.extract_text(
    file_path="https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/receipt.png",
    is_image=True,
)
print(image_ocr_response)
# Extract text from a local PDF file
local_ocr_response = mistral_reader.extract_text("path/to/your/document.pdf")
print(local_ocr_response)

Response includes structured page data, markdown content, and usage details.

> > > pages=[OCRPageObject(index=5, markdown='![img-0.jpeg](./images/img-0.jpeg)\n\nFigure 2: Scatter plot of predicted accuracy versus (true) OOD accuracy. Each point denotes a dif...',
> > > images=[OCRImageObject(id='img-0.jpeg', ...)], dimensions=OCRPageDimensions(...))] model='mistral-ocr-2505-completion' usage_info=...