Parsing options
Set language
LlamaParse uses OCR to extract text from images. Our OCR supports a long list of languages. You can specify one or more languages by separating them with a comma. This only affects text extracted from images.
- Python
- API
parser = LlamaParse(
language="fr"
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'language="fr"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Disable OCR
By default, LlamaParse runs OCR on images embedded in the document. You can disable it with disable_ocr=True
.
- Python
- API
parser = LlamaParse(
disable_ocr=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'disable_ocr="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Skip diagonal text
By default, LlamaParse will attempt to parse text that is diagonal on the page. This can be useful for some documents, but also introduce noise and errors. To avoid parsing diagonal text, set skip_diagonal_text=True
.
- Python
- API
parser = LlamaParse(
skip_diagonal_text=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'skip_diagonal_text="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Do not unroll columns
By default, LlamaParse tries to unroll columns into reading order. Set do_not_unroll_columns=True
to prevent LlamaParse from doing so.
- Python
- API
parser = LlamaParse(
do_not_unroll_columns=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'do_not_unroll_columns="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Target pages
By default, all pages will be extracted. To parse specific pages only, use a comma-separated string. Page numbering starts at 0.
- Python
- API
parser = LlamaParse(
target_pages="0,2,7"
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'target_pages="0,2,7"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Page separator
By default, LlamaParse will separate pages in the markdown and text output by \n---\n. You can change this separator by setting page_separator to the desired string.
It's also possible to include the page number within the separator using {pageNumber}
in the string. It will be replaced by the page number of the next page.
- Python
- API
parser = LlamaParse(
page_separator="\n=================\n",
# page_separator="\n== {pageNumber} ==\n" # Will transform to "\n== 4 ==\n" to separate page 3 and 4.
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'page_separator="\n== {pageNumber} ==\n"' \
--form 'page_prefix="START OF PAGE: {pageNumber}\n"' \
--form 'page_suffix="\nEND OF PAGE: {pageNumber}"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Page prefix and suffix
It's possible to specify a prefix or a suffix to be added to each page. These strings can contain {pageNumber}
as well and will be replaced by the current page number. Both parameters are optional and empty by default.
- Python
- API
parser = LlamaParse(
page_prefix="START OF PAGE: {pageNumber}\n"
page_suffix="\nEND OF PAGE: {pageNumber}"
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'page_prefix="START OF PAGE: {pageNumber}\n"
page_suffix="\nEND OF PAGE: {pageNumber}"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Bounding box
Specify an area of a document that you want to parse. This can be helpful to remove headers and footers. To do so you need to provide the bounding box margin in clockwise order from the top in a comma-separated. The margins are expressed as a fraction of the page size, a number between 0 and 1.
Examples:
- To exclude the top 10% of a document: bounding_box="0.1,0,0,0"
- To exclude the top 10% and bottom 20% of a document: bounding_box="0.1,0,0.2,0"
- Python
- API
parser = LlamaParse(
bounding_box="0.1,0,0.2,0"
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'bounding_box="0.1,0,0.2,0"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Take screenshot
Take a screenshot of each page and add it to JSON output in the following format:
{
"images": [
{
"name": "page_1.jpg",
"height": 792,
"width": 612,
"x": 0,
"y": 0,
"type": "full_page_screenshot"
}
]
}
- Python
- API
parser = LlamaParse(
take_screenshot=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'take_screenshot="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Disable image extraction
It is possible to disable the extraction of image for better performance using disable_image_extraction=true
- Python
- API
parser = LlamaParse(
disable_image_extraction=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'disable_image_extraction="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Extract multiple table per sheet in spreadsheet
By default LlamaParse extract each sheet of a spreadsheet as one table. Using spreadsheet_extract_sub_tables=true
, LlamaParse will try to identify spreadsheet sheet with multiple table and return them as separated tables.
- Python
- API
parser = LlamaParse(
spreadsheet_extract_sub_tables=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'spreadsheet_extract_sub_tables="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Output table as HTML in markdown
A common issue with markdown table is that they do not handle merged cells well. It is possible to ask LlamaParse to return table as html with colspan
and rowspan
to get a better representation of the table. When output_tables_as_HTML=true
, tables present in the markdown will be output as HTML tables.
- Python
- API
parser = LlamaParse(
output_tables_as_HTML=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'output_tables_as_HTML="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Preserve alignment across pages
If set to preserve_layout_alignment_across_pages=True
will try to keep the text align in text mode accross pages. Useful for document with continuous table / alignment accross pages.
- Python
- API
parser = LlamaParse(
preserve_layout_alignment_across_pages=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'preserve_layout_alignment_across_pages="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'