Pdfminersix Readthedocs Io en Latest
Pdfminersix Readthedocs Io en Latest
six
Release __VERSION__
1 Content 3
1.1 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 How-to guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Frequently asked questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Features 19
3 Installation instructions 21
4 Contributing 23
Index 25
i
ii
pdfminer.six, Release __VERSION__
We fathom PDF.
Pdfminer.six is a python package for extracting information from PDF documents.
Check out the source on github.
Contents 1
pdfminer.six, Release __VERSION__
2 Contents
CHAPTER 1
Content
This documentation is organized into four sections (according to the Diátaxis documentation framework). The Tuto-
rials section helps you setup and use pdfminer.six for the first time. Read this section if this is your first time working
with pdfminer.six. The How-to guides offers specific recipies for solving common problems. Take a look at the Topics
if you want more background information on how pdfminer.six works internally. The API Reference provides detailed
api documentation for all the common classes and functions in pdfminer.six.
1.1 Tutorials
To use pdfminer.six for the first time, you need to install the Python package in your Python environment.
This tutorial requires you to have a system with a working Python and pip installation. If you don’t have one and don’t
know how to install it, take a look at The Hitchhiker’s Guide to Python!.
Run the following command on the commandline to install pdfminer.six as a Python package:
3
pdfminer.six, Release __VERSION__
Now you can use pdfminer.six as a Python package. But pdfminer.six also comes with a couple of useful commandline
tools. To test if these tools are correctly installed, run the following on your commandline:
$ pdf2txt.py --version
pdfminer.six <installed version>
pdfminer.six has several tools that can be used from the command line. The command-line tools are aimed at users
that occasionally want to extract text from a pdf.
Take a look at the high-level or composable interface if you want to use pdfminer.six programmatically.
Examples
pdf2txt.py
$ pdf2txt.py example.pdf
all the text from the pdf appears on the command line
The pdf2txt.py tool extracts all the text from a PDF. It uses layout analysis with sensible defaults to order and group
the text in a sensible way.
dumppdf.py
$ dumppdf.py -a example.pdf
<pdf><object id="1">
...
</object>
...
</pdf>
The dumppdf.py tool can be used to extract the internal structure from a PDF. This tool is primarily for debugging
purposes, but that can be useful to anybody working with PDF’s.
>>> print(text)
(continues on next page)
4 Chapter 1. Content
pdfminer.six, Release __VERSION__
World
Hello
World
H e l l o
W o r l d
H e l l o
W o r l d
The command line tools and the high-level API are just shortcuts for often used combinations of pdfminer.six compo-
nents. You can use these components to modify pdfminer.six to your own needs.
For example, to extract the text from a PDF file and save it in a python variable:
output_string = StringIO()
with open('samples/simple1.pdf', 'rb') as in_file:
(continues on next page)
1.1. Tutorials 5
pdfminer.six, Release __VERSION__
print(output_string.getvalue())
The high level functions can be used to achieve common tasks. In this case, we can use extract_pages:
Each element will be an LTTextBox, LTFigure, LTLine, LTRect or an LTImage. Some of these can be
iterated further, for example iterating though an LTTextBox will give you an LTTextLine, and these in turn can
be iterated through to get an LTChar. See the diagram here: Layout analysis algorithm.
Let’s say we want to extract all of the text. We could do:
6 Chapter 1. Content
pdfminer.six, Release __VERSION__
Before you start, make sure you have installed pdfminer.six. The second thing you need is a PDF with images. If you
don’t have one, you can download this research paper with images of cats and dogs and save it as example.pdf :
$ curl https://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/parkhi12a.pdf --
˓→output example.pdf
This command extracts all the images from the PDF and saves them into the cats-and-dogs directory.
1.2.2 How to extract AcroForm interactive form fields from a PDF using PDFMiner
data = {}
def decode_value(value):
# decode bytes
if isinstance(value, bytes):
value = decode_text(value)
return value
doc = PDFDocument(parser)
res = resolve1(doc.catalog)
# decode name
name = decode_text(name)
# decode value(s)
if isinstance(values, list):
values = [decode_value(v) for v in values]
else:
values = decode_value(values)
data.update({name: values})
print(name, values)
This code snippet will print all the fields’ names and values and save them in the “data” dictionary.
How it works:
• Initialize the parser and the PDFDocument objects
parser = PDFParser(fp)
doc = PDFDocument(parser)
res = resolve1(doc.catalog)
• Check if the catalog contains the AcroForm key and raise ValueError if not
(the PDF does not contain Acroform type of interactive forms if this key is missing in the catalog, see section
12.7.2 of PDF 32000-1:2008 specs)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for f in fields:
field = resolve1(f)
name = decode_text(name)
8 Chapter 1. Content
pdfminer.six, Release __VERSION__
values = resolve1(value)
if isinstance(values, list):
values = [decode_value(v) for v in values]
else:
values = decode_value(values)
(the decode_value method takes care of decoding the field’s value, returning a string)
• Decode PSLiteral and PSKeyword field values
if isinstance(value, bytes):
value = utils.decode_text(value)
1.3 Topics
Most PDF files look like they contain well-structured text. But the reality is that a PDF file does not contain anything
that resembles paragraphs, sentences or even words. When it comes to text, a PDF file is only aware of the characters
and their placement.
This makes extracting meaningful pieces of text from PDF files difficult. The characters that compose a paragraph are
no different from those that compose the table, the page footer or the description of a figure. Unlike other document
formats, like a .txt file or a word document, the PDF format does not contain a stream of text.
A PDF document consists of a collection of objects that together describe the appearance of one or more pages,
possibly accompanied by additional interactive elements and higher-level application data. A PDF file contains the
objects making up a PDF document along with associated structural information, all represented as a single self-
contained sequence of bytes.1
PDFMiner attempts to reconstruct some of those structures by using heuristics on the positioning of characters. This
works well for sentences and paragraphs because meaningful groups of nearby characters can be made.
The layout analysis consists of three different stages: it groups characters into words and lines, then it groups lines
into boxes and finally it groups textboxes hierarchically. These stages are discussed in the following sections. The
resulting output of the layout analysis is an ordered hierarchy of layout objects on a PDF page.
The output of the layout analysis heavily depends on a couple of parameters. All these parameters are part of the
LAParams class.
1 Adobe System Inc. (2007). Pdf reference: Adobe portable document format, version 1.7.
1.3. Topics 9
pdfminer.six, Release __VERSION__
The first step in going from characters to text is to group characters in a meaningful way. Each character has an
x-coordinate and a y-coordinate for its bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six
uses these bounding boxes to decide which characters belong together.
Characters that are both horizontally and vertically close are grouped onto one line. How close they should be is
determined by the char_margin (M in the figure) and the line_overlap (not in figure) parameter. The horizontal
distance between the bounding boxes of two characters should be smaller than the char_margin and the vertical
overlap between the bounding boxes should be smaller than the line_overlap.
The values of char_margin and line_overlap are relative to the size of the bounding boxes of the characters. The
char_margin is relative to the maximum width of either one of the bounding boxes, and the line_overlap is relative to
the minimum height of either one of the bounding boxes.
Spaces need to be inserted between characters because the PDF format has no notion of the space character. A space
is inserted if the characters are further apart than the word_margin (W in the figure). The word_margin is relative to
the maximum width or height of the new character. Having a smaller word_margin creates smaller words. Note that
the word_margin should at least be smaller than the char_margin otherwise none of the characters will be separated
by a space.
The result of this stage is a list of lines. Each line consists of a list of characters. These characters are either original
LTChar characters that originate from the PDF file or inserted LTAnno characters that represent spaces between words
or newlines at the end of each line.
The second step is grouping lines in a meaningful way. Each line has a bounding box that is determined by the
bounding boxes of the characters that it contains. Like grouping characters, pdfminer.six uses the bounding boxes to
group the lines.
Lines that are both horizontally overlapping and vertically close are grouped. How vertically close the lines should be
is determined by the line_margin. This margin is specified relative to the height of the bounding box. Lines are close
10 Chapter 1. Content
pdfminer.six, Release __VERSION__
if the gap between the tops (see L 1 in the figure) and bottoms (see L 2 ) in the figure) of the bounding boxes are closer
together than the absolute line margin, i.e. the line_margin multiplied by the height of the bounding box.
The result of this stage is a list of text boxes. Each box consists of a list of lines.
The last step is to group the text boxes in a meaningful way. This step repeatedly merges the two text boxes that are
closest to each other.
The closeness of bounding boxes is computed as the area that is between the two text boxes (the blue area in the
figure). In other words, it is the area of the bounding box that surrounds both lines, minus the area of the bounding
boxes of the individual lines.
The algorithm described above assumes that all characters have the same orientation. However, any writing direction
is possible in a PDF. To accommodate for this, pdfminer.six allows detecting vertical writing with the detect_vertical
parameter. This will apply all the grouping steps as if the pdf was rotated 90 (or 270) degrees
References
pdf2txt.py
A command line tool for extracting text and images from PDF and output it to plain text, html, xml or tags.
Positional Arguments
Named Arguments
Parser
Layout analysis
12 Chapter 1. Content
pdfminer.six, Release __VERSION__
--line-margin, -L If two lines are close together they are considered to be part of the same para-
graph. The margin is specified relative to the height of a line.
Default: 0.5
--boxes-flow, -F Specifies how much a horizontal and vertical position of a text matters when
determining the order of lines. The value should be within the range of -1.0 (only
horizontal position matters) to +1.0 (only vertical position matters). You can also
pass disabled to disable advanced layout analysis, and instead return text based
on the position of the bottom left corner of the text box.
Default: 0.5
--all-texts, -A If layout analysis should be performed on text in figures.
Default: False
Output
dumppdf.py
Positional Arguments
Named Arguments
Parser
Output
14 Chapter 1. Content
pdfminer.six, Release __VERSION__
extract_text
extract_text_to_fp
• output_type – May be ‘text’, ‘xml’, ‘html’, ‘hocr’, ‘tag’. Only ‘text’ works properly.
• codec – Text decoding codec
• laparams – An LAParams object from pdfminer.layout. Default is None but may not
layout correctly.
• maxpages – How many pages to stop parsing after
• page_numbers – zero-indexed page numbers to operate on.
• password – For encrypted PDFs, the password to decrypt.
• scale – Scale factor
• rotation – Rotation factor
• layoutmode – Default is ‘normal’, see pdfminer.converter.HTMLConverter
• output_dir – If given, creates an ImageWriter for extracted images.
• strip_control – Does what it says on the tin
• debug – Output more logging data
• disable_caching – Does what it says on the tin
• other –
Returns nothing, acting as it does on two streams. Use StringIO to get strings.
extract_pages
16 Chapter 1. Content
pdfminer.six, Release __VERSION__
LAParams
Todo:
• PDFDevice
– TextConverter
– PDFPageAggregator
• PDFPageInterpreter
Pdfminer.six is a fork of the original pdfminer created by Euske. Almost all of the code and architecture are in -fact
created by Euske. But, for a long time, this original pdfminer did not support Python 3. Until 2020 the original
pdfminer only supported Python 2. The original goal of pdfminer.six was to add support for Python 3. This was done
with the six package. The six package helps to write code that is compatible with both Python 2 and Python 3. Hence,
pdfminer.six.
As of 2020, pdfminer.six dropped the support for Python 2 because it was end-of-life. While the .six part is no longer
applicable, we kept the name to prevent breaking changes for existing users.
The current punchline “We fathom PDF” is a whimsical reference to the six. Fathom means both deeply understanding
something, and a fathom is also equal to six feet.
Pdfminer.six is now an independent and community-maintained package for extracting text from PDFs with Python.
We actively fix bugs (also for PDFs that don’t strictly follow the PDF Reference), add new features and improve the
usability of pdfminer.six. This community separates pdfminer.six from the other forks of the original pdfminer. PDF
as a format is very diverse and there are countless deviations from the official format. The only way to support all the
PDFs out there is to have a community that actively uses and improves pdfminer.
Since 2020, the original pdfminer is dormant, and pdfminer.six is the fork which Euske recommends if you need an
actively maintained version of pdfminer.
One of the most common issues with pdfminer.six is that the textual output contains raw character id’s (cid:x). This
is often experienced as confusing because the text is shown fine in a PDF viewer and other text from the same PDF is
extracted properly.
The underlying problem is that a PDF has two different representations of each character. Each character is mapped to
a glyph that determines how the character is shown in a PDF viewer. And each character is also mapped to its unicode
value that is used when copy-pasting the character. Some PDF’s have incomplete unicode mappings and therefore it
is impossible to convert the character to unicode. In these cases pdfminer.six defaults to showing the raw character id
(cid:x)
A quick test to see if pdfminer.six should be able to do better is to copy-paste the text from a PDF viewer to a text
editor. If the result is proper text, pdfminer.six should also be able to extract proper text. If the result is gibberish,
pdfminer.six will also not be able to convert the characters to unicode.
References:
1. Chapter 5: Text, PDF Reference 1.7
2. Text: PDF, Wikipedia
18 Chapter 1. Content
CHAPTER 2
Features
19
pdfminer.six, Release __VERSION__
20 Chapter 2. Features
CHAPTER 3
Installation instructions
text = extract_text("example.pdf")
print(text)
21
pdfminer.six, Release __VERSION__
Contributing
We welcome any contributors to pdfminer.six! But, before doing anything, take a look at the contribution guide.
23
pdfminer.six, Release __VERSION__
24 Chapter 4. Contributing
Index
E
extract_pages() (in module pdfminer.high_level),
16
extract_text() (in module pdfminer.high_level), 15
extract_text_to_fp() (in module
pdfminer.high_level), 15
L
LAParams (class in pdfminer.layout), 17
25