Skip to content

Commit ac583a1

Browse files
author
Dean Malmgren
authored
Merge pull request deanmalmgren#411 from jhale1805/non_agpl_epub_extractor
Remove EbookLib dependency
2 parents 902028f + e81913b commit ac583a1

File tree

4 files changed

+45
-34
lines changed

4 files changed

+45
-34
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ var/
2525
pip-log.txt
2626
pip-delete-this-directory.txt
2727

28+
# Virtual environments
29+
**/venv*
30+
2831
# Unit test / coverage reports
2932
htmlcov/
3033
.tox/

requirements/python

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ argcomplete~=1.10.0
55
beautifulsoup4~=4.8.0
66
chardet==3.*
77
docx2txt~=0.8
8-
EbookLib==0.*
98
extract-msg<=0.29.* #Last with python2 support
109
pdfminer.six==20191110 #Last with python2 support
1110
python-pptx~=0.6.18

tests/epub/raw_text.txt

Lines changed: 0 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
1-
21
Epub testing
32
With subtitle...
4-
53
Introduction
64
Welcome here! All the text have ben generate with the Samuel L lorem ipsum.
7-
8-
95
We happy?
106
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
117
We happy?
@@ -16,7 +12,6 @@ No man, I don't eat pork
1612
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
1713
Is she dead, yes or no?
1814
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
19-
2015
We happy?
2116
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
2217
We happy?
@@ -27,7 +22,6 @@ No man, I don't eat pork
2722
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
2823
Is she dead, yes or no?
2924
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
30-
3125
We happy?
3226
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
3327
We happy?
@@ -38,7 +32,6 @@ No man, I don't eat pork
3832
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
3933
Is she dead, yes or no?
4034
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
41-
4235
We happy?
4336
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
4437
We happy?
@@ -49,18 +42,6 @@ No man, I don't eat pork
4942
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
5043
Is she dead, yes or no?
5144
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
52-
53-
We happy?
54-
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
55-
We happy?
56-
The lysine contingency - it's intended to prevent the spread of the animals is case they ever got off the island. Dr. Wu inserted a gene that makes a single faulty enzyme in protein metabolism. The animals can't manufacture the amino acid lysine. Unless they're continually supplied with lysine by us, they'll slip into a coma and die.
57-
Oh... what I'm gon' do?
58-
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
59-
No man, I don't eat pork
60-
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
61-
Is she dead, yes or no?
62-
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
63-
6445
We happy?
6546
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
6647
We happy?

textract/parsers/epub_parser.py

Lines changed: 42 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,53 @@
1-
from ebooklib import epub, ITEM_DOCUMENT
1+
import zipfile
22
from bs4 import BeautifulSoup
33

44
from .utils import BaseParser
55

66

77
class Parser(BaseParser):
8-
"""Extract text from epub using python epub library
9-
"""
8+
"""Extract text from epub"""
109

1110
def extract(self, filename, **kwargs):
12-
book = epub.read_epub(filename)
11+
book = zipfile.ZipFile(filename)
1312
result = ''
14-
for id, _ in book.spine:
15-
item = book.get_item_with_id(id)
16-
# Don't fail with some AttributeError exception when the item is of NoneType
17-
# (i.e. at the last position).
18-
if item is None:
13+
for text_name in self.__epub_sections(book):
14+
if not text_name.endswith("html"):
1915
continue
20-
soup = BeautifulSoup(item.content, 'lxml')
21-
for child in soup.find_all(
22-
['title', 'p', 'div', 'h1', 'h2', 'h3', 'h4']
23-
):
24-
result = result + child.text + '\n'
16+
soup = BeautifulSoup(book.open(text_name), features='lxml')
17+
html_content_tags = ['title', 'p', 'h1', 'h2', 'h3', 'h4']
18+
for child in soup.find_all(html_content_tags):
19+
inner_text = child.text.strip() if child.text else ""
20+
if inner_text:
21+
result += inner_text + '\n'
2522
return result
23+
24+
def __epub_sections(self, book):
25+
opf_paths = self.__get_opf_paths(book)
26+
item_paths = self.__get_item_paths(book, opf_paths)
27+
return item_paths
28+
29+
def __get_opf_paths(self, book):
30+
meta_inf = book.open("META-INF/container.xml")
31+
meta_soup = BeautifulSoup(meta_inf, features='lxml')
32+
return [f["full-path"] for f in meta_soup.rootfiles.find_all("rootfile")]
33+
34+
def __get_item_paths(self, book, opf_paths):
35+
item_paths = []
36+
for opf_path in opf_paths:
37+
opf_soup = BeautifulSoup(book.open(opf_path), "lxml")
38+
epub_items = opf_soup.spine.find_all("itemref")
39+
for epub_item in epub_items:
40+
item = self.__get_item(opf_soup, epub_item["idref"])
41+
item_paths.append(self.__get_full_item_path(book, item["href"]))
42+
return item_paths
43+
44+
def __get_item(self, opf_soup, item_id):
45+
for item in opf_soup.manifest.find_all("item"):
46+
if item["id"] == item_id:
47+
return item
48+
return None
49+
50+
def __get_full_item_path(self, book, partial_path):
51+
for filename in book.namelist():
52+
if filename.endswith(partial_path):
53+
return filename

0 commit comments

Comments
 (0)