Skip to content

Commit 03214b8

Browse files
authored
Merge pull request #230 from Roche/dev
version 1.2.2
2 parents 06dbeec + fe3f86d commit 03214b8

18 files changed

+1179
-1002
lines changed

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ authors:
55
given-names: "Otto"
66
orcid: "https://orcid.org/0000-0002-3363-9287"
77
title: "Pyreadstat"
8-
version: 1.2.1
8+
version: 1.2.2
99
doi: 10.5281/zenodo.6612282
1010
date-released: 2018-09-24
1111
url: "https://github.com/Roche/pyreadstat"

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -330,7 +330,8 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=["variab
330330
#### Reading files in parallel processes
331331

332332
A challenge when reading large files is the time consumed in the operation. In order to alleviate this
333-
pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
333+
pyreadstat provides a function "read\_file\_multiprocessing" to read a file in parallel processes using
334+
the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
334335
that is not the case look at Reading rows in chunks (next section)
335336

336337
Speed ups in the process will depend on a number of factors such as number of processes available, RAM,
@@ -351,6 +352,11 @@ import multiprocessing
351352
num_processes = multiprocessing.cpu_count()
352353
```
353354

355+
**Notes for Xport, Por and some defective SAV files not having the number of rows in the metadata**
356+
1. In all Xport, Por and some defective SAV files, the number of rows cannot be determined from the metadata. In such cases,
357+
you can use the parameter num\_rows to be equal or larger to the number of rows in the dataset. This number can be obtained
358+
reading the file without multiprocessing, reading in another application, etc.
359+
354360
**Notes for windows**
355361

356362
1. For this to work you must include a __name__ == "__main__" section in your script. See [this issue](#85)
@@ -410,6 +416,9 @@ for df, meta in reader:
410416
# do some cool calculations here for the chunk
411417
```
412418

419+
**If using multiprocessing, please read the notes in the previous section regarding Xport, Por and some defective SAV files not
420+
having the number of rows in the metadata**
421+
413422
**For Windows, please check the notes on the previous section reading files in parallel processes**
414423

415424
#### Reading value labels

change_log.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# 1.2.1 (github, pypi and conda 2023.06.01)
2+
* added num_rows to multiprocessing to allow processing of xport, por and
3+
sav files not having the number of rows in the metadata.
4+
15
# 1.2.1 (github, pypi and conda 2023.02.22)
26
* Readstat source updated to version 1.1.9
37
* introduced recognition for pandas datatype datetime64[ns, UTC] and other datetime64 types when writing,
13 Bytes
Binary file not shown.

docs/_build/doctrees/index.doctree

5.65 KB
Binary file not shown.

docs/_build/html/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: dc63e4405a0437fb9efe8c4f5ffb3848
3+
config: 321f78fa88a773e9f9ed9c32944f2233
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

docs/_build/html/_static/documentation_options.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
var DOCUMENTATION_OPTIONS = {
22
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
3-
VERSION: '1.2.1',
3+
VERSION: '1.2.2',
44
LANGUAGE: 'None',
55
COLLAPSE_INDEX: false,
66
BUILDER: 'html',

docs/_build/html/genindex.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<head>
44
<meta charset="utf-8" />
55
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
6-
<title>Index &mdash; pyreadstat 1.2.1 documentation</title>
6+
<title>Index &mdash; pyreadstat 1.2.2 documentation</title>
77
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
88
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
99
<!--[if lt IE 9]>

docs/_build/html/index.html

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />
55

66
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
7-
<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.2.1 documentation</title>
7+
<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.2.2 documentation</title>
88
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
99
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
1010
<!--[if lt IE 9]>
@@ -172,6 +172,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
172172
<dt class="sig sig-object py" id="pyreadstat.pyreadstat.read_file_in_chunks">
173173
<span class="sig-prename descclassname"><span class="pre">pyreadstat.pyreadstat.</span></span><span class="sig-name descname"><span class="pre">read_file_in_chunks</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_in_chunks" title="Permalink to this definition"></a></dt>
174174
<dd><p>Returns a generator that will allow to read a file in chunks.</p>
175+
<p>If using multiprocessing, for Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
176+
the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
177+
be obtained by the user before running this function.</p>
175178
<dl class="field-list simple">
176179
<dt class="field-odd">Parameters</dt>
177180
<dd class="field-odd"><ul class="simple">
@@ -182,6 +185,11 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
182185
<li><p><strong>limit</strong> (<em>integer</em><em>, </em><em>optional</em>) – stop reading the file after certain number of rows, will be added to offset</p></li>
183186
<li><p><strong>multiprocess</strong> (<em>bool</em><em>, </em><em>optional</em>) – use multiprocessing to read each chunk?</p></li>
184187
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – in case multiprocess is true, how many workers/processes to spawn?</p></li>
188+
<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. If using multiprocessing it is obligatory for files where
189+
the number of rows cannot be obtained from the medatata, such as xport, por and
190+
some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
191+
larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata or not using
192+
multiprocessing.</p></li>
185193
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function. row_limit and row_offset will be discarded if present.</p></li>
186194
</ul>
187195
</dd>
@@ -200,14 +208,18 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
200208
<dt class="sig sig-object py" id="pyreadstat.pyreadstat.read_file_multiprocessing">
201209
<span class="sig-prename descclassname"><span class="pre">pyreadstat.pyreadstat.</span></span><span class="sig-name descname"><span class="pre">read_file_multiprocessing</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_multiprocessing" title="Permalink to this definition"></a></dt>
202210
<dd><p>Reads a file in parallel using multiprocessing.
203-
Xport and Por files are not supported as they do not have the number of rows recorded in the metadata,
204-
information needed for this function.</p>
211+
For Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
212+
the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
213+
be obtained by the user before running this function.</p>
205214
<dl class="field-list simple">
206215
<dt class="field-odd">Parameters</dt>
207216
<dd class="field-odd"><ul class="simple">
208217
<li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
209218
<li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
210219
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the min 4 and the max cores on the computer</p></li>
220+
<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. Obligatory for files where the number of rows cannot be obtained from the medatata, such as xport, por and
221+
some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
222+
larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata.</p></li>
211223
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.</p></li>
212224
</ul>
213225
</dd>

docs/_build/html/py-modindex.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<head>
44
<meta charset="utf-8" />
55
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
6-
<title>Python Module Index &mdash; pyreadstat 1.2.1 documentation</title>
6+
<title>Python Module Index &mdash; pyreadstat 1.2.2 documentation</title>
77
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
88
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
99
<!--[if lt IE 9]>

docs/_build/html/search.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<head>
44
<meta charset="utf-8" />
55
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
6-
<title>Search &mdash; pyreadstat 1.2.1 documentation</title>
6+
<title>Search &mdash; pyreadstat 1.2.2 documentation</title>
77
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
88
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
99

docs/_build/html/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
# The short X.Y version
2727
version = ''
2828
# The full version, including alpha/beta/rc tags
29-
release = '1.2.1'
29+
release = '1.2.2'
3030

3131

3232
# -- General configuration ---------------------------------------------------

pyreadstat/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,5 @@
2020
from .pyreadstat import read_file_in_chunks, read_file_multiprocessing
2121
from ._readstat_parser import ReadstatError, metadata_container
2222

23-
__version__ = "1.2.1"
23+
__version__ = "1.2.2"
2424

0 commit comments

Comments
 (0)