You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-1Lines changed: 10 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -330,7 +330,8 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=["variab
330
330
#### Reading files in parallel processes
331
331
332
332
A challenge when reading large files is the time consumed in the operation. In order to alleviate this
333
-
pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
333
+
pyreadstat provides a function "read\_file\_multiprocessing" to read a file in parallel processes using
334
+
the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
334
335
that is not the case look at Reading rows in chunks (next section)
335
336
336
337
Speed ups in the process will depend on a number of factors such as number of processes available, RAM,
@@ -351,6 +352,11 @@ import multiprocessing
351
352
num_processes = multiprocessing.cpu_count()
352
353
```
353
354
355
+
**Notes for Xport, Por and some defective SAV files not having the number of rows in the metadata**
356
+
1. In all Xport, Por and some defective SAV files, the number of rows cannot be determined from the metadata. In such cases,
357
+
you can use the parameter num\_rows to be equal or larger to the number of rows in the dataset. This number can be obtained
358
+
reading the file without multiprocessing, reading in another application, etc.
359
+
354
360
**Notes for windows**
355
361
356
362
1. For this to work you must include a __name__ == "__main__" section in your script. See [this issue](#85)
@@ -410,6 +416,9 @@ for df, meta in reader:
410
416
# do some cool calculations here for the chunk
411
417
```
412
418
419
+
**If using multiprocessing, please read the notes in the previous section regarding Xport, Por and some defective SAV files not
420
+
having the number of rows in the metadata**
421
+
413
422
**For Windows, please check the notes on the previous section reading files in parallel processes**
<spanclass="sig-prename descclassname"><spanclass="pre">pyreadstat.pyreadstat.</span></span><spanclass="sig-name descname"><spanclass="pre">read_file_in_chunks</span></span><spanclass="sig-paren">(</span><spanclass="sig-paren">)</span><aclass="headerlink" href="#pyreadstat.pyreadstat.read_file_in_chunks" title="Permalink to this definition">¶</a></dt>
174
174
<dd><p>Returns a generator that will allow to read a file in chunks.</p>
175
+
<p>If using multiprocessing, for Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
176
+
the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
177
+
be obtained by the user before running this function.</p>
<li><p><strong>limit</strong> (<em>integer</em><em>, </em><em>optional</em>) – stop reading the file after certain number of rows, will be added to offset</p></li>
183
186
<li><p><strong>multiprocess</strong> (<em>bool</em><em>, </em><em>optional</em>) – use multiprocessing to read each chunk?</p></li>
184
187
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – in case multiprocess is true, how many workers/processes to spawn?</p></li>
188
+
<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. If using multiprocessing it is obligatory for files where
189
+
the number of rows cannot be obtained from the medatata, such as xport, por and
190
+
some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
191
+
larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata or not using
192
+
multiprocessing.</p></li>
185
193
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function. row_limit and row_offset will be discarded if present.</p></li>
<spanclass="sig-prename descclassname"><spanclass="pre">pyreadstat.pyreadstat.</span></span><spanclass="sig-name descname"><spanclass="pre">read_file_multiprocessing</span></span><spanclass="sig-paren">(</span><spanclass="sig-paren">)</span><aclass="headerlink" href="#pyreadstat.pyreadstat.read_file_multiprocessing" title="Permalink to this definition">¶</a></dt>
202
210
<dd><p>Reads a file in parallel using multiprocessing.
203
-
Xport and Por files are not supported as they do not have the number of rows recorded in the metadata,
204
-
information needed for this function.</p>
211
+
For Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
212
+
the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
213
+
be obtained by the user before running this function.</p>
205
214
<dlclass="field-list simple">
206
215
<dtclass="field-odd">Parameters</dt>
207
216
<ddclass="field-odd"><ulclass="simple">
208
217
<li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
209
218
<li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
210
219
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the min 4 and the max cores on the computer</p></li>
220
+
<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. Obligatory for files where the number of rows cannot be obtained from the medatata, such as xport, por and
221
+
some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
222
+
larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata.</p></li>
211
223
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.</p></li>
0 commit comments