Skip to content

build_index.py results in UnicodeDecodeError if terminal encoding is not set to UTF-8 #30

Open
@bmerkle

Description

@bmerkle

Please provide us with the following information:

This issue is for a: (mark with an x)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

run python build_index.py on a windows machine with german settings (this has cp1252 as default)
the program fails with a UnicodeDecodeError as can be seen in the logs below

The problem can be easily fixed if you set the codepage to utf-8 in the terminal/shell/powershell,
e.g. in powershell:
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
[Console]::InputEncoding = [System.Text.Encoding]::UTF8

We should add this information to the docs.
I can create a PR for this if you consider the information usefull (I do :-) )

Any log messages given by the failure

Failed Build
(.venv) PS C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial> python build_index.py
Data directory 'C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/' exists and contains 20 files.
Crack and chunk files from local path: C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/
Start embedding using connection with id = ...
Start creating index from embeddings.
Successfully created index at C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\tutorial-index-mlindex
Method indexes: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class Index: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Exception in thread Thread-19 (_readerthread):
Traceback (most recent call last):
File "c:\Program Files\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
self.run()
File "c:\Program Files\Python311\Lib\threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "c:\Program Files\Python311\Lib\subprocess.py", line 1599, in _readerthread
buffer.append(fh.read())
^^^^^^^^^
File "c:\Program Files\Python311\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 271: character maps to
Uploading tutorial-index-mlindex (0.0 MBs): 100%|#####################################################################################################################################| 1296/1296 [00:00<00:00, 1996.11it/s]

Fix. e.g. for powershell
[Console]::OutputEncoding = [System.Text.Encoding]::UT
[Console]::InputEncoding = [System.Text.Encoding]::UTF8

Expected/desired behavior

with the fix above it runs fine e.g.
(.venv) PS C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial> python build_index.py
Data directory 'C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/' exists and contains 20 files.
Crack and chunk files from local path: C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/
Start embedding using connection with id = ...
Start creating index from embeddings.
Successfully created index at C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\tutorial-index-mlindex
Method indexes: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class Index: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
not OS specific

Versions

not version specific

Mention any other details that might be useful


Thanks! We'll be in touch soon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions