Skip to content

Conversation

@kmike
Copy link
Member

@kmike kmike commented Oct 2, 2013

I think that there are two separate issues in #45:

  1. non-ascii input in scrapely.tool leads to UnicodeDecodeError;
  2. non-ascii data is not readable when printed.

This PR addresses (1).

<text> and <data> arguments are parsed by parse_criteria function
(it uses shlex and optparse for parsing). Data that is passed to parse_criteria
function is extracted from "line" argument of do_<…> methods.
This "line" argument is read from self.stdin by cmd.Cmd and
passed to do_ methods. In Python 2.x sys.stdin (which is
default for cmd.Cmd.stdin) is binary, so "line" is a bytestring;
its encoding is self.stdin.encoding. That's why <text> and <data>
argument values was previously bytestrings; when passed to
other scrapely functions they eventually got implicitly decoded
using sys.getdefaultencoding() - this usually leads to
UnicodeDecodeError if input text is non-ascii.

The fix is to decode these arguments using self.stdin.encoding
before passing them to scrapely. This is done after shlex call
because shlex doesn't support unicode. Non-ascii "field" arguments
are still unsupported.
pablohoffman added a commit that referenced this pull request Oct 10, 2013
[MRG] scrapely.tool: add support for non-ascii <text> and <data> arguments
@pablohoffman pablohoffman merged commit 576d3db into scrapy:master Oct 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants