Skip to content

Fixes conversion of userData and headers fields in Apify-Scrapy request translation #179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 23, 2024

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented Jan 23, 2024

Description

  • Based on the bug report by @JJetmar on Slack:

There is a question regarding the RequestQueue and Python SDK:
Link to Slack Conversation

I tried what I could but I couldn't save any custom attributes to Request in RequestQueue with via python SDK:

await rq.add_request(request={'url': url, 'method': 'GET', 'userData': { 'myTest': 'test' } })

What I get is:

{
  "id": "7R37ec5G62ZfQHW",
  "json": "{\n  \"url\": \"https://www.apify.com/\",\n  \"method\": \"GET\",\n  \"id\": \"7R37ec5G62ZfQHW\",\n  \"uniqueKey\": \"https://www.apify.com/\",\n  \"userData\": {\n    \"scrapy_request\": \"gASVMgIAAAAAAAB9lCiMA3VybJSMFWh0dHBzOi8vd3d3LmFwaWZ5LmNvbZSMCGNhbGxiYWNrlE6M\\nB2VycmJhY2uUTowHaGVhZGVyc5R9lChDBkFjY2VwdJRdlEM/dGV4dC9odG1sLGFwcGxpY2F0aW9u\\nL3hodG1sK3htbCxhcHBsaWNhdGlvbi94bWw7cT0wLjksKi8qO3E9MC44lGFDD0FjY2VwdC1MYW5n\\ndWFnZZRdlEMCZW6UYUMKVXNlci1BZ2VudJRdlEMjU2NyYXB5LzIuMTEuMCAoK2h0dHBzOi8vc2Ny\\nYXB5Lm9yZymUYUMPQWNjZXB0LUVuY29kaW5nlF2UQw1nemlwLCBkZWZsYXRllGF1jAZtZXRob2SU\\njANHRVSUjARib2R5lEMAlIwHY29va2llc5R9lIwEbWV0YZR9lCiMEGFwaWZ5X3JlcXVlc3RfaWSU\\njA83UjM3ZWM1RzYyWmZRSFeUjBhhcGlmeV9yZXF1ZXN0X3VuaXF1ZV9rZXmUjBVodHRwczovL3d3\\ndy5hcGlmeS5jb22UjBBkb3dubG9hZF90aW1lb3V0lEdAZoAAAAAAAIwNZG93bmxvYWRfc2xvdJSM\\nDXd3dy5hcGlmeS5jb22UjBBkb3dubG9hZF9sYXRlbmN5lEc/3EanAAAAAHWMCGVuY29kaW5nlIwF\\ndXRmLTiUjAhwcmlvcml0eZRLAIwLZG9udF9maWx0ZXKUiYwFZmxhZ3OUXZSMCWNiX2t3YXJnc5R9\\nlHUu\\n\"\n  },
  "method": "GET",
  "orderNo": null,
  "retryCount": 0,
  "uniqueKey": "https://www.apify.com/",
  "url": "https://www.apify.com/"
}

When I decoded the scrapy_request attribute, I get:

'2}(urlhttps://www.apify.comcallbacknerrbacknheaders/}(CAccept]C?text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8aCAccept-Language]CenaC
User-Agent]C#Scrapy/2.11.0 (+https://scrapy.org/)aCAccept-Encoding]C
gzip, deflateaumethodGETbodyCcookies}meta}(apify_request_id7R37ec5G62ZfQHWapify_request_unique_keyhttps://www.apify.comdownload_timeoutG@f/
download_slot
www.apify.comdownload_latencyG?ÜF§uencodingutf-8priorityK�dont_filterflags]	cb_kwargs}u.

Which still doesn't contain myTest custom attribute.

So if I understand it correctly this is not currently supported via Python SDK, right?

Review

  • @jirimoravcik Could you please do a review for this, since you were investigating the problem, thank you.
  • Additional notes to the changes:
    • I moved request-related functions from scrapy/utils.py to a separate module.
    • I rewrote the unit test files and added new test cases there (regarding the optional fields userData and headers).
    • I tried to do commits in a meaningful way, so hopefully you can use them to review the changes of to_{apify, scrapy}_request functions.

@vdusek vdusek added this to the 81st sprint - Tooling team milestone Jan 23, 2024
@vdusek vdusek self-assigned this Jan 23, 2024
@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Jan 23, 2024
@vdusek vdusek requested a review from jirimoravcik January 23, 2024 17:46
@vdusek vdusek force-pushed the scrapy-request-fix branch from 01d380f to 5e19b6e Compare January 23, 2024 17:52
@vdusek vdusek force-pushed the scrapy-request-fix branch from 5e19b6e to 7a48669 Compare January 23, 2024 18:27
@vdusek vdusek merged commit 1c68f62 into master Jan 23, 2024
@vdusek vdusek deleted the scrapy-request-fix branch January 23, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants