Scrapy Spider Related
This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!
Request Spoofing
When crawling Tencent weather data, it was found that the weather data could not be successfully crawled. After some attempts, it was discovered that the issue lies in the request headers. The default request headers used by the spider when crawling web pages are:
This default request header may be blocked when crawling many websites, making it impossible to obtain web page data. Therefore, request header spoofing is needed to make the website think it is a browser accessing it. The operation steps are as follows:
- Create a new file
HeaderMidWare.py
in the project directory and write the following code:
1 | # encoding: utf-8 |
- Add the following code to
setting.py
:
1 | USER_AGENT_LIST=[ |
After adding this, rerun the spider, and it will automatically select a user_agent for spoofing and successfully crawl the content on the web page.