Title: Beginners learn to crawlers. One article is enough

https://hacker.bz/t/entry/4378-title-beginners-learn-to-crawlers-one-article-is-enough/

In the previous article, we talked about using the web scraper browser plug-in to implement crawler. But my friend didn't understand, so we explained this article in depth. Hope it will be helpful to your study and work.

Single page information crawl

This is the most basic and simplest crawler. That is, all the desired information is on the same page and has not been paging. We just need to use web scraper to crawl directly. Example:

Crawl the data in the B station rankings, including video titles Author Playbacks Number of barrages 所有数据都在一页

Click Create new sitemap to create a crawler. 创建爬虫

Next, click add new selector to create crawled content. Here I have created four of them, corresponding to the video titles Author Playback Number of Barrages`

After the configuration is complete, click Scrape to start crawling.

After the crawler is finished, export the results. 发现数据优点乱

After exporting, the data was found to be quite messy. This is because the corresponding four fields have the same priority. How to solve this problem?

Know container

The web scraper mentions the concept of containers, which is like a div in html. Put the same div in the page into the same container. Read data from a div from the container. The specific methods are as follows:

Click Create new sitemap to create the container. Type Select Element 选择相同区域的div

Next, double-click the container to enter the container. Create the field you want to crawl again in the container.

The overall structure is as follows

The final crawler effect.

Crawl Level 2 Page

For example, in the case of appeal, we only obtained the number of views and the number of barrage. The number of likes and favorites in the B station is not presented. Instead, it is in the secondary page.

At this time, we need to jump to the second page after crawling on the first page. The specific methods are as follows:

Enter the container and select the title field type as link. (Click on the title to enter the secondary page)

Double-click the title field to create fields that like and coin collection in this field.

The final effect is as follows

Crawl paging information

In practice, a lot of information is paginated. For example, we crawl all the video information of the author on B.

Crawl the regular next button.

In many cases, we can observe the direct change pattern of the url by clicking the button and querying the response request. like

Page 1 https://space.bilibili.com/430579369/video?tid=0pn=1keyword=order=pubdate Page 2 https://space.bilibili.com/430579369/video?tid=0pn=2keyword=order=pubdate

Through observation, it is not difficult to find that by using pn=to control the operation of the button, we just need to add the corresponding variable to it. If there are ten pages of data in total, we can set it to pn=[1-10]

Example: Crawl all videos of Xiaoyaozi Big Cousin. 创建爬虫注意url 效果如下

No rules

For non-standard, we can use container simulation clicks to crawl. 通过模拟点击下一页按钮来跳转下一页

First, we create crawlers and containers.

Next create the container

The structure is as follows

Crawling effect

Summary

It is completely possible to use web scraper to complete some simple crawler tasks. It is relatively simple to get started, but it may not work for some sites with anti-crawler mechanisms.

Sign In

Title: Beginners learn to crawlers. One article is enough

Single page information crawl

Know container

Crawl Level 2 Page

Crawl paging information

Crawl the regular next button.

No rules

Summary

0 Comments

Recommended Comments

Account

Navigation

Search