Jump to content

Title: Beginners learn to crawlers. One article is enough

In the previous article, we talked about using the web scraper browser plug-in to implement crawler. But my friend didn't understand, so we explained this article in depth. Hope it will be helpful to your study and work.

Single page information crawl

This is the most basic and simplest crawler. That is, all the desired information is on the same page and has not been paging. We just need to use web scraper to crawl directly. Example:

Crawl the data in the B station rankings, including video titles Author Playbacks Number of barrages 所有数据都在一页

Click Create new sitemap to create a crawler.创建爬虫

Next, click add new selector to create crawled content. Here I have created four of them, corresponding to the video titles Author Playback Number of Barrages`

5qkmfte1vw3294.png

ow4kmyw5k23297.png

After the configuration is complete, click Scrape to start crawling.pfcuswbz3hp300.png

After the crawler is finished, export the results.发现数据优点乱

After exporting, the data was found to be quite messy. This is because the corresponding four fields have the same priority. How to solve this problem?

Know container

The web scraper mentions the concept of containers, which is like a div in html. Put the same div in the page into the same container. Read data from a div from the container. The specific methods are as follows:

Click Create new sitemap to create the container. Type Select Element 选择相同区域的div

Next, double-click the container to enter the container. Create the field you want to crawl again in the container.nibken5pgkc305.png 123tubqrmiu306.png

The overall structure is as follows54htx2fvggu308.png

The final crawler effect.jdmbfrnqml3310.png

Crawl Level 2 Page

For example, in the case of appeal, we only obtained the number of views and the number of barrage. The number of likes and favorites in the B station is not presented. Instead, it is in the secondary page.

At this time, we need to jump to the second page after crawling on the first page. The specific methods are as follows:

Enter the container and select the title field type as link. (Click on the title to enter the secondary page) kekytpymhm5313.png

Double-click the title field to create fields that like and coin collection in this field.s5a4i0lzsyr315.png

om43jmfvrr2320.png

The final effect is as follows ns1yhunhwt0327.png

Crawl paging information

In practice, a lot of information is paginated. For example, we crawl all the video information of the author on B.ppnsku2unyw331.png

Crawl the regular next button.

In many cases, we can observe the direct change pattern of the url by clicking the button and querying the response request. like

Page 1 https://space.bilibili.com/430579369/video?tid=0pn=1keyword=order=pubdate Page 2 https://space.bilibili.com/430579369/video?tid=0pn=2keyword=order=pubdate

Through observation, it is not difficult to find that by using pn=to control the operation of the button, we just need to add the corresponding variable to it. If there are ten pages of data in total, we can set it to pn=[1-10]

Example: Crawl all videos of Xiaoyaozi Big Cousin.创建爬虫 注意url 效果如下

No rules

For non-standard, we can use container simulation clicks to crawl.通过模拟点击下一页按钮来跳转下一页

First, we create crawlers and containers.h5ahkoms2d5363.png

Next create the container zhddqlu4o5m367.png

The structure is as followspahtq4d1v52370.png

Crawling effect sexrphbp4ma372.png

Summary

It is completely possible to use web scraper to complete some simple crawler tasks. It is relatively simple to get started, but it may not work for some sites with anti-crawler mechanisms.

0 Comments

Recommended Comments

There are no comments to display.

Guest
Add a comment...