Title: Getting Started with Scrapy Crawler

https://hacker.bz/t/entry/2883-title-getting-started-with-scrapy-crawler/

Scrapy is an application framework that is widely used in crawling websites and extracting structured data, such as data mining, information processing, etc. Its design is for website crawlers. It has developed to use APIs to extract data. It is a general website crawling tool. scrapy爬虫

Installation

In kali, because the python environment has been installed, we can directly install it with the following command.

pip install Scrapy

Isn't the installation very easy?

Now we will demonstrate how to crawl through the official small demo.

Save the following file as a 22.py file

import scrapy

class QuotesSpider(scrapy.Spider):

name='quotes'

start_urls=[

'https://quotes.toscrape.com/tag/humor/',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'author': quote.xpath('span/small/text()').get(),

'text': quote.css('span.text:text').get(),

}

next_page=response.css('li.next a:attr('href')').get()

if next_page is not None:

yield response.follow(next_page, self.parse) execute the following command

scrapy runspider 22.py -o quotes.jl crawler results will be saved to quotes.jl file. Save data format as json.

Crawler results

Code Analysis

Now we analyze the code

First, let’s take a look at the demo page provided by the official

The code for this is as follows

div class='quote' itemscope='' itemtype='http://schema.org/CreativeWork'

span class='text' itemprop='text'"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."/span

spanby small class='author' itemprop='author'Albert Einstein/small

a href='/author/Albert-Einstein'(about)/a

/span

div class='tags'

Tags:

meta class='keywords' itemprop='keywords' content='change,deep-thoughts,thinking,world'

a class='tag' href='/tag/change/page/1/'change/a

a class='tag' href='/tag/deep-thoughts/page/1/'deep-thoughts/a

a class='tag' href='/tag/thinking/page/1/'thinking/a

a class='tag' href='/tag/world/page/1/'world/a

/div

/div Now we analyze the crawler code

#Import the crawler module

import scrapy

class QuotesSpider(scrapy.Spider):

#Define two variables name and start_urls. Among them, start_urls is the target website of the crawler.

name='quotes'

start_urls=[

'https://quotes.toscrape.com/',

]

def parse(self, response):

#Travel over elements that use css as quote

for quote in response.css('div.quote'):

# Generate a dictionary containing the extracted quote text and author

#Get the values of author and text under DIV

yield {

'author': quote.xpath('span/small/text()').get(),

'text': quote.css('span.text:text').get(),

}

#Find links to the next page

next_page=response.css('li.next a:attr('href')').get()

if next_page is not None:

yield response.follow(next_page, self.parse)

quote.xpath('span/small/text()') Deep traversal to obtain the span tag under the target div, the small tag under the span tag, and pass in text (). Use the get() function to get its text value

The DIV is as follows

spanby small class='author' itemprop='author'Albert Einstein/smallquote.css('span.text:text').get(), get the value of the css under the span element under the css as the text element.

The DIV for this is as follows:

span class='text' itemprop='text'"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."/span Similarly, we can write out the value of the get tag tag.

div class='tags'

a class='tag' href='/tag/humor/page/1/'humor/a

/div'tags': quote.css('a.tag:text').getall() Here getall is to get all.

The slight test

Here we crawl the member rankings in the Big Cousin forum as an example

import scrapy

class QuotesSpider(scrapy.Spider):

name='quotes'

start_urls=[

'https://bbskali.cn/portal.php',

]

def parse(self, response):

for quote in response.css('div.z'):

yield {

'z': quote.xpath('p/a/text()').get(),

'z1': quote.css('p:text').get(),

}

next_page=response.css('li.next a:attr('href')').get()

if next_page is not None:

yield response.follow(next_page, self.parse)

How about it, it’s simple!

Sign In

Title: Getting Started with Scrapy Crawler

Installation

Code Analysis

The slight test

0 Comments

Recommended Comments

Account

Navigation

Search