Jump to content
  • Entries

    16114
  • Comments

    7952
  • Views

    863105284

Contributors to this blog

  • HireHackking 16114

About this blog

Hacking techniques include penetration testing, network security, reverse cracking, malware analysis, vulnerability exploitation, encryption cracking, social engineering, etc., used to identify and fix security flaws in systems.

In Python crawlers, anti-crawler mechanisms will be encountered. If you need to access the normal browser, the content will be displayed, that is, the content will be loaded dynamically. I wrote earlier that using Selenium to implement it. But configuration is more troublesome. Therefore, another method is introduced here!Python

Use Python + playwright to implement anti-crawler. Since a higher version of Python environment uses pip to install related modules, an error will be reported. Therefore, it is recommended to use a virtual environment.

Selenium Crawler Reference Article 《Python+Selenium 反爬虫实战》

Create a virtual environment

python3 -m venv biiaoge #biaoge is the name of the virtual environment

source bolde/bin/activate #Activate virtual environment 配置虚拟环境

Next, we need to install relevant tools in the virtual environment.

Install playwright

pip3 install playwright 安装playwright

After the installation is completed, install the corresponding browser (there is no browser in the local environment to install, if there is, you can ignore it)

playwright install firefox wfsbtt3k3bi120.png

Because it is installed in a virtual environment, it has no impact on the local environment.

Sample code: from playwright.sync_api import sync_playwright

from bs4 import BeautifulSoup

# Use sync_playwright context manager

with sync_playwright() as p:

# Start Firefox browser

browser=p.firefox.launch(headless=True)

# Create a new browser page

page=browser.new_page(

user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'

)

# Open the landing page

page.goto('https://data.xxxx.cn/easyquery.htm?cn=A01zb=A010G06sj=202502')

# Wait for the page to load

page.wait_for_load_state('networkidle')

# Get rendered HTML

html=page.content()

soup=BeautifulSoup(html, 'html.parser')

# Close the browser

browser.close()

# Analyze data

table_main=soup.find('table', id='table_main')

if table_main:

time_th=table_main.find_all('th')[1]

time=time_th.find('strong').get_text(strip=True)

tbody=table_main.find('tbody')

data=[]

for tr in tbody.find_all('tr'):

tds=tr.find_all('td')

indicator=tds[0].get_text(strip=True).replace('\n', '')

price=tds[1].get_text(strip=True)

data.append((time, indicator, price))

for item in data:

print(f'Time : {item[0]}')

print(f'index: {item[1]}')

print(f' price : {item[2]} yuan/kg\n')

else:

print('Table data not found')

Effectpr5palbnj1n122.png

Precautions

Every time you run a script, you need to go to the virtual environment. Therefore, it is best to recommend writing a script.