clemfromspace / scrapy-selenium

Scrapy middleware to handle javascript pages using selenium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to perform a click button with scrapy-selenium?

Houssemaster opened this issue · comments

Hello, i want to make some actions after getting response from page like clicking, hovering scrolling etc..

Requests have an additional meta key, named driver containing the selenium driver with the request processed.
You can perform those actions with it like:

class WhateverSpider(scrapy.Spider):
	def start_requests(self):
		urls = ['www.google.com']
		for url in urls:
			yield SeleniumRequest(
				url = url,
				callback = self.parse,
				wait_time = 10)

	def parse(self, response):
		driver = response.request.meta['driver']
		# Do some stuff..
		# Click a button. 
		button = driver.get_element_by_xpath( '//*[@id="clickable-button-foo"]')
		button.click()		
		# Do more stuff

Requests have an additional meta key, named driver containing the selenium driver with the request processed.
You can perform those actions with it like:

class WhateverSpider(scrapy.Spider):
	def start_requests(self):
		urls = ['www.google.com']
		for url in urls:
			yield SeleniumRequest(
				url = url,
				callback = self.parse,
				wait_time = 10)

	def parse(self, response):
		driver = response.request.meta['driver']
		# Do some stuff..
		# Click a button. 
		button = driver.get_element_by_xpath( '//*[@id="clickable-button-foo"]')
		button.click()		
		# Do more stuff

Hello, I think your solution solved part of the problem. However, there is still a problem with this snippet of code since downloading requests and parsing responses are asynchronous in scrapy. Thus, it is possible that scrapy invoked

driver.get(another_url)

in the middleware's process_request method before scrapy reaching the line:

driver.get_element_by_xpath( '//*[@id="clickable-button-foo"]')

which means at the time scrapy reached that line, the page source may have been changed.

This will case some problem, while the code are asynchronous.

But there is another solution.
You could use the request option wait_until to perform some action like this:

def some_action(driver):
    if wait_until_conditions:
        driver.find_element(By.CLASS_NAME, '.klass')
        ……
       return True

SeleniumRequest(
            url='http://xxx.ofg',
            wait_until=some_action
        )

# if you forget to return True in wait_until callback, This code would run again and again.

Hello, i want to make some actions after getting response from page like clicking, hovering scrolling etc..

I have the same requirement.
you can check this repo before the pull request accepted.

commented

Requests have an additional meta key, named driver containing the selenium driver with the request processed.
You can perform those actions with it like:

class WhateverSpider(scrapy.Spider):
	def start_requests(self):
		urls = ['www.google.com']
		for url in urls:
			yield SeleniumRequest(
				url = url,
				callback = self.parse,
				wait_time = 10)

	def parse(self, response):
		driver = response.request.meta['driver']
		# Do some stuff..
		# Click a button. 
		button = driver.get_element_by_xpath( '//*[@id="clickable-button-foo"]')
		button.click()		
		# Do more stuff

Hello, I think your solution solved part of the problem. However, there is still a problem with this snippet of code since downloading requests and parsing responses are asynchronous in scrapy. Thus, it is possible that scrapy invoked

driver.get(another_url)

in the middleware's process_request method before scrapy reaching the line:

driver.get_element_by_xpath( '//*[@id="clickable-button-foo"]')

which means at the time scrapy reached that line, the page source may have been changed.

You are right. There is only one drive. So response.request.meta['driver'] is dealing with the current url which is different from response.url. See #22
Any solution to this?

commented

get_element_by_xpath change to find_element_by_xpath