Create a loop for web scraping data with selenium in python

Create a loop for web scraping data with selenium in python - selenium-webdriver

i am new to python , i got 1 error when i scrape data on web page about , 1 2 times will work but it will return error when run it with 1 big array , ai please tell me why and how to fix it, thanks a lot
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.service import Service
#nhét driver vào cho chrome
path = Service('C:/Program Files (x86)/chromedriver.exe')
driver = webdriver.Chrome(service=path)
url = 'https://diemthi.mobiedu.vn/?typeExam=TOAN'
driver.get(url)
SBD = 25000001
# for i in range(1,5):
sbd = driver.find_element(By.XPATH, '/html/body/app-root/app-full-layout/app-home/div/div[1]/div/div[3]/input')
search = driver.find_element(By.XPATH, '/html/body/app-root/app-full-layout/app-home/div/div[1]/div/div[3]/button')
sbd.send_keys(SBD)
search.click()
sbd = driver.find_element(By.XPATH, '/html/body/app-root/app-full-layout/app-home/div/div[1]/div/div[3]/input')
sbd.clear()
time.sleep(2)
error is displayed:
Traceback (most recent call last):
File "D:\pythonProject\ScrapingWeb\Main.py", line 19, in
sbd = driver.find_element(By.XPATH, '/html/body/app-root/app-full-layout/app-home/div/div[1]/div/div[3]/input')
File "D:\pythonProject\ScrapingWeb\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 857, in find_element
return self.execute(Command.FIND_ELEMENT, {
File "D:\pythonProject\ScrapingWeb\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 435, in execute
self.error_handler.check_response(response)
File "D:\pythonProject\ScrapingWeb\venv\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/app-root/app-full-layout/app-home/div/div[1]/div/div[3]/input"}
(Session info: chrome=103.0.5060.114)
Stacktrace:

Related

How to find the the CloudFlare human verification element using Selenium

The browser is FireFox and the language is Python.I am unable to complete the CloudFlare human verification.
In this website(https://chat.openai.com/chat), I'm unable to find the "mark" element by this code:
verify=WebDriverWait(driver, 10,0.1).until(EC.presence_of_element_located((By.CLASS_NAME, 'mark')))
HTML:
Error Message:
Traceback (most recent call last):
File ,
verify=WebDriverWait(driver, 10,0.1).until(EC.presence_of_element_located((By.CLASS_NAME, 'mark')))
File "...Python310\lib\site-packages\selenium\webdriver\support\wait.py", line 90, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
RemoteError#chrome://remote/content/shared/RemoteError.jsm:12:1
WebDriverError#chrome://remote/content/shared/webdriver/Errors.jsm:192:5
NoSuchElementError#chrome://remote/content/shared/webdriver/Errors.jsm:404:5
element.find/</<#chrome://remote/content/marionette/element.js:291:16
Why and how to fix it.

The element <span class="mark>...</mark> have a visible text in it. So to identify the element instead of presence_of_element_located() you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "label.ctp-checkbox-label span.mark")))
Using XPATH:
element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//label[#class='ctp-checkbox-label']//span[#class='mark']")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

response status is not 200 on running webdriver to get url and on running beautiful soup to extract content, it throws attribute error

I have been trying to web scrape hotel reviews but on multiple page jumps, the url of the webpage doesn't change. So I am using webdriver from selenium to work this out. It is not showing any error but on checking if the response status is 200, it is showing false. In addition to that, running the line of code which I have mentioned below generates an error. If anyone can fix the issue, effort will be highly appreciated!
!pip install selenium
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
# install chromium, its driver, and selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
wd = webdriver.Chrome('chromedriver',options=options)
code = wd.get('https://www.goibibo.com/hotels/highland-park-hotel-in-trivandrum-1383427384655815037/?hquery={%22ci%22:%2220211209%22,%22co%22:%2220211210%22,%22r%22:%221-2-0%22,%22ibp%22:%22v15%22}&hmd=766931490eb7863d2f38f56c6185a1308de782c89dfeeea59d262b827ca15441bf50472cbfdc1ee84aeed8af756809a2e89cfd6eaea0fa308c1ca839e8c313d016ac0f5948658353cf30f1cd83050fd8e6adb2e55f2a5470cadeb0c28b7becc92ac44d81966b82408effde826d40fbff47525e09b5f145e321fe6d104e12933c066323798e33a911e0cbed7312fc1634f8f92fe502c8602556c9a02f34c047d04ff1400c995799156776c1a04e218d6486493edad5b0f7e51a5ea25f5f1cb4f5ed497ee9368137f6ec73b3b1166ee7c1a885920b90c98542e0270b4fa9004005cfe87a4d1efeaedc8e33a848f73345f09bec19153e8bf625cc7f9216e692a1bcc313e7f13a7fc091328b1fb43598bd236994fdc988ab35e70cf3a5d1856c0b0fa9794b23a1a958a5937ac6d258d121a75b7ce9fc70b9a820af43a8e9a3f279be65b5c6fbfff2ba20bfb0f3e3ee425f0b930bf671c50878a540c6a9003b197622b6ab22ae39e07b5174cb12bebbcd2a132bb8570e01b9e253c1bd83cb292de97a&cc=IN&reviewType=gi&vcid=3877384277955108166&srpFilters={%22type%22:[%22Hotel%22]}')
str(code) == "<Response [200]>"
**Output: ** False
soup = BeautifulSoup(code.content,'html.parser')
On running the below line of code, there comes an error:
AttributeError Traceback (most recent call
last) in () ----> 1 soup
= BeautifulSoup(code.content,'html.parser')
AttributeError: 'NoneType' object has no attribute 'content'

get()
get(url: str) loads a web page in the current browser session and doesn't returns anything.
Hence, as per your code, code will be always NULL.
Solution
To validate the Response you can adopt any of the two approaches:
Using requests.head():
import requests
request_response = requests.head(https://www.goibibo.com/hotels/highland-park-hotel-in-trivandrum-1383427384655815037/?hquery={%22ci%22:%2220211209%22,%22co%22:%2220211210%22,%22r%22:%221-2-0%22,%22ibp%22:%22v15%22}&hmd=766931490eb7863d2f38f56c6185a1308de782c89dfeeea59d262b827ca15441bf50472cbfdc1ee84aeed8af756809a2e89cfd6eaea0fa308c1ca839e8c313d016ac0f5948658353cf30f1cd83050fd8e6adb2e55f2a5470cadeb0c28b7becc92ac44d81966b82408effde826d40fbff47525e09b5f145e321fe6d104e12933c066323798e33a911e0cbed7312fc1634f8f92fe502c8602556c9a02f34c047d04ff1400c995799156776c1a04e218d6486493edad5b0f7e51a5ea25f5f1cb4f5ed497ee9368137f6ec73b3b1166ee7c1a885920b90c98542e0270b4fa9004005cfe87a4d1efeaedc8e33a848f73345f09bec19153e8bf625cc7f9216e692a1bcc313e7f13a7fc091328b1fb43598bd236994fdc988ab35e70cf3a5d1856c0b0fa9794b23a1a958a5937ac6d258d121a75b7ce9fc70b9a820af43a8e9a3f279be65b5c6fbfff2ba20bfb0f3e3ee425f0b930bf671c50878a540c6a9003b197622b6ab22ae39e07b5174cb12bebbcd2a132bb8570e01b9e253c1bd83cb292de97a&cc=IN&reviewType=gi&vcid=3877384277955108166&srpFilters={%22type%22:[%22Hotel%22]})
status_code = request_response.status_code
if status_code == 200:
print("URL is valid/up")
else:
print("URL is invalid/down")
Using urlopen():
import requests
import urllib
status_code = urllib.request.urlopen(https://www.goibibo.com/hotels/highland-park-hotel-in-trivandrum-1383427384655815037/?hquery={%22ci%22:%2220211209%22,%22co%22:%2220211210%22,%22r%22:%221-2-0%22,%22ibp%22:%22v15%22}&hmd=766931490eb7863d2f38f56c6185a1308de782c89dfeeea59d262b827ca15441bf50472cbfdc1ee84aeed8af756809a2e89cfd6eaea0fa308c1ca839e8c313d016ac0f5948658353cf30f1cd83050fd8e6adb2e55f2a5470cadeb0c28b7becc92ac44d81966b82408effde826d40fbff47525e09b5f145e321fe6d104e12933c066323798e33a911e0cbed7312fc1634f8f92fe502c8602556c9a02f34c047d04ff1400c995799156776c1a04e218d6486493edad5b0f7e51a5ea25f5f1cb4f5ed497ee9368137f6ec73b3b1166ee7c1a885920b90c98542e0270b4fa9004005cfe87a4d1efeaedc8e33a848f73345f09bec19153e8bf625cc7f9216e692a1bcc313e7f13a7fc091328b1fb43598bd236994fdc988ab35e70cf3a5d1856c0b0fa9794b23a1a958a5937ac6d258d121a75b7ce9fc70b9a820af43a8e9a3f279be65b5c6fbfff2ba20bfb0f3e3ee425f0b930bf671c50878a540c6a9003b197622b6ab22ae39e07b5174cb12bebbcd2a132bb8570e01b9e253c1bd83cb292de97a&cc=IN&reviewType=gi&vcid=3877384277955108166&srpFilters={%22type%22:[%22Hotel%22]}).getcode()
if status_code == 200:
print("URL is valid/up")
else:
print("URL is invalid/down")

Cannot run the code and print out the result, using xpath and webdriver to click the pulldown menu

Cannot run the code and print out the result, using xpath and webdriver to click the pulldown menu by following codes
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('URL')
driver.maximize_window()
wait = WebDriverWait(driver,40)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,'div.combobox-input-wrap a[data-value="rbAll"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH,'//div[#class="droplist-item"]/a[contains(.,"Headline Category")]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH,'//div[#id="rbAfter2006"]//div[#class="combobox-input-wrap"]/a[contains(.,"ALL")]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH,'//div[#class="droplist-group"]//ul[#class="droplist-items"]//li/a[contains(.,"Announcements and Notices")]'))).click()
ele=wait.until(EC.presence_of_element_located((By.XPATH,'//div[#class="droplist-group droplist-submenu level2"]//ul//li/a[contains(.,"New Listings (Listed Issuers/New Applicants)")]')))
ele.location_once_scrolled_into_view
ele.click()
ele2=wait.until(EC.presence_of_element_located((By.XPATH,'//div[#class="droplist-group droplist-submenu level3"]//ul//li/a[contains(.,"Allotment Results")]')))
ele2.location_once_scrolled_into_view
ele2.click()
html = driver.page_source
print html
the Error Log show as below when run it.
File "run.py", line 6, in <module>
driver = webdriver.Firefox()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/webdriver.py", line 167, in __init__
keep_alive=True)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 156, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 251, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 320, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status 1

You should be printing the value inside brackets, i.e.
html = driver.page_source
print(html)

Selenium Grid: WebDriverException with Windows 8.1/Firefox

I have a simple Selenium Grid remote driver initialization
which is failing with an WebDriverException as below:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
selenium_grid_url='http://localhost:4444/wd/hub'
caps = webdriver.DesiredCapabilities.FIREFOX.copy()
caps['platform']="WINDOWS"
caps['version']="8.1"
remotedriver = webdriver.Remote(desired_capabilities=caps,
command_executor=selenium_grid_url)
I get the following error:
Traceback (most recent call last):
File "D:/Selenium/ChromeBrowser/loginscript.py", line 16, in
command_executor=selenium_grid_url)
File "D:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 98, in init
self.start_session(desired_capabilities, browser_profile)
File "D:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 185, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "D:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "D:\Python\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: None
Stacktrace:
at java.util.HashMap.putMapEntries (:-1)
at java.util.HashMap.putAll (:-1)
at org.openqa.selenium.remote.DesiredCapabilities. (DesiredCapabilities.java:55)
at org.openqa.grid.web.servlet.handler.RequestHandler.process (RequestHandler.java:104)
at org.openqa.grid.web.servlet.DriverServlet.process (DriverServlet.java:83)
at org.openqa.grid.web.servlet.DriverServlet.doPost (DriverServlet.java:67)
at javax.servlet.http.HttpServlet.service (HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service (HttpServlet.java:790)
at org.seleniumhq.jetty9.servlet.ServletHolder.handle (ServletHolder.java:841)
at org.seleniumhq.jetty9.servlet.ServletHandler.doHandle (ServletHandler.java:543)
at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle (ScopedHandler.java:188)
at org.seleniumhq.jetty9.server.session.SessionHandler.doHandle (SessionHandler.java:1584)
at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle (ScopedHandler.java:188)
at org.seleniumhq.jetty9.server.handler.ContextHandler.doHandle (ContextHandler.java:1228)
at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope (ScopedHandler.java:168)
at org.seleniumhq.jetty9.servlet.ServletHandler.doScope (ServletHandler.java:481)
at org.seleniumhq.jetty9.server.session.SessionHandler.doScope (SessionHandler.java:1553)
at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope (ScopedHandler.java:166)
at org.seleniumhq.jetty9.server.handler.ContextHandler.doScope (ContextHandler.java:1130)
at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle (ScopedHandler.java:141)
at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle (HandlerWrapper.java:132)
at org.seleniumhq.jetty9.server.Server.handle (Server.java:564)
at org.seleniumhq.jetty9.server.HttpChannel.handle (HttpChannel.java:320)
at org.seleniumhq.jetty9.server.HttpConnection.onFillable (HttpConnection.java:251)
at org.seleniumhq.jetty9.io.AbstractConnection$ReadCallback.succeeded (AbstractConnection.java:279)
at org.seleniumhq.jetty9.io.FillInterest.fillable (FillInterest.java:112)
at org.seleniumhq.jetty9.io.ChannelEndPoint$2.run (ChannelEndPoint.java:124)
at org.seleniumhq.jetty9.util.thread.QueuedThreadPool.runJob (QueuedThreadPool.java:672)
at org.seleniumhq.jetty9.util.thread.QueuedThreadPool$2.run (QueuedThreadPool.java:590)
at java.lang.Thread.run (:-1)
Process finished with exit code 1
Is something wrong with my invocation?
EDIT:
Selenium standalone server version: 3.4.0
Java version 1.8.0 (121)
Both hub and node of the grid on the same machine.
Python version 3.5.2

python-twitter in google app engine

I am trying to use python-twitter api in GAE.
I need to import Oauth2 and httplib2.
Here is how I did
For OAuth2, I downloaded github.com/simplegeo/python-oauth2/tree/master/oauth2. For HTTPLib2, I dowloaded code.google.com/p/httplib2/wiki/Install and extracted folder python2/httplib2 to project root folder.
my views.py
import twitter
def index(request):
api = twitter.Api(consumer_key='XNAUYmsmono4gs3LP4T6Pw',consumer_secret='xxxxx',access_token_key='xxxxx',access_token_secret='iHzMkC6RRDipon1kYQtE5QOAYa1bVfYMhH7GFmMFjg',cache=None)
return render_to_response('fbtwitter/index.html')
I got the error paste.shehas.net/show/jbXyx2MSJrpjt7LR2Ksc
AttributeError
AttributeError: 'module' object has no attribute 'SignatureMethod_PLAINTEXT'
Traceback (most recent call last)
File "D:\PythonProj\fbtwitter\kay\lib\werkzeug\wsgi.py", line 471, in __call__
return app(environ, start_response)
File "D:\PythonProj\fbtwitter\kay\app.py", line 478, in __call__
response = self.get_response(request)
File "D:\PythonProj\fbtwitter\kay\app.py", line 405, in get_response
return self.handle_uncaught_exception(request, exc_info)
File "D:\PythonProj\fbtwitter\kay\app.py", line 371, in get_response
response = view_func(request, **values)
File "D:\PythonProj\fbtwitter\fbtwitter\views.py", line 39, in index
access_token_secret='iHzMkC6RRDipon1kYQtE5QOAYa1bVfYMhH7GFmMFjg',cache=None)
File "D:\PythonProj\fbtwitter\fbtwitter\twitter.py", line 2235, in __init__
self.SetCredentials(consumer_key, consumer_secret, access_token_key, access_token_secret)
File "D:\PythonProj\fbtwitter\fbtwitter\twitter.py", line 2264, in SetCredentials
self._signature_method_plaintext = oauth.SignatureMethod_PLAINTEXT()
AttributeError: 'module' object has no attribute 'SignatureMethod_PLAINTEXT'
It seems I did not import Oauth2 correctly when I tracked the error in twitter.py
self._signature_method_plaintext = oauth.SignatureMethod_PLAINTEXT()
I even go to twitter.py and add import oauth2 as oauth but it couldnt solve the problem
Can anybody help?

I fixed it. In twitter.py,
try:
from hashlib import md5
except ImportError:
from md5 import md5
import oauth
CHARACTER_LIMIT = 140
# A singleton representing a lazily instantiated FileCache.
DEFAULT_CACHE = object()
REQUEST_TOKEN_URL = 'https://api.twitter.com/oauth/request_token'
ACCESS_TOKEN_URL = 'https://api.twitter.com/oauth/access_token'
AUTHORIZATION_URL = 'https://api.twitter.com/oauth/authorize'
SIGNIN_URL = 'https://api.twitter.com/oauth/authenticate'
Need to change import oauth to import oauth2 as oauth

I'm using tweetpy with my GAE application and it works well.
https://github.com/tweepy/tweepy
You can find some sample codes of tweetpy on GAE in google search.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Create a loop for web scraping data with selenium in python - selenium-webdriver

Related

How to find the the CloudFlare human verification element using Selenium

response status is not 200 on running webdriver to get url and on running beautiful soup to extract content, it throws attribute error

Cannot run the code and print out the result, using xpath and webdriver to click the pulldown menu

Selenium Grid: WebDriverException with Windows 8.1/Firefox

python-twitter in google app engine

Categories

Resources