Python Selenium Failing to Acquire data - selenium-webdriver

I am trying to download the 24-month data from www1.nseindia.com and it fails on Chrome and Firefox drivers. It just freezes after filling all the values in the required places and does not click. The webpage does not respond...
Below is the code that I am trying to execute:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
id_list = ['ACC', 'ADANIENT']
# Chrome
def EOD_data_Chrome():
driver = webdriver.Chrome(executable_path="C:\Py388\Test\chromedriver.exe")
driver.get('https://www1.nseindia.com/products/content/equities/equities/eq_security.htm')
s1= Select(driver.find_element_by_id('dataType'))
s1.select_by_value('priceVolume')
s2= Select(driver.find_element_by_id('series'))
s2.select_by_value('EQ')
s3= Select(driver.find_element_by_id('dateRange'))
s3.select_by_value('24month')
driver.find_element_by_name("symbol").send_keys("ACC")
driver.find_element_by_id("get").click()
time.sleep(9)
s6 = Select(driver.find_element_by_class_name("download-data-link"))
s6.click()
# FireFox(Gecko)
def EOD_data_Gecko():
driver = webdriver.Firefox(executable_path="C:\Py388\Test\geckodriver.exe")
driver.get('https://www1.nseindia.com/products/content/equities/equities/eq_security.htm')
s1= Select(driver.find_element_by_id('dataType'))
s1.select_by_value('priceVolume')
s2= Select(driver.find_element_by_id('series'))
s2.select_by_value('EQ')
s3= Select(driver.find_element_by_id('dateRange'))
s3.select_by_value('24month')
driver.find_element_by_name("symbol").send_keys("ACC")
driver.find_element_by_id("get").click()
time.sleep(9)
s6 = Select(driver.find_element_by_class_name("download-data-link"))
s6.click()
EOD_data_Gecko()
# Change the above final line to "EOD_data_Chrome()" and still it just remains stuck...
Kindly help with what is missing in that code to download the 24-month data... When I perform the same in a normal browser, with manual clicks, it is successful...
When you are manually doing it in a browser, you can change the values as below:
Set first drop down to : Security wise price volume data
"Enter Symbol" : ACC
"Select Series" : EQ
"Period" (radio button: "For Past") : 24 Months
Then click on the button, "Get Data", and in about 3-5seconds, the data loads, and then when you click on "Download file in CSV format", you can have the CSV file in your downloads
Need help using any library you know for scraping in Python: Selenium, Beautifulsoup, Requests, Scrappy, etc... Doesn't really matters unless it is python...
Edit: #Patrick Bormann, pls find the screenshot... The get data button works..

When you say that it works manually, have you try to simulate a click with action chains instead of the internal click function
from selenium.webdriver.common.action_chains import ActionChains
easy_apply = Select(driver.find_element_by_id('dateRange'))
actions = ActionChains(driver)
actions.move_to_element(easy_apply)
actions.click(easy_apply)
actions.perform()
and then you simulate a mouse movement to the specific value?
In addition, I tried it on my own and I didnt get any data when pushing on the button Get Data, as it seems to have a class of "get" as you mentioned, but this button doesnt work, but as you can see there exists a second button called full download, perhaps ypu try to use this one? Because the GetData Button doesnt work on Firefox and Chrome (when i tested it).
Did you already try to catch it through the link?
Update
As the OP asks for help in this urgent matter I delivered a working solution.
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.webdriver.support.ui import Select
chrome_driver_path = "../chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver_path)
driver.get('https://www1.nseindia.com/products/content/equities/equities/eq_security.htm')
driver.execute_script("document.body.style.zoom='zoom 25%'")
time.sleep(2)
price_volume = driver.find_element_by_xpath('//*[#id="dataType"]/option[2]').click()
time.sleep(2)
date_range = driver.find_element_by_xpath('//*[#id="dateRange"]/option[8]').click()
time.sleep(2)
series = driver.find_element_by_name('series')
time.sleep(2)
drop = Select(series)
drop.select_by_value("EQ")
time.sleep(2)
driver.find_element_by_name("symbol").send_keys("ACC")
ez_download = driver.find_element_by_xpath('//*[#id="wrapper_btm"]/div[1]/div[3]/a')
actions = ActionChains(driver)
actions.move_to_element(ez_download)
actions.click(ez_download)
actions.perform()
Here you go, sorry, took a little, had to bring my son to bed...
This solution provides this output: I hope its correct. If you want to select other drop down menus you can change the string in the select (string because of too much indezes too handle) or the number in the xpath as the number highlights the index. The time is normally only for elements which need time to build themselves up on a webpage. But I made the experience that a too fast change sometimes causes errors. Feel free to change the time limit and see if it still works.
I hope you can now go on again in making some money for your living in India.
All the best Patrick,
Do not hesitate to ask if you have any questions.
UPDATE2
After one long night and another day we figured out that the Freezing originates from the website, as the website uses:
Boomerang | Akamai Developer developer.akamai.com/tools/… Boomerangis
a JavaScript library forReal User Monitoring (commonly called RUM).
Boomerang measures the performance characteristics of real-world page
loads and interactions. The documentation on this page is for mPulse’s
Boomerang. General API documentation for Boomerang can be found
atdocs.soasta.com/boomerang-api/.
.
What I discovered from the html header.
This is clearly a bot detection network/javascript. With the help of this SO post:
Can a website detect when you are using Selenium with chromedriver?
And the second paragraph from that post:https://piprogramming.org/articles/How-to-make-Selenium-undetectable-and-stealth--7-Ways-to-hide-your-Bot-Automation-from-Detection-0000000017.html
I finally solved the issue:
we changed the
var_key in chromedriver to something else like:
var key = '$dsjfgsdhfdshfsdiojisdjfdsb_';
In addition I changed the code to:
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
options = webdriver.ChromeOptions()
chrome_driver_path = "../chromedriver.exe"
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(executable_path=chrome_driver_path, options=options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.get('http://www1.nseindia.com/products/content/equities/equities/eq_security.htm')
driver.execute_script("document.body.style.zoom='zoom 25%'")
time.sleep(5)
price_volume = driver.find_element_by_xpath('//*[#id="dataType"]/option[2]').click()
time.sleep(3)
date_range = driver.find_element_by_xpath('//*[#id="dateRange"]/option[8]').click()
time.sleep(5)
series = driver.find_element_by_name('series')
time.sleep(3)
drop = Select(series)
drop.select_by_value("EQ")
time.sleep(4)
driver.find_element_by_name("symbol").send_keys("ACC")
actions = ActionChains(driver)
ez_download = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[2]/div[1]/div[3]/div/div[1]/form/div[2]/div[3]/p/img')
actions.move_to_element(ez_download)
actions.click(ez_download)
actions.perform()
#' essential because the button has to be loaded
time.sleep(5)
driver.find_element_by_class_name('download-data-link').click()
The code finally worked and the OP is happy.

I have edited the chromedriver.exe using hex editor and replaced cdc_ with dog_ and saved it. Then executed the below code using chrome driver.
import selenium
from selenium import webdriver
from selenium.webdriver.support.select import Select
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-blink-features")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
# Open the website
driver.get('https://www1.nseindia.com/products/content/equities/equities/eq_security.htm')
symbol_box = driver.find_element_by_id('symbol')
symbol_box.send_keys('20MICRONS')
driver.implicitly_wait(10)
#rd_period=driver.find_element_by_id('rdPeriod')
#rd_period.click()
list_daterange=driver.find_element_by_id('dateRange')
list_daterange=Select(list_daterange)
list_daterange.select_by_value('24month')
driver.implicitly_wait(10)
btn_getdata=driver.find_element_by_xpath('//*[#id="get"]')
btn_getdata.click()
driver.implicitly_wait(100)
print("Clicked button")
lnk_downloadData=driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[2]/div[1]/div[3]/div/div[3]/div[1]/span[2]/a')
lnk_downloadData.click()
This code is working fine as of now. But the problem is that - this is not a permanent solution. NSE keeps on updating the logic to detect BOT execution in a better way. Like NSE, we will also have update our code. Please let me know if this code is not working. Will figure out some other solution.

Related

How to make Selenium-Wire perform an indirect GraphQL AJAX request I expect and need?

Background story: I need to obtain the handles of the tagged Twitter users from an attached Twitter media. There's no current API method to do that unfortunately (see https://twittercommunity.com/t/how-to-get-tags-of-a-media-in-a-tweet/185614 and https://github.com/twitterdev/open-evolution/issues/34).
I have no other choice but to scrape, this is an example URL: https://twitter.com/justinwood_/status/1626275168157851650/media_tags. This is the page which pops up when you click on the tags link under the media of the parent Tweet: https://twitter.com/justinwood_/status/1626275168157851650/
The React generated DOM is deep and ugly, but would be scrapeable, however I do not want to log in with any account to get banned. Unfortunately when you visit https://twitter.com/justinwood_/status/1626275168157851650/media_tags in an Incognito window the popup shows up dead empty. However when I dig into the network requests the /TweetDetail GraphQL endpoint is full of messages about the anonymous page visit, fortunately it still contains the list of handles I need despite of all of this.
So what I need to have is a scraper which is able to process JavaScript, and capture the response for that specific GraphQL call. Selenium uses a headless Chrome under the hood, so it is able to process JavaScript, and Selenium-Wire offers the ability to capture the response.
Unfortunately my crafted Selenium-Wire script only has the TweetResultByRestId and UsersByRestId GraphQL requests but is missing the TweetDetail. I don't know what to tweak to make all the requests to happen. I iterated over a ton of Chrome options. Here is a variation of my script:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless") # for Jenkins
chrome_options.add_argument("--disable-dev-shm-usage") # Jenkins
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument('--window-size=1900,1080')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')
selenium_options = {
'request_storage_base_dir': '/tmp', # Use /tmp to store captured data
'exclude_hosts': ''
}
ser = Service('/usr/bin/chromedriver')
ser.service_args=["--verbose", "--log-path=test.log"]
driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
tweet_id = "1626275168157851650"
twitter_media_url = f"https://twitter.com/justinwood_/status/{tweet_id}/media_tags"
driver.get(twitter_media_url)
driver.wait_for_request("/TweetDetail", timeout=10)
Any ideas?
Apparently it looks like I'd rather need to scrape the parent Tweet URL https://twitter.com/justinwood_/status/1626275168157851650/ and right now it seems my craved GraphQL call happens. Probably I got confused while trying 100 combinations.

Unable to get Selenium Code Working to POST data to Court (PACER) Practice/Training Site

I have been trying to create a script to do the following while I have access to a Training Web-site I am using for electronic case filings. This requires the following steps:
Going to the following site: https://ecf-train.nvb.uscourts.gov/
clicking the html link at this site on this link: "District of Nevada Train Database - Document Filing System"
This then redirects me to this site, where I POST my login credentials (username/pw/testclientcode/flag=1): (https://train-login.uscourts.gov/csologin/login.jsf?pscCourtId=NVTBK&appurl=https://ecf-train.nvb.uscourts.gov/cgi-bin/login.pl)
the "flag" checks a "redaction" clicks in and then is supposed to land me or give me the main access page to select what I want to e-file in the testing system (this is all approved by the Court for testing, FYI). I am attaching a screenshot of what it should look like when logged in successfully, so I can then click the "Bankruptcy" header link ECF Header Links, which then brings me to the next target page (https://ecf-train.nvb.uscourts.gov/cgi-bin/DisplayMenu.pl?BankruptcyEvents&id=1227536), where I can select the link for "Case Upload" (see attached screenshot for 'CaseUpload')Case Upload Selection Link
I am new to Python, but I believe I took all the correct steps to download the Selenium Chrome Browser and install Selenium using "pip install selenium". However, my code is not finding the webdriver "PATH", which is a local desktop directory saved directly on my "C:" drive.
Any help getting this work is HUGELY appreciated. Here is the code I have been using python with the errors as well:
CODE:
import requests
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import sys
import os
import json
import time`
# pass in header info which is "Content-Type:application/json"
url = "https://ecf-train.nvb.uscourts.gov/"
# payload contains the header information for Conent Type and Accept
payload = {"Content-Type":"application/json", "Accept":"application/json"}
r = requests.post(url, data=payload)
pacer_target = requests.get(url)
pacer_target.text
print(r.text)
# defining the api-endpoint
API_ENDPOINT = 'url'
requests.get(API_ENDPOINT.text)
# data to be sent to api - I removed by userid and pw from this post
data = {
"loginId":"XXXXXXXXX",
"password":"XXXXXXXXX",
"clientCode":"anycode",
"redactFlag":"1"}
# sending post request and saving response as response object
r = requests.post(url = API_ENDPOINT, data = data)
print(r.text, "STEP 1 - logged into main Traing screen!!")
# Import Selenium Executable File for Chrome from C:\\ drive on my desktop and use the location path
PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(PATH)
# browser = webdriver.Chrome()
continue_link = browser.find_element(By.PARTIAL_LINK_TEXT, 'cgi-bin/login')
continue_link.click()
# get response from the clicked URL and print the response
response = requests.get(url)
print(url.text, "this is step 2 - just clicked the main login button!!!!!")
# sending post request and saving response as response object
r = requests.post(url = API_ENDPOINT, data = data)
# extracting response text
pastebin_url = r.text
print("The pastebin URL is:%s"%pastebin_url)

How to get Hikvision DeepinViews license plate number from URL?

I cant find the solution anywhere and mine doesn't seem to work.
I just want to see the last plate string in the browser,or the few last plates,doesn't matter.
http://login:password#MY.IP/ISAPI/Traffic/channels/1/vehicleDetect/plates/
<AfterTime><picTime>2021-12-09T09:07:15Z</picTime></AfterTime>
I do have a plate taken exactly at the time im using in pictime,but the result im getting is;
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ResponseStatus xmlns="
http://www.hikvision.com/ver20/XMLSchema
" version="2.0">
<requestURL>
/ISAPI/Traffic/channels/1/vehicleDetect/plates/
<AfterTime>
<picTime>2021-12-09T09:01:15Z</picTime>
</AfterTime>
</requestURL>
<statusCode>4</statusCode>
<statusString>Invalid Operation</statusString>
<subStatusCode>invalidOperation</subStatusCode>
</ResponseStatus>
POSTMAN
Edit:
Are you certain that the ISAPI setting is enabled in the camera configuration?
It's not possible in the browser without some tool to send and process your API request.
Have you tried using Postman?
Don't forget to use a Digest Auth header.
from requests.auth import HTTPDigestAuth
import requests
url = 'http://<Your IP>/ISAPI/Traffic/channels/1/vehicleDetect/plates/'
data = "<AfterTime><picTime>20220912T192011+0400</picTime></AfterTime>"
r=requests.get(url, data =data,auth=HTTPDigestAuth('admin', 'password'))
print(r.text)
Try this one after enabling this setting in camera
Screenshot

Python-selenium action chains not working in Safari

I am trying to click on a menu and then a submenu in selenium-python on a Safari Webdriver. No matter what I do, I cannot seem to make the ActionChains do anything whatsoever. Am I doing something wrong or is this an issue with Safari?
I have tried a number of different actions using the ActionsChain, but none of them seem to be working
```driver = webdriver.Safari()
wait = WebDriverWait(driver, 20)
url = "someurl.com"
link_text = "link text"
driver.get(url)
driver.maximize_window()
wait.until(EC.element_to_be_clickable((By.LINK_TEXT, link_text)))
ActionChains(driver).move_to_element(driver.find_element(By.LINK_TEXT, link_text)).click().perform()
print('Hello World')```
I expect to see the browser clicking on the element, but I see only the terminal of my program printing 'Hello World'.
You could try clicking with JavaScript instead of ActionChains here:
element_to_click = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, link_text)))
driver.execute_script(“arguments[0].click();”, element_to_click)

How to get links using selenium and scrape using beautifulsoup?

I want to collect articles from this particular website. I was using Beautifulsoup only earlier but it was not grabbing the links. So I tried to use selenium. Now I tried to write this code. This is giving output 'None'. I have never used selenium before, so I don't have much idea about it. What should I change in this code to make it work and give the desired results?
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get(url)
link = browser.find_elements_by_class_name('gs-title')
for links in link:
links.get_attribute('href')
soup = BeautifulSoup(browser.page_source, 'lxml')
date = soup.find('span', {'class': 'post-date'})
title = soup.find('h1', {'class':'headline'})
content = soup.find('div',{'class':'article-body'})
print(date)
print(title)
print(content)
time.sleep(3)
browser.close()
I want to collect the date, title, and content from all the articles on this page and other pages also like page no 7 to 18.
Thank you.
Instead of using Selenium to get the anchors, I tried to extract the page source first with the help of Selenium and then used Beautiful Soup on it.
So, to put it in perspective:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
#wait = WebDriverWait(browser, 10) #Not actually required
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser') #Get the Page Source
anchors = soup.find_all("a", class_ = "gs-title") #Now find the anchors
for anchor in anchors:
browser.get(anchor['href']) #Connect to the News Link, and extract it's Page Source
sub_soup = BeautifulSoup(browser.page_source, 'html.parser')
date = sub_soup.find('span', {'class': 'post-date'})
title = sub_soup.find('h1', {'class':'post-title'}) #Note that the class attribute for the heading is 'post-title' and not 'headline'
content = sub_soup.find('div',{'class':'article-body'})
print([date.string, title.string, content.string])
#time.sleep(3) #Even this I don't believe is required
browser.close()
With this modification, I believe you can get your required contents.
You can use same API as page uses. Alter parameters to get all pages of results
import requests
import json
import re
r = requests.get('https://cse.google.com/cse/element/v1?rsz=filtered_cse&num=10&hl=en&source=gcsc&gss=.uk&start=60&cselibv=5d7bf4891789cfae&cx=012545676297898659090:wk87ya_pczq&q=cybersecurity&safe=off&cse_tok=AKaTTZjKIBzl-5fANH8dQ8f78cv2:1560500563340&filter=0&sort=date&exp=csqr,4229469&callback=google.search.cse.api3732')
p = re.compile(r'api3732\((.*)\);', re.DOTALL)
data = json.loads(p.findall(r.text)[0])
links = [item['clicktrackUrl'] for item in data['results']]
print(links)

Resources