Scrapy or Selenium or Mechanize to scrape web data? - selenium-webdriver

I want to scrape some data from a website.
Basically, the website has some tabular display and shows around 50 records. For more records, the user has to click some button which makes an ajax call get & show the next 50 records.
I have previous knowledge of Selenium webdriver(Python). I can do this very quickly in Selenium. But, Selenium is more kind of automation testing tool and it is very slow.
I did some R&D and found that using Scrapy or Mechanize, I can also do the same thing.
Should I go for Scrapy or Mechanize or Selenium for this ?

I would recommend you to go with a combination of Mechanize and ExecJS (https://github.com/sstephenson/execjs) to execute any javascript requests you might come across. I have used those two gems in combination for quite some time now and they do a great job.
You should choose this instead of Selenium, because it it will be a lot faster compared to having to render the entire page in a headless browser.

Definitely I'd choose Scrapy. If you can't handle javascript you can try with Scrapy + splash.
Scrapy is by far the fastest tool for web scraping that I'm aware of.
Good luck!

Related

Selenium WebDriver and ZK framework application

On a previous project i worked on, i was able to write selenium scripts conveniently by targeting the HTML attributes either by name, id, cssSelector, xPath etc. Now, i'm working on another project aimed at automating the regression test for the application. This application was built using ZK Framework (mainly because of its security feature). One of the feature of ZK is the dynamic id attribute. It generates a new id upon login or refresh. This is making the selenium development work difficult. This is a huge application. I have tried using xPath but that hasn't been successful. Any idea of other solutions out there that work specifically for ZK typed application from Selenium WebDriver perspective. Often times, the only thing present in the html is the id (which changes) and the type
Java 8
Selenium 3.11.0
You have options when testing a ZK client with selenium.
Basically, you either use an ID generator to set fixed IDs during testing, or you use components IDs with the zk.$('$id') and jq('$id') client-side selectors
You can go further, but that should already cover 99% of the use-cases
more info here:
https://www.zkoss.org/wiki/ZK_Developer%27s_Reference/Testing/Testing_Tips
and there:
https://www.zkoss.org/wiki/ZK_Client-side_Reference/General_Control/Client-side_selection_of_elements_and_widgets

What can developers do to make the browser ui easy to automate for testing?

I'm curious what things a developer can do to make the creation of automated tests easier for testers using selenium web driver. The only thing I'm thinking of is using unique IDs for fields, buttons, etc. Can anyone think of any thing else that can be done?
From my experience, this really helps to automate whole process:
Provide unique IDs to at least important buttons (submit form, search buttons...)
Do not use HTTP Basic Authentification. Use normal login instead
Get rid of CAPTCHA fields. At least on test environment.
Provide friendly URLs, so that certain areas of app can be reached immediately
When page is loading, show some load image. Best option is to provide some small element which loads only when whole page is loaded.
Get rid of hover-only menus on page (you have to hover certain element to see other)

How do I make selenium see the network requests made by a web browser?

I have a dotnet Selenium web driver app.
When I'm testing the page one of the things I need to confirm is that a flash object on the page has pulled correct content from a content store on my site. (i.e. the flash object should be loading content from /stuff/info.txt and including that content within the animation.)
As a human looking at this I can use the chrome network tab and see that /stuff/info.txt has been accessed.
How can I make Selenium execute a similar watch and see the network requests made by a web browser?
I did not wrote this, neither tested it however someone did it here: http://www.softwareishard.com/blog/firebug/automate-page-load-performance-testing-with-firebug-and-selenium/
Basically all the requests are exported via netexport and firebug plugins inside a HAR (Http ARchive file)
Please give us your feedback if you give it a try!
Cheers !
I assume you want to automate the process which the developer tools of browsers does. Something like firebug but for verification using Code.
I don't believe Selenium has such features. For now, you will not be able to achieve this.

google scripts using jquery datepicker extermely slow

There are 2 topics which have asked the same thing, they are several years old so I wanted to create a new one.
jQuery + datepicker extreme slowness in Google Apps Script
JQuery UI in Google Apps Script HTML Service very slow
Is using jquery with google scripts just a bad idea or have we had any success on speeding it up?
Are there any suggestions on what I can use as a date picker using the HtmlService (reading that the gui will probably eventually be phased out)
Currently my google script app has only 2 fields and loading jquery and jquery ui, with the css takes over 10 seconds to just load, while in dev mode. I have not tested in a published state, but I guess more importantly should I be looking at a different solution?
If most of your users use a Google Chrome or any webkit-based browser, you might use
<input type="date">
But styles vary differ. Doesn't worked in ie and firefox (use js).
Helpful links:
w3.org/input.dat
Try it in different browsers
Is there any way to change input type=“date” format?

Browser Automation with Selenium: Fingerprints, recognizability and traceability?

I want to use selenium/webdriver to simulate a browser and scrape some website-content with it. Even if its not the fastest method, for me it has many advantages such as executing scripts etc.
For many websites it is forbidden to access them via an automated method, for example search engines like google or bing.
For one tool i need to scrape the estimated resultstat from google for several keywords. This will look like the following: simulate the browser that visits google.com and types in a keyword and scrapes the results, then after a little pause type in the next keyword, scrape the results and so on...
My question is: Is it possible for a website to recognize that I'm using selenium to simulate the browser instead of using the browser by hand? Especially the google case gives me some doubts. I know selenium is partly developed by google or at least by some guys working for google. So does leave selenium some fingerprints or isn't it possible to decide if I'm using the browser by myself or simulated by selenium, even for google?
No, nobody can actually see that you're using Selenium and not hand-operating the browser yourself with WebDriver. I'm not sure about the old Selenium RC, but it should be the same way. Here's how it works:
Selenium opens up a browser with a clean profile (or with a profile you selected)
Selenium is hooked up to the browser so it can steer it, control it. But the browser still does most of the work. Basically, Selenium replaces the user inputs to the browser, but not more.
You can easily verify this by reading the contents of the HTTP headers sent by your browser.
If you ever actually needed Selenium to be recognized by your server, you can use Browsermob-proxy and add a custom header to your requests.
All that said, there is one thing you must be aware of. While there's no way to detect Selenium directly, there can be some indirect clues picked up by the website you're visiting. Those usually include scanning for too many requests made in virtually no time - this might be an issue for you. Make sure your Selenium is behaving like a user.
EDIT 2016/04:
Apparanetly it is possible as https://stackoverflow.com/a/33403473/2930045 states that a company can do it. My guess - and it is nothing but a guess - is that they can run some JS that Selenium installs into the browser to operate.
Signs point to yes, sites are able to regonize that you are using Selenium.
Counter Example: www.stubhub.com detects and blocks my browser instance launched using Selenium while "normal" browsing done manually (not using the browser launched by the Selenium web driver) work with out issue.
See this stackoverflow question for additional details
Can a website detect when you are using selenium with chromedriver?

Resources