Reasons why SeleniumRC CSS locators might be slower than XPath? - screen-scraping

I've got some code that does a simulated recursion tree walk to scrape stuff from an HTML tree using SeleniumRC. I've run the code using both Xpath and CSS locators.
The tree is represented as a series of nested tables. If it matters at all, some of the tree content starts out not visible as branches are "collapsed". For both Xpath and CSS, the tree is in the same state in terms of visible vs. not visible.
To get node values, my code starts with a "root" expression, adds "branch" tokens that can be incremented for each successive sibling node, and then uses a "node" token to get the text content.
It all works, but much slower using the CSS expressions I've come up with.
I suppose it is a kludgy way to make locator expressions, although it works for my purposes. I'm just trying to figure out how to best use CSS to get closer to the times involved using Xpath.
The loop tests many invalid expressions (keeps looking for nth sibling until not found) and the expressions get really long, due to the way I am incrementally drilling further and further into nested tables.
Below follows the bits of expression and examples that come from the recursion. If anyone can provide some insight as to what I am doing that is making CSS take so much longer than Xpath, that would be very helpful.
I am a total newb at doing this kind of manipulation of HTML content, if you see something dumb in terms of how I've moved from Xpath to CSS, please say so.
XPath “tokens”:
final String rootbase = "//*[contains(#id,\"treeBox\")]/div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = "/table/tbody/tr[{branchIncrement}]/td[2]";
final String nodetoken = "/table/tbody/tr/td[4]/span";
CSS “tokens”:
final String rootbase = "css=[id*=treeBox]>div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = ">table>tbody>tr:nth-child({branchIncrement})>td:nth-child(2)";
final String nodetoken = ">table>tbody>tr>td:nth-child(4)>span";
The first XPath expression for the content at the "root" is:
//*[contains(#id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span
The last XPath expression for a 40 node tree with four levels, three sibling each level below the root (1+3+3x3+3x3x3) is:
//*[contains(#id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span
The first CSS expression is:
[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span
The last CSS expression is:
[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(3)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

In Firefox, Selenium RC's XPath locators are processed by the browser's native XPath engine, which the CSS locators are processed by a JavaScript library (Dean Edwards' cssQuery.js). Later Selenium releases (e.g., the 2.0b* series) use jQuery's sizzle library for CSS, but they still do it in JavaScript. On top of that implied difference in speed, you're doing pattern-matching in the root expression (i.e., [id*=treeBox), which requires enumerating the entire DOM tree to locate the matches, even before you descend down from there. Think about how you'd write that in pure JavaScript and you'll start to see the problem.
If it makes you feel any better, IE still doesn't have a native XPath implementation, so Selenium uses one of several JavaScript implementations in that browser, and it's anywhere from one-half to one-tenth the speed of XPath in Firefox 3.6 because of that.
Long answer short, there's not much you can do to make CSS locators faster in this particular case.

Usually, it's not something you can help. The XPath selector mechanism in Selenium makes use of the browser's XPath tools. Even IE6 has one of those. I'm not aware of a browser that provides CSS selector tools through JavaScript, so Selenium has to use its own code. As their code is all JavaScript and internal browser XPath parsing is usually done in native code, it's much slower (especially in IE6).

Thanks for that feedback. After reading your note, I wondered if I could get substancial improvment by using a tiny bit of code to resolve a literal Id value to replace the contains expression used repeatedly.
Here are four different locators I've used for the same thing. A pair of the locators are XPath, and two are CSS. For each of those pairs, one uses a contains expression, and one resolves to a literal first. In each case, the example locator are for the last node of a three level 1307 node tree.
XPath with contains:
//*[contains(#id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[26]/td[2]/table/tbody/tr/td[4]/span
XPath where literal replaces contains expression:
id('ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox')/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[24]/td[2]/table/tbody/tr/td[4]/span
CSS with contains:
css=[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span
CSS where literal replaces contains expression:
css=[id=ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span
Working with two different sized trees, one 102 nodes, the other 1307 nodes, I found the following.
102 nodes:
| contains | literal |
XPath | 15 sec. | 13 sec. |
CSS | 19 sec. | 19 sec. |
1307 nodes:
| contains | literal |
XPath | 255 sec. | 145 sec.|
CSS | 1893 sec. | 1811 sec.|
Clearly, a native implementation (XPath on Firefox with Se-RC) is much faster than a JScript implementation. The trade off is that it might not work as well across browsers.

Related

How to verify words one by one on a locator in Robot Framework?

I am still a bit new to the robot framework but please rest assured I am constantly reading its User Guide. I am a bit stuck now with one test case.
I do have a list of individual words, that I need to verify on a page, mostly German translations of field labels if they appear correctly or are found in an element at all.
I have created a list variable as follows:
#{GERMAN_WORDS} | Benutzer | Passwort | Sendung | Transaktionen | Notiz
I have the following locator that contains the text labels on the webpage, and the one I need to verify:
${GENERAL_GERMAN_BOARD} |
xpath=//*[#id="generalAndIncidents:generalAndIncidentsPanel"]
I would like to check every single word one by one from the list variable, whether they are present in the locator above.
I did create the following keyword for this purpose, however I might be missing something because it calls the entire content of my list variable, instead of checking the words from it one by one:
Block Text Verification
[Arguments] ${text_list_variable} ${locator_to_check}
Wait Until Element is Visible ${locator_to_check}
FOR ${label} IN ${text_list_variable}
${labelTostring} Convert to String ${label}
${isMatching} = Run Keyword and Return Status Element Should Contain ${locator_to_check} ${labelTostring}
Log ${label}
Log ${isMatching}
Exit For Loop If '${isMatching}' == 'False'
END
I am getting the following output for this:
Element
'xpath=//*[#id="generalAndIncidents:generalAndIncidentsPanel"]' should
have contained text '['Benutzer', 'Passwort', 'Sendung',
'Transaktionen', 'Notiz']' but its text was.... (and it lists all the
text from my locator)
So, it is basically not checking the words one by one.
Am I doing something wrong here? Is this a bad approach I am trying to do here?
I would be grateful if anyone could provide me some hint on what I should do here instead!
Thank you very much!
You've made one small but crucial mistake - the variable in this line here:
FOR ${label} IN ${text_list_variable}
, should be accessed with #:
FOR ${label} IN #{text_list_variable}
The for-in loops in RF expect 1 or more arguments of the looped over values, and the # expands a list variable to its members.

Appending a big number of nodes to an xml tree

I'm using libxml via C, to work with xml file creation and parsing. Until recently everything worked smoothly, but a case emerged where a single tree has a subnode, lets call it S, with approximately 200,000 children. This case works surprisingly slow and I suspect the function :
xmlNewChild(/**/);
which I'm using to build the tree, has to iterate over every child of S to add one more child. Since a function that also accepts a hint (a pointer to the last added function) doesn't seem to exist, is there a better way to build the tree (maybe a batch build method) ? In case such numbers are insignificant and I should search for deficiencies elsewhere, please let me know.
Yeah, rather than keeping the entire XML in memory with xmlTree, you may want to use a combination of libxml's xmlReader and xmlWriter APIs. They're both streaming, so it won't have to keep the entire document in memory and won't have any scaling problems based on the number of elements.
Examples of both xmlReader and xmlWriter can be found here:
http://www.xmlsoft.org/examples/index.html

Selenium CSS selector by id AND multiple classes

Using selenium for the first time here, I was wondering why:
final WebElement justAnId = findElement(By.cssSelector("#someId"));
final WebElement whatIWant = justAnId.findElement(
By.cssSelector(".aClass.andAnother input[type=text]")
);
works, but not:
final WebElement whatIWant = findElement(By.cssSelector(
"div#someId.aClass.andAnother input[type=text]"
));
Although they seem equivalent to me I get:
org.openqa.selenium.NoSuchElementException: Unable to locate element:
{"method":"css selector","selector":"div#someId.aClass.andAnother input[type=text]"}
Is this intended behaviour or a bug in Selenium? I had a quick look in the bug tracker in Selenium but I didn't see anything about that. I wanted to ask here before raising an issue that doesn't need to be. Also as far as I understand it doesn't work in IE6 but who cares. I was using firefox for this run.
Actually the two are quite different selectors.
Here is your cssSelector:
div#someId.aClass.andAnother input[type=text]
But what you really wanted to write was:
div#someId .aClass.andAnother input[type=text]
notice the space between ID and class. you need that.
findElement() finds an element in the current context, which means your first snippet of code is really finding an element that matches .aClass.andAnother input[type=text], which is contained within #someId. The element with that ID may or may not contain the two classes; WebDriver doesn't assume you're referring to the same element; it just finds the input as long as its ancestors are #someId and .aClass.andAnother.
This is completely different from div#someId.aClass.andAnother input[type=text], which finds any input[type=text] within div#someId.aClass.andAnother only (i.e. it's a div that contains both the ID and the classes).

Eclipse - How do I view java arrays / collections better in debugger

Viewing/Searching java arrays and collections in the Eclipse Java debugger is tedious and time-consuming.
I tried this promising plugin (in alpha as of Aug 2012)
http://www.cvast.tuwien.ac.at/projects/visualdebugging/ArrayExplorer
But it freezes Eclipse for simple arrays beyond a few hundred elements.
I do use Detail formatters, but that still needs clicking on each element to see the values.
Are there any better ways to view this array/collection data?
Use the 'Expressions' tab.
There you can type in any number of expressions and have them evaluated in the current scope.
ie: collection.size(), collection.getValueAt(i), ect...
Eclipse > Preferences > Java > Debug >Detail Formatter
This may be close to what you are looking for. It is another tedious work to setup but once done you can see the value of objects in Expressions window.
Here is link to start
override toString method of your class and you will be able to see what you want to see. i'm attaching example to show you exactly that.
Even though i could not find a way to see them in nice table/array, i found a halfway workaround.
The solution is to define a static method in a throwaway class that takes the array as input and returns a string of concatenated values that one wants to quickly glance at. it could include the array index and newlines to view results formatted nicely. It can be fine tuned to print out only certain array indices to reduce clutter.
This static method can then be used in the watch area.

dojo.query "," (or) for combining multiple selections does not work on IE 7

I am using
dojo.query('input,select',myDiv)[0].focus();
to focus the first input element found in a div container.
This will work in Firefox, but not in IE 7.
IE 7 only takes the first query into consideration:
dojo.query('input,select')[0] will select the first input element,
even if a select element is first.
dojo.query('select,input')[0] will select the first select element,
even if an input element is first.
Does anybody know a workaround for this?
If I recall correctly, dojo.query does not necessarily guarantee "chronological" order within the NodeList it returns, especially for complex queries. This is generally due to the fact that for some browsers / in some scenarios, it does have to cobble multiple disparate result sets together, and trying to reorder this based on where each element is in the document would probably be far more of a performance hit than it's worth.
That said, off the cuff I'm not sure what to suggest as an alternative. It'd be easy enough to find the first of one OR the other separately, just not while looking for both within the same query.
If your form has some kind of consistent markup around your inputs (e.g. each field is inside let's say, a div with class="field"), I suppose you could do something like this:
dojo.query('.field:first-child select, .field:first-child input')

Resources