How would a web analytics package such as piwik/google analytics/omniture etc determine what are unique pages from a set of urls?
E.g. a) a site could have the following pages for a product catalogue
http://acme.com/products/foo
http://acme.com/products/bar
or b) use query string
http://acme.com/catalogue.xxx?product=foo
http://acme.com/catalogue.xxx?product=bar
In either case you can have extra query string vars for things like affiliate links or other uses so how could you determine that its the same page?
e.g. both of these are for the foo product pages listed above.
http://acme.com/products/foo?aff=somebody
http://acme.com/catalogue.xxx?product=foo&aff=somebody
If you ignore all the query string then all products in catalogue.xxx are collated into one page view.
If you don't ignore the query string then any extra query string params look like different pages.
If you're dealing with 3rd party sites then you can't assume that they are using either method or rely on something like canonicallinks being correct.
How could you tackle this?
different tracking tools handle it differently, but you can explicitly set the reporting URL for all the tools.
For instance, Omniture doesn't care about the query string. It will chop it off, even if you don't specify a pageName and it defaults to the URL in the pages report, it still chops off the query string.
GA will record the full url including query string every time.
Yahoo Web Analytics only records the query string on first page of the visit and every page afterwards it removes it.
But as mentioned, all of the tools have a way to explicitly specify the URL to reported, and it is easy to write a bit of javascript to remove the Query string from the URL and pass that as the URL to report.
You mentioned giving your tracking code to 3rd parties. Since you are already giving them tracking code, it's easy enough to throw that extra bit of javascript into the tracking code you are already giving them.
For example, with GA (async version), instead of
_gaq.push(['_trackPageview']);
you would do something like
var page = location.href.split('?');
_gaq.push(['_trackPageview',page[0]]);
edit:
Or...for GA you can actually specify to exclude them within the report tool. Different tools may or may not do this for you, so code example can be applied to any of the tools (but popping their specific URL variable, obviously)
If you're dealing with third-party sites, you can't assume that their URLs follow any specific format either. You can try downloading the pages and comparing them locally, but even that is unreliable because of issues like rotating advertisement, timestamps, etc.
If you are dealing with a single site (or a small group of them), you can make a pattern to match each URL to a canonical (for you) form. However, this will get unmanageable quickly.
Of course, this is the reason that search engines like Google recommend the use of rel='canonical' links in the page header; if Google has issues telling the pages apart, it's not a trivial problem.
Related
I'm looking into how to process customisation fields for Amazon orders and according to their MWS API Docs, if a customer chooses to personalise his order, then a URL to download this data comes down in the Order Item XML's BuyerCustomizedInfo node:
<OrderItem xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<ASIN>ABC123</ASIN>
...
<ConditionSubtypeId>New</ConditionSubtypeId>
<BuyerCustomizedInfo>
<CustomizedURL>https://zme-caps.amazon.com/t/ABC123/ABC123/1</CustomizedURL>
</BuyerCustomizedInfo>
</OrderItem>
My client has given me two such orders to look at, and when I click on those links all I get is
NoSuchURL: Url id 'ABC123' has expired or does not exist!
I know that the ZIP will contain JSON which I will have to parse and may also contain references to SVGs, and that I must also make the code extra robust when dealing with customisation fields.
Am I getting this error because these links are time sensitive or one time use only? Or is it something else?
First off I'm not a Developer, I'm an Amazon Seller - I found your question while doing research as I'm trying to figure out what is possible and sketch a plan for a similar system and then hire a Developer.
I've pasted some info that I had found from the US below - the implementation of Amazon Custom in the European Marketplaces may not be the same as in the US though.
In general it is very hard to get good info on anything to do with Amazon Custom and it seems to have a messed up logic of its own - feel free to ask anything though and I will help if I can.
First of all, make sure you have the most up to date Amazon MWS Orders API SDK. If you don’t and refuse to update, you can make a reports API for orders, and that’ll include the ZIP URL, but you’ll have to parse it and life will be hell.
Next, for the order, call ListOrderItems which you probably already do. You’ll see the customization in the response XML under BuyerCustomizedInfo -> CustomizedURL.
This is a ZIP. Download the zip using CURL, put plenty of checks and fallbacks in place because it will fail sometimes.
Extract the ZIP to a folder. Inside that folder there will be a json file.
Parse that JSON file and you’ll probably know where to go from there for putting that information into your system.
Depending on how you’ve configured your product, there may also be an SVG file that you’ll want to parse to get some customization info. Specially json->{‘version3.0’}->customizationInfo->surfaces (each surface)->areas. Each area should be a text line or image. At least that is how it is for how we’ve set up products.
As always, put lots of checks, try catches, fallbacks, and error alerts.
The links are time sensitive and expire after 6 months I think.
The links should be a little more complex and if that is exactly the link you are seeing it's incorrect.
You don't require any auth to download them and the easiest way to test them is via the MWS Scratchpad.
I am going to be hosting 100's of thousands of pages, only a few KB, each with the exact same format. I originally thought of using a central database, but then thought that as the content on each page will never change, is it worth the unnecessary database requests? I originally thought of using something like memcache, but then, as stated above, thought it wouldn't be too efficient as the content will never change.
Question:
As the content is only a few KB per page, is it actually that inefficient to serve as static pages?
Could it affect the ability to search through the content to find the page, and how? Eg:
There will be a description on the page, when the user is using the search function it will need to search through the descriptions of each page.
One thing on the individual pages will change, but only once every few weeks/months, should I use a database for that, or serve that statically?
How can I find out (any language but better if Python) when Google indexed a specific html page?
Ideally I would have a list of URLs to check for.
I have already tried the WayBack machine but it doesn't have the majority of the pages I need. Also if anyone can suggest an API to extract dates in multiple language from text.
You can use this pattern to access the cached version of your webpage.
http://webcache.googleusercontent.com/search?q=cache:<URL>
Say for example, you can see the cache version of my blog datafireball.com like this, as you can see, it is indexed 20141020, 23:33:30, strip= will avoid loading javascript, css..etc. To get the time when the indexed was index, you can use some browser automation tool like Selenium or Phantomjs... etc. to get the page.
I'm stumped and need some ideas on how to do this or even whether it can be done at all.
I have a client who would like to build a website tailored to English-speaking travelers in a specific country (Thailand, in this case). The different modes of transportation (bus & train) have good web sites for providing their respective information. And both are very static in terms of the data they present (the schedules rarely change). Here's one of the sites I would need to get info from: train schedules The client wants to provide users the ability to search for a beginning and end location and determine, using the external website's information, how they can best get there, being provided a route with schedule times for the different modes of chosen transport.
Now, in my limited experience, I would think the way to do that would be to retrieve the original schedule info from the external site's server (via API or some other means) and retain the info in a database, which can be queried as needed. Our first thought was to contact the respective authorities to determine how/if this can be done, but this has proven to be problematic due to the language barrier, mainly.
My client suggested what is basically "screen scraping", but that sounds like it would be complicated at best, downloading the web page(s) and filtering through the HTML for relevant/necessary data to put into the database. My worry is that the info on these mainly static sites is so static, that the data isn't even kept in a database to build the page and the web page itself is updated (hard-coded) when something changes.
I could really use some help and suggestions here. Thanks!
Screen scraping is always problematic IMO as you are at the mercy of the person who wrote the page. If the content is static, then I think it would be easier to copy the data manually to your database. If you wanted to keep up to date with changes, you could then snapshot the page when you transcribe the info and run a job to periodically check whether the page has changed from the snapshot. When it does, it sends an email for you to update it.
The above method could also be used in conjunction with some sort of screen scaper which could fall back to a manual process if the page changes too drastically.
Ultimately, it is a case of how much effort (cost) is your client willing to bear for accuracy
I have done this for the following site: http://www.buscatchers.com/ so it's definitely more than doable! A key feature of a web scraping solution for travel sites is that it must send you emails if anything went wrong during the scraping process. On the site, I use a two day window so that I have two days to fix the code if the design changes. Only once or twice have I had to change my code, and it's very easy to do.
As for some examples. There is some simplified source code here: http://www.buscatchers.com/about/guide. The full source code for the project is here: https://github.com/nicodjimenez/bus_catchers. This should give you some ideas on how to get started.
I can tell that the data is dynamic, it's to well structured. It's not hard for someone who is familiar with xpath to scrape this site.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
Lots of accurate answers here.
What nobody's said is don't do it!
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?
Of course, sometimes you have no choice :(
In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.
Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.
A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.
In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):
//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');
If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.
If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..
In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.
string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match
Obviously in a large scale environment you'd be writing more robust code than the above.
A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.
That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.