How can I find out (any language but better if Python) when Google indexed a specific html page?
Ideally I would have a list of URLs to check for.
I have already tried the WayBack machine but it doesn't have the majority of the pages I need. Also if anyone can suggest an API to extract dates in multiple language from text.
You can use this pattern to access the cached version of your webpage.
http://webcache.googleusercontent.com/search?q=cache:<URL>
Say for example, you can see the cache version of my blog datafireball.com like this, as you can see, it is indexed 20141020, 23:33:30, strip= will avoid loading javascript, css..etc. To get the time when the indexed was index, you can use some browser automation tool like Selenium or Phantomjs... etc. to get the page.
Related
I would like to (programmatically) convert a text file with questions to a Google form. I want to specify the questions and the questiontypes and their options. Example: the questiontype scale should go from 1 to 7 and should have the label 'not important' for 1 and 'very important' for 7.
I was looking into the Google Spreadsheet API but did not see a solution.
(The Google form API at http://code.lancepollard.com/introducing-the-google-form-api is not an answer to this question)
Google released API for this: https://developers.google.com/apps-script/reference/forms/
This service allows scripts to create, access, and modify Google Forms.
Until Google satisfies this feature request (star the feature on Google's site if you want to vote for it), you could try a non-API approach.
iMacros allows you to record, modify and play back macros that control your web browser. My experiments with Google Drive showed that the basic version (without DirectScreen technology) doesn't record macros properly. I tried it with both the plugin for IE (basic and advanced click mode) and Chrome (the latter has limited iMacro support). FYI, I was able to get iMacros IE plug-in to create questions on mentimeter.com, but the macro recorder gets some input fields wrong (which requires hacking of the macro, double-checking the ATTR= of the TAG commands with the 'Inspect element' feature of Chrome, for example).
Assuming that you can get the TAG commands to produce clicks in the right places in Google Drive, the approach is that you basically write (ideally record) a macro, going through the steps you need to create the form as you would using a browser. Then the macro can be edited (you can use variables in iMacros, get the question/questiontype data from a CSV or user-input dialogs, etc.). Looping in iMacros is crude, however. There's no EOF for a CSV (you basically have to know how many lines are in the file and hard-code the loop in your macro).
There's a way to integrate iMacro calls with VB, etc., but I'm not sure if it's possible with the free versions. There's another angle where you generate code (Javascript) from a macro, and then modify it from there.
Of course, all of these things are more fragile than an API approach long-term. Google could change its presentation layer and it will break your macros.
Seems like Apps Script now has a REST API and SDK's for it. Through Apps Script you can generate Google Forms. This API was really hard to find by trying to google for it and I haven't yet tested it myself, but I am going to build something with it today (hopefully). So far everything looks good.
EDIT: Seems like the REST API I am using works very well for fully automated usage.
In March(2022) google released REST API for google form. API allows basic crud operation & also added support for registering watches on the form to notify whenever either form is updated or a new response is received.
As of now (March 2016), Google Forms APIs allow us to create forms and store them in Google Drive. However, Forms APIs do not allow one programmatically modify the form (such as modify content, add or delete questions, pre-filled data, etc). In other words, the form is static. In order to serve custom, external APIs are needed.
I want to create search engine. So I had used nutch and solr for the developing it.
But it is not able to crawl each and every url of the website and search results are not as
good as Google.So I started using jcrawler to get list of url.
Now I have list of urls.But I have to index them.
So is there any way where I can index list of urls stored line by line in a file.
and show results vis lucene or solr or any other Java API
How you programmatically do something really depends on which language you plan on writing your code in - fetching content from a URL and making sense of that content before indexing will be largely dependent on the libraries available for your programming language of choice.
You can still use nutch with the Solr backend - give it the list of urls as input and set --depth to 1 (so that it doesn't spider anything further).
There are also other "ready" options, such as Crawl Anywhere (which has a Solr backend) and Scrapy.
"Not as good as Google" is not a good description of what you want to accomplish and how to approach that (keep in mind that Search is a core product for Google and they have a very, very large set of custom technologies for handling search). If you have specific issues with your own data and how to display that (usually you can do more useful results as you have domain knowledge of the task you're trying to solve), ask concrete, specific questions.
You can use Data Import Handler to load the list of URLs from file and then read and index them.
You'd need to use nested entity with outside entity having rootEntity flag set to false.
You'd need to practice a little bit with DIH. So, I recommend that you first learn how to import just the URLs into individual Solr documents and then enhance it with actually parsing of URL content.
First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).
This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.
The two solutions I've begun to create are these:
Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.
Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.
I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.
So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)
Thanks!
Check this API by Clearbit. It's super simple to use:
Just send a query to:
https://logo.clearbit.com/[enter-domain-here]
For example:
https://logo.clearbit.com/www.stackoverflow.com
and get back the logo image!
More about it here
I had to find logos for ~10K websites for a previous project and tried the same technique you mentioned of extracting the image with "logo" in the URL. My variation was I loaded each webpage in webkit so that all images were loaded from CSS or JavaScript. This technique gave me logos for ~40% of websites.
Then I considered creating an app like Nick suggested to manually select the logo for the remaining websites, however I realized it was more cost effective to just give these to someone cheap (who I found via Elance) to do the work manually.
So I suggest don't bother solving this properly with a fully technical solution - outsource the manual labour.
Creating an application will definetely help you, but I believe in the end there will some manual work involved. Here's what I would do.
Have your application store in a database a link to all images on a website that are larger than a specified dimension so that you can weed out small icons.
Then you can setup a form to access these results. You may want to setup the database table to store the website url and relationship between the url and image links.
Even if it we're possible to write an application to truly figure out if it was a logo or not seems like it would be a massive amount of code. In the end, it would probably weed out even more than the above, but you have to take into account it could be faster for human to visually parse the results then the time it took for you to write and test the complex code.
Yet another simple way to solve this problem is to get all leaf nodes and get the first
<a><img src="http://example.com/a/file.png" /></a>
you can lookup for projects to get html leaf nodes on the net or use regular expressions to get all html tags.
I used C# console app with HtmlAgilityPack nuget package to scrape logos from over 600+ sites.
Algorithm is that you get all images that have "logo" in url.
The challenges you will face with during such extraction are:
Relative images
Base url is CDN HTTP/HTTPS (if you don't know
protocol before you make a request)
Images have ? or & with query
string at the end
With that things in mind I got approximately 70% of success but some images were not actual logos.
How would a web analytics package such as piwik/google analytics/omniture etc determine what are unique pages from a set of urls?
E.g. a) a site could have the following pages for a product catalogue
http://acme.com/products/foo
http://acme.com/products/bar
or b) use query string
http://acme.com/catalogue.xxx?product=foo
http://acme.com/catalogue.xxx?product=bar
In either case you can have extra query string vars for things like affiliate links or other uses so how could you determine that its the same page?
e.g. both of these are for the foo product pages listed above.
http://acme.com/products/foo?aff=somebody
http://acme.com/catalogue.xxx?product=foo&aff=somebody
If you ignore all the query string then all products in catalogue.xxx are collated into one page view.
If you don't ignore the query string then any extra query string params look like different pages.
If you're dealing with 3rd party sites then you can't assume that they are using either method or rely on something like canonicallinks being correct.
How could you tackle this?
different tracking tools handle it differently, but you can explicitly set the reporting URL for all the tools.
For instance, Omniture doesn't care about the query string. It will chop it off, even if you don't specify a pageName and it defaults to the URL in the pages report, it still chops off the query string.
GA will record the full url including query string every time.
Yahoo Web Analytics only records the query string on first page of the visit and every page afterwards it removes it.
But as mentioned, all of the tools have a way to explicitly specify the URL to reported, and it is easy to write a bit of javascript to remove the Query string from the URL and pass that as the URL to report.
You mentioned giving your tracking code to 3rd parties. Since you are already giving them tracking code, it's easy enough to throw that extra bit of javascript into the tracking code you are already giving them.
For example, with GA (async version), instead of
_gaq.push(['_trackPageview']);
you would do something like
var page = location.href.split('?');
_gaq.push(['_trackPageview',page[0]]);
edit:
Or...for GA you can actually specify to exclude them within the report tool. Different tools may or may not do this for you, so code example can be applied to any of the tools (but popping their specific URL variable, obviously)
If you're dealing with third-party sites, you can't assume that their URLs follow any specific format either. You can try downloading the pages and comparing them locally, but even that is unreliable because of issues like rotating advertisement, timestamps, etc.
If you are dealing with a single site (or a small group of them), you can make a pattern to match each URL to a canonical (for you) form. However, this will get unmanageable quickly.
Of course, this is the reason that search engines like Google recommend the use of rel='canonical' links in the page header; if Google has issues telling the pages apart, it's not a trivial problem.
My site will feature dozens and dozens of front end live demos ( html pages with cross browser bugs ), but instead of just throwing it on jsfiddle.net and linking to demos from articles I would actually like to store them in a database or organized dynamically generated flat files.
Example:
http://site/css-bug/ will feature an article on a certain bug in X browser. I can have many ( demos ) to one ( bug/article ). They will contain HTML, CSS and some Javascript.
Another possibility I was pondering about was making my own jsfiddle.net clone, and in doing so I would have to mimic the way jsfiddle stores them ( however it does ). I'm thinking this is the best route to go, but would appreciate advice.
Background info:
As of now I am manually making static html files in directories and linking to them, and I am using Django for my application which links to these demos ( which reside on a media server ).
You can use jsFiddle for that.
To get the files locally you can get the parts of the saved fiddle using an undocumented API (this mean it may disappear and not be valid). Add /show_js/ or /show_html/ or /show_css/ to the end of the url.
You may wait some time until we will add export to gists on github (implementing this shouldn't take very long, but we want hit beta first).
To increase the speed of loading the example it would be great if you'd load the embedded version on demand. Display [Show example] button which will create an iframe with embedded fiddle. We plan to write cross-browser support for that as well.