How to scrape logos from websites? - screen-scraping

First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).
This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.
The two solutions I've begun to create are these:
Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.
Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.
I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.
So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)
Thanks!

Check this API by Clearbit. It's super simple to use:
Just send a query to:
https://logo.clearbit.com/[enter-domain-here]
For example:
https://logo.clearbit.com/www.stackoverflow.com
and get back the logo image!
More about it here

I had to find logos for ~10K websites for a previous project and tried the same technique you mentioned of extracting the image with "logo" in the URL. My variation was I loaded each webpage in webkit so that all images were loaded from CSS or JavaScript. This technique gave me logos for ~40% of websites.
Then I considered creating an app like Nick suggested to manually select the logo for the remaining websites, however I realized it was more cost effective to just give these to someone cheap (who I found via Elance) to do the work manually.
So I suggest don't bother solving this properly with a fully technical solution - outsource the manual labour.

Creating an application will definetely help you, but I believe in the end there will some manual work involved. Here's what I would do.
Have your application store in a database a link to all images on a website that are larger than a specified dimension so that you can weed out small icons.
Then you can setup a form to access these results. You may want to setup the database table to store the website url and relationship between the url and image links.
Even if it we're possible to write an application to truly figure out if it was a logo or not seems like it would be a massive amount of code. In the end, it would probably weed out even more than the above, but you have to take into account it could be faster for human to visually parse the results then the time it took for you to write and test the complex code.

Yet another simple way to solve this problem is to get all leaf nodes and get the first
<a><img src="http://example.com/a/file.png" /></a>
you can lookup for projects to get html leaf nodes on the net or use regular expressions to get all html tags.

I used C# console app with HtmlAgilityPack nuget package to scrape logos from over 600+ sites.
Algorithm is that you get all images that have "logo" in url.
The challenges you will face with during such extraction are:
Relative images
Base url is CDN HTTP/HTTPS (if you don't know
protocol before you make a request)
Images have ? or & with query
string at the end
With that things in mind I got approximately 70% of success but some images were not actual logos.

Related

how to export react JS components to static html

are there any utilities or approaches to export regular react component into an email friendly static html?
for example i have a dashboard using react-table and would love it if there was a way to auto-magically translate that to static html i could insert into an email body.
i can think of a few approaches using a headless browser to render as pure html, but it would be awesome if there was a solution with more email friendly html
Because the layout of these gets fairly complex, it may also be advantageous to render page as image and insert that image into email body?
I only really know HTML Email, and basics of React - and SO is not great for recommendations of software type questions - so I'll just speak to the email side.
If you can get HTML, you need to consider a few things.
First, anything over two columns is likely to run out of space. You would need to consider a stackable column structure with repeated headers. That would require hiding the duplicated headers for desktop views, since the tables would be separate due to the way we do stacked columns in emails (as inline-blocks without media queries). See https://medium.com/#nathankeenmelb/bulletproof-responsive-datatables-in-html-emails-64248b9e18f5 for full details.
Second, only some approaches would work like that. Images then would be your go-to option. A nice output for table images would be:
<img src="https://via.placeholder.com/600x500" width="600" style="vertical-align:middle;width:100%;border:0">
The link goes to the image itself so you can zoom and move around easier, and get maximum realestate.
An alternative built on that idea would be to have a link from the image to landing page with full web capabilities. That would take longer to load, but may be well worth it.
Since that's probably the most viable, I'll explain these choices of attributes and styles:
Use the width attribute width="600" because that's what Outlook desktop uses
Use inline styles for those email clients that do not support <style> blocks
Vertical-align:middle (or display:block) removes the space underneath the image that some email clients add
width:100% makes it responsive to mobiles
border:0 ensures no border is shown because of the link
Third, datatables are so finicky and particular in HTML email. Each table is unique because they have different data in them that responds differently. In normal web design, you can just use a nice reset and get everything working without much thought. In HTML Email, everything needs to be inline, and supported, with fallbacks for those things that are unsupported. So even the core data often needs editing - e.g. if it has long URLs, emails or words you need to add a wrapping span with word-break CSS but also <wbr>s in the middle of it for some email clients to properly wrap.
Datatables don't often come up, and because of these considerations, it's hard to see how they could be automated easily - and hard to build a case for it financially.
On a related note, if you can show the information using card UI, that seems to me to be a much nicer, simpler, more accessible and easier to code solution than datatables. This is about taking the information and redesigning it into card blocks. I talk about that in detail here: https://medium.com/#nathankeenmelb/responsive-datatables-through-card-ui-design-for-email-aca6f3c395a2

Need to screen scrape browser as opposed to webpage

I have a webpage that needs to be scraped to look for certain text. The problem is it's not really webscraping that I am trying to achieve. The website is opened by a separate process. I am specifically talking about a webpage but really, it is more of a universal screen scraping issue. Conceptually, It's more like I am scraping the browser instead of the page itself. Is there a program that can scan any open process and look for and match text? To put it another way, it would be like having a separate program from the browser's built-in ctrl+f find function. I just need a simple utility to tell my if a given text is present in a boolean type fashion. I realize this is a very broad question but I haven't been able to find anything about it. Maybe I don't quite know how to articulate it in a Google search because my research keeps coming up empty.
If you already know the structure of the page, like it's always Google search results, or always an Amazon product, you might look at Selenium or one of the many Chrome screen-scraping add-ons.
If you want to grab data off of any page without knowing the format in advance, I don't know a way.

Export content from an ecommerce site without using the Backend

I have a site that I'm looking to transfer to Volusion. Importing tabled content into Volusion's a breeze, it's getting it tabled that's an issue. The old site has no real ability to export, nor do I know how to get at it's database. I'm thinking there must be some sort of script I can write to take the content from the frontend and download it in some sort of list that I can put into a CSV, and put into Volusion.
www.twincitygreetings.com
Any suggestions? I'm hoping to get in the image directory as well and download all them for upload to the new site.
You are going to need at the very least a file with product code, product name, weight and price.
Looking at the URL you provided it doesn't appear that the products their follow any type of orderly structure where you can target the images folder or products based on a known piece of information like a products code. Unless the back-end has some type of product export function you may have no choice but to recreate it from scratch.
I don't know if you solved this yet or not, but I would suggest scraping the data providing you have the information on the old site currently. This can be done easily using vbscript and excel, or if you aren't very savvy at coding you could look at a piece of software called mozenda. There are a whole variety of methods that can be used to scrape data, all of them pretty easy to learn with a bit of research. Basically you write a script that will crawl your dom and extract the data (to xml works best in my experience)
Hope this helps.

Mapping without Google Maps (on a stand-alone server)

I've been asked to create a stand-alone site/app that's not connected to the web (all on a local server).
One part of it is to have a map of a natural reserve with a bunch of links that will show footpaths, different animals habitat areas, visitor centres and such.
So there's a map (static picture) and when you click on it some overlay goes on top of it.
At least that's the way I see it now.
I've looked here: http://www.carto.net/williams/yosemite/ but it just looks mucho ugly.
Getting Maps Premium is not an option as it's not that cheap. And the reason they don't want to use Maps/Earth free API is because internet connection is still very slow there (sattelite internet only and when optic cable will be hooked up nobody knows).
Looking for some recommendations as to how to proceed there. Drawing paths/areas on the picture of the maps seems extremely insufficient and time consuming.
I'd need some way to use coordinates to automatically draw areas and lines over the map (and then somehow export that as a graphis file (or SVG) that'll be layered on top of original map simply using ajax.
Will ARCGIS pro edition be the way to go or should I start learning SVG. Do you know some good SVG books/tutorials (as related to mapping)? Maybe there's some other way around altogether...
They do have detailed maps of the area in ARCGIS (whatever format they are in I don't know yet).
Just looking for some ideas, any help will be appreciated. Thanks in advance.
Do you know GeoServer? More or less all-in-one, compatible with different types of datasets, widely customisable.
Starting from "raw" SVG and write the whole thing yourself will probably be prohibitively time consuming.
If you have very little data (say less than 50 geometries) that is fixed, you could also use OpenLayers without any backend server.
For the data you could use a OpenLayers.Layer.Image if your (overlay-) map consists of a small raster image. For vector data, you can use OpenLayers.Layer.Text or a OpenLayers.Layer.Vecor together with protocols OpenLayers.Layer.KML or .JSON.
You can click through the current release examples.
I admit that this is not an easy task for a beginner, but it's fun hacking the maps together.

Question on serving Images on App Engine ( 2 Alternatives )

planning to launch a comic site which serves comic strips (images).
I have little prior experience to serving/caching images.
so these are my 2 methods i'm considering:
1. Using LinkProperty
class Comic(db.Model)
image_link = db.LinkProperty()
timestamp = db.DateTimeProperty(auto_now=True)
Advantages:
The images are get-ed from the disk space itself ( and disk space is cheap i take it?)
I can easily set up app.yaml with an expiration date to cache the content in user's browser
I can set up memcache to retrieve the entities faster (for high traffic)
2. Using BlobProperty
I used this tutorial , it worked pretty neat. http://code.google.com/appengine/articles/images.html
Side question: Can I say that using BlobProperty sort of "protects" my images from outside linkage? That means people can't just link directly to the comic strips
I have a few worries for method 2.
I can obviously memcache these entities for faster reads.
But then:
Is memcaching images a good thing? My images are large (100-200kb per image). I think memcache allows only up to 4 GB of cached data? Or is it 1 Mb per memcached entity, with unlimited entities...
What if appengine's memcache fails? -> Solution: I'd have to go back to the datastore.
How do I cache these images in the user's browser? If I was doing method no. 1, I could just easily add to my app.yaml the expiration date for the content, and pictures get cached user side.
would like to hear your thoughts.
Should I use method 1 or 2? method 1 sounds dead simple and straightforward, should I be wary of it?
[EDITED]
How do solve this dilemma?
Dilemma: The last thing I want to do is to prevent people from getting the direct link to the image and putting it up on bit.ly because the user will automatically get directed to only the image on my server
( and not the advertising/content around it if the user had accessed it from the main page itself )
You're going to be using a lot of bandwidth to transfer all these images from the server to the clients (browsers). Remember appengine has a maximum number of files you can upload, I think it is 1000 but it may have increased recently. And if you want to control access to the files I do not think you can use option #1.
Option #2 is good, but your bandwidth and storage costs are going to be high if you have a lot of content. To solve this problem people usually turn to Content Delivery Networks (CDNs). Amazon S3 and edgecast.com are two such CDNs that support token based access urls. Meaning, you can generate a token in your appengine app that that is good for either the IP address, time, geography and some other criteria and then give your cdn url with this token to the requestor. The CDN serves your images and does the access checks based on the token. This will help you control access, but remember if there is a will, there is a way and you can't 100% secure anything - but you probably get reasonably close.
So instead of storing the content in appengine, you would store it on the cdn, and use appengine to create urls with tokens pointing to the content on the cdn.
Here are some links about the signed urls. I've used both of these :
http://jets3t.s3.amazonaws.com/toolkit/code-samples.html#signed-urls
http://www.edgecast.com/edgecast_difference.htm - look at 'Content Security'
In terms of solving your dilemma, I think that there are a couple of alternatives:
you could cause the images to be
rendered in a Flash object that would
download the images from your server
in some kind of encrypted format that
it would know how to decode. This would
involve quite a bit of up-front work.
you could have a valid-one-time link
for the image. Each time that you
generated the surrounding web page,
the link to the image would be
generated randomly, and the
image-serving code would invalidate
that link after allowing it one time. If you
have a high-traffic web-site, this would be a very
resource-intensive scheme.
Really, though, you want to consider just how much work it is worth to force people to see ads, especially when a goodly number of them will be coming to your site via Firefox, and there's almost nothing that you can do to circumvent AdBlock.
In terms of choosing between your two methods, there are a couple of things to think about. With option one, where are are storing the images as static files, you will only be able to add new images by doing an appcfg.py update. Since AppEngine application do not allow you to write to the filesystem, you will need to add new images to your development code and do a code deployment. This might be difficult from a site management perspective. Also, serving the images form memcache would likely not offer you an improvement performance over having them served as static files.
Your second option, putting the images in the datastore does protect your images from linking only to the extent that you have some power to control through logic if they are served or not. The problem that you will encounter is that making that decision is difficult. Remember that HTTP is stateless, so finding a way to distinguish a request from a link that is external to your application and one that is internal to your application is going to require trickery.
My personal feeling is that jumping through hoops to make sure that people can't see your comics with seeing ads is solving the prolbem the wrong way. If the content that you are publishing is worth protecting, people will flock to your website to enjoy it anyway. Through high volumes of traffic, you will more than make up for anyone who directly links to your image, thus circumventing a few ad serves. Don't try to outsmart your consumers. Deliver outstanding content, and you will make plenty of money.
Your method #1 isn't practical: You'd need to upload a new version of your app for each new comic strip.
Your method #2 should work fine. It doesn't automatically "protect" your images from being hotlinked - they're still served up on a URL like any other image - but you can write whatever code you want in the image serving handler to try and prevent abuse.
A third option, and a variant of #2, is to use the new Blob API. Instead of storing the image itself in the datastore, you can store the blob key, and your image handler just instructs the blobstore infrastructure what image to serve.

Resources