planning to launch a comic site which serves comic strips (images).
I have little prior experience to serving/caching images.
so these are my 2 methods i'm considering:
1. Using LinkProperty
class Comic(db.Model)
image_link = db.LinkProperty()
timestamp = db.DateTimeProperty(auto_now=True)
Advantages:
The images are get-ed from the disk space itself ( and disk space is cheap i take it?)
I can easily set up app.yaml with an expiration date to cache the content in user's browser
I can set up memcache to retrieve the entities faster (for high traffic)
2. Using BlobProperty
I used this tutorial , it worked pretty neat. http://code.google.com/appengine/articles/images.html
Side question: Can I say that using BlobProperty sort of "protects" my images from outside linkage? That means people can't just link directly to the comic strips
I have a few worries for method 2.
I can obviously memcache these entities for faster reads.
But then:
Is memcaching images a good thing? My images are large (100-200kb per image). I think memcache allows only up to 4 GB of cached data? Or is it 1 Mb per memcached entity, with unlimited entities...
What if appengine's memcache fails? -> Solution: I'd have to go back to the datastore.
How do I cache these images in the user's browser? If I was doing method no. 1, I could just easily add to my app.yaml the expiration date for the content, and pictures get cached user side.
would like to hear your thoughts.
Should I use method 1 or 2? method 1 sounds dead simple and straightforward, should I be wary of it?
[EDITED]
How do solve this dilemma?
Dilemma: The last thing I want to do is to prevent people from getting the direct link to the image and putting it up on bit.ly because the user will automatically get directed to only the image on my server
( and not the advertising/content around it if the user had accessed it from the main page itself )
You're going to be using a lot of bandwidth to transfer all these images from the server to the clients (browsers). Remember appengine has a maximum number of files you can upload, I think it is 1000 but it may have increased recently. And if you want to control access to the files I do not think you can use option #1.
Option #2 is good, but your bandwidth and storage costs are going to be high if you have a lot of content. To solve this problem people usually turn to Content Delivery Networks (CDNs). Amazon S3 and edgecast.com are two such CDNs that support token based access urls. Meaning, you can generate a token in your appengine app that that is good for either the IP address, time, geography and some other criteria and then give your cdn url with this token to the requestor. The CDN serves your images and does the access checks based on the token. This will help you control access, but remember if there is a will, there is a way and you can't 100% secure anything - but you probably get reasonably close.
So instead of storing the content in appengine, you would store it on the cdn, and use appengine to create urls with tokens pointing to the content on the cdn.
Here are some links about the signed urls. I've used both of these :
http://jets3t.s3.amazonaws.com/toolkit/code-samples.html#signed-urls
http://www.edgecast.com/edgecast_difference.htm - look at 'Content Security'
In terms of solving your dilemma, I think that there are a couple of alternatives:
you could cause the images to be
rendered in a Flash object that would
download the images from your server
in some kind of encrypted format that
it would know how to decode. This would
involve quite a bit of up-front work.
you could have a valid-one-time link
for the image. Each time that you
generated the surrounding web page,
the link to the image would be
generated randomly, and the
image-serving code would invalidate
that link after allowing it one time. If you
have a high-traffic web-site, this would be a very
resource-intensive scheme.
Really, though, you want to consider just how much work it is worth to force people to see ads, especially when a goodly number of them will be coming to your site via Firefox, and there's almost nothing that you can do to circumvent AdBlock.
In terms of choosing between your two methods, there are a couple of things to think about. With option one, where are are storing the images as static files, you will only be able to add new images by doing an appcfg.py update. Since AppEngine application do not allow you to write to the filesystem, you will need to add new images to your development code and do a code deployment. This might be difficult from a site management perspective. Also, serving the images form memcache would likely not offer you an improvement performance over having them served as static files.
Your second option, putting the images in the datastore does protect your images from linking only to the extent that you have some power to control through logic if they are served or not. The problem that you will encounter is that making that decision is difficult. Remember that HTTP is stateless, so finding a way to distinguish a request from a link that is external to your application and one that is internal to your application is going to require trickery.
My personal feeling is that jumping through hoops to make sure that people can't see your comics with seeing ads is solving the prolbem the wrong way. If the content that you are publishing is worth protecting, people will flock to your website to enjoy it anyway. Through high volumes of traffic, you will more than make up for anyone who directly links to your image, thus circumventing a few ad serves. Don't try to outsmart your consumers. Deliver outstanding content, and you will make plenty of money.
Your method #1 isn't practical: You'd need to upload a new version of your app for each new comic strip.
Your method #2 should work fine. It doesn't automatically "protect" your images from being hotlinked - they're still served up on a URL like any other image - but you can write whatever code you want in the image serving handler to try and prevent abuse.
A third option, and a variant of #2, is to use the new Blob API. Instead of storing the image itself in the datastore, you can store the blob key, and your image handler just instructs the blobstore infrastructure what image to serve.
Related
1. The setup
I'm currently initiating a GET request to an S3 bucket (not important) to download a very large file using the browser fetch(). This file is, in it's stored form, raw and unusable binary data, not structured.
2. The task and problem
There are a few things I want to do on the client-side with this data:
I need to process this data as it streams into the client to perform transformations on it (decryption, for example).
Once the data is processed and downloaded, it might still not be of any immediate use to the user outside the context of the web UI. Maybe the data should stay stored within the web app's sandbox disk space unless a user explicitly exports it?
3. The question
Where can I store this blob of unstructured data in both or either of the use cases listed above? There appear to be many options but none that fit this use case precisely. Any thoughts?
EDIT:
I feel like an idiot. I totally forgot about the FileSystem API. I'll take a look and answer my own question with a pseudo-implementation of this works.
EDIT 2:
I feel the need to reiterate what I stated in 2.2 above:
within the web app's sandbox disk space
I don't care about accessing the user's whole file system. I just want a space I can work with large files in on disk, similar to the app space directories provided to mobile applications by Android and iOS.
If you want to save and process a file at client level, and Blob is not an option, you may consider File System Access API (https://developer.mozilla.org/en-US/docs/Web/API/File_System_Access_API#writing_to_files), even if this will introduce an interaction with the user.
Another option would be to take the advantages of PWAs client-side storage (https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Client-side_web_APIs/Client-side_storage), this is also about your application architecture.
Before to check if to process your file at client level can be done as you need with the existing technologies, check if you really need to do that because it is only option, or, instead, if you are able to move such logic at server level, depending on your use cases.
I am struggling with some details about finding a solution to make an Angular / C# app available offline.
My idea would be to use upup.js the get the business logic for the SPA in Angular available offline. upup.js uses service workers to do so. I would store the data required for the offline app using angular-localForage which uses IndexedDB and falls back to WebSQL and localStorage in case.
The problem is, that I have to make files and images available offline too without requiring the user to visit the page they are being used on and I am worried about the maximum quotas. I could store them using either upup.js and adding those files as assets, or store them with angular-localForage as blobs. IndexedDB is supposed to be unlimited by now if I am informed correctly? I couldn't find any maximum quota for a service worker though, as the upup.js solution would use. The AppCache is deprecated, so I wouldn't use that... Or maybe I understood something completely wrong, or there is even another, better solution? Anyhow, the question is:
TL;DR: What is the better way to store files for an AngularJS offline application: angular-localForage (IndexedDB etc.) or a Service Worker (upup.js) and what are the maximum quotas for each solution? Or is there an even better solution?
In my opinion Service Worker (SW) is better than tradition local storage. Plus SW can also use indexedDB.
For the implementation, it is very depending. How your app structure, what technologies of front-end being with with your angular app, what is your main goals of using SW...?
1. Traditional JS loading, you are likely merge all the JS file in one... like app.js contains everything.
And you also don't care about Push Notification neither any other cool features that SW offering.
=> For this case it seem like upup.js suite you the best.
NOTE : beware that upup.js attempt to registering SW on it own, so it likely blocking or complexified your work on expanding feature of SW.
2. Advanced AMD user, where almost all of your JS chopped to small pieces... like fooCtrl.js, barCtrl.js, etc...
For sure you don't want to configure like 100+ files of JS, and further more you will got a lot of html template to load.
=> In this case I will suggest you to use sw-toolbox . A very powerful and light weight tool made by Google. At initial if you are not familiar with SW concept yet, you will have a bit of trouble setting it up for your site (but it won't be longer than a day of you are an advanced JS developer)
After all has configured, all become so very simple. For example this is how I cache all of static content in my site.
self.toolbox.router.get(/(.js|.css|.png|.jpg|.json|.html)/, self.toolbox.fastest);
3. You don't care about what kind of technology at front-end side. You just interest with SW.
=> Simply go for sw-toolbox it's a real time-saver for basic configuration. And if you want to expand the usage of SW, you can just expand it by your own will.
I am planning to develop a hybrid mobile app using ionic. One of the features i need is offline google map. Is there a way how to do it?
It depends on the requirements of your application whether this will be possible. Are your users on "modern" devices A.K.A is HTML5 fully supported? Do your users need to view/edit the map globally, or just in a specific area? Does the map really need to be provided by google? I'll address some issues below to point you in possible takes on this problem.
Do you really need google maps? (Most optimal scenerio)
First of, do you really need google maps? Also relevant: how far do your users need to zoom their maps? If it can be any maps, and zooming is not really of high priority (if it is, including all map tiles will make the app eat all storage), you could probably use map-tiles as a packaged part of your app, and display them with a library like http://leafletjs.com/. The library is well documented, and provides a map-interface for a variety of map-providers. It will be do-able to configure this to use your own local map-tiles. You could include map-tiles for multiple zoom levels if necessary, and limit the min/max zoom-levels to the tiles you actually have available. This will make your maps work offline.
I can't or don't wan't to provide my own tiles make sure that you really looked into the option, there is systems out there that provide map-tiles you could use (check https://www.mapbox.com/ for example)
Okay, so you really don't want to do what I suggested. What are the options now? Javascript mapping-solutions typically render tiles based on the location of the map you want to see and the zooming level. These tiles are requested to the tile-provider. I do not know how to implement this for google exactly, you might need some research on this - I'll try to help you see a direction. There will be requests to get the tiles from the servers. I checked with http://maps.google.com what images are loaded when trying to navigate the map: (example (click)). Find out what url's are used in your situation, we will need these kind of url's later (just inspect the network tab in your browser console and see which requests are made when scrolling in your map). When we only need our users to work in a certain area when offline we could use service workers to cache the responses of these requests when we are online, and serve those caches when we are offline. Read more on service workers here (click).
Advantage: Real offline map-functionality for any tile you visited before (as long as your cache wasn't overflown, depending on your implementation of the service workers, and for service-worker supported browsers/devices).
Disadvantages: No support for tiles that were never put in the cache (AKA: never seen before). Another one: this will only work for devices that support service workers. Might be an option in situations where you either don't care about users using "older" devices, or where you can control the user's device choices. Note that using crosswalk could ease your developing efforts here, since you only have to consider one browser-runtime then: but crosswalk also doesn't support older devices.
However: This solution could be fine for people that will need to work in a specific area, which might be true for the case provided by #vipul-r If you or your users know in advance where they need their maps to work, you can instruct/help them in loading & caching their maps correctly.
If you can't work on either of these 2 solutions, then I highly doubt there will be a way to do it. I don't see any other way to the best of my knowledge.
First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).
This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.
The two solutions I've begun to create are these:
Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.
Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.
I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.
So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)
Thanks!
Check this API by Clearbit. It's super simple to use:
Just send a query to:
https://logo.clearbit.com/[enter-domain-here]
For example:
https://logo.clearbit.com/www.stackoverflow.com
and get back the logo image!
More about it here
I had to find logos for ~10K websites for a previous project and tried the same technique you mentioned of extracting the image with "logo" in the URL. My variation was I loaded each webpage in webkit so that all images were loaded from CSS or JavaScript. This technique gave me logos for ~40% of websites.
Then I considered creating an app like Nick suggested to manually select the logo for the remaining websites, however I realized it was more cost effective to just give these to someone cheap (who I found via Elance) to do the work manually.
So I suggest don't bother solving this properly with a fully technical solution - outsource the manual labour.
Creating an application will definetely help you, but I believe in the end there will some manual work involved. Here's what I would do.
Have your application store in a database a link to all images on a website that are larger than a specified dimension so that you can weed out small icons.
Then you can setup a form to access these results. You may want to setup the database table to store the website url and relationship between the url and image links.
Even if it we're possible to write an application to truly figure out if it was a logo or not seems like it would be a massive amount of code. In the end, it would probably weed out even more than the above, but you have to take into account it could be faster for human to visually parse the results then the time it took for you to write and test the complex code.
Yet another simple way to solve this problem is to get all leaf nodes and get the first
<a><img src="http://example.com/a/file.png" /></a>
you can lookup for projects to get html leaf nodes on the net or use regular expressions to get all html tags.
I used C# console app with HtmlAgilityPack nuget package to scrape logos from over 600+ sites.
Algorithm is that you get all images that have "logo" in url.
The challenges you will face with during such extraction are:
Relative images
Base url is CDN HTTP/HTTPS (if you don't know
protocol before you make a request)
Images have ? or & with query
string at the end
With that things in mind I got approximately 70% of success but some images were not actual logos.
is there an elegant way to determine the size of data downloaded from a website -- bearing in mind that not all requests will go to the same domain that you originally visited and that other browsers may in the background be polling at the same time. Ideally i'd like to look at the size of each individual page -- or for a Flash site the total downloaded over time.
I'm looking for some kind of browser plug-in or Fiddler script. I'm not sure Fiddler would work due to the issues pointed out above.
I want to compare sites similar to mine for total filesize - and keep track of my own site also.
Firebug and HttpFox are two Firefox plugin that can be used to determine the size of data downloaded from a website for one single page. While Firebug is a great tool for any web developer, HttpFox is a more specialized plugin to analyze HTTP requests / responses (with relative size).
You can install both and try them out, just be sure to disable the one while enabling the other.
If you need a website wide measurement:
If the website is made of plain HTML and assets (like CSS, images, flash, ...) you can check how big the folder containing the website is on the server (this assumes you can login on the server)
You can mirror the website locally using wget, curl or some GUI based application like Site Sucker and check how big the folder containing the mirror is
If you know the website is huge but you don't know how much, you can estimate its size. i.e. www.mygallery.com has 1000 galleries; each gallery has an average of 20 images loaded; every image is stored in 2 different sizes (thumbnail and full size) an average of for _n_kb / image; ...
Keep in mind that if you download / estimating a dynamic websites, you are dealing with what the website produces, not with the real size of the website on the server. A small PHP script can produce tons of HTML.
Have you tried Firebug for Firefox?
The "Net" panel in Firebug will tell you the size and fetch time of each fetched file, along with the totals.
You can download the entire site and then you will know for sure!
https://www.httrack.com/