I am trying to design a basic web crawler in Google App Engine that will grab documents from websites. The problem I have is that I need to share between the task queue the pages it already has visited. I attempted to do this with Memcache but it doesn't appear to be working. What is the best approach?
memcache_results = memcache.get(root_url, namespace='charlotte')
if memcache_results:
if url in memcache_results:
return False
else:
pass
else:
memcache_results = []
memcache.add(root_url, memcache_results.append(url), 60 * 60, namespace='charlotte')
print memcache_results
charlotte = SpiderWorker(root_url)
charlotte.get_page(url)
links = charlotte.grab_links()
for url in links:
deferred.defer(spin_worker, root_url=root_url, url=url, _queue='doc-repo')
Related
I have a Flask app where a user can upload an image and the image is saved on a static folder on the filesystem.
Currently, I'm using Google App Engine for hosting and found that it's not possible to save to the static folder on the standard environment. Here is the code
def save_picture(form_picture,name):
picture_fn = name + '.jpg'
picture_path = os.path.join(app.instance_path, 'static/image/'+ picture_fn)
output_size = (1000,1000)
i = Image.open(form_picture)
i.thumbnail(output_size)
i.save(picture_path)
return picture_path
#app.route('/image/add', methods=['GET', 'POST'])
def addimage():
form = Form()
if form.validate_on_submit():
name = 'randomname'
try:
picture_file = save_picture(form.image.data,name)
return redirect(url_for('addimage'))
except:
flash("unsuccess")
return redirect(url_for('addimage'))
My question is if I change from standard to flex environment would it be possible to save to a static folder? If not what are the other hosting options that I should consider? Do you have any suggestions?
Thanks in advance.
following your's advice I'm changing to use Cloud Storage. i'm wondering what should i use from upload_from_file(), upload_from_filename() or upload_from_string(). the source_file takes data from form.photo.data from flask-wtform. i'm not successfully saving on the cloud storage yet. this is my code:
def upload_blob(bucket_name, source_file, destination_blob_name):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file)
return destination_blob_name
#app.route('/image/add', methods=['GET', 'POST'])
def addimage():
form = Form()
if form.validate_on_submit():
name = 'randomname'
try:
filename = 'foldername/'+ name + '.jpg'
picture_file = upload_blob('mybucketname', form.photo.data, filename)
return redirect(url_for('addimage'))
except:
flash("unsuccess")
return redirect(url_for('addimage'))
I have successfully able to save file on google cloud storage by changing the save_picture function just in case anyone have trouble with this in the future:
app.config['BUCKET'] = 'yourbucket'
app.config['UPLOAD_FOLDER'] = '/tmp'
def save_picture(form_picture,name):
picture_fn = secure_filename(name + '.jpg')
picture_path = os.path.join(app.config['UPLOAD_FOLDER'], picture_fn)
output_size = (1000,1000)
i = Image.open(form_picture)
i.thumbnail(output_size)
i.save(picture_path)
storage_client = storage.Client()
bucket = storage_client.get_bucket(app.config['BUCKET'])
blob = bucket.blob('static/image/'+ picture_fn)
blob.upload_from_filename(picture_path)
return picture_path
The problem with storing it to some folder is that it would live on that one instance and other instances would not be able to access it. Furthermore, instances in GAE come and go, so you would lose the image eventually.
You should use Google Cloud Storage for this:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
blob = bucket.get_blob('remote/path/to/file.txt')
blob.upload_from_string('New contents!')
https://googleapis.dev/python/storage/latest/index.html
With Flask and Appengine, Python3.7, I save files to a bucket in the following way, because I want to loop it for many files:
for key, upload in request.files.items():
file_storage = upload
content_type = None
identity = str(uuid.uuid4()) # or uuid.uuid4().hex
try:
upload_blob("f00b4r42.appspot.com", request.files[key], identity, content_type=upload.content_type)
The helper function:
from google.cloud import storage
def upload_blob(bucket_name, source_file_name, destination_blob_name, content_type="application/octet-stream"):
"""Uploads a file to the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_file(source_file_name, content_type=content_type)
blob.make_public()
print('File {} uploaded to {}.'.format(
source_file_name,
destination_blob_name))
Changing from Google App Engine Standard Environment to Google App Engine Flexible Environment will allow you to write to disk, as well as to choose a Compute Engine machine type with more memory for your specific application [1]. If you are interested on following this path find all the relevant documentation from migrating a Python app here.
Nonetheless, as it was explained by user #Alex on his provided answer as instances are created (the number of instances is scaled up) or deleted (the number of instances is scaled down) according to your load, the better option in your particular case would be to use Cloud Storage. Find an example for uploading objects to Cloud Storage with Python here.
I have mp3 files storaged in Google App Engine Cloud Storage and I want to get their durations.
I made this code with help from one guy here but unfortunately the class AudioSystem doesn't work with the Google App Engine Cloud Storage.
Does someone know a way to do it?
ListResult lr = gcsService.list(mybucketname, ListOptions.DEFAULT);
while (lr.hasNext() && playlistLength > 0) {
ListItem li = lr.next();
String filename = li.getName();
GcsService gcsService =
GcsServiceFactory.createGcsService(RetryParams.getDefaultInstance());
GcsInputChannel readChannel = gcsService.openPrefetchingReadChannel(new GcsFilename(mybucketName, fileName), 0, 1024 * 1024);
AudioInputStream audioInputStream;
try (InputStream in = Channels.newInputStream(readChannel)) {
audioInputStream = AudioSystem.getAudioInputStream(in);
}
long frames = audioInputStream.getFrameLength();
double durationInSeconds = (frames+0.0) / format.getFrameRate();
playlistLength-=(int)(durationInSeconds)/60;
Here is the error returned :
Error for /hello java.lang.NoClassDefFoundError: javax.sound.sampled.AudioSystem is a restricted class.
Please see the Google App Engine developer's guide for more details.
at com.google.apphosting.runtime.security.shared.stub.javax.sound.sampled.AudioSystem.<clinit>(AudioSystem.java)
Looking into the issue and the doc here, there's a good chance your issue is not solvable with this library, as it is currently restricted because it's making some kind of system call that the platform won't let you do.
You have multiple solutions available to you. I would suggest, when you upload the file, uploading an entity to the datastore containing the metadata and retrieving that instead.
You can also parse the file yourself and read the duration from it (mp3 being a pretty simple format, as Igor Artamonov is pointing out).
I am new to SEO and just want to get the idea about how it works for Single Page Application with dynamic content.
In my case, I have a single page application (powered by AngularJS, using router to show different state) that provides some location-based search functionalities, similar to Zillow, Redfin, or Yelp. On mt site, user can type in a location name, and the site will return some results based on the location.
I am trying to figure out a way to make it work well with Google. For example, if I type in "Apartment San Francisco" in Google, the results will be:
And when user click on these links, the sites will display the correct result. I am thinking about having similar SEO like these for my site.
The question is, the page content is purely depending on user's query. User can search by city name, state name, zip code, etc, to show different results, and it's not possible to put them all into sitemap. How google can crawl the content for these kind of dynamic page results?
I don't have experience with SEO and not sure how to do it for my site. Please share some experience or pointers to help me get started. Thanks a lot!
===========
Follow up question:
I saw Googlebot can now run Javascript. I want to understand a bit more of this. When a specific url of my SPA app is opened, it will do some network query (XHR request) for a few seconds and then the page content will be displayed. In this case, will GoogleBot wait for the http response?
I saw some tutorial says we need to prepare static html specifically for Search Engines. If I only want to deal with Google, does it mean I don't have to serve static html anymore because Google can run Javascript?
Thanks again.
If a search engine should come across your JavaScript application then we have the permission to redirect the search engine to another URL that serves the fully rendered version of the page.
For this job
You can either use this tool by Thomas Davis available on github
SEOSERVER
Or
you can use the code below which does the same job as above this code is also available here
Implementation using Phantom.js
We can setup a node.js server that given a URL, it will fully render the page content. Then we will redirect bots to this server to retrieve the correct content.
We will need to install node.js and phantom.js onto a box. Then start up this server below. There are two files, one which is the web server and the other is a phantomjs script that renders the page.
// web.js
// Express is our web server that can handle request
var express = require('express');
var app = express();
var getContent = function(url, callback) {
var content = '';
// Here we spawn a phantom.js process, the first element of the
// array is our phantomjs script and the second element is our url
var phantom = require('child_process').spawn('phantomjs',['phantom-server.js', url]);
phantom.stdout.setEncoding('utf8');
// Our phantom.js script is simply logging the output and
// we access it here through stdout
phantom.stdout.on('data', function(data) {
content += data.toString();
});
phantom.on('exit', function(code) {
if (code !== 0) {
console.log('We have an error');
} else {
// once our phantom.js script exits, let's call out call back
// which outputs the contents to the page
callback(content);
}
});
};
var respond = function (req, res) {
// Because we use [P] in htaccess we have access to this header
url = 'http://' + req.headers['x-forwarded-host'] + req.params[0];
getContent(url, function (content) {
res.send(content);
});
}
app.get(/(.*)/, respond);
app.listen(3000);
The script below is phantom-server.js and will be in charge of fully rendering the content. We don't return the content until the page is fully rendered. We hook into the resources listener to do this.
var page = require('webpage').create();
var system = require('system');
var lastReceived = new Date().getTime();
var requestCount = 0;
var responseCount = 0;
var requestIds = [];
var startTime = new Date().getTime();
page.onResourceReceived = function (response) {
if(requestIds.indexOf(response.id) !== -1) {
lastReceived = new Date().getTime();
responseCount++;
requestIds[requestIds.indexOf(response.id)] = null;
}
};
page.onResourceRequested = function (request) {
if(requestIds.indexOf(request.id) === -1) {
requestIds.push(request.id);
requestCount++;
}
};
// Open the page
page.open(system.args[1], function () {});
var checkComplete = function () {
// We don't allow it to take longer than 5 seconds but
// don't return until all requests are finished
if((new Date().getTime() - lastReceived > 300 && requestCount === responseCount) || new Date().getTime() - startTime > 5000) {
clearInterval(checkCompleteInterval);
console.log(page.content);
phantom.exit();
}
}
// Let us check to see if the page is finished rendering
var checkCompleteInterval = setInterval(checkComplete, 1);
Once we have this server up and running we just redirect bots to the server in our client's web server configuration.
Redirecting bots
If you are using apache we can edit out .htaccess such that Google requests are proxied to our middle man phantom.js server.
RewriteEngine on
RewriteCond %{QUERY_STRING} ^_escaped_fragment_=(.*)$
RewriteRule (.*) http://webserver:3000/%1? [P]
We could also include other RewriteCond, such as user agent to redirect other search engines we wish to be indexed on.
Though Google won't use _escaped_fragment_ unless we tell it to by either including a meta tag; <meta name="fragment" content="!">or using #! URLs in our links.
You will most likely have to use both.
This has been tested with Google Webmasters fetch tool. Make sure you include #! on your URLs when using the fetch tool.
So I have this web-app using angularJS and nodeJS. I don't want to just use localhost to demo my project because it doesn't looks cool at all when I type "node server.js" and then go to localhost.....
Since I intend to use Firebase for the data, I have noticed that Firebase provides hosting. I tried it, but it seems to only host the index.html and not through/using server.js. I have customized files for the server to use/update. So, how can I tell Firebase Hosting to use my server and related files when hosting?
Is it possible to tell Firebase, hey, run "node server.js" to host my index.html?
I'm guessing by the way you are wording the question you want to see this site from "the internet".
Two routes you could go here.
a) Serve your index through Firebase hosting. Firebase only hosts assets. If your Angular app is being served through Node then you will need to change your architecture to be more SPA-ish
SPA-ish would be like an index bootstrap that interacts with the backend purely through API's.
You would host the API server on something more appropriate like through Nodejitsu.
b) Serve the whole thing through something like Nodejitsu (hosting platform) or your very own VM managed by a different kind of hosting company like BuyVM.net.
Another idea, is if your nodejs app is independent of the angularjs app (however they use shared data, and perform operations on that data model) you could separate the two and connect them only via firebase.
Firebase hosting -> index.html and necessary angularjs files.
Locally (your PC) -> server.js which just connects to firebase and trigger on changed data.
I have done this for a few projects and it's a handy way to access the outside world (internet) while maintaining some semblence of security by not opening ports blindly.
I was able to do this to control a chromecast at my house while at a friends house
Here's an example from my most recent project (I'm trying to make a DVR).
https://github.com/onaclov2000/webdvr/blob/master/app.js
var FB_URL = '';
var Firebase = require('firebase');
var os = require('os')
var myRootRef = new Firebase(FB_URL);
var interfaces = os.networkInterfaces();
var addresses = [];
for (k in interfaces) {
for (k2 in interfaces[k]) {
var address = interfaces[k][k2];
if (address.family == 'IPv4' && !address.internal) {
addresses.push(address.address)
}
}
}
// Push my IP to firebase
// Perhaps a common "devices" location would be handy
var ipRef = myRootRef.push({
"type": "local",
"ip": addresses[0]
});
myRootRef.on('child_changed', function(childSnapshot, prevChildName) {
// code to handle child data changes.
var data = childSnapshot.val();
var localref = childSnapshot.ref();
if (data["commanded"] == "new") {
console.log("New Schedule Added");
var schedule = require('node-schedule');
var date = new Date(data["year"], data["month"], data["day"], data["hh"], data["mm"], 0);
console.log(date);
var j = schedule.scheduleJob(date, function(channel, program, length){
console.log("Recording Channel " + channel + " and program " + program + " for " + length + "ms");
}.bind(null, data["channel"], data["program"], data["length"]));
localref.update({"commanded" : "waiting"});
}
});
When I change my "commanded" data at the FB_URL, to "new" (which can be accomplished by angularjs VERY Simply, using an ng-click operation for example) it'll schedule a recording for a particular date and time (not all actually functional at the moment).
I might be late but since 3 years have passed there is an solution available now from Firebase in the form of cloud functions
Its not straight forward but looks promising if one can refactor their code a bit
I am currently work on a web app using webapp2, that deals with restaurant in several cities. Some of the url would look like
1. www.example.com/newyork
2. www.example.com/newyork/fastfood
3. www.example.com/newyork/fastfood/tacobell
To handle the first url, I used the following
CITY_RE = r'(/(?:[a-zA-Z0-9]+/?)*)'
app = webapp2.WSGIApplication([(CITY_RE, CityHandler)], debug = True)
How would I handle the url with multiple parameters such as 2 and 3.
I have a similar approach to match urls like /<country>/<region>/<city>/<category>e.g. /usa/california/losangeles/restaurants where I use this regex:
app = webapp2.WSGIApplication([('/([^/]+)/?([^/]*)/?([^/]*)', RegionSearch)], config=settings.w2config, debug=True)
The declare the relevant parameters in the handler class.
class RegionSearch(SearchBaseHandler):
"""Handles regional search requests."""
def get(
self,
region=None,
city=None,
category=None,
subcategory='For sale',
PAGESIZE=50, # items on page
limit=60, # number of days
year=2012,
month=1,
day=1,
next_page=None,
):
I think that you could even do it this way
webapp2.Route('/passwdresetcomplete/<city>/<category>/<name>', handler=RegionSearch, name='regionsearch')