Script is only working correctly on the first run - google-app-engine

The script has to fetch posts from api and save them into the database.
After running the script that runs successfully, it won't be able to fetch new posts even if there are some new for another 5-24+ hours.
It will return the same old response every time in fraction of second like it was running from cache or something. (if I remove the old posts, it still adds them to the db)
What is interesting is that if I deploy same script, it runs ok for the first time and then again, I have to wait for another 5-24+ hours.
If it is running successfully, it takes like 3-10 second, otherwise it takes less than a second.
I'm really confused with it, is there something like caching responses ? Or may this be a problem on reddit api side ? Would adding any of this options help ?
CURLOPT_RETURNTRANSFER => true,
CURLOPT_CONNECTTIMEOUT => 100,
CURLOPT_TIMEOUT => 100
I'm currently using requests library for request
r = requests.get(url, headers = {'User-agent': 'My App 12345'})
response = r.json()
Here is the GAE part of my script
class MainHandler(webapp2.RequestHandler):
def get(self):
# --------------- Database Connection ---------------
global db
global cursor
if os.getenv('SERVER_SOFTWARE', '').startswith('Google App Engine/'):
db = MySQLdb.connect(xxx)
else:
db = MySQLdb.connect(xxx)
cursor = db.cursor()
# ---------------------------------------------------
fetchFromReddit("") # Start fetching script
self.response.write("Finished !")
db.close()
cursor.close()
app = webapp2.WSGIApplication([
('/url', MainHandler)
], debug=True)

Appengine URL Fetch service does appear to cache responses. As mentioned in this google appengine group thread, To by-pass/disable the cache, you need to add this to you request headers:
headers={'Cache-Control': 'no-cache,max-age=0', 'Pragma': 'no-cache'}
Where "max-age" is the oldest you want data returned from the cache to
be.

Related

aws - should I integrate s3 upload and store s3 url in dynamodb in one single request?

I have a Table called "Banner".
I have a banner upload function in my UI.
Aws Api gateway is used.
2 resources are created in api gateway,which are /s3 and /banner
I am using 2 separate requests to do this.
1.POST request, resource: /s3
This request runs below lambda function, to upload the banner image to s3
UploadBannerToS3.js
...
const s3 = new AWS.S3();
...
const data = await s3.upload(params).promise();
...
This returns a s3 url storing the banner(as image).
2. POST request, resource: /banner
This request take above s3 url as parameter, to store a banner information including the url in dynamodb.
The lambda function will like this.
CreateBanner.js
...
const { url} = JSON.parse(event.body);
const params = {
TableName : "Banner",
Item: {
id: id,
url: url,
createdAt: date,
}
};
...
const data = await documentClient.put(params).promise();
...
In my frontend code(I am using React) will like this.
handleUploadBanner = async (banners) => {
const image = await toBase64(banner);
const payload = { "banner": image }
try {
// request 1
const uploadResponse_S3 = await APIHandler.uploadBannerToS3(payload)
const s3Url = uploadResponse_S3.data.Location;
// request 2
const response = await APIHandler.createBanners({
url: s3Url,
})
console.log(response)
} catch (error) {
console.log(error)
}
}
If only request 1 is successfully sent, while request 2 fail to return successful status, would it be a mess for development?
Should I combine these 2 request in one single lambda function to handle it?
What is the best practise to do so?
If end-user (front-end) wants to have a "synchronized" response from API, so it means we need to design 2 apis as synchronized ones. But it doesn't mean we need to merge them.
If end-user wants to have only the first api response and doesn't care about the second one, we can design the second apis as asynchronized and you can use the pipeline like
a. Lambda 1 -> Performs its logic -> Send a SNS and return to end-user
b. SNS -> SQS -> Lambda 2
The more we design the system as "single responsibility" is the better for development and maintainance.
Thanks,
If only request 1 is successfully sent, while request 2 fail to return
successful status, would it be a mess for development?
Not necessarily. You could come up with a retry function in front-end for simplicity. But it depends because mess it is a very abstract concept. What is the requirement ? It is of vital importance that the requests never fail ? What do you wanna do if they fail ?
Should I combine these 2 request in one single lambda function to
handle it?
Either way is better to keep them small and short. It is how you work with aws lambdas.
But I think if you want more control over the outcome with better fail-over approach.
SQS it is one way of doing, however they are complex for that case. I would configure a trigger from s3 to lambda that way you will only update when the images get successfully updated.
So in summary:
Call Lambda 1 -> Upload s3 ? Successful ?
S3 Triggers Lambda 2
Lambda 2 saves to DB
I would prefer to process both at one lambda for s3 uploading and db storing. It's simpler and reliable to be said it makes sense to abstracting the fail response.
I mean, the app client is mirroring the file item to dynamodb not the s3. So I will assume, whatever it succeed either failed process we don't need to worries for the app getting wrong link. With some scenarios:
succeed upload, succeed db: App client get the correct link
succeed upload, failed db: App client will never get the correct link (no item)
failed upload, failed db: as same to point #2

GAE gunicorn flask best way to act with requests that takes 30-60 minutes

I built an application with flask that shall be able to crawl some data. First step is to use the Youtube Data API to get some data about a user, including a list of all videos the user ever uploaded. That totally works fine! After I got the list of Video Ids I try to scrape all these videos on youtube to extract the likes and views over all videos and sum them together to 2 big numbers. I tested it local without gunicorn and not in the app engine, it works fine! But when a user uploaded 6700 videos it may take 30 min to complete the request (local it works).
When I try to run the same code in the GAE it returns 502 Bad Gateway after several minutes, but in the logs I see it is still crawling.
This is the GET 502:
Worker continued scraping for several minutes.
Here is the Code I wrote to crawl:
This is my app.yaml. With -t 36000 Workers can be silent for one hour until they are killed and restarted.
runtime: python37
service: crawler
entrypoint: . ./env.inc.sh && gunicorn -t 36000 -b :$PORT app:app
This is the route in my app.py which is called:
#app.route('/youtube/username/<user>')
def youtubeStatistics(user):
response = crawler.crawl_youtube_user(os.environ['YOUTUBE_API_KEY'], user)
if response:
return jsonify(response), 200
else:
return jsonify({"prettyMessage": "Quota Limit maybe Exceeded"}), 403
These are my crawler functions I use:
def scrape_url(url):
r = requests.get(url)
page = r.text
soup = bs(page, 'html.parser')
return soup
def crawl_youtube_user(KEY, username):
youtube = set_up(KEY)
request = youtube.channels().list(
part="snippet,contentDetails,statistics",
forUsername=username
)
uploadPlaylistId = ""
data = {}
try:
response = request.execute()
except:
return {}
if (response["pageInfo"]["totalResults"] > 0):
stats = response["items"][0]["statistics"]
data["subscriberCount"] = stats["subscriberCount"]
data["videoCount"] = stats["videoCount"]
data["publishedAt"] = response["items"][0]["snippet"]["publishedAt"]
uploadPlaylistId = response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]
request = youtube.playlistItems().list(
part="snippet,contentDetails",
maxResults=50,
playlistId=uploadPlaylistId
)
videoIds = []
while True:
try:
response = request.execute()
except:
return {}
for vid in response["items"]:
videoIds.append(vid["snippet"]["resourceId"]["videoId"])
if "nextPageToken" not in response:
break
else:
request = youtube.playlistItems().list(
part="snippet,contentDetails",
maxResults=50,
playlistId=uploadPlaylistId,
pageToken=response["nextPageToken"]
)
data.update(crawl_youtube_videos(videoIds))
return data
def crawl_youtube_videos(ids):
data = {'viewCount': 0, 'videoLikes': 0}
counter = 0
idlength = len(ids)
for id in ids:
counter += 1
print('{}/{}: Scraping Youtube videoId {}'.format(counter,idlength,id))
soup = scrape_url('https://www.youtube.com/watch?v={}&gl=DE&hl=de'.format(id))
try:
data['viewCount'] += int(soup.find('div', class_='watch-view-count').getText().split(' ')[0].replace('.', '').replace(',', ''))
except:
print("Error while trying to extract the views of a Video: {}.".format(sys.exc_info()[0]))
try:
data['videoLikes'] += int(soup.find("button",{"title": "Mag ich"}).find("span").getText().replace('.', '').replace(',', ''))
except:
print("Error while trying to extract the likes of a Video: {}.".format(sys.exc_info()[0]))
return data
I dont want to use more threads or something like this, to make the whole process faster! Im scared about my IP getting blocked or something like this, if I scrape to many sites in a short time. I just try to keep the request alive until I get the response I want.
So are there more mechanisms which protects the GAE-App from long response time or something like this? And what would be the best way to act with requests which take 30-60 minutes?
You should consider using a background task queue like Celery or RQ.
When in place, your request would queue a job. You can then query the task queue and get the job status as you wish.
Here is a great resource for getting started with either of these options.

Creating a cluster before sending a job to dataproc programmatically

I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :
authenticate using a service account
submit a job to a cluster
The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.
I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.
How could I do ?
I'd like something cleaner than just waiting for a few minutes in my script !
Thanks
EDIT : Here's what my code looks like so far :
To launch the job
class EnqueueTaskHandler(webapp2.RequestHandler):
def get(self):
task = taskqueue.add(
url='/run',
target='worker')
self.response.write(
'Task {} enqueued, ETA {}.'.format(task.name, task.eta))
app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)
The job
class CronEventHandler(webapp2.RequestHandler):
def create_cluster(self, dataproc, project, zone, region, cluster_name):
zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
cluster_data = {...}
dataproc.projects().regions().clusters().create(
projectId=project,
region=region,
body=cluster_data).execute()
def wait_for_cluster(self, dataproc, project, region, clustername):
print('Waiting for cluster to run...')
while True:
result = dataproc.projects().regions().clusters().get(
projectId=project,
region=region,
clusterName=clustername).execute()
# Handle exceptions
if result['status']['state'] != 'RUNNING':
time.sleep(60)
else:
return result
def wait_for_job(self, dataproc, project, region, job_id):
print('Waiting for job to finish...')
while True:
result = dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId=job_id).execute()
# Handle exceptions
print(result['status']['state'])
if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
return result
else:
time.sleep(60)
def submit_job(self, dataproc, project, region, clusterName):
job = {...}
result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
return result['reference']['jobId']
def post(self):
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project = '...'
region = "..."
zone = "..."
clusterName = '...'
self.create_cluster(dataproc, project, zone, region, clusterName)
self.wait_for_cluster(dataproc, project, region, clusterName)
job_id = self.submit_job(dataproc,project,region,clusterName)
self.wait_for_job(dataproc,project,region,job_id)
dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
self.response.write("JOB SENT")
app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)
Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?
In addition to general polling either through list or get requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.
You can use the Google Cloud Dataproc API to create, delete and list clusters.
The list operation can be (repeatedly) performed after create and delete operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:
UNKNOWN The cluster state is unknown.
CREATING The cluster is being created and set up. It is not ready for use.
RUNNING The cluster is currently running and healthy. It is ready for use.
ERROR The cluster encountered an error. It is not ready for use.
DELETING The cluster is being deleted. It cannot be used.
UPDATING The cluster is being updated. It continues to accept and process jobs.
To prevent plain waiting between the (repeated) list invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list operations at a later time. For example, in python, see taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.

GAE taskqueue access application storage

My GAE application is written in Python with webapp2. My application targets at analyzing user's online social network. Users could login and authorize my application, hence the access token will be stored for further crawling the data. Then i use the taskqueue to launch a backend task, as the crawling process is time consuming. However, when i access the datastore to fetch the access token, i can get it. I wonders whether there is a way to access the data of the frontend, rather than the temporary storage for the taskqueue.
the handler to the process http request from the user
class Callback(webapp2.RequestHandler):
def get(self):
global client
global r
code = self.request.get('code')
try:
client = APIClient(app_key=APP_KEY, app_secret=APP_SECRET,redirect_uri=CALLBACK_URL)
r = client.request_access_token(code)
access_token = r.access_token
record = model.getAccessTokenByUid(r.uid)
if record is None or r.access_token != record.accessToken:
# logging.debug("access token stored")
**model.insertAccessToken(long(r.uid), access_token, r.expires_in, "uncrawled", datetime.datetime.now())** #data stored here
session = self.request.environ['beaker.session']
session['uid'] = long(r.uid)
self.redirect(CLUSTER_PAGE % ("true"))
except Exception, e:
logging.error("callback:%s" % (str(e)));
self.redirect(CLUSTER_PAGE % ("false"))
the handle to process task submitted to taskqueue
class CrawlWorker(webapp2.RequestHandler):
def post(self): # should run at most 1/s
uid = self.request.get('uid')
logging.debug("start crawling uid:%s in the backend" % (str(uid)))
global client
global client1
global r
tokenTuple = model.getAccessTokenByUid(uid)
if tokenTuple is None: **#here i always get a None**
logging.error("CounterWorker:oops, authorization token is missed.")
return
The question is not clear (is it can or cant?) But if you want to access frontend data from the taskqueue, pass it as parameters to the task queue.

What response times can be expected from GAE/NDB?

We are currently building a small and simple central HTTP service that maps "external identities" (like a facebook id) to an "internal (uu)id", unique across all our services to help with analytics.
The first prototype in "our stack" (flask+postgresql) was done within a day. But since we want the service to (almost) never fail and scale automagically, we decided to use Google App Engine.
After a week of reading&trying&benchmarking this question emerges:
What response times are considered "normal" on App Engine (with NDB)?
We are getting response times that are consistently above 500ms on average and well above 1s in the 90percentile.
I've attached a stripped down version of our code below, hoping somebody can point out the obvious flaw. We really like the autoscaling and the distributed storage, but we can not imagine 500ms really is the expected performance in our case. The sql based prototype responded much faster (consistently), hosted on one single Heroku dyno using the free, cache-less postgresql (even with an ORM).
We tried both synchronous and asynchronous variants of the code below and looked at the appstats profile. It's always RPC calls (both memcache and datastore) that take very long (50ms-100ms), made worse by the fact that there are always multiple calls (eg. mc.get() + ds.get() + ds.set() on a write). We also tried deferring as much as possible to the task queue, without noticeable gains.
import json
import uuid
from google.appengine.ext import ndb
import webapp2
from webapp2_extras.routes import RedirectRoute
def _parse_request(request):
if request.content_type == 'application/json':
try:
body_json = json.loads(request.body)
provider_name = body_json.get('provider_name', None)
provider_user_id = body_json.get('provider_user_id', None)
except ValueError:
return webapp2.abort(400, detail='invalid json')
else:
provider_name = request.params.get('provider_name', None)
provider_user_id = request.params.get('provider_user_id', None)
return provider_name, provider_user_id
class Provider(ndb.Model):
name = ndb.StringProperty(required=True)
class Identity(ndb.Model):
user = ndb.KeyProperty(kind='GlobalUser')
class GlobalUser(ndb.Model):
uuid = ndb.StringProperty(required=True)
#property
def identities(self):
return Identity.query(Identity.user==self.key).fetch()
class ResolveHandler(webapp2.RequestHandler):
#ndb.toplevel
def post(self):
provider_name, provider_user_id = _parse_request(self.request)
if not provider_name or not provider_user_id:
return self.abort(400, detail='missing provider_name and/or provider_user_id')
identity = ndb.Key(Provider, provider_name, Identity, provider_user_id).get()
if identity:
user_uuid = identity.user.id()
else:
user_uuid = uuid.uuid4().hex
GlobalUser(
id=user_uuid,
uuid=user_uuid
).put_async()
Identity(
parent=ndb.Key(Provider, provider_name),
id=provider_user_id,
user=ndb.Key(GlobalUser, user_uuid)
).put_async()
return webapp2.Response(
status='200 OK',
content_type='application/json',
body = json.dumps({
'provider_name' : provider_name,
'provider_user_id' : provider_user_id,
'uuid' : user_uuid
})
)
app = webapp2.WSGIApplication([
RedirectRoute('/v1/resolve', ResolveHandler, 'resolve', strict_slash=True)
], debug=False)
For completeness sake the (almost default) app.yaml
application: GAE_APP_IDENTIFIER
version: 1
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: .*
script: main.app
libraries:
- name: webapp2
version: 2.5.2
- name: webob
version: 1.2.3
inbound_services:
- warmup
In my experience, RPC performance fluctuates by orders of magnitude, between 5ms-100ms for a datastore get. I suspect it's related to the GAE datacenter load. Sometimes it gets better, sometimes it gets worse.
Your operation looks very simple. I expect that with 3 requests, it should take about 20ms, but it could be up to 300ms. A sustained average of 500ms sounds very high though.
ndb does local caching when fetching objects by ID. That should kick in if you're accessing the same users, and those requests should be much faster.
I assume you're doing perf testing on the production and not dev_appserver. dev_appserver performance is not representative.
Not sure how many iterations you've tested, but you might want to try a larger number to see if 500ms is really your average.
When you're blocked on simple RPC calls, there's not too optimizing you can do.
The 1st obvious moment I see: do you really need a transaction on every request?
I believe that unless most of your requests create new entities it's better to do .get_by_id() outside of transaction. And if entity not found then start transaction or even better defer creation of the entity.
def request_handler(key, data):
entity = key.get()
if entity:
return 'ok'
else:
defer(_deferred_create, key, data)
return 'ok'
def _deferred_create(key, data):
#ndb.transactional
def _tx():
entity = key.get()
if not entity:
entity = CreateEntity(data)
entity.put()
_tx()
That should give much better response time for user facing requests.
The 2nd and only optimization I see is to use ndb.put_multi() to minimize RPC calls.
P.S. Not 100% sure but you can try to disable multithreading (threadsave: no) to get more stable response time.

Resources