I would like to have information about the setting of the publisher in the pubsub environment of gcp. I would like to enqueue messages that will be consumed via a google function. To achieve this, the publication will trigger when a number of messages is reached or from a certain time.
I set the topic as follows :
topic.PublishSettings = pubsub.PublishSettings{
ByteThreshold: 1e6, // Publish a batch when its size in bytes reaches this value. (1e6 = 1Mo)
CountThreshold: 100, // Publish a batch when it has this many messages.
DelayThreshold: 10 * time.Second, // Publish a non-empty batch after this delay has passed.
}
When I call the publish function, I have a 10 second delay on each call. Messages are not added to the queue ...
for _, v := range list {
ctx := context.Background()
res := a.Topic.Publish(ctx, &pubsub.Message{Data: v})
// Block until the result is returned and a server-generated
// ID is returned for the published message.
serverID, err = res.Get(ctx)
if err != nil {
return "", err
}
}
Someone can help me ?
Cheers
Batching the publisher side is designed to allow for more cost efficiency when sending messages to Google Cloud Pub/Sub. Given that the minimum billing unit for the service is 1KB, it can be cheaper to send multiple messages in the same Publish request. For example, sending two 0.5KB messages as separate Publish requests would result in being changed for sending 2KB of data (1KB for each). If one were to batch that into a single Publish request, it would be charged as 1KB of data.
The tradeoff with batching is latency: in order to fill up batches, the publisher has to wait to receive more messages to batch together. The three batching properties (ByteThreshold, CountThreshold, and DelayThreshold) allow one to control the level of that tradeoff. The first two properties control how much data or how many messages we put in a single batch. The last property controls how long the publisher should wait to send a batch.
As an example, imagine you have CountThreshold set to 100. If you are publishing few messages, it could take awhile to receive 100 messages to send as a batch. This means that the latency for messages in that batch will be higher because they are sitting in the client waiting to be sent. With DelayThreshold set to 10 seconds, that means that a batch would be sent if it had 100 messages in it or if the first message in the batch was received at least 10 seconds ago. Therefore, this is putting a limit on the amount of latency to introduce in order to have more data in an individual batch.
The code as you have it is going to result in batches with only a single message that each take 10 seconds to publish. The reason is the call to res.Get(ctx), which will block until the message has been successfully sent to the server. With CountThreshold set to 100 and DelayThreshold set to 10 seconds, the sequence that is happening inside your loop is:
A call to Publish puts a message in a batch to publish.
That batch is waiting to receive 99 more messages or for 10 seconds to pass before sending the batch to the server.
The code is waiting for this message to be sent to the server and return with a serverID.
Given the code doesn't call Publish again until res.Get(ctx) returns, it waits 10 seconds to send the batch.
res.Get(ctx) returns with a serverID for the single message.
Go back to 1.
If you actually want to batch messages together, you can't call res.Get(ctx) before the next Publish call. You'll want to either call publish inside a goroutine (so one routine per message) or you'll want to amass the res objects in a list and then call Get on them outside the loop, e.g.:
var res []*PublishResult
ctx := context.Background()
for _, v := range list {
res = append(res, a.Topic.Publish(ctx, &pubsub.Message{Data: v}))
}
for _, r := range res {
serverID, err = r.Get(ctx)
if err != nil {
return "", err
}
}
Something to keep in mind is that batching will optimize cost on the publish side, not on the subscribe side. Cloud Functions is built with push subscriptions. This means that messages must be delivered to the subscriber one at a time (since the response code is what is used to ack or nack each message), which means there is no batching of messages delivered to the subscriber.
Related
My Google App Engine application (Python3, standard environment) serves requests from users: if there is no wanted record in the database, then create it.
Here is the problem about database overwriting:
When one user (via browser) sends a request to database, the running GAE instance may temporarily fail to respond to the request and then it creates a new process to respond this request. It results that two instances respond to the same request. Both instances make a query to database almost in the same time, and each of them finds there is no wanted record and thus creates a new record. It results as two repeated records.
Another scenery is that for certain reason, the user's browser sends twice requests with time difference less than 0.01 second, which are processed by two instances at the server side and thus repeated records are created.
I am wondering how to temporarily lock the database by one instance to prevent the database overwriting from another instance.
I have considered the following schemes but have no idea whether it is efficient or not.
For python 2, Google App Engine provides "memcache", which can be used to mark the status of query for the purpose of database locking. But for python3, it seems that one has to setup a Redis server to rapidly exchange database status among different instances. So, how about the efficiency of database locking by using Redis?
The usage of session module of Flask. The session module can be used to share data (in most cases, the login status of users) among different requests and thus different instances. I am wondering the speed to exchange the data between different instances.
Appended information (1)
I followed the advice to use transaction, but it did not work.
Below is the code I used to verify the transaction.
The reason of failure may be that the transaction only works for CURRENT client. For multiple requests at the same time, the server side of GAE will create different processes or instances to respond to the requests, and each process or instance will have its own independent client.
#staticmethod
def get_test(test_key_id, unique_user_id, course_key_id, make_new=False):
client = ndb.Client()
with client.context():
from google.cloud import datastore
from datetime import datetime
client2 = datastore.Client()
print("transaction started at: ", datetime.utcnow())
with client2.transaction():
print("query started at: ", datetime.utcnow())
my_test = MyTest.query(MyTest.test_key_id==test_key_id, MyTest.unique_user_id==unique_user_id).get()
import time
time.sleep(5)
if make_new and not my_test:
print("data to create started at: ", datetime.utcnow())
my_test = MyTest(test_key_id=test_key_id, unique_user_id=unique_user_id, course_key_id=course_key_id, status="")
my_test.put()
print("data to created at: ", datetime.utcnow())
print("transaction ended at: ", datetime.utcnow())
return my_test
Appended information (2)
Here is new information about usage of memcache (Python 3)
I have tried the follow code to lock the database by using memcache, but it still failed to avoid overwriting.
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.gets(cache_key_id)
if result is None or result == "":
client.cas(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
print("failed after 500 tries.")
break
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.cas(cache_key_id, "")
memcache.delete(cache_key_id)
If the problem is duplication but not overwriting, maybe you should specify data id when creating new entries, but not let GAE generate a random one for you. Then the application will write to the same entry twice, instead of creating two entries. The data id can be anything unique, such as a session id, a timestamp, etc.
The problem of transaction is, it prevents you modifying the same entry in parallel, but it does not stop you creating two new entries in parallel.
I used memcache in the following way (using get/set ) and succeeded in locking the database writing.
It seems that gets/cas does not work well. In a test, I set the valve by cas() but then it failed to read value by gets() later.
Memcache API: https://cloud.google.com/appengine/docs/standard/python3/reference/services/bundled/google/appengine/api/memcache
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.get(cache_key_id)
if result is None or result == "":
client.set(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
return "failed after 500 tries of memcache checking."
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.delete(cache_key_id)
...
Transactions:
https://developers.google.com/appengine/docs/python/datastore/transactions
When two or more transactions simultaneously attempt to modify entities in one or more common entity groups, only the first transaction to commit its changes can succeed; all the others will fail on commit.
You should be updating your values inside a transaction. App Engine's transactions will prevent two updates from overwriting each other as long as your read and write are within a single transaction. Be sure to pay attention to the discussion about entity groups.
You have two options:
Implement your own logic for transaction failures (how many times to
retry, etc.)
Instead of writing to the datastore directly, create a task to modify
an entity. Run a transaction inside a task. If it fails, the App
Engine will retry this task until it succeeds.
I have a gatling scenario where I retrieve 1000 documents from a database via a RESTful API.
I then modify the documents and send update requests for each.
This is how I'm currently doing it:
...
val scrollQueries = scenario("Enrichment Topologies").exec(ScrollQueryInitiator.query, repeat(numberOfPagesToScrollThrough, "scrollQueryCounter"){
exec(ScrollQuery.query, pause(10 seconds).foreach("${hitsJson}", "hit"){ exec(HitProcessor.query) })
})
...
Here are the main features of interest:
ScrollQuery.query fetches the 1000 results and saves them into hitsJson in the session.
It then pauses for 10 seconds to simulate longer-term processing.
The 1000 results are iterated over and for each item a HitProcessor is run which sends the update request
In reality, the foreach loop ensures that each request is sent one after the other.
Question
What I really want is to work through the 1000 results in groups of 10, sending update requests in parallel 10 at a time.
How can I achieve this?
Try moving the fetch part to before hook.
Now that you have the data you can start 10 threads
setUp( scn.inject(atOnceUsers(10)))
I am trying to implement a script with a server socket that will also periodically poll for data from several sensors (i.e on 59th second of every minute). I do not want to serialize the data to disk but rather keep it in a table which the socket will respond with when polled.
Here's some sketch the code to illustrate what I am trying to do (I've not included the client code that accesses this server, but that part is OK)
#!/usr/bin/env lua
local socket = require("socket")
local server = assert(socket.bind("*", 0))
local ip, port = server:getsockname()
local data = {}
local count = 1
local function pollSensors()
-- I do the sensor polling here and add to table e.g os.time()
table.insert(data, os.time() .."\t" .. tostring(count))
count = count + 1
end
while true do
local client = server:accept()
client:settimeout(2)
local line, err = client:receive()
-- I do process the received line to determine the response
-- for illustration I'll just send the number of items in the table
if not err then client:send("Records: " ..table.getn(data) .. "\n") end
client:close()
if os.time().sec == 59 then
pollSensors()
end
end
I am concerned that the server may on occasion(s) block and therefore I'll miss the 59th second.
Is this a good way to implement this or is there a (simpler) better way to do this (say using coroutines)? If coroutines would be better, how do I implement them for my scenario?
To accomplish this you need some sort of multitasking.
I'd use a network aware scheduler.
e.g. cqueues would look like this:
local cqueues = require "cqueues"
local cs = require "cqueues.socket"
local data = {}
local count = 1
local function pollSensors()
-- I do the sensor polling here and add to table e.g os.time()
table.insert(data, os.time() .."\t" .. tostring(count))
count = count + 1
end
local function handle_client(client)
client:setmode("b", "bn") -- turn on binary mode for socket and turn off buffering
-- ported code from question:
client:settimeout(2) -- I'm not sure why you chose a 2 second timeout
local line, err = client:read("*l") -- with cqueues, this read will not block the whole program, but just yield the current coroutine until data arrives.
-- I do process the received line to determine the response
-- for illustration I'll just send the number of items in the table
if not err then
assert(client:write(string.format("Records: %d\n", #data)))
end
client:close()
end
local cq = cqueues.new() -- create a new scheduler
-- create first coroutine that waits for incoming clients
cq:wrap(function()
local server = cs.listen{host = "0.0.0.0"; port = "0"}
local fam, ip, port = server:localname()
print(string.format("Now listening on ip=%s port=%d", ip, port))
for client in server:clients() do -- iterates over `accept`ed clients
-- create a new coroutine for each client, passing the client in
cqueues.running():wrap(handle_client, client)
end
end)
-- create second coroutine that reads sensors
cq:wrap(function()
while true do
-- I assume you just wanted to read every 60 seconds; rather than actually *on* the 59th second of each minute.
pollSensors()
cqueues.sleep(60)
end
end)
-- Run scheduler until all threads exit
assert(cq:loop())
I think the periodically launching some apps/codes are good realized with 'cron' libraries in different languages.
For instance, cron lib in lua you can download here.
I would like to change my GAE app logic and to start emails sending with task queue usage.
Currently I have a cron job, which runs each 15 minutes and read messages to be sent from the datastore:
class SendMessagesHandler(webapp2.RequestHandler):
def get(self):
emails_quota_exceeded = models.get_system_value('emails_quota_exceeded')
if emails_quota_exceeded == 0 or emails_quota_exceeded == None:
messages = models.get_emails_queue()
for message in messages:
try:
...
email.send()
models.update_email_status(message.key.id()) # update email status indicating that the mail has been sent
except apiproxy_errors.OverQuotaError, error_message:
models.set_system_value(what='emails_quota_exceeded', val=1)
logging.warning('E-mails quota exceeded for today: %s' % error_message)
break
else:
logging.info('Free quota to send e-mails is exceeded')
If I use task queues, then I'll get something like:
for message in messages:
taskqueue.add(url='/sendmsg', payload=message)
In this scenario it is possible that the same message will be sent twice (or even more times) - for ex., if it wasn't sent yet, but cron job was executed second time.
If I update email status immediately after adding the message to the queue:
for message in messages:
taskqueue.add(url='/sendmsg', payload=message)
models.update_email_status(message.key.id()) # update email status indicating that the mail has been sent
then it is possible that the message will never be sent. For ex., if exception happened during e-mail sending. Understand that the task will be retried, but in case quota is exceeded for today, then retries will not help.
I think I can also re-read the status of each message at task queue before trying to sent it, but it will cost me additional read operations.
What's the best way to handle it?
Giving your task a name including the key.id() will prevent it from being sent twice:
task_name = ''.join(['myemail-', str(mykey)])
try:
taskqueue.Task(
url="/someURL/send-single-email",
name=task_name,
method="POST",
params={
"subject": subject,
"body": body,
"to": to,
"from": from }
).add(queue_name="mail-queue")
except:
pass #throws TombstonedTaskError(InvalidTaskError) if tombstoned name used.
There may be times when you want to send follow-up emails for messages with the same key. Therefore, I would recommend adding a date or datetime stamp to the task name. This will allow you to send other messages of the same key at a later time:
task_name = ''.join(['myemail-', str(mykey), str(datetime.utcnow()-timedelta(hours=8))]).translate(string.maketrans('.:_ ', '----'))
I want to do some heavy processing in the map() call of the mapper.
I was going through the source file MapReduceServlet.java:
// Amount of time to spend on actual map() calls per task execution.
public static final int PROCESSING_TIME_PER_TASK_MS = 10000;
Does it mean, the map call can last only for 10secs. What happens after 10sec?
Can I increase this to large number like 1min or 10min.
-Aswath
MapReduce operations are executed in tasks using Push Queues, and as said in the documentation the task deadline is currently 10 minutes (limit after which you will get a DeadlineExceededException).
If the task failed to execute, by default App Engine retries it until it succeed. If you need longer deadline that 10 minutes, you can use Backend for executing your tasks.
Looking at the actual usage of PROCESSING_TIME_PER_TASK_MS in Worker.java, this value is used to limit the number of map call done in a single task.
After each map call has been executed if more than 10s has elapsed since the beginning of the task, it will spawn a new task to handle the rest of the map calls.
Worker.scheduleWorker spawns a new Task for a each given shard
Each task will call Worker.processMapper
processMapper execute 1 map call
if less than PROCESSING_TIME_PER_TASK_MS have elapsed since 2. go back to 3.
else if processing is not finished reschedule a new worker task
In the worst case scenario the default task request deadline (10 minutes) should apply to each of your individual map call.