In distributed mode can locust know if other worker already run task for a particular user - distributed

community.
I need users created by different workers not to be duplicated. I.e. if worker_1 create user fake1 and run task for this user, worker_2 must not to be able create user fake1.
Currently it works only in standalone mode.
credentials = [ (f"fake{i}", 'password') for i in range(100) ]
class MobileUser(HttpUser):
tasks = [RealTimeTask]
wait_time = constant_pacing(1)
def __init__(self, parent):
super().__init__(parent)
if len(credentials) > 0:
shuffle(credentials)
self.username, self.password = credentials.pop()

Not really. Slaves are blissfully unaware of each other.
I think your best options is to set an env var at worker start time:
WORKER_ID=1 locust --worker ...
(and then insert os.env['WORKER_ID'] into your user names to make them unique)
Every worker does have a unique id (accessible from a task using self.environment.runner.client_id), but they are not ordered (they contain hostname + a UUID), so I dont know if that really helps you.
Another option might be to use hostname + process id as an identifier.

Related

How to prevent overwriting of database for requests from different instances (Google App Engine using NDB)

My Google App Engine application (Python3, standard environment) serves requests from users: if there is no wanted record in the database, then create it.
Here is the problem about database overwriting:
When one user (via browser) sends a request to database, the running GAE instance may temporarily fail to respond to the request and then it creates a new process to respond this request. It results that two instances respond to the same request. Both instances make a query to database almost in the same time, and each of them finds there is no wanted record and thus creates a new record. It results as two repeated records.
Another scenery is that for certain reason, the user's browser sends twice requests with time difference less than 0.01 second, which are processed by two instances at the server side and thus repeated records are created.
I am wondering how to temporarily lock the database by one instance to prevent the database overwriting from another instance.
I have considered the following schemes but have no idea whether it is efficient or not.
For python 2, Google App Engine provides "memcache", which can be used to mark the status of query for the purpose of database locking. But for python3, it seems that one has to setup a Redis server to rapidly exchange database status among different instances. So, how about the efficiency of database locking by using Redis?
The usage of session module of Flask. The session module can be used to share data (in most cases, the login status of users) among different requests and thus different instances. I am wondering the speed to exchange the data between different instances.
Appended information (1)
I followed the advice to use transaction, but it did not work.
Below is the code I used to verify the transaction.
The reason of failure may be that the transaction only works for CURRENT client. For multiple requests at the same time, the server side of GAE will create different processes or instances to respond to the requests, and each process or instance will have its own independent client.
#staticmethod
def get_test(test_key_id, unique_user_id, course_key_id, make_new=False):
client = ndb.Client()
with client.context():
from google.cloud import datastore
from datetime import datetime
client2 = datastore.Client()
print("transaction started at: ", datetime.utcnow())
with client2.transaction():
print("query started at: ", datetime.utcnow())
my_test = MyTest.query(MyTest.test_key_id==test_key_id, MyTest.unique_user_id==unique_user_id).get()
import time
time.sleep(5)
if make_new and not my_test:
print("data to create started at: ", datetime.utcnow())
my_test = MyTest(test_key_id=test_key_id, unique_user_id=unique_user_id, course_key_id=course_key_id, status="")
my_test.put()
print("data to created at: ", datetime.utcnow())
print("transaction ended at: ", datetime.utcnow())
return my_test
Appended information (2)
Here is new information about usage of memcache (Python 3)
I have tried the follow code to lock the database by using memcache, but it still failed to avoid overwriting.
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.gets(cache_key_id)
if result is None or result == "":
client.cas(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
print("failed after 500 tries.")
break
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.cas(cache_key_id, "")
memcache.delete(cache_key_id)
If the problem is duplication but not overwriting, maybe you should specify data id when creating new entries, but not let GAE generate a random one for you. Then the application will write to the same entry twice, instead of creating two entries. The data id can be anything unique, such as a session id, a timestamp, etc.
The problem of transaction is, it prevents you modifying the same entry in parallel, but it does not stop you creating two new entries in parallel.
I used memcache in the following way (using get/set ) and succeeded in locking the database writing.
It seems that gets/cas does not work well. In a test, I set the valve by cas() but then it failed to read value by gets() later.
Memcache API: https://cloud.google.com/appengine/docs/standard/python3/reference/services/bundled/google/appengine/api/memcache
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.get(cache_key_id)
if result is None or result == "":
client.set(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
return "failed after 500 tries of memcache checking."
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.delete(cache_key_id)
...
Transactions:
https://developers.google.com/appengine/docs/python/datastore/transactions
When two or more transactions simultaneously attempt to modify entities in one or more common entity groups, only the first transaction to commit its changes can succeed; all the others will fail on commit.
You should be updating your values inside a transaction. App Engine's transactions will prevent two updates from overwriting each other as long as your read and write are within a single transaction. Be sure to pay attention to the discussion about entity groups.
You have two options:
Implement your own logic for transaction failures (how many times to
retry, etc.)
Instead of writing to the datastore directly, create a task to modify
an entity. Run a transaction inside a task. If it fails, the App
Engine will retry this task until it succeeds.

Linux Samba/Winbind - groups not refreshing when using SSH - Active Directory membership

We have Linux hosts that are bound to our Active Directory Domain user Samba/Winbind to be a member server - for users to get access to the servers we use a domain group placed into the sshd_config. So user gets added to group and then in theory they can login - that was the plan.
At the moment we can add a user to the group - and if that group has never been used before the server will reach out and grab the group membership. But once that has been done the group membership does not refresh - without going to the extremes of removing tdb files and rebinding the machine to the domain which is a mess.
Has anyone ever got around this problem?
What is annoying is that if I ssh onto the box - add a user to the AD group then 'su' to the user the groups are refreshed. However that does not work if you 'sudo su' (I don't want people's passwords)
workgroup = INTERNAL
realm = INTERNAL.NETWORK
netbios name = no1
security = ADS
dns forwarder = { 123.123.123.123; 123.123.123.123 }
idmap config * : backend = tdb
idmap config *:range = 50000-1000000
template homedir = /home/%U
template shell = /bin/bash
winbind use default domain = true
winbind nss info = rfc2307
winbind enum users = yes
winbind enum groups = yes
winbind use nested groups = yes
pam password change = yes
vfs objects = acl_xattr
map acl inherit = yes
store dos attributes = yes
encrypt passwords = true
winbind cache time = 10
I am wondering if the issue is SSH looking at the group and this does not trigger an event to go and check the group membership in AD?
Can get around this with using local groups and domain users - but this is annoying that surely this is something that fundamentally should work
Thanks
Your smb.conf is borked or you are using sssd, if the latter, this has nothing to do with Samba, if the former, please read this:
https://wiki.samba.org/index.php/Setting_up_Samba_as_a_Domain_Member

Apache Flink (How to uniquely tag Jobs)

Is it possible to tag jobs with a unique name so I can stop them at a later date?. I don't really want to grep and persist Job IDs.
In a nutshell I want to stop a job as part of my deployment and deploy the new one.
You can name jobs when you start them in the execute(name: String) call, e.g.,
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment()
val result: DataStream[] = ??? // your job logic
result.addSink(new YourSinkFunction) // add a sink
env.execute("Name of your job") // execute and assign a name
The REST API of the JobManager provides a list of job details which include the name of the job and its JobId.

GAE Datastore ID

I've created two different Entities, one a User and one a Message they can create. I assign each user an ID and then want to assign this ID to each message which that user creates. How can I go about this? Do I have to do it in a query?
Thanks
Assuming that you are using Python NDB, you can having something like the following:
class User(ndb.Model):
# put your fileds here
class Message(ndb.Model):
owner = ndb.KeyProperty()
# other fields
Create and save a User:
user = User(field1=value1, ....)
user.put()
Create and save a Message:
message = Message(owner=user.key, ...)
message.put()
Query a message based on user:
messages = Message.query().filter(Message.owner==user.key).fetch() # returns a list of messages that have this owner
For more information about NDB, take a look at Python NDB API.
Also, you should take a look at Python Datastore in order to get a better understanding of data modeling in App Engine.

Try to fill the GAE datastore but the code consumes to much cpu time. How to optimize this?

I try to get the list of images in Amazon EC2 inside the Google datastore. I want to realize this with a cron job inside the GAE.
class AmazonEC2uswest(db.Model):
ami = db.StringProperty(required=True)
mani = db.StringProperty()
typ = db.StringProperty()
arch = db.StringProperty()
state = db.StringProperty()
owner = db.StringProperty()
class CronAMIsAmazonUS_WEST(webapp.RequestHandler):
def get(self):
aws_access_key_id_admin = "<secret>"
aws_secret_access_key_admin = "<secret>"
conn_us_west = boto.ec2.connect_to_region('us-west-1', aws_access_key_id=aws_access_key_id_admin,
aws_secret_access_key=aws_secret_access_key_admin, is_secure = False)
liste_images_us_west = conn_us_west.get_all_images()
laenge_liste_images_us_west = len(liste_images_us_west)
for i in range(laenge_liste_images_us_west):
datastore_uswest_AMIs = AmazonEC2uswest(ami=liste_images_us_west[i].id,
mani=str(liste_images_us_west[i].location),
typ=liste_images_us_west[i].type,
arch=liste_images_us_west[i].architecture,
state=liste_images_us_west[i].state,
owner=liste_images_us_west[i].ownerId)
datastore_uswest_AMIs.put()
The problem: Getting the list with get_all_images() lasts only a few seconds. But writing the data to the Google datastore needs way too much CPU time.
My IBM T42p (P4M with 2GHz) needs for that piece of code approx. 1 Minute!
Is it possible to optimize my code in a way that it needs fewer CPU time?
First possible optimisation: create all the entities in your loop, and then call db.put() with a list of all of them after you're finished. Something like:
entities = []
for i in range(laenge_liste_images_us_west):
datastore_uswest_AMIs = AmazonEC2uswest(...)
entities.append(datastore_uswest_AMIs)
db.put(entities)
or:
db.put([AmazonEC2uswest(...) for image in liste_images_us_west])
If that's still too slow, the right thing to do is probably:
Get the list of images.
Divide these up into small batches which can complete comfortably in under 30 seconds. So in your example which is currently taking a minute, you want at least 4 batches, maybe more, and the number of batches should depend on the number of images you get.
For each batch, add a task to a task queue, specifying which images to add to the DB. This might be done by specifying all the data, or just by specifying a range of images to handle. Which you do depends on being able to store the data temporarily: there's a limit to what you can store in a task, if you go past that you could use memcache, or you could only store the image id, not all the fields. Or you could create more tasks, so that the data for each batch is under the limit.
In the task handler, process just that batch. If you have all the data then great, otherwise get it again with get_all_images. Then generate and store just the entities that belong to this batch.
You don't have to use tasks, cron alone could handle it if you can remember how far you got last time the job ran, and continue from there next time. But tasks seem appropriate to me.

Resources