JMH passing parameters between benchmarks - benchmarking

My Scenario:
I have to run benchmarks on createAccount with a unique id e.g UUID as key.
I have to run benchmarks on createBill on all the accounts created(with the unique id's created by createAccount)
How can I pass or get the UUIDs created by createAccount to createBill? I tried #Setup but thats before/after each benchmark. #Param doesn't seem to help in this case.

Related

how can I make custom auto generated IDs for documents in firestore

For my project I need ids that can be easily shared, so firestores default auto generated ids won't work.
I am looking for a way to auto generate id like 8329423 that would be incremented or randomly chosen in range 0 to 9999999.
Firestore's auto-ID fields are designed to statistically guarantee that no two clients will ever generate the same value. This is why they're as long as they are: it's to ensure there is enough randomness (entropy) in them.
This allows Firestore to determine these keys completely client-side without needing to look up on the server whether the key it generated was already generated on another client before. And this in turn has these main benefits:
Since the keys are generated client-side, they can also be generated when the client is not connected to any server.
Since the keys are generated client-side, there is no need for a roundtrip to the server to generate a new key. This significantly speeds up the process.
Since the keys are generated client-side, there is no contention between clients generating keys. Each client just generates keys as needed.
If these benefits are important to your use-case, then you should strongly consider whether you're likely to create a better unique ID than Firestore already does. For example, Firestore's IDs have 62^20 unique values, which is why they're statistically guaranteed to never generate the same value over a very long period of time. Your proposed range of 0 - 9999999 has 1 million unique values, which is much more likely to generate duplicate.
If you really want this scheme for IDs, you will need to store the IDs that you've already given out on the server (likely in Firestore), so that you can check against it when generating a new key. A very common way to do this is to keep a counter of the last ID you've already handed out in a document. To generate a new unique ID, you:
Read the latest counter value from the document.
Increment the counter.
Write the updated counter value to the document.
Use the updated counter value in your code.
Since this read-update-write happens from multiple clients, you will need to use a transaction for it. Also note that the clients now are coordinating the key-generation, so you're going to experience throughput limits on the number of keys you can generate.

ElasticSearch - workaround of unique constraint

I am thinking about some smart workaround of "no unique constraint" problem in ElasticSearch.
I can't use _id to store my unique field, because I am using _id for other purpose.
I crawl Internet pages and store them in ElasticSearch index. My rule is, that url must be unique (only one document with given url in index) so as ElasticSearch doesn't allow to set unique constraint on one field, I must query index before inserting new page to check if there is already site with given url.
So adding new page to document looks like that:
Query(match) index in ES to check if there is document with given url field.
If not, I insert new document.
The solution has two disadvantages:
I must execute extra query to check if there is already document with given url. It slows down inserting process and generates extra load.
If I try to add 2 documents with the same url in short amount of time and the index doesn't refresh before adding second document, the second query returns, that there is no document with given url and finally I have two documents with the same url
So I am looking for something else. Please tell me if you have any idea or please tell me what do you think about such solutions:
Solution 1
To use other database system (or maybe another ES index with url in _id) where I will store only urls and I will query it to check if there is already url
Solution 2
2. To queue documents before inserting and to disable index refreshing when other process will process the queue and add queued documents to index.
You've hit upon one of the things that Elasticsearch does not do well (secondary indexes and constraints) when compared to some other NoSQL solutions. In addition to Solution 1 and Solution 2 I'd suggest you look at Elasticsearch Rivers:
Rivers
A river is a pluggable service running within elasticsearch cluster
pulling data (or being pushed with data) that is then indexed into the
cluster.
For example, you could use the MongoDB river and then insert your data into MongoDB. MongoDB supports secondary unique indexes so you could prevent insertion of duplicate urls. The River will then take care of pushing the data to Elasticsearch in realtime.
https://github.com/richardwilly98/elasticsearch-river-mongodb
ES supports CouchDB officially and there are a number of other databases that have rivers too -

How to re-index documents in Solr without knowing last modified time?

How to handle the following scenario in Solr's DataImportHandler? We do a full import of all our documents once daily (the full indexing takes about 1 hour to run). All our documents are in two classes, say A and B. Only 3% of the documents belong to class A and these documents get modified often. We re-index documents in class A every 10 mins via deltaQuery by using the modified time. All fine till here.
Now, we also want to re-index ALL documents in class A once every hour (because we have a view_count column in a different table and the document modified time does not change when we update the view_count). How to do this?
Update (short-term solution): For now we decided to not use the modified time in the delta at all and simply re-index all documents in class A every 10 mins. It takes only 3 mins to index class A docs so we are OK for now. Any solution will be of help though.
Rather than using separate query an deltaQuery parameters in your DIH DB config, I chose to follow the suggestion found here, which allows you to use the same query logic for both full and partial updates by passing different parameters to Solr to perform either a full import or a delta import.
In both cases, you would pass ?command=full-import, but for a full import you would pass &clean=true as a URL parameter, for a delta you would pass &clean=false, which would affect the # of records returned from the query as well as tell Solr whether or not to flush and start over.
I found one can use ExternalFileField to store the view count and use a function query to sort the results based on that field. (I asked another question about this on SO: ExternalFileField in Solr 3.6.) However, I found that these fields cannot be returned in the Solr result set, which meant I needed to do a DB call to get the values for the fields. I don't want to do that.
Found an alternate solution: When trying to understand Mike Klostermeyer's answer, I found that command=full-import can also take an additional query param: entity. So now I set up two top-level entities inside the <document> tag in data-config.xml - the first one will only index docs in class A and the second one will only index docs in class B. For class A documents we do a delta import based on last modified time every 5 mins and a full import every hour (to get the view_count updated). For class B documents, we only do one full import every day and no delta imports.
This essentially gives three different execution plans to run at different time intervals.
There is also one caveat though: need to pass query param clean=false every time I run the import for an entity; otherwise the docs in the other entity get deleted after indexing completes.
One thing I don't like about the approach is the copy-pasting of all the queries and sub-entities from one top entity to the other. The only difference between the queries in the top-level entities is whether the doc is in class A or class B.

Django: efficient database search

I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.

Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)
If are not doing so already you should be using key_names for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
pass
Then you will populate your datastore like:
UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
Create a pipeline for the overall task that you will run via cron every 24hrs
The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
pipeline api can then send you an email to report on it's status

Resources