I'm putting some data in the datastore via entity.put(), then soon thereafter reading from the datastore (getting data that includes the just put entity) via a .get().
The .get() data is correct, but often the order of it doesn't make sense:
SELECT * FROM entityName
WHERE someThing = 'value'
ORDER BY votes DESC, lastTouchedTimestamp DESC
Will return the correct entities (updated to include new data from the aforementioned .put()) but in an order that is incorrect (i.e. the votes and/or lastTouchedTimestamp actually aren't in order)
Pretty new to GAE so sorry if there is some simple thing I'm overlooking.
EDIT/ADDITION:
each entity has a vote integer. the SELECT should return entities in order of votes like: 10,8,7,7,1, but instead sometimes returns 10,7,8,7,1, for example.
What you're describing is in App Engine terms not a .get() call but a query. Proper .get() calls specify a key and are not subject to this race. (Nor are ancestor queries.) For more background info about this topic, read https://developers.google.com/appengine/docs/python/datastore/overview#Datastore_Writes_and_Data_Visibility
You're lucky that you're getting the updated entity in your query results at all -- that's because the entity as it existed before your .put() call still matched the query. You're getting the correct value in the entity because query results (except for projection queries as #tesdal mentioned) are accessed by key; but you're getting the wrong ordering because the ordering is taken from the index.
App Engine has no guarantee concerning index update timing.
In your example it means that index data is 10,7,7,7,1 but the returned results are actual objects (which are updated) so you notice that ordering is off because you expect 8 for one of the entries.
If you use a projection query, you'll see 10,7,7,7,1.
Related
I need to find a document in mongodb using it's ID . This operation should be so fast. So we need to get the exact document which has the given ID . Is there any way to do that . I am a beginner here . So I would be much thankful to you if you could give some in-depth answer
Okay so your are really a beginner.
First thing you should know from now that getting any kind of record
from a database is done by a querying the database and its called as
Search.
It simply means that when you want any data from your database, database engine has to search it in database using the query you provided.
So I think this sufficient to get you know that whenever you ask(using database query) database to give some records from it, it will perform a search based on conditions you provided, then it doesn't matter
You provide condition with a single unique key or multiple complex combinations of columns or joins of multiple tables
Your database contain no records or billions of records.
it has to search in database.
So above explanation holds true for I guess every database as far as I know.
Now coming to MongoDB
So referring to above explanation, MongoDB engine query database to get result.
Now main question is - How to get result Fast!!!
And I think that's what your main concern should be.
So query speed (Search Speed) is mainly depend on 2 things:
Query.
Number of records in your database.
1. Query
Here factors affecting are :
a. Nature of parameters used in a query - (Indexed or UnIndexed)
If you use Indexed parameters in your query it always going to be a faster search operation for database.
For example the _id field is default indexed by mongodb. So if you search document in collection using _id filed alone is going to be always a faster search.
b. Combination of parameters with operators
This refers to number of parameters used for query (more the parameter, more slower) and kind of query operators you used in query (simple query operator give result faster as compare to aggregation query operators with pipelines)
c. Read Preferences
Read preference describes how MongoDB will route read operations to the members of a replica set. It actually describe your preference of confidence in data that you are getting.
Above are two main parameters, but there are many things such as :
Schema of your collection,
Your understanding of the schema (specifically data-types of documents)
Understanding of query operator you used. For example - when to use $or, $and operators and when to use $in and $nin operators.
2. Number of records in your database.
Now this happens when you have enormous data in database, of course with single database server it will be slower i.e. more records more slower.
In such cases Sharding (Clustering) your data on multiple database server will give you faster search performance.
MongoDB has Mongos component which will route our query to perfect database server in cluster. In order to perform such routing it uses config servers which stores the meta-data about our collections using Indexes and Shard-Key.
Hence in sharded environment choosing proper shard-key plays important role in faster query response.
I hope this will give you a descent idea of how actually a search is affected by various parameters.
I will improve this answer in future time.
Its pretty starlight forward, you can try for the following:
var id = "89e6dd2eb4494ed008d595bd";
Model.findById(id, function (err, user) { ... } );
with mongoose:
router.get("/:id", (req, res) => {
if (!mongoose.Types.ObjectId.isValid(req.params.id)) { // checking if the id is valid
return res.send("Please provide valid id");
}
var id = mongoose.Types.ObjectId(req.params.id);
Item.findById({ _id: id })
.then(item=> {
res.json(item);
})
.catch(err => res.status(404).json({ success: false }));
});
Using Google App Engine's NDB datastore, how do I ensure a strongly consistent read of a list of entities after creating a new entity?
The example use case is that I have entities of the Employee kind.
Create a new employee entity
Immediately load a list of employees (including the one that was added)
I understand that the approach below will yield an eventually consistent read of the list of employees which may or may not contain the new employee. This leads to a bad experience in the case of the latter.
e = Employee(...)
e.put()
Employee.query().fetch(...)
Now here are a few options I've thought about:
IMPORTANT QUALIFIERS
I only care about a consistent list read for the user who added the new employee. I don't care if other users have an eventual consistent read.
Let's assume I do not want to put all the employees under an Ancestor to enable a strongly consistent ancestor query. In the case of thousands and thousands of employee entities, the 5 writes / second limitation is not worth it.
Let's also assume that I want the write and the list read to be the result of two separate HTTP requests. I could theoretically put both write and read into a single transaction (?) but then that would be a very non-RESTful API endpoint.
Option 1
Create a new employee entity in the datastore
Additionally, write the new employee object to memcache, local browser cookie, local mobile storage.
Query datastore for list of employees (eventually consistent)
If new employee entity is not in this list, add it to the list (in my application code) from memcache / local memory
Render results to user. If user selects the new employee entity, retrieve the entity using key.get() (strongly consistent).
Option 2
Create a new employee entity using a transaction
Query datastore for list of employees in a transaction
I'm not sure Option #2 actually works.
Technically, does the previous write transaction get written to all the servers before the read transaction of that entity occurs? Or is this not correct behavior?
Transactions (including XG) have a limit on number of entity groups and a list of employees (each is its own entity group) could exceed this limit.
What are the downsides of read-only transactions vs. normal reads?
Thoughts? Option #1 seems like it would work, but it seems like a lot of work to ensure consistency on a follow-on read.
If you don not use an entity group you can do a key_only query and get_multi(keys) lookup for entity consistency. For the new employee you have to pass the new key to key list of the get_multi.
Docs: A combination of the keys-only, global query with a lookup method will read the latest entity values. But it should be noted that a keys-only global query can not exclude the possibility of an index not yet being consistent at the time of the query, which may result in an entity not being retrieved at all. The result of the query could potentially be generated based on filtering out old index values. In summary, a developer may use a keys-only global query followed by lookup by key only when an application requirement allows the index value not yet being consistent at the time of a query.
More info and magic here : Balancing Strong and Eventual Consistency with Google Cloud Datastore
I had the same problem, option #2 doesn't really work: a read using the key will work, but a query might still miss the new employee.
Option #1 could work, but only in the same request. The saved memcache key can dissapear at any time, a subsequent query on the same instance or one on another instance potentially running on another piece of hw would still miss the new employee.
The only "solution" that comes to mind for consistent query results is to actually not attempt to force the new employee into the results and rather leave things flow naturally until it does. I'd just add a warning that creating the new user will take "a while". If tolerable maybe keep polling/querying in the original request until it shows up? - that would be the only place where the employee creation event is known with certainty.
This question is old as I write this. However, it is a good question and will be relevant long term.
Option #2 from the original question will not work.
If the entity creation and the subsequent query are truly independent, with no context linking them, then you are really just stuck - or you don't care. The trick is that there is almost always some relationship or some use case that must be covered. In other words if the query is truly some kind of, essentially, ad hoc query, then you really don't care. In that case, you just quote CAP theorem and remind the client executing the query how great it is that this system scales. However, almost always, if you are worried about the eventual consistency, there is some use case or set of cases that must be handled. For example, if you have a high score list, the highest score must be at the top of the list. The highest score may have just been achieved by the user who is now looking at the list. Another example might be that when an employee is created, that employee must be on the "new employees" list.
So what you usually do is exploit these known cases to balance the throughput needed with consistency. For example, for the high score example, you may be able to afford to keep a secondary index (an entity) that is the list of the high scores. You always get it by key and you can write to it as frequently as needed because high scores are not generated that often presumably. For the new employee example, you might use an approach that you started to suggest by storing the timestamp of the last employee in memcache. Then when you query, you check to make sure your list includes that employee ... or something along those lines.
The price in balancing write throughput and consistency on App Engine and similar systems is always the same. It requires increased model complexity / code complexity to bridge the business needs.
I have a model named UserModel and I know that it will never grow beyond 10000 entities. I don't have anything unique in the UserModel which I can use for creating a key. Hence I decided to have string keys which are of this format USRXXXXX.
Where XXXXX represent the serial count. e.g USR00001, USR12345
Hence I chose to have a following way to generate the IDs
def generate_unique_id():
qry = UserModel.query()
num = qry.count() + 1
id = 'USR' + '%0.5d' % num
return id
def create_entity(model, id, **kwargs):
ent = model.get_or_insert(id, **kwargs)
# check if its the newly created record or the existing one
if ent.key.id() != id:
raise InsertError('failed to add new user, please retry the operation)
return True
Questions:
Is this the best way of achiving serial count of fixed width. Whethe this solution is optimal and idiomatic?
Does using get_or_insert like above guarantees that I will never have duplicate records.
Will it increase my billing, becuase for counting the number of records I an doing UserModel.query() without any filters. In a way I am fetching all the records. Or billing thing will not come in picture till I user fetch api on the qry object?
Since you only need a unique key for the UserModel entities, I don't quite understand why you need to create the key manually. The IDs that are generated automatically by App Engine are quaranteed to be unique.
Regarding your questions, we have the following:
I think not. Maybe you should first allocate IDs (check section Using Numeric Key IDs), order it, and use it.
Even though get_or_insert is strong consistent, the query you perform (qry = UserModel.query()) is not. Thus, you may result in overwriting existing entities. For more information about eventual consistency, take a look here.
No, it will not increase your billing. When you execute Model.query().count(), the datastore under the hood executes a Model.query().fetch(keys_only=True) and counts the number of results. Keys-only queries generate small datastore operations, which based on latest pricing changes by Google are not billable.
Probably not. You might get away with what you are trying to do if your UserModel entities have ancestors for stronger consistency.
No, get_or_insert does not guarantee that you won't have duplicates. Although you are unlikely to have duplicates in this case, you are more likely to loose data. Say you are inserting two entities with no ancestors - Model.query().count() might take some time to reflect the creation of the first entity causing the second entity to have the same ID as the first one and thus overwriting it (i.e. you end up with the 2nd entity only that has the ID of the first one).
Model.query().count() is short for len(Model.query().fetch()) (although with some optimizations) so every time you generate an ID you fetch all entities.
I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
Thanks!
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
https://developers.google.com/appengine/docs/billing
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats: http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118.
I'm using the HRD on Appengine.
Say I have a query that cuts across entity groups (i.e. not an ancestor query). I understand that the set of results returned by this query may not be consistent:
For example, the query may return 4 entities {A, B, C, D} even though and 5th entity E, matches the query. This makes sense.
However, in the inconsistent query above, is it ALSO the case that any of the results in the set may themselves not be consisitent (i.e. their fields are not the freshest)? That is, if A has a property called foo, is foo consistent?
My question boils down to, which part of the query is inconsistent - the set of results, the properties of the returned results, or both?
Eventual consistency applies to both the entities themselves and the indexes. This means that if you modify an entity, then query with a filter that matches only the modified one (not the value before modification), you could get no records. It also means that potentially you could get entities back from a query whose current versions do not match the index criteria they were fetched for.
You can ensure you have the latest copy of an entity by doing a consistent get (though outside a transaction, this is fairly meaningless, since it could have changed the moment you do the get), but there's no equivalent way to do a consistent index lookup.
I think the answer is that inconsistency can occur in both the set of results and properties of the returned results. Because incosistency occurs when you query a replica (or data center as in Google docs) that doesn't know yet about some write you made before. And the write can be anything, creating new entity or updating existing one.
So if you have for example the entity A with property x and you:
update x on A to 50 (previously it was 40)
query for entities with x >= 30
Then you certainly get this entity in the resut set but it can has an old value of x (40), in case that the replica you queried didn't yet know about your update.