Do composite index updates maintain same order as original updates - google-app-engine

I have read the documentation at the link below, but have this important question still open. Do composite indexes become consistent in the same order as the original entity update? For example, let's say the same indexed property that is part of a composite index gets updated for rec1, rec2, and rec3. The recs get updates one second apart (rec1=T0, rec2=T0+1, rec3=T0+2). As the index updates get fanned out, can one assume that the indexes become eventually consistent in the same order as the updates? IOW, the index consistency for rec1 precedes the consistency for rec2 which precedes the consistency for rec3. Not asking if the consistency is the same one second apart (that is not important), but more simply whether the order for becoming consistent stays the same. Or is it possible that rec3's index will become consistent before rec2 or rec1. Many thanks. -stevep
Link: http://code.google.com/appengine/articles/life_of_write.html

Only Google can reliably answer this question.
OTOH, if you look at the link you provided, under the Apply Steps it says:
Since each index can live in a separate location in Bigtable, these writes
can be fanned out in parallel to multiple tablet servers.
Since the indexes are written in parallel on multiple servers, I'd say there is no guarantee that they are written in some order.

Related

Disadvantages of creating multiple indexes in a PostgreSQL table

After reading documentation on indexes I thought
Hey, since (in my case) almost always reading from database is performed much more often than writing to it, why not create indexes on most fields in the table?
Is it right attitude? Are there any other downsides of doing this, except longer inserting?
Of course, indexes would be limited to fields that I actually use in conditions of SELECT statements.
Indexes have several disadvantages.
First, they consume space. This may be inconsequential, but if your table is particularly large, it may have an impact.
Second, and more importantly, you need to remember that indexes have a performance penalty when it comes to INSERTing new rows, DELETEing old ones or UPDATEing existing values of the indexed column, as now the DML statement need not only modify the table's data, but the index's one too. Once again, this depends largely on your application usecase. If DMLs are so rare that the performance is a non-issue, this may not be a consideration.
Third (although this ties in storngly to my first point), remember that each time you create another database object, you are creating an additional maintenance overhead - it's another index you'd have to occasionally rebuild, collect statistics for (depending on the RDBMS you're using, of course), another objet to clatter the data dictionary, etc.
The bottom line all comes down to your usecase. If you have important queries that you run often and that can be improved by this index - go for it. If you're running this query once in a blue moon, you probably wouldn't want to slow down all your INSERT statements.
There also is a (small) performance downside of having many indexes during read operations: the more you have, the more time Postgres spends evaluating which query plan is best.
As you correctly state, there's no point in having an index on fields that you don't use on conditions. It merely slows down inserts.
Note that there also is little point in having an index on fields that offer dubious selectivity. Picture a table that is roughly split in two depending on whether rows have a boolean set to true or false. An index on that field offers little benefit for queries if true and false values distributed mostly evenly — the index could be useful, from lack of better options, if true and false rows are clustered together. If, in contrast, that field is 90% true, a partial index on the 10% that are false is useful.
And lastly, there is the storage issue: each index takes space. Sometimes (GIN in particular) a few times more than the indexed data.

Do Cursors deal with Eventual Consistency?

In the App Engine Documentation I found an interesting strategy for keeping up to date with changes in the datastore by using Cursors:
An interesting application of cursors is to monitor entities for unseen changes. If the app sets a timestamp property with the current date and time every time an entity changes, the app can use a query sorted by the timestamp property, ascending, with a Datastore cursor to check when entities are moved to the end of the result list. If an entity's timestamp is updated, the query with the cursor returns the updated entity. If no entities were updated since the last time the query was performed, no results are returned, and the cursor does not move.
However, I'm not quite sure how this can always work. After all, when using the High Replication Datastore, queries are only eventually consistent. So if two entities are put, and only the later of the two is seen by the query, it will move the cursor past both of them. Which will mean that the first of the two new entities will remain unseen.
So is this an actual issue? Or is there some other way that cursors work around this?
Having an index, builtin or composite, on a property that contains a monotonically increasing value (such as the current timestamp) may not perform as well as you may want at high write rates. This type of workload will generate a hotspot, as the tail of the index is constantly being updated as opposed to the load being distributed throughout the sorted index. However, for low write-rates, this will work fine.
The rest of the answer will depend on whether you are in the same entity group or separate entity groups.
If your query is an ancestor query, and thus in the same entity group it can be strongly consistent (by default they are), and the described method should always be accurate. The query will immediately see any writes (changes to an entity inside the entity group).
If you are querying over many entities groups, which is always eventually consistent, then there is no guarantee what order the writes are applied/visible. For example:
- Time1 - Write EntityA
- Time2 - Write EntityB
- Time3 - Query only sees EntityB
- Time4 - Query sees EntityA and EntityB
So the method of using a cursor to detect a change is correct, but it may "skip" over some changes.
For more information on eventual/strong consistency, see Balancing Strong and Eventual consistency with Google Cloud Datastore
You'll probably be best informed if you could ask someone who's worked on it, but after thinking about it a bit and re-reading Paxos a bit, I think it should not be a problem, though it would depend on how the datastore's actually implemented.
A cursor is essentially a position in the index. In theory you can re-read the same cursor over and over, and see new entities start appearing after it. In the real world case, you'll generally move on to the newest cursor position and forget about the old cursor position.
Eventual consistency "problems" appear because there's multiple copies of the index spread across multiple machines. Depending on which index you read from, you may get stale results.
You describe a problem case where there are two (exact) copies of an index I, and two new entities are created, E1, and E2. Say I1 = I + E1 and I2 = I + E2, so depending on the index you read from, you might get E1 or E2 as the new entity, move your cursor, and miss an entity when the index gets "patched" with the other index, ie I2 eventually gets patched to I + E1 + E2.
If the datastore actually happens that way, then I suspect, yes, you can get a problem. However, it sounds very difficult to operate that way, and I suspect the datastore indexes only get updated after the Paxos voting comes to an agreement. So you'll never seen an out-of-order index, you'll only see entities show up late: ie, you'll never see I + E2, you'll only ever see (I) or (I + E1) or (I + E1 + E2)
I suspect though, you might get a problem where you may be able to have a cursor that's too new for an index that hasn't caught up yet.

The order of a SQL Select statement without Order By clause

As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Access database performance

For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.

Resources