How can I store the date with datastore? - google-app-engine

Datastore documentation is very clear that there is an issue with "hotspots" if you include 'monotonically increasing values' (like the current unix time), however there isn't a good alternative mentioned, nor is it addressed whether storing the exact same (rather than increasing values) would create "hotspots":
"Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates."
https://cloud.google.com/datastore/docs/best-practices
I would like to store the time when each particular entity is inserted into the datastore, if that's not possible though, storing just the date would also work.
That almost seems more likely to cause "hotspots" though, since every new entity for 24 hours would get added to the same index (that's my understanding anyway).
Perhaps there's something more going on with how indexes work (I am having trouble finding great explanations of exactly how they work) and having the same value index over and over again is fine, but incrementing values is not.
I would appreciate if anyone has an answer to this question, or else better documentation for how datastore indexes work.

Is your application actually planning on querying the date? If not, consider simply not indexing that property. If you only need to read that property infrequently, consider writing a mapreduce rather than indexing.
That advice is given due to the way BigTable tablets work, which is described here: https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
To the best of my knowledge, it's more important to have the primary key of an entity not be a monotonically increasing number. It would be better to have a string key, so the entity can be stored with better distribution.
But saying this as a non-expert, I can't imagine that indexes on individual properties with monotonic values would be as problematic, if it's legitimately needed. I know with the Nomulus codebase for example, we had a legitimate need for an index on time, because we wanted to delete commit logs older than a specific time.
One cool thing I think happens with these monotonic indexes is that, when these tablet splits don't happen, fetching the leftmost or rightmost element in the index actually has better latency properties than fetching stuff in the middle of the index. For example, if you do a query that just grabs the first result in the index, it can actually go faster than a key lookup.

There is a key quote in the page that Justine linked to that is very helpful:
As a developer, what can you do to avoid this situation? ... Lower your write rate, or figure out how to better distribute values.
It is ok to store an indexed time stamp as long as that entity has a low write rate.
If you have an entity where you want to store an indexed time stamp and the entity has a high write rate, then the solution is to split the entity into two entities. Entity A will have properties that need to be updated frequently and entity B will have the time stamp and properties that don't get updated often.
When I do this, I have a common ID for the two entities to make it really easy to get from one to the other.

You could try storing just the date and put random hours, minutes, and seconds into the timestamp, then throw away that extra data later. (Or keep the hours and minutes and use random seconds, for example). I'm not 100% sure this would work but if you need to index the date it's worth trying.

Related

If I don’t use a tag in a query, does this decrease cardinality?

I have the following problem with InfluxDb: it overwires values with the same tag set and timestamp (a terrible design choice in my view).
Now to sidestep this in a cost effective manner one idea is to make a tag (e.g. value_id) which is unique and constantly increasing.
I know this will bloat cardinality to a point where query time will be super slow.
My question is: if I don’t use this random tag (value_id) in my query, but have it in the db, will this still-affect the speed of my queries?
If it does not it sounds like a “solution” to my problem.
P.S. I am aware that adding a nanosecond or arbitrary tag are two "solutions" suggested by InfluxDB, but neither sounds good and neither work reliably without a large cost.
Can you explain your use-case and why you need to write different values with the same time and tagset?
To answer your question: yes, this is damage your write and query time.
InfluxDB has a Seriesfile that stores a mapping if your series key to a unique identifier. This lookup and potential write to the Seriesfile happens on every write and read. The biggest this file gets, the slower these operations get.
It's not actually cardinality from a query POV that is bad, TSI enables billions of series; however the Seriesfile hasn't been optimised for these workloads yet.

What is the best moment to create SQL indexes?

When starting a project, should SQL indexes be created at the beginning?
I have a project where I haven´t created any indexes yet in production. The table that will grow most has 30000 rows and I have measured the time of the queries against this table creating an index and deleting it afterwards. The times are very similar.
I have decided to postpone the creation of the indexes in production until I notice a reduction of the response time in queries when creating them.
Is my approach correct? Or should I create them now?
I'm pretty deep into the topic of database indexing (it's actually my full-time job, also wrote a book about it (SQL Performance Explained) which is available for free here).
In my opinion, indexes should be created at the time you write the query because this is the time you have all the required information needed to decide which indexes to create in your head. In other words, if you do it at that time, it doesn't take you any extra effort. Another reason is that indexing sometimes affects the way you have to write the query so it can actually take benefit of that index.
However, the above statement assumes that you know how indexes work so you can decide which indexes to create. If you don't know that, I'd really suggest to learn about proper indexing first. Again, the book I've written is available for free on the web (Table of Contents). According to a recent survey, it takes you about 4-5 hours to read through it. Well-spent time, I'd say.
However, due to the ludicrous speed of modern hardware and vast amount of memory—even cheap commodity hardware—it is absolutely possible that you cannot measure any difference with these small tables (30k is small in DB world) yet. Nevertheless, you because you cannot measure this difference with a timers resolution of maybe 10ms, it doesn't mean the difference isn't there. Further: did you verify that the index was actually used? Are you sure the index you created was a good index for the given query?
Never the less, if the overall system is fast enough for you at the moment, sure you can go on without indexes. The risk remains, however, that it isn't fast enough on the day a major news outlet covers your app. What is supposed to be your best day might turn out to become your worst day :(
You didn't tell us a lot about your app, so I've to do some guesswork. I guess it is more like an OLTP app like an online website (as opposed to BI/OLAP). Although indexes add some overhead to write operations (insert, update, delete and merge), this is typically small compared to the benefit they bring to select (still assuming OLTP). Sure you can misuse indexes (e.g., creating hundreds on a single table) so that the overhead becomes a major problem too. But adding "a few" indexes on an OLTP table will most certainly not cause any problems due to the maintenance overhead.
Coming to an end: if you already know which indexes are good for your queries (verify it using explain), add them now before it is too late. If you are not sure, I'd still suggest to put some effort on that now. If you are not afraid of load peaks taking your app down, go on without indexes.
If you need more help, create a new question containing your query, table & index definitions as well as the explain output and people will be happy to help you figuring out if that index is fine or not.
Just create them now based on sensible choices: start with primary and foreign keys - thst'll keep your joins fast - then add indexes on single columns you'll be searching on (name, phone, etc) you are using.
Avoid creating multiple column indexes until you have a demonstrated performance problem and you can prove that an index helps. Often, reworking the query will fix the problem better than some complicated index.
The only time I delay creating indexes is if I'm about to load a heap of data and building indexes before loading means a much slower load as the index is updated for every row addition, although some databases allow the index rebuild to be deferred until after the load, so even then there's no point in waiting.

Time to retrieve a single record via a SQL Server index in a large table

Short version of the question:
If you have a table with a large number of small rows and you want to retrieve a single record from this table via an index probably consisting of two columns is this likely to be something that wil be low cost and fast or high cost and slow
Longer version of question and background:
I am a consultant working with a software development company and I have an argument with them about the performance implications of a piece of functionality that I want to add to the application they are building (and I am designing).
At the moment, we write out a log record every time somebody retrieves a client record. I want to put the name and time of the last person prevously to access that record onto the client page each time that record is retrieved.
They are saying that the performance implications of this will be high but based on my reasonable but not expert knowledge of how B trees work, this doesn't seem right even if the table is very large.
If you create an index on the GUID of the client record and the date/time of access (descending), then you ought to be able to retrieve the required record via an index scan which would just need to find the first entry for that GUID and then stop? And that with a b-tree index, most of the index would be cached so the number of physical disc accesses needed would be very small and the query time therefore significantly less than 1s.
Or have I got this completely wrong
You will have problems with GUID index fragmentation but because your rows do not increase in size (as you said in the comments) you will not have page-splitting problems. The random insert issue is fixable by doing reorganizing and rebuilding.
Besides that, there is nothing wrong with your approach. If the table is larger than RAM you will likely have a single disk IO per access (the intermediate index levels will be cached). If your data fits in RAM you will pay about 0.2 to 0.5ms per query. If your data is on a magnetic disk a seek will likely require 8-12ms. On an SSD you are back to 0.2ms to 0.5ms (maybe 0.05ms more).
Why don't you just create some test data (by selecting a cross product from sys.object of 1M rows) and measure it. It takes little time and you will find out for sure.
should be low cost and fast since the columns are indexed and that would be O(n) I think
You say last person to access? You mean that for every read you will have a write?
And that write is going to change an indexed date time column?
Then I would be worried too.
Writing on each record read will cause you lots of extra disk writes. This will block reads and it might be bad to your caching too. You also need to update your index a lot, and since you change the indexed data your index will be very fragmented.
It depends.
A single retrieval will be low cost and fast
on a decent indexed table
running on decent hardware
over a decent network
On the other hand, it takes time nonetheless.
If we are talking about one retrieval per hour, don't sweat over it. If we are talking about thousands of retrievals per second (as opposed to currently none) it will start to add up to the point it would be noticable.
Some questions you need to adress
Is my hardware up to spec
Does adding two fields result in a page split (unlikely)
How many extra pages need to be read for your regular result sets
How many retrievals/sec will be made
How many inserts/sec (triggering an index update) will be made
After you've adressed these questions, you should be able to make the determination yourself. As far as my gut feelings go, I would be surprised you would notice the performance difference.

database row/ record pointers

I don't know the correct words for what I'm trying to find out about and as such having a hard time googling.
I want to know whether its possible with databases (technology independent but would be interested to hear whether its possible with Oracle, MySQL and Postgres) to point to specific rows instead of executing my query again.
So I might initially execute a query find some rows of interest and then wish to avoid searching for them again by having a list of pointers or some other metadata which indicates the location on a database which I can go to straight away the next time I want those results.
I realise there is caching on databases, but I want to keep these "pointers" else where and as such caching doesn't ultimately solve this problem. Is this just an index and I store the index and look up by this? most of my current tables don't have indexes and I don't want the speed decrease that sometimes comes with indexes.
So whats the magic term I've been trying to put into google?
Cheers
In Oracle it is called ROWID. It identifies the file, the block number, and the row number in that block. I can't say that what you are describing is a good idea, but this might at least get you started looking in the right direction.
Check here for more info: http://www.orafaq.com/wiki/ROWID.
By the way, the "speed decrease that comes with indexes" that you are afraid of is only relevant if you do more inserts and updates than reads. Indexes only speed up reads, so if the read ratio is high, you might not have an issue and an index might be your best solution.
most of my current tables don't have
indexes and I don't want the speed
decrease that sometimes comes with
indexes.
And you also don't want the speed increase which usually comes with indexes but you want to hand-roll a bespoke pseudo-cache instead?
I'm not being snarky here, this is a serious point. Database designers have expended a great deal of skill and energy into optimizing their products. Wouldn't it be more sensible to learn how to take advantage of their efforts rather re-implementing some core features?
In general, the best way to handle this sort of requirement is to use the primary key (or in fact any convenient, compact unique identifier) as the 'pointer', and rely on the indexed lookup to be swift - which it usually will be.
You can use ROWID in more DBMS than just Oracle, but it generally isn't recommended for a variety or reasons. If you succumb to the 'every table has an autoincrement column' school of database design, then you can record the autoincrement column values as the identifiers.
You should have at least one index on (almost) all of your tables - that index will be for the primary key. The exception might be for a table so small that it fits in memory easily and won't be updated and will be used enough not to be evicted from memory. Then an index might be a distraction; however, such tables are typically seldom updated so the index won't harm anything, and the optimizer will ignore it if the index doesn't help (and it may not).
You may also have auxilliary indexes. In a system where most of the activity is reading the data, you may want to erro on the side of having more indexes rather than fewer, because access time is most critical. If your system was update intensive, then you would go with fewer indexes because there is a cost associated with updating indexes when data is added, removed or updated. Clearly, you need to design the indexes to work well with the queries that your users actually perform (or your applications perform).
You may also be interested in cursors. (Note that the index debate is still valid with cursors.)
Wikipedia definition here.

should nearly unique fields have indexes

I have a field in a database that is nearly unique: 98% of the time the values will be unique, but it may have a few duplicates. I won't be doing many searches on this field; say twice a month. The table currently has ~5000 records and will gain about 150 per month.
Should this field have an index?
I am using MySQL.
I think the 'nearly unique' is probably a red herring. The data is either unique, or it's not, but that doesn't determine whether you would want to index it for performance reasons.
Answer:
5000 records is really not many at all, and regardless of whether you have an index, searches will still be fast. At that rate of inserts, it'll take you 3 years to get to 10000 records, which is still also not many.
I personally wouldn't bother with adding an index, but it wouldn't matter if you did.
Explanation:
What you have to think about when deciding to add an index is the trade-off between insertion speed, and selection speed.
Without an index, doing a select on that field means MySQL has to walk over every single row and read every single field. Adding an index prevents this.
The downside of the index is that each time data gets inserted, the DB has to update the index in addition to adding the data. This is usually a small overhead, but you'd really notice it if you had loads of indexes, and were doing a lot of writes.
By the time you get this many rows in your database, you'd want an index anyway as otherwise your selects would take all day, but it's just something to be aware about so that you don't end up adding indexes on fields "just in case I need it"
That's not very many records at all; I wouldn't bother making any indexes on that table. The relative uniqueness of the field is irrelevant - even on years-old commodity hardware I'd expect a query on that table to take a fraction of a second.
you can use the general rule of thumb: optimize when it becomes a problem. Just don't use an index until you notice you need one.
From what you say, it doesn't sound like an index is necessary. Rule of thumb is index fields that are being used in SELECTS a lot to speed up the searching, which in turn (can) slows down INSERTS and UPDATES.
On a recordset as small as yours, I don't think you will see much of a real world hit either way.
If you'll only be doing searches on it twice a month and its that few rows then I would say don't index it. Its all but useless.
No. There aren't many records and it's not going to be frequently queried. No need to index.
It's really a judgement call. With such a small table you can search reasonably quickly without an index, so you could get by without it.
On the other hand, the cost of creating an index you don't really need is pretty low, so you're not saving yourself much by not doing it.
Also, if you do create the index, you're covered for the future if you suddenly start getting 1000 new records/week. Possibly you know enough about the situation to say for certain that that will never happen, but requirements do have a way of changing when you least expect.
EDIT: As far as changing requirements, the thing to consider is this: If the DB does grow and you find out later that you do need an index, can you simply create the index and be done? Or will you also need to change lots of code to make use of the new index?
It depends. As others have responded, there's a trade off between table update speed and selection speed. Table update includes inserts, updates, and deletes on the table.
One question you didn't address. Does the table have a primary key, and a corresponding index? A table with no indexes usually benefits form having at least one index. The most common way of getting that index is to declare a primary key, and rely on the DBMS to generate an index accordingly.
If a table has no candidates for primary key, that usually indicates a serious flaw in table design. That's a separate issue and should get a spearate discussion.

Resources