Concurrent queries in PostgreSQL - what is actually happening? - database

Let us say we have two users running a query against the same table in PostgreSQL. So,
User 1: SELECT * FROM table WHERE year = '2020' and
User 2: SELECT * FROM table WHERE year = '2019'
Are they going to be executed at the same time as opposed to executing one after the other?
I would expect that if I have 2 processors, I can run both at the same time. But I am thinking that matters become far more complicated depending on where the data is located (e.g. disk) given that it is the same table, whether there is partitioning, configurations, transactions, etc. Can someone help me understand how I can ensure that I get my desired behaviour as far as PostgreSQL is concerned? Under which circumstances will I get my desired behaviour and under which circumstances will I not?
EDIT: I have found this other question which is very close to what I was asking - https://dba.stackexchange.com/questions/72325/postgresql-if-i-run-multiple-queries-concurrently-under-what-circumstances-wo. It is a bit old and doesn't have much answers, would appreciate a fresh outlook on it.

If the two users have two independent connections and they don't go out of their way to block each other, then the queries will execute at the same time. If they need to access the same buffer at the same time, or read the same disk page into a buffer at the same time, they will use very fast locking/coordination methods (LWLocks, spin locks, or atomic operations like CAS) to coordinate that. The exact techniques vary from version to version, as better methods become widely available on supported platforms and as people find the time to change the implementation to use those better methods.
I can ensure that I get my desired behaviour as far as PostgreSQL is concerned?
You should always get the correct answer to your query (Or possibly some kind of ERROR indicating a failure to serialize if you are using the highest (and non-default) isolation level, but that doesn't seem to be a risk if each of those queries is run in a single-statement transaction.)
I think you are overthinking this. The point of using a database management system is that you don't need to micromanage it.
Also, "parallel-query" refers to a single query using multiple CPUs, not to different queries running at the same time.

Related

Index on array element attribute extremely slow

I'm new to mongodb but not new to databases. I created a collection of documents that look like this:
{_id: ObjectId('5e0d86e06a24490c4041bd7e')
,
,
match[{
_id: ObjectId(5e0c35606a24490c4041bd71),
ts: 1234456,
,
,}]
}
So there is a list of objects on the documents and within the list there might be many objects with the same _id field. I have a handful of documents in this collection and my query that selects on selected match._id's is horribly slow. I mean unnaturally slow.
Query is simply this: {match: {$elemMatch: {_id:match._id }}} and literally hangs the system for like 15 seconds returning 15 matching documents out of 25 total!
I put an index on the collection like this:
collection.createIndex({"match._id" : 1}) but that didn't help.
Explain says execution time is 0 and says it's using the index but it still takes 15 seconds or longer to complete.
I'm getting the same slowness in nodejs and in compass.
Explain Output:
{"explainVersion":"1","queryPlanner":{"namespace":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef.lead","indexFilterSet":false,"parsedQuery":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"maxIndexedOrSolutionsReached":false,"maxIndexedAndSolutionsReached":false,"maxScansToExplodeReached":false,"winningPlan":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"inputStage":{"stage":"IXSCAN","keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]}}},"rejectedPlans":[]},"executionStats":{"executionSuccess":true,"nReturned":15,"executionTimeMillis":0,"totalKeysExamined":15,"totalDocsExamined":15,"executionStages":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"docsExamined":15,"alreadyHasObj":0,"inputStage":{"stage":"IXSCAN","nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]},"keysExamined":15,"seeks":1,"dupsTested":15,"dupsDropped":0}},"allPlansExecution":[]},"command":{"find":"lead","filter":{"match":{"$elemMatch":{"_id":"5e0c3560e5a9e0cbd994fa52"}}},"skip":0,"limit":0,"maxTimeMS":60000,"$db":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef"},"serverInfo":{"host":"Dans-MacBook-Pro.local","port":27017,"version":"5.0.9","gitVersion":"6f7dae919422dcd7f4892c10ff20cdc721ad00e6"},"serverParameters":{"internalQueryFacetBufferSizeBytes":104857600,"internalQueryFacetMaxOutputDocSizeBytes":104857600,"internalLookupStageIntermediateDocumentMaxSizeBytes":104857600,"internalDocumentSourceGroupMaxMemoryBytes":104857600,"internalQueryMaxBlockingSortMemoryUsageBytes":104857600,"internalQueryProhibitBlockingMergeOnMongoS":0,"internalQueryMaxAddToSetBytes":104857600,"internalDocumentSourceSetWindowFieldsMaxMemoryBytes":104857600},"ok":1}
The explain output confirms that the operation that was explained is perfectly efficient. In particular we see:
The expected index being used with tight indexBounds
Efficient access of the data (totalKeysExamined == totalDocsExamined == nReturned)
No meaningful duration ("executionTimeMillis":0 which implies that the operation took less than 0.5ms for the database to execute)
Therefore the slowness that you're experiencing for that particular operation is not related to the efficiency of the plan itself. This doesn't always rule out the database (or its underlying server) as the source of the slowness completely, but it is usually a pretty strong indicator that either the problem is elsewhere or that there are multiple factors at play.
I would suggest the following as potential next steps:
Check the mongod log file (you can confirm its location by running db.adminCmd("getCmdLineOpts") via the shell connected to the instance). By default any operation slower than 100ms is captured. This will help in a variety of ways:
If there is a log entry (with a meaningful duration) then it confirms that the slowness is being introduced while the database is processing the operation. It could also give some helpful hints as to why that might be the case (waiting for locks or server resources such as storage for example).
If an associated entry cannot be found, then that would be significantly stronger evidence that we are looking in the wrong place for the source of the slowness.
Is the operation that you gathered explain for the exact one that the application and Compass are observing as being slow? Were you connected to the same server and namespace? Is the explained operation simplified in some way, such as the original operation containing sort, projection, collation, etc?
As a relevant example that combines these two, I notice that there are skip and limit parameters applied to the command explained on a mongod seemingly running on a laptop. Are those parameters non-zero when running the application and does the application run against a different database with a larger data set?
The explain command doesn't include everything that an application would. Notably absent is the actual time it takes to send the results across the network. If you had particularly large documents that could be a factor, though it seems unlikely to be the culprit in this particular situation.
How exactly are you measuring the full execution time? Does it potentially include the time to connect to the database? In this case you mentioned that Compass itself also demonstrates the slowness, so that may rule out most of this.
What else is running on the server hosting the database? Is there a container or VM involved? Would the database or the underlying server be experiencing resource contention due to concurrency?
Two additional minor asides:
25 total documents in a collection is extremely small. I would expect even the smallest hardware to be able to process such a request without an index unless there was some complicating factor.
Assuming that match is always an array then the $elemMatch operator is not strictly necessary for this particular query. You can read more about that here. I would not expect this to have a performance impact for your situation.

PostgreSQL performance testing - precautions?

I have some performance tests for an index structure on some data. I will be comparing 2 indexes side-by-side (still not decided if I will be using 2 VMs). I require results to be as neutral as possible of course, so I have these kinds of questions which I would appreciate any input about... How can I ensure/control what is influencing the test? For example, caching effects/order of arrival from one test to another will influence the result. How can I measure these influences? How do I create a suitable warm-up? Or what kind of statistical techniques can I use to nullify such influences (I don't think just averages is enough)?
Before you start:
Make sure your tables and indices have just been freshly created and populated. This avoids issues with regard to fragmentation. Otherwise, if the data in one test is heavily fragmented, and the other is not, you might not be comparing apples to apples.
Make sure your tables are properly ANALYZEd. This makes sure that the query planner has proper statistics in all cases.
If you just want a comparison, and not a test under realistic use, I'd just do:
Cold-start your (virtual) machine. Wait a reasonable but fixed time (let's say 5 min, or whatever is reasonable for your system) so that all startup processes have taken place and do not interfere with the DB execution.
Perform test with index1, and measure time (this is timing where you don't have anything cached by either the database nor the OS).
If you're interested in results when there are cache effects: Perform test again 10 times (or any number of times as big as reasonable). Measure each time, to account for variability due to other processes running on the VM, and other contingencies.
Reboot your machine, and repeat the whole process for test2. There are methods to clean the OS cache; but they're very system dependent, and you don't have a way to clean the database cache. Check See and clear Postgres caches/buffers?.
If you are really (or mostly) interested in performance when there are no cache effects, you should perform the whole process several times. It's slow and tedious. If you're only interested in the case where there's (most probably) a cache effect, you don't need to restart again.
Perform an ANOVA (or any other statistical hypothesis test you might think more suited) to decide if your average time is statistically different or not.
You can see an example of performing several tests in the answer to a question about NOT NULL versus CHECK(xx NOT NULL).
As neutral as possible, then create two databases on the same instance of your database management system, then create the same tablespaces with data, using indexes on one instance but not the other.
The challenge with a VM is you have arbitrated access to your disk resources ( unless you have each VM pinned to a specific interface and disk set ). Because of this, your arbitration model could vary from one test to the next. The most neutral course, which removes the arbitration, is on physical hardware....and the same hardware in both cases.

Synchronizing Database Access in a Distributed App

A common bit of programming logic I find myself implementing often is something like the following pseudo-code:
Let X = some value
Let Database = some external Database handle
if !Database.contains(X):
SomeCalculation()
Database.insert(X)
However, in a multi-threaded program we have a race condition here. Thread A might check if X is in Database, find that it's not, and then proceed to call SomeCalculation(). Meanwhile, Thread B will also check if X is in Database, find that it's not, and insert a duplicate entry.
So of course, this needs to be synchronized like:
Let X = some value
Let Database = some external Database handle
LockMutex()
if !Database.contains(X):
SomeCalculation()
Database.insert(X)
UnlockMutex()
This is fine, except what if the application is a distributed app, running across multiple computers, all of which communicate with the same back-end database machine? In this case, a Mutex is useless, because it only synchronizes a single instance of the app with other local threads. To make this work, we'd need some kind of "global" distributed synchronization technique. (Assume that simply disallowing duplicates in Database is not a feasible strategy.)
In general, what are some practical solutions to this problem?
I realize this question is very generic, but I don't want to make this a language-specific question because this is an issue that comes up across multiple languages and multiple Database technologies.
I intentionally avoided specifying whether I'm talking about an RDBMS or SQL Database, versus something like a NoSQL Database, because again - I'm looking for generalized answers based on industry practices. For example, is this situation something that Atomic Stored Procedures might solve? Or Atomic Transactions? Or is this something that requires something like a "Distributed Mutex"? Or more generally, is this problem generally addressed by the Database system, or is it something the Application itself should handle?
If it turns out this question is impossible to answer at all without further information, please tell me so I can modify it.
One sure way to ensure against data stomping is to lock the data row. Many databases allow you to do that, via transactions. Some don't support transactions.
However, this is overkill for most cases, where contention is low in general. You might want to read up on Isolation levels to get more background on the topic.
A better general approach is often Optimistic Concurrency. The idea behind it is that each data row includes a signature, a timestamp works fine but the signature need not be time oriented. It could be a hash value, for example. This is a general concurrency management approach and is not limited to relational stores.
The app that changes data first reads the row, and then performs whatever calculations it requires, and then at some point, writes the updated data back to the data store. Via Optimistic concurrency, the app writes the update with the stipulation (expressed in SQL if it is a SQL database) that the data row must be updated only if the signature has not changed in the interim. And, each time a data row is updated, the signature must be updated as well.
The result is that updates don't get stomped on. But for a more rigorous explanation of the concurrency issues, refer to that article on DB Isolation levels.
All distributed updaters must follow the OCC convention (or something stronger, like transactional locking) in order for this to work.
You can obviously move the "synch" part to the DB layer itself, using an exclusive lock on a specific resource.
This is a bit extreme (in most cases, attempting the insert and managing the exception when you actually discover that someone already inserted the row) would be more adequate, I think.
Well, since you ask an general question, I will try to provide another option. Its not very orthodox, but may be useful: You could "define" a machine or a process responsible for doing that. For example:
Let X = some value
Let Database = some external Database handle
xResposible = Definer.defineResponsibleFor(X);
if(xResposible == me)
if !Database.contains(X):
SomeCalculation()
Database.insert(X)
The trick here is to make defineResponsibleFor always return the same value independent of who is calling. So, if you have a fair distributed range of X and a fair Definer all machines will have work to do. And you could use simple threading mutex to avoid race conditions. Of course, now you have to take care of fail-tolerance (if a machine or process is out of business your Definer must know and not define any job for it). But you should do this anyawy... :)

Design of database to record failed logins to prevent brute forcing

There have been a couple of questions about limiting login attempts, but none have really discussed the advantages or disadvantages or different ways of storing the record of login attempts (most have focused on the issue of throttling vs captchas, for instance, which is not what I'm interested in). Instead, I'm trying to figure out the best way to record this information in a database that would be useful but not impact performance too much.
Approach 1) Create a table that would record: ip address, username tried, and attempt time. Then when a user tries to login, I run a select to determine how many times a username was tried in the last 5 minutes (or how many times a particular ip address tried in the last 5 minutes). If over a certain amount, then either throttle the login attempts or display a captcha.
This allows throttling based both on username and on ip address (if the attack is trying many different usernames with a single password), and creates an easily auditable record of what happened, but requires an INSERT for every login attempt, and at least one extra SELECT.
Approach 2) Create a similar table that records the number of failed attempts and the last login time. This could result in a smaller database table and faster SELECT statements (no COUNT needed), but offers a little less control and also less data to analyze.
Approach 3) Similar to approach 2 but store this information in the User table itself. This means that an additional SELECT statement would not be needed, although it would add extra, perhaps unnecessary information to the users table. It also wouldn't allow the extra control and information that approach 1 would offer.
Additional approaches: store the login attempts in a mysql MEMORY table or a sqlite in-memory table. This would reduce performance loss due to disk performance, but still requires database calls, and wouldn't allow long term auditing of login attempts because the data wouldn't be persistent.
Any thoughts on a good way to do this or what you yourself have implemented?
It sounds like you're optimizing prematurely.
Implement it in the way that other developers would expect it to be implemented, and then measure performance. If you find you need an optimization at that point, then take some steps based on your measurements.
I would expect Approach #1 to be the most natural solution.
As a side note: I'd neatly separate the portion of code that a) decides if someone is hacking, and b) decides what action to take. Chances are you'll make several attempts to get this portion correct over optimizing the queries.
A couple of extra simple statements aren't going to make any noticeable difference. Logins are rare - relatively speaking. If you want to make sure automated login attempts don't cause a performance problem, you just have to have a second threshold after which you
Timestamp the abuse
Stop counting (and thus stop inserting/updating)
Block the ip for a few hours.
This threshold must be much higher than the one that throttles/introduces a captcha.

Am i right to sacrifice database design fundamentals in this case for the sake of speed?

I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.

Resources