I'm writing an application where there are multiple users are users can upload reports within the application.
Currently, I have a 'reports' table that holds all of the submitted reports which has an 'id' field that is a serial primary key on the table.
The requirements I have specify that users need to be able to specify a prefix and a number for their reports to start counting from. For example, a user should be able to say that their reports start at ABC-100, then the next one is ABC-101, ABC-102, and so on and so forth.
The way I'm thinking of achieving this is that when a user creates an account, he can specify the prefix and start number, and I will create a postgres sequence with the name of the prefix that is specified and a minValue of the number the user wants reports to start at.
Then when a user submits a new report, I can mark the report_number as nextval(prefix_sequence). In theory this will work, but I am pretty new to postgres, and I want some advice and feedback on if this is a good use of sequences or if there is a better way.
This is an area where you probably don't need the key benefit of sequences - that they can be used concurrently by multiple transactions. You may also not want the corresponding downside, that gaps in sequences are normal. It's quite normal to get output like 1, 2, 4, 7, 8, 12, ... if you have concurrent transactions, rollbacks, etc.
In this case you are much better off with a counter. When a user creates an account, create a row in an account_sequences table like (account_id, counter). Do not store it in the main table of accounts, because you'll be locking and updating it a lot, and you want to minimise VACUUM workload.
e.g.
CREATE TABLE account_sequences
(
account_id integer PRIMARY KEY REFERENCES account(id),
counter integer NOT NULL DEFAULT 1,
);
Now write a simple LANGUAGE SQL function like
CREATE OR REPLACE FUNCTION account_get_next_id(integer)
RETURNS integer VOLATILE LANGUAGE sql AS
$$
UPDATE account_sequences
SET counter = counter + 1
WHERE account_id = $1
RETURNING counter
$$;
You can then use this in place of nextval.
This will work because each transaction that UPDATEs the relevant account_sequences row takes a lock on the row that it holds until it commits or rolls back. Other transactions that try to get IDs for the same account will wait for it to finish.
For more info search for "postgresql gapless sequence".
If you want, you can make your SQL function fetch the prefix too, concatenate it with the generated value using format, and return a text result. This will be easier if you put the prefix text NOT NULL column into your account_sequences table, so you can do something like:
CREATE OR REPLACE FUNCTION account_get_next_id(integer)
RETURNS text VOLATILE LANGUAGE sql AS
$$
UPDATE account_sequences
SET counter = counter + 1
WHERE account_id = $1
RETURNING format('%s%s', prefix, counter)
$$;
By the way, do not take the naïve approach of using a subquery with SELECT max(id) .... It's completely concurrency-unsafe, it'll produce wrong results or errors if multiple transactions run at once. Plus it's slow.
Related
I have a Java EE Web Application and a SQL Server Database.
I intend to cluster my database later.
Now, I have two tables:
- Users
- Places
But I don't want to use auto id of SQL Server.
I want to generate my own id because of the cluster.
So, I've created a new table Parameter. The parameter table has two columns: TableName and LastId. My parameter table stores the last id. When I add a new user, my method addUser do this:
Query the last id of the parameter table and increments +1;
Insert the new User
Update the last id +1.
It's working. But it's a web application, so how about 1000 people simultaneously? Maybe some of them get the same last id. How can I solve this? I've tried with synchronized, but it's not working.
What do you suggest? Yes, I have to avoid auto-increment.
I know that the user has to wait.
Automatic ID may work better in a cluster, but if you want to be database-portable or implement the allocator yourself, the basic approach is to work in an optimistic loop.
I prefer 'Next ID', since it makes the logic cleaner, so I'm going to use that in this example.
SELECT the NextID from your allocator table.
UPDATE NextID SET NextID=NextID+Increment WHERE NextID=the value you read
loop while RowsAffected != 1.
Of course, you'll also use the TableName condition when selecting/ updating to select the appropriate allocator row.
You should also look at allocating in blocks -- Increment=200, say -- and caching them in the appserver. This will give better concurrency & be a lot faster than hitting the DB each time.
So I currently have a primary postgres database that handles multiple users from different apps. So one of the issue regarding concurrency is that when say AppOne and AppTwo want to add users at the same time.
Currently what is happening is AppOne will generate a random number (must be 10 digits long) and then check if the value exists in the database, if it doesn't exist then it will insert the user with that value in a column called user_url (used for their url).
Now as you can image, if in between the time for the generation, check, or insertion AppTwo makes a request to add a users we can have repeated unique values (it's happened). I want to solve that issue potentially using postgres triggers.
I know that I can use transactions, but I don't want to hold up the database, I'd rather it created the unique number sequence through a function and trigger on the database side, so as I scale I don't have to worry about race conditions. Once the trigger does it's thing, I can then get the newly added user with all of it's data, including the unique id.
So Ideally
CREATE OR REPLACE FUNCTION set_unique_number(...) RETURNS trigger AS $$
DECLARE
BEGIN
....something here
RETURN new;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER insert_unique_url_id BEFORE INSERT ... PROCEDURE
set_unique_number(...);
it would be a function that generates the number and inserts it into the row, which would be run by a trigger of BEFORE INSERT. I may be wrong.
Any help/suggestions would be helpful
EDIT: I want it so that there is no sequence to the numbers. this way people could not guess the next user's url.
Thanks
9,000,000,000 is small enough number that birthday problem will guarantee that you'll start to see collisions very soon.
I think you can work around this problem while still allowing concurrent inserts by using advisory locking. Your procedure might look like this (in pseudocode):
while (true) {
start transaction;
bigint new_id = floor(random())*9000000000+1000000000;
if select pg_try_advisory_xact_lock(new_id) {
if select not exists id from url where id=new_id {
insert into url (id, ...) values (new_id, ...);
commit;
break;
}
}
commit;
}
This procedure will never end when you'd have 9,000,000,000 rows in the database. You'd have to implement it externally, as Postgres procedures do not allow multiple transactions within a procedure. It might be possible to work around by using exceptions, but it'll be rather complicated.
Why don't you use UUID-ossp extension? It will allow you to generated UUID's from within postgres itself.
Heres a good tutorial how to use those as primary keys even.
I have a table called ticket, and it has a field called number and a foreign key called client that needs to work much like an auto-field (incrementing by 1 for each new record), except that the client chain needs to be able specify the starting number. This isn't a unique field because multiple clients will undoubtedly use the same numbers (e.g. start at 1001). In my application I'm fetching the row with the highest number, and using that number + 1 to make the next record's number. This all takes place inside of a single transaction (the fetching and the saving of the new record). Is it true that I won't have to worry about a ticket ever getting an incorrect (duplicate) number under a high load situation, or will the transaction protect from that possibility? (note: I'm using PostgreSQL 9.x)
without locking the whole table on every insert/update, no. The way transactions work on PostgreSQL means that new rows that appear as a result of concurrent transactions never conflict with each other; and thats exactly what would be happening.
You need to make sure that updates actually cause the same rows to conflict. You would basically need to implement something similar to the mechanic used by PostgreSQL's native sequences.
What I would do is add another column to the table referenced by your client column to represent the last_val of the sequence's you'll be using. So each transaction would look sort of like this:
BEGIN;
SET TRANSACTION SERIALIZABLE;
UPDATE clients
SET footable_last_val = footable_last_val + 1
WHERE clients.id = :client_id;
INSERT INTO footable(somecol, client_id, number)
VALUES (:somevalue,
:client_id,
(SELECT footable_last_val
FROM clients
WHERE clients.id = :client_id));
COMMIT;
So that the first update into the clients table fails due to a version conflict before reaching the insert.
You do have to worry about duplicate numbers.
The typical problematic scenario is: transaction T1 reads N, and creates a new row with N+1. But before T1 commits, another transaction T2 sees N as the max for this client and creates another new row with N+1 => conflict.
There are many ways to avoid this; here is a simple piece of plpgsql code that implements one method, assuming a unique index on (client,number). The solution is to let the inserts run concurrently but in the event of a unique index violation, retry with refreshed values until it's accepted (it's not a busy loop, though, since concurrent inserts are blocked until other transactions commit)
do
$$
begin
loop
BEGIN
-- client number is assumed to be 1234 for the sake of simplicity
insert into the_table(client,number)
select 1234, 1+coalesce(max(number),0) from the_table where client=1234;
exit;
EXCEPTION
when unique_violation then -- nothing (keep looping)
END;
end loop;
end$$;
This example is a bit similar to the UPSERT implementation from the PG documentation.
It's easily transferable into a plpgsql function taking the client id as input.
I'm new to pagination, so I'm not sure I fully understand how it works. But here's what I want to do.
Basically, I'm creating a search engine of sorts that generates results from a database (MySQL). These results are merged together algorithmically, and then returned to the user.
My question is this: When the results are merged on the backend, do I need to create a temporary view with the results that is then used by the PHP pagination? Or do I create a table? I don't want a bunch of views and/or tables floating around for each and every query. Also, if I do use temporary tables, when are they destroyed? What if the user hits the "Back" button on his/her browser?
I hope this makes sense. Please ask for clarification if you don't understand. I've provided a little bit more information below.
MORE EXPLANATION: The database contains English words and phrases, each of which is mapped to a concept (Example: "apple" is 0.67 semantically-related to the concept of "cooking"). The user can enter in a bunch of keywords, and find the closest matching concept to each of those keywords. So I am mathematically combining the raw relational scores to find a ranked list of the most semantically-related concepts for the set of words the user enters. So it's not as simple as building a SQL query like "SELECT * FROM words WHERE blah blah..."
It depends on your database engine (i.e. what kind of SQL), but nearly each SQL flavor has support for paginating a query.
For example, MySQL has LIMIT and MS SQL has ROW_NUMBER.
So you build your SQL as usual, and then you just add the database engine-specific pagination stuff and the server automatically returns only, say, row 10 to 20 of the query result.
EDIT:
So the final query (which selects the data that is returned to the user) selects data from some tables (temporary or not), as I expected.
It's a SELECT query, which you can page with LIMIT in MySQL.
Your description sounds to me as if the actual calculation is way more resource-hogging than the final query which returns the results to the user.
So I would do the following:
get the individual results tables for the entered words, and save them in a table in a way that you can get the data for this specifiy query later (for example, with an additional column like SessionID or QueryID). No pagination here.
query these result tables again for the final query that is returned to the user.
Here you can do paging by using LIMIT.
So you have to do the actual calculation (the resource-hogging queries) only once when the user "starts" the query. Then you can return paginated results to the user by just selecting from the already populated results table.
EDIT 2:
I just saw that you accepted my answer, but still, here's more detail about my usage of "temporary" tables.
Of course this is only one possible way to do it. If the expected result is not too large, returning the whole resultset to the client, keeping it in memory and doing the paging client side (as you suggested) is possible as well.
But if we are talking about real huge amounts of data of which the user will only view a few (think Google search results), and/or low bandwidth, then you only want to transfer as little data as possible to the client.
That's what I was thinking about when I wrote this answer.
So: I don't mean a "real" temporary table, I'm talking about a "normal" table used for saving temporary data.
I'm way more proficient in MS SQL than in MySQL, so I don't know much about temp tables in MySQL.
I can tell you how I would do it in MS SQL, but maybe there's a better way to do this in MySQL that I don't know.
When I'd have to page a resource-intensive query, I want do the actual calculation once, save it in a table and then query that table several times from the client (to avoid doing the calculation again for each page).
The problem is: in MS SQL, a temp table only exists in the scope of the query where it is created.
So I can't use a temp table for that because it would be gone when I want to query it the second time.
So I use "real" tables for things like that.
I'm not sure whether I understood your algorithm example correct, so I'll simplify the example a bit. I hope that I can make my point clear anyway:
This is the table (this is probably not valid MySQL, it's just to show the concept):
create table AlgorithmTempTable
(
QueryID guid,
Rank float,
Value float
)
As I said before - it's not literally a "temporary" table, it's actually a real permanent table that is just used for temporary data.
Now the user opens your application, enters his search words and presses the "Search" button.
Then you start your resource-heavy algorithm to calculate the result once, and store it in the table:
insert into AlgorithmTempTable (QueryID, Rank, Value)
select '12345678-9012-3456789', foo, bar
from Whatever
insert into AlgorithmTempTable (QueryID, Rank, Value)
select '12345678-9012-3456789', foo2, bar2
from SomewhereElse
The Guid must be known to the client. Maybe you can use the client's SessionID for that (if he has one and if he can't start more than one query at once...or you generate a new Guid on the client each time the user presses the "Search" button, or whatever).
Now all the calculation is done, and the ranked list of results is saved in the table.
Now you can query the table, filtering by the QueryID:
select Rank, Value
from AlgorithmTempTable
where QueryID = '12345678-9012-3456789'
order by Rank
limit 0, 10
Because of the QueryID, multiple users can do this at the same time without interfering each other's query. If you create a new QueryID for each search, the same user can even run multiple queries at once.
Now there's only one thing left to do: delete the temporary data when it's not needed anymore (only the data! The table is never dropped).
So, if the user closes the query screen:
delete
from AlgorithmTempTable
where QueryID = '12345678-9012-3456789'
This is not ideal in some cases, though. If the application crashes, the data stays in the table forever.
There are several better ways. Which one is the best for you depends on your application. Some possibilities:
You can add a datetime column with the current time as default value, and then run a nightly (or weekly) job that deletes everything older than X
Same as above, but instead of a weekly job you can delete everything older than X every time someone starts a new query
If you have a session per user, you can save the SessionID in an additional column in the table. When the user logs out or the session expires, you can delete everything with that SessionID in the table
Paging results can be very tricky. They way I have done this is as follows. Set an upperbound limit for any query that may be run. For example say 5,000. If a query returns more than 5,000 then limit the results to 5,000.
This is best done using a stored procedure.
Store the results of the query into a temp table.
Select Page X's amount of data from the temp table.
Also return back the current page and total number of pages.
I have a list of unique numbers in a file. Whenever user of my sites makes a request i need to fetch a value from that list , such that no other user gets the same value. There could be many concurrent requests, and hence the major issue is that of guaranteeing that no two users get the same value. I am not that concerned with performance in the sense that user doesn't expect any reply after making a request and so everything can happen in the background. From what i have read, this could be implemented using file locks or i could store data in the db and use db locks. I just want to get an ideas what is the best way of doing this thing. I am using postresql as my database and was wondering if this could be done using sequences where i store the current row number in the sequence so that my program knows which row to read from the db. But again i am not sure how to prevent multiple processes reading the same row before any one of them has a chance to update it.
Databases don't actually do many things, but what they do, they do very well. In particular, they handle concurrent access to data. You could use this to your advantage:
Create a table: create table special_number (id serial, num varchar(100) not null);
Ensure uniqueness: create unique index special_number_num on special_number(num);
Load your numbers into the table using COPY, letting id count up from 1 automatically
Create a sequence: create sequence num_seq;
Use the fact that postgres's nextval function is guaranteed concurrent-safe to safely pick a number from the list: select num from special_number where id = (select nextval('num_seq'));. This will return null if you run out of numbers.
This approach is 100% safe and easy - postgres's internal strengths as a database does all the heavy lifting.
Here's the SQL extracted from above:
create table special_number (id serial, num varchar(100) not null);
create unique index special_number_num on special_number(num);
copy special_number(num) from '/tmp/data.txt'; -- 1 special number per file line
create sequence num_seq;
select num from special_number where id = (select nextval('num_seq'));
This SQL has been tested on postgres and works as expected.