Possible performance issues with huge PostgreSQL table

Possible performance issues with huge PostgreSQL table - database

The website I'm create has "Events". Events have a title, date, and the userids of the people involved. An event can be anything from following someone, creating a new post, etc. I was thinking of storing all events in a single table, but I could see this getting very big very quickly.
Am I doing it right? When I need to search the table for, say, event pertaining to a certain user, how bad of a toll would that be on the system? Could I optimise it somehow?

You would add indexes on the columns you most frequently use in WHERE clauses, e.g. if you are frequently selecting all events that pertain to a certain user, you should create an index on the user_id column.
http://www.postgresql.org/docs/9.1/static/sql-createindex.html

As long as the data in that table is normalized, you should be OK. If you find that read queries on that table slow down, you can add an index to some of the columns, but you should keep in mind that this will slow down writes to that table.
If you find that the performance is too slow, you can switch to using some NoSQL database that's better optimized for large tables.

If table will be really big, you can use partitioning:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
but you must choose a good partition key - good candidates are:
event timestamp
event type
user_id

Related

What is the number of columns that make table really big?

I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?

The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.

The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.

There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,

I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.

What is the best way to implement soft deletion in a large relational Database? [duplicate]

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.

I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL

You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.

Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).

The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.

if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table

That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?

Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)

You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.

you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.

Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy

Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.

I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...

Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345

#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete

One table with redundant data in one column vs Multiple tables

I have a scenario wherein I have users uploading their custom preferences to the conversion being done by our utility. So, the problem we are facing is
should we keep these custom mappings in one table with an extra column of UID or maybe GUID
,
or
should we create tables on the fly for every user that registers with us and provide him/her a separate table for their custom mappings altogether?
We're more inclined to having just an extra column and one table for the custom mappings, but does the redundant data in that one column have any effect on performance? On the other hand, logic dictates us to normalize our tables and have separate tables. I'm rather confused as to what to do.

Common practice is to use a single table with a user id. If there are commonly reused preferences, you can instead refactor these out into a separate table and reference them in one go, but that is up to you.
Why is the creating one table per user a bad idea? Well, I don't think you want one million tables in your database if you end up with one million users. Doing it this way will also guarantee your queries will be unwieldy at best, often slower than they ought to be, and often dynamically generated and EXECed - usually a last resort option and not necessarily safe.

I inherited a system where the original developers had chosen to go with the one table per client approach. We are now rewriting the system! From a dev perspective, you would end up with a tangled mess of dynamic SQL and looping. The DBAs will not be happy because they will wake up to an entirely different database every morning, which makes performance tuning and maintenance impossible.
You could have an additional BLOB or XML column in the users table, storing the preferences. This is called the property bag approach, but I would not recommend this either, as it will hamper query performance in many cases.
The best approach in this scenario is to solve the problem with normalisation. A separate table for the properties, joined to the users table with a PK/FK relationship. I would not recommend using a GUID. This GUiD would likely end-up as the PK, meaning you will have a 16-byte Clustered Index, and that will also be carried through to all non-clustered indexes on the table. You would probably also be building a non-cluster index on this in the Users table. Also, you could run the risk of falling into the pitfall of generating these GUIDs with NEW_ID() rather than NEW_SEQUENTIALID() which will lead to massive fragmentation.
If you take the normalisation approach and you have concerns about returning the results across the two tables in a tabular format, then you can simply use a PIVOT operator.

Database design

Our database is part of a (specialized) desktop application.
The primary goal is to keep data about certain events.
Events happen every few minutes.
The data collected about events changes frequently with new data groups being added in and old ones swapped out almost monthly (the data comes in definite groups).
I have to put together a database to track the events. A first stab at that might be to simply have a single big table where each row is an event and that is basically what our data looks like, but this seems undesirable because of our constantly changing groups of data (i.e. the number of columns would either keep growing perpetually or we would constantly having this months database incompatible with last months database - ugh!). Because of this I am leadning toward the following even though it creates circular references. (But maybe this is a stupid idea)
Create tables like
Table Events
Table Group of the Month 1
Table Group of the Month 2
...
Table Events has:
A primary key whose deletion cascade to delete rows with foreign keys referencing it
A nullable foreighn key for each data group table
Each data group table has:
A primary key, whose deletion cascades to null out foreign keys referencing it
Columns for the data in that group
A non-nullable foreign key back to the event
This still leaves you with a growing, changing Event Table (as you need to add new foreign key columns for each new data group), just much less drastically. However it seems more modular to me than one giant table. Is this a good solution to this situation? If not, what is?
Any suggestions?
P.S. We are using SQL Express or SQL Compact (we are currently experimenting with which one suits us best)

Why not use basically the single table approach and store the changing event data as XML in an XML column? You can even use XSD schemas to account for the changing data types, and you can add indexes on XML data if fast query performance on some XML data is required.
A permanently changing DB schema wouldn't really be an option if I were to implement such a database.

You should not have foreign keys for each data group table in the event table.
The event table already has an event_id which is in each data group table. So you can get from event to the child tables. Furthermore there will be old rows in the event table that are not in the latest data group table. So you really can't have a foreign key.
That said, I would wonder whether there is additional structure in the data group tables that can be used to clean up your design. Without knowing anything about what they look like I can't say. But if there is, consider taking advantage of it! (A schema that changes every month is a pretty bad code smell.)

Store your data at as granular a level as possible. It might be as simple as:
EventSource int FK
EventType int FK
Duration int
OccuredOn datetime
Get the data right and as simple as possible in the first place, and then
Aggregate via views or queries. Your instincts are correct about the ever changing nature of the columns - better to control that in T-SQL than in DDL.
I faced this problem a number of years ago with logfiles for massive armies of media players, and where I ultimately ended up was taking this data and creating a OLAP cube out of it. OLAP is another approach to database design where the important thing is optimizing it for reporting and "sliceability". It sounds like you're on that track, where it would be very useful to be able to look at a quick month's view of data, then a quarter's, and then back down to a week's. This is what OLAP is for.
Microsoft's technology for this is Analysis Services, which comes as part of Sql Server. If you didn't want to take the entire plunge (OLAP has a pretty steep learning curve), you could also look at doing a selectively denormalized database that you populated each night with ETL from your source database.
HTH.

database design asking for advice

I need to store entries of the schema like (ID:int, description:varchar, updatetime:DateTime). ID is unique primary key. The usage scenario is, I will frequently insert new entries, frequently query entries by ID and less frequently remove expired entries (by updatetime field, using another SQL Job run daily to avoid database ever increasing). Each entry is with 0.5k size.
My question is how to optimize the database schema design (e.g. tricks to add index, transaction/lock levels or other options) in my scenario to improve performance? Currently I plan to store all information in a single table, not sure whether it is the best option.
BTW: I am using SQL Server 2005/2008.
thanks in advance,
George

Additionally to your primary key, just add index on updatetime.

Your decision to store everything in a single table needs to be reviewed. There are very few subject matters that can really be well modeled by just one table.
The problems that arise from using just one table are usually less obvious than the problems that arise from not creating the right indexes and things like that.
I'm interested in the "description" column (field). Do all descriptions describe the same kind of thing? Do you ever retrieve sets of descriptions, rather than just one description at a time? How do you group descriptions into sets?
How do you know the ID for the description you are trying to retrieve? Do you store copies of the ID in some toher place, in order to reference which ones you want?
Do you know what a "foreign key" is? Was your choice not to include any foreign keys in this table deliberate?
These are some of the questions that need to be answered before you can know whether a single table design really suits your case.

Your ID is your primary key and it has automatically an index.
You can put onther index for the expiration date. Indexes
are going to help you for searching but decreases the performance
when inserting, deleting and updating. Anyway one index is not
an issue.
It sounds for me somehow strange -I am not saying that it is an error-
that you have ALL the information in one table. Re-think that point.
See if you can refactorize something.

It sounds as simple as it gets, except for possibly adding an index on updatetime as OMax suggested (I recommend).
If you would also like to fetch items by description, you should also consider a text index or full-text index on that column.
Other than that - you're ready to go :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight