PHP Geek needs Database Design Assistance - database

Like most developers I think I am always striving to create the most optimal code and database schemas.
However - Ive got the feeling, that Im over engineering my database schema that I want to create.
I have a web app that, in a short space of time, will hold a lot of users. The users are in the form of customers, suppliers, system users. Its in an industry where it likely to grow rapidly.
In previous schemas I have those users separated in different tables.
However, I am now thinking of going down the route of having one table called: PEOPLE.
There will be these tables:
People,
Contact Details,
Residences
They are related via PivotTables ie:
PivotContacts
PivotResidences.
My Question is this considered good/bad design.?
Am I over thinking, over engineering a simple setup.
The table People will grow exponentially and will hold ALOT of data - and other tables will relate to it.
I would really welcome opinions.
Will my design scale to 100 thousand records and maintain moderate speed.? * will initially start with 1000 records and will likely grow to approx 100,000 in 1 year.

For users that can log-in, and maybe are traced (last login, failed password retries) it is optimal to have a small table and maybe a separate table for writing (distinction between reading and writing data).
Any table with people in general has a tendency to collect a tremendous number of fields. Functional distinctions kept in different tables keep the data tidy, indexing on a suppliers table is nicer/maybe more optimal, as are changes to supplier data. SQL JOINs are manageable, and could be done with SQL views.
So I would go for a thin base table People and 1:1 tables SupplierPeople, SystemUsingPeople and so on. And consider which changes do happen: how often tables are updated, inserted into, being read.
Also consider having to modify the database scheme, adding a field.

If you're only worried about the scalability of your solution, 100K records is not a particularly large number, subject to some (important) assumptions.
Modern database software (I assume you're going to use MySQL as you say you're a PHP hand) running on modern hardware can easily handle databases with millions of records, as long as you have a well-designed table layout, and can use indices.
Your design - linking "people" to "contacts" and "residences" can use primary/foreign keys to join; that should easily scale to your requirements.
It's worth considering the likely queries you're going to be running, though - I'm guessing you will need to be able to search "people" by name, or by address, or by city, or by last contact date etc. This suggest you may need free text searching - once you get to large numbers of records, using where name like '%Jones%' can be slow.
You may also want to consider archiving/history strategies - do you need to store the history of someone's residences (so you can find out where they lived when they placed the order)?

Related

SQLite performance advice for .net

I am using SQLite in my application. The scenario is that I have stock market data and each company is a database with 1 table. That table stores records which can range from couple thousand to half a million.
Currently when I update the data in real time I - open connection, check if that particular data exists or not. If not, I then insert it and close the connection. This is then done in a loop and each database (representing a company) is updated. The number of records inserted is low and is not the problem. But is the process okay?
An alternate way is to have 1 database with many tables (each company can be a table) and each table can have a lot of records. Is this better or not?
You can expect at around 500 companies. I am coding in VS 2010. The language is VB.NET.
The optimal organization for your data is to make it properly normalized, i.e., put all data into a single table with a company column.
This is better for performance because the table- and database-related overhead is reduced.
Queries can be sped up with indexes, but what indexes you need depends on the actual queries.
I did something similar, with similar sized data in another field. It depends a lot on your indexes. Ultimately, separating each large table was best (1 table per file, representing a cohesive unit, in you case one company). Plus you gain the advantage of each company table being the same name, versus having x tables of different names that have the same scheme (and no sanitizing of company names to make new tables required).
Internally, other DBMSs often keep at least one file per table in their internal structure, SQL is thus just a layer of abstraction above that. SQLite (despite its conceptors' boasting) is meant for small projects and querying larger data models will get more finicky in order to make it work well.

database design for large number of users

which one is better a or b:
a). 7 tables for each user e.g. user7 messages, user7mail etc.In this case if we have 1000 users the there will be 7000 tables.
b). 7 tables e.g. messages, mails etc. all the messasges or mails of every usr will be on same table.
in this case for 1000 users we have only 7 tables.
In most cases, on modern hardware and with reasonable tuning, your database should be able to support tens of millions of records without too much pain, as long as your data really is relational. If you're searching for text, or storing hierarchical data, or storing documents, or running reports, there are alternative options (e.g. NoSQL).
Where at all possible, stick with the orthodox way of using relational databases; that means normalization, query tuning, using caches and throwing hardware at the problem.
Only once you've proven you have a performance problem is it worth looking at more exotic solutions. Within RDBMS world, that might mean partitioning the data (sorta kinda similar to your "table per user" idea). Alternatively, you might jump to NoSQL.
The problems with your "table per user" strategy is that you gain almost no benefit when querying by index (on a modern RDBMS, searching a table with 1 row or a table with million rows when hitting the index makes almost no difference for finding the data). For actions that don't hit the index, you should see a decent gain - but that's usually a sign you're not really relational in the first place...
It makes developing the client application rather error prone, and more complicated than it needs to be, especially when creating moderately complex SQL queries (e.g. multi-table joins) - and tuning those queries will become much harder as a result. You won't be able to use the tools available to manage database queries (e.g. ORM tools), as these are all based on the "standard" relational model.
The biggest problem is changing the database - if you have to add an attribute to "message", you have to repeat that change over 7000 tables. You'll either spend a lot of time writing custom database management scripts, or have a human being repeat the same thing thousands of times (and make hard-to-spot mistakes).
Case B will be much better, just make sure that your users have a user_id type field that increments automatically, and link your tables together via that ID e.g.
user_id email
1000 hello
This will improve lookup speed because you do not have to iclude functionality to choose a specific piece of data from a search of 1000's of tables (in this case it would be searching columns of tables until it found the right table with the right column, it would be ludicrous)
but if you are searching a specific table (e.g. you only need messages) only 1 table will be included in the lookup, much faster and easier to manage all the tables at an admin level.
but and even better idea would be 1 table with several columns, say a 'communications' table which could be like
user_id email messages
1000 hello hi

Database design: one large table versus several smaller tables

I have to create a database to store information being sent and received to / from a 3rd party web service portal. There are about 150 fields of information to be sent though I can remove about 50 of those fields by normalising (there are three sets addresses that can be saved in an address table, for example). However, this still leaves a table that could potentially have 100 columns.
I've come up with two ways of handling this though I'm not sure which to use:
1. Have a table with 100 columns and three references to an address table.
2. Break it down into maybe 15-20 separate dedicated tables.
Option 1 seems the quickest as it involves the fewest joins but the idea of a table with 100 columns doesn't feel right.
Option 2 feels better and would break things down in to more managable chunks but it won't save any database space and will increase the number of joins. Pretty much all the columns in the database will have a value and I cannot normalise these columns any further.
My question is, in this situation is it acceptable to have a table with c.100 columns in it or should I try and break it down over several tables for presentation?
Please note: The table structure will not change over the course of it's useage, a new database would be created for a new version of the web service portal. I have no control over the web service data structure.
Edit: #Oded's answer below has made me think a bit more about how the data will be accessed; it will really only be accessed in whole and not in part. I wouldn't for example, need to return columns 5-20 on a regular basis.
Answer: I accepted Oded's answer based on the comments after he posted it helped me make my mind up and I decided to go with option 1. As the data is accessed in full then having one table seems the better solution. If, for example, I regularly wanted to access columns 5-20 rather than the full table row then I'd see about breaking it up into separate tables for performance reasons.
Speaking from a relational purist point of view - first, there is nothing against having 100 columns in a table, if they are related. The point here is that if after normalizing you still have 100 columns, that's OK.
But you should normalize, and in the process you may very well end up with 15-20 separate dedicated tables, which most relational database professionals would agree is a better design (avoid data duplication with the update/delete issues associated, smaller data footprint etc...).
Pragmatically, however, if there is a measurable performance problem, it may be sensible to denormalize your design for performance benefit. The key here - measureable. Don't optimize before you have an actual problem.
In that respect, I'd say you should go with the set of 15-20 tables as an initial design.
From MSDN:Maximum Capacity Specifications for SQL Server :
Columns per nonwide table: 1,024
Columns per wide table: 30,000
So I think 100 columns is ok in your case. And also maybe you need to note(from same link):
Columns per primary key: 16
Of course this is only in the case if need data only as Log for a service.
If after reading from service you need to maintain data -> then normalising seems better...
If you find it easier to "manage" tables with fewer columns, however you happen to define manageability (e.g. less horizontal scrolling when looking at the table data in SSMS), you can break the table up into several tables with 1-to-1 relationships without violating the rules of normalization.

How should I store extremely large amounts of traffic data for easy retrieval?

for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.).
This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well.
What is a good way to do this? I already have some ideas:
Create a file for each user and day and append every dataset to it.
Advantage: It's probably very fast, and data is easy to find given a consistent file layout.
Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users.
Use a database
Advantage: It's very easy to find specific data with the right SQL query.
Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.
Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user.
Advantage: It would be easy to get information for one user using SQL queries on his file.
Disadvantage: Getting overall information would still be difficult.
But perhaps someone else has a very good idea?
Thanks very much in advance.
First, get The Data Warehouse Toolkit before you do anything.
You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing.
[Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.]
SQL databases are slow, but that slow is good for flexible retrieval.
The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.
A typical DW approach for this is to do this.
Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.
Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.
Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them.
Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.
When someone wants to do analysis, build them a datamart.
For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart.
You can do all the SQL queries you want on this mart. Most of the queries will devolve to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE clauses.
I think the proper answer really depends on the definition of a "dataset". As you mention in your question you are storing individual sets of information for each record; timestamp, userid, destination ip, source ip, number of bytes etc..
SQL Server is perfectly capable of handing this type of data storage with hundreds of millions of records without any real difficulty. Granted this type of logging is going to require some good hardware to handle it, but it shouldn't be too complex.
Any other solution in my opinion is going to make reporting very hard, and from the sounds of it that is an important requirement.
So you are in one of the cases where you have much more write activity than read, you want your writes not to block you, and you want your reads to be "reasonably fast" but not critical. It's a typical business intelligence use case.
You should probably use a database and store your data in as a "denormalized" schema to avoid complex joins and multiple inserts for each record. Think of your table as a huge log file.
In this case, some of the "new and fancy" NoSQL databases are probably what you're looking for: they provide relaxed ACID constraints, which you should not terribly mind here (in case of crash, you can loose the last lines of your log), but they perform much better for insertion, because they don't have to sync journals on disk at each transaction.

What's the better database design: more tables or more columns?

A former coworker insisted that a database with more tables with fewer columns each is better than one with fewer tables with more columns each. For example rather than a customer table with name, address, city, state, zip, etc. columns, you would have a name table, an address table, a city table, etc.
He argued this design was more efficient and flexible. Perhaps it is more flexible, but I am not qualified to comment on its efficiency. Even if it is more efficient, I think those gains may be outweighed by the added complexity.
So, are there any significant benefits to more tables with fewer columns over fewer tables with more columns?
I have a few fairly simple rules of thumb I follow when designing databases, which I think can be used to help make decisions like this....
Favor normalization. Denormalization is a form of optimization, with all the requisite tradeoffs, and as such it should be approached with a YAGNI attitude.
Make sure that client code referencing the database is decoupled enough from the schema that reworking it doesn't necessitate a major redesign of the client(s).
Don't be afraid to denormalize when it provides a clear benefit to performance or query complexity.
Use views or downstream tables to implement denormalization rather than denormalizing the core of the schema, when data volume and usage scenarios allow for it.
The usual result of these rules is that the initial design will favor tables over columns, with a focus on eliminating redundancy. As the project progresses and denormalization points are identified, the overall structure will evolve toward a balance that compromises with limited redundancy and column proliferation in exchange for other valuable benefits.
It doesn't sound so much like a question about tables/columns, but about normalization. In some situations have a high degree of normalization ("more tables" in this case) is good, and clean, but it typically takes a high number of JOINs to get relevant results. And with a large enough dataset, this can bog down performance.
Jeff wrote a little about it regarding the design of StackOverflow. See also the post Jeff links to by Dare Obasanjo.
I would argue in favor of more tables, but only up to a certain point. Using your example, if you separated your user's information into two tables, say USERS and ADDRESS, this gives you the flexibility to have multiple addresses per user. One obvious application of this is a user who has separate billing and shipping addresses.
The argument in favor of having a separate CITY table would be that you only have to store each city's name once, then refer to it when you need it. That does reduce duplication, but in this example I think it's overkill. It may be more space efficient, but you'll pay the price in joins when you select data from your database.
A fully normalized design (i.e, "More Tables") is more flexible, easier to maintain, and avoids duplication of data, which means your data integrity is going to be a lot easier to enforce.
Those are powerful reasons to normalize. I would choose to normalize first, and then only denormalize specific tables after you saw that performance was becoming an issue.
My experience is that in the real world, you won't reach the point where denormalization is necessary, even with very large data sets.
Each table should only include columns that pertain to the entity that's uniquely identified by the primary key. If all the columns in the database are all attributes of the same entity, then you'd only need one table with all the columns.
If any of the columns may be null, though, you would need to put each nullable column into its own table with a foreign key to the main table in order to normalize it. This is a common scenario, so for a cleaner design, you're likley to be adding more tables than columns to existing tables. Also, by adding these optional attributes to their own table, they would no longer need to allow nulls and you avoid a slew of NULL-related issues.
It depends on your database flavor. MS SQL Server, for example, tends to prefer narrower tables. That's also the more 'normalized' approach. Other engines might prefer it the other way around. Mainframes tend to fall in that category.
The multi-table database is a lot more flexible if any of these one to one relationships may become one to many or many to many in the future. For example, if you need to store multiple addresses for some customers, it's a lot easier if you have a customer table and an address table. I can't really see a situation where you might need to duplicate some parts of an address but not others, so separate address, city, state, and zip tables may be a bit over the top.
Like everything else: it depends.
There is no hard and fast rule regarding column count vs table count.
If your customers need to have multiple addresses, then a separate table for that makes sense. If you have a really good reason to normalize the City column into its own table, then that can go, too, but I haven't seen that before because it's a free form field (usually).
A table heavy, normalized design is efficient in terms of space and looks "textbook-good" but can get extremely complex. It looks nice until you have to do 12 joins to get a customer's name and address. These designs are not automatically fantastic in terms of performance that matters most: queries.
Avoid complexity if possible. For example, if a customer can have only two addresses (not arbitrarily many), then it might make sense to just keep them all in a single table (CustomerID, Name, ShipToAddress, BillingAddress, ShipToCity, BillingCity, etc.).
Here's Jeff's post on the topic.
There are advantages to having tables with fewer columns, but you also need to look at your scenario above and answer these questions:
Will the customer be allowed to have more than 1 address? If not, then a separate table for address is not necessary. If so, then a separate table becomes helpful because you can easily add more addresses as needed down the road, where it becomes more difficult to add more columns to the table.
i would consider normalizing as the first step, so cities, counties, states, countries would be better as separate columns... the power of SQL language, together with today DBMS-es allows you to group your data later if you need to view it in some other, non-normalized view.
When the system is being developed, you might consider 'unnormalizing' some part if you see that as an improvement.
I think balance is in order in this case. If it makes sense to put a column in a table, then put it in the table, if it doesn't, then don't. Your coworkers approach would definately help to normalize the database, but that might not be very useful if you have to join 50 tables together to get the information you need.
I guess what my answer would be is, use your best judgement.
There are many sides to this, but from an application efficiency perspective mote tables can be more efficient at times. If you have a few tables with a bunch of columns every time the db as to do an operation it has a chance of making a lock, more data is made unavailable for the duration of the lock. If locks get escalated to page and tables (well hopefully not tables :) ) you can see how this can slow down the system.
Hmm.
I think its a wash and depends on your particular design model. Definitely factor out entities that have more than a few fields out into their own table, or entities whose makeup will likely change as your application's requirements changes (for instance - I'd factor out address anyways, since it has so many fields, but I'd especially do it if you thought there was any chance you'd need to handle foreign country addresses, which can be of a different form. The same with phone numbers).
That said, when you're got it working, keep an eye out on performance. If you've spun an entity out that requires you to do large, expensive joins, maybe it becomes a better design decision to spin that table back into the original.
When you design your database, you should be as close as possible from the meaning of data and NOT your application need !
A good database design should stand over 20 years without a change.
A customer could have multiple adresses, that's the reality. If you decided that's your application is limited to one adresse for the first release, it's concern the design of your application not the data !
It's better to have multiple table instead of multiple column and use view if you want to simplify your query.
Most of time you will have performance issue with a database it's about network performance (chain query with one row result, fetch column you don't need, etc) not about the complexity of your query.
There are huge benefits to queries using as few columns as possible. But the table itself can have a large number. Jeff says something on this as well.
Basically, make sure that you don't ask for more than you need when doing a query - performance of queries is directly related to the number of columns you ask for.
I think you have to look at the kind of data you're storing before you make that decision. Having an address table is great but only if the likelihood of multiple people sharing the same address is high. If every person had different addresses, keeping that data in a different table just introduces unnecessary joins.
I don't see the benefit of having a city table unless cities in of themselves are entities you care about in your application. Or if you want to limit the number of cities available to your users.
Bottom line is decisions like this have to take the application itself into considering before you start shooting for efficiency. IMO.
First, normalize your tables. This ensures you avoid redundant data, giving you less rows of data to scan, which improves your queries. Then, if you run into a point where the normalized tables you are joining are causing the query to take to long to process (expensive join clause), denormalize where more appropriate.
Good to see so many inspiring and well based answers.
My answer would be (unfortunately): it depends.
Two cases:
* If you create a datamodel that is to be used for many years and thus possibly has to adept many future changes: go for more tables and less rows and pretty strict normalization.
* In other cases you can choose between more tables-less rows or less tables-more rows. Especially for people relatively new to the subject this last approach can be more intuitive and easy to comprehend.
The same is valid for the choosing between the object oriented approach and other options.

Resources