Facebook database design...why have a profile table?

Facebook database design...why have a profile table? - database

See image below
Since 1 account has 1 profile relationship, Why have a profile table? what is the purpose of the profile table, apart from storing the status. Why not include status in the Account table and make a direct relationship from the "account" table to BasicInformation, PersonalInformation etc.
http://i.stack.imgur.com/u7GKB.jpg

If, at some future time, you change the model so that one account can have more than one profile, you are much better off with two tables than with just one.
With regard to the cost of joins, you need to quantify that, and decide where a speed difference just isn't worth worrying about. Excessive fear of slowing things down with joins is one of the most common newbie mistakes with relational databases.

Some ideas and educated guesses.
At the conceptual level, an account
and a profile are two different
things.
Adding the profile status to the
account table makes that table wider
and slower.
Since status holds only your most
recent post (is that right?), that
table can be put on a separate
tablespace, probably on an insanely
fast disk array for fast lookups.
Status is probably looked up much
more often than anything in the
account table.
Security is simpler to administer.
Lots of third-party apps might be
allowed access to your status, but
they shouldn't necessarily have
access to your email address and
password. Physical isolation (separate tables) is pretty easy to get obviously right.

I guess it's because not every Account will have a profile associated with it. i.e. the relationship is actually 1:0/1, not 1:1.

It's just a matter of abstraction.
An account has profile data in it. So, it has an instance (table) of a profile.
This way you can access profile data seperately, and maybe in the future add more data to the account.

Related

Database table creation for large data

I am making a client management application in which I am storing the data of employee , admin and company. In the future the database will have hundreds of companies registered. I am thinking to go for the best approach to database design.
I can think of 2 approaches:
Making all tables of app separately for each company
Storing all data in app database
Can you suggest the best way to do that?
Please note that all 3 tables are linked on the basis of ids and there will be hundreds of companies and each company will have many admin and each admin will have hundreds of employee . What would be the best approach to do with security and query performance

With the partial information you provided, it look like 3 normalized tables is what you need, plus the auxiliar data like lookups and other stuff.
But when you design a database you would need to consider many more point like, security, visibility, client access methods, etc
For example if you want to ensure isolation, and don't allow users to have any visibility to other's data, you could create dynamically a schema per company, create user and access rights for each schema dynamically. Then you'll need support these stuff in the DAL, which in fact will be quite fat.
Another approach for the DAl could be exposing views that always return subsets for one company.
A big reason reason that I would suggest going for the normalized approach is that maintenance will be much easier this way.
From a SQL point of view I don't see any performance advantage having many tables or just 3, efficiency of the indexes, and smart DAL will make the difference.

The performance of the query doesn't much depends on the size of table but it depends more on the indexes you have on that table. so you need to put clustered and non clustered indexes as per your requirement and i can guarantee that up to 10 GB of data you will not face any problem

This is a classic problem shared my most web business services: for discussions of the factors involved, Google "multi-tenant architecture."
You almost certainly want to put all companies into a common set of tables: each data table should reference the company key, and all queries should join on that key, among their other criteria. This allows the best overall performance, and saves you the potential maintenance nightmare of duplicating views, stored procedures and so on hundreds of times, or of having to apply the same structural changes to hundreds of tables should you wish to add a field or a table.
To help assure that you don't inadvertently intermingle data from different customers, it might be useful to do all data access through a validated set of stored procedures (all of which take the company ID as a parameter).
Hundreds of parallel databases will not scale very well: the DB server will constantly be pushing tables and indexes out of memory to accommodate the next query, resulting in disk thrashing and poor performance, as well. There is only pain down that path.

depending on the use-cases of your application there is no "best" way.
Please explain the operations your application will provide so we can get further insight into your problem.
The data to be stored seemed to be structured so a relational database at a first glance would work out well, but stick to the point i marked above.

You have not said how this data links at all or if there are even any links between them. However, at a guess, you need 3 tables.
EmployeeTable
AdminTable
CompanyTable
Each with the required properties in there, without additional information I'm not able to provide any more guidance.

Securely store data for multiple party's in a single table

I'm certainly no DBA and only a beginner when it comes to software development, so any help is appreciated. What is the most secure structure for storing the data from multiple parties in one database? For instance if three people have access to the same tables, I want to make sure that each person can only see their data. Is it best to create a unique ID for each person and store that along with the data then query based on that ID? Are there other considerations I should take into account as well?

You are on the right track, but mapping the USER ID into the table is probably not what you want, because in practice many users have access to the corporations data. In those cases you would store "CorpID" as a column, or more generically "ContextID". But yes, to limit access to data, each row should be able to convey who the data is for, either directly (the row actually contains a reference to CorpID, UserID, ContextID or the like) or it can be inferred by joining to other tables that reference the qualifier.
In practice, these rules are enforced by a middle tier that queries the database, providing the user context in some way so that only the correct records are selected out of the database and ultimately presented to the user.

...three people have access to the same tables...
If these persons can query the tables directly through some query tool like toad then we have a serious problem. if not, that is like they access through some middle tier/service layer or so then #wagregg's solution above holds.
coming to the case when they have direct access rights then one approach is:
create database level user accounts for each of the users.
have another table with row level grant information. say your_table has a primary key column MY_PK_COL then the structure of the GRANTS_TABLE table would be like {USER_ID; MY_PK_COL} with MY_PK_COL a foreign key to your_table.
Remove all privileges of concerned users from your_table
Create a view. SELECT * FROM your_table WHERE user_id=getCurrentUserID();
give your users SELECT/INSERT/UPDATE rights on this view.
Most of the database systems (MySQL, Oracle, SQLServer) provide way to get current logged user. (the one used in the connection string). They also provide ways to restrict access to certain tables. now for your users the view will behave as a normal table. they will never know the difference.
a problem happens when there are too many users. provisioning a database level uer account to every one of them may turn difficult. but then DBMS like MsSQLServer can use windows authentication, there by reducing the user/creation problem.
In most of the scenarios the filter at middle tier approach is the best way. but there are times when security is paramount. Also a bug in the middle tier may allow malicious users to bypass the security. SQL injection is one thing to name. then you have to do what you have to do.

It sounds like you're talking about a multi-tenant architecture, but I can't tell for sure.
This SO answer has a summary of the issues, and links to an online article containing details about the trade-offs.

Database Design: does it make sense to duplicate info in this case?

I'm building a service, kind of a social network, that is expected to attract trillions of users. Those users will be able to follow other users. For the case, let's imagine that I'm building Facebook. hah!
Next to each user's name, there will be the number of followers that he has. Something like
SELECT COUNT(*) FROM users_vs_users
WHERE user_followed_id = 'xxx' GROUP BY user_followed;
would work, but doing that for each page reload and checking trillions of users would kill my server.
Is it reasonable to have a field named num_of_followers in the users table for each user, that is updated every time somebody is followed or unfollowed?
Thanks

Yes. Effectively, you are denormalising for performance reasons.

I have another opinion here
Some databases can use memory (plus disk sync) like Oracle times ten and MySQL Cluster
Using memory based database only for data that is frequently accessed usually give great performance that simply make hassles of managing "counting" fields history
Another BIG tip, never optimise unless you have to, try to predict expected traffic for the next couple of months, not years, then you can monitor which queries actually are killing performance or doing too much disk access, just then you'll be able to de-normalize tables according to realistic information, not guesses

In my opinion, any self-respecting DBMS should internally perform such an optimization on its own accord. Or maybe they already do? Is COUNT(*) actually slow? I don't know.
Anyway, why not? Just make sure that "users_vs_users" and "users.num_of_followers" are synchronized at any time.

What is the best database design for thousand rows

I'm about to start a Database Design that will simply manage users under companies.
Each company will have a admin area that can manage users
Each company will have around 25.000 users
Client believes to have around 50 companies to start
My main question is
Should I create tables based on Companies? like
users_company_0001 users_company_0002 users_company_0003 ...
as each company will never use "other" users and nothing will need to sum/count different tables in all user_company (a simple JOIN will do the trick, though it's more expensive (time) it will work as having the main picture, this will never be needed.
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing).
I'm thinking about the first option, though, I'm not sure how would I use Entity Framework on such layout... I would probably need to go back to the 90's and generate my Data Logic Layer by hand.
has it will be a simple call to Store Procedures containing the Company Id
What will you suggest?
The system application will be ASP.NET (probably MVC, I'm still trying to figure this out as all my knowledge is on webforms, though I saw Scott Hanselman MVC videos - seams easy - but I know it will not be that easy as problems will come and I will take more time to fix them), plus Microsoft SQL.

Even though you've described this as a 1-many relationship, I'd still design the DB as many-to-many to guard against a future change in requirements. Something like:

Having worked with a multi-terabyte SQL Server database, and having experience with hundreds of tables over the course of my career with multi-million rows, I can tell you with full assurance that SQL Server can handle a your company and users tables without partitioning. It's always there when you need it, but your worry shouldn't be about your tables - pick the simplest schema that meets your needs. If you want to do something to optimize performance, your bottleneck will almost assuredly be your disks. Don't buy large, slow disks. Get yourself a bunch of small, high RPM disks and spread your data out across them as much as possible, and don't share disks with your logs and your data. With databases, you're almost always better off achieving performance with good hardware, a good disk subsystem, and proper indexing. Don't compromise and over complicate your schema trying to anticipate performance - you'll regret it. I've seen really big databases where that sort of thing was necessary, but yours ain't it.

re: Should I create tables based on Companies?
yes
like
users_company_0001 users_company_0002 users_company_0003
no, like
companyID companyName, contactID
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing)
yes

I think you should create separate tables for Company and User. Then
a third table to connect the two: CompanyAdmin. Something like:
Company(Company_Id, Company_name, ...)
User(User_Id, User_name, ...)
CompanyAdmin(Company_id, User_id)
This way you can add users and/or companies without affecting the number
of tables you need to manage. It is generally a bad design where you need
to modify the database (ie. add tables) when new data (companies) are added to the system.
With proper indexing, the join costs in a database containing
a few million rows should not be a problem.
Finally, if you ever need to change or record additional information about
Companies, Users or the relationship between them, this setup should
have the least amount of impact on your application.

Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.

Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm

It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.

It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.

I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight