Database Design: does it make sense to duplicate info in this case?

Database Design: does it make sense to duplicate info in this case? - database

I'm building a service, kind of a social network, that is expected to attract trillions of users. Those users will be able to follow other users. For the case, let's imagine that I'm building Facebook. hah!
Next to each user's name, there will be the number of followers that he has. Something like
SELECT COUNT(*) FROM users_vs_users
WHERE user_followed_id = 'xxx' GROUP BY user_followed;
would work, but doing that for each page reload and checking trillions of users would kill my server.
Is it reasonable to have a field named num_of_followers in the users table for each user, that is updated every time somebody is followed or unfollowed?
Thanks

Yes. Effectively, you are denormalising for performance reasons.

I have another opinion here
Some databases can use memory (plus disk sync) like Oracle times ten and MySQL Cluster
Using memory based database only for data that is frequently accessed usually give great performance that simply make hassles of managing "counting" fields history
Another BIG tip, never optimise unless you have to, try to predict expected traffic for the next couple of months, not years, then you can monitor which queries actually are killing performance or doing too much disk access, just then you'll be able to de-normalize tables according to realistic information, not guesses

In my opinion, any self-respecting DBMS should internally perform such an optimization on its own accord. Or maybe they already do? Is COUNT(*) actually slow? I don't know.
Anyway, why not? Just make sure that "users_vs_users" and "users.num_of_followers" are synchronized at any time.

Related

How to query when data is spread into different microservices?

I'm new on microservices architecture and I'm facing this problem:
I have a platform where basically Users manage the accounting of their Clients.
I have one microservice in charge of the security. This one manages which Users have access to which Clients.
Then I have another microservice that manages the Invoices of the Clients.
One of the functions here would be: given a User is logged, list all the Invoices of all the Clients that the User has access to.
For that, I thought that I should ask the Security microservice to give me the list of the Clients the User has access to. And then, I go to the database of Invoices and query, filtering by all those clients.
The problem is that I end up with a horrible query, as it's something like:
SELECT * FROM Invoice WHERE clientId IN (CLI1, CLI2, CLI3, ...) -- Potentially 200 clients
I thought to keep a copy of the User-Client relation in the Invoice database. Or to have both microservices sharing the same database. But none of them convince me as I have more microservices that may face the same problem, leading to a huge repetition of data or to a big monolithic database.
Is there a better way to do this?
Thanks in advance!

In general, database access is restricted across the services, and also keeping information which you do not own, tends to be a tiresome process as you will always be in fight to sync this piece of data with its intended source of truth.
So the only option that you have is what you have already mentioned in the question.
You end up with a horrible query.
But is it horrible when you write it or will it lag in performance?
It depends, yes if you are using MySQL, you can always perform joins over sub queries
But joins have there own cost.
And even than it will be ok to check if sub queries in these cases gives you expected performance.
If its feasible you can explore other databases which could be really optimized for queries like these.
Or worst case scenario, you can copy the data but you have to put in lot effort to ensure they remain in sync.
Most of the choices are not straight forward and there will be trade offs that you need to take call on.

Storing large amounts of data in a database

I'm currently working on a home-automation project which provides the user with the possibility to view their energy usage over a period of time. Currently we request data every 15 minutes and we are expecting around 2000 users for our first big pilot.
My boss is requesting we that we store at least half a year of data. A quick sum leads to estimates of around 35 million records. Though these records are small (around 500bytes each) I'm still wondering whether storing these in our database (Postgres) is a correct decision.
Does anyone have some good reference material and/or advise about how to deal with this amount of information?

For now, 35M records of 0.5K each means 37.5G of data. This fits in a database for your pilot, but you should also think of the next step after the pilot. Your boss will not be happy when the pilot will be a big success and that you will tell him that you cannot add 100.000 users to the system in the next months without redesigning everything. Moreover, what about a new feature for VIP users to request data at each minutes...
This is a complex issue and the choice you make will restrict the evolution of your software.
For the pilot, keep it as simple as possible to get the product out as cheap as possible --> ok for a database. But tell you boss that you cannot open the service like that and that you will have to change things before getting 10.000 new users per week.
One thing for the next release: have many data repositories: one for your user data that is updated frequently, one for you queries/statistics system, ...
You could look at RRD for your next release.
Also keep in mind the update frequency: 2000 users updating data each 15 minutes means 2.2 updates per seconds --> ok; 100.000 users updating data each 5 minutes means 333.3 updates per seconds. I am not sure a simple database can keep up with that, and a single web service server definitely cannot.

We frequently hit tables that look like this. Obviously structure your indexes based on usage (do you read or write a lot, etc), and from the start think about table partitioning based on some high level grouping of the data.
Also, you can implement an archiving idea to keep the live table thin. Historical records are either never touched, or reported on, both of which are no good to live tables in my opinion.
It's worth noting that we have tables around 100m records and we don't perceive there to be a performance problem. A lot of these performance improvements can be made with little pain afterwards, so you could always start with a common-sense solution and tune only when performance is proven to be poor.

With appropriate indexes to avoid slow queries, I wouldn't expect any decent RDBMS to struggle with that kind of dataset. Lots of people are using PostgreSQL to handle far more data than that.
It's what databases are made for :)

First of all, I would suggest that you make a performance test - write a program that generates test entries that corresponds to the number of entries you'll see over half a year, insert them and check results to see if query times are satisfactory. If not, try indexing as suggested by other answers. It is, btw, also worth trying write performance to ensure that you can actually insert the amount of data you're generating in 15 minutes in.. 15 minutes or less.
Making a test will avoid the mother of all problems - assumptions :-)
Also think about production performance - your pilot will have 2000 users - will your production environment have 4000 users or 200000 users in a year or two?
If we're talking a really big environment, you need to think about a solution that allows you to scale out by adding more nodes instead of relying on always being able to add more CPU, disk and memory to a single machine. You can either do this in your application by keeping track on which out of multiple database machines is hosting details for a specific user, or you can use one of the Postgresql clustering methods, or you could go a completely different path - the NoSQL approach, where you walk away completely from RDBMS and use systems which are built to scale horizontally.
There are a number of such systems. I only have personal experience of Cassandra. You have to think completely different compared to what you're used to from the RDBMS world which is something of a challenge - think more about how you want
to access the data rather than how to store it. For your example, I think storing the data with the user-id as key and then add a column with the column name being the timestamp and the column value being your data for that timestamp would make sense. You can then ask for slices of those columns for example for graphing results in a Web UI - Cassandra has good enough response times for UI applications.
The upside of investing time in learning and using a nosql system is that when you need more space - you just add a new node. Same thing if you need more write performance, or more read performance.

Are you not better off not keeping individual samples for the full period? You could possibly implement some sort of consolidation mechanism, which concatenates weekly/monthly samples into one record. And run said consolidation on a schedule.
You decision has to depend on the type of queries you need to be able to run on the database.

There are lots of techniques to handle this problem. you will only get performance if you touch minimum number of records. in your case you can use following techniques.
Try to keep old data in separate table here your can use table partitioning or can use a different kind of approach where you can store your old data in file system and can serve them directly from your application without connecting to database, this way your database will be free. I am doing this for one of my project and it already has more than 50GB of data but it is running very smoothly.
Try to index table columns but be careful as it will affect your insertion speed.
Try batch processing for your insertion or select queries. you can handle this issue very smartly here.
Example: suppose you are getting request to insert record in any table after every 1 second then you make a mechanism where you process this request in batch of 5 record in this way you will hit your database after 5 second which is much better. Yes, you can make users to wait for 5 second to wait for their record inserted like in Gmail where you send email and it ask you to wait/processing. for select you can put your resultset periodically in file system and can serve them directly to user without touching database like most stock market data company do.
You can also use some ORM like Hibernate. They will use some caching techniques to boost speed of your data.
For any further query you can mail me on ranjeet1985#gmail.com

Will creating seperate databases in SQL Server give me better performance?

All, I'm a programmer by trade but for this particular project I'm finidng myself being the DBA as well. Here is the scenario I'm faced with:
Web app with anywhere from 400-1000 customers. A customer is a "physical company", each of which has n-number of uers. Each customer (company) has on average 1GB worth of data (total of about 200 million rows). Each company has probably 80% similar data in terms of the type of data stored. The other 20% is custom data that the companies can themselves define (basically custom fields).
I am trying to figure out the best way to scale this on the cheap when you conisder that the customers need pretty good reaction time. For example, customer X might want to grab all records where last name like 'smith' and phone like '555' where as customer Y might want to grab all records where account number equals '1526A'.
Bottom line, performance is key and I'm finding it hard to decide what to index and if that is even going to help me given the fact these guys can basically create their own query through the UI.
My question is, what would you do? Do you think it would be wise to break each customer out into it's own DB? Total DB size at the moment is around 400GB.
It is a complete re-write so I have the fortune of being able to start fresh if needed. Any thoughts, hints would be greatly appreciated.

Bottom line, performance is key and
I'm finding it hard to decide what to
index and if that is even going to
help me given the fact these guys can
basically create their own query
through the UI.
Bottom line, you're ceding your DB performance to the whims of your clients. If they're able to "create their own query", then they're able to "create their own REALLY BAD queries".
So, if you run this in a shared environment (i.e. the same hardware), then customer A's awful table scans can saturate the I/O for everyone else.
If they're on the same database server, then Customer A's scans get to flush all of your other customers data from the data cache.
Basically, the more you "share", the more one customer can impact the operations of other customers. If you give customers the capability to do expensive things, and share much of it, then everyone suffers.
So, the options are a) don't let the customers do silly things or b) keep the customers as separated as practical so that when one does do silly things, the phones don't light up from all of the other customers.
If you don't know "what to index" then you are not offering much control over what the customers can do, and thus the silly thing factor goes way up.
You would probably get quite far by offering several popular, pre-made SQL views that the customers can select from, and then they're limited to simply filtering and possibly ordering the results. Then you optimize around execution of those views.
It's likely that surprisingly few "general" views can cover a large amount of the use cases.
Generic, silly queries can be delegated to a batch process that runs overnight, during off hours, or to a separate machine that doesn't impact transactional performance, such as a nightly snapshot with "everything but todays data" on it. Let them run historic queries against that.

The SO question How to design a multi tenant database has a link to a decent article on the tradeoffs along the spectrum from "shared nothing" to "shared everything". Also, SO has a tag for those kinds of questions; I added it for you.

Creating separate databases on the same server won't help you get better performance. The performance optimisations available to you with multiple databases are just the same as you can achieve with one database.
Separate databases might make sense for administrative reasons - if different backup or availability requirements apply to different customers for example.
It's still probably sensible to build your application so that it can support multiple databases so that you have the option of scaling out over multiple DB servers.

If you have seperate databases the 80% that is the same beciomes almost impossible to keep the same over time. YOu will end up spending far more money for maintenance.
Luckly SQL Server has some options for you. First put the customer sspeicifc information in the same database in a separate schema and the common stuff in a differnt schema(create a common schema and a schema for each client).
Next set up data partitioning by client. This can require the proper hardware to do this effectively.
Now you have one code base for common which will promugate changes to all clients at once and clients are separated for performance using the partitions.

Facebook database design...why have a profile table?

See image below
Since 1 account has 1 profile relationship, Why have a profile table? what is the purpose of the profile table, apart from storing the status. Why not include status in the Account table and make a direct relationship from the "account" table to BasicInformation, PersonalInformation etc.
http://i.stack.imgur.com/u7GKB.jpg

If, at some future time, you change the model so that one account can have more than one profile, you are much better off with two tables than with just one.
With regard to the cost of joins, you need to quantify that, and decide where a speed difference just isn't worth worrying about. Excessive fear of slowing things down with joins is one of the most common newbie mistakes with relational databases.

Some ideas and educated guesses.
At the conceptual level, an account
and a profile are two different
things.
Adding the profile status to the
account table makes that table wider
and slower.
Since status holds only your most
recent post (is that right?), that
table can be put on a separate
tablespace, probably on an insanely
fast disk array for fast lookups.
Status is probably looked up much
more often than anything in the
account table.
Security is simpler to administer.
Lots of third-party apps might be
allowed access to your status, but
they shouldn't necessarily have
access to your email address and
password. Physical isolation (separate tables) is pretty easy to get obviously right.

I guess it's because not every Account will have a profile associated with it. i.e. the relationship is actually 1:0/1, not 1:1.

It's just a matter of abstraction.
An account has profile data in it. So, it has an instance (table) of a profile.
This way you can access profile data seperately, and maybe in the future add more data to the account.

'Followers' and efficiency

I am designing an app that would involve users 'following' each other's activity, in the twitter sense, but I am not very experienced with database/query design/efficiency. Are there best practices for managing this, pitfalls to avoid, etc.? I gather this can create a very large load on the db if not done properly (or maybe even then?).
If it makes a difference it is likely that people will 'follow' only a relatively small number of people (but a person may have many followers). However this is not certain, and I wouldn't want to count on it.
Any advice gratefully received. Thanks.

Pretty simple and easy to do with full normalisation. If you have a table of users, each with a unique ID, you would have a TABLE_FOLLOWERS table with the columns, USERID and FOLLOWERID which would describe all the followers for each user as a one to one to many relationship.
Even with millions of assosciations on a half decent database server this will perform well and fast as long as you are using a good database (IE, not MS-Access).

The model is fairly simple. The problem is in the size of the Subscription table; if there are 1 million users, and each subscribes to 1000, then the Subscription table has 1 billion rows.

That depends on how many users you expect to need to support; how many followers you expect users to have; and what sort of funding/development-effort you expect to have access to should your answers to the previous questions prove optimistic.
For a small scale project I would likely ignore the database, design the application as a simple object model with User objects that maintain a List[followers]. Keep it all in RAM for normal operation and use an ORM to persist to a database periodically (probably postgresql or mysql).
For a larger project I would not be using a relational database at all; but exactly what I would use would depend on the specific details of the project.
If you are only trying to spike the concept, go with the ORM approach; but, keep in mind it won't scale.

You probably should read http://highscalability.com/ and it's articles on how this is managed by the big sites.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight