Say there is a website with 100,000 users each has up to 1000 unique strings attached to them so that there are maximum 100,000,000 strings in total.
Would it be better to have 1 table with each string being one record along with it's owner's id. So that you end up with 1 table with 100,000,000 records with 2 fields (text and user id).
Or have 100,000 tables, one table for each user and the table's name is the user's id. and then 1000 records in each table, with just one field (the text).
Or instead of storing the strings in a database (there would be a character limit about the length of an SMS message) just store link to text files where there are 100,000,000 text files in a directory and each file has a unique name (random numbers and/or letters) and contains one of the strings? (or where each user has a directory and then their strings are in that directory?)
Which would be the most efficient option, the directory and database and then which sub option of those would be the most efficient?
(this question is obviously theoretical in my case, but what does a site like twitter do?)
(by efficiency I mean using the least amount of resources and time)
Or have 100,000 tables
For the love of $DEITY, no! This will lead to horrible code - it's not what databases are designed for.
You should have one table with 100,000,000 records. Database servers are built to handle large tables, and you can use indexes and partitioning etc to improve performance if necessary.
Option #1
It would be easier to store one table with a user id and the text. It would not be more efficient to create a table for every user.
Though in practice you would want something like a Mongo sharded cluster instead of a lone server running MySQL.
You'd have one table, with indexes on the USER_ID.
For speed, you can partition the table, duplicate it, use caching, cloud, sharding, ...
Please consider NoSQL databases: http://nosql-database.org/
Definitely one table, and fill with record based on key. OS will crawl with a directory structure of 100,000 file names to sort through... the directory mgmt alone will KILL your performance (from the OS level)
It depends on how much activity the server has to handle.
A few month ago we build a system that indexed ~20 million Medline article abstracts which each are longer than your twitter message.
We put the stuff in a single lucene index that was ~40GB big.
Even through we had bad hardware (2 GB Ram and no SSD drives - poor interns) we were able to run searches for ~3 million terms in a few days against the database.
A single table or (lucene index) should be the way to go.
Related
I am using SQLite in my application. The scenario is that I have stock market data and each company is a database with 1 table. That table stores records which can range from couple thousand to half a million.
Currently when I update the data in real time I - open connection, check if that particular data exists or not. If not, I then insert it and close the connection. This is then done in a loop and each database (representing a company) is updated. The number of records inserted is low and is not the problem. But is the process okay?
An alternate way is to have 1 database with many tables (each company can be a table) and each table can have a lot of records. Is this better or not?
You can expect at around 500 companies. I am coding in VS 2010. The language is VB.NET.
The optimal organization for your data is to make it properly normalized, i.e., put all data into a single table with a company column.
This is better for performance because the table- and database-related overhead is reduced.
Queries can be sped up with indexes, but what indexes you need depends on the actual queries.
I did something similar, with similar sized data in another field. It depends a lot on your indexes. Ultimately, separating each large table was best (1 table per file, representing a cohesive unit, in you case one company). Plus you gain the advantage of each company table being the same name, versus having x tables of different names that have the same scheme (and no sanitizing of company names to make new tables required).
Internally, other DBMSs often keep at least one file per table in their internal structure, SQL is thus just a layer of abstraction above that. SQLite (despite its conceptors' boasting) is meant for small projects and querying larger data models will get more finicky in order to make it work well.
I am writing a service that will be creating and managing user records. 100+ million of them.
For each new user, service will generate a unique user id and write it in database. Database is sharded based on unique user id that gets generated.
Each user record has several fields. Now one of the requirement is that the service be able to search if there exists a user with a matching field value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id ). I will need to search on all shards to find a user record that matches a particular column.
So to make that lookup fast. One thing i am thinking of doing is setting up an ElasticSearch cluster. Service will write to the ES cluster every time it creates a new user record. ES cluster will index the user record based on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have 100+million user records where 5 columns of each user record need to be indexed. I know it depends on hardware config as well. But please assume a well tuned hardware.
-- Here i am trying to use ES as a memcache alternative that provides multiple keys. So i want all dataset to be in memory and does not need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with ElasticSearch for large dataset is very much appreciated.
ES is not explicitly designed to run completely in memory - you normally wouldn't want to do that with large unbounded datasets in a Java application (though you can using off-heap memory). Rather, it'll cache what it can and rely on the OS's disk cache for the rest.
100+ million records shouldn't be an issue at all even on a single machine. I run an index consisting 15 million records of ~100 small fields (no large text fields) amounting to 65Gb of data on disk on a single machine. Fairly complex queries that just return id/score execute in less than 500ms, queries that require loading the documents return in 1-1.5 seconds on a warmed up vm against a single SSD. I tend to given the JVM 12-16GB of memory - any more and I find it's just better to scale up via a cluster than a single huge vm.
I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks
Just make sure you have the correct indexes on so selecting should be quick
Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.
I would like to be able to swap one table partition for another, just by replacing partitionN.ndf before starting up the server.
The general aim is to be able to split out some sets of table rows into different files so that when the app is installed, it only goes with one set. There are some rows that are always needed, so
Scenario A
ID Game Strategy
1 Squash Stick to the T
2 Racketball Drop it at the back
3 Tennis Serve to the backhand
1000 Croquet The key is to be really mean
1001 Billiards Glare a lot
Scenario B
ID Game Strategy
1 Squash Stick to the T
2 Racketball Drop it at the back
3 Tennis Serve to the backhand
1000 Baseball Favour third
1002 Pool Snooker them, be irritating
Here I would partition out the IDs from 1000, and keep the low numbers in the common database. There will be lots of IDs needing to maintain referential integrity with tables in the common database in the scenario-specific partitions.
Would that work? Or would I need to issue some partitioning command to the server to replace it while the server is running? I suppose part of the question is: does the server just start up and read the files, or does it maintain caches and other things that would be sensitive to the replacement?
I do not think that it will work at all. The File is a far more complex structure than the single table (gam, sgam, pfs, file header pages) as well as the partitioned table has a HoBT ID per partition within the table and your new file will not have the same HoBT ID for the IAM etc.
Edit :
Your example is not the problem that partitioning is designed to solve, you are basically trying to have a table pre-populated with a certain portion of rows that are fixed, and a number of rows that are variable based on an installation criteria.
Personally I suggest you ignore partitioned tables immediately for this, it is not the right tool for the job - you could choose instead to split the values into 2 physical tables and then place a view on top of the two, union'ing the two tables together.
This at least means you are only trying to replace the table, not an individual partition - but I still wouldn't like that approach - if I have enough privledges to post install add and remove filegroups / files, then I would have enough privledges to use a proper data loading routine and just load the data as required.
If you needed physical seperation of the fixed and variable portion of the values, then you can use the view approach afterwards, if required.
Let's say you're creating a database to store messages for a chat room application. There's an infinite number of chat rooms (they're created at run-time on-demand), and all messages need to be stored in the database.
Would it be a mistaken to create one giant table to store messages for all chat rooms, knowing that there could eventually be billions of records in that one table?
Would it be more prudent to dynamically create a table for each room created, and store that room's messages only in that table?
It would be proper to have a single table. When you have n tables which grows by application usage, you're describing using the database itself as a table of tables, which is not how an RDBMS is designed to work. Billions of records in a single table is trivial on a modern database. At that level, your only performance concerns are good indexes and how you do joins.
Billions of records?
Assuming you have constantly 1000 active users with 1 message per minute, this results in 1.5mio messages per day, and approx 500mio messages per year.
If you still need to store chat messages several years old (what for?), you could archive them into year-based tables.
I would definitely argue against dynamic creation of room-based tables.
Whilst a table per chat room could be performed, each database has limits over the number of tables that may be created, so given an infinite number of chat rooms, you are required to create an infinite number of tables, which is not going to work.
You can on the other hand store billions of rows of data, storage is not normally the issue given the space - retrieval of the information within a sensible time frame is however and requires careful planning.
You could partition the messages by a date range, and if planned out, you can use LUN migration to move older data onto slower storage, whilst leaving more recent data on the faster storage.
Strictly speaking, your design is right, a single table. fields with low entropy {e.g 'userid' - you want to link from ID tables, i.e following normal database normalization patterns}
you might want to think about range based partitioning. e.g 'copies' of your table with a year prefix. Or maybe even a just a 'current' and archive table
Both of these approaches mean that your query semantic is more complex {consider if someone did a multi-year search}, you would have to query multiple tables.
however, the upside is that your 'current' table will remain at a roughly constant size, and archiving is more straightforward. - {you can just drop table 2005_Chat when you want to archive 2005 data}
-Ace