I'm planning on a web app with a database in the background having big numbers stored. I'm thinking of if it could be possible, that integers need more space than storing the same number as a string.
So normally an integer is stored by base 2 numbers:
That means for 0 and 1 I need 1 bit while I would need 8 bit to write
them as a char.
Writing 2 I would need 2 bit but still 8 bit as char.
Is there something like a break even point therefor. If so, at what number is it?
Thanks so far.
Optimizing bitwise operations is not anything that people using databases do (with perhaps some minor exceptions).
You are using the database for its ACID properties, and perhaps for its ability to query and manage data. You are using it because it scales easily, manages multiple processors, manages multiple disks, and manages memory hierarchies. You are not using it because it stores the data in the smallest amount of space.
You should worry about other aspects of your application and the data model you want to use.
Related
I am working on a project in which we collect a lot of data from many sensors. These sensors, in many cases, return low precision floats represented with 1 and 2 byte integers. These integers are mapped back to floats via some simple relation; for instance,
x_{float} = (x_{int} + 5) / 3
Each sensor will return around 70+ variables of this kind.
Currently, we expect to store a minimum of 10+ millions of entries per day, possibly even 100+ million entries per day. However, we only require like 2 or 3 of these variables on a daily basis and the others one will rarely be used (we require them for modeling purposes).
So, in order to save some space I was considering storing these small precision integers directly into the DB instead of the float value (with the exception of the 2-3 variables we read regularly, which will be stored as floats to avoid the constant overhead of mapping them back from ints). In theory, this should reduce the size of the database by almost a half.
My questions is if this is a good idea?. Will this backfire when we have to map back all the data to train models?.
Thanks in advance.
P.S. We are using Cassandra, I don't know it this may be relevant for the question.
We want to use artificial keys in our data warehouse.
Is it a good idea to use bigint as data type for the artificial keys? I think because it's 64-bit, it can be very fast processed by the CPU. Or is this wrong?
I basically use the smallest type possible - in order to read less data when operations are performed.
So, depending on the range choose the best type for you. I guess INT will be small for data warehouse (remember when the YouTube counter was broken by the video hitting 2,147,483,647 views - the upper bound of the INT range.
How fast would be CPU processing if the key is BIGINT? Don't worry about such stuff - the processing depends on other factors - indexes, row-size, execution plans, is able the engine to process rows in batches and so on. How good is the engine with certain data types is your latest worry and even if it's slow, I doubt you can use INT because of the business requirements.
While your computer will have 64 bit registers, much of the challenge in performance is about fitting things into memory. It isn't helpful to your overall performance to make objects larger than they need to be. 64 bit systems do 32 bit arithmetic just fine.
int is a pretty useful data type, and even if you stick to positive values (which I suggest you do as they compress better), you still have well over 2 billion values. If you're dealing with values like customers or products, that's never going to be too small.
If you're at a gigantic site and deal with very large numbers of transactions, you might well want to use bigint for those.
I'm a bit confused here... I'm being offered to get into a project, where would be an array of certain sensors, that would give off reading every millisecond ( yes, 1000 reading in a second ). Reading would be a 3 or 4 digit number, for example like 818 or 1529. This reading need to be stored in a database on a server and accessed remotely.
I never worked with such big amounts of data, what do you think, how much in terms of MBs reading from one sensor for a day would be?... 4(digits)x1000x60x60x24 ... = 345600000 bits ... right ? about 42 MB per day... doesn't seem too bad, right?
therefor a DB of, say, 1 GB, would hold 23 days of info from 1 sensor, correct?
I understand that MySQL & PHP probably would not be able to handle it... what would you suggest, maybe some aps? azure? oracle?
3 or 4 digit number =
4 bytes if you store it as a string.
2 bytes storing it as a 16bit (0-65535) integer
1000/sec -> 60,000/minute -> 3,600,000/hour, 86,400,000/day
as string: 86,400,000 * 4 bytes = 329megabytes/day
as integer:86,400,000 * 2bytes = 165megabytes/day
Your DB may not perform too well under that kind of insert load, especially if you're running frequent selects on the same data. optimizing a DB for largescale retrieval slows things down for fast/frequent inserts. On the other hand, inserting a simple integer is not exactly a "stressful" operation.
You'd probably be better off inserting into a temporary database, and do an hourly mass copy into the main 'archive' database. You do your analysis/mining on that main archive table, with the understanding that its data will be up to 1 hour stale.
But in the end, you'll have to benchmark variations of all this and see what works best for your particular usage case. There's no "you must do X to achieve Y" type advice in databaseland.
Most likely you will need not to keep the data with such a high discretization for a long time. You may use several options to minimize the volumes. First, after some period of time you may collapse hourly data into min/max/avg values; you may keep detailed info only for some unstable situations detected or situations that require to keep detailed data by definition. Also, many things may be turned into events logging. These approaches were implemented and successfully used a couple of decades ago in some industrial automation systems provided by the company I have been working for at that time. The available storage devices sizes were times smaller than you can find today.
So, first, you need to analyse the data you will be storing and then decide how to optimize it's storage.
Following #MarcB's numbers, 2 bytes at 1kHz, is just 2KB/s, or 16Kbit/s. This is not really too much of a problem.
I think a sensible and flexible approach should be to construct a queue of sensor readings which the database can simply pop until it is clear. At these data rates, the problem is not the throughput (which could be handled by a dial-up modem) but the gap between the timings. Any system caching values will need to be able to get out of the way fast enough for the next value to be stored; 1ms is not long to return, particularly if you have GC interference.
The advantage of a queue is that it is cheap to add something to the queue at one end, and the values can be processed in bulk at the other end. So the sensor end gets the responsiveness it needs and the database gets to process in bulk.
İf you do not need relational database you can use a NoSQL database like mongodb or even a much simper solution like JDBM2, if you are using java.
What are good sizes for data types in SQL Server? When defining columns, i see data types with sizes of 50 as one of the default sizes(eg: nvarchar(50), binary(50)). What is the significance of 50? I'm tempted to use sizes of powers of 2, is that better or just useless?
Update 1
Alright thanks for your input guys. I just wanted to know the best way of defining the size of a datatype for a column.
There is no reason to use powers of 2 for performance etc. Data length should be determined by the size stored data.
Why not the traditional powers of 2, minus 1 such as 255...
Seriously, the length should match what you need and is suitable for your data.
Nothing else: how the client uses it, aligns to 32 bit word boundary, powers of 2, birthdays, Scorpio rising in Uranus, roll of dice...
The reason so many fields have a length of 50 is that SQL Server defaults to 50 as the length for most data types where length is an issue.
As has been said, the length of a field should be appropriate to the data that is being stored there, not least because there is a limit to the length of single record in SQL Server (it's ~8000 bytes). It is possible to blow past that limit.
Also, the length of your fields can be considered part of your documentation. I don't know how many times I've met lazy programmers who claim that they don't need to document because the code is self documenting and then they don't bother doing the things that would make the code self documenting.
You won't gain anything from using powers of 2. Make the fields as long as your business needs really require them to be - let SQL Server handle the rest.
Also, since the SQL Server page size is limited to 8K (of which 8060 bytes are available to user data), making your variable length strings as small as possible (but as long as needed, from a requirements perspective) is a plus.
That 8K limit is a fixed SQL Server system setting which cannot be changed.
Of course, SQL Server these days can handle more than 8K of data in a row, using so called "overflow" pages - but it's less efficient, so trying to stay within 8K is generally a good idea.
Marc
The size of a field should be appropriate for the data you are planning to store there, global defaults are not a good idea.
It's a good idea that the whole row fits into page several times without leaving too much free space.
A row cannot span two pages, an a page has 8096 bytes of free space, so two rows that take 4049 bytes each will occupy two pages.
See docs on how to calculate the space occupied by one row.
Also note that VAR in VARCHAR and VARBINARY stands for "varying", so if you put a 1-byte value into a 50-byte column, it will take but 1 byte.
This totally depends on what you are storing.
If you need x chars use x not some arbitrarily predefined amount.
I am creating an little hobby database driven browser based game and I stumbled across this problem: I store money owned by users as an 32bit integer field (to be precise: two fields. One stores money in players hand, the other - money stored in bank). We all know, that maximum value, which can be stored in 32 bits is 2^32-1.
I am absolutelly sure, that 95% of players will not be able to reach the upper limit - but on the other hand (and after doing some calculations today) good players will be able to accumulate that much.
Having that in mind I came with the following ideas:
store money in 64bits, which doubles space of each record.
store money as string and convert to/from long long in the runtime.
change game mechanics so players will not be able to gain that amount of wealth.
I know that existence of reachable upper limit is rather limiting for some players, so for me the third option is worst from the proposed ones.
Are there any other ways of dealing with this kind of problems? Which one would You go for?
Taking an example from the real world, why not have different types of coins e.g a column for a million units of the currency.
Changing to a larger datatype is likely the easiest solution and considerations of disk space/memory aren't likely to be significant unless your game is huge in scale. Have 5,000 users playing your game? Changing from 32-bits to 64-bits will consume roughly 20k extra. That's not enough to lose any sleep over.
The best answer would likely come from someone familiar with how banks handle these types of situations, though their requirements may be far more complicated than what you need.
Space on memory shouldn't be a problem depending on the amount of players you'll have simultaneously, but storing as string will definitely use more disk space.
But seriously, 4 294 967 296 rupees/simoleons/furlongs? Who are they? Sim Gates?
Why not store money the way it should be stored, as a Money data type? This is assume of course you are using SQL Server. The money data type won't have this limitation and won't be affected by rounding issues.