hardware specialized for bitmap indexes? - database

This is just an out of curiosity question. Let's say you have a database table with 1m rows in it, and you want to often do queries like looking for either male or female, US or non-US, voter or non-voter etc, it's clearly very efficient to define a bitmap index for the table in which each bit represents one either-or condition.
However, to execute the query, you still have to scan through (probably) all of the index doing a bitand to select matching rows.
My question is is there some kind of bitmap-optimized storage such that the bit 'channels' are pre-created in the hardware? I'm envisaging something similar to knitting needles lifting punched cards out of an old library catalog system. In other words, rather than going row by row through memory locations, the chip can just pull out the matching rows electronically because there are hardware connections for each bit channel? I've a feeling the brain must work something like this. If I think of 'all blue objects', and then restrict that to 'all long blue objects' and then 'all long blue heavy objects', my brain does it effortlessly and I'm sure it's not scanning through all the objects I know about every time. It seems like perhaps there is some neurons that provide pathways for different dimensions for quick retrieval. I'm just wondering if there's anything like this in the hardware world?
Thanks!

Why invent something that's already there?
Content-addressable memory

You could certainly wire up some logic to perform this (e.g. using programmable logic devices) but you'll need a large number of logic elements and connections, making such circuits probably expensive to build for large databases.
For example, one would have to build matching logic (is this bit being selected on ? what is the required value ?) into each 'row' giving you one signal (selected/not selected) per row.
You would then have a logic circuit with one million output lines (telling you which records were selected) which you probably at some point have to 'serialize' anyway, e.g. when you interface with the PCI bus inside a computer (i.e. first transmit the result for record 0 then 1 etc. or transmit the numbers of the selected records).
As bitwise operations in modern CPUs are fast (should only take one clock cycle for logicl operations such as bitwise and, or and 'xor') you're probably not gaining much using such a custom circuit compared to optimized software (not mentioning the 'hardware' development and testing effort) unless you have a very special use case.

Related

What is being distributed in a distributed database?

What is being distributed in a distributed database?
Processing logic : processing logic or processing elements are distributed
Data : used by a number of applications may be distributed to a number of processing sites
Control : The control of the execution of various tasks might be distributed instead of being performed by one computer system.
Can you please explain these three parts more briefly?
Processing logic : processing logic or processing elements are distributed
This is so you can optimise the query itself when you know that the data that the query will be retrieving from will be 'scatter' across different places(usually across the net)
Consider a situation where you want to say, get all the employees from a DB, but the actual DB is been breaking down into two fragments horizontally, but the employees you want only exists in one out of the two fragments, now consider if you don't distribute the processing logic, you will have to put the two fragments together by union them, and only to be making use of only one half of the data, so the cost of transferring the other half that isn't really required will result then be just wasted computing in the form of longer response time or overall wait time, etc.
Data : used by a number of applications may be distributed to a number of processing sites
We mentioned the idea of fragments just before, but the idea of fragment is really just a formal way of defining how the data should be 'break down'.
Usually, the fragments will be either horizontal fragments or vertical fragments.
A fragment should have a property known as the correctness property. The correctness property demands three conditions to be held true for any fragments, a somewhat 'simplified' interpretation of these conditions are
Reconstruction
When you put back the fragments, you get the original table.
Completeness
All the records from the original table should be presented in a fragment, otherwise data will be lost.
Disjointness
Each record only show up in one fragment.
A trashy analogy would be that, basically think about how you have a piece of paper, you tear the piece paper up out of anger, but then you suddenly realise you had something important written on the piece paper, you really want to be able to put them back to its original state, if all the pieces of the paper were disjointed, and all the information written were completely written on those pieces, and lastly you have all the original pieces that you have just torn in front of you, so you could just reconstruct the original thing,
Control : The control of the execution of various tasks might be distributed instead of being performed by one computer system.
I think this mostly tights into the performance aspect of DDB and some aspect of control access. So instead of running queries in one place.

Database System for timeseries & aggregation

I'm currently creating a raspberry pi based logging device for logging the power which is fed into the grid by a solar array.
The "main table" will be growing at ~ 20 entries representing the "current" power produced by several parts of the array.
Basically this isn't that much and can be handled at an acceptable performance using a raspberry pi, but with a growing amount of data queries like "select last 10 years, group by month" probably wouldn't be very effective... (the data should be displayed via an interactive web interface)
I thought of doing some "background aggregation" and maintaining several tables for containing the aggregated data of various timeframes, but this seems like a problem which probably has been dealt with by many people before.
What do you suggest me to do?
You do not know how much data growth is needed to affect performance.
You do not know by how much performance will be affected then.
You do not know if performance will be affected at all.
As long as you do not have even an estimate of how much performance improvement you need, it does not make sense to try to do optimizations.
Or, as said by Donald Knuth:
premature optimization is the root of all evil
If you really do want to create caches of aggregated values, I'd suggest to use triggers to keep the cache consistent after any change to the original data.

If we make a number every millisecond, how much data would we have in a day?

I'm a bit confused here... I'm being offered to get into a project, where would be an array of certain sensors, that would give off reading every millisecond ( yes, 1000 reading in a second ). Reading would be a 3 or 4 digit number, for example like 818 or 1529. This reading need to be stored in a database on a server and accessed remotely.
I never worked with such big amounts of data, what do you think, how much in terms of MBs reading from one sensor for a day would be?... 4(digits)x1000x60x60x24 ... = 345600000 bits ... right ? about 42 MB per day... doesn't seem too bad, right?
therefor a DB of, say, 1 GB, would hold 23 days of info from 1 sensor, correct?
I understand that MySQL & PHP probably would not be able to handle it... what would you suggest, maybe some aps? azure? oracle?
3 or 4 digit number =
4 bytes if you store it as a string.
2 bytes storing it as a 16bit (0-65535) integer
1000/sec -> 60,000/minute -> 3,600,000/hour, 86,400,000/day
as string: 86,400,000 * 4 bytes = 329megabytes/day
as integer:86,400,000 * 2bytes = 165megabytes/day
Your DB may not perform too well under that kind of insert load, especially if you're running frequent selects on the same data. optimizing a DB for largescale retrieval slows things down for fast/frequent inserts. On the other hand, inserting a simple integer is not exactly a "stressful" operation.
You'd probably be better off inserting into a temporary database, and do an hourly mass copy into the main 'archive' database. You do your analysis/mining on that main archive table, with the understanding that its data will be up to 1 hour stale.
But in the end, you'll have to benchmark variations of all this and see what works best for your particular usage case. There's no "you must do X to achieve Y" type advice in databaseland.
Most likely you will need not to keep the data with such a high discretization for a long time. You may use several options to minimize the volumes. First, after some period of time you may collapse hourly data into min/max/avg values; you may keep detailed info only for some unstable situations detected or situations that require to keep detailed data by definition. Also, many things may be turned into events logging. These approaches were implemented and successfully used a couple of decades ago in some industrial automation systems provided by the company I have been working for at that time. The available storage devices sizes were times smaller than you can find today.
So, first, you need to analyse the data you will be storing and then decide how to optimize it's storage.
Following #MarcB's numbers, 2 bytes at 1kHz, is just 2KB/s, or 16Kbit/s. This is not really too much of a problem.
I think a sensible and flexible approach should be to construct a queue of sensor readings which the database can simply pop until it is clear. At these data rates, the problem is not the throughput (which could be handled by a dial-up modem) but the gap between the timings. Any system caching values will need to be able to get out of the way fast enough for the next value to be stored; 1ms is not long to return, particularly if you have GC interference.
The advantage of a queue is that it is cheap to add something to the queue at one end, and the values can be processed in bulk at the other end. So the sensor end gets the responsiveness it needs and the database gets to process in bulk.
İf you do not need relational database you can use a NoSQL database like mongodb or even a much simper solution like JDBM2, if you are using java.

Need for speed: Best database solution

What I want to create is a huge index over an even bigger collection of data. The data is a huge collection of images (and I mean millions of photos!) and I want to build an index on all unique images.
So I calculate a hash value of every image and append this with the width, height and file size of the image. This would generate a very unique key for every image. This would be combined with the location of the image, or locations in case of duplicates.
Technically speaking, this would fit perfectly in a single database table. An unique index on file name, plus an additional non-unique index on hash-width-height-size would be enough. However, I could use an existing database system to solve this, or just write my own, optimized version. It will be a single-user application anyway and the main purpose is to detect when I add a duplicate image to the collection so it will warn me that I already have it in my collection and display the locations where the other copies are. I can then decide to still add the duplicate or to discard it.
I've written hash-table implementations before and it's not that difficult once you know what you have to be aware of. So I could just implement my own file format for this data. It's unlikely that I'll ever need to add more information to these images and I'm not interested in similar images, just exact images. I'm not storing the original images in this file either, just the hash, size and location.
From experience, I know this could run extremely fast. I've done it before and have been doing similar things for nearly three decades so it's likely that I will chose this solution.
But I do wonder... Doing the same with an existing database system like SQL Server, Oracle, Interbase or MySQL, would performance still be high enough? There would be about 750 TB of images indexed in this database, which roughly translates to around 30 million records in a single, small table. Is it even worth considering the use of a regular database?
I have doubts about the usability of a database for this project. The amount of data is huge, yet the structure is real simple. I don't need multi-user support or most other features that most databases provide. So I don't see a need for a database. But I'm interested in the opinions of other programmers about this. (Although I expect most will agree with me here.)
The project itself, which is still just an idea in my head, is supposed to be some tool or add-on for explorer or whatever. Basically, it builds an index for any external hard disk that I attach to the system and when I copy an image to this disk somewhere, it's supposed to tell me if the image already exists at this disk. It will allow me to avoid filling up my backup disks with duplicates, although I sometimes would like to add duplicates. (E.g. because they're part of a series.) Since I like to create my own rendered artwork I have plenty of images. Plus, I've been taking digital pictures with digital cameras since 1996 so I also have a huge collection of photos. Add some other large collections to this and you'll soon realise that the amount of data will be huge. (And yes, there are already plenty of duplicates in my collection...)
Since it's a single-user application that you are considering, I'd probably have a look at SQLite. It ought to fit your other requirements rather nicely, I'd say.
I just tested the performance of PostgreSQL on my laptop (Core 2 Duo T5800 2.0 GHz 3.0 GiB RAM). I have a table with slightly more than 100M records, 5 columns and some indexes. I performed a range query on one indexed column (not the primary key) and returned all columns. A mean query returned 75 rows and executed in 750ms. You have to decide if this is fast enough.
I would avoid DIY-ing it unless you know all the repocussions of what you're doing.
Transactional Consistency for example, is not trivial.
I would suggest designing your code in such a way the backend can be easily replaced later, and then run with something sane ( SQLite is a good starting choice ), develop it the most sane and rational way possible, and then try slotting in the alternative backing store.
Then profile the differences, and run regression tests against it to make sure your database is not worse than SQLite.
Exisiting database solutions tend to win because they've had years of improvement and fine tuning to get their benefits, an a naïve attempt will likely be slower, buggier, and do less, all the while Increasing your development load to purely MONUMENTAL proportions.
http://fetter.org/optimization.html
The first rule of Optimization is, you do not talk about Optimization.
The second rule of Optimization is, you DO NOT talk about Optimization.
If your app is running faster than the underlying transport protocol, the optimization is over.
One factor at a time.
No marketroids, no marketroid schedules.
Testing will go on as long as it has to.
If this is your first night at Optimization Club, you have to write a test case.
Also, with databases, there is one thing you utterly MUST get ingrained.
Speed is unimportant
Your data being there when you need it, that is important.
When you have the assuredness that your data will always be there, then you may worry about trivial concerns like speed.
Hashes
You also lament that you'll be using image SHA's/MD5's etc to deduplicate images. This is a fallacious notion of its own, Hashes of files are only able to tell if the files are different, not if they're the same.
The logic is akin to asking 30 people to flip a coin, and you see the first one get heads, and thus decide to delete every other person who gets a head, because they're obviously the same person.
https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer
Although you may think it unlikely you'd have 2 different files with the same hash, your odds are about as good as winning the lotto. The chances of you winning the lotto are low, but somebody wins the lotto every day. Don't let it be you.

How do you measure SQL Fill Factor value

Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.

Resources