Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.
Related
This is a question about database access performance vs code simplicity and pest practices.
Let's say I have a Users table and an Addresses table. Every user can have more than one address, which will be stored in the Addresses table with a foreign key to the Users table.
What would be the best way to read users from the database, assuming that I always want to get the addresses along with the users?
First option would be to query the user, say by his username, and once I have the object, use the user's id to query the Addresses table for all the user's addresses.
pros:
Simple code
No duplicate data is transferred
Cons:
Requires two queries to the database
Second option would be to write a query that joins Users with Addresses and returns a user result line for every address the user has. All the columns, except for the address column, would be exactly the same for every line. I would then aggregate all the lines into a single user object with a list of addresses.
Pros:
Requires a single query to the database
Cons
Relatively complicated code (aggregating the users)
A lot of the data transferred is redundant
Those are the two ways I could think of, both have their pros and cons. Which of the options would you suggest?
Maybe another solution altogether?
My first rule of thumb is usually to let the database engine do what it is good at. Joining of tables is a basic function that the database performs with maximum efficiency. A join by the DB will always be faster than what you can do by making multiple calls.
The point you make about the fact that it fetches a lot of user data is true only if you have real problems with data transfer or the data is really massive.
In exchange, you are making just one call to the database instead of multiple calls. That saving can well outweigh the possible downside of data size.
I'm not quite sure what you meant by "aggregating the user data" since you just take it from the first entry of that user and skip the rest.
At the end of the day, let the database do its work unless there is a really good reason not to do so.
In really serious cases there are ways to bring nulls in the user data all but the first row. However, this complicates the SQL query greatly and, once again, is generally not worth the overhead.
I just had a long debate over it with Microsoft over GitHub and a discussion with MS-SQL MVP.
Summarizing that thread (from my prescriptive):
To SQL Server it doesn't matter if you'll have a single query or 10, the redundant fields returned have 0 impact on SQL Server.
Splitting the queries is what SQL is doing internally anyway, and when people try to optimize SQL, it's usually to the worst as SQL is doing better when not forcing it do act in a specific way.
Having multiple queries have overhead on SQL.
The only thing that actually splitting queries solves is bandwidth on the network as there will be less bytes transferred over the wire, and he says it's negligible compared to having multiple queries.
When you have massive returned rows, you'll want to split the queries because of Table Spools and because of the bandwidth.
In the end, I decided to use
GROUP_CONCAT(DISTINCT addresses.address SEPARATOR ' | ') addresses
...
GROUP BY userId
I then split the addresses into a list in the client (specifically, in my customer BeanPropertyRowMapper)
I'm pretty proficient with VBA, but I know almost nothing about Access! I'm running a complex simulation using Arrrays in VBA, and I want to store the results somewhere. Since the results of the simulation will be quite large (~1GB in memory), I'd like to store this in Access rather than Excel.
I currently have a large number of Arrays populated with my data, but I'm not sure how to write these to a database, or even how to create one with VBA. Here's what I need to do, in a nutshell, with VBA:
Create a new Access Database
Create a new Access Table (the db will be only a single table)
Create ~1200 fields programmatically
Copy the results from my arrays to the new Access table.
I've looked at a number of answers on here, but none of them seem to answer my question fully. For instance, Adding field to MS Access Table using VBA talks about adding fields to a database. But I don't see doubles listed here. Most of my arrays are doubles. Will this be a problem?
EDIT:
Here are a few more details about the project:
I am running a network design simulation. Thus, I start by generating ~150,000 unique networks. Then, I run a lot of calculations (no, these can't be simplified to queries unfortunately!) of characteristics for the network. There end up being ~1200 for each possible network (unique record). Thus, I would like to store these in an Access database. Each record will be a unique network, and each field will be a specific characteristic associated with that network.
Virtually all of the fields (arrays at this point!) are doubles.
You (almost?) never want a database with one table. You might as well store it in a text file. One of the main benefits of databases is relating data in different tables, and with one table you don't need it.
Fortunately for you, you need more than one table and a database might be the way to go. You (almost) never need to create permanent tables in code (temp tables, sure, but not permanent ones). If your field names are variable, you need to change your design. When data is variable, it goes in the data part of a database. When it's fixed, it can be a table or a field. Based on what you've said, I think you need this:
In Access create a tabled called tblNetworks with these fields
NetworkID AutoNumber
NetworkName Short Text
Then create another tabled called tblCalculations with these fields
CalcID Autonumber
NetworkID Long (Relates to tblNetworks, one to many)
CalcDesc Short Text
Result Number (Double)
What you were going to name your fields in your Access table will be the CalcDesc data. You'll use ADODB to execute INSERT INTO sql statements that put the data in the tables.
You'll end with tblNetworks with 150k records and tblCalculations with 1,200 x 150k records or so. When you're tables grow longer and not wider as things change, that a good indication you designed it right.
If you're really unfamiliar with Access, I recommend learning how to create Tables, setting up Relationships, and Referential Integrity. If you don't know SQL, search for INSERT INTO. And if you haven't used ADO before in Excel, search for ADODB Connections and the Execute method.
Update
You can definitely get away with a CSV for this. Like you said, it's pretty low overhead. Whether a text file or a database is the right answer probably depends more on how you're going to use the data and how often.
If you're going to pull this into Excel a small number of times, do a few sorts or filters, maybe a pivot table, then any performance hit you get from a CSV isn't going to be that bad. And if you only need to deal with a subset of the data at a time, you can use ADO to read a text file and only pull in the data you want at that time, further mitigating the slowness of sorting and filtering 150k rows. Not to mention if you have a few gigs of RAM, 150k x 1,200 probably won't be bad at all.
If you find that the performance of a CSV stinks because your hardware isn't up to the task, you have to access it often, or you doing a ton of different queries against the data, it may be to your benefit to use the database. If you fields are structured as you say, you may benefit from even more tables. You'd still have the network table and the calc table, but you'd also have Market, Slot, and Characteristic tables. Then your Calc table would look like:
CalcID
CalcDesc
NetworkID
MarketID
SlotID
CharacteristicID
Result
If you looking for data a lot of times and you need it quickly, you're not going to do better than a bunch of INNER JOINs on those tables and a WHERE clause that limits what you want.
But only you can decide if it's worth all the setup and overhead of using a database. And because of that, I would start down the CSV path until the reason to change presented itself. I would design my code in a way that switching from CSV to database only touched a few procedure (like by using class modules) so that the change didn't affect any already-tested business logic.
The project requires storing binary data into PostgreSQL (project requirement) database. For that purpose we made a table with following columns:
id : integer, primary key, generated by client
data : bytea, for storing client binary data
The client is a C++ program, running on Linux.
The rows must be inserted (initialized with a chunk of binary data), and after that updated (concatenating additional binary data to data field).
Simple tests have shown that this yields better performance.
Depending on your inputs, we will make client use concurrent threads to insert / update data (with different DB connections), or a single thread with only one DB connection.
We haven't much experience with PostgreSQL, so could you help us with some pointers concerning possible bottlenecks, and whether using multiple threads to insert data is better than using a single thread.
Thank you :)
Edit 1:
More detailed information:
there will be only one client accessing the database, using only one Linux process
database and client are on the same high performance server, but this must not matter, client must be fast no matter the machine, without additional client configuration
we will get new stream of data every 10 seconds, stream will provide new 16000 bytes per 0.5 seconds (CBR, but we can use buffering and only do inserts every 4 seconds max)
stream will last anywhere between 10 seconds and 5 minutes
It makes extremely little sense that you should get better performance inserting a row then appending to it if you are using bytea.
PostgreSQL's MVCC design means that an UPDATE is logically equivalent to a DELETE and an INSERT. When you insert the row then update it, what's happening is that the original tuple you inserted is marked as deleted and new tuple is written that contains the concatentation of the old and added data.
I question your testing methodology - can you explain in more detail how you determined that insert-then-append was faster? It makes no sense.
Beyond that, I think this question is too broad as written to really say much of use. You've given no details or numbers; no estimates of binary data size, rowcount estimates, client count estimates, etc.
bytea insert performance is no different to any other insert performance tuning in PostgreSQL. All the same advice applies: Batch work into transactions, use multiple concurrent sessions (but not too many; rule of thumb is number_of_cpus + number_of_hard_drives) to insert data, avoid having transactions use each others' data so you don't need UPDATE locks, use async commit and/or a commit_delay if you don't have a disk subsystem with a safe write-back cache like a battery-backed RAID controller, etc.
Given the updated stats you provided in the main comments thread, the amount of data you want to consume sounds entirely practical with appropriate hardware and application design. Your peak load might be achievable even on a plain hard drive if you had to commit every block that came in, since it'd require about 60 transactions per second. You could use a commit_delay to achieve group commit and significantly lower fsync() overhead, or even use synchronous_commit = off if you can afford to lose a time window of transactions in case of a crash.
With a write-back caching storage device like a battery-backed cache RAID controller or an SSD with reliable power-loss-safe cache, this load should be easy to cope with.
I haven't benchmarked different scenarios for this, so I can only speak in general terms. If designing this myself, I'd be concerned about checkpoint stalls with PostgreSQL, and would want to make sure I could buffer a bit of data. It sounds like you can so you should be OK.
Here's the first approach I'd test, benchmark and load-test, as it's in my view probably the most practical:
One connection per data stream, synchronous_commit = off + a commit_delay.
INSERT each 16kb record as it comes in into a staging table (if possible UNLOGGED or TEMPORARY if you can afford to lose incomplete records) and let Pg synchronize and group up commits. When each stream ends, read the byte arrays, concatenate them, and write the record to the final table.
For absolutely best speed with this approach, implement a bytea_agg aggregate function for bytea as an extension module (and submit it to PostgreSQL for inclusion in future versions). In reality it's likely you can get away with doing the bytea concatenation in your application by reading the data out, or with the rather inefficient and nonlinearly scaling:
CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
INSERT INTO final_table SELECT stream_id, bytea_agg(data_block) FROM temp_stream_table;
You would want to be sure to tune your checkpointing behaviour, and if you were using an ordinary or UNLOGGED table rather than a TEMPORARY table to accumulate those 16kb records, you'd need to make sure it was being quite aggressively VACUUMed.
See also:
Whats the fastest way to do a bulk insert into Postgres?
How to speed up insertion performance in PostgreSQL
Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.
I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size