I'm searching to store(in some DB) a matrix of millions rows and columns of boolean values.
The rows and columns need to be added constantly.
My goal is to do boolean operations on specific row or column.
I would like to do the work in the DB and not fetch the data to in memory of the application.
All operations will done in memory not disk. DBMS also fetches the data in memory and calculate the operation.
Your question may change to this that you want to leave it to DBMS by just writing a query. ALL DBMS can do it by Bitwise operators or User Defined Functions.
SELECT sum(yourColumn) FROM yourTable;
SELECT your_bitwise_AND(yourColumn) FROM yourTable;
But I suggest you to use Columnar databases (such as Cassandra or HBase) when working on a column of entire (or a big partition of) table, where your data is really big.
As another solution you can save your data in text format and use Apache Spark to read and calculate your operation simply and efficient.
Related
I'm pretty proficient with VBA, but I know almost nothing about Access! I'm running a complex simulation using Arrrays in VBA, and I want to store the results somewhere. Since the results of the simulation will be quite large (~1GB in memory), I'd like to store this in Access rather than Excel.
I currently have a large number of Arrays populated with my data, but I'm not sure how to write these to a database, or even how to create one with VBA. Here's what I need to do, in a nutshell, with VBA:
Create a new Access Database
Create a new Access Table (the db will be only a single table)
Create ~1200 fields programmatically
Copy the results from my arrays to the new Access table.
I've looked at a number of answers on here, but none of them seem to answer my question fully. For instance, Adding field to MS Access Table using VBA talks about adding fields to a database. But I don't see doubles listed here. Most of my arrays are doubles. Will this be a problem?
EDIT:
Here are a few more details about the project:
I am running a network design simulation. Thus, I start by generating ~150,000 unique networks. Then, I run a lot of calculations (no, these can't be simplified to queries unfortunately!) of characteristics for the network. There end up being ~1200 for each possible network (unique record). Thus, I would like to store these in an Access database. Each record will be a unique network, and each field will be a specific characteristic associated with that network.
Virtually all of the fields (arrays at this point!) are doubles.
You (almost?) never want a database with one table. You might as well store it in a text file. One of the main benefits of databases is relating data in different tables, and with one table you don't need it.
Fortunately for you, you need more than one table and a database might be the way to go. You (almost) never need to create permanent tables in code (temp tables, sure, but not permanent ones). If your field names are variable, you need to change your design. When data is variable, it goes in the data part of a database. When it's fixed, it can be a table or a field. Based on what you've said, I think you need this:
In Access create a tabled called tblNetworks with these fields
NetworkID AutoNumber
NetworkName Short Text
Then create another tabled called tblCalculations with these fields
CalcID Autonumber
NetworkID Long (Relates to tblNetworks, one to many)
CalcDesc Short Text
Result Number (Double)
What you were going to name your fields in your Access table will be the CalcDesc data. You'll use ADODB to execute INSERT INTO sql statements that put the data in the tables.
You'll end with tblNetworks with 150k records and tblCalculations with 1,200 x 150k records or so. When you're tables grow longer and not wider as things change, that a good indication you designed it right.
If you're really unfamiliar with Access, I recommend learning how to create Tables, setting up Relationships, and Referential Integrity. If you don't know SQL, search for INSERT INTO. And if you haven't used ADO before in Excel, search for ADODB Connections and the Execute method.
Update
You can definitely get away with a CSV for this. Like you said, it's pretty low overhead. Whether a text file or a database is the right answer probably depends more on how you're going to use the data and how often.
If you're going to pull this into Excel a small number of times, do a few sorts or filters, maybe a pivot table, then any performance hit you get from a CSV isn't going to be that bad. And if you only need to deal with a subset of the data at a time, you can use ADO to read a text file and only pull in the data you want at that time, further mitigating the slowness of sorting and filtering 150k rows. Not to mention if you have a few gigs of RAM, 150k x 1,200 probably won't be bad at all.
If you find that the performance of a CSV stinks because your hardware isn't up to the task, you have to access it often, or you doing a ton of different queries against the data, it may be to your benefit to use the database. If you fields are structured as you say, you may benefit from even more tables. You'd still have the network table and the calc table, but you'd also have Market, Slot, and Characteristic tables. Then your Calc table would look like:
CalcID
CalcDesc
NetworkID
MarketID
SlotID
CharacteristicID
Result
If you looking for data a lot of times and you need it quickly, you're not going to do better than a bunch of INNER JOINs on those tables and a WHERE clause that limits what you want.
But only you can decide if it's worth all the setup and overhead of using a database. And because of that, I would start down the CSV path until the reason to change presented itself. I would design my code in a way that switching from CSV to database only touched a few procedure (like by using class modules) so that the change didn't affect any already-tested business logic.
This is a question of what is the best practice and best performance.
I have inherited a database that contains data for turbine engines. I have found 20 data points that are calculated from several fields from the turbine. The way it was done in the past is a view was create to pull data for some turbines and calculate some of the 20 data point. Then other views for the same turbines but different data point and then other views for different turbines and data point. So the same equations are used over and over.
I want to consolidate all of the equations (20 data point) into one place. My debate is either creating a user function that will do all 20 calculations or creating them as computed columns in the table. With a function it would calculate all 20 for each turbine even thou I might only need 2 or 3 for a view. But as a computed column it would only calculate the columns the view pulled.
The answer is probably "it depends".
The factors when making this determination include:
Is the column deterministic? (e.g. can you persist it or not)
How often is data inserted into the table?
How often is data retrieved from the table?
The trade offs for computed and specifically persisted computed columns are similar to that when considering an index on your table. Having persisted columns will increase the amount of time an insert takes on the table, but allows retrieval to happen faster. Whereas on the other end, computed columns (that aren't persisted), or a function you would have faster on the insert but slower on the retrieval.
The end solution would likely depend on the utilization of the table (how often writes and reads occur) - which is something that you would need to determine.
Personally, I wouldn't do a function for the columns, but rather I'd persist them, or write a view/computed columns that accomplished them, depending on the nature of the usage on the table.
I've seen a few questions on this topic already but I'm looking for some insight on the performance differences between these two techniques.
For example, lets say I am recording a log of events which will come into the system with a dictionary set of key/value pairs for the specific event. I will record an entry in an Events table with the base data but then I need a way to also link the additional key/value data. I will never know what kinds of Keys or Values will come in so any sort of predefined enum table seems out of the question.
This event data will be constantly streaming in so insert times is just as important as query times.
When I query for specific events I will be using some fields on the Event as well as data from the key/value data. For the XML way I would simply use a Attributes.exists('xpath') statement as part of the where clause to filter the records.
The normalized way would be to use a Table with basically Key and Value fields with a foreign link to the Event record. This seems clean and simple but I worry about the amount of data that is involved.
You've got three major options for a 'flexible' storage mechanism.
XML fields are flexible but put you in the realm of blob storage, which is slow to query. I've seen queries against small data sets of 30,000 rows take 5 minutes when it was digging stuff out of the blobs with Xpath queries. This is the slowest option by far but it is flexible.
Key/value pairs are a lot faster, particularly if you put a clustered index on the event key. This means that all attributes for a single event will be physically stored together in the database, which will minimise the I/O. The approach is less flexible than XML but substantially faster. The most efficient queries to report against it would involve pivoting the data (i.e. a table scan to make an intermediate flattened result); joining to get individual fields will be much slower.
The fastest approach is to have a flat table with a set of user defined fields (Field1 - Field50) and hold some metadata about the contents of the fields. This is the fastest to insert and fastest and easiest to query, but the contents of the table are opaque to anything that does not have access to the metadata.
The problem I think the key/value table approach is regarding the datatypes - if a value could be a datetime, or a string or a unicode string or an integer, then how do you define the column? This dilemma means the value column has to be a datatype which can contain all the different types of data in it which then begs the question of efficiency/ease of querying. Alternatively, you have multiple columns of specific datatypes, but I think this is a bit clunky.
For a true flexible schema, I can't think of a better option than XML. You can index XML columns.
This article off MSDN discusses XML storage in more detail.
I'd assume the normalized way would be faster for both INSERT and SELECT operations, if only because that's what any RDBMS would be optimized for. The "amount of data involved" part might be an issue too, but a more solvable one - how long do you need that data immediately on hand, can you archive it after a day, or a couple weeks, or 3 months, etc? SQL Server can handle an awful lot.
This event data will be constantly streaming in so insert times is just as important as query times.
Option 3: If you really have a lot of data constantly streaming - create a separate queue in shared memory, in-process sqlite, separate db table, or even it's own server, to store the incoming raw event & attributes, and have another process (scheduled task, windows service, etc) parse that queue into whatever preferred format tuned for speedy SELECTs. Optimal input, optimal output, ready to scale in either direction, everyone's happy.
I am trying to decide between two possible implementations and am eager to choose the best one :)
I need to add an optional BLOB field to a table which currently only has 3 simple fields. It is predicted that the new field will be used in fewer than 10%, maybe even less than 5% of cases so it will be null for most rows - in fact most of our customers will probably never have any BLOB data in there.
A colleague's first inclination was to add a new table to hold just the BLOBs, with a (nullable) foreign key in the first table. He predicts this will have performance benefits when querying the first table.
My thoughts were that it is more logical and easier to store the BLOB directly in the original table. None of our queries do SELECT * from that table so my intuition is that storing it directly won't have a significant performance overhead.
I'm going to benchmark both choices but I was hoping some SQL gurus had any advice from experience.
Using MSSQL and Oracle.
For MSSQL, the blobs will be stored on a separate page in the database so they should not affect performance if the column is null.
If you use the IMAGE data type then the data is always stored out of the row.
If you use the varbinary(max) data type then if the data is > 8kb it is stored outside the row, otherwise it may be stored in the row depending on the table options.
If you only have a few rows with blobs the performance should not be affected.
Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.