I'm searching to store(in some DB) a matrix of millions rows and columns of boolean values.
The rows and columns need to be added constantly.
My goal is to do boolean operations on specific row or column.
I would like to do the work in the DB and not fetch the data to in memory of the application.
All operations will done in memory not disk. DBMS also fetches the data in memory and calculate the operation.
Your question may change to this that you want to leave it to DBMS by just writing a query. ALL DBMS can do it by Bitwise operators or User Defined Functions.
SELECT sum(yourColumn) FROM yourTable;
SELECT your_bitwise_AND(yourColumn) FROM yourTable;
But I suggest you to use Columnar databases (such as Cassandra or HBase) when working on a column of entire (or a big partition of) table, where your data is really big.
As another solution you can save your data in text format and use Apache Spark to read and calculate your operation simply and efficient.
Let a distributed database have a recover mechanism such as, in case it is splitted by a network partition which allows two sides to keep working, both sides continue to commit transactions and, when the partition is ending, they agree to go back to the last common state.
Should we say that this system :
is A+P, as it has a built-in (not very clever) mechanism to deal with partitions, and the consistency is sort of broken by the fact we can deny a commit which has been done ?
is C+A, as it does not really deals with partitions, and the state is always consistent (never a mixed weird thing recorded) ?
another more complicated option ?
"when the partition is ending, they agree to go back to the last common state."
Do you mean to say all the writes since time t1 (when partition split occurs) to time t2 (when partition split ends) are lost?
That makes your database neither consistent nor durable.
However, you'd be providing availability and partition tolerance.
I wan to know how sequence for a table works internally in oracle database. Specially how they increment value of sequence.
Are they use trigger for incrementing value of sequence or anything else???
Oracle does not handle sequences as other objects, like tables.
If you insert 100 records using the NEXTVAL and issue a ROLLBACK,
this sequence does not get rolled back.
Instead, 100 records will have incremented the sequence.
The next insert will have the 101-st value of the sequence.
This will lead to "spaces" in the sequence. That allows for multiple people to
safely use sequences without the risk of duplicates.
If two users simultaneously grab the NEXTVAL, they will be
assigned unique numbers. Oracle caches sequences in memory.
The init.ora parameter SEQUENCE_CACHE_ENTRIES defines the cache size.
We are building a warehouse using SSIS.
I am doing an aggregation and I received the following message
The Aggregate transformation has encountered 2000037 key combinations. It has to re-hash data because the number of key combinations is more than expected. The component can be configured to avoid data re-hash by adjusting the Keys, KeyScale, and AutoExtendFactor properties
It did not write the output, and most disturbingly, it did not throw any exceptions.
What is going on here? How do I fix this?
Key combinations are the number of groups in the data flow. Set the KeysScale (advanced property of task) value to a number > 2000037. Try to guess as close as possible so that the transformation can allocate enough memory to perform the aggregation.
I would avoid using this transformation when at all possible. Not only is it memory intensive, but it holds up the entire data flow task until it completes.
Where is the data coming from? If a relational database, move the group by, order by, and any aggregates to the database query.
Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.