Context on the problem statement.
Scroll to the bottom for questions.
Note: The tables are not relational, joins can be done at application level.
Classes
Record
Most atomic unit of the database (each record has key, value, id)
Page
Each file can store multiple records. Each page is a limited chunk (8 kb??), and it also stores an offset to retrieve each id at the top?
Index
A B-tree data structure, that stores ability to do log(n) lookups to find which id lives in which page.
We can also insert id's and page into this B-tree.
Table
Each Table is an abstraction over a directory that stores multiple pages.
Table also stores Index.
Database
Database is an abstraction over a directory which includes all tables that are a part of that database.
Database Manager
Gives ability to switch between different databases, create new databases, and drop existing databases.
Communication In Main Process
Initiates the Database Manager as it's own process.
When the process quits it saves Indexes back to disk.
The process also stores the indexes back to disk based on an interval.
To interact with this DB process we will use http to communicate with it.
Database Manager stores a reference to the current database being used.
The current database attribute stored in the Database Manager stores a reference to all Table's in a hashmap.
Each Table stores a reference to the index that is read from the index page from disk and kept in memory.
Each Table exposes public methods to set and get key value pair.
Get method navigates through b-tree to find the right page, on that page it finds the key val pair based on the offset stored on the first line, and returns a Record.
Each Set method adds a key val pair to the database and then updates the index for that table.
Outstanding Questions:
Am I making any logical errors in my design above?
How should I go about figuring what the data page size should be (Not sure why relation DB's do 8gb)?
How should I store the Index B-tree to disk?
Should the Database load all indexes for the table in memory at the very start ?
A couple of notes from the top of my head:
How many records do you anticipate storing? What are the maximum key and value sizes? I ask, because with a file per page scheme, you might find yourself exhausting available file handles.
Are the database/table distinctions necessary? What does this separation gain you? Truly asking the question, not being socratic.
I would define page size in terms of multiples of your maximum key and value sizes so that you can get good read/write alignment and not have too much fragmentation. It might be worth having a naive, but space inefficient, implementation that aligns all writes.
I would recommend starting with the simplest possible design (load all indices, align writes, flat schema) to launch your database, then layer on complexity and optimizations as they become needed, but not a moment before. Hope this helps!
Related
I'm interested to hear other developers views on creating and loading data as the current site I'm working on has a completely different take on DWH loading.
The protocol used currently to load a fact table has a number of steps;
Drop old table
Recreate Table with no PK/Clustered Index
Load cleaned/new data
Create PK & Indexes
I'm wondering how much work really goes on under the covers with step 4? The data are loaded without a Clusterd index so I'm assuming that the natural order of the data load defines its order on disk. When step 4. creates a primary key (clustered) it will re-order the data on disk into that order. Would it not be better to load the data and have the PK/Clustered Index already defined thereby reduce server workload?
When inserting a large amount of records, the overhead in updating the index can often be larger than simply creating it from scratch. The performance gain comes from inserting onto a heap which is the most efficient way to get data into a table.
The only way you can know if your import strategy is faster with the indexes left intact, will be to test both on your own environment and compare.
Up to my thoughts Indexers are Good for Select. and may be bad for DML operations.
And if you are loading the Huge amount of data that means you need to update Indexers for every insert. This may lag the performance. Some times it may go beyond the limit.
I know that switching between partitions requires both the partition to reside in same filegroup.But am not able to find any proper place to know where/what could be the reasons behind that concept.
Source and target tables must share the same filegroup.
The source and the target table of the ALTER TABLE...SWITCH statement must reside in the same filegroup, and their large-value columns must be stored in the same filegroup. Any corresponding indexes, index partitions, or indexed view partitions must also reside in the same filegroup. However, the filegroup can be different from that of the corresponding tables or other corresponding indexes.
http://technet.microsoft.com/en-us/library/ms191160(v=sql.105).aspx
In one of my partition implementation:
I keep my archival table in same filegroup, perform SWITCH , then drop
and recreate clustered index to move data to different filegroup.This is costing me much !
I want the old data moved to different table i.e archivaltable (for analysis purpose) residing in a different filegroup(different drive).But due to this restriction i have implemented as mentioned
I understand the concept followed (data is not physically moved). but why?
Expecting answer such as "due to sql-server pagesize limitation or paging concept overlaps etc etc" like that.
Please help me find or understand this !
The switch statement is that efficient because it's essentially just replacing addresses on-disk instead of moving the data around. Hence, both sets of data need to be in the same file-group in order to facilitate this "trick".
Consider the following setup
FG1 FG2
| |
------------ ------------
| | | |
F1 F2 F3 F4
Where the FGs are file groups and the Fs below are individual files.
We have a partition whose data, by chance, all happens to currently exist within F1. After we've performed the switch, all of its data will still be within F1. Despite the restriction only being specified in terms of file groups, the restriction is actually "the data has to stay inside the same file".
Why? Because that's the whole reason we can do this efficiently. We can't take the extents (or even individual data pages) within F1 and suddenly make them part of F2 (or F3 or F4) because those other files might be located on other disks. You can't say "this page of this disk here is now part of that file, located over there on another disk" - that's not how (most traditional) file systems work - certainly the ones that SQL Server works on.
And in the case of wanting to move across file groups, you can't suddenly say F1 (or part of it) is now part of FG2 as well as, or instead of, it belonging to FG1. Files only belong to one file group because File Groups are the level of management for several features.
If we want to move a set of rows between two tables but we want to write it in a way that doesn't have data movement restrictions, we've already got the tools to do that - INSERT and DELETE (possibly writing it as one cute combined statement using OUTPUT from DELETE as the source of rows for the INSERT).
Someone at Microsoft could sit down and write an implementation of ALTER TABLE ... SWITCH that does allow data movement to occur - but they haven't seen the need to implement that. Instead, they've documented the current restriction.
(I'll note that I'm still not linking to any "official sources" or really adding much new here that can't be gleaned from understanding what filegroups actually are, which I'd hope someone would have at least some familiarity with before they encounter a situation where they're actually trying to move data between them)
One filegroup could store multiple table's data. The statement refers to a table not to a filegroup. When a filegroup stores multiple table's data and you are trying to move only one table's data (switching only one table to another FG), the original filegroup should be splitted which requires lots of resources to do the operation (to physically move the data). In this case the operation is similar to when you move an index or clustered key to another filegroup.
Edit Additionally the source and target FG and the associated partition scheme and function not necessary the same and the data what you are trying to move not necessary fits to the partition definition in the new filegroup.
From the comments below
Why would it be impossible to update the meta data that stores the filegroup information just like SQL Server updates the meta data that stores to B-tree root? In this answer I do not find a reason why switch cannot switch across filegroups.
It is not impossible, but one physical file could not be the part of multiple FGs. If you are changing only the metadata to another FG and the original FG stores multiple table's data (which means a physical file stores multiple table's data) a file split would be necessary, or if the data is not moved, a file will be part of multiple FGs.
The use case for SWITCH is not to relocate data to different storage but to move data into a different table as a meta-data only operation. This is by design. There are no technical reasons that prevent SQL Server from moving the partition to a different filegroup but then it would no longer be just meta-data, and the operation could be quite time consuming due to extensive data movement. This would basically the same costly operation you currently do manually.
This begs the question as to why you are moving the data to a different table at all. I would think you'd just want to move the partition to a different filegroup but keep it in the original table.
I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB
Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.
Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.