Have been reading up on Hadoop and HBase lately, and came across this term-
HBase is an open-source, distributed, sparse, column-oriented store...
What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.
In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).
This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.
Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.
Storage is cheap. Performance isn't.
Sparse in respect to HBase is indeed used in the same context as a sparse matrix. It basically means that fields that are null are free to store (in terms of space).
I found a couple of blog posts that touch on this subject in a bit more detail:
http://blog.rapleaf.com/dev/2008/03/11/matching-impedance-when-to-use-hbase/
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.
As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.
See:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
for more information on HBase storage
There are two way of data storing in the tables it will be either Sparse data and Dense data.
example for sparse data.
Suppose we have to perform a operation on a table containing sales data for transaction by employee between the month jan2015 to nov 2015 then after triggering the query we will get data which satisfies above timestamp condition
if employee didnt made any transaction then the whole row will return blank
eg.
EMPNo Name Product Date Quantity
1234 Mike Hbase 2014/12/01 1
5678
3454 Jole Flume 2015/09/12 3
the row with empno5678 have no data and rest of the rows contains the data if we consider whole table with blanks row and populated row then we can termed it as sparse data.
If we take only populated data then it is termed as dense data.
The best article I have seen, which explains many databases terms as well.
> http://jimbojw.com/#understanding%20hbase
Related
I'm trying to get my head around Snowflake's capabilities around wide-tables.
I have a table of the form:
userId
metricName
value
asOfDate
1
'meanSessionTime'
30
2022-01-04
1
'meanSessionSpend'
20
2022-01-04
2
'meanSessionTime'
34
2022-01-05
...
...
...
...
However, for my analysis I usually pull big subsets of this table into Python and pivot out the metric names
userId
asOfDate
meanSessionTime
meanSessionSpend
...
1
2022-01-04
30
20
...
2
2022-01-05
43
12
...
...
...
...
...
...
I am thinking of generating this Pivot in Snowflake (via DBT, the SQL itself is not hard), but I'm not sure if this is good/bad.
Any good reasons to keep the data in the long format? Any good reasons to go wide?
Note that I don't plan to always SELECT * from the wide table, so it may be a good usecase for the columnar storage.
Note:
These are big tables (billions or records, hundreds of metrics), so I am looking for a sense-check before a burn a few hundred $ in credits doing an experiment.
Thanks for the additional details provided in the comments and apologies for delayed response. A few thoughts.
I've used both Wide and Tall tables to represent feature/metric stores in Snowflake. You can also potentially use semi-structured column(s) to store the Wide representation. Or in the Tall format if your metrics can be of different data-types (e.g. numeric & character), to store the metric value in a single VARIANT column.
With ~600 metrics (columns), you are still within the limits of Snowflakes row width, but the wider the table gets, generally the less useable/manageable it becomes when writing queries against it, or just retrieving the results for further analysis.
The wide format will typically result in a smaller storage footprint than the tall format, due to the repetition of the key (e.g. user-id, asOfDate) and metricName, plus any additional columns you might need in the tall form. I've seen 3-5x greater storage in the Tall format in some implementations so you should see some storage savings if you move to the Wide model.
In the Tall table this can be minimised through clustering the table so the same key and/or metric column values are gathered into the same micro-partitions, which then favours better compression and access. Also, as referenced in my comments/questions, if some metrics are sparse, or have a dominant default value distribution, or change value at significantly different rates, moving to a sparse-tall form can enable more much efficient storage and processing. In the wide form, if only one metric value changes out of 600, on a given day, you still need to write a new record with all 599 unchanged values. Whereas in the tall form you could write a single record for the metric with the changed value.
In the wide format, Snowflakes columnar storage/access should effectively eliminate physical scanning of columns not included within the queries so they should be at least as efficient as the tall format, and columnar compression techniques can effectively minimise the physical storage.
Assuming your data is not being inserted into the tall table in optimal sequence for your analysis patterns the table will need to be clustered to get the best performance using CLUSTER BY. For example if you are always filtering on a subset of user-ids, it should take precedence in your CLUSTER BY, but if you are mainly going after a subset of columns, for all, or a subset of all, user-ids then the metricName should take precedence. Clustering has an additional service cost which may become a factor in using the tall format.
In the tall format, having a well defined standard for metric names enables a programmatic approach to column selection. e.g. column names as contracts This makes working with groups of columns as a unit very effective using the WHERE clause to 'select' the column groups (e.g. with LIKE), and apply operations on them efficiently. IMO this enables much more concise & maintainable SQL to be written, without necessarily needing to use a templating tool like Jinja or DBT.
Similar flexibility can be achieved in the wide format, by grouping and storing the metric name/value pairs within OBJECT columns, rather than as individual columns. They can be gathered (Pivoted) to an Object with OBJECT_AGG. Snowflakes semi-structured functionality can then be used on the object. Snowflake implicitly columnarises semi-structured columns, up to a point/limit, but with 600+ columns, some of your data will not benefit from this which may impact performance. If you know which columns are the most commonly used for filtering or returned in queries you could use a hybrid of the two approaches
I've also used Snowflake UDFs to effectively perform commonly required filter, project or transform operations over the OBJECT columns using Javascript, but noting that you're using Python, the new Python UDF functionality may be a better option for you. When you retrieve the data to Python for further analysis you can easily convert the OBJECT to a DICT in Python for further iteration. You could also take a look at Snowpark for Python, which should enable you to push further analysis and processing from Python into Snowflake.
You could of course not choose between the two options, but go with both. If CPU dominates storage in your cloud costs, then you might get the best bang for your buck by maintaining the data in both forms, and picking the best target for any given query.
You can even consider creating views that present the one from as the other one, if query convenience outweighs other concerns.
Another option is to split your measures out by volatility. Store the slow moving ones with a date range key in a narrow (6NF) table and the fast ones with snapshot dates in a wide (3NF) table. Again a view can help present an simpler user access point (although I guess the Snowflake optimizer won't be do join pruning over range predicates, so YMMV on the view idea).
Non-1NF gives you more options too on DBMSes like Snowflake that have native support for "semi-structured" ARRAY, OBJECT, and VARIANT column values.
BTW do update us if you do any experiments and get comparison numbers out. Would make a good blog post!
I'm new to database. Recently I start using timescaledb, which is an extension in PostgreSQL, so I guess this is also PostgreSQL related.
I observed a strange behavior. I calculated my table structure, 1 timestamp, 2 double, so totally 24bytes per row. And I imported (by psycopg2 copy_from) 2,750,182 rows from csv file. I manually calculated the size should be 63MB, but I query timescaledb, it tells me the table size is 137MB, index size is 100MB and total 237MB. I was expecting that the table size should equal my calculation, but it doesn't. Any idea?
There are two basic reasons your table is bigger than you expect:
1. Per tuple overhead in Postgres
2. Index size
Per tuple overhead: An answer to a related question goes into detail that I won't repeat here but basically Postgres uses 23 (+padding) bytes per row for various internal things, mostly multi-version concurrency control (MVCC) management (Bruce Momjian has some good intros if you want more info). Which gets you pretty darn close to the 137 MB you are seeing. The rest might be because of either the fill factor setting of the table or if there are any dead rows still included in the table from say a previous insert and subsequent delete.
Index Size: Unlike some other DBMSs Postgres does not organize its tables on disk around an index, unless you manually cluster the table on an index, and even then it will not maintain the clustering over time (see https://www.postgresql.org/docs/10/static/sql-cluster.html). Rather it keeps its indices separately, which is why there is extra space for your index. If on-disk size is really important to you and you aren't using your index for, say, uniqueness constraint enforcement, you might consider a BRIN index, especially if your data is going in with some ordering (see https://www.postgresql.org/docs/10/static/brin-intro.html).
I have a table with a column contains HTML content and is relative greater than the other columns.
Having a column with a great size can slow the queries in this table?
I need to put this big fields in another table?
The TOAST Technique should handle this for you, after a given size the storage will be transparently set in a _toast table and some internal things are done to avoid slowing down your queries (check the given link).
But of course if you always retrieve the whole content you'll loose time in the retrieval. And it's also clear that requests on this table where this column is not used won't suffer from this column size.
The bigger the database the slower the queries. Always.
It's likely that if you have large column, there is going to be more disk I/O since caching the column itself takes more space. However, putting these in a different table won't likely alleviate this issue (other than the issue below). When you don't explicitly need the actual HTML data, be sure not to SELECT it.
Sometimes the ordering of the columns can matter because of the way rows are stored, if you're really worried about it, store it as the last column so it doesn't get paged when selecting other columns
You would have to look at how Postgres internally stores things to see if you need to split this out but a very large field could cause the way the data is stored on the disk to be broken up and thus adds to the time it takes to access it.
Further, returning 100 bytes of data vice 10000 bytes of data for one record is clearly going to be slower, the more records the slower. If you are doing SELECT * this is clearly a problem espcially if you usually do not need the HTML.
Another consideration could be ptting the HTML information in a noSQL database. This kind of document information is what they excel at. No reason you can't use both a realtional database for some info and a noSQL database for other info.
Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.
Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.