Long Binary Array Compression

Long Binary Array Compression - arrays

There is an Array of Binary Numbers and the number of elements in this array is around 10^20.
The number of "ones" in the array is around 10^10, and these numbers are randomly distributed.
Once the data is generated and saved, it won't be edited: it will remain in read-only mode in its whole life cycle.
Having this data saved, requests will be received. Each request contains the index of the array, and the response should be the value in that particular index. The indexes of these requests are not in order (they may be random).
Question is: how to encode this info saving space and at the same time have a good performance when serving requests?
My thoughts so far are:
To have an array of indexes for each of the "ones". So, I would have an array of 10^10 elements, containing indexes in the range: 0 - 10^20. Maybe not the best compression method, but, it is easy to decode.
The optimal solution in compression: to enumerate each of the combinations (select 10^10 numbers from a set of 10^20 available numbers), then, the data is just the "id" of this enumeration... but, this could be a problem to decode, I think.

Look up "sparse array". If access speed is important, a good solution is a hash table of indices. You should allocate about 2x the space, requiring a 180 GB table. The access time would be O(1).
You could have just a 90 GB table and do a binary search for an index. The access time would be O(log n), if you're happy with that speed.
You can pack the indices more tightly, to less than 84 GB to minimize the size of the single-table approach.
You can break it up into multiple tables. E.g. if you had eight tables, each representing the possible high three bits of the index, then the tables would take 80 GB.
You can break it up further. E.g. if you have 2048 tables, each representing the high 11 bits of the index, the total would be 70 GB, plus some very small amount for the table of pointers to the sub-tables.
Even further, with 524288 tables, you can do six bytes per entry for 60 GB, plus the table of tables overhead. That would still be small in comparison, just megabytes.
The next multiple of 256 should still be a win. With 134 million subtables, you could get it down to 50 GB, plus less than a GB for the table of tables. So less than 51 GB. Then you could, for example, keep the table of tables in memory, and load a sub-table into memory for each binary search. You could have a cache of sub-tables in memory, throwing out old ones when you run out of space. Each sub-table would have, on average, only 75 entries. Then the binary search is around seven steps, after one step to find the sub-table. Most of the time will be spent getting the sub-tables into memory, assuming that you don't have 64 GB of RAM. Then again, maybe you do.

Related

Efficiency of Querying 10 Billion Rows (with High Cardinality) in ScyllaDB

Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:
PK1 PK2 PK3 V1 V2
Where PK represents a partition key and V represents a value. So in the above example, the partition key consists of 3 columns.
Scylla requires that you to specify all columns of the partition key in the WHERE clause.
If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?
My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)
Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).
Is this correct? Or is there any flaw in my mental model of how Scylla processes such queries?
Thanks

As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ּSo Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.
However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.
So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.
Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.

Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it

How to determine what size is required to hold data in a table space?

I'm somewhat confused about tablespace and what requirements determines the size needed for it to hold the data.
I have read the documentation, and many articles, including answers here in stackoverflow about tablespace, but I still don't get it.
Let's say I want to create 3 tables:
customer
product
sales
Does the above schema effect the size you choose for your tablespace? or is it completely irrelevant? If it is irrelevant, then what is relevant in this case?
Can someone please explain in simple terms for people who are new to this study.

The size (and number) of data files assigned to a tablespace depends on the amount of data that you're going to be storing in your tables. In most organizations, it also depends on what size chunks your storage admins prefer to use, how long it takes to get additional storage space, and other organization-specific bits of information.
Estimating the size of a table can get a bit complicated depending on how close you want to get and how much knowledge you have about your data. For estimating the size of data files to allocate to a tablespace, though, you can generally get away with a pretty basic estimate and then just monitor actual utilization.
Let's say that your customer table has a customer_id column that is a numeric identifier, it has a name column that averages, say, 30 characters, and a create_date that tells you when it was created. Roughly, that means that every row requires 7 bytes for the create_date, 30 bytes for the name, lets say an average of 5 bytes for the customer_id for a total of 42 bytes. If we expect to have, say 1,000,000 customers in the first 6 months (we're an optimistic bunch), we'd expect our table to be about 42 MB in size. If we repeat the process for the other tables in the tablespace and add up the result, that gives us a guess as to how big the data files we'd need to allocate to cover the first 6 months of operation.
Of course, in reality, there are lots of complications. You can't just add up the size of the columns to get the size of a row. You'd have to figure out how many rows would be in a block which may depend on patterns of how data changes over time. I'm ignoring things like pctfree that reserve space for future updates to rows. Plus your estimates for how many rows you're going to have and how big various strings will be are rarely particularly accurate. So the estimate you're coming up with is extremely rough. In this case, though, even if you're off by a factor of 2, it's not that big of a deal in general. Once you do the initial allocation, you'll want to monitor how much space is actually used. So you can always go in later and add files, increase the size of files, etc. if you're using more space than you guessed.

Logic: Best way to sample & count bytes of a 100MB+ file

Let's say I have this 170mb file (roughly 180 million bytes). What I need to do is to create a table that lists:
all 4096 byte combinations found [column 'bytes'], and
the number of times each byte combination appeared in it [column 'occurrences']
Assume two things:
I can save data very fast, but
I can update my saved data very slow.
How should I sample the file and save the needed information?
Here're some suggestions that are (extremely) slow:
Go through each 4096 byte combinations in the file, save each data, but search the table first for existing combinations and update it's values. this is unbelievably slow
Go through each 4096 byte combinations in the file, save until 1 million rows of data in a temporary table. Go through that table and fix the entries (combine repeating byte combinations), then copy to the big table. Repeat going through another 1 million rows of data and repeat the process. this is faster by a bit, but still unbelievably slow
This is kind of like taking the statistics of the file.
NOTE:
I know that sampling the file can generate tons of data (around 22Gb from experience), and I know that any solution posted would take a bit of time to finish. I need the most efficient saving process

The first solution you've provided could be sped up significantly if you also hash the data and store the hash of the 4096-byte segment in your database, then compare to that. Comparing to a string that's 4096 bytes long would take forever, but this would be significantly faster:
For each 4096-byte segment in the file
Hash the segment into something short (even MD5 is fine, and it's quick)
Look up the hash in your database
If it exists (segment may have already been found)
Compare the actual segment to see if there's a match
If it doesn't exist
It's a new segment - save it to your database
Hashing the segment isn't free, but it's pretty cheap, and the comparisons between hashes will be orders of magnitude cheaper than comparing the full byte segments to each other repeatedly. Hashes are useful for a many applications - this is definitely one of them.

It's a bit late and I can't think straight, so my algorithm complexity calculation is kind of off :) But if you manage to fit it into memory, you might have a very very quick implementation with a trie. If you can optimize each trie node to take as little memory as possible, it just might work.
Another thing is basically #rwmnau's suggestion, but don't use predefined hash functions like MD5 - use running totals. Unlike hashes, this is almost free, without any downside for such a large block size (there's plenty of randomness in 4096 bytes). With each new block, you get one byte, and you lose one byte. So calculate a sum of the first 4096 bytes; for each subsequent ones, simply subtract the lost byte and add the new one. Depending on the size of the integer you do the sums in, you will have plenty of buckets. Then you will have much smaller number of blocks to compare byte-for-byte.

Statistical analysis for performance measurement on usage of bigger datatype wherever they were not required at all

If i takes larger datatype where i know i should have taken datatype that was sufficient for possible values that i will insert into a table will affect any performance in sql server in terms of speed or any other way.
eg.
IsActive (0,1,2,3) not more than 3 in
any case.
I know i must take tinyint but due to some reasons consider it as compulsion, i am taking every numeric field as bigint and every character field as nVarchar(Max)
Please give statistics if possible, to let me try to overcoming that compulsion.
I need some solid analysis that can really make someone rethink before taking any datatype.
EDIT
Say, I am using
SELECT * FROM tblXYZ where IsActive=1
How will it get affected. Consider i have 1 million records
Whether it will only have memory wastage only or perforamance wastage as well.
I know more the no of pages more indexing effort is required hence performance will also get affected. But I need some statistics if possible.

You are basically wasting 7 bytes per row on bigint, this will make your tables bigger and thus less will be stored per page so more IO will be needed to bring the same amount of rows back if you used tinyint. If you have a billion row table it will add up

Defining this in statistical terms is somewhat difficult, you can literally do the maths and work out the additional IO overhead.
Let's take a table with 1 million rows, and assume no page padding, compression and use some simple figures.
Given a table whose row size is 100 bytes, that contains 10 tinyints. The number of rows per page (assuming no padding / fragmentation) is 80 (8096 / 100)
By using Bigints, a total of 70 bytes would be added to the row size (10 fields that are 7 bytes more each), giving a row size of 170 bytes, and reducing the rows per page to 47.
For the 1 million rows this results in 12,500 pages for the tinyints, and 21277 pages for the Bigints.
Taking a single disk, reading sequentially, we might expect 300 IOs per second sequential reading, and each read is 8k (e.g. a page).
The respective read times given this theoretical disk is then 41.6 seconds and 70.9 seconds - for a very theoretical scenario of a made up table / row.
That however only applies to a scan, under an index seek, the increase in IO would be relatively small, depending on how many of the bigint's were in the index or clustered key. In terms of backup and restore as mentioned, the data is expanded out and the time loss can be calculated as linear unless compression is at play.
In terms of memory caching, each byte wasted on a page on disk is a byte wasted in the memory, but only applies to the pages in memory - this is were it will get more complex, since the memory wastage will be based on how many of the pages are sitting in the buffer pool, but for the above example it would be broadly 97.6 meg of data vs 166meg of data, and assuming the entire table was scanned and thus in the buffer pool, you would be wasting ~78 megs of memory.

A lot of it comes down to space. Your bigints are going to take 8 times the space (8 byte vs 1 byte for tinyint). Your nvarchar is going to take twice as many bytes as a varchar. Making it max won't affect much of anything.
This will really come into play if you're doing look ups on values. The indexes you will (hopefully) be applying will be much larger.

I'd at least pare it down to int. Bigint is way overkill. But something about this field is calling out to me that something else is wrong with the table as well. Maybe it's just the column name — IsActive sounds like it should be a boolean/bit column.
More than that, though, I'm concerned about your varchar(max) fields. Those will add up even faster.

All the 'wasted' space also comes into play for DR (if you are 4-6 times the size due to poor data type configuration, your recovery can be just as long).
Not only do the larger pages/extents require more IO to serve.... you also decrease your memory cache with the size. With billions of rows, depending on your server you could be dealing with constant memory pressure and clearing memory cache simply because you chose a datatype that was 8 times the size you needed it to be.

Data Structure to store billions of integers

What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.

Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.

my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).

Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work

If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.

We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.