Logic: Best way to sample & count bytes of a 100MB+ file - database

Let's say I have this 170mb file (roughly 180 million bytes). What I need to do is to create a table that lists:
all 4096 byte combinations found [column 'bytes'], and
the number of times each byte combination appeared in it [column 'occurrences']
Assume two things:
I can save data very fast, but
I can update my saved data very slow.
How should I sample the file and save the needed information?
Here're some suggestions that are (extremely) slow:
Go through each 4096 byte combinations in the file, save each data, but search the table first for existing combinations and update it's values. this is unbelievably slow
Go through each 4096 byte combinations in the file, save until 1 million rows of data in a temporary table. Go through that table and fix the entries (combine repeating byte combinations), then copy to the big table. Repeat going through another 1 million rows of data and repeat the process. this is faster by a bit, but still unbelievably slow
This is kind of like taking the statistics of the file.
NOTE:
I know that sampling the file can generate tons of data (around 22Gb from experience), and I know that any solution posted would take a bit of time to finish. I need the most efficient saving process

The first solution you've provided could be sped up significantly if you also hash the data and store the hash of the 4096-byte segment in your database, then compare to that. Comparing to a string that's 4096 bytes long would take forever, but this would be significantly faster:
For each 4096-byte segment in the file
Hash the segment into something short (even MD5 is fine, and it's quick)
Look up the hash in your database
If it exists (segment may have already been found)
Compare the actual segment to see if there's a match
If it doesn't exist
It's a new segment - save it to your database
Hashing the segment isn't free, but it's pretty cheap, and the comparisons between hashes will be orders of magnitude cheaper than comparing the full byte segments to each other repeatedly. Hashes are useful for a many applications - this is definitely one of them.

It's a bit late and I can't think straight, so my algorithm complexity calculation is kind of off :) But if you manage to fit it into memory, you might have a very very quick implementation with a trie. If you can optimize each trie node to take as little memory as possible, it just might work.
Another thing is basically #rwmnau's suggestion, but don't use predefined hash functions like MD5 - use running totals. Unlike hashes, this is almost free, without any downside for such a large block size (there's plenty of randomness in 4096 bytes). With each new block, you get one byte, and you lose one byte. So calculate a sum of the first 4096 bytes; for each subsequent ones, simply subtract the lost byte and add the new one. Depending on the size of the integer you do the sums in, you will have plenty of buckets. Then you will have much smaller number of blocks to compare byte-for-byte.

Related

Long Binary Array Compression

There is an Array of Binary Numbers and the number of elements in this array is around 10^20.
The number of "ones" in the array is around 10^10, and these numbers are randomly distributed.
Once the data is generated and saved, it won't be edited: it will remain in read-only mode in its whole life cycle.
Having this data saved, requests will be received. Each request contains the index of the array, and the response should be the value in that particular index. The indexes of these requests are not in order (they may be random).
Question is: how to encode this info saving space and at the same time have a good performance when serving requests?
My thoughts so far are:
To have an array of indexes for each of the "ones". So, I would have an array of 10^10 elements, containing indexes in the range: 0 - 10^20. Maybe not the best compression method, but, it is easy to decode.
The optimal solution in compression: to enumerate each of the combinations (select 10^10 numbers from a set of 10^20 available numbers), then, the data is just the "id" of this enumeration... but, this could be a problem to decode, I think.
Look up "sparse array". If access speed is important, a good solution is a hash table of indices. You should allocate about 2x the space, requiring a 180 GB table. The access time would be O(1).
You could have just a 90 GB table and do a binary search for an index. The access time would be O(log n), if you're happy with that speed.
You can pack the indices more tightly, to less than 84 GB to minimize the size of the single-table approach.
You can break it up into multiple tables. E.g. if you had eight tables, each representing the possible high three bits of the index, then the tables would take 80 GB.
You can break it up further. E.g. if you have 2048 tables, each representing the high 11 bits of the index, the total would be 70 GB, plus some very small amount for the table of pointers to the sub-tables.
Even further, with 524288 tables, you can do six bytes per entry for 60 GB, plus the table of tables overhead. That would still be small in comparison, just megabytes.
The next multiple of 256 should still be a win. With 134 million subtables, you could get it down to 50 GB, plus less than a GB for the table of tables. So less than 51 GB. Then you could, for example, keep the table of tables in memory, and load a sub-table into memory for each binary search. You could have a cache of sub-tables in memory, throwing out old ones when you run out of space. Each sub-table would have, on average, only 75 entries. Then the binary search is around seven steps, after one step to find the sub-table. Most of the time will be spent getting the sub-tables into memory, assuming that you don't have 64 GB of RAM. Then again, maybe you do.

Reading chunks from an SQlite blob when a chunk might span two blobs

I have a situation where I need to read arbitrarily-sized (but generally small) chunks of binary data from an SQlite database. The database lives on disk, and the data is stored in rows consisting of an id and a read-only blob of between 256 to 64k bytes (the length will always be a power of 2). I use the SQlite incremental I/O to read the chunks into a rewritable buffer, then take the average of the values in the chunk, and cache the result.
The problem I have is that since the chunks are of arbitrary size the blob size will only very occasionally be an integer multiple of the chunk size. This means that a chunk will span two blobs quite frequently.
What I am looking for is a simple and elegant (since 'elegance is not optional') way to handle this slightly awkward scenario. I have a read-chunk function which is fairly dumb, simply reading the chunks and computing averages. So far I have tried the following strategies:
Read only the first part of an overlapping chunk, discarding the second.
Make read-chunk aware of blob boundaries, so that it can move to the next blob where appropriate.
Use something like a ring buffer, so that overlapping chunks can just wrap around the edges.
The first option is the simplest but is unsatisfactory because it discards potentially important information. Since read-chunk is called frequently I don't want to overburden it with too much branching logic, so the second option also isn't appealing. Using a ring buffer (or something like it) seems like an elegant solution. What I envisage is a producer which reads intermediately-sized (say, 256 byte) chunks from the blob into a 1k buffer, then a consumer which calls read-chunk on the buffer, wrapping around where appropriate. Since I will always be dealing with powers of 2 the producer will always align to the edges of the buffer, and I can also avoid using mod to compute the indices for both producer and consumer.
I am using Lisp (CL), but since this seems to be a general algorithmic or data structure question I have left it language-agnostic. What I am interested in is in clarifying what options I have - is there another option other than the ones I've listed?

Statistical analysis for performance measurement on usage of bigger datatype wherever they were not required at all

If i takes larger datatype where i know i should have taken datatype that was sufficient for possible values that i will insert into a table will affect any performance in sql server in terms of speed or any other way.
eg.
IsActive (0,1,2,3) not more than 3 in
any case.
I know i must take tinyint but due to some reasons consider it as compulsion, i am taking every numeric field as bigint and every character field as nVarchar(Max)
Please give statistics if possible, to let me try to overcoming that compulsion.
I need some solid analysis that can really make someone rethink before taking any datatype.
EDIT
Say, I am using
SELECT * FROM tblXYZ where IsActive=1
How will it get affected. Consider i have 1 million records
Whether it will only have memory wastage only or perforamance wastage as well.
I know more the no of pages more indexing effort is required hence performance will also get affected. But I need some statistics if possible.
You are basically wasting 7 bytes per row on bigint, this will make your tables bigger and thus less will be stored per page so more IO will be needed to bring the same amount of rows back if you used tinyint. If you have a billion row table it will add up
Defining this in statistical terms is somewhat difficult, you can literally do the maths and work out the additional IO overhead.
Let's take a table with 1 million rows, and assume no page padding, compression and use some simple figures.
Given a table whose row size is 100 bytes, that contains 10 tinyints. The number of rows per page (assuming no padding / fragmentation) is 80 (8096 / 100)
By using Bigints, a total of 70 bytes would be added to the row size (10 fields that are 7 bytes more each), giving a row size of 170 bytes, and reducing the rows per page to 47.
For the 1 million rows this results in 12,500 pages for the tinyints, and 21277 pages for the Bigints.
Taking a single disk, reading sequentially, we might expect 300 IOs per second sequential reading, and each read is 8k (e.g. a page).
The respective read times given this theoretical disk is then 41.6 seconds and 70.9 seconds - for a very theoretical scenario of a made up table / row.
That however only applies to a scan, under an index seek, the increase in IO would be relatively small, depending on how many of the bigint's were in the index or clustered key. In terms of backup and restore as mentioned, the data is expanded out and the time loss can be calculated as linear unless compression is at play.
In terms of memory caching, each byte wasted on a page on disk is a byte wasted in the memory, but only applies to the pages in memory - this is were it will get more complex, since the memory wastage will be based on how many of the pages are sitting in the buffer pool, but for the above example it would be broadly 97.6 meg of data vs 166meg of data, and assuming the entire table was scanned and thus in the buffer pool, you would be wasting ~78 megs of memory.
A lot of it comes down to space. Your bigints are going to take 8 times the space (8 byte vs 1 byte for tinyint). Your nvarchar is going to take twice as many bytes as a varchar. Making it max won't affect much of anything.
This will really come into play if you're doing look ups on values. The indexes you will (hopefully) be applying will be much larger.
I'd at least pare it down to int. Bigint is way overkill. But something about this field is calling out to me that something else is wrong with the table as well. Maybe it's just the column name — IsActive sounds like it should be a boolean/bit column.
More than that, though, I'm concerned about your varchar(max) fields. Those will add up even faster.
All the 'wasted' space also comes into play for DR (if you are 4-6 times the size due to poor data type configuration, your recovery can be just as long).
Not only do the larger pages/extents require more IO to serve.... you also decrease your memory cache with the size. With billions of rows, depending on your server you could be dealing with constant memory pressure and clearing memory cache simply because you chose a datatype that was 8 times the size you needed it to be.

Data Structure to store billions of integers

What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.
Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.
my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).
Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work
If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.
We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Resources