Databases,RAM and rerformance - database

I have 5GB dictionary, where key is word and value is 300d vector of numbers but I have only 1GB RAM (minus 200MB of server) and 50GB ssd. My goal is relatively fast (1-3sec) retrieval of vector, for every word in input sentence.
What kind of storage system would be best for this kind of task? Is a nonsql database like Mongo a good option?
If so, is there a way to calculate minimal cache memory that will mongo need, and is this solution feasible with given hardware?
Thank you.

Assuming single precision float number with 32 bits each and 32 bit word keys, 5GB roughly sums up to 4.1 million vectors.
You could store a <word, word> dictionary with these 4.1 million entries in RAM. The value part of the dictionary points to a combination of file and offset within the file stored on SSD. In case your assumptions are different, the calculation should stay similar.
It is probably not practical to store the whole vectors in a single file. It might be sufficient to store the vectors in a database, provided the tablespace reside on SSD.
Example: You could have 32 files with 130.000 vectors each. Then, the highest 5 bits of the word value indicate the file, and the lowest 27 bits are the offset or vector number within the file.

Related

Fastest disk based solution to cache trillions of unique md5 hashes

Is there a very low latency disk based caching solution that I can use to store only unique values (NOT key+value)?
My script needs to keep track of which files it has processed so it doesn't redo any work. I need to check the cache to search for the md5 hash of the file, if it doesn't exist, I process the file and add the hash to the cache.
Is there a faster disk based caching solution than using a key-value based solution?
Try LevelDB.
It's a key-value store but is very compact due to the trie structure.
Less space usage => less I/O => better performance.
Not sure about "trillions" (a trillion MD5 hashes would be 16,000 TB), but Bitcoin core as well as Ethereum implementations all use LevelDB.
In your case, there is no need for an "Ordered Key-Value Store". That is you can rely on plain Key-Value stores (direct dbm successors):
Good candidates are:
tokyo cabinet it has a hash-based format, that might be faster in your case.
gdbm
In the case where the datatset fit into memory, you might want to try LMDB.
I do not recomment LevelDB because it is slow.
Do the math. 1 trillion MD5s, without any tricks, would take 16TB of disk space. This is, I assume, far more than your RAM size.
Since each MD5 lookup is essentially a 'random' probe into the disk, there will necessarily be about 1 disk hit per check.
If, say, an SSD read is 1ms, that is 1e9 seconds to insert (or check) a trillion hashes. That's 30 years.
There are a lot flaws in my math, but I think this says that it is not practical today to store and check a trillion of anything random.
If you want to crank it down to a billion MD5s, now we are getting in the range of RAM sizes. But you probably want to have the data persisted? So you really need some database-like tool that will do the persisting for you, while making the checks purely in RAM (CPU-speed).
In any case, I would consider writing code that breaks the MD5 into 2 or 3 chunks, then use the chunks like a directory structure. At the bottom level, you have a variable-length bunch of values for the last chunk. Each is perhaps 8 bytes long. That would need a linear or binary search into a bunch of numbers that are half the size of an MD5. The savings here helps compensate for the various overheads in the rest of the structure, plus the need for writing blocks to disk. Hence, I would still expect needing about 16GB of RAM to house a billion MD5s.
Given that approach, virtually any database engine is already geared up to do most of the work reasonably efficiently. The lowest level would be some type of BLOB containing multiple 8-byte chunks.
Another trick to use... Let's look at just the first 5 bytes of an MD5. There are a trillion different values in 5 bytes. If you have only a billion entries in your dataset, then checking the 5 bytes has a 99.9% chance of correctly saying "the md5 is not in the dataset" versus less than 0.01% chance of saying "the md5 might be in the dataset". In the former case, you get a quick answer with only 5GB for a billion items. In the latter case, you may have to go to disk and be slower. Still the average time is better. This helps with the speed of checking. (But does not address the speed of loading.)

Understanding O(1) vs O(n) Time Complexity Intuitively

I understand that O(1) is constant-time, which means that the operation does not depend on the input size, and O(n) is linear time, which means that the operation changes linearly with input size.
If I had an algorithm that could simply go directly to an array index rather than going through each index one-by-one to find the required one, that would be considered constant-time rather than linear-time, right? This is what the textbooks say. But, intuitively, I don't understand how a computer could work this way: Wouldn't the computer still need to go through each index one-by-one, from 0 to (potentially) n, in order to find the specific index given to it? But then, is this not the same as what a linear-time algorithm does?
Edit
My response to ElKamina's answer elaborates on how my confusion extends to hardware:
But wouldn't the computer have to check where it is on its journey to
the index? For instance, if it's trying to find index 3, "I'm at index
0 so I need address 0 + 3", "Ok, now I'm at address 1, so I have to
move forward 2", "Ok, now I'm at address 2, so I have to move forward
1", "Ok, now I'm at index 3". Isn't those the same thing as what
linear-time algorithms do? How can the computer not do it
sequentially?
Theory
Imagine you have an array which stores events in the order they happened. If each event takes the same amount of space in a computer's memory, you know where that array begins, and you know what number event you're interested in, then you can precalculate the location of each event.
Imagine you want to store records and key them by telephone numbers. Since there are many numbers, you can calculate a hash of each one. The simplest hash you might apply is to treat the telephone number like a regular number and take it modulus the length of the array you'd like to store the number in. Again, you can assume each record takes the same amount of space, you know the number of records, you know where the array begins, and you know the offset of the event of interest. From these, you can precalculate the location of each event.
If array items have different sizes, then instead fill the array with pointers to the actual items. Your lookup then has two stages: find the appropriate array element and then follow it to the item in question.
Much like we can use shmancy GPS systems to tell us where an address is, but we still need to do the work of driving there, the problem with accessing memory is not knowing where an item is, it's getting there.
Answer to your question
With this in mind, the answer to your question is that look-up is almost never free, but it also is rarely O(N).
Tape memory: O(N)
Tape memory requires O(N) seeks, for obvious reasons: you have to spool and unspool the tape to position it to the needed location. It's slow. It's also cheap and reliable, so it's still in use today in long-term back-up systems. Special algorithms which account for the physical nature of the tape can speed up operations on it slightly.
Notice that, per the foregoing, the problem with tape is not that we don't know where the thing is we're trying to find. The problem is getting the physical medium to get there. The nature of a good tape algorithm is to try to minimize the total amount of tape spooled and unspooled over a grouping of operations.
Speaking of which, what if, instead of having one long tape, we had two shorter tapes: this would reduce the point-to-point travel time. What if we had four tapes?
Disk memory: O(N), but smaller
Hard drives make a huge reduction in seek time by turning the tape into a series of rings. Now, even though there are N memory spaces on a disk, any one can be accessed in short order by moving the drive head and the disk to the appropriate point. (Figuring out how to express this in big-oh notation is a challenge.)
Again, if you use faster disks or smaller disks, you can optimize performance.
RAM: O(1), but with caveats
Pretty much everyone who answers this question is going to fixate on RAM, since that's what programmers work with most frequently. Look to their answers for fuller explanations.
But, just briefly, RAM is a natural extension of the ideas developed above. The RAM holds N items and we know where the item we want is. However, this time there's nothing that needs to mechanically move in order for us to get to that item. In addition, we saw that by having more short tapes or smaller, faster drives, we could get to the memory we wanted faster. RAM takes this idea to its extreme.
For practical purposes, you can think of RAM as being a collection of little memory stores, all strung together. Your computer doesn't know exactly where in RAM a particular item is, just the collection it belongs to. So it grabs the whole collection, consisting of thousands or millions of bytes. It stashes this in something like an L3 cache.
But where is a particular item in that cache? Again, you can think of the computer as not really knowing, it just grabs the a subset which is guaranteed to include the item and passes it to the L2 cache.
And again, for the L1 cache.
And, at this point, we've gone from gigabytes (or terabytes) of RAM to something like 3-30 kilobytes. And, at this level, your computer (finally) knows exactly where the item is and grabs it for processing.
This kind of hierarchical behavior means that accessing adjacent items in RAM is much faster than randomly accessing different points all across RAM. That was also true of tape drives and hard disks.
However, unlike tape drives and hard disks, the worst-case time where all the caches are missed is not dependent on the amount of memory (or, at least, is very weakly dependent: path lengths, speed of light, &c)! For this reason, you can treat it as an O(1) operation in the size of the memory.
Comparing speeds
Knowing this, we can talk about access speed by looking at Latency Numbers Every Programmer Should Know:
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
In more human terms, these look like:
Minute:
L1 cache reference 0.5 s One heart beat (0.5 s)
Branch mispredict 5 s Yawn
L2 cache reference 7 s Long yawn
Mutex lock/unlock 25 s Making a coffee
Hour:
Main memory reference 100 s Brushing your teeth
Compress 1K bytes with Zippy 50 min One episode of a TV show (including ad breaks)
Day:
Send 2K bytes over 1 Gbps network 5.5 hr From lunch to end of work day
Week:
SSD random read 1.7 days A normal weekend
Read 1 MB sequentially from memory 2.9 days A long weekend
Round trip within same datacenter 5.8 days A medium vacation
Read 1 MB sequentially from SSD 11.6 days Waiting for almost 2 weeks for a delivery
Year:
Disk seek 16.5 weeks A semester in university
Read 1 MB sequentially from disk 7.8 months Almost producing a new human being
The above 2 together 1 year
Decade:
Send packet CA->Netherlands->CA 4.8 years Average time it takes to complete a bachelor's degree
Underlying any calculation of time complexity is a cost model. Cost models tend to be oversimplified; for example, we generally talk about the time complexity of sort algorithms in terms of how many elements do we have to compare to each other.
The assumption underlying concluding that indexing into an array is O(1) is that of random access memory; that we can access location N by encoding N on the address lines of the memory bus, and the contents of that location come back on the data bus. If memory were sequential access (e.g., accessing off of a magnetic tape), we'd assume a different cost model.
Imagine computer memory as buckets, say you have 10 buckets in from of you.
if someone tells you to pick something up from bucket number 8, you will not first stick your hand into bucket 1 to 7. you would simply put your hand directly into bucket 8.
Arrays work the same way, in most languages map to some form of memory layout. so e.g. if you have an byte array of 10 that would be 10 sequential bytes.
other types could vary in size depending if the content is a value type/struct or if it is a reference type where the array would consist of pointers.
We assume that the memory is "Random Access Memory" (also known as RAM), not the tape or disk memory. In RAM you can access any address in constant time. See the corresponding wiki article for more information on how it works.
Also, elements of the array are stored sequentially. Say we want to store integers in Java which take up 4 bytes. If we wanted to look for kth element, we would directly look at start + 4 * k location in the memory.
You could implement an array in other ways as well. For example, you could implement the array with a linked list, in which case it would take O(n) time to access an element. But this is not how arrays are implemented typically.
No one here has explained why (IMO) in sufficient detail you can access it in O(1) time in detail, so I will try to:
As a note before I do, this is probably trivializing how complex the hardware in the computer has become, but hopefully it's something along the right path. You would cover this in a Computer Organization course that goes into the guts of the hardware.
When you have circuits, the voltage passed through the computer propagates very fast, and the results that come back depends on the pulse of the clock. Take this diagram for example:
https://upload.wikimedia.org/wikipedia/commons/3/3d/Square_array_of_mosfet_cells_read.png
The following is missing parts that you would learn properly from a textbook or course (or online), but omission of those details should still leave you with sufficient enough of a high level overview for a rough idea of how this works:
The address you send as bits will go up the left side of the image, and based on the address size you send, the voltage will be properly sent to the proper memory cell that has the data you want. Upon the cell receiving voltage, it will then emit the value back down to the bottom (which also is basically instant), and now you've read the 'value stored in memory' since the data you want has arrived. Because of how fast voltage travels, you pretty much almost instantly get the result due to the speed of voltage change in circuits. This means it does not depend on traversing the elements before it since you can just go to it, which is the idea behind RAM. The bottleneck comes from the clock pulse with the latches, which when you take a computer organization course you will see what we do and why we do it.
This is why we consider it doable in O(1) time.
Now an Operating Systems and Computer Organization course would show you all about how this is connected under the hood, why its way more complex than what I've written (and what might not even be that accurate anymore), but hopefully gives you an intuition as to why we can do it in constant time.
Since complexity notation hides the constants under the hood (which from the above, we can assume it's constant time to go to any offset in memory), it then would make sense that we can jump to any array offset in O(1) time from a high level point of view -- which is what complexity analysis aims to do for us -- compared to. This is also why we don't need to traverse over every element in memory to get where we want, which as you said is O(n).
Assuming the data structure you are talking about is a vector/array, you can easily reach index 'x' by incrementing whatever you use to iterate over it.
Say you have a vector of struct "A" where A occupies 20bytes, say you want to get to index 28 and you know the vector starts at memory location 'x', than you simply need to go to x + 20 bytes and that is your element.
With a data structure like a list the lookup time will be O(n) since its not continously assigned you have to jump from pointer to pointer.
With a binary tree its O(log2(n)) ... etc
So the answer here is that it depend on your structure. I would recommend reading some books about fundamental data structures, those might help you greatly in gaining more theoretical understanding of the various concepts you are using.

Using a filesystem in place of a Database to perform quick lookups

I was recently posed a simple question to which I responded with an unusual answer. I suspect my answer was particularly bad but am not certain what the performance characteristics truly would be.
Suppose you are given inputs each in the form of a hash code (just a bunch of bits.) Uniquely corresponding to each hash code is an integer value which you would like to return. Your system knows the most likely queries and caches them in memory. For the remaining, less frequent lookups you will have to access a hard drive (disk I/O.) There exists a least recently used policy to replace the cache in memory but that shouldn't be terribly important here.
For the hashes on disk, the conventional way to store them would be in a Database (keyed on the hashes) in a tree shape. This would grant you O(log(n)) lookup time once at the database stage.
My answer seemed odd to the asker and a little odd to me. Suppose instead of a database, you simply kept the values on disk in a file system with a directory structure that exactly mirrored the bits of the hash values. For instance, if we had three bit hashes (and only had entries for 100 => 42 and 010 => 314159 your file system would look like:
\0
.\00
.\000
.\100
42.justanumber
.\10
.\010
314159.justanumber
.\110
\1
.\01
.\001
.\101
.\11
.\011
.\111
The x.justanumber files are empty. The filenames themselves contain the information you're looking for.
Further assume that updates never occur (the entire DB/file system is re-written weekly.) I'd think that a filesystem set up this way would give you O(1) lookup time instead of the O(log(n)) lookup time of a tree-based DB. Am I missing something? Why would this not be preferable?
I believe I've come to an answer.
If the system always uses a folder for every possible bit combination and you have a 32-bit hash, you will have 2^32 folders at the outer layer and roughly the same amount within all inner layers so you'll have 2^33 folders each of size 4kB due to page size limitations. Additionally, each "empty" hash file will occupy 4kB on disk for the same reason. So assuming, let's say, 5% occupancy you would have:
2^33 * 4kB + 2^32 * 4kB * 0.05 = 35.22 TB of storage needed.
The savings are O(1) lookup time instead of O(log(# hashes)) lookup time which almost certainly doesn't justify the storage space requirements. (Also remember this is for the uncommon case where you have a memory/cache miss.)
Admittedly, if you only created the minimum number of folders needed to support the 5% occupancy you'd end up with less space needed. The exact number of folders needed would depend on the distribution of the hashes. Assuming a good hash function, this should be close to random, which I believe means we'll end up needing the majority of the inner layers and definitely 5% of the outermost layers of directories. This instead gives something like:
2^32 * 4 kB * 0.05 + 2^32 * 4kB * 0.40 + 2^32 * 4kB * 0.05 = 8.59 TB
...which is still way too much space for such a simple thing. (I made up the 40% figure for the inner folders. If anyone can come up with a rigorous figure there, please comment/answer.)

Storing Large Integers/Values in an Embedded System

I'm developing a embedded system that can test a large numbers of wires (upto 360) - essentially a continuity checking system. The system works by clocking in a test vector and reading the output from the other end. The output is then compared with a stored result (which would be on an SD Card) that tells what the output should have been. The test-vectors are just a walking ones so there's no need to store them anywhere. The process would be a bit like follows:
Clock out test-vector (walking ones)
Read in output test-vector.
Read corresponding output test-vector from SD Card which tells what the output vector should be.
Compare the test-vectors from step 2 and 3.
Note down the errors/faults in a separate array.
Continue back to step 1 unless all wires are checked.
Output the errors/faults to the LCD.
My hardware consists of a large shift register thats clocked into the AVR microcontroller. For every test vector (which would also be 360 bits), I will need to read in 360 bits. So, for 360 wires the total amount of data would be 360*360 = 16kB or so. I already know I cannot do this in one pass (i.e. read the entire data and then compare), so it will have to be test-vector by test-vector.
As there are no inherent types that can hold such large numbers, I intend to use a bit-array of length 360 bit. Now, my question is, how should I store this bit array in a txt file?
One way is to store raw values i.e. on each line store the raw binary data that I read in from the shift register. So, for 8 wires, it would be 0b10011010. But this can get ugly for upto 360 wires - each line would contain 360 bytes.
Another way is to store hex values - this would just be two characters for 8 bits (9A for the above) and about 90 characters for 360 bits. This would, however, require me to read in the text - line by line - and convert the hex value to be represented in the bit-array, somehow.
So whats the best solution for this sort of problem? I need the solution to be completely "deterministic" - I can't have calls to malloc or such. They are a bit of a no-no in embedded systems from what I've read.
SUMMARY
I need to store large values that can't be represented by any traditional variable types. Currently I intend to store these values in a bitarray. What's the best way to store these values in a text file on an SD Card?
These are not integer values but rather bit maps; they have no arithmetic meaning. What you are suggesting is simply a byte array of length 360/8, and not related to "large integers" at all. However some more appropriate data structure or representation may be possible.
If the test vector is a single bit in 360, then it is both inefficient and unnecessary to store 360 bits for each vector, a value 0 to 359 is sufficient to unambiguously define each vector. If the correct output is also a single bit, then that could also be stored as a bit index, if not then you could store it as a list of indices for each bit that should be set, with some sentinel value >=360 or <0 to indicate the end of the list. Where most vectors contain less than fewer than 22 set bits, this structure will be more efficient that storing a 45 byte array.
From any bit index value, you can determine the address and mask of the individual wire by:
byte_address = base_address + bit_index / 8 ;
bit_mask = 0x01 << (bit_index % 8) ;
You could either test each of the 360 bits iteratively or generate a 360 bit vector on the fly from the list of bits.
I can see no need for dynamic memory allocation in this, but whether or not it is advisable in an embedded system is largely dependent on the application and target resources. A typical AVR system has very little memory, and dynamic memory allocation carries an overhead for heap management and block alignment that you may not be able to afford. Dynamic memory allocation is not suited in situations where hard real-time deterministic timing is required. And in all cases you should have a well defined strategy or architecture for avoiding memory leak issues (repeatedly allocating memory that never gets released).

Representing a very large array of bits in little memory

I would like to represent a structure containing 250 M states(1 bit each) somehow into as less memory as possible (100 k maximum). The operations on it are set/get. I cold not say that it's dense or sparse, it may vary.
The language I want to use is C.
I looked at other threads here to find something suitable also. A probabilistic structure like Bloom filter for example would not fit because of the possible false answers.
Any suggestions please?
If you know your data might be sparse, then you could use run-length encoding. But otherwise, there's no way you can compress it.
The size of the structure depends on the entropy of the information. You cannot squeeze information something in less than a given size if you have no repeated pattern. The worst case would still be about 32Mb of storage in your case. If you know something about the relation between the bits then it's maybe possible...
I don't think it's possible to do what you're asking. If you need to cover 250 million states of 1 bit each, you'd need 250Mbits/8 = 31.25MBytes. A far cry from 100KBytes.
You'd typically create a large array of bytes, and use functions to determine the byte (index >> 3) and bit position (index & 0x07) to set/clear/get.
250M bits will take 31.25 megabytes to store (assuming 8 bits/byte, of course), much much more than your 100k goal.
The only way to beat that is to start taking advantage of some sparseness or pattern in your data.
The max number of bits you can store in 100K of mem is 819,200 bits. This is assuming that 1 K = 1024 bytes, and 1 byte = 8 bits.
are files possible in your environment ?
if so, you might swap, say for example 4k sized segmented bit buffer.
your solution shoud access those bits in a serialized way to
minimize disk load/save operation.

Resources