Best option to store big key-value mapping on disk for high read-only throughput?

Best option to store big key-value mapping on disk for high read-only throughput? - database

I have a dictionary (simple key - value mapping between small strings) of ~1GB in size (which may grow in time) that I want to use for spell correction and autocompletion purposes. I was planning on keeping it in RAM, but until I gain some traffic I want to stick to free hosting plans, so this is not really an option for the time being in my case.
The alternative is to store it on disk (SSD) and use some limited (e.g. up to 128MB) clever caching (e.g. with a combined LRU/LFU eviction policy) to keep access times bearable. However, I'm not sure which form of disk storage I should use to maximize throughput. The options I have considered so far are:
Use a database:
MongoDB
BerkleyDB ( https://en.wikipedia.org/wiki/Berkeley_DB )
Use a custom solution:
https://github.com/spotify/sparkey
Just use the filesystem:
have a file for each entry in the dictionary,
whose name is the key and contents are the value
Before I start getting my hands dirty and evaluate the above by implementing and profiling, I wanted to know, if anyone has done something similar before, what was your approach and results. Please note that the dictionary is only created once and not modified afterwards. That is, after creation, there will only be read (lookup) operations, and one 'correction/suggestion query' typically needs 15-20 lookup operations.
Thanks in advance for any useful input!
PS: I'm developing using the MEAN stack.

Related

Lightweight solution to keep one table with million records

The brief: I need to have access to simply table with only one column, million rows, no any relationships, with just simle 6-character entries - postal codes. I will use it to check against user entered postal code to find out if it is valid. This will be temporary solution for a few monthes until I can remove this validation and leave it to web services. So right now I am seeking for solution to this.
What I have:
Web portal build on top of Adobe CQ5 (Java, OSGi, Apache Sling, CRX)
Linux environment where it is all situated
plain text file (9mb) with these million rows
What I want:
to have fast access to this data (only read, no write) for only one
purpose: to find a row with specific value (six character length, contais only latin symbols and digits).
create this solution as easier as possible, i.e. to use linux
preinstalled software or with ability to quickly install and start it
without long set up and configuring.
Currently I have the next options: use database or use something like HashSet to keep these million records. The first solution requires additional steps for installing and configuring database, the second solution drives me crazy when I think about whole million records in HashSet. So right now I am considering to try to use SQLite, but I want to hear some suggestions on this problem.
Thanks a lot.

Storing in the content repository
You could store it in the CQ5 repository to eliminate the external dependency on sqlite. If you do, I would recommend structuring the storage hierarchically to limit the number of peer nodes. For example, the postcode EC4M 7RF would be stored at:
/content/postcodes/e/c/4/m/ec4m7rf
This is similar to the approach that you will see to users and groups under /home.
This kind of data structure might help with autocomplete if you needed it also. If you typed ec then you could return all of the possible subsequent characters for postcodes in your set by requesting something like:
/content/postcodes/e/c.1.json
This will show you the 4 (and the next character for any other postcode in EC).
You can control the depth using a numeric selector:
/content/postcodes/e/c.2.json
This will go down two levels showing you the 4 and the M and any postcodes in those 'zones'.
Checking for non-existence using a Bloom Filter
Also, have you considered using a Bloom Filter? A bloom filter is a space efficient probabilistic data structure that can quickly tell you whether an item is definitely not in a set. There is a chance of false positives, but you can control the probability vs size trade-off during the creation of the bloom filter. There is no chance of false negatives.
There is a tutorial that demonstrates the concept here.
Guava provides and implementation of the bloom filter that is easily used. It will work like the hashset, but you may not need to hold the whole dataset in memory.
BloomFilter<Person> friends = BloomFilter.create(personFunnel, 500, 0.01);
for(Person friend : friendsList) {
friends.put(friend);
}
// much later
if (friends.mightContain(dude)) {
// the probability that dude reached this place if he isn't a friend is 1%
// we might, for example, start asynchronously loading things for dude while we do a more expensive exact check
}
Essentially, the bloom filter could sit in front of the check and obviate the need to make the check for items that are definitely in the set. For items that maybe (~99% accurate depending on setup) in the set, then the check is made to rule out the false positive.

I would try to use redis memory database wich can handle millions of key/value pair and is blazing fast for loading or reading. Many connectors exists for all languages. and an apache module also exist (mod_redis)

You said that this is a temporary solution/requirement - so do you need a database?
You already have this as a text file - why not just load it into memory as part of your program as it's only 9 MB (assuming your process is persistent and always resident) and reference as an array or just a table of values.

Big Data Database

I am collecting a large amount of data which is most likely going to be a format as follows:
User 1: (a,o,x,y,z,t,h,u)
Where all the variables dynamically change with respect to time, except u - this is used to store the user name. What I am trying to understand since my background is not very intense in "big data", is when I end up with my array, it will be very large, something like 108000 x 3500, since I will be preforming analysis on each timestep, and graphing it, what would be an appropriate database to manage this in is what I am trying to determine. Since this is for scientific research I was looking at CDF and HDF5, and based on what I read here NASA I think I will want to use CDF. But is this the correct way to manage such data for speed and efficiency?
The final data set will have all the users as columns, and the rows will be timestamped, so my analysis program would read row by row to interpret the data. And make entries into the dataset. Maybe I should be looking at things like CouchDB and RDBMS, I just don't know a good place to start. Advice would be appreciated.

This is an extended comment rather than a comprehensive answer ...
With respect, a dataset of size 108000*3500 doesn't really qualify as big data these days, not unless you've omitted a unit such as GB. If it's just 108000*3500 bytes, that's only 3GB plus change. Any of the technologies you mention will cope with that with ease. I think you ought to make your choice on the basis of which approach will speed your development rather than speeding your execution.
But if you want further suggestions to consider, I suggest:
SciDB
Rasdaman, and
Monet DB
all of which have some traction in the academic big data community and are beginning to be used outside that community too.

I have been using CDF for some similarly sized data and I think it should work nicely. You will need to keep a few things in mind though. Considering I don't really know the details of your project, this may or may not be helpful...
3GB of data is right around the file size limit for the older version of CDF, so make sure you are using an up-to-date library.
While 3GB isn't that much data, depending on how you read and write it, things may be slow going. Make sure you use the hyper read/write functions whenever possible.
CDF supports meta-data (called global/variable attributes) that can hold information such as username and data descriptions.
It is easy to break data up into multiple files. I would recommend using one file per user. This will mean that you can write the user name just once for the whole file as an attribute, rather than in each record.
You will need to create an extra variable called epoch. This is well defined timestamp for each record. I am not sure if the time stamp you have now would be appropriate, or if you will need to process it some, but it is something you need to think about. Also, the epoch variable needs to have a specific type assigned to it (epoch, epoch16, or TT2000). TT2000 is the most recent version which gives nanosecond precision and handles leap seconds, but most CDF readers that I have run into don't handle it well yet. If you don't need that kind of precision, I recommend epoch16 as that has been the standard for a while.
Hope this helps, if you go with CDF, feel free to bug me with any issues you hit.

database vs flat file, which is a faster structure for "regex" matching with many simultaneous requests

which structure returns faster result and/or less taxing on the host server, flat file or database (mysql)?
Assume many users (100 users) are simultaneously query the file/db.
Searches involve pattern matching against a static file/db.
File has 50,000 unique lines (same data type).
There could be many matches.
There is no writing to the file/db, just read.
Is it possible to have a duplicate the file/db and write a logic switch to use the backup file/db if the main file is in use?
Which language is best for the type of structure? Perl for flat and PHP for db?
Addition info:
If I want to find all the cities have the pattern "cis" in their names.
Which is better/faster, using regex or string functions?
Please recommend a strategy
TIA

I am a huge fan of simple solutions, and thus prefer -- for simple tasks -- flat file storage. A relational DB with its indexing capabilities won't help you much with arbitrary regex patterns at all, and the filesystem's caching ensures that this rather small file is in memory anyway. I would go the flat file + perl route.
Edit: (taking your new information into account) If it's really just about finding a substring in one known attribute, then using a fulltext index (which a DB provides) will help you a bit (depending on the type of index applied) and might provide an easy and reasonably fast solution that fits your requirements. Of course, you could implement an index yourself on the file system, e.g. using a variation of a Suffix Tree, which is hard to be beaten speed-wise.
Still, I would go the flat file route (and if it fits your purpose, have a look at awk), because if you had started implementing it, you'd be finished already ;) Further I suspect that the amount of users you talk about won't make the system feel the difference (your CPU will be bored most of the time anyway).
If you are uncertain, just try it! Implement that regex+perl solution, it takes a few minutes if you know perl, loop 100 times and measure with time. If it is sufficiently fast, use it, if not, consider another solution. You have to keep in mind that your 50,000 unique lines are really a low number in terms of modern computing. (compare with this: Optimizing Mysql Table Indexing for Substring Queries )
HTH,
alexander

Depending on how your queries and your data look like a full text search engine like Lucene or Sphinx could be a good idea.

Practical to save thousands of data structures in a file and do specific lookups?

There's been a discussion between me and some colleagues that are taking the same class as me (and thus have the same project) about saving data to files and read from those files only when we need that specific data.
For instance, the project is something about managing a social network. I'm not going into specifics because it doesn't matter, but the idea is to use the best data structures to manipulate this data.
Let's say I'm using an Hash Table to save the users profile data. Some of them argue that only some specific information should be saved in the data structures, like and ID that represents an user. Everything else should be put on files. We should access the files to get that data we want when we want.
I don't think this is practical... It could be if we were using some library for a database like SQLite or something, but are not and I don't think we are supposed to. We are only supposed to code everything ourselves and use C functions, like these. Nor do I think we are supposed to do a perfect memory management. The requisites of the project are not for us to code a database, or even a pseudo-database. What this project demands of us, are the best data structures (as long as we know how to justify why we picked those instead of others) to store the type of data and the all data specified for the project.
I should let you know that we had 2 classes before where the knowledge we got there is to be applied on this project. One of those dealt with the basis of C, functions, structures, arrays, strings, file IO, recursion, pointers and simple data structures like binary trees and linked lists, stuff like that. The other one was about more complex data structures, hash tables, AVL trees, heaps, graphs, etc... It also talked about time complexity, big O notation and stuff like that.
For instance, let's say all I have in memory is the IDs of the users and then I need to find all friends of a specific user. I'll have to process the whole file (or files) finding out the friends of that user. It would be much easier if I could have all that data in memory already.
It makes no sense to me that we need to pick (and justify) the data structures that we best see fit for the project and then only use them to lookup for an ID. We will then need to do a second lookup, to get the real data we need, which will take it's time, won't it? Why did we bother with the data structures in the first place if we still need to get to search a bunch of files on the hard drive?
How could it be possible, using standard C functions, coding everything manually and still simulate some kind of database? Is this practical at all?
Am I missing something here?

It sounds like the project might be more about how you design the relationships between your data "entities," and not as much about how you store them. I don't think storing data off in files would be a good solution - file IO will be much slower than accessing things in memory. If you had the need to persist data on the disk, you'd probably want to just use a database, rather than files (I know it's an academic course though, so who knows).
I think you should focus more on how you design your data types, and their relationships, to maximize the speed of lookups, searches, etc. For example, you could store all the users in a linked list, or store them in a tree, or a graph, but each will have its implications on how fast you can find users, etc. Depending on what features you want in your social networking site, there will be different designs that will allow different types of behavior to perform better than it would in other designs.

From what you're saying I doubt that you need to store anything on disk.
One thing that I would ask the teacher is if you're optimizing for time or space complexity (there will be a trade off between these two depending on what you're trying to achieve).

That can certainly be done. The resource forks in Mac System 5-8 files were stored as binary indexed databases (general use of the term, don't think SQL!). (I think the interface was actually written in assembly, but I could do it in c).
The only thing is: it's a pain in the butt. Such files typically need to start with some kind of index or header, and then hold a bunch of records at predictable locations. (OK, sometimes the first index just points at some more indexes. How many layers of indirection do you care to manage?)
If you're going to do it, just remember: binary mode access.

Hmm... what about persistent storage?
If your project requires you to be able to remember friend data between two restarts of the app, then don't you think file storage (or whatever else becomes an issue)?

I'm having a very hard time figuring out what you are trying to ask here.
But there is a general rule that may apply:
If all of your data will fit in memory at once, it is usually best to load all of it into memory at once and keep it there. You write out to a file only to save, to exit, or for backup.
There are lots of exceptions to this rule, but for a class project where this is going to be the only major application running on the machine, you may as well store everything in memory. After all, you have already paid for the memory; you don't want it just sitting there idle.
I may have completely misunderstood the question you are trying to ask...

Store static data in an array, or in a database?

We always have some static data which can be stored in a file as an array or stored in a database table in our web based project. So which one should be preferred?
In my opinion, arrays have some advantages:
More flexible (it can be any structure, which specifies a really complex relation)
Better performance (it will be loaded in memory, which will have better read/write performance compared with a database's I/O operations)
But my colleague argued that he preferred DB approach, since it can keep a uniform data persistence interface, and be more flexible.
So which should be preferred? Or how can we choose? Or we should prefer one in some scenario and another in other scenarios? what are the scenarios?
EDIT:
Let me clarify something. Truly just as Benjamin made the change to the title, the data we want to store in an array(file) won't change so frequently, which means the code won't change the value of the array in the runtime. If the data change very frequently I will use DB undoubtedly. That's why I made such a post.
And sometimes it's hard to store some really complex relations like:
Task = {
"1" : {
"name" : "xx",
"requirement" : {
"level" : 5,
"money" : 100,
}
...
}
Just like the above code sample(a python dict or you can think it as an array), the requirement field is hard to store in DB(store a structure like pickled object directly in DB? not so good I think). So in such condition, I will prefer arrays.
So what's your idea? In such scenario, we should prefer arrays to DB, right?
Regards.

Lets be pragmatic/objetive:
Do you write to your data on runtime? Yes: Db, No: File
Do you update your data more than once per week? Yes: Db, No: File
It's a pain to release an updated data file? Yes: Db, No: File,
Do you read that data often? Yes: File/Cache, No: Db
It is a pain to update that data file and you need extra tools? Yes: db, No: File
For sure I've forgotten other points, but I guess the basics are there.

The "flexiable" array in a file is fraught with a zillion issues already delt with by using a DB. Unless you can prove that the DB is really going to way slower than using the other approach use a DB. Move on and start solving business problems.
Edit
Comment from OP asks what the issues with using a file might be, here are a handful (pause to take a deep breath).
Concurrency: You have to manage the situation where multiple requests may be trying to write back to the file. Not too hard but it becomes a bottleneck.
Performance: Yes modifying an in-memory array is quicker but how do you determine how much and when the array needs to be persisted to a file. Note that using a DB doesn't pre-clude the use of an appropriate in-memory cache. Writing a file back each time a small modification is made isn't going to perform that well.
Scalability: Really a function of the first two. In order to acheive any scalable goals you need to be able to quickly modify small bits of the data that is persisted. IWO if you don't use a DB you would end up writing one. If you find you need more than one webserver to support growing demand where are you going to store the file(s)? Now you've got file I/O over a network (ableit likely a very quick one).
Structure: Your code will be responsible for managing the structure of data, querying it etc if you use an array. How will you do that in way which acheives greater "flexibility" than using a DB? All manner of choices and complexity are needed here.
Reliability: You need to ensure the integrity of your persisted data. In the event of some failure your array/file code would need to ensure that data is at least not so corrupt that the application can continue.

Your colleague is correct, BUT there's where you need to put aside the comp sci textbook and be pragmatic. How often will you be accessing this data from your application? If it's fairly frequently then don't incur the costs of access overhead. Instead of reading from a flat file you could still gain the advantages of a db, but use a caching strategy in your application. Depending on your development language you could look at something like memcache or jtreecache.

It depends on what kind of data you are looking at, and whether or not it needs to be updated regularly.
I tend to keep most things (non-config data) in the database, even if the data isn't going to be repeating (e.g. thosands of rows). Databases will scale so much easier than a flat file, if your system starts to grow fast your flat file might become a burden to your system.

If the data doesn't change very oftern, and your programming in Java, why not use Spring to hold the values?
They can be injected into your bean, and changed easly.
but thats if you'r developing in Java.

Yeah I agree with your implied assessment that databases are overused and basic flat files may work in multitude of scenarios. If your application is read-only (and writes are done by the admin when app restarts) I would definitely go with the file. Even if application writes to the file, but only in append mode (vs random inserts/updates) in one thread, I would also use file. Anything else -- need a real database with random updates, queries, concurrency control etc.