I want to try writing a dictionary application. From other questions on SO, I learned about data structures that should be considered:
DAWG
ternary search tree
double-array tree
However, it may be just my prejudice, but storing data in binary file and mapping it in-memory directly to data structure seems much heavy-handed approach. I'd normally consider putting it in graph database, but I imagine that DB-related data overhead (ids, hashes and the like) can destroy space gains coming from using aforementioned structures, when compared to a trie.
The dictionary will be large, containing all words in English. Should I store it in some sort of database or refrain from it?
Related
Lee byron makes this point in the video, but I can't seem to find the part where he explains this.
https://www.youtube.com/watch?v=I7IdS-PbEgI&t=1604s
Is this because when you update a node you have traverse log(n) to get to the node. With an immutable structure and it must copy worst-case n nodes... That is as far as I get in my thinking.
If you would attempt to create an immutable list the simple way, the obvious solution would be to copy the whole list into a new list and exchange that single item. So a larger list would take longer to copy, right? The result would at least be O(n).
Immutable.js on the other hand uses a trie (see wikipedia) graph, which allows to reuse most of the structure while making sure, that existing references are not mutated.
Simply said, you create a new tree structure and create new branches for modified parts. When one of the branches is unchanged, the tree can just link to the original structure instead of copying.
The immutable.js documentation starts with two links to long descriptions, especially the one about vector tries is nice:
These data structures are highly efficient on modern JavaScript VMs by
using structural sharing via hash maps tries and vector
tries as popularized by Clojure and Scala, minimizing the need to
copy or cache data.
If you want to know more the details, you might want to take a look on the question about How Immutability is Implemented too.
I have been reading a bit about tries, and how they are a good structure for typeahead designs. Aside from the trie, you usually also have a key/value pair for nodes and pre-computed top-n suggestions to improve response times.
Usually, from what I've gathered, it is ideal to keep them in memory for fast searches, such as what was suggested in this question: Scrabble word finder: building a trie, storing a trie, using a trie?. However, what if your Trie is too big and you have to shard it somehow? (e.g. perhaps a big e-commerce website).
The key/value pair for pre-computed suggestions can be obviously implemented in a key/value store (either kept in memory, like memcached/redis or in a database, and horizontally scalled as needed), but what is the best way to store a trie if it can't fit in memory? Should it be done at all, or should distributed systems each hold part of the trie in memory, while also replicating it so that it is not lost?
Alternatively, a search service (e.g. Solr or Elasticsearch) could be used to produce search suggestions/auto-complete, but I'm not sure whether the performance is up to par for this particular use-case. The advantage of the Trie is that you can pre-compute top-N suggestions based on its structure, leaving the search service to handle actual search on the website.
I know there are off-the-shelf solutions for this, but I'm mostly interesting in learning how to re-invent the wheel on this one, or at least catch a glimpse of the best practices if one wants to broach this topic.
What are your thoughts?
Edit: I also saw this post: https://medium.com/#prefixyteam/how-we-built-prefixy-a-scalable-prefix-search-service-for-powering-autocomplete-c20f98e2eff1, which basically covers the use of Redis as primary data store for a skip list with mongodb for LRU prefixes. Seems an OK approach, but I would still want to learn if there are other viable/better approaches.
I was reading 'System Design Interview: An Insider's Guide' by Alex Yu and he briefly covers this topic.
Trie DB. Trie DB is the persistent storage. Two options are available to store the data:
Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB [4] are good fits for serialized data.
Key-value store: A trie can be represented in a hash table form [4] by applying the following logic:
• Every prefix in the trie is mapped to a key in a hash table.
• Data on each trie node is mapped to a value in a hash table.
Figure 13-10 shows the mapping between the trie and hash table.
The numbers on each prefix node represent the frequency of searches for that specific prefix/word.
He then suggests possibly scaling the storage by sharding the data by each alphabet (or groups of alphabets, like a-m, n-z). But this would unevenly distribute the data since there are more words that start with 'a' than with 'z'.
So he recommends using some type of shard manager where it keeps track of the query frequency and assigns a shard based on that. So if there are twice the amount of queries for 's' as opposed to 'z' and 'x' combined, two shards can be used, one for 's', and another for 'z' and 'x'.
I am trying to write a search-engine for a large collection, for learning purposes. I started with my own intuitions. Then I researched and am finally arriving at a working model.
I am constructing a giant hash-table to hold all the terms in my collection. It is very expensive to construct this from the collection. Once I have computed the table I want to save this to disk, so that whenever I want to access this hash-table in my program latter, I can load it again from disk.
Is there any standard way of doing it or do I have to invent my own file-format and hacks to do this?
Note: The has-table is only for storing all term occurrences, I am planning to store the main ranking data in a postings file and have its pointer set in corresponding term of hash-table.
I am working in C.
BDB is a library for efficiently managing flat-file databases. In particular a hash table format is supported. B-Trees are also available, in case ordered access is required.
i have been working on Inverted indexing, which index documents collection, store each term with information and also store its reference in posting file (document id, location, etc.).
Currently i store it in .txt file format which need string matching for each and every query concerning to that .txt file, which take more time and also something more complex.
Now i want to store that information in a file like linked list style data structure. so is this possible for this type of scenario.... (and also i am using PHP language for indexing).
Any help will be appreciated, thanks.
The point of an inverted index is to allow for extremely fast access to the list of occurrences (the postings list) for any given term. If you want to implement it using simple, readily-available data structures, then the best you can probably do is
Use a hash to store the mapping from terms to postings lists
Store each postings list as a continuous block of sorted integers (i.e. something like ArrayList in Java or std::vector in C++). Do not use a linked list because that involves a huge amount of space wasted for pointers
A more proper (and more sophisticated) implementation would take into account:
That postings lists can get very large, so you would have to break it up into multiple chunks, each stored as one continuous block
That postings lists can and should be compressed
Detailed descriptions of these techniques are found in the classical book Managing Gigabytes.
We have the Trie structure to efficiently access data when the key to that data set is a string. What would be the best possible index if key to a data set is an image?
By key, I mean some thing which uniquely distinguishes data. Is this a less frequently used scenario i.e. accessing data by an image? I do feel there are applications where it is used like a finger print database.
Does hashing help in this case? I mean hash the image into a unique number, depending on pixel values.
Please share any pointers on this.
cheers
You could use a hash function to find a item based on an image. But I see little practical use for this scenario.
Application such as finger print recognition, face recognition, or object identification perform a feature extraction process. This means they convert the complex image structure into simpler feature vectors that can be compared against stored patterns.
The real hard work is the feature extraction process that must seperate the important information from the 'noise' in the image.
Just hashing the image will will yield no usable features. The only situation I would think about hashing a image to find some information is to build a image database. But even in this case a common hash function as SHA1 or MD5 will be of little use, because modifying a single pixel or metadata such as the author will change the hash and make it impossible to identify the two images based on a common hash function.
I'm not 100% sure what you're trying to do, but hashing should give you a unique string to identify an image with. You didn't specify your language, but most have a function to hash an entire file's data, so you could just run the image file through that. (For example, PHP has md5_file())
It's unclear what problem you're trying to solve. You can definitely obtain a hash for an entire image and use that as a key in a Trie structure, although I think in this case the Trie structure would give you almost no performance benefit over a regular hash table, because you are performing a (large) hash every time you do a lookup.
If you are implementing something where you want to compare two images or find similar images in the tree quickly, you might consider using the GIF or JPEG header of the image as the beginning of the key. This would cause images with similar type, size, index colors, etc. to be grouped near each other within the Trie structure. You could then compute a hash for the image only if there was a collision (that is, multiple images in the Trie with the exact same header).