File format need for inverted indexing - file

i have been working on Inverted indexing, which index documents collection, store each term with information and also store its reference in posting file (document id, location, etc.).
Currently i store it in .txt file format which need string matching for each and every query concerning to that .txt file, which take more time and also something more complex.
Now i want to store that information in a file like linked list style data structure. so is this possible for this type of scenario.... (and also i am using PHP language for indexing).
Any help will be appreciated, thanks.

The point of an inverted index is to allow for extremely fast access to the list of occurrences (the postings list) for any given term. If you want to implement it using simple, readily-available data structures, then the best you can probably do is
Use a hash to store the mapping from terms to postings lists
Store each postings list as a continuous block of sorted integers (i.e. something like ArrayList in Java or std::vector in C++). Do not use a linked list because that involves a huge amount of space wasted for pointers
A more proper (and more sophisticated) implementation would take into account:
That postings lists can get very large, so you would have to break it up into multiple chunks, each stored as one continuous block
That postings lists can and should be compressed
Detailed descriptions of these techniques are found in the classical book Managing Gigabytes.

Related

C data structure suitable for fast search and simple add/remove

As stated in the question header, I need a data structure suitable for fast and efficient searching. The data structure should also be able to add/remove elements to/from anywhere inside the data structure.
Currently I'm using a linked-list. But the problem is that I should walk through the list to find a desired element. General searching algorithms (binary search, jump search and ...) are not directly usable in linked lists, as there is no random access to list elements. Sorting list elements needed in these algorithms is also a problem.
On the other hand, I can't use arrays as it's hard to add/remove an element to/from any desired index.
I've looked for searching algorithms in linked lists and I came to 'skip lists'. Now I'm here to ask if there is a better data structure for my case, or if there is a better search algorithm for linked lists.
I would use AVL binary search tree
For an example of binary search tree you can take a look at https://www.geeksforgeeks.org/avl-tree-set-1-insertion/ and https://www.geeksforgeeks.org/avl-tree-set-2-deletion/
It's well detailed, there is C code and schema.
It's efficient to search in, and It allows you to add and delete values.
It works for both numeric values and some characters implementations (such as dictionnay).

Storing Inverted Index

I know that inverted indexing is a good way to index words, but what I'm confused about is how the search engines actually store them? For example, if a word "google" appears in document - 2, 4, 6, 8 with different frequencies, where should store them? Can a database table with one-to-many relation would do any good for storing them?
It is highly unlikely that fullfledged SQL-like databases are used for this purpose. First, it is called an inverted index because it is just an index. Each entry is just a reference. As non-relational databases and key-value stores came up as a favourite topic in relation to web technology.
You only ever have one way of accessing the data (by query word). That is why it's called an index.
Each entry is a list/array/vector of references to documents, so each element of that list is very small. The only other information besides of storing a documentID would be to store a tf-idf score for each element.
How to use it:
If you have a single query word ("google") then you look up in the inverted index in which documents this word turns up (2,4,6,8 in your example). If you have tf-idf scores, you can sort the results to report the best matching document first. You then go and look up which documents the document IDs 2,4,6,8 refer to, and report their URL as well as a snippet etc. URL, snippets etc are probably best stored in another table or key-value store.
If you have multiple query words ("google" and "altavista"), you look into the II for both query words and you get two lists of document IDs (2,4,6,8 and 3,7,8,11,19). You take the intersection of both lists, which in this case is (8), which is the list of documents in which both query words occur.
It's a fair bet that each of the major search engines has its own technology for handling inverted indexes. It's also a moderately good bet that they're not based on standard relational database technology.
In the specific case of Google, it is a reasonable guess that the current technology used is derived from the BigTable technology described in 2006 by Fay Chang et al in Bigtable: A Distributed Storage System for Structured Data. There's little doubt that the system has evolved since then, though.
Traditionally, an inverted index is written directly to file and stored on disk somewhere. If you want to do boolean retrieval querying (Either a file contains all the words in the query or not) postings might look like so stored contiguously on file.
Term_ID_1:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_2:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_N:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N
The term id is the id of a term, the frequency is the number of docs the term appears in (in other words how long is the postings list) and the doc id is the document that contained the term.
Along with the index, you need to know where everything is on file so mappings also have to be stored somewhere on another file. For instance, given a term_id, the map needs to return the file position that contains that index and then it is possible to seek to that position. Since the frequency_id is recorded in the postings, you know how many doc_ids to read from the file. In addition, there will need to be mappings from the IDs to the actual term/doc name.
If you have a small use case, you may be able to pull this off with SQL by using blobs for the postings list and handling the intersection yourself when querying.
Another strategy for a very small use case is to use a term document matrix.
Possible Solution
One possible solution would be to use a positional index. It's basically an inverted index, but we augment it by adding more information. You can read more about it at Stanford NLP.
Example
Say a word "hello" appeared in docs 1 and 3, in positions (3,5,6,200) and (9,10) respectively.
Basic Inverted Index (note there's no way to find word freqs nor there positions)
"hello" => [1,3]
Positional Index (note we don't only have freqs for each docs, but we also know exactly where the term appeared in the doc)
"hello" => [1:<3,5,6,200> , 3:<9,10>]
Heads Up
Will your index take a lot more size now? You bet!
That's why it's a good idea to compress the index. There are multiple options to compress the postings list using gap encoding, and even more options to compress the dictionary, using general string compression algorithms.
Related Readings
Index compression
Postings file compression
Dictionary compression

Should I store large tree in a database?

I want to try writing a dictionary application. From other questions on SO, I learned about data structures that should be considered:
DAWG
ternary search tree
double-array tree
However, it may be just my prejudice, but storing data in binary file and mapping it in-memory directly to data structure seems much heavy-handed approach. I'd normally consider putting it in graph database, but I imagine that DB-related data overhead (ids, hashes and the like) can destroy space gains coming from using aforementioned structures, when compared to a trie.
The dictionary will be large, containing all words in English. Should I store it in some sort of database or refrain from it?

How to save and load a giant hash-table to-n-fro from disk?

I am trying to write a search-engine for a large collection, for learning purposes. I started with my own intuitions. Then I researched and am finally arriving at a working model.
I am constructing a giant hash-table to hold all the terms in my collection. It is very expensive to construct this from the collection. Once I have computed the table I want to save this to disk, so that whenever I want to access this hash-table in my program latter, I can load it again from disk.
Is there any standard way of doing it or do I have to invent my own file-format and hacks to do this?
Note: The has-table is only for storing all term occurrences, I am planning to store the main ranking data in a postings file and have its pointer set in corresponding term of hash-table.
I am working in C.
BDB is a library for efficiently managing flat-file databases. In particular a hash table format is supported. B-Trees are also available, in case ordered access is required.

C Database Design, Sortable by Multiple Fields

If memory is not an issue for my particular application (entry, lookup, and sort speed being the priorities), what kind of data structure/concept would be the best option for a multi-field rankings table?
For example, let's say I want to create a Hall of Fame for a game, sortable by top score (independent of username), username (with all scores by the same user placed together before ranking users by their highest scores), or level reached (independent of score or name). In this example, if I order a linked list, vector, or any other sequential data structure by the top score of each player, it makes searching for the other fields -- like level and non-highest scores -- more iterative (i.e. iterate across all looking for the stage, or looking for a specific score-range), unless I conceive some other way to store the information sorted when I enter new data.
The question is whether there is a more efficient (albeit complicated and memory-consumptive) method or database structure in C/C++ that might be primed for this kind of multi-field sort. Linked lists seem fine for simple score rankings, and I could even organize a hashtable by hashing on a single field (player name, or level reached) to sort by a single field, but then the other fields take O(N) to find, worse to sort. With just three fields, I wonder if there is a way (like sets or secondary lists) to prevent iterating in certain pre-desired sorts that we know beforehand.
Do it the same way databases do it: using index structures. You have your main data as a number of records (structs), perhaps ordered according to one of your sorting criteria. Then you have index structures, each one ordered according to one of your other sorting criteria, but these index structures don't contain copies of all the data, just pointers to the main data records. (Think "index" like the index in a book, with page numbers "pointing" into the main data body.)
Using ordered linked list for your index structures will give you a fast and simple way to go through the records in order, but it will be slow if you need to search for a given value, and similarly slow when inserting new data.
Hash tables will have fast search and insertion, but (with normal hash tables) won't help you with ordering at all.
So I suggest some sort of tree structure. Balanced binary trees (look for AVL trees) work well in main memory.
But don't forget the option to use an actual database! Database managers such as MySQL and SQLite can be linked with your program, without a separate server, and let you do all your sorting and indexing very easily, using SQL embedded in your program. It will probably execute a bit slower than if you hand-craft your own main-memory data structures, or if you use main-memory data structures from a library, but it might be easier to code, and you won't need to write separate code to save the data on disk.
So, you already know how to store your data and keep it sorted with respect to a single field. Assuming the values of the fields for a single entry are independent, the only way you'll be able to get what you want is to keep three different lists (using the data structure of your choice), each of which are sorted to a different field. You'll use three times the memory's worth of pointers of a single list.
As for what data structure each of the lists should be, using a binary max heap will be effective. Insertion is lg(N), and displaying individual entries in order is O(1) (so O(N) to see all of them). If in some of these list copies the entries need to be sub-sorted by another field, just consider that in the comparison function call.

Resources