I know that inverted indexing is a good way to index words, but what I'm confused about is how the search engines actually store them? For example, if a word "google" appears in document - 2, 4, 6, 8 with different frequencies, where should store them? Can a database table with one-to-many relation would do any good for storing them?
It is highly unlikely that fullfledged SQL-like databases are used for this purpose. First, it is called an inverted index because it is just an index. Each entry is just a reference. As non-relational databases and key-value stores came up as a favourite topic in relation to web technology.
You only ever have one way of accessing the data (by query word). That is why it's called an index.
Each entry is a list/array/vector of references to documents, so each element of that list is very small. The only other information besides of storing a documentID would be to store a tf-idf score for each element.
How to use it:
If you have a single query word ("google") then you look up in the inverted index in which documents this word turns up (2,4,6,8 in your example). If you have tf-idf scores, you can sort the results to report the best matching document first. You then go and look up which documents the document IDs 2,4,6,8 refer to, and report their URL as well as a snippet etc. URL, snippets etc are probably best stored in another table or key-value store.
If you have multiple query words ("google" and "altavista"), you look into the II for both query words and you get two lists of document IDs (2,4,6,8 and 3,7,8,11,19). You take the intersection of both lists, which in this case is (8), which is the list of documents in which both query words occur.
It's a fair bet that each of the major search engines has its own technology for handling inverted indexes. It's also a moderately good bet that they're not based on standard relational database technology.
In the specific case of Google, it is a reasonable guess that the current technology used is derived from the BigTable technology described in 2006 by Fay Chang et al in Bigtable: A Distributed Storage System for Structured Data. There's little doubt that the system has evolved since then, though.
Traditionally, an inverted index is written directly to file and stored on disk somewhere. If you want to do boolean retrieval querying (Either a file contains all the words in the query or not) postings might look like so stored contiguously on file.
Term_ID_1:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_2:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_N:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N
The term id is the id of a term, the frequency is the number of docs the term appears in (in other words how long is the postings list) and the doc id is the document that contained the term.
Along with the index, you need to know where everything is on file so mappings also have to be stored somewhere on another file. For instance, given a term_id, the map needs to return the file position that contains that index and then it is possible to seek to that position. Since the frequency_id is recorded in the postings, you know how many doc_ids to read from the file. In addition, there will need to be mappings from the IDs to the actual term/doc name.
If you have a small use case, you may be able to pull this off with SQL by using blobs for the postings list and handling the intersection yourself when querying.
Another strategy for a very small use case is to use a term document matrix.
Possible Solution
One possible solution would be to use a positional index. It's basically an inverted index, but we augment it by adding more information. You can read more about it at Stanford NLP.
Example
Say a word "hello" appeared in docs 1 and 3, in positions (3,5,6,200) and (9,10) respectively.
Basic Inverted Index (note there's no way to find word freqs nor there positions)
"hello" => [1,3]
Positional Index (note we don't only have freqs for each docs, but we also know exactly where the term appeared in the doc)
"hello" => [1:<3,5,6,200> , 3:<9,10>]
Heads Up
Will your index take a lot more size now? You bet!
That's why it's a good idea to compress the index. There are multiple options to compress the postings list using gap encoding, and even more options to compress the dictionary, using general string compression algorithms.
Related Readings
Index compression
Postings file compression
Dictionary compression
Related
I'm new to databases and don't have a firm grasp on how indexing works.
I'm looking into indexing a column in my that contains a tsvector that is weighted (title is given the greatest weight, followed by subheading and then paragraph contents). According to the Postgres documentation, GIN is the best one to use for full text search, followed by GiST. However there is a note in chapter 12.9:
GIN indexes are the preferred text search index type. As inverted
indexes, they contain an index entry for each word (lexeme), with a
compressed list of matching locations. Multi-word searches can find
the first match, then use the index to remove rows that are lacking
additional words. GIN indexes store only the words (lexemes) of
tsvector values, and not their weight labels. Thus a table row recheck
is needed when using a query that involves weights.
Does this mean that GIN is inefficient in my use case and I should go with GiST, or is it still the best one to use? I'm using the latest Postgres version (12).
No, you should stick with GIN indexes.
The index scan acts as a filter and hopefully eliminates most of the rows, so that only few have to be rechecked.
You probably have to fetch the table rows anyway, so unless there are many false positives found during the index scan, that won't be a lot of extra work.
The best thing would be to run some benchmarks on your data set, that would give you an authoritative answer which index is better in your case.
To find out how many false positives were eliminated during the bitmap heap scan, you caan examine the ouput of EXPLAIN (ANALYZE, BUFFERS) for the query.
The implementation of GiST indexes for tsvector is lossy, so they also need to consult the table. That part of documentation is weird, as it seems to be contrasting GIN to GiST but neither GIN nor GiST stores the weights, so there is nothing to contrast. (GiST doesn't even store the values much less the weights, just a hashed bit of the value).
Also, weights are only used when ranking, not when searching.
About the only time GiST would be prefered for tsvector is if you want a multicolumn index where you will be ANDing together selective criteria on the different columns.
New to search databases and working with one. What is the difference between full text and free text search/index?
They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)
I am building a search engine. I am using NoSQL variety key-value datastores, specifically Amazon SimpleDB, and not a regular RDBMS. I have a table of URLs that point to web pages. I think I need to build another table which can be used to look up which pages contain a given English word.
The structure of this table is: Search (String word, String URL) and my queries would look like select from Search where word = "foo"
Should I hash the words before storing them and for lookup? I. e. should I use a table: Search (String word_hash, String URL) and use queries like select from Search where word = "acbd18db4cc2f85cedef654fccc4a4d8"
Unless you are doing this as an exercise, don't build your own. Use sphinx or something similar.
If this is an exercise, points for ambition! A search engine is a big project.
I don't see any value in hashing the words yourself. The hash table already does that internally (it's a hash table). Later on you might want to do basic spelling corrections, or allow "books" to also match "book", for example, and at that point it will help to have plain text words.
The jury is out there for the general case. While it seems that the database would hash internally, there is definitely an important counter-example: BigTable that has it listed as a specific benefit that URL keys such as "com.example.foo/*.html" would cluster together to make it easier to build the Google search index. (see the bigtable paper for details).
We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().
I’m currently looking at developing a solr application to index products on our e-commerce website.
Some example fields in the schema are:
ProductID
ProductName
Description
Price
Categories (multi-value)
Attributes
Attributes are a list of key-value pairs.
For example:
Type = Rose
Position = Full Sun
Position = Shade
Colour = Red
I am going to store the fields, so that my pages can be generated from a search result.
How is it best to represent these?
I was thinking of maybe having some dynamic fields for indexing:
attribute_* for example (attribute_position)
And then “attribute” for stored value (For returning, for displaying) - storing multiple fields
The value of an “attribute” field could be (for example) Position|Full Sun - then let the client handle the displaying?
Are there any better ways of doing this?
As a footnote- I will be using Solrnet as a client for querying (probably not relevant)
First, I would not recommend storing your entire document in your search engine. The only thing you should store in Solr is those things that you wish to search on. Yes, it supports storing more, however, taking advantage of this can cause issues down the road with index size, master/slave replication time, etc. Ideally, the only thing in Solr is things you wish to search/sort on and a document ID that is unique enough to fetch document data with from another source that is optimized for storing .... documents.
However, if you decide to ignore this advice, then you can easily store your name value pairs in a single field. If your name value pairs have a limited character set, you can easily concatenate the name value pairs into a single string. Then, parse them on the way out when you are forming your web page for display. There's no need to come up with a more complex schema to support this. Multiple fields for storing these will only increase your index overhead, which buys you nothing.