database for google type searches - database

We need to be able to perform fast searches against 10 million tweets we have stored off. Any suggestions for a good database to use for this? We'd prefer to be able to do regular expressions searches but it's sufficient to be able to find all entries that contain a given word.
thanks - dave

Answer at Microsoft MSDN forum - database for bing type searches
Full-Text queries perform a linguistic search against this data,
operating on words and phrases based on rules of a particular
language.
A LIKE query against millions of rows of text data can
take minutes to return; whereas a full-text query can take only
seconds or less against the same data, depending on the number of rows
that are returned. We can use Full-Text Search to perform a fuzzy
search and then use LIKE clause to return the records that have an
exact match of our search conditions.
For more information, please refer to the following links:
Full-Text Search Overview:
http://msdn.microsoft.com/en-us/library/ms142571.aspx
SQL Server 2008 Full-Text Search: Internals and Enhancements
http://technet.microsoft.com/en-us/library/cc721269(SQL.100).aspx

You could use http://incubator.apache.org/lucene.net/ which is used by stackoverflow and RavenDB.

Related

Azure SQL Server - Full Text Search - Partial Words/Leading Wildcard

I've seen several questions on SO about the possibility of matching partial words in a Full-Text Search on SQL Server but they are all quite old so I'm posting to see if there is an update on the situation...
The Problem:
I have a keyword search running on a single field in a table that is using Full-Text Search.
I want to be able to match a partial word, not just a wildcard search from the start of a given word.
So, I know I can do:
Contains(table.myfield, '"par*"' which will match things like party, partner etc...
I also want to be able to say:
Contains(table.myfield, '"*par*"' to match things like spartan, sparing etc...
Is it true to say that FTS cannot achieve this and I would have to resort to LIKE '%par%' to get the results I require?
Full-Text Search still does not allow double wildcard. However, you can now use Azure Search to perform regular expressions searches on multiple columns at the same time using Lucene syntax as explained here. For example to search for all jobs with either the term Senior or Junior you can do the following search:
&queryType=full&$select=business_title&search=business_title:/(Sen|Jun)ior/

Fulltext search with partial strings on Postgresql

I was assigned to develop a full-text search functionality on PostgreSql 9.3 and I'd be very glad if I can hear other opinions and advices in this matter.
The problem is, that I need to implement a partial word match. An user will send out a string which can contain partial words, separated by space, and without order.
For example: string "lue ped zeb" should find a row with "Blue striped zebra" in it (in one column). It should be case-insensitive and the order of words should not matter (but these conditions are insignificant in this question).
Problem is performance. There are over 5 million rows in the database table on which the search is performed and I need to get to very small execution times.
Example query would be "SELECT * FROM table WHERE LOWER(text) LIKE ('%lue%ped%zeb');", which I suspect will be VERY slow because the wildcard at first position will cause the query to ignore indexes.
So far, I've found http://www.sai.msu.su/~megera/wiki/wildspeed, which is a index that could help me (size of the index doesn't really matter in this case), but the production server is running MS Windows and I don't know if this extension will be able to compile on windows. (I will try it and update my question).
I'm not a database developer and use Postgres usually only from applications, so I don't have much experience in database optimalization and lower-level operations.
Does anyone have some experience with similar problem, word of advice or example that can help me with this task?
Trigram is a contrib module for Postgres, which can help you achieve your goal. There is a complete example of its usage in the docs.
Beginning in 9.1, trigram support index searches for LIKE and ILIKE operators.
Beginning in 9.3, it support index searches for regular-expression matches (~ and ~* operators).
But if you want to search for any order of the provided partial words, you should query for each word separate:
...
WHERE LOWER(text) LIKE '%lue%'
OR LOWER(text) LIKE '%ped%'
OR LOWER(text) LIKE '%zeb%'

Algorithms for key value pair, where key is string

I have a problem where there is a huge list of strings or phrases it might scale from 100,000 to 100Million. when i search for a phrase if found it gives me the Id or index to database for further operation. I know hash table can be used for this, but i am looking for other algorithm which could serve me to generate index based on strings and can also be useful in some other features like autocomplete etc.
I read suffix tree/array based on some SO threads they serve the purpose but consumes alot memory than i can afford. Any alternatives to this?
Since my search is only in a huge list of millions of strings. No docs no webpages not interested in search engine like lucene etc.
Also read about inverted index sounds helpful but which algorithm i need to study for it?.
If this Database index is within MS SQL Server you may get good results with SQL Full Text Indexing. Other SQL providers may have a similar function but I would not be able to help with those.
Check out: http://www.simple-talk.com/sql/learn-sql-server/understanding-full-text-indexing-in-sql-server/
and
http://msdn.microsoft.com/en-us/library/ms142571.aspx

full index checkbox while creating new database

I am creating a new database, which I am basically designing for the logging/history purpose. So, I'll make around 8-10 tables in this database. Which will keep the data and I'll retrieve it for showing history information to the user.
I am creating database from the SQL Server 2005 and I can see that there is a check box of " Use full Indexing". I am not sure whether I make it check or unchecked. As I am not familiar with the database too much, suggest me that by checking it, will it increase the performance of my database in retrieval?
I think that is the check box for FULLTEXT indexing.
You turn it on only if you plan to do some natural language queries or a lot of text-based queries.
See here for a description of what it is used to support.
http://msdn.microsoft.com/en-us/library/ms142571.aspx
From that base link, you can follow through to http://msdn.microsoft.com/en-us/library/ms142547.aspx (amongst others). Interesting is this quote
Comparison of LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works
on character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
There is a cost for this of course which is in the storage of the patterns and relationships between words in the same record. It is really useful if you are storing articles for example, where you want to enable searching by "contains a, b and c". A LIKE pattern would be complicated and extremely slow to process like %A%B%C% OR LIKE '%B%A%C' Or ... and all the permutations for the order of appearance of A, B and C.

How to implement an Enterprise Search

We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."
I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.
I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/
For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.
Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...

Resources