I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.
Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.
Related
I am using solr search engine. I had defined a schema initially and imported data from SQL db to solr using DIH. I have got a new column in sql db and value to which is getting populated using some of the previous columns. Now, I have to index this new column into solr.
My question is: do I perform update for all records or do I delete all records from solr and rebuild index again using DIH? I am asking this question because I have read that if we perform update for any document, solr first deletes the index and then rebuild it again.
The answer regarding speed is, as always, "it depends". But it's usually easier to just reindex. It doesn't require all fields to be stored in Solr and it's something you'll have to support anyway - so it doesn't require any additional code.
It also offers a bit more flexibility in regards to the index, since as you note, if you are going to do partial updates, the actual implementation is delete+add internally (since there might be fields that depend on the field you're changing, update processors, distribution across the cluster, etc.) - which requires all fields to be stored. This can have a huge impact on index size, which might not be necessary - especially if you have all the content in the DB for all other uses anyway.
So in regards to speed you're probably just going to have to try (document sizes, speed of DB, field sizes, etc. is going to affect that for each single case) - but usually the speed of a reindex isn't the most important part.
If you update your index don't forget to optimize it afterwards (via the Admin console for example) to get rid of all the deleted documents.
I have been using postgresql for full text search, matching a list of articles against documents containing a particular word. The performance for which degraded with a rise in the no. of rows. I had been using postgresql support for full text searches which made the performance faster, but over time resulted in slower searches as the articles increased.
I am just starting to implement with solr for searching. Going thru various resources on the net I came across that it can do much more than searching and give me finer control over my results.
Solr seems to use an inverted index, wouldn't the performance degrade over time if many documents (over 1 million) contain a search term begin queried by the user? Also if I am limiting the results via pagination for the searched term, while calculating the score for the documents, wouldn't it need to load all of the 1 million+ documents first and then limit the results which would dampen the performance with many documents having the same word?
Is there a way to sort the index by the score itself in the first place which would avoid loading of the documents later?
Lucene has been designed to solve all the problems you mentioned. Apart from inverted index, there is also posting lists, docvalues, separation of indexed and stored value, and so on.
And then you have Solr on top of that to add even more goodies.
And 1 million documents is an introductory level problem for Lucene/Solr. It is being routinely tested on indexing a Wikipedia dump.
If you feel you actually need to understand how it works, rather than just be reassured about this, check books on Lucene, including the old ones. Also check Lucene Javadocs - they often have additional information.
I'm using DSE for Cassandra/Solr integration so that data are stored in Cassandra and indexed in Solr. It's very natural to use Cassandra to handle CRUD operation and use Solr for full text search respectively, and DSE can really simplify data synchronization between Cassandra and Solr.
When it comes to query, however, there are actually two ways to go: Cassandra secondary/manual configured index vs. Solr. I want to know when to use which method and what's the performance difference in general, especially under DSE setup.
Here is one example use case in my project. I have a Cassandra table storing some item entity data. Besides the basic CRUD operation, I also need to retrieve items by equality on some field (say category) and then sort by some order (in my case here, a like_count field).
I can think of three different ways to handle it:
Declare 'indexed=true' in Solr schema for both category and like_count field and query in Solr
Create a denormalized table in Cassandra with primary key (category, like_count, id)
Create a denormalized table in Cassandra with primary key (category, order, id) and use an external component, such as Spark/Storm,to sort the items by like_count
The first method seems to be the simplest to implement and maintain. I just write some trivial Solr accessing code and the rest heavy lifting are handled by Solr/DSE search.
The second method requires manual denormalization on create and update. I also need to maintain a separate table. There is also tombstone issue as the like_count can possibly be updated frequently. The good part is that the read may be faster (if there are no excessive tombstones).
The third method can alleviate the tombstone issue at the cost of one extra component for sorting.
Which method do you think is the best option? What is the difference in performance?
Cassandra secondary indexes have limited use cases:
No more than a couple of columns indexed.
Only a single indexed column in a query.
Too much inter-node traffic for high cardinality data (relatively unique column values)
Too much inter-node traffic for low cardinality data (high percentage of rows will match)
Queries need to be known in advance so data model can be optimized around them.
Because of these limitations, it is common for apps to create "index tables" which are indexed by whatever column is desired. This requires either that data be duplicated from the main table to each index table, or an extra query will be needed to read the index table and then read the actual row from the main table after reading the main key from the index table. Queries on multiple columns will have to be manually indexed in advance, making ad hoc queries problematic. And any duplicated will have to be manually updated by the app into each index table.
Other than that... they will work fine in cases where a "modest" number of rows will be selected from a modest number of nodes, and queries are well specified in advance and not ad hoc.
DSE/Solr is better for:
A moderate number of columns are indexed.
Complex queries with a number of columns/fields referenced - Lucene matches all specified fields in a query in parallel. Lucene indexes the data on each node, so nodes query in parallel.
Ad hoc queries in general, where the precise queries are not known in advance.
Rich text queries such as keyword search, wildcard, fuzzy/like, range, inequality.
There is a performance and capacity cost to using Solr indexing, so a proof of concept implementation is recommended to evaluate how much additional RAM, storage, and nodes are needed, which depends on how many columns you index, the amount of text indexed, and any text filtering complexity (e.g., n-grams need more.) It could range from 25% increase for a relatively small number of indexed columns to 100% if all columns are indexed. Also, you need to have enough nodes so that the per-node Solr index fits in RAM or mostly in RAM if using SSD. And vnodes are not currently recommended for Solr data centers.
Why are results from search index queries limited to 200 rows, whereas standard view queries seem to have no limit?
Fundamentally because we hold a 200 item array in memory as we stream over all hits, preserving the top 200 scoring hits. A standard view just streams all rows between a start and end point. The intent of a search is to typically to find the needle in a haystack, so you don't generally fetch thousands of results (compare with Google, who clicks through to page 500?). If you don't find what you want, you refine your search and look again.
There are cases when retrieving all matches makes sense (and we can stream this in the order we find them, so there's no RAM issue). That's a feature we can (and should) add, but it's not currently available.
It's also worth noting that the _view API (aka "mapreduce") is fundamentally different than search because of the ordering of results on disk. Materialized views are persisted in CouchDB b+ trees, so they are essentially sorted by key. That allows for efficient range queries (start/end key), and makes limit/paging trivial. However, it also means that you have to order the view rows on disk, which restricts the types of boolean queries that you can perform against the materialized views. That's where search helps, but Bob (aka "The Lucene Expert") notes the limitations.
I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.