Cloudant Search Index Query Limit - cloudant

Why are results from search index queries limited to 200 rows, whereas standard view queries seem to have no limit?

Fundamentally because we hold a 200 item array in memory as we stream over all hits, preserving the top 200 scoring hits. A standard view just streams all rows between a start and end point. The intent of a search is to typically to find the needle in a haystack, so you don't generally fetch thousands of results (compare with Google, who clicks through to page 500?). If you don't find what you want, you refine your search and look again.
There are cases when retrieving all matches makes sense (and we can stream this in the order we find them, so there's no RAM issue). That's a feature we can (and should) add, but it's not currently available.

It's also worth noting that the _view API (aka "mapreduce") is fundamentally different than search because of the ordering of results on disk. Materialized views are persisted in CouchDB b+ trees, so they are essentially sorted by key. That allows for efficient range queries (start/end key), and makes limit/paging trivial. However, it also means that you have to order the view rows on disk, which restricts the types of boolean queries that you can perform against the materialized views. That's where search helps, but Bob (aka "The Lucene Expert") notes the limitations.

Related

Best Practice to Combine both DB and Lucene Search

I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.
Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.

Querying large not indexed tables

we are developing a CRUD like web interface for out application. For this, we need to show data from different tables. Some are huge and very "alive", with many rows (millions). Some are small, configuration tables.
Now we want to allow our users filtering, refinement, sorting, pagination etc. on grids we show. As a result of user selection - we are building select queries.
For obvious reasons, filtering on not indexed fields will produce a rather long running query. On the other hand, indexing every column of a table, looks a bit "weird". And we do have tables with more than 50 rows.
We are looking into Apache Lucene, but as far as I understand - it well help us solve text indexing. But what about numbers, dates, ranges? Is there any solutions, discussions available for said issue?
Also, I must point that this issue is UX specific only. For all applications own needs, we do good.
You are correct, in general, you don't want to allow random predicates on non indexed fields, however how much effect this has is very dependent on table size, database engine being used and machine being used to drive the database. Some engines are not too bad with non indexed columns, but in worst case each will degenerate to a sequential scan. Sequential scans aren't always as bad as they sound either.
Some ideas
Investigate using a column store database engine, these store data columnwise rather than row wise which can be much faster for random predicates on non indexed columns. Column stores aren't a universal solution though if you often need all fields on a row
Index the main columns that will be queried by users and indicate in the UX layer that queries on some columns will be slower. Users will be more accepting, especially if they know in advance that a column query will be slow
If possible, just throw memory at it. Engines like oracle or sql/server will be pretty good while most of your database fits in memory. Only problem is that once your database exceeds the memory performance will fall off a cliff (without warning)
Consider using vertical partitioning if possible. This lets you split a row into 2 or more pieces for storage, which can reduce IO for predicates.
Sure you know this, but make sure columns used for joins are indexed.

GAE — Performance of queries on indexed properties

If I had an entity with an indexed property, say "name," what would the performance of == queries on that property be like?
Of course, I understand that no exact answers are possible, but how does the performance correlate with the total number of entities for which name == x for some x, the total number of entities in the datastore, etc.?
How much slower would a query on name == x be if I had 1000 entities with name equalling x, versus 100 entities? Has any sort of benchmarking been done on this?
Some not very strenuous testing on my part indicated response times increased roughly linearly with the number of results returned. Note that even if you have 1000 entities, if you add a limit=100 to your query, it'll perform the same as if you only had 100 entities.
This is in line with the documentation which indicates that perf varies with the number of entities returned.
When I say not very strenuous, I mean that the response times were all over the place, and it was a very very rough estimate to draw a line through. I'd often see an order of magnitude difference in perf on the same request.
AppEngine does queries in a very optimized way, so it is virtually irrelevant from a performance stand-point whether you do a query on the name property vs. just doing a batch-get with the keys only. Either will be linear in the number of entities returned. The total number of entities stored in your database does not make a difference. What does make a tiny difference, though, is the number of different values for "name" that occur in your database (so, 1000 entities returned will be pretty much exactly 10 times slower than 100 entities returned).
The way this is done is via the indices (or indexes as preferred) stored along with your data. An index for the "name" property consists of a table that has all names sorted in alphabetical order (and a second one sorted in reverse alphabetical order, if you use descending order in any of your queries) and a query will then simply find the first occurrence of the name you are querying in the table and start returning results in order. This is called a "scan".
This video is a bit technical, but it explains in detail how all this works and if you're concerned about coding for maximum performance, might be a good time investment:
Google I/O 2008: Under the Covers of the Google App Engine Datastore
(the video quality is fairly bad, but they also have the slides online (see link above video))

Reaching an appropriate balance between performance and scalability in a large database

I'm trying to determine which of the many database models would best support probabilistic record comparison. Specifically, I have approximately 20 million documents defined by a variety of attributes (name, type, author, owner, etc.). Text attributes dominate the data set, yet there are still plenty of images. Read operations are the most crucial vis-a-vis performance, but I expect roughly 20,000 new documents to insert each week. Luckily, insert speed does not matter at all, and I am comfortable queuing the incoming documents for controlled processing.
Database queries will most typically take the following forms:
Find documents containing at least five sentences that reference someone who'a a member of the military
Predict whether User A will comment on a specific document written by User B, given User A's entire comment history
Predict an author for Document X by comparing vocabulary, word ordering, sentence structure, and concept flow
My first thought was to use a simple document store like, like MongoDB, since each document does not necessarily contain the same data. However, complex queries effectively degrade this to a file system wrapper, since I cannot construct a query yielding the results I desire. As such, this approach corners me into walking the entire database and processing each file separately. Although document stores scale well horizontally, the benefits are not realized here.
This led me to realize that my granularity isn't at the document level, but rather the entity-relationship level. As such, graph databases seemed like logical choice, since they facilitate relating each word in a sentence to the next word, next paragraph, current paragraph, part of speech, etc. Graph databases limit data replication, increase the speed of statistical clustering, and scale horizontally, among other things. Unfortunately, ensuring a definitive answer to your query still necessitates traversing the entire graph. Even still, indexing will help with performance.
I've also evaluated the use of relational databases, which are very efficient when designed properly (i.e., by avoiding unnecessary normalization). A relational database excels at finding all documents authored by User A, but fails at structural comparisons (which involves expensive joins). Relational databases also enforce constraints (primary keys, foreign keys, uniqueness, etc.) efficiently--a task with which some NoSQL solutions struggle.
After considering the above-listed requirements, are there any database models that combine the "exactness" of relational models (viz., efficient exhaustion of the domain) with the flexibility of graph databases?
This is not really an answer, just a discussion.
The database you are talking about is a large database. You don't mention the nature of the documents, but newspaper articles are typically in the 2-3k range, so you are talking about hundreds of gigabytes of raw data.
If query performance is an issue, you are talking about a large, rather expensive system.
Your requirements are also quite complex, and not likely to be out-of-the-box. I would be thinking of a hybrid system. Store the document metadata in a relational database system, so you can quickly access them with simple queries. You can store the documents themselves in the database as blobs.
Some of your requirements can be met with text-add ins on relational databases. So, simple searching is feasible using inverted index technology. That handles the first of your three scenarios.
The other two are much more challenging. The third ("predict an author") can probably be handled by having a parallel system that stores author information, summarized from the documents when they are loaded. Then it is a question of comparing a document to the author, using simple statistical analysis (naive Bayesian, anyone?).
The middle one is tricky, but it suggests yet another component for managing comments on documents. Depending on the volume, this may be easy or hard.
Finally, how changing are the requirements? Do you really know what the system should be doing? Or will the functionality be radically different once you get it up and running?

Searching across shards?

Short version
If I split my users into shards, how do I offer a "user search"? Obviously, I don't want every search to hit every shard.
Long version
By shard, I mean have multiple databases where each contains a fraction of the total data. For (a naive) example, the databases UserA, UserB, etc. might contain users whose names begin with "A", "B", etc. When a new user signs up, I simple examine his name and put him into the correct database. When a returning user signs in, I again look at his name to determine the correct database to pull his information from.
The advantage of sharding vs read replication is that read replication does not scale your writes. All the writes that go to the master have to go to each slave. In a sense, they all carry the same write load, even though the read load is distributed.
Meanwhile, shards do not care about each other's writes. If Brian signs up on the UserB shard, the UserA shard does not need to hear about it. If Brian sends a message to Alex, I can record that fact on both the UserA and UserB shards. In this way, when either Alex or Brian logs in, he can retrieve all his sent and received messages from his own shard without querying all shards.
So far, so good. What about searches? In this example, if Brian searches for "Alex" I can check UserA. But what if he searches for Alex by his last name, "Smith"? There are Smiths in every shard. From here, I see two options:
Have the application search for Smiths on each shard. This can be done slowly (querying each shard in succession) or quickly (querying each shard in parallel), but either way, every shard needs to be involved in every search. In the same way that read replication does not scale writes, having searches hit every shard does not scale your searches. You may reach a time when your search volume is high enough to overwhelm each shard, and adding shards does not help you, since they all get the same volume.
Some kind of indexing that itself is tolerant of sharding. For example, let's say I have a constant number of fields by which I want to search: first name and last name. In addition to UserA, UserB, etc. I also have IndexA, IndexB, etc. When a new user registers, I attach him to each index I want him to be found on. So I put Alex Smith into both IndexA and IndexS, and he can be found on either "Alex" or "Smith", but no substrings. In this way, you don't need to query each shard, so search might be scalable.
So can search be scaled? If so, is this indexing approach the right one? Is there any other?
There is no magic bullet.
Searching each shard in succession is out of the question, obviously, due to the incredibly high latency you will incur.
So you want to search in parallel, if you have to.
There are two realistic options, and you already listed them -- indexing, and parallelized search. Allow me to go into a little more detail on how you would go about designing them.
The key insight you can use is that in search, you rarely need the complete set of results. You only need the first (or nth) page of results. So there is quite a bit of wiggle room you can use to decrease response time.
Indexing
If you know the attributes on which the users will be searched, you can create custom, separate indexes for them. You can build your own inverted index, which will point to the (shard, recordId) tuple for each search term, or you can store it in the database. Update it lazily, and asynchronously. I do not know your application requirements, it might even be possible to just rebuild the index every night (meaning you will not have the most recent entries on any given day -- but that might be ok for you). Make sure to optimize this index for size so it can fit in memory; note that you can shard this index, if you need to.
Naturally, if people can search for something like "lastname='Smith' OR lastname='Jones'", you can read the index for Smith, read the index for Jones, and compute the union -- you do not need to store all possible queries, just their building parts.
Parallel Search
For every query, send off requests to every shard unless you know which shard to look for because the search happens to be on the distribution key. Make the requests asynchronous. Reply to the user as soon as you get the first page-worth of results; collect the rest and cache locally, so that if the user hits "next" you will have the results ready and do not need to re-query the servers. This way, if some of the servers are taking longer than others, you do not need to wait on them to service the request.
While you are at it, log the response times of the sharded servers to observe potential problems with uneven data and/or load distribution.
I'm assuming you are talking about shards a la :
http://highscalability.com/unorthodox-approach-database-design-coming-shard
If you read that article he goes into some detail on exactly your question, but long answer short, you write custom application code to bring your disparate shards together. You can do some smart hashing to both query individual shards and insert data into shards. You need to ask a more specific question to get a more specific answer.
You actually do need every search to hit every shard, or at least every search needs to be performed against an index that contains the data from all shards, which boils down to the same thing.
Presumably you shard based on a single property of the user, probably a hash of the username. If your search feature allows the user to search based on other properties of the user it is clear that there is no single shard or subset of shards that can satisfy a query, because any shard could contain users that match the query. You can't rule out any shards before performing the search, which implies that you must run the query against all shards.
You may want to look at Sphinx (http://www.sphinxsearch.com/articles.html). It supports distributed searching. GigaSpaces has parallel query and merge support. This can also be done with MySQL Proxy (http://jan.kneschke.de/2008/6/2/mysql-proxy-merging-resultsets).
To build a non-sharded indexed kinds of defeats the purpose of the shard in the first place :-) A centralized index probably won't work if shards were necessary.
I think all the shards need to be hit in parallel. The results need to be filtered, ranked, sorted, grouped and the results merged from all the shards. If the shards themselves become overwhelmed you have to do the usual (reshard, scale up, etc) to underwhelm them again.
RDBMs are not good tool for textual search. You will be much better off looking at Solr. Performance difference between Solr and database will be in the order of magnitude of 100X.

Resources