I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.
Related
Im currently investigating the tools necessary to add a fast, full text search to our ERP SAAS application with the aim of providing a single search entry point in the application that could search over the many different kind of objects that compose the domain of the software.
The application (a Spring Java web application) is backed by a Sql Server RDBMS (usign Hibernate as ORM), there are hundreds of different tables, dozens of which (but maybe more) should be searchable (usually there are one or more varchar columns in evenry table that should be indexed/searched).
Think for example of a single search bar where i can search customers, contracts, employees, articles..), this data is also very often updated (new inserts, deletes, updates..)
I found this article (www.chrisumbel.com/article/lucene_solr_sql_server) that shows how to connect a Sql Server db with Solr, posting a query example on the database that extracts the data used by Solr during the data import.
Since we have dozens (and more) tables containing the searchable data that means that we should pass for a first step that integrate all the sql queries that extracts this data with Solr, in order to build the index?
Second question: not all the data is searchable by everyone (permissions and ad hoc filters), so how could we complement the full text search provided by Solr with the need of putting in place more complex queries (join on other tables for example) on this data?
Thanks
You are nearly asking for a full blown consulting project :-) But a few suggestions are possible.
Define Search Result Types: Search engines use denormalized data, i.e. you won't do any joins while querying (if you think you do, stick to your DB:-) That means you need to do the necessary joins while filling the index. This defines what you can search for. Most people "just" index documents or log-lines, so there is just one type of result. Sometimes people's profiles are included, sometimes a difference is made between results from different source systems where the documents come from, but in the end, there is a limited number of types of search results. And even more, they are nevertheless indexed into one and the same schema (where schemas are very malleable for search engines).
Index: You know your SQL statements to extract your data. Converting to JSON and shoveling it into a search engine is not difficult. One thing to watch out for: while your DB changes, you keep indexing, incremental or full "crawl" depends on how much logic you want to add. The most tricky part is to get deletes on the DB side into the index. If its gone, its gone: how do you know there was something that needs to be purged from the index :-)
Secure Search Since you don't really join, applying access rights at query time amounts requires two steps. During indexing, write principle (group, user) names of those who may read your search result. At query time, get the user ID and expand it, recursively, to get all groups of the user. Add this as a query filter. Make sure to cache the filter or even pre-compute for all users quite regularly and store it in a fast store (the search index is one place, DB would do too:-) Obviously you need to re-index if access rights change. The good thing is: as long as things only change in LDAP/AD, you don't need to index the data, only the expanded groups of the affected users.
ad hoc filters If you want to filter for X, put X as a field into the index. At query time, apply the filter.
Is there any way through Elastic Search or Lucene metadata, to store a count of how many times a particular document has satisfied queries even though one has not recalled the document for processing.
For example, say you issue a query and get 100 results. You process the first 10 and not go any further. We would like to flag ALL the documents (100) that satisfied the search criteria for later analysis.
Thanks
Currently, Azure Search does not expose this information (and neither does Elasticsearch or Lucene). However, we're working on building better ranking models, and we're thinking about capturing (and potentially exposing) this type of data.
We'd be very interested in learning more about your scenario. Could you email me at eugenesh at the usual Microsoft domain? Thanks!
I have read that it is best practice to only return an ID when querying for results, and then populate metadata from the database. Is this true? I am worried about performance.
In my opinion, it is almost always best to store and return the fewest fields possible — preferably just the ID, unless you explicitly need a feature such as highlighting.
Storing a lot of data in your index can have a negative impact on your search performance as your index grows. There is no data that loads faster than no data. Plus, looking up objects by their IDs should be a very cheap operation in your primary data store of choice.
Most importantly, if your application is using an ORM to interact with its data store, then the sheer utility of reusing all your domain modeling consistently throughout your application would be hard to overstate.
Returning values straight from your search engine can be useful. But, short of using the search engine as a primary data store, I would need a very compelling reason to fragment my domain logic by foregoing an ORM.
IMO, If you can retrieve the search results and the data within a single call would be a huge boost to performance in comparison with getting just the ids and making a DB call to retrieve the metadata for the same.
Also, Solr/ES provides in built Caching solutions so the response would be faster for subsequent queries. For DB you may have to use a Solution or probably some other options.
this all depends on your specific scenario.
In some cases, what you say might be true. For instance, Etsy does exactly that (or at least used to do that), they rationale is that they had a very capable mysql cluster and they know very well how to manage it, and is very fast, so Solr returning only the id was enough for them.
But, you might be in a totally different scenario, and maybe calling the db will take longer than storing everything needed in Solr and hitting just Solr.
In my experience Solr performs bad on retrieving results when you either have highlighting on, or the fields you retrieve are very large and the network serialization/deserialization transfer overhead increases. If that is the case, you might be better off asynchronously retrieving these fields from the DB.
I want to create prospective search subscription with empty query, but GAE raises exception
QuerySyntaxError: query:'' detail:'Query is empty.'
which is not compatible with Search API, which allows empty queries. Any workarounds? Should I file an issue?
The Prospective Search Service is intended to support applications that filter a stream of documents; applications that want less than all documents matched. In such an application, an "empty query" would normally be considered evidence of a bug. Admittedly, empty queries might sometimes be useful for various debugging purposes, however, the decision was made to design the interface's contracts with production use in mind.
As suggested by Will Brown, if you want a subscription that will match all documents, then insert some dummy field with a constant value into your documents and then create a query that matches just that field and value. Given that there is such an easy work-around available for those rare cases when "all documents" are needed, I think it unlikely that we would provide support for empty queries. It might also be interesting to note that the prohibition against empty queries is not just in the AppEngine code but also in the backend servers that AppEngine accesses to provide the Prospective Search Service.
Although the "Search API" (which really should be called the "Retrospective Search API") may support empty queries, it is important to realize that resource utilization patterns for prospective search are very, very different from those of retrospective search. For instance, you might have an application that is streaming hundreds of documents per second into both a document index (using retrospective search) and through a query index (using prospective seach). In such a system, an empty retrospective query is only going to return just a few documents whenever that query is submitted. On the other hand, a prospective query would generate a real-time stream of all documents. The presence of just a few prospective queries could thus generate significant loads on your application. In general, if you want a firehose, real-time push feed of everything published, it is best to code that up explicitly.
You can file a feature request for this, but it is by design (I don't know why). If you know that incoming documents will have something in common, you can write a query for those; for example, if you add a field "alldocuments" with content "yes" to the document when you send the request, you could register a query like "alldocuments:yes" to match all documents.
Consider the following situation. We have a database which stores writers and books in two separate tables. One book obviously stores the reference to the writer who wrote the book.
For Solr i have to denormalize this structure into one big document where every book contains the details of the writer associated. This index is now used for querying books.
One user of the system now decides to update a writer record in the system. Because many books can be associated with it i have to update every document in Solr which have embedded data from this writer record. This is very painful because i have to delete and re-add every affected document as far as i know.
Is there any better way of doing this? I need near realtime update of the index in the system if one of the referenced data gets modified.
This would be a perfect usecase for nested documents. As far as I know lucene does support nested documents but Solr doesn't, not totally sure about the current state of this feature.
This feature is available in elasticsearch though. You might want to have a look at it, there's an article I just wrote that can be interesting if you want to know what's so cool about elasticsearch in my opinion. Your question just reminded me that I didn't mention the nested documents feature in my article, which is really cool too. You can use the nested type in your mapping. If you want to know more you can have a look at this article. By the way it contains exactly the books/authors example.
Elasticsearch also helps you while updating documents. You don't need to reindex the whole document but send only the changes through a script. Thanks to the fact that it stores the source document that has been indexed it internally retrieves it, updates it running the script and reindexes it. That's how lucene internally works since its index segments are write-once. With Solr 4, which will be soon released, you can update documents providing only the changes, but as far as I know this works only if all your fields are stored. The fields that are not stored cannot be retrieved from the index.
If we are talking about Near Real Time updates, elasticsearch does use the Lucene Near Real Time API and refreshes automatically the index reader every second. Solr 3 doesn't use yet those APIs but Solr 4 does.
For updating nested types in SOLR you can use dataimporters and delta imports. The example on https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example shows how this would work. Obviously you would then need to have solr access your database.