Azure Search Keeping Record of Documents Satisfying Queries - azure-cognitive-search

Is there any way through Elastic Search or Lucene metadata, to store a count of how many times a particular document has satisfied queries even though one has not recalled the document for processing.
For example, say you issue a query and get 100 results. You process the first 10 and not go any further. We would like to flag ALL the documents (100) that satisfied the search criteria for later analysis.
Thanks

Currently, Azure Search does not expose this information (and neither does Elasticsearch or Lucene). However, we're working on building better ranking models, and we're thinking about capturing (and potentially exposing) this type of data.
We'd be very interested in learning more about your scenario. Could you email me at eugenesh at the usual Microsoft domain? Thanks!

Related

Is Solr Better than the normal RDBMS in case of searching normal queries i.e not full text search?

I am developing a web application where I want to use Solr for search only and keep my data on another Database.
I will be having 2 databases: one Relational (Sql Server) and the other will be a copy of it on the NoSQL Solr database.
I'll be searching for specific fields in the solr documents e.g(by id,name,type and join queries) i.e NOT full text search.
I know Solr strength is in full text search by creating inverted index on the documents data, now i want to know does it also helps in my case by creating another type of index on my documents which make normal searching faster than sql server index?
Yes, it will help you.
You need to consider what is your requirement. What is your preference?
If you have the solr as another additional option which will be used for the searching the application data, you need to consider that you have to constantly update the solr. You will need additional infrastructure and all.
If the performance is your main criteria and you don't want to put any search load on your RDBMS then you can add the solr to your system. Also consider how big your data is in the RDBMS. Because RDBMS system are also enough strong to support searching data.
Considering all the above aspects you can take the decision.

Full Text Search in ERP application - is Apache Lucene\Solr the right choice?

Im currently investigating the tools necessary to add a fast, full text search to our ERP SAAS application with the aim of providing a single search entry point in the application that could search over the many different kind of objects that compose the domain of the software.
​
The application (a Spring Java web application) is backed by a Sql Server RDBMS (usign Hibernate as ORM), there are hundreds of different tables, dozens of which (but maybe more) should be searchable (usually there are one or more varchar columns in evenry table that should be indexed/searched).
Think for example of a single search bar where i can search customers, contracts, employees, articles..), this data is also very often updated (new inserts, deletes, updates..)
​
I found this article (www.chrisumbel.com/article/lucene_solr_sql_server) that shows how to connect a Sql Server db with Solr, posting a query example on the database that extracts the data used by Solr during the data import.
Since we have dozens (and more) tables containing the searchable data that means that we should pass for a first step that integrate all the sql queries that extracts this data with Solr, in order to build the index?
Second question: not all the data is searchable by everyone (permissions and ad hoc filters), so how could we complement the full text search provided by Solr with the need of putting in place more complex queries (join on other tables for example) on this data?
​
Thanks
You are nearly asking for a full blown consulting project :-) But a few suggestions are possible.
Define Search Result Types: Search engines use denormalized data, i.e. you won't do any joins while querying (if you think you do, stick to your DB:-) That means you need to do the necessary joins while filling the index. This defines what you can search for. Most people "just" index documents or log-lines, so there is just one type of result. Sometimes people's profiles are included, sometimes a difference is made between results from different source systems where the documents come from, but in the end, there is a limited number of types of search results. And even more, they are nevertheless indexed into one and the same schema (where schemas are very malleable for search engines).
Index: You know your SQL statements to extract your data. Converting to JSON and shoveling it into a search engine is not difficult. One thing to watch out for: while your DB changes, you keep indexing, incremental or full "crawl" depends on how much logic you want to add. The most tricky part is to get deletes on the DB side into the index. If its gone, its gone: how do you know there was something that needs to be purged from the index :-)
Secure Search Since you don't really join, applying access rights at query time amounts requires two steps. During indexing, write principle (group, user) names of those who may read your search result. At query time, get the user ID and expand it, recursively, to get all groups of the user. Add this as a query filter. Make sure to cache the filter or even pre-compute for all users quite regularly and store it in a fast store (the search index is one place, DB would do too:-) Obviously you need to re-index if access rights change. The good thing is: as long as things only change in LDAP/AD, you don't need to index the data, only the expanded groups of the affected users.
ad hoc filters If you want to filter for X, put X as a field into the index. At query time, apply the filter.

Using Neo4j and Lucene in a distributed system

I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.

GAE and Prospective search: empty query

I want to create prospective search subscription with empty query, but GAE raises exception
QuerySyntaxError: query:'' detail:'Query is empty.'
which is not compatible with Search API, which allows empty queries. Any workarounds? Should I file an issue?
The Prospective Search Service is intended to support applications that filter a stream of documents; applications that want less than all documents matched. In such an application, an "empty query" would normally be considered evidence of a bug. Admittedly, empty queries might sometimes be useful for various debugging purposes, however, the decision was made to design the interface's contracts with production use in mind.
As suggested by Will Brown, if you want a subscription that will match all documents, then insert some dummy field with a constant value into your documents and then create a query that matches just that field and value. Given that there is such an easy work-around available for those rare cases when "all documents" are needed, I think it unlikely that we would provide support for empty queries. It might also be interesting to note that the prohibition against empty queries is not just in the AppEngine code but also in the backend servers that AppEngine accesses to provide the Prospective Search Service.
Although the "Search API" (which really should be called the "Retrospective Search API") may support empty queries, it is important to realize that resource utilization patterns for prospective search are very, very different from those of retrospective search. For instance, you might have an application that is streaming hundreds of documents per second into both a document index (using retrospective search) and through a query index (using prospective seach). In such a system, an empty retrospective query is only going to return just a few documents whenever that query is submitted. On the other hand, a prospective query would generate a real-time stream of all documents. The presence of just a few prospective queries could thus generate significant loads on your application. In general, if you want a firehose, real-time push feed of everything published, it is best to code that up explicitly.
You can file a feature request for this, but it is by design (I don't know why). If you know that incoming documents will have something in common, you can write a query for those; for example, if you add a field "alldocuments" with content "yes" to the document when you send the request, you could register a query like "alldocuments:yes" to match all documents.

Tracking user read/unread of a link/document in solr

I am using solr to index reports from DB. I am successful in doing that. However, I also need to track user activity to report whether a document has been read by the user or not. I am aware that Solr is not built to index/keep track user activity, but is there a good approach going about this ?
Any suggestions?
No, as you say there is no support for this in Solr. From a Solr perspective it’s more related to how you build you web-application. I would recommend you to ask yourself this:
When tracking the reading statistics of my users do I need to index that information into Solr too?
The answer to this question depends on if you need to the information to facet, search or use it in the relevance model. Say for example you want to have a facet that allows your users to filter on read or unread documents then of course you need to index this into Solr.
If you only want to present whether or not a document has been read or not (in the web interface) you might as well store this information inside a SQL database fetching it when presenting the results.

Resources