Azure Search Service Duplicate Results

Azure Search Service Duplicate Results - azure-cognitive-search

Hello Azure Search Team,
We have been experiencing an issue in our facade using .Net SDK to Azure Search service that search result pagination includes duplicate documents.
Everything was working fine when we have 1 instance of AzureSearch service, index database without partitioning and replication up until yesterday. But, we just changed the replication to 3 and partitioning to 2 for performance improvement, we started seeing such behavior.
Our query is simple such as search=*&$skip=0&$top=50 (Page1), search=*&$skip=50&$top=50 (Page2), now page 2 has some of the content from page1.
Can someone suggest what could be the cause of this behavior? Our expectation was irrespective of number of results if we use correct skip and top we should be able to browse through content.
NOTE:
Steps we have already verified that index data doesn't have duplicate entries.
It's working as expected when service instance (Azure Search) is 1 and no replication/partitioning.
We also eliminated facade from the picture and directly using portal to query Azure Search service.

Related

Azure Search Design Question: Omit result if User has seen this result

I'm trying to design a solution where I don't have to use the SQL Server Database to answer a question: Show me Azure Index search results where the user has never seen this search result.
I can keep track of user document "views" in my SQL database, but how do I extend this functionality to Azure Search Index queries?
I mean I could do a $filter where document id is not in (1,2,3,etc), or I could filter the Index results before the user ever sees them from the server.
I'm just wondering if there's a more clever way to do this?
Thanks for your help!

Best way to achieve this is the first option you mentioned, once the first query comes on that user session, you can save which document ids were returned and then create a filter to exclude those ids for subsequent queries on the same session.

Data in database and google analytics does not match

Why are the counts I see in my database different than what I see in Google Analytics? The goal conversion number showing in Google Analytics is much lower than what I see in the database. This is the case for several months.

Few reasons here
Sampled data vs. unsampled data: You can read about here: https://support.google.com/analytics/answer/1042498?hl=en - For API work i normally use a web query explorer to verify that my API call's are being sent and responses match to verify the data: https://ga-dev-tools.appspot.com/explorer/
Adblockers: You might get hits/submissions from people where they are using an ad blocker, hence more entries in Database or Google Analytics.
Users vs. Sessions vs. Hits: You are looking at Unique Visitors/Sessions in Google Analytics instead of the total number of "Events", Not sure how your Goal is setup but best to use events and look at "Total Events" and "Unique Events" to get a sense.
Implementation: You may be firing JavaScript after the person has hit the button without waiting for the page change, can happen on some sites where you take them to a thank-you page or something. Best to check how this is setup and the order in which tag fires and page works.

Azure search: use a single index on multiple data sources

I have multiple Azure tables across multiple Azure storage that have the exact same format. Is it possible to configure several data sources in Azure-search to use a unique Index so that a search on this Index would return the results aggregated from all data sources (Azure tables)?
So far, each time I configure a new 'Data Sources' and the corresponding index, I must create a new index (with a new index name). Attempting to reuse an existing index name results in an error stating "Another index with this name already exists"
Thank you for any help or pointer you might provide.

Yes, it's possible, but we don't currently support it in the Azure Portal.
When you go through the "import data" flow in the portal, it'll create a data source, indexer and index for you.
If you want more sources for that index, you need to create new data sources and indexers, with the new indexers pointing at the existing index. Unfortunately this is not currently supported from the portal. You can do it using the .NET SDK (if you're using .NET), directly using the REST API from your app, or using any tool that can make HTTP requests such as PowerShell, curl or Fiddler.
The documentation that describes the indexer-related REST APIs is here:
https://msdn.microsoft.com/en-us/library/azure/dn946891.aspx

Cloudant CDTDatastore to pull only part of the database

We're using Cloudant as the remote database for our app. The database contains documents for each user of the app. When the app launches, we need to query the database for all the documents belonging to a user. What we found is the CDTDatastore API only allows pulling the entire database and storing it inside the app then performing the query in the local copy. The initial pulling to the local datastore takes about 10 seconds and I imagine will take longer when adding more users.
Is there a way I can save only part of the remote database to the local datastore? Or, are we using the wrong service for our app?

You can use a server side replication filter function; you'll need to add information about your filter to the pull replicator. However replication will have a performance hit when using the function.
That being said a common pattern is to use one database per user, however this has other trade offs and it is something you should read up on. There is some information on the one database per user pattern here.

Data security in result sets from Elastic Search, Solr or

I need to add full-text search capabilities to my existing database. Of course first turn is to something like Solr or Elastic Search. And the blocking point I’ve got to is – how to securely display results returned from underlying search engine (let’s think about Solr or Elastic Search for now, however any other solution or engine that hit the point are also appreciated).
The tricky context is that I have, for example, in my system Personal Profile records that are to be indexed. One of the fields in personal profile is – manager’s feedback. Normally in the system that field is visible only to employee’s direct manager and higher hierarchy, i.e. ‘manager’ from another branch will not be able to see that field. However, I want that field to be searchable via full text search but only for people who actually can see it.
Now I query Solr for ‘stupid’ (that is query string) and it returns me N documents. When returning that to end-user I’ll remove the ‘Manager’s feedback’ field because end-user is not the manager of given people – but just presence of the document in resultset is already the evidence of ‘stupid’ guys …
The question is – what is workable approach to handle that use-case? Is it possible to plug into Solr/ES with home-grown security filter for outputs?
Caveats:
filtering out only fields do not work because of above mentioned scenario
filtering out complete documents will not work because of
search engine does not tell which fields matched – therefore no way to manually filter resultset by field http://elasticsearch-users.115913.n3.nabble.com/Best-way-to-return-which-field-matched-td2713071.html
even this does work, removing documents from result set will spoil down facets (e.g. number of matches by department) returned by the engine – I’ll have to either recalculate facets manually or they will not match to manually filtered records and will reveal what I actually do not want to show to end users

In Solr you can create multiValued fields. In your case you can use it to store de-normalized values of organization structure.
In described scenario you will create multi valued field ouId (Organization Unit Id) and store employee's ouId and all parent ouIds. In other words you will save allowed ouIds into this field.
In search scenario you will use FilterQuery - fq parameter filtering by ouId of manager.
Example:
..&fq=ouId:12
where 12 is organization unit id of selected manager.

Maybe this is helpful for you https://github.com/salyh/elasticsearch-security-plugin It adds Document level security to elasticsearch.
"Currently for user based authentication and authorization Kerberos/SPNEGO and NTLM are supported through 3rd party library waffle (only on windows servers). For UNIX servers Kerberos/SPNEGO is supported through tomcat build in SPNEGO Valve (Works with any Kerberos implementation. For authorization either Active Directory and generic LDAP is supported). PKI/SSL client certificate authentication is also supported (CLIENT-CERT method). SSL/TLS is also supported without client authentication.
You can use this plugin also without Kerberos/NTLM/PKI but then only host based authentication is available.
As of now two security modules are implemented:
Actionpathfilter: Restrict actions against Elasticsearch on a coarse-grained level like who is allowed to to READ, WRITE or even ADMIN rest api calls
Document level security (dls): Restrict actions on document level like who is allowed to query for which fields within a document"