Getting a more friendly result highlight on Azure Search for PDFs - azure-cognitive-search

I'm indexing PDFs into an Azure Cognitive Search index and created a search page.
I'm using the highlights feature, but the returned content is pretty ugly. Sometimes it even returns the content from the footer of the files. It removes spaces and it looks like a mess.
To index the files I'm using the default configuration. The 'content' field uses a Standard Analyzer. I have configured an Indexer connected to a Data Source that points to an Azure Storage folder.
What is the best approach to display a more user-friendly excerpt of the file?

Related

How to store the operations performed by a user in Elastic Search?

I am making a react application where there will be few buttons like sorting, removing duplicates etc. And there will be input field to get the text(for example, the text that has to be sorted) and there will also be output field to display the result. I also have login and sign up page in this application, so that only registered users can use the application. Now I want to have a history view for all the operations performed by each user. To store the history, I will be using Elastic Search. I'm absolutely new to ElasticSearch. So I would like to get some rough idea or a blueprint like what are the steps I will have to follow. Much appreciated.
I've used Java in the server side
Below is highlevel idea you can implement for user tracking:
Log each action of user into log files.
Use Logstash File Input Plugin or Filebeat Log Input for reading your file and indexing to the elasticsearch using output plugin.
Use Kibana for visulization.

Logic Apps SharePoint Connector - When a file is created or modified in a folder

I have an integration scenario where I need to pull documents from SharePoint Online and submit the content (PDF) to a downstream API. I'm using the When a file is created or modified in a folder trigger to pull documents from a SharePoint library. My question is whether there is a way to set a filter on the trigger so that only PDF documents are retrieved i.e. (*.pdf)?
For this requirement, you can click "Settings" of your trigger.
And then add the expression #contains(triggerBody()?['{FilenameWithExtension}'], '.pdf') to "Trigger Conditions".
After that, the logic app can just be triggered with file name which contains .pdf.
=================================Upate==============================
You can change the expression to:
#contains(trigger().outputs?['headers']?['x-ms-file-name'], '.pdf')

Azure search: use a single index on multiple data sources

I have multiple Azure tables across multiple Azure storage that have the exact same format. Is it possible to configure several data sources in Azure-search to use a unique Index so that a search on this Index would return the results aggregated from all data sources (Azure tables)?
So far, each time I configure a new 'Data Sources' and the corresponding index, I must create a new index (with a new index name). Attempting to reuse an existing index name results in an error stating "Another index with this name already exists"
Thank you for any help or pointer you might provide.
Yes, it's possible, but we don't currently support it in the Azure Portal.
When you go through the "import data" flow in the portal, it'll create a data source, indexer and index for you.
If you want more sources for that index, you need to create new data sources and indexers, with the new indexers pointing at the existing index. Unfortunately this is not currently supported from the portal. You can do it using the .NET SDK (if you're using .NET), directly using the REST API from your app, or using any tool that can make HTTP requests such as PowerShell, curl or Fiddler.
The documentation that describes the indexer-related REST APIs is here:
https://msdn.microsoft.com/en-us/library/azure/dn946891.aspx

How to delete a search Index itself

Search Index has a method to delete a document.
https://cloud.google.com/appengine/docs/standard/python/search/indexclass (Python)
https://cloud.google.com/appengine/docs/standard/java/javadoc/com/google/appengine/api/search/Index (java)
But how to delete Index itself?
Empty Index was listed at the Text Search Panel in the Admin Console.
but no button to delete.
Since you have tagged gae-search I assume your question refers to an index of the Search API (i.e. full text search service, not NDB/HRD datastore index).
Currently you can only delete the documents in an index, but you can't delete the index itself, e.g. issue 8235 and 8490. This restriction of Search API applies to all languages supported in Google App Engine.
The vacuum_indexes prompts you only for indices in datastore, but I miss something like this for the search service too.
When running a local dev environment with version 1.9.x (and possibly earlier), you can pass this argument to dev_appserver.py to simply clear ALL of them regardless of whether there are documents in them:
--clear_search_indexes [CLEAR_SEARCH_INDEXES]
It doesn't look like there's a way to clear an individual index yet though based on the issue statuses posted above by Ani.

Data security in result sets from Elastic Search, Solr or

I need to add full-text search capabilities to my existing database. Of course first turn is to something like Solr or Elastic Search. And the blocking point I’ve got to is – how to securely display results returned from underlying search engine (let’s think about Solr or Elastic Search for now, however any other solution or engine that hit the point are also appreciated).
The tricky context is that I have, for example, in my system Personal Profile records that are to be indexed. One of the fields in personal profile is – manager’s feedback. Normally in the system that field is visible only to employee’s direct manager and higher hierarchy, i.e. ‘manager’ from another branch will not be able to see that field. However, I want that field to be searchable via full text search but only for people who actually can see it.
Now I query Solr for ‘stupid’ (that is query string) and it returns me N documents. When returning that to end-user I’ll remove the ‘Manager’s feedback’ field because end-user is not the manager of given people – but just presence of the document in resultset is already the evidence of ‘stupid’ guys …
The question is – what is workable approach to handle that use-case? Is it possible to plug into Solr/ES with home-grown security filter for outputs?
Caveats:
filtering out only fields do not work because of above mentioned scenario
filtering out complete documents will not work because of
search engine does not tell which fields matched – therefore no way to manually filter resultset by field http://elasticsearch-users.115913.n3.nabble.com/Best-way-to-return-which-field-matched-td2713071.html
even this does work, removing documents from result set will spoil down facets (e.g. number of matches by department) returned by the engine – I’ll have to either recalculate facets manually or they will not match to manually filtered records and will reveal what I actually do not want to show to end users
In Solr you can create multiValued fields. In your case you can use it to store de-normalized values of organization structure.
In described scenario you will create multi valued field ouId (Organization Unit Id) and store employee's ouId and all parent ouIds. In other words you will save allowed ouIds into this field.
In search scenario you will use FilterQuery - fq parameter filtering by ouId of manager.
Example:
..&fq=ouId:12
where 12 is organization unit id of selected manager.
Maybe this is helpful for you https://github.com/salyh/elasticsearch-security-plugin It adds Document level security to elasticsearch.
"Currently for user based authentication and authorization Kerberos/SPNEGO and NTLM are supported through 3rd party library waffle (only on windows servers). For UNIX servers Kerberos/SPNEGO is supported through tomcat build in SPNEGO Valve (Works with any Kerberos implementation. For authorization either Active Directory and generic LDAP is supported). PKI/SSL client certificate authentication is also supported (CLIENT-CERT method). SSL/TLS is also supported without client authentication.
You can use this plugin also without Kerberos/NTLM/PKI but then only host based authentication is available.
As of now two security modules are implemented:
Actionpathfilter: Restrict actions against Elasticsearch on a coarse-grained level like who is allowed to to READ, WRITE or even ADMIN rest api calls
Document level security (dls): Restrict actions on document level like who is allowed to query for which fields within a document"

Resources