Data security in result sets from Elastic Search, Solr or - solr

I need to add full-text search capabilities to my existing database. Of course first turn is to something like Solr or Elastic Search. And the blocking point I’ve got to is – how to securely display results returned from underlying search engine (let’s think about Solr or Elastic Search for now, however any other solution or engine that hit the point are also appreciated).
The tricky context is that I have, for example, in my system Personal Profile records that are to be indexed. One of the fields in personal profile is – manager’s feedback. Normally in the system that field is visible only to employee’s direct manager and higher hierarchy, i.e. ‘manager’ from another branch will not be able to see that field. However, I want that field to be searchable via full text search but only for people who actually can see it.
Now I query Solr for ‘stupid’ (that is query string) and it returns me N documents. When returning that to end-user I’ll remove the ‘Manager’s feedback’ field because end-user is not the manager of given people – but just presence of the document in resultset is already the evidence of ‘stupid’ guys …
The question is – what is workable approach to handle that use-case? Is it possible to plug into Solr/ES with home-grown security filter for outputs?
Caveats:
filtering out only fields do not work because of above mentioned scenario
filtering out complete documents will not work because of
search engine does not tell which fields matched – therefore no way to manually filter resultset by field http://elasticsearch-users.115913.n3.nabble.com/Best-way-to-return-which-field-matched-td2713071.html
even this does work, removing documents from result set will spoil down facets (e.g. number of matches by department) returned by the engine – I’ll have to either recalculate facets manually or they will not match to manually filtered records and will reveal what I actually do not want to show to end users

In Solr you can create multiValued fields. In your case you can use it to store de-normalized values of organization structure.
In described scenario you will create multi valued field ouId (Organization Unit Id) and store employee's ouId and all parent ouIds. In other words you will save allowed ouIds into this field.
In search scenario you will use FilterQuery - fq parameter filtering by ouId of manager.
Example:
..&fq=ouId:12
where 12 is organization unit id of selected manager.

Maybe this is helpful for you https://github.com/salyh/elasticsearch-security-plugin It adds Document level security to elasticsearch.
"Currently for user based authentication and authorization Kerberos/SPNEGO and NTLM are supported through 3rd party library waffle (only on windows servers). For UNIX servers Kerberos/SPNEGO is supported through tomcat build in SPNEGO Valve (Works with any Kerberos implementation. For authorization either Active Directory and generic LDAP is supported). PKI/SSL client certificate authentication is also supported (CLIENT-CERT method). SSL/TLS is also supported without client authentication.
You can use this plugin also without Kerberos/NTLM/PKI but then only host based authentication is available.
As of now two security modules are implemented:
Actionpathfilter: Restrict actions against Elasticsearch on a coarse-grained level like who is allowed to to READ, WRITE or even ADMIN rest api calls
Document level security (dls): Restrict actions on document level like who is allowed to query for which fields within a document"

Related

Azure AD provisioning requires two runs to succeed with custom app

We've created an application using SCIM 2 SDK from PingIdentity for provisioning with Azure AD. Custom mapping is set up and working.
However, when the user is CREATED, all of the fields are included in the import, but only a few fields are included in the provisioning step and sent to our application. Provisioning needs to run a second time on that user to UPDATE in order for all the fields to be included. Amongst other things, this means that first and last name are not split and it only sends the displayname (which ends up as firstname on our end).
For some users in normal provisioning, it can take days between the create and update runs so we're missing data for a long time.
Anyone know how can we can test for what's causing this and solve it so all the fields are included in the initial CREATE run for a user?
Here are the attribute mapping settings: https://imgur.com/ypfAAmD
And an example log of when the user is created with only basic fields: https://imgur.com/iOXACJh
vs. when the user is updated with all the other fields: https://imgur.com/UqDNyCv
I'm a product manager at Microsoft that works on the provisioning service and our SCIM client.
The behavior you're seeing occurs when you have attributes that are not part of the SCIM core schema included as "short" names. Attributes not defined in the SCIM core schema (RFC 7643) should have full URN syntax. Something to the effect of urn:ietf:params:scim:schemas:extension:appName:2.0:User:attributeName is commonly used by other implementations. The shaky behavior you're seeing where the AAD provisioning service fails to send these attribute values via a POST but later includes them in a PATCH comes down to different code paths in the AAD provisioning service, and the PATCH code happens to handle this differently than the POST code. This is purely by chance, however, and isn't an intentional design choice. At some future point I'm hoping we'll make this more consistent and disallow incorrectly structured attribute names entirely.
If you adjust your attribute names to align with the guidance in the SCIM spec's schema RFC and provide the attributes with fully defined URNs, you should see consistent behavior that works on both POST and PATCH.

Microsoft Graph AD Users or people API to search all users?

I'm trying to build functionality into my app for 'admins' to assign users from their AD group to certain groups that are further assigned to app-specific roles. Basically a simple management component.
Adding the user with the oid to a group is easy, the problem I'm facing is finding the actual user.
Currently, the only option I'm seeing is making multiple api requests to v1.0/users (999 items max) and grouping them all in memory and then provide a simple search function to narrow it down.
I have also used the v1.0/me/people endpoint to search for users but this does not reveal all users from the AD group, just relevant users they deal with, so not too useful.
Is there any other api endpoint I could tap into to do a search ONLY on members of the same active directory?
Using the startsWith filter on multiple properties is probably the closest we can get to user search in MS Graph at the moment:
https://graph.microsoft.com/v1.0/users?$filter=startswith(displayName,'sarah') or startswith(givenName,'sarah') or startswith(surname,'sarah') or startswith(mail,'sarah') or startswith(userPrincipalName,'sarah')
Ended up switching to the old AD Graph API and implementing a query on the endpoint as follows:
https://graph.windows.net/{ tenant ID }/users?api-version=1.6&$select=mail,displayName,objectId,givenName,surname&$filter=startswith(givenName,'SEARCH TERM') or startswith(surname,'SEARCH TERM')
If a function receives 1 single param, it will search for that parameter in both givenName and surname but you could configure this to search accross any other supported fields.
You could also completely ditch the $select= completely to get the whole data. I didn't want the clutter though and those keys are enough for me.
Instead of going with startswith You may get better experience using search keyword:
https://learn.microsoft.com/en-us/graph/api/user-list?view=graph-rest-1.0&tabs=http#example-6-use-search-to-get-users-with-display-names-that-contain-the-letters-wa-including-a-count-of-returned-objects

Azure search: use a single index on multiple data sources

I have multiple Azure tables across multiple Azure storage that have the exact same format. Is it possible to configure several data sources in Azure-search to use a unique Index so that a search on this Index would return the results aggregated from all data sources (Azure tables)?
So far, each time I configure a new 'Data Sources' and the corresponding index, I must create a new index (with a new index name). Attempting to reuse an existing index name results in an error stating "Another index with this name already exists"
Thank you for any help or pointer you might provide.
Yes, it's possible, but we don't currently support it in the Azure Portal.
When you go through the "import data" flow in the portal, it'll create a data source, indexer and index for you.
If you want more sources for that index, you need to create new data sources and indexers, with the new indexers pointing at the existing index. Unfortunately this is not currently supported from the portal. You can do it using the .NET SDK (if you're using .NET), directly using the REST API from your app, or using any tool that can make HTTP requests such as PowerShell, curl or Fiddler.
The documentation that describes the indexer-related REST APIs is here:
https://msdn.microsoft.com/en-us/library/azure/dn946891.aspx

How to find Uniqueness in Azure Usage Records from Billing APIs

I am building an Azure Chargeback solution and for that I am pulling the Azure Usage data from Azure Billing REST APIs for multiple subscriptions and different dates. I need to store this into custom MS SQL database as per customer’s requirements. I get various usage records from Azure.
Problem: From these Usage records, I am not able to find any combination of the columns in the data I receive which will give me a
Unique Key to identify a Usage record for a particular subscription
and for a specific date. Only column I see as different is Quantity
but even that can be duplicated. E.g. If there are 2 VMs of type A1
with no data or applications on them, in the same cloud service, then
they will have exact quantity of usage. I do not get the exact name
of the VM or any other resource via the Usage APIs.
One Custom Solution (Ineffective): I can append a counter or unique ID to the usage records but if I fetch the data next time the
order may shuffle or new data may be introduced thereby affecting the
logic for uniqueness. Any logic I build to checking if any data is
missing in DB will result in bugs if there is any alteration in the
order the usage records are returned (for a specific subscription for
a specific date).
I am sure that Microsoft stores this data in some database. I can’t find the unique id to identify a usage record from many records returned by the Billing API. Maybe I am missing something here.
I will appreciate any help or any pointers on this.
When you call the Usage API set the ShowDetails parameter to true: &showDetails=true
MSDN Doc
This will populate the instance data in the returned JSON with the unique URI for the resource which includes the name for example:
Website:
"instanceData": "{\"Microsoft.Resources\":{\"resourceUri\":\"/subscriptions/xxx-xxxx/resourceGroups/mygoup/providers/Microsoft.Web/sites/WebsiteName\",\"...
Virtual Machine:
"instanceData": "{\"Microsoft.Resources\":{\"resourceUri\":\"/subscriptions/xxx-xxxx/resourceGroups/TESTINGBillGroup/providers/Microsoft.Compute/virtualMachines/TestingBillVM\",\...
If ShowDetails is false all your resources will be aggregated on the server side based on the resource type, all your websites will show as one entry.
The resource URI, date range and meterid will form a unique key as far as I know.
NOTE: If you are using the legacy API your VMs will be aggregated under the cloud service hosting them.

GAE datastore -- proper ways to implement search/data retrieval in response to a user request?

I am writing a web app and I am trying to improve the performance of search/displaying results. I am relatively new to programming this sort of thing, so I apologize in advance if these are simple questions/concepts.
Right now I have a database of ~20,000 sites, each with properties, and I have a search form that (for now) just asks the database to pull all sites within a set distance (for this example, say 50km). I have put the data into an index and use the Search API to find sites.
I am noticing that the database search takes ~2-3 seconds to:
1) Search the index
2) Get a list of key names (this is stored in the search index)
3) Using key names, pull from datastore (in a loop) and extract data properties to be displayed to the user
4) Transmit data to the user via jinja template variables
This is also only getting 20 results (the default maximum for a Search API query.. I haven't implemented cursors here yet, although I will have to).
For whatever reason, it feels quite slow.. I am wondering what websites do to make the process seem faster. Do they implement some kind of "asynchronous" search, where a page loads while in the background the search/data pulls are processed, and then subsequently shown to the user...?
Are there "standard" ways of performing searches here where the processing/loading feels seamless to the user?
Thanks.
edit
Would doing something like just passing a "query ID" via the page work, and then using AJAX to get data from the datastore via JSON work? Like... can app engine redirect the user to the final page, pass in only a "query ID", and then search in the meantime, and then once the data is ready, pass the information the user via JSON?
Make sure you are getting entities from the datastore in parallel. Since you already have the key names, you just have to pass your list of keys to the appropriate method.
For db:
MyModel.get_by_key_name(key_names)
For ndb:
ndb.get_multi([ndb.Key.from_path('MyModel', key_name) for key_name in key_names])
If you needed to do datastore queries, you could enable parallel fetches with the query.run (db) and query.fetch_async (ndb) methods.

Resources