Azure search across 37 languages - azure-cognitive-search

We have a site which keeps English content and translated content (37 languages) in blob storage. Each in their own folder structure in the blob.
Examples below:
Insider-progams/windows-insider/en-us/…bunch of json files to index
Insider-progams/windows-insider/ru-ru/…bunch of json files to index
Insider-progams/windows-insider/es-es/…bunch of json files to index
Insider-progams/windows-insider/fr-fr/…bunch of json files to index
We current have a .NET service which acts as an API to get this content and return it to our Angular web app. However, we are getting into scenarios where we need to search the content across all languages.
Does this mean we would need 37 separate indexes? One for each language? Or would we just pass a parameter to get the languages data we want from the API.
I am also concerned about how big an index might get and how long it would take the build. We will need to rebuild the index on demand when new content arrives.
We also intend to use the Search APIs to for basic data retrieval of our content along with paging.

You don't need one index for each language, only one column per language. See https://learn.microsoft.com/en-us/azure/search/search-language-support on how to create the index with different analyzers for each column/language.
Indexers can automatically track the blobs in the storage account to periodically pick up new blobs and incrementally build the index without reprocessing everything.
There are a few options for populating the index depending on your requirements and use cases. The simplest option is to create one indexer per language, filtered by blob folder, targeting the corresponding column in the index. If you need more language capabilities or control structures, consider adding a skillset to the indexer which has the ability to detect language, translate, extract key phrases, make decisions based on conditions, or plug in a custom skill. The last option is to manage adding/updating/deleting documents in the index yourself, which has the lowest latency for any changes in the storage account to be reflected in the index.
See the search API for searching documents and paging results.

Related

Full Text Search in ERP application - is Apache Lucene\Solr the right choice?

Im currently investigating the tools necessary to add a fast, full text search to our ERP SAAS application with the aim of providing a single search entry point in the application that could search over the many different kind of objects that compose the domain of the software.
​
The application (a Spring Java web application) is backed by a Sql Server RDBMS (usign Hibernate as ORM), there are hundreds of different tables, dozens of which (but maybe more) should be searchable (usually there are one or more varchar columns in evenry table that should be indexed/searched).
Think for example of a single search bar where i can search customers, contracts, employees, articles..), this data is also very often updated (new inserts, deletes, updates..)
​
I found this article (www.chrisumbel.com/article/lucene_solr_sql_server) that shows how to connect a Sql Server db with Solr, posting a query example on the database that extracts the data used by Solr during the data import.
Since we have dozens (and more) tables containing the searchable data that means that we should pass for a first step that integrate all the sql queries that extracts this data with Solr, in order to build the index?
Second question: not all the data is searchable by everyone (permissions and ad hoc filters), so how could we complement the full text search provided by Solr with the need of putting in place more complex queries (join on other tables for example) on this data?
​
Thanks
You are nearly asking for a full blown consulting project :-) But a few suggestions are possible.
Define Search Result Types: Search engines use denormalized data, i.e. you won't do any joins while querying (if you think you do, stick to your DB:-) That means you need to do the necessary joins while filling the index. This defines what you can search for. Most people "just" index documents or log-lines, so there is just one type of result. Sometimes people's profiles are included, sometimes a difference is made between results from different source systems where the documents come from, but in the end, there is a limited number of types of search results. And even more, they are nevertheless indexed into one and the same schema (where schemas are very malleable for search engines).
Index: You know your SQL statements to extract your data. Converting to JSON and shoveling it into a search engine is not difficult. One thing to watch out for: while your DB changes, you keep indexing, incremental or full "crawl" depends on how much logic you want to add. The most tricky part is to get deletes on the DB side into the index. If its gone, its gone: how do you know there was something that needs to be purged from the index :-)
Secure Search Since you don't really join, applying access rights at query time amounts requires two steps. During indexing, write principle (group, user) names of those who may read your search result. At query time, get the user ID and expand it, recursively, to get all groups of the user. Add this as a query filter. Make sure to cache the filter or even pre-compute for all users quite regularly and store it in a fast store (the search index is one place, DB would do too:-) Obviously you need to re-index if access rights change. The good thing is: as long as things only change in LDAP/AD, you don't need to index the data, only the expanded groups of the affected users.
ad hoc filters If you want to filter for X, put X as a field into the index. At query time, apply the filter.

What's the difference between GAE Search API and Datastore queries?

I'm trying to understand which one of the search API or querying the datastore will be the most accurate for a search engine in my app.
I want something very scalable and fast. That's mean be able to find something among up to millions results and very quickly. I mean as soon as a data has been registered , this one must be immediately searchable.
I'm seeking to make an autocomplete search system inspired by the google search system (with suggested results in real time).
So, what is the more appropriate option to use for A google app engine user ?
Thanks
Both Search API and the Datastore are very fast, and their performance is not dependent on the number of documents that you index.
Search API was specifically designed to index text documents. This means you can search for text inside these text documents, and any other fields that you store in the index. Search API only supports full-word search.
The Datastore will not allow you to index text inside your documents. The Datastore, however, allows searching for entries that start with a specific character sequence.
Neither of these platforms will meet your requirements out of the box. Personally, I used the Datastore to build a search engine for music meta-data (artist names, album and recording titles) which supports both search inside a text string and suggestions (entries that contain words at any position which start with a specific character sequence). I had to implement my own inverted index to accomplish that. It works very well, especially in combination with Memcache.

Search API, create documents and indexes

I need help in the search API
I am Brazilian and I'm using the google translator to communicate.
My question is:
For each item in the datastore persisted I create a document and an index?
And for those objects that are already persisted in the datastore, I go all the bank to create a document and an index for each, if I want to search for Search API?
I am using java.
It's reasonable to use the Search API to search for objects that are also stored in the Datastore. You can create a Search document for each Datastore entity (so that there's a one-to-one correspondence between them). But you don't need to use a separate Search Index for each one: all the Search documents can be added to one index. Or, if you have a huge number of documents, and if there is some natural partitioning between them, you could distribute them over some modest number of indexes. Assuming you can know via some external means which (single) index to choose for searching, preventing them from getting too big can help performance.
I've tried to answer the question that I think you're asking. It's difficult for me to understand the English that the Google translator has produced. In particular, what does "I go all the bank ..." mean?

Trying to understand the idea of Search Documents on the Google App Engine

I'm trying to understand the concept of Documents on Google App Engine's Search API. The concept I'm having trouble with is the idea behind storing documents. So for example, say in my database I have this:
class Business(ndb.Model):
name = ndb...
description = ndb...
For each business, I am storing a document so I can do full-text searches on the name and description.
My questions are:
Is this right? Does these mean we are essentially storing each entity TWICE, in two different places, just to make it searchable?
If the answer to above is yes, is there a better way to do it?
And again if the answer to number 1 is yes, where do the documents get stored? To the high-rep DS?
I just want to make sure I am thinking about this concept correctly. Storing entities in docs means I have to maintain each entity in two separate places... doesn't seem very optimal just to keep it searchable.
You have it worked out already.
Full Text Search Overview
The Search API allows your application to perform Google-like searches
over structured data. You can search across several different types of
data (plain text, HTML, atom, numbers, dates, and geographic
locations). Searches return a sorted list of matching text. You can
customize the sorting and presentation of results.
As you don't get to search "inside" the contents of the models in the datastore the search API provides the ability to do that for text and html.
So to link a searchable text document (e.g a product description) to a model in the datastore (e.g. that product's price) you have to "manually" make that link between the documents and the data-store objects they relate to. You can use the search api and the datastore totally independently of each other also so you have to build that in. AFAIK there is no automatic linkage between them.

Regularly updated data and the Search API

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

Resources