Is there a way to determine which files were indexed by the indexer in Azure Cognitive Search? - azure-cognitive-search

Is there any way to determine the files that were indexed by the indexer? I see an API that returns the indexer history and the number of files that were indexed, but not the actual file names themselves.

The content in your index represents the indexed files. You could simply search for * (wildcard) to list all files that were indexed.

Related

Azure Search : Blob only Index Creation

We would like to enable Azure Search only for Blob data, including its Contents and Meta Attributes stamped on the blob.
Is it possible to have such Indexer & Index without any reference to the database? How are the Fields of the Index specified in this case? Will the fields be same as meta attributes stamped on the blob?
Also, we have certain fields which may contain data from two different languages. Is it possible to add same field twice in the Index, with different language analyzer specified on each?
Is it possible to related same Indexer to two different Indexes?
Is it possible to specify more than one Storage Account Container as data source for the same Index?
Ideally, we would like to be able to do the following;
Utilize same Indexer in multiple Indexes
Enable same Indexer/Index to be able to search for Multiple Languages (with language analyzers)
Enable Index based only on Blob & its Meta attributes data
This doc topic explains how to setup search for blob data: https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage
The default dataToExtract parameter value is contentAndMetadata, meaning all text content and metadata will be indexed. You should be able to set-up field mappings from metadata and contents to your index (the details are outlined in this same doc topic).
The indexer points to the index it should output to, so I don't think it would be possible to re-use the same indexer for multiple indexes, and you'll have to copy them instead.
Similarly, the indexer specifies what datasource it takes its data from, so only one data source per indexer. You'd need to aggregate your data into a single source first if you want to build an index from the data of multiple sources.
It is possible to index multiple languages in a single index, by specifying the relevant analyzer for each index field. More details can be found in this topic: https://learn.microsoft.com/en-us/azure/search/search-language-support

Difference between full text and free text search in solr (other search db)

New to search databases and working with one. What is the difference between full text and free text search/index?
They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)

How can we retrieve tokens of a particular property from search engine?

Community version. When contents are added in Alfresco search engine tokenizes properties (name, description) and stores it in indexes. I would like to know if there a way by which we could retrieve a list of those keywords associated with particular content?
Ex.. Fetch me tokens from "Name" of "abc.txt" content
I see there are API's exposed by SolR to get overall status of indexes and to fix transactions, but nothing which meets my needs.
I had a similar experience, needed to find out what the tokenizer was doing about indexes because a particular file name was not found during search.
I finally used Luke Lucene index toolbox which is:
Luke is a handy development and diagnostic tool, which accesses
already existing Lucene indexes and allows you to display and modify
their content in several ways:
browse by document number, or by term
view documents / copy to clipboard
retrieve a ranked list of most frequent terms execute a search, and browse the results
analyze search results
selectively delete documents from the index
reconstruct the original document fields, edit them and re-insert to the index
optimize indexes
open indexes consisting of multiple parts, and/or located on Hadoop filesystem
and much more...
Simply open the index files and you will have a peek on how properties and data were tokenized.
As reported in this post it could be easily used also for SolR indexes.

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Problem with spellchecker

Why I don't get any suggestion when I execute this query agains Solr:
q=%2B%28text%3A%28gasal%29%29&suggestField=contentOriginal&ontologySeed=gasal&spellcheck.build=true&spellcheck.q=gasal&spellcheck=true&spellcheck.collate=true&hl=true&hl.snippets=5&hl.fl=text&hl.fl=text&rows=12&start=0&qt=%2Fsuggestprobabilistic
I am searching gasal and it should suggest gasol.
Thanks in advance
By default, the spellchecker works by taking the indexed content of a source field (in Solr) and store into an external Lucene index. That external index is the dictionary. Each words of the source field are stored in the dictionary in a format that allows to match words that are closed to each other. When asking for suggestions, Solr will look into that dictionary, NOT into the Solr index.
So in order for the dictionary to be built, you have to specify the source field. It should be defined in your schema using an appropriate analyzer (usually no stemming). That field should contain enough words to built a good dictionary. A good practice is to populate it from your text fields using copyfield instructions.
Then, the dictionary have to be built. This is the operation where the content of the source field is taken to build the actual dictionary. It can be done automatically at each commit or manually using the "build" parameter.

Resources