Indexing HTML content from Azure database with Azure Search

Indexing HTML content from Azure database with Azure Search - azure-cognitive-search

We store some CMS content in our Azure database, and need to index some HTML content from our database.
What are best practices for indexing this in Azure Search, such that it only indexes content, and not the HTML? Or, such that the index recognizes is as HTML, and will ignore HTML markup?
I know one option would be for me to manipulate it before it gets to the index or on its way, but was hoping there were some built-in capabilities in Azure Search.

Currently, Azure blob indexer is the only Azure Search indexer that supports parsing HTML in a way that strips HTML markup. Azure SQL indexer treats HTML text just as a chunk of text.
You have several potential options:
Use SQL indexer and accept HTML markup being indexed - depending on your documents, your search quality may still be good.
Pre-process your data and strip the HTML markup, then put parsed text back into SQL (and use SQL indexer), or you indexing API to push data into a search index.
Store HTML data in blob storage and use the blob indexer to index HTML data, while continuing to use SQL indexer to index the rest of the data. Multiple indexers can write into the same search index, in effect "assembling" documents from multiple data sources.

You could try with a Custom Analyzer with a custom Char Filter.
Char Filters can be used to "clean" the input with either a mapping or a pattern replace (Regular Expression).
The pattern replace its internally using the PatternReplaceCharFilter.
Please keep in mind that complex expresions will probably have the consequence of longer indexing times.

I'm using such custom analyzer to index HTML. Don't know if it's the best way.
{
"name": "bodyHtml",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "standard_v2",
"tokenFilters": [
"lowercase", "asciifolding"
],
"charFilters": [
"html_strip"
]
}

Related

Wagtail Search in CMS

I'm trying to search through all of my pages on the backend to find instances of urls with outdated domain names. But when I search, it only seems to get hits based on the title of the pages, and not their content. Is there something I need to configure to make this work?
Thank you!

To search on fields beyond the ones that are common to all page types (such as title), you'll need to define a search_fields property on your page model, such as:
from wagtail.search import index
class MyPage(Page):
body = StreamField(...)
search_fields = Page.search_fields + [
index.SearchField('body')
]
After setting this up, you'll need to update the search index to contain this new data, by running ./manage.py update_index.
If you're running 2.14.x or earlier, you'll also need to set up an alternative search backend such as Postgres or Elasticsearch - the default wagtail.search.backends.db search backend was very limited and only supported searching on the 'core' fields such as title. As of 2.15, a new wagtail.search.backends.database backend is available and enabled on new projects out of the box, and this has full support for searching across all fields.

How to remove highlights tags before in Azure Cognitive Search documents before searching

Azure Search by default highlights search results with <em> tag. I've met with situation where user uploads document with that tag inside:
<em>Today</em> topic will be...
When i would search for "topic" i would get:
<em>Today</em> <em>topic</em> will be...
And i wouldn't be able to distinguish the right highlight.
I know that i can modify highlight_pre_tag and highlight_post_tag so i would avoid this in this particular situation. But is there other way to encode this tags before appyling highlighs?
EDIT:
By encoding i mean getting something like this:
<em>Today</em> <em>topic</em> will be...;
So I can send it to frontend and then display <em> from "Today" as <em> and use <em> in "topic" to highlight it to yellow.

Azure Search doesn't provide any built-in mechanism to modify the "raw" content of a document if you are using the Index API directly, however, if you are using one of our built-in indexers, you can look into using the field mapper functions (such as the UrlEncode function) or create your own custom skill (if you want to only apply very specific rules) to transform the documents in transit from your data source to the search index.
Alternatively, we've seen customers use custom highlight pre and post tags that are easily recognizable (and unlikely to be mistaken for original content) and then using a simple search and replace function in their client application to transform those back into the desired tag.
For example, using
pre-tag : "!HIGHLIGHT_START!" and post-tag :"!HIGHLIGHT_END!"
and then using
String.Replace("!HIGHLIGHT_START!", "<em>")
before displaying the results in their application. That way, any client-side logic that requires finding the actual highlights can use the custom tags, while still showing the desired tag in the UX.

Tag-based search model in Mongodb

I am creating a tag based search engine for various kind of things in mongodb.
I have blogs document, testimonials document, comments documents, books document and images document and all these have array of tags field.
Now when I fetch a book, which have certain tags associated with it, I would like to also fetch blogs and testimonials and comments with those tags.
I would like to the same when I fetch a blog .. fetch rest with tags that blog have.
I am designing my database model. what is the best way to handle these kind of tag based search.
currently what I am thinking is
add tags in each document
at fetch , take tag and search through all other document
take the result and then send with result
is this the best way ? how should I design model?
Update :
I will perform search more frequently.

If you need to repeat tags in multiple collections, I would rather do a tags collection itself.
Why would I move tags into their own collection?
Think if you need to change the name of one tag in the future, maybe because of a mistake like a typo, you'll need to iterate over all your collections searching for this tag to fix it. Wouldn't it be easier if you only need to replace in one place?
Embed arrays and objects in one document is a powerful tool, but there are times when it's not the best solution. This case is one of them, and you should prevent as much as you can repetition.
Official documentation talking about avoid repetition.
Collection Structure
Create a tags collection and add their ObjectId to the tags array in the other documents instead of the tag itself. Like below.
// tags collection
{
_id: <ObjectId1>
title: "trending"
}
// all other documents (blogs, testimonials...)
{
_id: <ObjectId2>
tags: [
<ObjectId1>
],
// other stuff...
}
Fetching tag related documents in one hit
When you fetch one document you can get all its tags and look for other documents with related tags using the operator $in, like this:
db.blogs.find({
tags: {
$in: [
<ObjectId1>,
<ObjectIdX>,
// other tags ids
]
}
})
And this will return at once all the documents matching one or more tags.
More about $in operator.
Other tips
Well used indexes have a great impact on performance. This isn't the place to teach about how they works, but mongodb have multikey indexex and in your concrete case is obviuos which one, tags.
Example:
db.blogs.createIndex( { "tags": 1 } )

Document Converstion for PDF form (eg. w2/1040/etc) as key/values instead of a single string based on font information

Trying to use the Document Conversion service to capture the json key/value pairs for the pdf documents such as (w2/1040/etc forms.)
Content of such forms in json response are coming as part of the "text" under the "content". Missing the form data, but mostly rendering the form labels as a single string.
I would like to know if there is anyway to capture the form data for the pdf (w2/1040/etc) as key / values in json instead of a single string?
Thanks.

Unfortunately, the Document Conversion Service currently does not support forms in PDFs. At most, it may recognize some of the forms as tables, but not as key/value pairs.
If it recognizes a form as a table, you still would need to do some non-trivial post-processing to map it to key/value pairs.

Indexing URL pointing to pdf using TIKA in SOLR

I have a requirement where the incoming update request has a metadata like "link":"htp://example.pdf" (along with some other metadata) and i have to parse the PDF document and indexed it in another field like "link_value":"PDF extracted contents". Is this possible in SOLR using tika?
NOTE: I cannot use Data import handler since the incoming request is not from a single source and is done via external source

So, if I understand correctly:
you are getting some /update call to add some doc
the doc contains a 'link' field, which you want to retrieve, extract text with Tika, and index into another field
Yes you can do this in Solr, but you need to do some work:
set up an UpdateRequestProcessor, you could start off TikaLanguageIdentifierUpdateProcessorFactory as it uses Tika too and maybe you can reuse some stuff
you wire your URP so it is used by the /update handler
that URP will kick in every time a doc is added
in the URP code, you: retrieve the pdf, programatically extract the text with Tika, and add it to the target field

You can map content to a specific field and supply specific field values when you're using the ExtractingRequestHandler (if you're using Tika yourself, you'll include the content as a regular document field).
To map the content to a different field, use fmap: fmap.content=link_value, and to include a literal value (i.e. the URL of the document you're indexing), use literal: literal.link=http://example.com/test.pdf (apply URL escaping as necessary).