Using Azure Search for PDFs in Azure Blob Storage - azure-cognitive-search

We are trying to enable full text search. Application stores PDF files in the Azure Blob Storage, which is the data source for Azure Search. Majority of this works fine however the Indexer is not able to extract text from couple of PDFs. Are there any specific kinds of PDFs that Azure Search Indexer can extract?. If Yes, What are they?
Any information, Help/Support in this regard greatly appreciated.

Azure Search can extract all text from PDF text elements. Extracting text from embedded images (which requires OCR) or tables is not yet integrated in Azure Search, but it is on the roadmap.
If your PDFs contain images and you want to extract text from those as well, then you can try following the steps here.

Are there any specific kinds of PDFs that Azure Search Indexer can extract?
Based on my experience, there are no specific kinds of PDFs that Azure search Indexer can't extract. According to your description, I assume that it reaches the Azure search limitation. For more detailed information please refer to Indexing Documents in Azure Blob Storage with Azure Search.
Azure Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, and 4 million for Standard, Standard S2 and Standard S3 tiers. A warning is included in the indexer status response for truncated documents.

I recently wrote a blog post about my experience with this. I ended up using a python-based script running in a Docker container within Azure Somewhat complicated, but the blog lays it out pretty clearly (and the results have been very good as far as OCR/searchability)
http://martyice.github.io/docker-in-azure/

Related

How can I publish my dataset on Google Dataset Search?

I want to put my dataset which I have prepared in Google Dataset Search. I didn't find any procedure to do this task online.
This question - How can I post my doctoral lab's datasets on Google Dataset Search? asked in Google support provide answers but the corresponding links are not of great use.
Can
The easiest way to make your dataset eligible to be included in Dataset Search results is to upload it to a repository that adheres to metadata standards used by Dataset Search. There are many such repositories. You can try the following:
https://zenodo.org
https://figshare.com
https://dataverse.harvard.edu/
https://www.kaggle.com/datasets
If you don't want to use an external repository you will need to embed metadata in the webpage you host that describes the datasets. More information on this here.

Azure search no longer indexing documents in blob storage

Up until a couple of weeks ago, I was successfully setting up a data source, index and indexer for documents stored within Azure blob storage. The documents were being indexed as I expected. Now, however, no matter what I do, the same documents are no longer being indexed. I've tried pretty much all possible combinations, re-run the indexer, used different blob storage and even deleted and created a new Azure Search service, but all to no avail. Whenever I run the indexer it just tells me it has been a success with 0/0 documents.
I have no file extension exclusions, only about 20 out of 700 documents have AzureSearch_Skip metadata set to true.
I set up the data source, indexer and index using the default settings in the Azure Search web interface in the Azure portal.
The Azure Search service is called KulaHub if anyone from Microsoft is reading.
Is there an issue with Azure Search for indexing documents in blob storage? I know this question lacks specifics but I wish I could provide more details.
Many thanks
Tim
The issue is that your indexer batch size got set to 0. Please edit indexer properties in the portal and set the Batch Size to some reasonable number (10 is the default, but if your documents are small, something like 100-500 may be better).

Suggestion for choice of database/design

Okay, i'm building a search engine based on urls stored in a database
| link_id || link_url || link_tags| <== schema
link_tags for a site say w3schools.com represent [web-design,html,php,js] etc..
and the database(MySql) has like 10,00,000+ rows
Now, i need them to be searchable by a search engine which also takes the link_tags into consideration while processing queries such as "best html tutorial" to return optimal results, also the entire web content of each url would also need to be stored to generate additional input to the engine based on keywords.
Which opensource search engine or any previous implementation should i be looking at to acheive this?
There is a small opensource search engine here. It is written in php and uses mysql. it my be able to be stretched for your needs.

Developing an web directory search engine for enterprises information, what's better? use a database or files?

I want to develop an web app for storing enterprises' information, so this info can be searched by keywords as by category, but principally by keywords, because the interface it's going to be as simple a Google. The doubt I have is, is it better to store this info in a database or in text files?
If you want full text search, probably neither. You should look into a search index such as Elasticsearch (http://www.elasticsearch.org/overview/). A search index stores data in a way that is optimized for searching.

Where are the specifications/XSDs for Amazon MWS feed XML processing reports?

Amazon provides a batch of documents describing the format of the feeds we can send via MWS, however, we also need to know what to expect in their responses, what status codes may be reported or what is the structure of XML when errors reported, etc...
Where can I get the information?
The MWS XML schemata are documented within the Selling on Amazon Guide to XML linked from the Developer Guides section in the Amazon Marketplace Web Service (Amazon MWS) Documentation.
I'm omitting a direct link to the PDF, as this might change once in a while. For the same reason the XSD files you are looking for are not publicly linked by Amazon as well, rather you'll find the links to the most current schema documents within the respective sections of the Selling on Amazon Guide to XML.
You might also be interested in the Amazon MWS Developer Guide, the Feeds API Reference and the guide for the Amazon MWS Scratchpad, which are all available there as well.
Good luck!
I know this is a rather old question but I just wanted to look at the actual XML schema files myself today.
There is an XML Documentation PDF hosted on images-na.ssl-images-amazon.com which I assume will stay there for a while. This PDF contains links to the core schema files amzn-envelope.xsd, amzn-header.xsd, and amzn-base.xsd and some other API schemas like Product.xsd which all appear to be relative to https://images-na.ssl-images-amazon.com/images/G/01/rainier/help/xsd/release_1_9/.
The PDF explicitly states that
The XSD samples shown on the Help pages may not reflect the latest XSDs. We recommend
using the provided XSD links to obtain the latest [ve]rsions.
However, the official MWS Feeds API documentation also links to some XSDs but these are relative to https://images-na.ssl-images-amazon.com/images/G/01/rainier/help/xsd/release_4_1/ now, e.g. Price.xsd. Schema references also seem to be relative to this path. For example, Price.xsd includes amzn-base.xsd via <xsd:include schemaLocation="amzn-base.xsd"/> and sure enough there it is.
Unfortunately, I have no idea whether release_4_1 is the latest release of the schemas but the link from the MWS API documentation is a good indicator to me.
Another way to get the XSD's which I think is the most "official" way is to go to your Seller Central and navigate to Help > XML & data exchange > Reference > XSDs.
There you can download all the XSD's available to your account.
Hope it helps!
It seems that this XSD files are outdated.
Just checked the official sellercentral help page for the XSD files https://sellercentral-europe.amazon.com/gp/help/G1611
For the OrderReport there is still release_4_1 referenced.
Some time ago amazon has added a new field to OrderReport for EU markets. The new field is IsSoldByAB.
I am using the xsd files since many years for automatic code generation. And this fails from time to time because of new fields like this. This field is not descriped in one of this:
release_1_9 ($Revision: #7 $, $Date: 2006/05/23 $)
release_4_1 ($Revision: #10 $, $Date: 2007/09/06 $)
XSD files and I am not able to find a version that include this field.
Since some years I extend the XSD file on my own to generate my code. IsSoldByAB is just a boolean field as IsPrime or IsBusinessOrder. So this was an easy task but not "official"...

Resources