Up until a couple of weeks ago, I was successfully setting up a data source, index and indexer for documents stored within Azure blob storage. The documents were being indexed as I expected. Now, however, no matter what I do, the same documents are no longer being indexed. I've tried pretty much all possible combinations, re-run the indexer, used different blob storage and even deleted and created a new Azure Search service, but all to no avail. Whenever I run the indexer it just tells me it has been a success with 0/0 documents.
I have no file extension exclusions, only about 20 out of 700 documents have AzureSearch_Skip metadata set to true.
I set up the data source, indexer and index using the default settings in the Azure Search web interface in the Azure portal.
The Azure Search service is called KulaHub if anyone from Microsoft is reading.
Is there an issue with Azure Search for indexing documents in blob storage? I know this question lacks specifics but I wish I could provide more details.
Many thanks
The issue is that your indexer batch size got set to 0. Please edit indexer properties in the portal and set the Batch Size to some reasonable number (10 is the default, but if your documents are small, something like 100-500 may be better).
This could entirely be a case of me misunderstanding how Azurite works, but I can't seem to find the answers by searching.
I've downloaded Azurite through the VS code extension, and uploaded some data to a local blob source on my hard drive using Windows Storage Explorer; that's now visible in the azurite __blobstorage__ folder. I've tried initialising a new function to try and search over the data, but the project i'm working on specifically phrased it as:
"Set-up local version of Storage and Cognitive Search and index a sample set of documents"
Is this possible to do and i'm just missing something somewhere? Or have I misunderstood the task and you can't actually run cognitive search locally without at some stage attaching to the subscription? I'm waiting for the PM to get back from annual leave, so I thought i'd carry on trying to find out the answer whilst I wait, and hoping someone here might be able to help me out!
I've tried hunting through both the microsoft VS Code Local Development Hot to Guide and the Git repository for Azurite, so i'm not sure if i'm just reading the information wrong or if it's just not there to find.
Azure Search does not currently offer a localhost emulator. Azurite is for localhost storage emulation. It is not possible for an Azure Search Indexer to index data from a local emulator, but you can write data to Azure Search directly via the Index Docs REST APIs. You would need to write a script to read from your local storage and make an API call to index the data into a Search instance in Azure.
We have to migrate customers using Solr to Solr Cloud. I know that the cleanest way is to re-import data again, but it would last for a really long time.
We tried to create collection in Solr Cloud and then copy there standalone solr indexes. It works.
The question is, whether it is worthy to try this in production. What can go wrong?
We are trying to enable full text search. Application stores PDF files in the Azure Blob Storage, which is the data source for Azure Search. Majority of this works fine however the Indexer is not able to extract text from couple of PDFs. Are there any specific kinds of PDFs that Azure Search Indexer can extract?. If Yes, What are they?
Any information, Help/Support in this regard greatly appreciated.
Azure Search can extract all text from PDF text elements. Extracting text from embedded images (which requires OCR) or tables is not yet integrated in Azure Search, but it is on the roadmap.
If your PDFs contain images and you want to extract text from those as well, then you can try following the steps here.
Are there any specific kinds of PDFs that Azure Search Indexer can extract?
Based on my experience, there are no specific kinds of PDFs that Azure search Indexer can't extract. According to your description, I assume that it reaches the Azure search limitation. For more detailed information please refer to Indexing Documents in Azure Blob Storage with Azure Search.
Azure Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, and 4 million for Standard, Standard S2 and Standard S3 tiers. A warning is included in the indexer status response for truncated documents.
I recently wrote a blog post about my experience with this. I ended up using a python-based script running in a Docker container within Azure Somewhat complicated, but the blog lays it out pretty clearly (and the results have been very good as far as OCR/searchability)
I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.
I have a 10 MB CSV file of Geolocation data that I tried to upload to my App Engine datastore yesterday. I followed the instructions in this blog post and used the bulkloader/appcfg tool. The datastore indicated that records were uploaded but it took several hours and used up my entire CPU quota for the day. The process broke down in errors towards the end before I actually exceeded my quota. But needless to say, 10 MB of data shouldn't require this much time and power.
So, is there some other way to get this CSV data into my App Engine datastore (for a Java app).
I saw a post by Ikai Lan about using a mapper tool he created for this purpose but it looks rather complicated.
Instead, what about uploading the CSV to Google Docs - is there a way to transfer it to the App Engine datastore from there?
I do daily uploads of 100000 records (20 megs) through the bulkloader. Settings I played with:
- bulkloader.yaml config: set to auto generate keys.
- include header row in raw csv file.
- speed parameters are set on max (not sure if reducing would reduce cpus consumed)
These settings burn through my 6.5 hrs of free quota in about 4 minutes -- but it gets the data loaded (maybe its' from the indexes being generated).
appcfg.py upload_data --config_file=bulkloader.yaml --url=http://yourapp.appspot.com/remote_api --filename=data.csv --kind=yourtablename --bandwidth_limit=999999 --rps_limit=100 --batch_size=50 --http_limit=15
(I autogenerate this line with a script and use Autohotkey to send my credentials).
I wrote this gdata connector to pull data out of a Google Docs Spreadsheet and insert it into the datastore, but it uses Bulkloader, so it kind of takes you back to square one of your problem.
What you could do however is take a look at the source to see how I pull data out of gdocs and create a task(s) that does that, instead of going through bulkloader.
Also you could upload your document into the blobstore and similarly create a task that reads csv data out of blobstore and creates entities. (I think this would be easier and faster than working with gdata feeds)