automating the solr index mechanism - solr

I have indexed few PDF files in solr. I have used curl command for now. My requirment is that if files are pushed to a perticular directory, those files must be indexed. no manual indexing should be done. When files come, it must be indexed. Is there any way to achieve this ? I am new to Solr. Pls give brief suggestions. Thanks in advance.

I can see 2 options.
Create cron job (or something like that)
Try to use DataImportHandler's scheduler
I would probably lean more towards cron(like) solution 1.
That way after file got indexed it can be moved to separate folder. This is very basic solution, using proper queueing system should give you option to process many files at once.

Related

How to write to different files based on content for batch processing in Flink?

I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.

Is there an easy way to delete a complete Vespa document set?

Playing with Yahoo's vespa.ai, I'm now at a point where I have a search definition with which I am happy, but still, have a bunch of garbage test documents stored.
Is there an easy way to delete/purge/drop all of them at once, ala SQL DROP TABLE or DELETE FROM X?
The only place I found at this point where deleting documents is clearly mentioned in the Document JSON format page. As far as I understand it requires deleting documents one by one, which is fine, but gets a bit cumbersome when one is just playing around.
I tried deleting the application via the Deploy API using the default tenant, but the data is still there when issuing search requests.
Did I miss something? or is this by design?
There's no API available to do this, but the vespa-remove-index command line tool could help you out. Ie, to drop everything:
$ vespa-stop-services
$ vespa-remove-index
$ vespa-start-services
You could also play around with using garbage collection for this, but I wouldn't go down this path unless you are unable to use vespa-remove-index.

FileTable External Process Like Zip using T-SQL

Has anyone ever tried to use tsql to launch external processes against files in a FileTable? I have not been able to find anything so if may not be possible.
In particular I am looking into PGP and ZIP operations. My backup plan is to use C# in combination with queries against the filetable.
I am curious to know if it can be done with T-SQL instead. I have looked at xpcmdshell to launch a process but many people recommend against this.
Thoughts and ideas are much appreciated.
Can set folder to compressed (or on EFS)?
See http://blog.brucejackson.info/2013/04/sql-file-table-step-by-step.html

solr/browse gives page not found error.

How to make browse page load ? I have added handler as given in the page
https://wiki.apache.org/solr/VelocityResponseWriter
Still not working. Can any one brief me on this. Thanks in advance.
Couple of things to check:
Have you restarted Solr?
Is the core you are trying to 'browse' a default core? If not, you need to include the core name in the URL. E.g. /solr/collection1/browse
Are your library statements in solrconfig.xml pointing at the right velocity jar? Use absolute path unless you are very sure that you know what your base directory is for the relative paths
Are you getting any errors in the server logs?
If all fails, start comparing what you have with the collection1 example in Solr distribution. it works there, so you can compare nearly line-by-line the relevant entries and even experiment with collection1 to make it more like your failing example.

Indexing document files

I want to store and index document files like .doc, .pdf, .txt and than search this files over a basic web application and let users to download them.
I have found 2 ways. First one is store these files in MSSQL (filestream) and search them with the power of full-text search but this way is scared me because the backup file will be getting bigger and bigger. Second one is indexing these files with Windows Search Service and search them with remote query but this way lacks of full-text search power.
What is the best way to do this, is there any better alternative?
Thanks in advance.
You can process the build in functionality for fulltext search or build your own index. I would recommend building your own one as an Inverted Index for document search capabilities. A quite similar question was asked here: Building Big Search Egnine. Hopefully it helps you to further design what you want.

Resources