Is there an easy way to delete a complete Vespa document set? - vespa

Playing with Yahoo's vespa.ai, I'm now at a point where I have a search definition with which I am happy, but still, have a bunch of garbage test documents stored.
Is there an easy way to delete/purge/drop all of them at once, ala SQL DROP TABLE or DELETE FROM X?
The only place I found at this point where deleting documents is clearly mentioned in the Document JSON format page. As far as I understand it requires deleting documents one by one, which is fine, but gets a bit cumbersome when one is just playing around.
I tried deleting the application via the Deploy API using the default tenant, but the data is still there when issuing search requests.
Did I miss something? or is this by design?

There's no API available to do this, but the vespa-remove-index command line tool could help you out. Ie, to drop everything:
$ vespa-stop-services
$ vespa-remove-index
$ vespa-start-services
You could also play around with using garbage collection for this, but I wouldn't go down this path unless you are unable to use vespa-remove-index.

Related

Posting large directory of files to SOLR using post tool, how to commit after every file

I am using the java post tool for solr to upload and index a directory of documents. There are several thousand documents. Solr only does a commit at the very end of the process and sometimes things stop before it completes so I lose all the work.
Has anyone a technique to fetch the name of each doc and call post on that so you get the commit for each document? Rather than the large commit of all the docs at the end?
From the help page for the post tool:
Other options:
..
-params "<key>=<value>[&<key>=<value>...]" (values must be URL-encoded; these pass through to Solr update request)
This should allow you to use -params "commitWithin=1000" to make sure each document shows up within one second of being added to the index.
Committing after each document is an overkill for the performance, in any case it's quite strange that you had to resubmit anything from start if something goes wrong. I suggest to seriously to change the indexing strategy you're using instead of investigating in a different way to commit.
Given that, if you not have any other way that change the commit configuration, I suggest to configure autocommit in your Solr collection/index or use the parameter commitWithin, as suggested by #MatsLindh. Just be aware if the tool you're using has the chance to add this parameter.
autoCommit
These settings control how often pending updates will be automatically pushed to the index. An alternative to autoCommit is
to use commitWithin, which can be defined when making the update
request to Solr (i.e., when pushing documents), or in an update
RequestHandler.

How to reset solr back to first use?

I'm having a lot of problems running my solr server. When I have problems committing my csv files (its a 500 MB csv) it throws up some error and I am never able to fix it. Which is why I try to clean up entire indexing using
http://10.96.94.98:8983/solr/gettingstarted/update?stream.body=<delete><query>*:*</query></delete>&commit=true
But sometimes it just doesnt delete. In which casese, I use the
bin/solr stop -all
And then try, but again it gives me some errors for updating. Then I dedicided to extract the install tarball deleteing all my revious solr files. And successfully it works!
I was wondering if there is a shorter way to go about it. I'm sure the index files arn't the only that are generated. Is there any revert to fresh installion option?
If you are calling the update command against the right collection and you are doing commit, you should see the content deleted/reset. If that is not happening, I would check that the server/collection you are querying is actually the same one you are executing your delete command against (here gettingstarted). If that does not work, you may have found a bug. But it is unlikely.
If you really want to delete the collection, you can unload it in the Admin UI's Core page and then delete from the disk. To see where the collection is, look at the core's Overview page on the right hand side. You will see Instance variable with path to your core's directory. It could be for example: .../solr-6.1.0/example/techproducts/solr/techproducts So, deleting that directory after unloading the core will get rid of everything there.

How is ElasticSearch supposed to work in CakePHP 3?

I've been trying my very best not to ask any nosy question here in stackoverflow, but it has been almost one week since I got stuck in this problem and I couldn't find any solution.
I already have my working website built with CakePHP 3.2. What the website basically does is scrape Twitter for tweets containing a given search term, check if it's already in my database, and store it if it doesn't yet exist. Twitter's JSON response has this "tweet_id" property, and I've been using that value to check for whether I should ignore or append a specific tweet to my DB. While this might be okay while my database is small, I suspect it's going to slow things down considerably when my tables grow bigger. Thus my need for ElasticSearch.
My ElasticSearch server is running on my Arch Linux install, and I've configured my app to point to the said server. Also, I have my "Type" object named the same way as my "Tweets" table (I followed the documentation until the overview part http://book.cakephp.org/3.0/en/elasticsearch.html). This craps out an "Unknown method "alias" error, and following Google searches led me to creating an alternate pagination class since that was what some found to be the cause of the error (https://github.com/lorenzo/audit-stash/issues/4), which still doesn't fix things.
I'm not sure if I got this right. I installed the ElasticSearch plugin with the assumption that all I have to do is name the Types the same name as my tables, since to me the documentation "implies" that this should be done on top of the Blog Tutorial they did to "improve query performance".
TLDR, how is this supposed to work? Is my above assumption right? Do I name the Types differently and index everything myself? I'm not sure if there's just too much automagic, or I'm just poor at these sort of things. And yes, I'm new to frameworks (but not PHP, among other languages)
Thanks in advance!

Easy GUI way to add data to a Solr index

I'm working on the front end of an app that uses Solr for data storage. Currently I have an empty index, but it'd (understandably) be a lot easier for me if some dummy data was returned so I could make sure that it's output correctly on the front end.
If I was working with and RDBMS (let's say postgres) I'd open up a GUI (e.g. pgadmin) and type data manually into a few rows to achieve this goal. I have access to the Solr web interface, but I can't see any obvious call to action saying INSERT YOUR DATA HERE. The closest thing I can find to an answer on the web is this SO thread, but it's still not quite the droids easy GUI-based solution I'm looking for.
So, my question is: Is the a way to quickly and easily insert some data equivalent to the RDBMS method mentioned above?
Make sure you have defined a schema in schema.xml.
SOLR does indeed have a (limited) html GUI, which on a local installation is probably found at localhost:8983/solr (default). If you can get to the base admin page, then on the left there is a small combobox where you can select a core/collection. If you click on THAT, then you get a list of options that emerges, and you can pick 'documents' to get a similar GUI to what I think you expect from postgres/RDBMS/whatnot.
http://localhost:8983/solr/#/collection1/documents is the URL on a default SOLR installation that I have. This should work as long as you don't have default cores. (Replace collection1 with your collection name and localhost:8983 with wherever your solr is hosted/the port).

How to implement Solr into Sitecore

I have to implement Solr index into Sitecore and I would like to know what is the best approach?
I looked at following approaches:
Capture publish end event (or other events) and then push item to solr index
Implement custom database crawler and get all changes from history table. Then using custom index push data to solr.
Second approach sounds like a way to go (in my opinion). In this case do I need to create a new search index, or search manager?
If anyone's done it before, can you point me into the right direction? Also if you could post some links to articles about sitecore-solr implementation.
UPDATE
Ok, after reading sitecore documentation this is what I came up with :
Create your custom SolrConfiguration class where you can set properties like solrserviceurl, add indexes and its definition (custom solr indexes)
Create SolrIndex and add it (in the config file) to your SolrConfiguration. Which instantiating, solrindex should subscribe to AddEntry event of Sitecore History Manager, and communicate with solr crawlers.
Create custom processor and hook into sitecore initialisation pipeline. Processor should initialize SolrConfiguration (from step 1)
Since everything in your config file in will be build using refrection, you can get instance of your cofiguration based on your config file
How does that sound like. Can I have any comments please?
We've done this on a few sites and tend to have a new "published" solr index and "unpublished" index
We interrupt:
OnItemSaving
Event to push things into the unpublished index (you may not need this, it depends if you want things in preview mode)
OnPublishItemProcessed
We process additions and updates to the published index here, I'm not sure what we do about deletions here without digging right into the code but certainly deal with deletions on the OnItemDelete (mentioned below)
OnItemDelete
We interrupt here to remove things from the published and non-published index (I think we remove from the published index here because Sitecore makes you publish the parent node in order to publish out deletions to the web database)
I hope that helps, I'd post the code if I could (but I'd be scowled at).
In addition to the already posted answer (which I think is a good way to do things) I'll share how we do it.
We basically just took a look at the Sitecore database crawler and decided to do things kind of like how it was doing it.
We utilize a significantly modified version of the Custom Item Generator to facilitate mapping between strongly typed objects and an object that has properties that correspond to our Solr schema. For actual communication with Solr we use SolrNet.
The general idea is that we loop through all the items (starting with the site root) recursively and map them to the appropriate type based on its template. Then we go through an indexing process for that item (some items need to index multiple documents to Solr in our implementation).
This approach is working very well for us except I will note that because we are indexing everything at once, it tends to introduce a slight bit of lag time between publish and the site reflecting any changes made to the index. One oversight we made in the beginning but will be working to fix soon is that we don't have an "unpublished" index (meaning we need to publish the site to see updates). It doesn't impact our solution that much really, but I can definitely see where it would others, so keep that in mind.
We didn't particularly want to get into the deletion of items from the index so we do the indexing as a publish:end event.
I hope this additional insight helps you. As far as I know there's not a whole lot of information out there about this specific combination of products, but I can tell you it's definitely possible and quite useful.

Resources