I'm working on the front end of an app that uses Solr for data storage. Currently I have an empty index, but it'd (understandably) be a lot easier for me if some dummy data was returned so I could make sure that it's output correctly on the front end.
If I was working with and RDBMS (let's say postgres) I'd open up a GUI (e.g. pgadmin) and type data manually into a few rows to achieve this goal. I have access to the Solr web interface, but I can't see any obvious call to action saying INSERT YOUR DATA HERE. The closest thing I can find to an answer on the web is this SO thread, but it's still not quite the droids easy GUI-based solution I'm looking for.
So, my question is: Is the a way to quickly and easily insert some data equivalent to the RDBMS method mentioned above?
Make sure you have defined a schema in schema.xml.
SOLR does indeed have a (limited) html GUI, which on a local installation is probably found at localhost:8983/solr (default). If you can get to the base admin page, then on the left there is a small combobox where you can select a core/collection. If you click on THAT, then you get a list of options that emerges, and you can pick 'documents' to get a similar GUI to what I think you expect from postgres/RDBMS/whatnot.
http://localhost:8983/solr/#/collection1/documents is the URL on a default SOLR installation that I have. This should work as long as you don't have default cores. (Replace collection1 with your collection name and localhost:8983 with wherever your solr is hosted/the port).
Related
Playing with Yahoo's vespa.ai, I'm now at a point where I have a search definition with which I am happy, but still, have a bunch of garbage test documents stored.
Is there an easy way to delete/purge/drop all of them at once, ala SQL DROP TABLE or DELETE FROM X?
The only place I found at this point where deleting documents is clearly mentioned in the Document JSON format page. As far as I understand it requires deleting documents one by one, which is fine, but gets a bit cumbersome when one is just playing around.
I tried deleting the application via the Deploy API using the default tenant, but the data is still there when issuing search requests.
Did I miss something? or is this by design?
There's no API available to do this, but the vespa-remove-index command line tool could help you out. Ie, to drop everything:
$ vespa-stop-services
$ vespa-remove-index
$ vespa-start-services
You could also play around with using garbage collection for this, but I wouldn't go down this path unless you are unable to use vespa-remove-index.
I've been trying my very best not to ask any nosy question here in stackoverflow, but it has been almost one week since I got stuck in this problem and I couldn't find any solution.
I already have my working website built with CakePHP 3.2. What the website basically does is scrape Twitter for tweets containing a given search term, check if it's already in my database, and store it if it doesn't yet exist. Twitter's JSON response has this "tweet_id" property, and I've been using that value to check for whether I should ignore or append a specific tweet to my DB. While this might be okay while my database is small, I suspect it's going to slow things down considerably when my tables grow bigger. Thus my need for ElasticSearch.
My ElasticSearch server is running on my Arch Linux install, and I've configured my app to point to the said server. Also, I have my "Type" object named the same way as my "Tweets" table (I followed the documentation until the overview part http://book.cakephp.org/3.0/en/elasticsearch.html). This craps out an "Unknown method "alias" error, and following Google searches led me to creating an alternate pagination class since that was what some found to be the cause of the error (https://github.com/lorenzo/audit-stash/issues/4), which still doesn't fix things.
I'm not sure if I got this right. I installed the ElasticSearch plugin with the assumption that all I have to do is name the Types the same name as my tables, since to me the documentation "implies" that this should be done on top of the Blog Tutorial they did to "improve query performance".
TLDR, how is this supposed to work? Is my above assumption right? Do I name the Types differently and index everything myself? I'm not sure if there's just too much automagic, or I'm just poor at these sort of things. And yes, I'm new to frameworks (but not PHP, among other languages)
Thanks in advance!
I want to create search engine. So I had used nutch and solr for the developing it.
But it is not able to crawl each and every url of the website and search results are not as
good as Google.So I started using jcrawler to get list of url.
Now I have list of urls.But I have to index them.
So is there any way where I can index list of urls stored line by line in a file.
and show results vis lucene or solr or any other Java API
How you programmatically do something really depends on which language you plan on writing your code in - fetching content from a URL and making sense of that content before indexing will be largely dependent on the libraries available for your programming language of choice.
You can still use nutch with the Solr backend - give it the list of urls as input and set --depth to 1 (so that it doesn't spider anything further).
There are also other "ready" options, such as Crawl Anywhere (which has a Solr backend) and Scrapy.
"Not as good as Google" is not a good description of what you want to accomplish and how to approach that (keep in mind that Search is a core product for Google and they have a very, very large set of custom technologies for handling search). If you have specific issues with your own data and how to display that (usually you can do more useful results as you have domain knowledge of the task you're trying to solve), ask concrete, specific questions.
You can use Data Import Handler to load the list of URLs from file and then read and index them.
You'd need to use nested entity with outside entity having rootEntity flag set to false.
You'd need to practice a little bit with DIH. So, I recommend that you first learn how to import just the URLs into individual Solr documents and then enhance it with actually parsing of URL content.
When I try to create some entities I don't see the option to input fields. I just see the SaveEntity button.
However I can view all the existing entities.
What is very strange is - there is another entity called VideoEntity for which the create did not work yesterday but works today.
Can somebody help me with this seemingly unpredictable tool ?
Regards,
Sathya
i think the console knows what properties each entity has based on existing data, rather then your models. and the data is only updated periodically. when did you upload your app? maybe waiting a few hours will give the console time to update.
alternatively, you could use the remote api to add your entities, or write a small snippet and upload such as ...
VideoStatsEntity(app='home', ip='116.89.52.67', params='tag=20130210').put()
Writing a simple interface to the data-store to allow you to edit/create models is probably the best thing to do in this case. You know what they contain so you can adjust your interface accordingly, rather then waiting for the admin interface to "catch up" as Gwyn notes.
I believe that there are some property types that are impossible to add via the admin interface that you are using so you'll probably get to the point sooner rather then later of creating a custom interface.
The admin datastore view is good for quickly checking out the contents of the datastore, but ever tried paging through 100's of entries? Not fun.
I have to implement Solr index into Sitecore and I would like to know what is the best approach?
I looked at following approaches:
Capture publish end event (or other events) and then push item to solr index
Implement custom database crawler and get all changes from history table. Then using custom index push data to solr.
Second approach sounds like a way to go (in my opinion). In this case do I need to create a new search index, or search manager?
If anyone's done it before, can you point me into the right direction? Also if you could post some links to articles about sitecore-solr implementation.
UPDATE
Ok, after reading sitecore documentation this is what I came up with :
Create your custom SolrConfiguration class where you can set properties like solrserviceurl, add indexes and its definition (custom solr indexes)
Create SolrIndex and add it (in the config file) to your SolrConfiguration. Which instantiating, solrindex should subscribe to AddEntry event of Sitecore History Manager, and communicate with solr crawlers.
Create custom processor and hook into sitecore initialisation pipeline. Processor should initialize SolrConfiguration (from step 1)
Since everything in your config file in will be build using refrection, you can get instance of your cofiguration based on your config file
How does that sound like. Can I have any comments please?
We've done this on a few sites and tend to have a new "published" solr index and "unpublished" index
We interrupt:
OnItemSaving
Event to push things into the unpublished index (you may not need this, it depends if you want things in preview mode)
OnPublishItemProcessed
We process additions and updates to the published index here, I'm not sure what we do about deletions here without digging right into the code but certainly deal with deletions on the OnItemDelete (mentioned below)
OnItemDelete
We interrupt here to remove things from the published and non-published index (I think we remove from the published index here because Sitecore makes you publish the parent node in order to publish out deletions to the web database)
I hope that helps, I'd post the code if I could (but I'd be scowled at).
In addition to the already posted answer (which I think is a good way to do things) I'll share how we do it.
We basically just took a look at the Sitecore database crawler and decided to do things kind of like how it was doing it.
We utilize a significantly modified version of the Custom Item Generator to facilitate mapping between strongly typed objects and an object that has properties that correspond to our Solr schema. For actual communication with Solr we use SolrNet.
The general idea is that we loop through all the items (starting with the site root) recursively and map them to the appropriate type based on its template. Then we go through an indexing process for that item (some items need to index multiple documents to Solr in our implementation).
This approach is working very well for us except I will note that because we are indexing everything at once, it tends to introduce a slight bit of lag time between publish and the site reflecting any changes made to the index. One oversight we made in the beginning but will be working to fix soon is that we don't have an "unpublished" index (meaning we need to publish the site to see updates). It doesn't impact our solution that much really, but I can definitely see where it would others, so keep that in mind.
We didn't particularly want to get into the deletion of items from the index so we do the indexing as a publish:end event.
I hope this additional insight helps you. As far as I know there's not a whole lot of information out there about this specific combination of products, but I can tell you it's definitely possible and quite useful.