Retrieving documents from Vespa at scale - vespa

I am looking for an overview on what is required and how to connect with Vespa for retrieving indexed data at scale.
i've run stress tests on Vespa document RESTful API and as suggested in documentation, it has an upper bound.
http://docs.vespa.ai/documentation/document-api-guide.html indicates the way forward but assumes a head-start on subject matter.
i can figure
com.yahoo.documentapi.messagebus.MessageBusDocumentAccess
and related bus creation etc.
MessageBusDocumentApiTestCase
adds some more to understanding.
package jrt https://github.com/vespa-engine/vespa/tree/master/jrt and some more resources come to aid but the trail, to humbly accept, is tough to put together :)
The trouble is i can't find, if documented, any guide to clearly explain how to invoke vespa from an external system, or if that's not possible, run an embedded client and how it talks to vespa cluster.
please point me to if such an overview exists.
edit:
vespaclient-java/src/main/java/com/yahoo/vespaget/DocumentRetriever.java
-- another example. thoughts?

This seems like a duplicate of a question which has already been answered in a github issue: https://github.com/vespa-engine/vespa/issues/3628
For feeding to Vespa clusters from external systems which is not part
of your Vespa cluster we recommend
http://docs.vespa.ai/documentation/vespa-http-client.html.
For reading single get operations from Vespa the http RESTful API for
GET described in http://docs.vespa.ai/documentation/document-api.html
is the best option. The RESTful API for GET is built on top of the
http://docs.vespa.ai/documentation/document-api-guide.html which is a
low-level api to use on nodes which are part of a Vespa cluster
already and have access to configuration like schema and content
clusters and number of nodes.

Related

Advice for implementing a user-search from with React UI and Spring Boot server

I have a Spring Boot/React application. I have a list of users in my database I will have populated already from LDAP.
As part of a form, I need to allow users to specify a list of users. Since they could be searching from (and technically specifying as well), up to 400,000 users (most will be in the 10k or less range), I'm assuming I'd need to do this both client and server-side.
Does anyone have any recommendations on the approach or technologies?
I'm not using a small amount of data, but I don't want to over-engineer it either (tips are mostly for server-side, but any are welcome).
If you are using hibernate as the ORM in your application, you may also checkout Hibernate Search. This seems to serve your purpose as I feel that searching through a list of users can be done using a normal text based index. Hibernate search leverages Lucene, which is suitable for text based indexing and searching.
While another answer is good and works perfectly fine when you have a small set of data but be aware of the few design issue with it.
Lucene is not distributed and you can't easily scale it to multiple horizontal machines without duplicating the whole index, which is perfectly fine when you have a small set of data and in-fact it's pretty fast as there will be no network call(in case of elasticsearch, it will be).
If you want to build a stateless application that is easy to HS(horizontally scalablele) then going with Lucene will not be helpful as it stateful and you need to create Lucene index before your newly spawned app-server finished local indexing in Lucene.
Elasticsearch(ES) is rest-based and is written in JAVA and has very good java-client which you can easily use for simple to complex use-cases.
Last but not the least, please go through the STOF answer of none other than shay banon, creator of Elasticsearch, who explains why he created ES in first place :) and which will give more trade-off and insights to choose a best solution for your use-case.

What's the Best Way to Query Multiple REST API's and Deliver Relevant Search Results?

My organization has multiple databases that we need to provide search results for. Right now you have to search each database individually. I'm trying to create a web interface that will query all the databases at once and sort the results based upon relevance.
Some of the databases I have direct access to. Others I can only access via a REST API.
My challenge isn't knowing how to query each individual database. I understand how to make API calls. It's how to sort the results by relevance.
On the surface it looks like Elasticsearch would be a good option. Its reverse indexing system seems like a good solution to figuring out which results are going to be the most relevant to our users. It's also super fast.
The problem is that I don't see a way (so far) to include results from an external API into Elasticsearch so it can do its magic.
Is there a better option that I'm not aware of? Or is it possible to have Elasticsearch evaluate the relevance of results from an external API while also including data from its own internal indices?
I did find an answer, although nobody replied. :\
The answer is to use the http_poll plugin with logstash. This will make an API call and injest the results into Elasticsearch.
Another option could be some form of microservices orchestration for the various API calls then merge them into a final result set.

Is it possible to get data from other companies' databases?

I was wondering how so many job sites have so many job offers/information regarding other companies' offers. For instance, if I were to start my own job searching engine, how would I be able to get the information that sites like indeed.com have in my own databases? One site (jobmaps.us) says that it's "powered by indeed" and seems to be following the same format as indeed.com (as do all other job searching websites). Is there some universal job searching template that I can use?
Thanks in advance.
Some services offer an API which allows you to "federate" searches (relay them to multiple data sources, then gather all the results together for display in one place). Alternatively, some offer a mechanism that would allow you to download/retrieve data, so you can load it into your own search index.
The latter approach is usually faster and gives you total control but requires you to maintain a search index and track when data items are updated/added/deleted on remote systems. That's not always trivial.
In either case, some APIs will be open/free and some will require registration and/or a license. Most will have rate limits. It's all down to whoever owns the data.
It's possible to emulate a user browsing a website, sending HTTP requests and analysing the response from a web server. By knowing the structure of the HTML, it's possible to extract ("scrape") the information you need.
This approach is often against site policies and is likely to get you blocked. If you do go for this approach, ensure that you respect any robots.txt policies to avoid being blacklisted.

Apatar for feeding data into Solr

I need to fetch data from normalized MSSQL db and feed them in Solr index.
I was just wondering whether Apatar can be used to perform the job. I've gone through its documents, but doesn't get the information I'm looking for. It states, it can fetch data from SQL server, and post it over HTTP, but still not sure, whether it can post fetched data in XML over http or not?
Any advise will be highly valuable. thank you
I am not familiar with Apatar, but seeing as it is a Java application, it may be a bit challenging to implement it in a windows environment. However, for various scenarios where I need to fetch data from a MSSQL Database and feed it to Solr, I have written custom C# code leveraging the SolrNet client. This tends to be pretty straight forward and simple code and in the cases where we need to load data at specified intervals we are using scheduled tasks calling a console application. I would recommend checking out the Create/Update section of the SolrNet site for some examples of loading/updating data with the .Net client.

Trending Words in Solr 4.0

Is anyone aware of any upcoming or plugin support for Solr 4.0 trending word/topic functionality?
I am aware of various DIY algorithmic approaches and some external frameworks that perhaps can be used (Mahout etc) but given its popularity i'd imagine there are already efforts to make this a part pluggable of Solr.
Failing that, if anyone can point to a resource that details using an external framework that would be much appreciated.
If you want a really hacky way to do it, you can capture incoming searches in your Solr front-end (PHP, or whatever) and then send them to an external database, and then query that database for top searches within a certain timeperiod, and outlier searches that are up-and-coming.

Resources