We are trying to use the ML model in Vespa, we have textual data stored in Vespa, can somebody help us with the below question-
One example of onnx model trained using scikit-learn used in Vespa.
Where to add preprocessing steps before model training and prediction using onnx model in Vespa with example.
This is a very broad question and the answer very much depends on what your goals are. In general, the documentation for using an ONNX model in Vespa can be found here:
https://docs.vespa.ai/documentation/onnx.html
An example that uses an ONNX BERT model for ranking can be found in the Transformers sample application:
https://github.com/vespa-engine/sample-apps/tree/master/transformers
Note that both these links assume that you have an existing model. In general, Vespa is a serving platform and not usually used in the model training process. As such Vespa doesn't really care where your model comes from, be that scikit-learn, pytorch or any other system. ONNX is a general format for ML model exchange between various systems.
However, there are some foundational ideas that I think I should get across that maybe can clarify a bit. Vespa currently considers all ML models to have numeric (in the form of tensors) inputs and outputs. This means you can't directly put text to your model and have text come out on the other side. Most textual data these days are encoded to some form of numeric representation such as embedding vectors, or, as the BERT example above shows, text is tokenized such that each token gets its own vector representation. After model computation, embedding vectors or token-set representations can be decoded back to text.
Vespa currently handles the computational part, the (pre-)processing of encoding/decoding text to embeddings or other representations are currently up to the user. Vespa does offer a rich set of features to help out in this regard in the form of document and query processors. So you can create a document processor that encodes the text of each incoming document to some representation before storing it. Likewise, a searcher (query processor) can be created that encodes incoming textual queries to a compatible representation before documents are scored against it.
So, in general, you would train your models outside of Vespa using whatever embedding or tokenization strategies are necessary for your model. When deploying the Vespa application you add the models with any required custom processing code, which is used when feeding or querying Vespa.
If you have a more concrete example of what you are trying to achieve I could be more specific.
Related
I have a need to extract entities from word and pdf documents. Documents can be in the range of 10 to 20 pages. Are there scalable library/APIs available that we can plug into our processing pipeline? Any comparative study of different solutions will be helpful.
Take a look at the Watson Natural Language Understanding (you'll need to get an IBM ID and then login to see this content - don't worry , cost is $0). With Watson Natural Language Understanding you will want to look at the API Explorer to find the correct API syntax to use to get the results that you are looking for.
I also noticed that mention Word/PDF documents. You will need to convert those using the Watson Discovery service, and then you can pass the converted documents to Watson Natural Language Understanding, which takes in JSON, text or HTML inputs.
I am trying to compare data ingested into both Accumulo and Solr from the same source XML. The data ingested into Accumulo is legacy code while Solr is new code. I can easily extract out data from Solr using SolrCloud and choosing CSV or JSON, which is easily readable. But I'm at a loss for how to easily view the data in Accumulo. I used scan to view the data, but it is not easily readable. Is there a way to export the data in Accumulo to a CSV or something similar so it will be easy to read/compare with other datasets?
As I understand it, Apache Solr is a document store which uses Lucene indexes to make search fast via a web-based REST interface. On the other hand, Apache Accumulo is a massively scalable sorted key-value store, which stores arbitrary key-value pairs with cell-level security labels, in accordance with the user's application, queryable with a Java API. It makes no sense to compare the two. They are entirely different applications. Accumulo is a low-level infrastructure application, upon which you can build complex systems, such as a search engine comparable to Solr, but it is not directly comparable to Solr because Accumulo is not a search engine.
To answer your question about how to view data in Accumulo, the answer is to use its Java API. I recommend starting with the Tour on its web page, for some examples of how to query it. As for how the data is presented, and in what form, that depends on the application which ingested it in the first place. It can be arbitrary binary data in byte arrays and may not be directly viewable; that depends on the application. Accumulo is agnostic to the nature of the data stored in its key-value pairs.
What you were probably referring to in your question, when you said "I used scan to view the data", you were probably referring to the scan command in Accumulo's shell. You should probably be aware that the shell is not the primary interface for query. It is intended for system administration and triage of data ingest. The Java API is the primary means of querying.
The Accumulo open source community is pretty responsive to questions. If you're having trouble figuring out how best to use it for your needs, I would advise to ask on their community mailing lists, which can be found at their website. StackOverflow is more suitable for very specific questions than generalized "getting started" kinds of tutorials.
I am currently exploring three services for identifying person's tweets or facebook post's are helpfulness or not:
Personality Insights
Natural Language Understanding
Discovery
will I need to write my on wrapper on these services to identify the helpfulness characteristic or is there any other way to just query & get result.
can anyone please guide which service I need to use for this task
Thanks
According to Neil, sure, all depends on how you define helpfulness.
Discovery:
If you want use Discovery you need some base to get the data, you can filter the data about you want with filter. By using data analysis combined with cognitive intuition to take your unstructured data and enrich it so you can discover the information you need.
Personality:
If you want use Personality, understand personality characteristics, needs, and values in written text. The service uses linguistic analytics to infer individuals' intrinsic personality characteristics, including Big Five, Needs, and Values, from digital communications such as email, text messages, tweets, and forum posts.
Watson Knowledge Studio:
If you want to work with models for tweets, you can use WKS (Watson knowledge Studio), this service provides easy-to-use tools for annotating unstructured domain literature and uses those annotations to create a custom machine-learning model that understands the language of the domain. The accuracy of the model improves through iterative testing, ultimately resulting in an algorithm that can learn from the patterns that it sees and recognize those patterns in large collections of new documents. For example, if you want learn about car, you can simple give some models to WKS.
It all depends on how you define helpfulness. Whether it is in general, or helpful to answering a question etc.
For Personality Insights, have a look at https://www.ibm.com/watson/developercloud/doc/personality-insights/models.html which has all the traits, as well as what they mean. The closest trait to helpfulness is probably Conscientiousness.
Neil
When will ibm make it's Watson Q&A api capable of accepting a custom corpus?
Is there a roadmap I can see?
Besides the Question and Answer service currently doesn't provide a way to use your own data. You can get similar or better results by combining Document Conversion and Retrieve and Rank.
You will use Document Conversion to convert your corpus documents (PDF, docx, html) to answer units that will be indexed by the Retrieve and Rank service.
The Retrieve and Rank service is built on top of Apache Solr, and once you load your data into the Solr index, you can create and train a Ranker (machine learning model that knows how to sort results).
To expand on German's answer, also take a look at the Watson Natural Language Classifier (NLC) and Dialog services, which are additional building blocks for creating a custom Question and Answer application. NLC classifies text and allows you to trigger an action, and Dialog allows you to create and manage virtual conversations with your users.
Here is a great blog with an introduction to both NLC and Dialog. And another good blog that introduces the Watson Document Conversion and Retrieve and Rank services.
I am working on building a document similarity graph for a collection. I already do all the basic things like tokenization, stemming, stop-word removal, and bag-of-word representation to represent the documents and computing similarity using Jaccard coefficient. I am now trying to extract Named Entities and evaluate if these would be helpful in improving the quality of the document similarity graph. I have been spending much of time on finding ground-truth datasets for my analysis. I have been very disappointed with Message Understanding Conference (MUC) datasets. They are cryptic to understand and requires sufficient data cleaning/massaging before it can be used on a different platform (like Scala)
My questions are here more specifically
Are there tutorials on getting started with MUC datasets that would make it easier for analyzing the results using open source NLP tools like openNLP
there other datasets available?
Tools like OpenNLP and Stanford Core NLP employ approaches that are essentially supervised. Correct?
GATE is a great tool for hand-annotating your own text corpus Correct?
For a new test dataset (that I hand-create) how can I compute the baseline (Vocabulary Transfer) or what kind of metrics can I compute?
First of all, I have a few concerns about using Jaccard coefficient to compute similarity. I'd expect TF.IDF and cosinus similarity to give better results.
Some answers to your questions:
See the CoNLL 203 evaluation campaign: it also provides data, evaluation tools, etc. You ma also have a look at ACE.
Yes
Gate is also a pipeline that automatically annotates text, but as far as I know NER is a rule-based component.
A baseline is most of the time a very simple algorithm (e.g. majority classes) so it is not a baseline to compare corpus, but to compare approaches.