WKS - Training model to identify entities on tables - ibm-watson

Browser type and version: GoogleChrome 67.0.3396.99
We are trying to train our model to identify values from multiple types of tables whom contain different number of rows and columns. A text row was extracted to begin the training, first we configure our system types and then, marked the entities and also the relation “AllInOne”. We are able to train 10 relations in a training set, but when the model is tested, we are only able to see 8 relations even creating other document sets for training and test the model multiple times. Is there another way to associate the column value with the row values in a single relation considering there isn’t a standard for the types of tables we are analyzing with the Discovery service?
We are expecting the discovery service response as the following:
"relations": [
{
"type": "AllInOne",
"sentence": "…",
"arguments": [
{
"entities": [
{
"“text": "””",
"type": "entity1"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "entity2"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "\"entity..n”,"
}
]
},
{ "..." }
]
}

The machine learning model that is trained in Watson Knowledge Studio targets unstructured natural language text. It may not be suitable for (semi-) structured format like table, especially for relations.

Related

Which should I choose, Document Model or Relation Model?

In my application, I choose Document Model but I still have some questions.
Here is my example document:
{
"catalogs": {
"cat-id1": {
"name": "catalog-1",
"createdAt": 123,
"products": {
"pro-id1": {
"name": "product-1",
"createdAt": 321,
"ingredients": {}
},
"pro-id2": {
"name": "product-2",
"createdAt": 654,
"ingredients": {}
}
}
},
"cat-id2": {
"name": "catalog-2",
"createdAt": 456,
"products": {
"pro-id3": {
"name": "product-3",
"createdAt": 322,
"ingredients": {}
},
"pro-id4": {
"name": "product-4",
"createdAt": 655,
"ingredients": {}
}
}
}
}
}
But ingredients in product is referrer to another Document.
{
"ingredients": {
"ing-id1": {},
"ing-id2": {}
}
}
Document Model has several benefits:
Easy to edit schema, like if (user.first_name) user.first_name = user.name.split(' ')[0]
No need to join, easily take all data in once.
Also I know that:
On updates to a document, the entire document usually needs to be rewritten.
For these reasons, it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document .
Main idea is: Which data model leads to simpler application code?
My question will be:
What size of Document should I keep?
My application already have a Relation DB, should I combined Document Model to Relation DB to reducing complexity?
Since you already have a relational database in use, I don't see a real benefit to using a document based DB as well.
Your Database schema seems simple enough to be using a relational DB. Wheras, if catalog entries would be very different from each other, you might consider a document based model. But this does not seem to be the case.
Therefore, my advice is, you stick with a relational model.
I would design the model like this:
A table for each entity (catalog, product, ingredient) where each entry has a unique Id
A relation table for each n:m relationship (catalogProduct, productIngredient) that only contain the Id of the entities of the relationship.
An example:
The ingredients ing1, ing2 and ing3 are stored in the table ingredient.
The products prod1 and prod2 are stored in product.
ing1 and ing2 are needed for prod1
ing2 and ing3 for prod2
In productIngredient in each entry, you store the ID of an ingredient and the ID of the product it is used in.
prod1 : ing1
prod1 : ing2
prod2 : ing2
prod2 : ing3

Representing JSON data in relational database table

I have a problem where I need to convert a JSON payload into SQL tables, while maintaining the relationships established in the payload. This is so that later I have the ability to query the tables and recreate the JSON payload structure in the future.
For example:
{
"batchId": "batch1",
"payees" : [
   {
 "payeeId": "payee1",
"payments": [
{
"paymentId": "paymentId1",
"amount": 200,
"currency": "USD"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "YEN"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "EURO"
}
]
}
]
}
For the above payload, I have a batch with payments grouped by payees. At its core it all boils down to a batch and its payments. But in that you can have groupings, for example above, it's grouped by payees.
One thing to note is that the payload may not necessarily always follow the above structure. Instead of grouping by payees, it could be by something else like currency for example. Or even no grouping at all, just a root level batch and an array of payments.
I want to know if there are conventions/rules I can follow to approach represent such data into relational tables? Thanks.
edit:
I am primarily looking to use Postgres and have looked into the jsonb feature that it provides for storing json data. However, I'm still struggling to figure out how/where (in terms of which table) to best store the grouping info.

Ludwig preprocessing

I'm running a model with Ludwig.
Dataset is Adult Census:
Features
workclass has almost 70% instances of Private, the Unknown (?) can be imputed with this value.
native_country, 90% of the instances are United States which can be used to impute for the Unknown (?) values. Same cannot be said about occupation column as the values are more distributed.
capital_gain has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.
capital_loss has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.
When I define the model what is the best way to do it for the above cases?
{
"name": "workclass",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "native_country",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "capital_gain",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean",
}
},
{
"name": "capital_loss",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
Questions:
1) For category features how to define: If you find ?, replace it with X.
2) For numerical features how to define: If you find 0, replace it with mean?
Ludwig currently considers missing values in the CSV file, like with two consecutive commas for it's replacement strategies. In your case I would suggest to do some minimal preprocessing to your dataset by replacing the zeros and ? with missing values or depending on the type of feature. You can easily do it in pandas with something like:
df[df.my_column == <value>].my_column = <new_value>.
The alternative is to perform the replacement already in your code (for instance replacing 0s with averages) so that Ludwig doesn't have to do it and you have full control of the replacement strategy.

Building a database of average speed from two cameras using cloudant entries

I'm very new to IBM Cloud, and specifically, Cloudant DB. In this database is records of when cars pass two different speed cameras (either 20 or 21). I have a database of javascript objects with the following format:
{
"_id": "006994989f0914a7fb1ca44fae00fe75",
"_rev": "1-e9b9afcb45f6ff703825d4be6d331f73",
"payload": {
"license_plate": "GNX834",
"camera_id": 20,
"date_time_string": "2019-05-08T15:20:04.134Z",
"date_time_UTC_milliseconds": 1557328804134
},
"qos": 2,
"retain": false
}
I wish create a search index function to create another database full of objects that contain the average speed of cars as they travel between the two cameras (They're 3 miles apart). I know I need to sort them by number plate, but I'm struggling to understand how to do this in Cloudant DB?
Any help on this subject would be great!
If you want to extra the documents where the licence_plate is equal to a value that you know, then you need to create an index on that field:
POST /db/_index HTTP/1.1
Content-Type: application/json
{
"index": {
"fields": ["payload.license_plate"]
},
"name" : "plate-index",
"type" : "json"
}
Then you can query by licence plate:
POST /db/_find HTTP/1.1
Content-Type: application/json
{
"selector": {
"payload.license_plate": "NH67992"
}
}
which would return you the matching documents.
Alternatively, if you want to extract the data en masse, ordered by the license plate field, you could create a MapReduce View with the license plate as the key (map function below):
function(doc) {
emit(doc.payload.license_plate, doc.payload.date_time_string)
}
The index that this Map function generates is ordered by license plate and be used to extract data in license plate order.

Optimizing seemingly simple couchbase query for "items whose children satisfy"

I'm developing a system to store our translations using couchbase.
I have about 15,000 entries in my bucket that look like this:
{
"classifications": [
{
"documentPath": "Test Vendor/Test Project/Ordered",
"position": 1
}
],
"id": "message-Test Vendor/Test Project:first",
"key": "first",
"projectId": "project-Test Vendor/Test Project",
"translations": {
"en-US": [
{
"default": {
"owner": "414d6352-c26b-493e-835e-3f0cf37f1f3c",
"text": "first"
}
}
]
},
"type": "message",
"vendorId": "vendor-Test Vendor"
},
And I want, as an example, to find all messages that are classified with a "documentPath" of "Test Vendor/Test Project/Ordered".
I use this query:
SELECT message.*
FROM couchlate message UNNEST message.classifications classification
WHERE classification.documentPath = "Test Vendor/Test Project/Ordered"
AND message.type="message"
ORDER BY classification.position
But I'm very surprised that the query takes 2 seconds to execute!
Looking at the query execution plan, it seems that couchbase is looping over all the messages and then filtering on "documentPath".
I'd like it to first filter on "documentPath" (because there are in reality only 2 documentPaths matching my query) and then find the messages.
I've tried to create an index on "classifications" but it did not change anything.
Is there something wrong with my index setup, or should I structure my data differently to get fast results?
I'm using couchbase 4.5 beta if that matters.
Your query filters on the documentPath field, so an index on classifications doesn't actually help. You need to create an array index on the documentPath field itself using the new array index syntax on Couchbase 4.5:
CREATE INDEX ix_documentPath ON myBucket ( DISTINCT ARRAY c.documentPath FOR c IN classifications END ) ;
Then you can query on documentPath with a query like this:
SELECT * FROM myBucket WHERE ANY c IN classifications SATISFIES c.documentPath = "your path here" END ;
Add EXPLAIN to the start of the query to see the execution plan and confirm that it is indeed using the index ix_documentPath.
More details and examples here: http://developer.couchbase.com/documentation/server/4.5-dp/indexing-arrays.html

Resources