How to define Input tensor that has thousands of possible categorical values? - tensorflow.js

I understand that if I have a categorical input that has several possible values (e.g. country or color), I can use a onehot tensor (represented as multiple 0s and only one 1).
I also understand that if the variable has many posible values (e.g. thousands of possible zip codes or school ids) a onehot tensor might not be efficient and we should use other representations (hash based?). But I have not found documentation nor examples on how to do this with JavaScript version of TensorFlow.
Any hints?
UPDATE
#edkeveked gave me the right suggestion on using embeddings, but now I need some help on how to actually use embeddings with tensorflowjs.
Let me try with a concrete example:
Let's assume that I have records for people, for which I have age (integer), state (an integer from 0 to 49) and risk (0 or 1).
const data = [
{age: 20, state: 0, risk: 0},
{age: 30, state: 35, risk: 0},
{age: 60, state: 35, risk: 1},
{age: 75, state: 17, risk: 1},
...
]
When I have wanted to create a classifier model with tensorflowjs I would encode the state as a one-hot tensor, have the risk - label - as a onehot tensor (risk: 01, no risk 10) and build a model with dense layers such as the following:
const inputTensorAge = tf.tensor(data.map(d => d.age),[data.length,1])
const inputTensorState = tf.oneHot(data.map(d => d.state),50)
const labelTensor = tf.oneHot(data.map(d => d.risk),2)
const inputDims = 51;
const model = tf.sequential({
layers: [
tf.layers.dense({units: 8, inputDim:inputDims, activation: 'relu'}),
tf.layers.dense({units: 2, activation: 'softmax'}),
]
});
model.compile({loss: 'categoricalCrossentropy', "optimizer": "Adam", metrics:["accuracy"]});
model.fit(tf.concat([inputTensorState, inputTensorAge],1), labelTensor, {epochs:10})
(BTW ... I am new to tensorflow, so there might be much better approaches ... but this has worked for me)
Now ... my challenge. If I want a similar model but now I have a postcode instead of state (let's say that there are 10000 possible values for the postcode):
const data = [
{age: 20, postcode: 0, risk: 0},
{age: 30, postcode: 11, risk: 0},
{age: 60, postcode: 11, risk: 1},
{age: 75, postcode: 9876, risk: 1},
...
]
If I want to use embeddings, to represent the postcode, I understand that I should use an embedding layer such as:
tf.layers.embedding({inputDim:10000, outputDim: 20})
So, if I was using only the postcode as an input and omit the age, the model would be:
const model = tf.sequential({
layers: [
tf.layers.embedding({inputDim:10000, outputDim: 20})
tf.layers.dense({units: 2, activation: 'softmax'}),
]
});
If I create the inputtensor as
inputTensorPostcode = tf.tensor(data.map(d => d.postcode);
And try
model.fit(inputTensorPostcode, labelTensor, {epochs:10})
It will not work ... so I am obviously doing something wrong.
Any hints on how should I create my model and do the model.fit with embeddings?
Also ... if I want to combine multiple inputs (let's say postcode and age), how should I do it?

For categorical data, one might use a one-hot encoding to solve the problem. The issue with one-hot encoding is that it often leads to sparse data with a lot of zero.
The other way to deal with categorical data is to reduce the dimensions of the input data. This technique is known as embeddings. For creating models involving categorical data, one might use the embedding layer offered in the Js API.
Edit:
The data is not really a categorical data, though it is possible to build it as such and there is no reason for doing so. An example of classical categorical data for recommendation system is a data containing the moovies a user watched or not. The data will look like the following:
________________________________________________
| moovie 1 | moovie 2 | moovie 3| --- | moovie n|
|__________|__________|_________|______|_________|
user 1 | 0 | 1 | 1 | --- | 0 |
user 2 | 0 | 0 | 1 | --- | 0 |
user 3 | 0 | 1 | 0 | --- | 0 |
. | . | . | . | --- | . |
. | . | . | . | --- | . |
. | . | . | . | --- | . |
The input dimension here is the number of moovies n. Such a data can be very sparsed with a lot of zeros. For the database may contain hundred of thousands of moovies and the average user could hardly have watched more than a thousand ones. In that case there will be a thousand fields with 1 and all the rest with 0. Such a data needs to be aggregated using embeddings in order to lower the dimension from n to something smaller.
That is not the case here. The input data has only 2 features age and postcode. The input data dimension is 2 and the output (label) is always of one dimension (the label here is the risk property). But since there are two categories, the input dimension will have a size of 2. The range of values of postcode does not affect our categorization
const data = [
{age: 20, state: 0, risk: 0},
{age: 30, state: 35, risk: 0},
{age: 60, state: 35, risk: 1},
{age: 75, state: 17, risk: 1}
]
const model = tf.sequential()
model.add(tf.layers.dense({inputShape: [2], units: 10, activation: 'relu'}))
model.add(tf.layers.dense({activation: 'softmax', units: 2}))
const x = tf.tensor2d(data.map(e => [e.age, e.state]), [data.length, 2])
const y = tf.oneHot(tf.tensor1d(data.map(e => e.risk), "int32"), 2)
model.compile({optimizer: 'adam', loss: 'categoricalCrossentropy' })
model.fit(x, y, {epochs: 10}).then(() => {
// prediction will look like [p, 1-p] with 0 <= p <= 1
// predictions [p, 1-p] such that p > 0.5 are in one category
// predictions [p, 1-p] such that 1-p > 0.5 are in the 2 category
// prediction for age 30 and postcode 35 is the same with age 0 and postcode 35
// (they both will either have p > 0.5 or p < 0.5)
// the previous prediction will be different for age 75 postcode 17
model.predict(tf.tensor2d([[30, 35], [0, 20], [75, 17]])).print()
})
<html>
<head>
<!-- Load TensorFlow.js -->
<script src="https://cdn.jsdelivr.net/npm/#tensorflow/tfjs#0.13.0"> </script>
</head>
<body>
</body>
</html>

Related

Last WHEN expression overrides result of first WHEN expression in CASE WHEN in SQL SELECT statement

This looks so familiar to most of in this community as it is all about SQL CASE expression. But I am going now in triage mode rather doing actual implementation. I appreciate if there is optimal way to work around this.
SCENARIO:
I have one select statement where in I retrieve multiple columns from a table. The table has columns mostly with numeric(10, 3) datatype. I chose this datatype as I thought if I need to display int value the conversion would be easier rather vice versa. Here is the table structure.
Name: FleetRange
Columns:
CaptionID INT NOT NULL
Caption NVARCHAR(50)
FleetRange_1 numeric(10,3)
FleetRange_2_4 numeric(10,3)
.......
Total numeric(10,3)
Critera:
Current result:
My SQL query:
SELECT
CaptionId, Caption,
CASE
WHEN CaptionId IN (1, 2, 4, 5, 6, 7, 8, 9, 14, 15, 17, 18, 20, 21, 22, 23)
THEN CONVERT(INT, FleetRange_1)
WHEN CaptionId IN (11, 12, 13)
THEN CONVERT(NUMERIC(10, 2), FleetRange_1)
WHEN CaptionId IN (3, 10, 16, 19)
THEN CONVERT(NUMERIC(10, 3), FleetRange_1)
END AS 'FleetRange_1'
FROM
FleetRange
NOTE:
What currently happening is, the last WHEN is overriding previous evaluation and hence every row display values with 3 decimal places even if there is an integer value.
I have applied same case structure for other numeric(10,3), hence I have shortened the query.
instead of case written within above query, I tried below syntax too - but no difference.
WHEN
CaptionId = 11 OR
CaptionId = 12 OR
CaptionId = 13
THEN...
My expectation (desired actual result): my objective is - particular row value should be converted to int numeric given precision if the particular when expression with specific CaptionID is evaluated.
Something like below:
CaptionID | Caption | FleetRange_1 | FleetRange_2_4 | .....
1 | SafetyFirst | 0 | 1 |
11 | DriveSafe | 2.15 | null|
3 | Caution | 1.025 | 2.174|
Every expression in a query has a single data type. The data type of a CASE expression is:
the highest precedence type from the set of types in
result_expressions and the optional else_result_expression.
CASE (TSQL)
And see Data Type Precedence
If you want different display formatting for different rows in a single resultset, you'll have to convert them to strings and Format them yourself.

Adding Zero Arrays to Pandas Dataframe

I'm looking for code to add [0,0] arrays to the 'Mar' column if the shape is smaller than (3,)
Sample Dataframe visualized:
df:
account Jan Feb Mar
Jones LLC | 150 | 200 | [.332, .326], [.058, .138]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395], [.013, .592]]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923]]
Again: I'm looking for code to add [0,0] arrays to the 'Mar' column so that any row with shape smaller than (3,x) is modified, resulting in the following df
df:
account Jan Feb Mar
Jones LLC | 150 | 200 | [.332, .326], [.058, .138], [0, 0]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395],[.013, .592]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923], [0, 0]
Code to create sample dataframe:
Sample = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': [[.332, .326], [.058, .138]]},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': [[.234, .246], [.234, .395], [.013, .592]]},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': [[.084, .23], [.745, .923]]}]
df = pd.DataFrame(Sample)
You could add extra values,
df['Mar'] = df['Mar']+[[0, 0], [0, 0], [0, 0]]
then trim it down, using a method from here.

Solr Aggregation / Facet on DynamicField

Table schema:
CREATE TABLE attributes_v1 (
profile_id bigint,
attributes map<text, int>,
solr_query text,
PRIMARY KEY ((profile_id))
)
Data inside the table looks like this:
profile_id | attributes
------------+---------------------------------------------
2 | {'a101': 1, 'a11': 1, 'a12322': 1, 'a51': 3}
3 | {'a1': 1, 'a10': 1, 'a11': 3, 'a51': 1}
1 | {'a1': 1, 'a10': 1, 'a2322': 1, 'a5': 3}
I can't figure out how to accomplish the following (using solr either via CQL or java)
Desired aggregation / facet :
a1 count: 2
a1 sum: 2
a101 count: 1
a101 sum: 1
a11 count: 2
a11 sum: 4
a12322 count: 1
a12322 sum: 1
a2322 count: 1
a2322 sum: 1
a10 count: 2
a10 sum: 2
a51 count: 2
a51 sum: 4
a5 count: 1
a5 sum: 3
Any ideas?
Thank you!
I believe you should have prefixed your map keys with the name of the field itself. This is a requirement/limitation as mentioned in the following links:
link1
link2
link3
So, for example, your 'a1' element should have been called 'attributes1' and your 'a12322' should have been called 'attributes12322'.
Then in your Solr schema, define a dynamicField as follows:
<dynamicField name="attributes*" ... />
You can then query the map elements by referring to them directly. e.g.
q=attributes12322:1
Now, for your question about aggregating, since you also need the sum aside from the count, I think you to use need stats instead of facet.
stats=true&stats.field=attributes12322
You can specify multiple stats.field parameters like:
stats=true&stats.field=attributes12322&stats.field=attributes1&stats.field=attribute51
You can then retrieve the sum from the 'sum' attribute and the count from the 'count' attribute of each stats_fields response item
EDIT:
I didn't notice immediately that you were specifically asking for querying Solr via CQL or Java. I'm not sure if 'stats' is supported by CQL Solr queries

How to iterate through table in Lua?

So, I have a table something along these lines:
arr =
{
apples = { 'a', "red", 5 },
oranges = { 'o', "orange", 12 },
pears = { 'p', "green", 7 }
}
It doesn't seem like it's possible to access them based on their index, and the values themselves are tables, so I just made the first value of the nested table the index of it, so it now looks like this:
arr =
{
apples = { 0, 'a', "red", 5 },
oranges = { 1, 'o', "orange", 12 },
pears = { 2, 'p', "green", 7 }
}
So, now any time I use one of these tables, I know what the index is, but still can't get to the table using the index, so I started to write a function that loops through them all, and check the indexes until it finds the right one. Then I realized... how can I loop through them if I can't already refer to them by their index? So, now I'm stuck. I really want to be able to type arr.apples vs arr[1] most of the time, but of course it's necessary to do both at times.
To iterate over all the key-value pairs in a table you can use pairs:
for k, v in pairs(arr) do
print(k, v[1], v[2], v[3])
end
outputs:
pears 2 p green
apples 0 a red
oranges 1 o orange
Edit: Note that Lua doesn't guarantee any iteration order for the associative part of the table. If you want to access the items in a specific order, retrieve the keys from arr and sort it. Then access arr through the sorted keys:
local ordered_keys = {}
for k in pairs(arr) do
table.insert(ordered_keys, k)
end
table.sort(ordered_keys)
for i = 1, #ordered_keys do
local k, v = ordered_keys[i], arr[ ordered_keys[i] ]
print(k, v[1], v[2], v[3])
end
outputs:
apples a red 5
oranges o orange 12
pears p green 7
If you want to refer to a nested table by multiple keys you can just assign them to separate keys. The tables are not duplicated, and still reference the same values.
arr = {}
apples = {'a', "red", 5 }
arr.apples = apples
arr[1] = apples
This code block lets you iterate through all the key-value pairs in a table (http://lua-users.org/wiki/TablesTutorial):
for k,v in pairs(t) do
print(k,v)
end
For those wondering why ipairs doesn't print all the values of the table all the time, here's why (I would comment this, but I don't have enough good boy points).
The function ipairs only works on tables which have an element with the key 1. If there is an element with the key 1, ipairs will try to go as far as it can in a sequential order, 1 -> 2 -> 3 -> 4 etc until it cant find an element with a key that is the next in the sequence. The order of the elements does not matter.
Tables that do not meet those requirements will not work with ipairs, use pairs instead.
Examples:
ipairsCompatable = {"AAA", "BBB", "CCC"}
ipairsCompatable2 = {[1] = "DDD", [2] = "EEE", [3] = "FFF"}
ipairsCompatable3 = {[3] = "work", [2] = "does", [1] = "this"}
notIpairsCompatable = {[2] = "this", [3] = "does", [4] = "not"}
notIpairsCompatable2 = {[2] = "this", [5] = "doesn't", [24] = "either"}
ipairs will go as far as it can with it's iterations but won't iterate over any other element in the table.
kindofIpairsCompatable = {[2] = 2, ["cool"] = "bro", [1] = 1, [3] = 3, [5] = 5 }
When printing these tables, these are the outputs. I've also included pairs outputs for comparison.
ipairs + ipairsCompatable
1 AAA
2 BBB
3 CCC
ipairs + ipairsCompatable2
1 DDD
2 EEE
3 FFF
ipairs + ipairsCompatable3
1 this
2 does
3 work
ipairs + notIpairsCompatable
pairs + notIpairsCompatable
2 this
3 does
4 not
ipairs + notIpairsCompatable2
pairs + notIpairsCompatable2
2 this
5 doesnt
24 either
ipairs + kindofIpairsCompatable
1 1
2 2
3 3
pairs + kindofIpairsCompatable
1 1
2 2
3 3
5 5
cool bro
All the answers here suggest to use ipairs but beware, it does not work all the time.
t = {[2] = 44, [4]=77, [6]=88}
--This for loop prints the table
for key,value in next,t,nil do
print(key,value)
end
--This one does not print the table
for key,value in ipairs(t) do
print(key,value)
end
Hi guys i am new in LUA but these answers helped me only like an half.So i write my own.
for i in pairs(actions) do
if actions[i][3] ~= nill then
--do something
end
end
i - is index of value in table similar like in c#
actions - just name of table
actions[i][3] it will check all indexes in table if their 3rd value is not nil

How to store testing options into database?

Let's say I have some parameters such as a,b,c, and I need to store the test results by changing them.
The thing is that the the number of parameters will be keep increasing, so I can't keep them as static column.
For example :
Test 1 : a = 10, b = 20, c = 1
Test 2 : a = 11, b = 21, c = 11
Test 3 : a = 11, b = 20, c = 1
...
Test 1001 : d = 30
I thought about having a table for parameters as follows.
id name value
1 a 10
2 b 20
3 c 1
4 a 11
5 b 21
6 c 11
...
100 d 30
And a table for using the option. The orders are not important.
id usage
1 1-2-3
2 4-5-6
3 4-5-3
The problem for this approach is that the number of the option used for each test is not fixed. It can be 1, but it also can be 1-2-3-4-5-6-7.
Questions
Is there any better method for this problem? Not using two tables or someting?
If I have to use this method, how can I deal with the variable element problem? Use string or equivalent?
Take a look at this discussion.

Resources