Elasticsearch query strategy for nested array elements - arrays

I am trying to find results by color. In the database, it is recorded in rgb format: an array of three numbers representing red, green, and blue values respectively.
Here is how it is stored in the db and elasticsearch record (storing 4 rgb colors in an array):
"color_data":
[
[253, 253, 253],
[159, 159, 159],
[102, 102, 102],
[21, 21, 21]
]
Is there a query strategy that will allow me to find similar colors? i.e. exact match or within a close range of rgb values?
Here is a method I am trying, but the addressing method to access array values doesn't work:
curl -X GET 'http://localhost:9200/_search' -d '{
"from": 0,
"size": 50,
"range": {
"color_data.0.0": {
"gte": "#{b_lo}",
"lte": "#{b_hi}"
},
"color_data.0.1": {
"gte": "#{g_lo}",
"lte": "#{g_hi}"
}
}
}'
(r_lo, r_hi, et. al. are set to +/- 10 from the rgb values recorded in the color_data variable)

First, you should move channel data to separate fields (or to object field at least)
If you need simple matching algo (±deviation without scoring), then you can perform simple filter>range queries, passing your fuzziness threshold in query.
If you need scoring (how much similar that docs are), than you need to perform scripted queries. Take a look at this article
Btw, I strongly recommend work in HSL space, if you need such operations, you'll get much better results. Take a look at this example

Related

Azure Data Factory - converting lookup result array

I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT

Database schema design for stock market financial data

I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}
What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.
Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...

Ludwig preprocessing

I'm running a model with Ludwig.
Dataset is Adult Census:
Features
workclass has almost 70% instances of Private, the Unknown (?) can be imputed with this value.
native_country, 90% of the instances are United States which can be used to impute for the Unknown (?) values. Same cannot be said about occupation column as the values are more distributed.
capital_gain has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.
capital_loss has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.
When I define the model what is the best way to do it for the above cases?
{
"name": "workclass",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "native_country",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "capital_gain",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean",
}
},
{
"name": "capital_loss",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
Questions:
1) For category features how to define: If you find ?, replace it with X.
2) For numerical features how to define: If you find 0, replace it with mean?
Ludwig currently considers missing values in the CSV file, like with two consecutive commas for it's replacement strategies. In your case I would suggest to do some minimal preprocessing to your dataset by replacing the zeros and ? with missing values or depending on the type of feature. You can easily do it in pandas with something like:
df[df.my_column == <value>].my_column = <new_value>.
The alternative is to perform the replacement already in your code (for instance replacing 0s with averages) so that Ludwig doesn't have to do it and you have full control of the replacement strategy.

search sub-arrays with index and value to return document in solr

I have the following data-set in the following format, and I wish it to be searchable through solr. The following example how would my each document look like.
{
'key': <unique key>,
'val_arr': [
['laptop', 'macbook pro', '16gb', 'i9', 'spacegrey'],
['cellphone', 'iPhone', '4gb', 't2', 'rose gold'],
['laptop', 'macbook air', '8gb', 'i5', 'black'],
['router', 'huawei', '10x10', 'white'],
['laptop', 'macbook', '8gb', 'i5', 'silve']
]
}
I would be getting search requests with Element value and its index(2 elements per request).
eg. index1=0, val1=laptop, index2=2, val2=16gb, that matches one of the array in the above given document, hence it pulls the whole document in search result.
I tried using the copyField and custom query parser but that would search cross-sub-arrays i.e. may fetch a laptop with 4gb phone, whereas the request could be for a 4 gb laptop. Any help would be appreciated.
If you're only performing exact matches, index the values with the index as part of the value, and use a string / non-processed field type:
val_arr: ["0_laptop", "1_macbook pro", "2_16gb", ...]
That can be queried with the exact index, value combination - val_arr:0_laptop AND val_arr:2_16gb.
If you need to perform regular matching (and processing / tokenisation) of the field, you can use dynamic field names instead:
"val_arr_0": "laptop",
"val_arr_1": "macbook pro",
"val_arr_2": "16gb",
..
And then query the field - val_arr_0:laptop AND val_arr_2:16gb.

Is there any standard for NOT repeating json keys in arrays

I need to return list of specific objects in api call in json format.
Because all objects have same structure, it is good separate keys (that all same of all array objects) and values (that are different for each object) to save network bandwidth.
For example:
[
{"ID": 10, "Name": "Name1"},
{"ID": 11, "Name": "Name2"}
]
could be something like:
{
Schema: {"ID","Name"},
Values: [{10,"Name1"},{11,"Name2}]
}
In big results (for example when there are 100 records for response (that is common) and objects with more than 10 keys, this type of response could save a lot of bandwidth (in real example with 10 records containing 42 keys for each record, response size reduced from 9,172 bytes to 3,180 bytes, meaning 65% smaller)

Resources