Sort solr response by value in subdocument collection - solr

I'm using DSE solr to index a cassandra table that contains a collection of UDTs. I want to be able to sort search results based on a value inside those UDTs.
Given a simplistic example table...
create type test_score (
test_name text,
percentile double,
score int,
description text
);
create table students (
id int,
name text,
test_scores set<frozen<test_score>>,
...
);
... and assuming I'm auto-generating the solr schema via dsetool, I want to be able to write a solr query that finds students who have taken a test (by a specific test_name), and sort them by that test's score (or percentile, or whatever).

Unfortunately you can't sort by UDT fields.
However, I'm not sure what the value of a UDT is here. Perhaps I don't know enough about your use case. Another issue I see is that each partition key is a student id, so you can only store one test result per student. A better approach might be to use test id as a clustering column so you can store all the test results for a student in a single partition. Something like this:
CREATE TABLE students (
id int,
student_name text,
test_name text,
score int,
percentile double,
description text,
PRIMARY KEY (id, student_name, test_name)
);
Student name is kind of redundant (it should be the same for every row in each partition), but it doesn't have to be a clustering column.
Then you can sort on any field like so:
SELECT * FROM students WHERE solr_query='{"q":"test_name:Biology", "sort":"percentile desc"}' LIMIT 10;
I've used the JSON syntax described here: https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchJSON.html

Ok so basically you want to do a JOIN between table test_score and students right ?
According to the official doc: http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchQueryJoin.html
Joining Solr core is possible only if the 2 tables share the same partition key, which is not the case in your example ...

Related

Query of records based on relationships

I have an object Contract that has a look-up to another object Indexationtype. I have another object IndexationEntry that has master-detail to Indexationtype. Now I would like to get the value of the percentage field in the IndexationEntry onto Contract based on the yer fields. The year in the IndexationEntry matches Year in Contract. How should I achieve this?
From Contract "up" to IndexationType__c, then "down" to IndexationEntry__c?
If there's no direct link between them it's not going to be pretty. One way would be something like this
SELECT Id, Name,
(SELECT Id, ContractNumber FROM Contracts__r WHERE Year__c = '2021'),
(SELECT Id, Percent__c FROM IndexationEntries__r WHERE Year__c = '2021')
FROM IndexationType__c
You'd have to run it once for each year. Or (since you tagged it Apex) maybe you can prepare the reference data a bit, query Indexation Types + Entries and build something like Map<Id, Map<Integer, IndexationEntry__c>> (1st key is by Indexation Type Id, then by year). Query them, populate the Map, then loop through your contracts and use map.get() to fetch your values.

How to improve or index postgresql's jsonb array field?

I usually use jsonb field store array data.
for example, I want to store customer's barcode info, I will create a table like this:
create table customers(fcustomerid bigint, fcodes jsonb);
One customer has one row, all barcode info stored in its fcodes field, just like below:
[
{
"barcode":"000000001",
"codeid":1,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":true,
"lottdate":"2021-01-20",
"bonus":50
},
{
"barcode":"000000002",
"codeid":2,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
...
{
"barcode":"000500000",
"codeid":500000,
"product":"Pepsi Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
]
The jsonb array maybe store millions of barcode's objects with the same structure. Perhaps this is not a good idea, but you konw when I have thousands of customer, I can store all the data in one table, one customer has one row in this table, all its data store in one field, it looks very tersely and easy to manage.
For this kind of application scenarios, how to efficiently to insert or modify or query the data?
I can use jsonb_insert to insert one object, just like:
update customers
set fcodes=jsonb_insert(fcodes,'{-1}','{...}'::jsonb)
where fcustomerid=999;
When I want modify some object, I found it is a little difficulty, I should know the index of object first, if I use the incremental key codeid as the array index, things looks easilly. I can use jsonb_modify,Just like below:
update customers
set fcodes=jsonb_set(fcodes,concat('{',(mycodeid-1)::text,',lottery}'),'true'::jsonb)
where fcustomerid=999;
But if I want to query the objects in the jsonb array with createdate or bonus or lottorry or product, I should use jsonpath operator. just like:
select jsonb_path_query_array(fcodes,'$ ? (product=="Pepsi Cola")'
from customer
where fcustomerid=999;
or like:
select jsonb_path_query_array(fcodes,'$ ? (lottdate.datetime()>="2021-01-01".datetime() && lottdate.datetime()<="2021-01-31".datetime())'
from customer
where fcustomerid=999;
Thie jsonb index looks useful, But it looks useful between different row, and my operation mostly works in one row's one jsonb field.
I am very worrying about the efficiency, for millions of objects stored in one row's one jsonb field, is this a good idea? And how to improve the efficiency in this scenarios? Especially for the query.
You are right to worry. With a huge JSON like that, you will never get good performance.
Your data don't need JSON at all. Create a table that stores a single barcode and has a foreign key reference to customers. Then everything will be simple and efficient.
Using JSON in the database is almost always the wrong choice, judging from the questions in this forum.

Retriving more data if found data is not sufficent. Cassandra

My table looks like this,
CREATE TABLE IF NOT EXISTS names (
firstname text,
surname text,
id text,
PRIMARY KEY (firstname, surname)
)
Let's say I want to return a minimum of 10 names.
I do
select * from names where firstname = "something" and surname "something";
But if this only returns 6 people, I then want it to do:
select * from names where firstname = "something" limit 4;
But I want to avoid returning the same row twice.
And possibly do it in one query only.
Is this possible?
You may use SELECT "DISTINCT" feature of CQLSH. you will get unique value for partitions. Also please refer below documentations for more understanding:-
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/cql_commands/cqlSelect.html
You can rely on the paging implemented by driver(s), for example, in Java.
In your case, you can perform query, and use .setFetchSize when performing specific query to some value you need - in this case, driver will read approximately specified number (or less) as a first page, and if you'll need more, then you can proceed iterating over results, and driver will fetch next page, until either you stop, or there will be no more data.
But be very careful with too low values of the page - if you have a lot of data in partition, then driver will need to go to Cassandra very often, and this will impact performance.
P.S. you can't have 10 records for query where firstname = "something" and surname = "something", because both columns comprise full primary key, and there could be only one record for given primary key. You may use something like where firstname = "something" and surname >= "something" to fetch data starting with given surname.

Cassandra: map collections with multiple datatypes

As stated and discussed by Mr. Ellis in dynamic-columns/wide-rows, dynamic table is possible through Map Collection. However, I can see that this is only applicable for data with the same types.
Example from the link:
CREATE TABLE users (
user_id text PRIMARY KEY,
name text,
birth_year int,
phone_numbers map
);
INSERT INTO users (user_id, name, birth_year, phone_numbers)
VALUES ('jbellis', 'Jonathan Ellis', 1976, {'home': 1112223333, 'work': 2223334444});
Both home and work phone_numbers are type integer. But we need a collection with various datatypes. Say,
Create table storage ( mobile_id int PRIMARY KEY, date timestamp, data map );
Then data contains these:
{'state': String, 'protocol': Integer, 'weight': Double, 'frame': Blob ... }
So my question is that, do we have an alternative for this? Is this possible with CQL?
At this time, I believe that it is not possible. You would be better off using a String with some sort of type information embedded in it
ie {'home': 'int:1112223333', 'work': 'str:222-333-4444'}
or alternatively use a blob and save a language-specific map into Cassandra using the blob type and language-specific serialization to save your variable map.
Maps are typed. Still you can create one map field per type? map_int, map_text, map_etc
This has the added benefit it'll be a bit faster, as collections are loaded as a whole when read, so splitting up by type will load less data one each query. You should be able to find out the type of what you're looking for up front I hope.

Which database can do this?

In the recent project, I need a database that does this.
each item is key value pair, each key is a multi-dimensional string, so for example
item 1:
key :['teacher','professor']
value: 'david'
item 2:
key :['staff', 'instructor', 'professor']
value: 'shawn'
so each key's length is not necessarily the same. I can do query like
anyone with both ['teacher','staff'] as keys.
Also I can add another item later easily, for example, a key-value pair like.
item 3:
key :['female', 'instructor', 'professor','programmer']
value: 'annie'
so the idea is that I can tag any array of keys to a value, and I can search by a subset of keys.
Since (judging on your comments) you don't need to enforce uniqueness, these are not actually "keys", and can be more appropriately thought of as "tags" whose primary purpose is to be searched on (not unlike StackOverflow.com tags).
The typical way of implementing tags in a relational database looks something like this:
Note the order of fields in the junction table TAG_ITEM primary key: since our goal is to find items of given tag (not tags of given item), the leading edge of the index "underneath" PK is TAG_ID. This facilitates efficient index range scan on given TAG_ID.
Cluster TAG_ITEM if your DBMS supports it.
You can then search for items with any of the given tags like this:
SELECT [DISTINCT] ITEM_ID
FROM
TAG
JOIN TAG_ITEM ON TAG.TAG_ID = TAG_ITEM.TAG_ID
WHERE
TAG_NAME = 'teacher'
OR TAG_NAME = 'professor'
And if you need any other fields from ITEM, you can:
SELECT * FROM ITEM WHERE ITEM_ID IN (<query above>)
You can search for items with all of the given tags like this:
SELECT ITEM_ID
FROM
TAG
JOIN TAG_ITEM ON TAG.TAG_ID = TAG_ITEM.TAG_ID
WHERE
TAG_NAME = 'teacher'
OR TAG_NAME = 'professor'
GROUP BY
ITEM_ID
HAVING
COUNT(*) = 2
PostgreSQL can do something similar with it's hstore data-format: http://www.postgresql.org/docs/9.1/static/hstore.html
Or maybe you search for arrays?: http://postgresguide.com/sexy/arrays.html

Resources