Cassandra: map collections with multiple datatypes - database

As stated and discussed by Mr. Ellis in dynamic-columns/wide-rows, dynamic table is possible through Map Collection. However, I can see that this is only applicable for data with the same types.
Example from the link:
CREATE TABLE users (
user_id text PRIMARY KEY,
name text,
birth_year int,
phone_numbers map
);
INSERT INTO users (user_id, name, birth_year, phone_numbers)
VALUES ('jbellis', 'Jonathan Ellis', 1976, {'home': 1112223333, 'work': 2223334444});
Both home and work phone_numbers are type integer. But we need a collection with various datatypes. Say,
Create table storage ( mobile_id int PRIMARY KEY, date timestamp, data map );
Then data contains these:
{'state': String, 'protocol': Integer, 'weight': Double, 'frame': Blob ... }
So my question is that, do we have an alternative for this? Is this possible with CQL?

At this time, I believe that it is not possible. You would be better off using a String with some sort of type information embedded in it
ie {'home': 'int:1112223333', 'work': 'str:222-333-4444'}
or alternatively use a blob and save a language-specific map into Cassandra using the blob type and language-specific serialization to save your variable map.

Maps are typed. Still you can create one map field per type? map_int, map_text, map_etc
This has the added benefit it'll be a bit faster, as collections are loaded as a whole when read, so splitting up by type will load less data one each query. You should be able to find out the type of what you're looking for up front I hope.

Related

How to improve or index postgresql's jsonb array field?

I usually use jsonb field store array data.
for example, I want to store customer's barcode info, I will create a table like this:
create table customers(fcustomerid bigint, fcodes jsonb);
One customer has one row, all barcode info stored in its fcodes field, just like below:
[
{
"barcode":"000000001",
"codeid":1,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":true,
"lottdate":"2021-01-20",
"bonus":50
},
{
"barcode":"000000002",
"codeid":2,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
...
{
"barcode":"000500000",
"codeid":500000,
"product":"Pepsi Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
]
The jsonb array maybe store millions of barcode's objects with the same structure. Perhaps this is not a good idea, but you konw when I have thousands of customer, I can store all the data in one table, one customer has one row in this table, all its data store in one field, it looks very tersely and easy to manage.
For this kind of application scenarios, how to efficiently to insert or modify or query the data?
I can use jsonb_insert to insert one object, just like:
update customers
set fcodes=jsonb_insert(fcodes,'{-1}','{...}'::jsonb)
where fcustomerid=999;
When I want modify some object, I found it is a little difficulty, I should know the index of object first, if I use the incremental key codeid as the array index, things looks easilly. I can use jsonb_modify,Just like below:
update customers
set fcodes=jsonb_set(fcodes,concat('{',(mycodeid-1)::text,',lottery}'),'true'::jsonb)
where fcustomerid=999;
But if I want to query the objects in the jsonb array with createdate or bonus or lottorry or product, I should use jsonpath operator. just like:
select jsonb_path_query_array(fcodes,'$ ? (product=="Pepsi Cola")'
from customer
where fcustomerid=999;
or like:
select jsonb_path_query_array(fcodes,'$ ? (lottdate.datetime()>="2021-01-01".datetime() && lottdate.datetime()<="2021-01-31".datetime())'
from customer
where fcustomerid=999;
Thie jsonb index looks useful, But it looks useful between different row, and my operation mostly works in one row's one jsonb field.
I am very worrying about the efficiency, for millions of objects stored in one row's one jsonb field, is this a good idea? And how to improve the efficiency in this scenarios? Especially for the query.
You are right to worry. With a huge JSON like that, you will never get good performance.
Your data don't need JSON at all. Create a table that stores a single barcode and has a foreign key reference to customers. Then everything will be simple and efficient.
Using JSON in the database is almost always the wrong choice, judging from the questions in this forum.

How to make dynamic references to tables in Anylogic?

I`ve modeled six machines. Each of them has a different profile of electricity load. The load profile is provided in a table in AnyLogic. Every machine has an own table storing these values. I iterate trough the values to implement the same in TableFunctions. Now I face the following challenge: How can I make a dynamic reference to the relevant table. I would like to pick a specific table in dependence of a machine indice. How can I define a variable that dynamically refers to the relevant table object?
Thank you for your help!
not sure it is really necessary in your case but here goes:
You can store a reference to a database table to a variable of the following type:
com.mysema.query.sql.RelationalPathBase
When selecting values of double (int, String, etc.) type in a particular column, you may get the column by index calling variable.getColumns().get(index). Then you need to cast it to the corresponding type like below:
List<Double> resultRows = selectFrom(variable).where(
( (com.mysema.query.types.path.NumberPath<Double>) variable.getColumns().get(1) ).eq(2.0))
.list(( (com.mysema.query.types.path.NumberPath<Double>) variable.getColumns().get(1) ));
Are you always going to have a finite number of machines and how is your load profile represented? If you have a finite number of machines, and the load profile is a set of individual values - or indeed as long as you can hold those values in a single field per element - then you can create a single table, e.g. machine_load_profile, where the first column is load_profile_element and holds element IDs and further columns are named machine_0, machine_1, machine_2 etc., holding the values for each load profile element. You can then get the load profile elements for a single machine like this:
List<Double> dblReturnLPEs = main.selectValues(
"SELECT machine_" + oMachine.getIndex()
+ " FROM machine_load_profile"
+ " ORDER BY load_profile_element;"
);
and either iterate that list or convert them into an array:
dblLPEValues = dblReturnLPEs.stream().mapToDouble(Double::doubleValue).toArray();
and iterate that.
Of course you could also use the opposite orientation for your columns and rows as well, using WHERE, I simply had a handy example oriented in this manner.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

How to use ndb key with integer_id?

I see the document
https://developers.google.com/appengine/docs/python/ndb/keyclass#Key_integer_id
Returns the integer id in the last (kind, id) pair, or None if the key
has an string id or is incomplete.
see I think the id of a key can be a int ; so I write
r = ndb.Key(UserSession, int(id)).get()
if r:
return r.session
but the dev_server.py , will always raise
File "/home/bitcoin/down/google_appengine/google/appengine/datastore/datastore_stub_util.py", line 346, in CheckReference
raise datastore_errors.BadRequestError('missing key id/name')
BadRequestError: missing key id/name
I chanage the int(id) -> str(id)
seems right ;
so my question is , How to use ndb key with integer_id ?
the model is
class UserSession(ndb.Model):
session = ndb.BlobProperty()
The type of the id you use when reading the entity must match the type of the id you used when you wrote the entity. Normally, integer ids are assigned automatically when you write a new entity without specifying an id or key; you then get the id out of the key returned by entity.put(). It is generally not recommended to assign your own integer ids; when the app assigns the keys, the convention is that they should be strings.
There's an easier way to fetch:
UserSession.get_by_id(int(id))
https://developers.google.com/appengine/docs/python/ndb/modelclass#Model_get_by_id
If that doesn't work, I suspect that id is wrong or empty.
There must be something wrong with your variable 'id'.
Your code here should be no problem, and it's better to user long instead of int.
You can try your code on interactive console of development server with specific integer id.
It may be easier to identify your entities in the sessions with their keys instead of their ids. There really is no need to extract the ID from the key to identify the session (other than maybe saving a bit of memory. I think the way your thinking is based on a RDB. I learned that using the key actually makes entity/session identifications easier.
'id' is also a python builtin function. Maybe you are taking that by mistake.

ndb retrieving entity key by ID without parent

I want to get an entity key knowing entity ID and an ancestor.
ID is unique within entity group defined by the ancestor.
It seems to me that it's not possible using ndb interface. As I understand datastore it may be caused by the fact that this operation requires full index scan to perform.
The workaround I used is to create a computed property in the model, which will contain the id part of the key. I'm able now to do an ancestor query and get the key
class SomeModel(ndb.Model):
ID = ndb.ComputedProperty( lambda self: self.key.id() )
#classmethod
def id_to_key(cls, identifier, ancestor):
return cls.query(cls.ID == identifier,
ancestor = ancestor.key ).get( keys_only = True)
It seems to work, but are there any better solutions to this problem?
Update
It seems that for datastore the natural solution is to use full paths instead of identifiers. Initially I thought it'd be too burdensome. After reading dragonx answer I redesigned my application. To my suprise everything looks much simpler now. Additional benefits are that my entities will use less space and I won't need additional indexes.
I ran into this problem too. I think you do have the solution.
The better solution would be to stop using IDs to reference entities, and store either the actual key or a full path.
Internally, I use keys instead of IDs.
On my rest API, I used to do http://url/kind/id (where id looked like "123") to fetch an entity. I modified that to provide the complete ancestor path to the entity: http://url/kind/ancestor-ancestor-id (789-456-123), I'd then parse that string, generate a key, and then get by key.
Since you have full information about your ancestor and you know your id, you could directly create your key and get the entity, as follows:
my_key = ndb.Key(Ancestor, ancestor.key.id(), SomeModel, id)
entity = my_key.get()
This way you avoid making a query that costs more than a get operation both in terms of money and speed.
Hope this helps.
I want to make a little addition to dargonx's answer.
In my application on front-end I use string representation of keys:
str(instance.key())
When I need to make some changes with instence even if it is a descendant I use only string representation of its key. For example I have key_str -- argument from request to delete instance':
instance = Kind.get(key_str)
instance.delete()
My solution is using urlsafe to get item without worry about parent id:
pk = ndb.Key(Product, 1234)
usafe = LocationItem.get_by_id(5678, parent=pk).key.urlsafe()
# now can get by urlsafe
item = ndb.Key(urlsafe=usafe)
print item

Resources