Which database can do this? - database

In the recent project, I need a database that does this.
each item is key value pair, each key is a multi-dimensional string, so for example
item 1:
key :['teacher','professor']
value: 'david'
item 2:
key :['staff', 'instructor', 'professor']
value: 'shawn'
so each key's length is not necessarily the same. I can do query like
anyone with both ['teacher','staff'] as keys.
Also I can add another item later easily, for example, a key-value pair like.
item 3:
key :['female', 'instructor', 'professor','programmer']
value: 'annie'
so the idea is that I can tag any array of keys to a value, and I can search by a subset of keys.

Since (judging on your comments) you don't need to enforce uniqueness, these are not actually "keys", and can be more appropriately thought of as "tags" whose primary purpose is to be searched on (not unlike StackOverflow.com tags).
The typical way of implementing tags in a relational database looks something like this:
Note the order of fields in the junction table TAG_ITEM primary key: since our goal is to find items of given tag (not tags of given item), the leading edge of the index "underneath" PK is TAG_ID. This facilitates efficient index range scan on given TAG_ID.
Cluster TAG_ITEM if your DBMS supports it.
You can then search for items with any of the given tags like this:
SELECT [DISTINCT] ITEM_ID
FROM
TAG
JOIN TAG_ITEM ON TAG.TAG_ID = TAG_ITEM.TAG_ID
WHERE
TAG_NAME = 'teacher'
OR TAG_NAME = 'professor'
And if you need any other fields from ITEM, you can:
SELECT * FROM ITEM WHERE ITEM_ID IN (<query above>)
You can search for items with all of the given tags like this:
SELECT ITEM_ID
FROM
TAG
JOIN TAG_ITEM ON TAG.TAG_ID = TAG_ITEM.TAG_ID
WHERE
TAG_NAME = 'teacher'
OR TAG_NAME = 'professor'
GROUP BY
ITEM_ID
HAVING
COUNT(*) = 2

PostgreSQL can do something similar with it's hstore data-format: http://www.postgresql.org/docs/9.1/static/hstore.html
Or maybe you search for arrays?: http://postgresguide.com/sexy/arrays.html

Related

Retriving more data if found data is not sufficent. Cassandra

My table looks like this,
CREATE TABLE IF NOT EXISTS names (
firstname text,
surname text,
id text,
PRIMARY KEY (firstname, surname)
)
Let's say I want to return a minimum of 10 names.
I do
select * from names where firstname = "something" and surname "something";
But if this only returns 6 people, I then want it to do:
select * from names where firstname = "something" limit 4;
But I want to avoid returning the same row twice.
And possibly do it in one query only.
Is this possible?
You may use SELECT "DISTINCT" feature of CQLSH. you will get unique value for partitions. Also please refer below documentations for more understanding:-
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/cql_commands/cqlSelect.html
You can rely on the paging implemented by driver(s), for example, in Java.
In your case, you can perform query, and use .setFetchSize when performing specific query to some value you need - in this case, driver will read approximately specified number (or less) as a first page, and if you'll need more, then you can proceed iterating over results, and driver will fetch next page, until either you stop, or there will be no more data.
But be very careful with too low values of the page - if you have a lot of data in partition, then driver will need to go to Cassandra very often, and this will impact performance.
P.S. you can't have 10 records for query where firstname = "something" and surname = "something", because both columns comprise full primary key, and there could be only one record for given primary key. You may use something like where firstname = "something" and surname >= "something" to fetch data starting with given surname.

AppEngine Datastore: Hierarchical queries

If you're dealing with a hierarchy of records, where most of the keys have ancestors, will you have to create a chain of all of the keys before you can retrieve a leaf?
Example (in Go):
rootKey = datastore.NewKey(ctx, "EntityType", "", id1, nil)
secondGenKey = datastore.NewKey(ctx, "EntityType", "", id2, rootKey)
thirdGenKey = datastore.NewKey(ctx, "EntityType", "", id3, rootKey)
How do you get the record described by thirdGenKey without having to declare the keys for all of the levels of the hierarchy above it?
In order to get an individual entity, its key must be globally unique - this is enforced through each entity key being unique within its entity group. The ancestor path forms an intrinsic part of the entity key for this reason.
So, the only way to get a single entity with strong consistency is to specify its ancestor path. This can be done either with a get-by-key or with an ancestor query.
If you don't know the full ancestor path, your only option is to query on a property of the entity, but bear in mind that:
this may not be unique within your application
you will be subject to eventual consistency.
To supplement tx802's answer:
If you want to load an entity by key, you need its key. If the key is such a key that has a parent, in order to form / create the key, you also need the parent key to be created prior. The parent key is part of the key, just like the numeric ID or the string name.
Looking from the implementation perspective: datastore.Key is a struct:
type Key struct {
kind string
stringID string
intID int64
parent *Key
appID string
namespace string
}
In order to construct a Key which has a parent, you must construct the parent key too, recursively. If you're finding it too verbose to always create the key hierarchy, you can create a helper function for it.
For the sake of simplicity, let's assume all keys use the same entity name, and we only use numeric IDs. It may look like this:
func createKey(ctx context.Context, entity string, ids ...int) (k *datastore.Key) {
for _, id := range ids {
k = datastore.NewKey(ctx, entity, "", id, k)
}
return
}
With this helper function, your example is reduced to this:
k2 := createKey(ctx, "EntityType", id1, id2)
k3 := createKey(ctx, "EntityType", id1, id3)

Sort solr response by value in subdocument collection

I'm using DSE solr to index a cassandra table that contains a collection of UDTs. I want to be able to sort search results based on a value inside those UDTs.
Given a simplistic example table...
create type test_score (
test_name text,
percentile double,
score int,
description text
);
create table students (
id int,
name text,
test_scores set<frozen<test_score>>,
...
);
... and assuming I'm auto-generating the solr schema via dsetool, I want to be able to write a solr query that finds students who have taken a test (by a specific test_name), and sort them by that test's score (or percentile, or whatever).
Unfortunately you can't sort by UDT fields.
However, I'm not sure what the value of a UDT is here. Perhaps I don't know enough about your use case. Another issue I see is that each partition key is a student id, so you can only store one test result per student. A better approach might be to use test id as a clustering column so you can store all the test results for a student in a single partition. Something like this:
CREATE TABLE students (
id int,
student_name text,
test_name text,
score int,
percentile double,
description text,
PRIMARY KEY (id, student_name, test_name)
);
Student name is kind of redundant (it should be the same for every row in each partition), but it doesn't have to be a clustering column.
Then you can sort on any field like so:
SELECT * FROM students WHERE solr_query='{"q":"test_name:Biology", "sort":"percentile desc"}' LIMIT 10;
I've used the JSON syntax described here: https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchJSON.html
Ok so basically you want to do a JOIN between table test_score and students right ?
According to the official doc: http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchQueryJoin.html
Joining Solr core is possible only if the 2 tables share the same partition key, which is not the case in your example ...

JPA search by Key without Knowing Parent Key

Ok so I have an application that uses GAE and consequently the datastore.
Say I have multiple companies A, B and C and I have within each company Employees X,Y and Z. The relationship between a company and employee will be OneToMany, with the company being the owner. This results in the Company Key being of the form
long id = 4504699138998272; // Random Example
Key CompanyKey = KeyFactory.createKey(Company.class.getSimpleName(), id);
and the employee key would be of the form
long id2 = 5630599045840896;
Key EmployeeKey = KeyFactory.createKey(CompanyKey,Employee.class.getSimpleName(),id2);
all fine and well and there is no problem, until in the front end, during jsp representation. Sometimes I would need to generate a report, or open an Employees profile, in which case the div containing his information would get an id as follows
<div class="employeeInfo" id="<%=employee.getKey().getId()%>" > .....</div>
and this div has an onclick / submit event, that will ajax the new modifications to the employee profile to servelet, at which point I have to specify the primary key of the employee, (which I thought I could easily get from the div id), but it didnt work server side.
The problem is I know the Employees String portion of the Key and the long portion, but not the Parent key. To save time I tried this and it didnt work
Key key = KeyFactory.creatKey(Employee.class.getSimpleName(); id);
Employee X = em.find(Employee.class,key);
X is always returned null.
I would really appreciate any idea of how to find or "query" Entities by keys without knowing their parents key (as I would hate having to re-adjust Entity classes)
Thanks alot !!
An Entity key and its parents cannot be separated. It's called ancestor path, a chain composed of entity kinds and ids.
So, in your example ancestor paths will look like this:
CompanyKey: ("Company", 4504699138998272)
EmployeeKey: ("Company", 4504699138998272, "Employee", 5630599045840896)
A key composed only of ("Employee", 5630599045840896) is a completely different one comparing to the EmployeeKey even though both keys end with the same values. Think of concatenating elements into a single "string" and comparing final values, they will never match.
One thing you can do is use encoded keys instead of their id values:
String encodedKey = KeyFactory.keyToString(EmployeeKey);
Key decodedKey = KeyFactory.stringToKey(encodedKey);
decodedKey.equals(EmployeeKey); // true
More about Ancestor Paths:
https://developers.google.com/appengine/docs/java/datastore/entities#Java_Ancestor_paths
KeyFactory Java doc:
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/KeyFactory#keyToString(com.google.appengine.api.datastore.Key)

How to best represent items with variable # of attributes in a database?

Lets say you want to create a listing of widgets
The Widget Manufacturers all create widgets with different number and types of attributes. And the Widget sellers all have different preferences on what type and number of attributes they want to store in the database and display.
The problem here now is that each time you add in a new widget, it may have attributes on it that donot currently exist for any other widget, and currently you accomplish this by modifying the table and adding in a new column for that attribute and then modifying all forms and reports to reflect this change.
How do you go about creating a database which takes into account that attributes on a widget are fluid and can change from widget to widget.
Ideally the widget attributes should be something the user can define according to his/her preference and needs
I would have a table for widgets and one for widget attributes. For example:
Widgets
- Id
- Name
WidgetAttributes
- Id
- Name
Then, you would have another table which has what widgets have which attributes:
WidgetAttributeMap
- Id
- WidgetId
(a value from the Id column in the Widget table)
- WidgetAttributeId
(a value from the Id column in the WidgetAttribute table)
This way, you can add attributes to widgets by modifying rows in the WidgetAttributeMap table, not by modifying the structure of your widget table.
casperOne is showing the way, although I would personally add yet one more table for the attribute values, ending up with
Widgets
-WidgetID (pk)
-Name
WidgetAttributes
-AttributeID (pk)
-Name
WidgetHasAttribute
-WidgetID (pk)
-AttributeID (pk)
WidgetAttributeValues
-ValueID (pk)
-WidgetID
-AttributeID
-Value
In order to retrieve the results, you want to join the tables and perform an aggregate concatenation, so you can end up with data looking like (for example):
Name Properties
Widget1 Attr1:Value1;Attr2:Value2;...etc
Then you could split the Properties string in your Business Logic Layer and use as you wish.
A suggestion on how to join the data:
SELECT w.Name, wa.Name + ':' + wav.Value
FROM ((
Widgets w
INNER JOIN
WidgetHasAttribute wha
ON w.WidgetID = wha.WidgetID)
INNER JOIN WidgetAttributes wa
ON wha.AttributeID = wa.AttributeID)
INNER JOIN WidgetAttributeValues wav
ON (w.WidgetID = wav.WidgetID AND wa.AttributeID = wav.AttributeID)
You can read more on aggregate concatenation here.
As far as performance is concerned, it shouldn't be a problem as long as you make sure to index all columns that will be frequently read - that is
All the ID columns, as they will be compared in the join clauses
WidgetAttributes.Name and WidgetAttributeValues.Value, as they will be concatenated

Resources