How to deal with compound keys using dih in solr - solr

I am importing data from mysql db into solr documents. All is fine but I have one table which has a compound key (a pair of columns together as primary key) -> primary key for post_locations table is (post_id, location_id).
But my post_id is the primary key for my solr document, so when data is being imported from post_location table the location_ids are being overwritten.Is it possible to get location_ids(which is of type int) as an array(as there can be more than one location_id for a post).

For MySQL you can use GROUP BY and GROUP_CONCAT to get all the values for a field grouped together in a single column, separated by ,. You can then use the RegexTransformer and splitBy for that field to index the field as multiValued (in practice indexing it as an array). I posted an example of this in a previous answer. You might also do this by having dependent entity entries in DIH, but it will require more SQL queries than doing a GROUP BY and GROUP_CONCAT.
If you want one row for each entry, you can use build a custom uniqueKey instead, using CONCAT to build the aggregate / compound key on the MySQL side.

Related

Should I add a unique constraint to a UUID column?

I'm adding a UUID column to one of my tables so I can easily provision API keys. Should I bother adding a unique constraint to the column? I don't want to have duplicate API keys but on the other hand, the odds of a collision on generating the UUID values is infinitesimal.
I think you need to take into consideration if you are going to join tables based on this column or perform any operations like filter etc. If so, you will need to create a unique key on the UUID column as it will help retrieve data faster.

Sql: Joining view on computed columns vs performance

I have some Sql tables with a primary key that's includes more column. I created a view on this
tables and I added a computed column that is a concatenation of table's primary key, separated by a separator. (for example: ColumnA$ColumnB$ColumnC is concatenation of Column A, B e C that's table key).
When I use this view I filter on computed column to work with primary key.
In other case I have a query that put in join more view. Foreign key on the view is computed like primary key and the joins are on computed column.
The scope of this work is to simplified key to simplified integration with other software.
Could this execution scenario affect significatly performance?
Thanks in advance
Luca
Better idea would be to keep these columns separately just as you have them natively in your tables, then you can create your index/PK based on all 3 columns not just a concentrated single one. For the performance I would probably suggest here to use indexed view here. Other way if we talk about 3 string columns you can use some hashing techniques as long as you can handle that extreme minimum hashing duplication exception on your application end.

Cassandra: Why do I not have to include all partition keys in query?

Currently, I am dealing with Cassandra.
While reading a blog post, it is said:
When issuing a CQL query, you must include all partition key columns,
at a minimum.
(https://shermandigital.com/blog/designing-a-cassandra-data-model/)
However, in my database it seems like it possible without including all partition keys. Here the table:
CREATE TABLE usertable (
personid text,
name text,
"timestamp" timestamp,
active boolean,
PRIMARY KEY ((personid, name), timestamp)
) WITH
CLUSTERING ORDER BY ("timestamp" DESC)
AND comment=''
AND read_repair_chance=0
AND dclocal_read_repair_chance=0.1
AND gc_grace_seconds=864000
AND bloom_filter_fp_chance=0.01
AND compaction={ 'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold':'32',
'min_threshold':'4' }
AND compression={ 'chunk_length_in_kb':'64',
'class':'org.apache.cassandra.io.compress.LZ4Compressor' }
AND caching={ 'keys':'ALL',
'rows_per_partition':'NONE' }
AND default_time_to_live=0
AND id='23ff16b0-c400-11e8-55c7-2b453518a213'
AND min_index_interval=128
AND max_index_interval=2048
AND memtable_flush_period_in_ms=0
AND speculative_retry='99PERCENTILE';
So I can do select * from usertable where personid = 'ABC-02';. However, according to the blog post, I have to include timestamp as well.
Can someone explain this?
In cassandra, partition key spreads data around cluster. It computes the hash of partition key and determine the location of data in the cluster.
One exception is, if you use ALLOW FILTERING or secondary index it does not require you too include all partition keys in where query.
For further information take a look at blog post:
The purpose of a partition key is to split the data into partitions
where an entire partition is stored on a single node in the cluster
(with each node storing many partitions). When data is read or written
from the cluster, a function called Partitioner is used to compute the
hash value of the partition key. This hash value is used to determine
the node/partition which contains that row. The clustering key is used
further to search for a row within a given partition.
Select queries in Apache Cassandra look a lot like select queries from
a relational database. However, they are significantly more
restricted. The attributes allowed in ‘where’ clause of Cassandra
query must include the full partition key and additional clauses may
only reference the clustering key columns or a secondary index of the
table being queried.
Requiring the partition key attributes in the ‘where’ helps Cassandra
to maintain constant result-set retrieval time as the cluster is
scaled-out by allowing Cassandra to determine the partition, and thus
the node (and even data files on disk), that the query must be
directed to.
If a query does not specify the values for all the columns from the
primary key in the ‘where’ clause, Cassandra will not execute it and
give the following warning :
‘InvalidRequest: Error from server: code=2200 [Invalid query]
message=”Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW
FILTERING” ‘
https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
According to your schema, your timestamp column is the clustering column, the sorting column, no part of the partition key. That’s why it is not required.
(personid, name) are your partitions columns.

Multi-Column b-tree index logic?

I know how to implement a b-tree for single-column indexes, but how do I implement a b-tree for multi-column indexes in my rdbms project?
For example, I have a table consisting of documents records:
Documents
-------------
id
serial_no
order_no
record_sequence
If I make an index with 3 columns, for example:
CREATE INDEX UNIQUE myindex(serial_no, order_no, record_sequence);
then I have a key name for my b-tree structure in this format:
serial_no*order_no*record_sequence.
I can request a record via this index and this query:
SELECT * FROM Documents WHERE serial_no='ABC' AND order_no=500 AND record_sequence=0;
Note: I am creating an index record ABC*500*0 as b-tree key name.
But when I call all records of a document, for example:
SELECT * FROM Documents WHERE serial_no='ABC' AND order_no=500;
I cannot use my index to search for records because record_sequence is missing in this example.
As a result, what is the method of creating and searching multi-column indexes?
As far as I know my b-tree object does not support searching for "ABC*500*ANY". I am using a RaptorDB_v2.7.5 b-tree object:
RaptorDB - the Document Store
NoSql, JSON based, Document store database with compiled .net map functions and automatic hybrid bitmap indexing and LINQ query filters

Why is the DSE Search Unique Key the Partition key in Cassandra?

I have a column family that I expose to some application via DataStax Enterprise Search's SolR HTTP API. In some use cases, I thought it might be preferable directly accessing the cql layer.
When taking a closer look at the underlying data model though, I see that the unique in SolR is mapped to the partition key in Cassandra, not making use of compound keys with clustering columns.
Won't this produce a single wide row per partition?
And isn't that a "poor" data model for large data sets?
The unique key in your Solr schema should be a comma-separated list of all of the partition and clustering columns, enclosed within parentheses. Composite partition keys are supported as well as compound primary keys.
See the doc:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/srch/srchConfSkema.html
Yes, you do get a single wide storage row for each partition key, but it's your choice whether a column in your Cassandra primary key should be used as a clustering column or in the partition key. If you feel that your storage rows in Cassandra are two wide, move one of the clustering columns into a composite partition key, or add another column for that purpose.
Balancing the number of partitions and partition width is of course critical, but DSE/Solr is not restricting your choice.

Resources