Multi-Column b-tree index logic? - database

I know how to implement a b-tree for single-column indexes, but how do I implement a b-tree for multi-column indexes in my rdbms project?
For example, I have a table consisting of documents records:
Documents
-------------
id
serial_no
order_no
record_sequence
If I make an index with 3 columns, for example:
CREATE INDEX UNIQUE myindex(serial_no, order_no, record_sequence);
then I have a key name for my b-tree structure in this format:
serial_no*order_no*record_sequence.
I can request a record via this index and this query:
SELECT * FROM Documents WHERE serial_no='ABC' AND order_no=500 AND record_sequence=0;
Note: I am creating an index record ABC*500*0 as b-tree key name.
But when I call all records of a document, for example:
SELECT * FROM Documents WHERE serial_no='ABC' AND order_no=500;
I cannot use my index to search for records because record_sequence is missing in this example.
As a result, what is the method of creating and searching multi-column indexes?
As far as I know my b-tree object does not support searching for "ABC*500*ANY". I am using a RaptorDB_v2.7.5 b-tree object:
RaptorDB - the Document Store
NoSql, JSON based, Document store database with compiled .net map functions and automatic hybrid bitmap indexing and LINQ query filters

Related

What kind of index should I set up for a key/value pair table?

I have a db that supports a free-form addition of values to products. I'm using a key-value structure because individual products can have wildly different key/value pairs associated with them, otherwise I'd just make a column.
Give a table for products and a table for key-value pairs, I want to know what kind of indexes to set up to best support this.
Tables:
Products: productId(pk), name, category
ProductDetails: productId(fk), name, value(text)
Frequently used queries I want to be fast:
SELECT * from ProductDetails pd where pd.productId = NNN
SELECT * from ProductDetails pd where pd.name='advantages' and pd.value like '%forehead laser%`
I'd encourage you to comb through this answer: How to create composite primary key in SQL Server 2008
This all depends on your querying and data constraint needs. You can have a clustered index on just the productId and add multiple non-clustered indexes on other composite keys (ProductDetails.name and .value and/or productId too). These can also enforce the uniqueness of the data being inserted so you don't get duplicates.
Be aware though there are diminishing returns on adding too many indexes on large tables where inserts/updates need to occur as well. The db has to determine where the row should go in relation to each index.

Querying a High Cardinality Field

I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?
I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.

Cassandra: Why do I not have to include all partition keys in query?

Currently, I am dealing with Cassandra.
While reading a blog post, it is said:
When issuing a CQL query, you must include all partition key columns,
at a minimum.
(https://shermandigital.com/blog/designing-a-cassandra-data-model/)
However, in my database it seems like it possible without including all partition keys. Here the table:
CREATE TABLE usertable (
personid text,
name text,
"timestamp" timestamp,
active boolean,
PRIMARY KEY ((personid, name), timestamp)
) WITH
CLUSTERING ORDER BY ("timestamp" DESC)
AND comment=''
AND read_repair_chance=0
AND dclocal_read_repair_chance=0.1
AND gc_grace_seconds=864000
AND bloom_filter_fp_chance=0.01
AND compaction={ 'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold':'32',
'min_threshold':'4' }
AND compression={ 'chunk_length_in_kb':'64',
'class':'org.apache.cassandra.io.compress.LZ4Compressor' }
AND caching={ 'keys':'ALL',
'rows_per_partition':'NONE' }
AND default_time_to_live=0
AND id='23ff16b0-c400-11e8-55c7-2b453518a213'
AND min_index_interval=128
AND max_index_interval=2048
AND memtable_flush_period_in_ms=0
AND speculative_retry='99PERCENTILE';
So I can do select * from usertable where personid = 'ABC-02';. However, according to the blog post, I have to include timestamp as well.
Can someone explain this?
In cassandra, partition key spreads data around cluster. It computes the hash of partition key and determine the location of data in the cluster.
One exception is, if you use ALLOW FILTERING or secondary index it does not require you too include all partition keys in where query.
For further information take a look at blog post:
The purpose of a partition key is to split the data into partitions
where an entire partition is stored on a single node in the cluster
(with each node storing many partitions). When data is read or written
from the cluster, a function called Partitioner is used to compute the
hash value of the partition key. This hash value is used to determine
the node/partition which contains that row. The clustering key is used
further to search for a row within a given partition.
Select queries in Apache Cassandra look a lot like select queries from
a relational database. However, they are significantly more
restricted. The attributes allowed in ‘where’ clause of Cassandra
query must include the full partition key and additional clauses may
only reference the clustering key columns or a secondary index of the
table being queried.
Requiring the partition key attributes in the ‘where’ helps Cassandra
to maintain constant result-set retrieval time as the cluster is
scaled-out by allowing Cassandra to determine the partition, and thus
the node (and even data files on disk), that the query must be
directed to.
If a query does not specify the values for all the columns from the
primary key in the ‘where’ clause, Cassandra will not execute it and
give the following warning :
‘InvalidRequest: Error from server: code=2200 [Invalid query]
message=”Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW
FILTERING” ‘
https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
According to your schema, your timestamp column is the clustering column, the sorting column, no part of the partition key. That’s why it is not required.
(personid, name) are your partitions columns.

How to deal with compound keys using dih in solr

I am importing data from mysql db into solr documents. All is fine but I have one table which has a compound key (a pair of columns together as primary key) -> primary key for post_locations table is (post_id, location_id).
But my post_id is the primary key for my solr document, so when data is being imported from post_location table the location_ids are being overwritten.Is it possible to get location_ids(which is of type int) as an array(as there can be more than one location_id for a post).
For MySQL you can use GROUP BY and GROUP_CONCAT to get all the values for a field grouped together in a single column, separated by ,. You can then use the RegexTransformer and splitBy for that field to index the field as multiValued (in practice indexing it as an array). I posted an example of this in a previous answer. You might also do this by having dependent entity entries in DIH, but it will require more SQL queries than doing a GROUP BY and GROUP_CONCAT.
If you want one row for each entry, you can use build a custom uniqueKey instead, using CONCAT to build the aggregate / compound key on the MySQL side.

What are the different SQL Server index types?

Getting info on our table in Squirrel returns the index types as ints. I found Types of Indexes on Microsoft's site, but it has no mapping to numeric values.
I'm on Linux so I can't exactly pull up SQL Management Studio. Is there anywhere that actually maps the number values to Microsoft's named types?
Sepcifically, I want to know what index type 1 and index type 3 are.
There are different mappings available.
The sp_indexes stored procedure returns the following index types:
0 = Statistics for a table
1 = Clustered
2 = Hashed
3 = Other
On the other hand, the sys.indexes catalog view uses the following map:
0 = Heap
1 = Clustered
2 = Nonclustered
3 = XML
In sql server 2005+, types are (from sys.indexes DMV):
0 = Heap
1 = Clustered
2 = Nonclustered
3 = XML
4 = Spatial
Spatial is 2008 only.
MSDN Page
Hash - With a hash index, data is accessed through an in-memory hash table. Hash indexes consume a fixed amount of memory, which is a function of the bucket count.
memory-optimized nonclustered indexes - For memory-optimized nonclustered indexes, memory consumption is a function of the row count and the size of the index key columns
Clustered - A clustered index sorts and stores the data rows of the table or view in order based on the clustered index key. The clustered index is implemented as a B-tree index structure that supports fast retrieval of the rows, based on their clustered index key values.
Nonclustered - A nonclustered index can be defined on a table or view with a clustered index or on a heap. Each index row in the nonclustered index contains the nonclustered key value and a row locator. This locator points to the data row in the clustered index or heap having the key value. The rows in the index are stored in the order of the index key values, but the data rows are not guaranteed to be in any particular order unless a clustered index is created on the table.
Unique - A unique index ensures that the index key contains no duplicate values and therefore every row in the table or view is in some way unique.
Columnstore - An in-memory columnstore index stores and manages data by using column-based data storage and column-based query processing.
Columnstore indexes work well for data warehousing workloads that primarily perform bulk loads and read-only queries. Use the columnstore index to achieve up to 10x query performance gains over traditional row-oriented storage, and up to 7x data compression over the uncompressed data size.
Index with included columns - A nonclustered index that is extended to include nonkey columns in addition to the key columns.
Index on computed columns - An index on a column that is derived from the value of one or more other columns, or certain deterministic inputs.
Filtered - An optimized nonclustered index, especially suited to cover queries that select from a well-defined subset of data. It uses a filter predicate to index a portion of rows in the table. A well-designed filtered index can improve query performance, reduce index maintenance costs, and reduce index storage costs compared with full-table indexes.
Spatial - A spatial index provides the ability to perform certain operations more efficiently on spatial objects (spatial data) in a column of the geometry data type. The spatial index reduces the number of objects on which relatively costly spatial operations need to be applied.
XML - A shredded, and persisted, representation of the XML binary large objects (BLOBs) in the xml data type column.
Full-text - A special type of token-based functional index that is built and maintained by the Microsoft Full-Text Engine for SQL Server. It provides efficient support for sophisticated word searches in character string data.
- Type of Index
Clustered
Nonclustered
Unique
Index with included columns
Indexed views
Full-text
XML

Resources