Efficiency of indexes for a field with low cardinality - database

For example There is a field ( can be null) in a postgres database which stores enum value and that enum has only two values A,B.
Now my all select query have where clause on this field.
I have a question will adding a index to this field will be a good approach or it will not increase any performance as each row contains either A or B or a null.
Is there a way i can increase performance of all get call.
Please help

No. In most cases, an index on a low-cardinality column (or: a set of columns with a low cardinality) is useless. Instead, you could use a conditional index. As an example, my tweets - table, with a handful boolean columns:
twitters=# \d tweets
Table "public.tweets"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+---------
seq | bigint | | not null |
id | bigint | | not null |
user_id | bigint | | not null |
in_reply_to_id | bigint | | not null | 0
parent_seq | bigint | | not null | 0
sucker_id | integer | | not null | 0
created_at | timestamp with time zone | | |
fetch_stamp | timestamp with time zone | | not null | now()
is_dm | boolean | | not null | false
is_reply_to_me | boolean | | not null | false
is_retweet | boolean | | not null | false
did_resolve | boolean | | not null | false
is_stuck | boolean | | not null | false
need_refetch | boolean | | not null | false
is_troll | boolean | | not null | false
body | text | | |
zoek | tsvector | | |
Indexes:
"tweets_pkey" PRIMARY KEY, btree (seq)
"tweets_id_key" UNIQUE CONSTRAINT, btree (id)
"tweets_stamp_idx" UNIQUE, btree (fetch_stamp, seq)
"tweets_du_idx" btree (created_at, user_id)
"tweets_id_idx" btree (id) WHERE need_refetch = true
"tweets_in_reply_to_id_created_at_idx" btree (in_reply_to_id, created_at) WHERE is_retweet = false AND did_resolve = false AND in_reply_to_id > 0
"tweets_in_reply_to_id_fp" btree (in_reply_to_id)
"tweets_parent_seq_fk" btree (parent_seq)
"tweets_ud_idx" btree (user_id, created_at)
"tweets_userid_id" btree (user_id, id)
"tweets_zoek" gin (zoek)
Foreign-key constraints:
...
The "tweets_in_reply_to_id_created_at_idx" index only has entries for rows that fulfill the condition. Once the reference is refetched (or failed to do so) they are removed from the index. So, this index will usually only have a few pending records.
A different example: a gender column. You'd expect a 50/50 distribution of male/female. Assuming a rowsize of ~100, there are ~70 rows on an 8K page. Probably there will be both males and females on the same page, so even a search for males-only or females-only would need to read all the pages. (needing to read the index will worsen this, but the optimiser will wisely decide to ignore the index) A clustered index may help, but will need a lot of maintenance work. Not worth the wile.

An index just on that column is unlikely to be useful, unless the distribution of values is very skewed (e.g. 99% A, 0.99% NULL, 0.01% B). But in that case you would probably be better off with a partial index on some other field WHERE this_field='B'.
But even with an more uniform distribution of values (33.33% A, 33.33% NULL, 33.33% B) it could be useful to include that column as the leading column in some multicolumn indexes. For example, for WHERE this_field='A' and other_field=7945, the index on (this_field, other_field) would generally be about 3 times more efficient than one on just (other_field) if the distribution of value is even.
Where it could make a huge difference is with something like WHERE this_field='A' ORDER by other_field LIMIT 5. With the index on (this_field, other_field) it can jump right to the proper spot in the index and read off the first 5 rows (which pass checking for visibility) already in order and then stop. If the index were just on (other_field) it might, if the two columns are not statistically independent of each other, have to skip over any arbitrary number of 'B' or NULL rows before finding 5 with 'A'.

In the case when NULL is not so often in the column you can partitioning the table and process only required part automatically when this field used in condition without any additional indexes.

Related

Why does Solr changes record position after updating a field

I am new to Solr and encountered a weird behavior as I update a field and perform search.
Here's the scenario :
I have a 300records in my core, I have a search query wherein I filtered the results with this
fq=IsSoldHidden:false AND IsDeleted:false AND StoreId:60
and I sort it by DateInStock asc
Everything is perfectly returning my expected results,
Here is the sample top 3 results of my query :
--------------------------------------------------------------------------------------
id | Price | IsSoldHidden | IsDeleted | StoreId | StockNo | DateInStock
--------------------------------------------------------------------------------------
27236 | 15000.0 | false | false | 60 | A00059 | 2021-06-07T00:00:00Z
--------------------------------------------------------------------------------------
37580 | 0.0 | false | false | 60 | M9202 | 2021-06-08T00:00:00Z
--------------------------------------------------------------------------------------
37581 | 12000 | false | false | 60 | M9173 | 2021-06-08T00:00:00Z
but when I tried to update(AtomicUpdate to be specific) the Price field in 2nd row , and trigger a search again with the same filters requirements, the results changes to this :
--------------------------------------------------------------------------------------
id | Price | IsSoldHidden | IsDeleted | StoreId | StockNo | DateInStock
--------------------------------------------------------------------------------------
27236 | 15000.0 | false | false | 60 | A00059 | 2021-06-07T00:00:00Z
--------------------------------------------------------------------------------------
37581 | 0.0 | false | false | 60 | M9173 | 2021-06-08T00:00:00
--------------------------------------------------------------------------------------
37582 | 0.0 | false | false | 60 | M1236 | 2021-06-08T00:00:00Z
and the 2nd row(37580) of the 1st results was placed at the last row(document#300).
I have researched online , and Here's what I've found
Solr changes document's score when its random field value altered
but I think the situation is different to mine, since I did not add the score as a Sort.
I am not sure why does it behave like this,
Am I missing something ?
Or is there anyone can explain it ?
Thanks in advance.
Since the dates are identical, their internal sort order depends on their position in the index.
Updating the document marks the original document as deleted and adds a new document at the end of the index, so its position in the index changes.
If you want to have it stable, sort by date and id instead - that way the lower id will always be first when the dates are identical, and the sort will be stable.

Create/Update table in MS Access dynamically

EDIT:
Here's what I have: An Access database made up of 3 tables linked from SQL server. I need to create a new table in this database by querying the 3 source tables. Here are examples of the 3 tables I'm using:
PlanTable1
+------+------+------+------+---------+---------+
| Key1 | Key2 | Key3 | Key4 | PName | MainKey |
+------+------+------+------+---------+---------+
| 53 | 1 | 5 | -1 | Bikes | 536681 |
| 53 | 99 | -1 | -1 | Drinks | 536682 |
| 53 | 66 | 68 | -1 | Balls | 536683 |
+------+------+------+------+---------+---------+
SpTable
+----+---------+---------+
| ID | MainKey | SpName |
+----+---------+---------+
| 10 | 536681 | Wing1 |
| 11 | 536682 | Wing2 |
| 12 | 536683 | Wing3 |
+----+---------+---------+
LocTable
+-------+-------------+--------------+
| LocID | CenterState | CenterCity |
+--- ---+-------------+--------------+
| 10 | IN | Indianapolis |
| 11 | OH | Columbus |
| 12 | IL | Chicago |
+-------+-------------+--------------+
You can see the relationships between the tables. The NewMasterTable I need to create based off of these will look something like this:
NewMasterTable
+-------+--------+-------------+------+--------------+-------+-------+-------+
| LocID | PName | CenterState | Key4 | CenterCity | Wing1 | Wing2 | Wing3 |
+-------+--------+-------------+------+--------------+-------+-------+-------+
| 10 | Bikes | IN | -1 | Indianapolis | 1 | 0 | 0 |
| 11 | Drinks | OH | -1 | Columbus | 0 | 1 | 0 |
| 12 | Balls | IL | -1 | Chicago | 0 | 0 | 1 |
+-------+--------+-------------+------+--------------+-------+-------+-------+
The hard part for me is making this new table dynamic. In the future, rows may be added to the source tables. I need my NewMasterTable to reflect any changes/additions to the source. How do I go about building the NewMasterTable as described? Does this make any sort of sense?
Since an Access table is a necessary requirement, then probably the only way to go about it is to create a set of Update and Insert queries that are executed periodically. There is no built-in "dynamic" feature of Access that will monitor and update the table.
First, create the table. You could either 1) do this manually from scratch by defining the columns and constraints yourself, or 2) create a make-table query (i.e. SELECT... INTO) that generates most of the schema, then add any additional columns, edit necessary details and add appropriate indexes.
Define and save Update and Insert (and optional Delete) queries to keep the table synced. I'm not sharing actual code here, because that goes beyond your primary issue I think and requires specifics that you need to define. Due to some ambiguity with your key values (the field names and sample data still are not sufficient to reveal precise relationships and constraints), it is likely that you'll need multiple Update statements.
In particular, the "Wing" columns will likely require a transform statement.
You may not be able to update all columns appropriately using a single query. I recommend not trying to force such an "artificial" requirement. Multiple queries can actually be easier to understand and maintain.
In the event that you experience "query is not updateable" errors, you may need to define other "temporary" tables with appropriate indexes, into which you do initial inserts from the linked tables, then subsequent queries to update your master table from those.
Finally, and I think this is the key to solving your problem, you need to define some Access form (or other code) that periodically runs your set of "sync" queries. Access forms have a [Timer Interval] property and corresponding Timer event that fires periodically. Add VBA code in the Form_Timer sub that runs all your queries. I would suggest "wrapping" such VBA in a transaction and adding appropriate error handling and error logging, etc.

SQLite database structure for time series

My VB.NET application produces simulation data which I want to store in an SQLite Database. The data consists of hundreds of variables which have values for up to 50k time steps (occurrences/measurements). Number of variables is variable. Time steps can vary from 10 to 50k.
So far, I have one table. The first column contains the timestamp (primary key) and the following contain variable values for each variable (column name = variable name). The rows are filled with the timestamps and the variable values for each time step:
timestamp | var1 | var2 | var3 | ...
----------------------------------------------
1 | var1(1) | var2(1) | var3(1) | ...
2 | var1(2) | var2(2) | var3(2) | ...
3 | var1(3) | var2(3) | var3(3) | ...
... | ... | ... | ... | ...
I use:
CREATE TABLE variables(timestamp INTEGER PRIMARY KEY, var1 REAL, var2 REAL, ...);
This works. I use the database to save the simulation data for later evaluation. I need to plot selected time series and copy the values of some variables for specific time spans to Excel (calculate sums, maxima, etc.).
I've read not to add too many columns (I may have more than 500 variables/columns). Regarding performance, is it better to structure it differently? For example one table with four columns: ID (primary key), timestamp, variable name and variable value.
ID | timestamp | varName | varValue
------------------------------------
1 | 1 | var1 | var1(1)
2 | 2 | var1 | var1(2)
...| ... | ... | ...
50 | 50 | var1 | var1(50)
51 | 1 | var2 | var2(1)
52 | 2 | var2 | var2(2)
...| ... | ... | ...
In this case I would have 50k time steps * 500 variables = 25 million rows, but a fixed number of columns. Is there any better way?
What happens regarding performance (for read queries) if I have inserted the rows not in ascending timestamp order?

Why does Neo4j hit every indexed record when only returning a count?

I am using version 3.0.3, and running my queries in the shell.
I have ~58 million record nodes with 4 properties each, specifically an ID string, a epoch time integer, and lat/lon floats.
When I run a query like profile MATCH (r:record) RETURN count(r); I get a very quick response:
+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
29 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------------+----------------+------+---------+-----------+--------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------------+----------------+------+---------+-----------+--------------------------------+
| +ProduceResults | 7644 | 1 | 0 | count(r) | count(r) |
| | +----------------+------+---------+-----------+--------------------------------+
| +NodeCountFromCountStore | 7644 | 1 | 0 | count(r) | count( (:record) ) AS count(r) |
+--------------------------+----------------+------+---------+-----------+--------------------------------+
Total database accesses: 0
The Total database accesses: 0 and NodeCountFromCountStore tells me that neo4j uses a counting mechanism here that avoids iterating over all the nodes.
However, when I run profile MATCH (r:record) WHERE r.time < 10000000000 RETURN count(r);, I get a very slow response:
+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
151278 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+----------+----------+-----------+------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-----------------------+----------------+----------+----------+-----------+------------------------------+
| +ProduceResults | 1324 | 1 | 0 | count(r) | count(r) |
| | +----------------+----------+----------+-----------+------------------------------+
| +EagerAggregation | 1324 | 1 | 0 | count(r) | |
| | +----------------+----------+----------+-----------+------------------------------+
| +NodeIndexSeekByRange | 1752922 | 58430739 | 58430740 | r | :record(time) < { AUTOINT0} |
+-----------------------+----------------+----------+----------+-----------+------------------------------+
Total database accesses: 58430740
The count is correct, as I chose a time value larger than all of my records. What surprises me here is that Neo4j is accessing EVERY single record. The profiler states that Neo4j is using the NodeIndexSeekByRange as an alternative method here.
My question is, why does Neo4j access EVERY record when all it is returning is a count? Are there no intelligent mechanisms inside the system to count a range of values after seeking the boundary/threshold value within the index?
I use Apache Solr for the same data, and returning a count after searching an index is extremely fast (about 5 seconds). If I recall correctly, both platforms are built on top of Apache Lucene. While I don't know much about that software internally, I would assume that the index support is fairly similar for both Neo4j and Solr.
I am working on a proxy service that will deliver results in a paginated form (using the SKIP n LIMIT m technique) by first getting a count, and then iterating over results in chunks. This works really well for Solr, but I am afraid that Neo4j may not perform well in this scenario.
Any thoughts?
The later query does a NodeIndexSeekByRange operation. This is going through all your matched nodes with the record label to look up the value of the node property time and does a comparison if its value is less than 10000000000.
This query actually has to get every node and read some info for comparison, and that's the reason why it is much slower.

Magento - What are the diferences between an EAV attribute in the entity and attribute in its table type

I have the next doubt:
In the EAV data structure you can find an attribute in the entity or in the table of the associated type.
For Example:
catalog_product_entity
| entity_id | entity_type_id | attribute_set_id | type_id | sku | has_option | ....
| 1 | 4 | 34 | simple | 0912132 | 0 |
catalog_product_entity_datetatime
| value_id | entity_type_id | attribute_id | store_id | entity_id | value
| 1 | 4 | 71 | 0 | 1 | NULL
catalog_product_entity_decimal
| value_id | entity_type_id | attribute_id | store_id | entity_id |value
| 1 | 4 | 69 | 1 | 1 | 29.009
you can find the attribute sku in the flat table catalog_product_entity, and the attribute_id 71 and 69 in the tables catalog_product_entity_datatime and catalog_product_entity_decimal
Why is the attribute sku in the entity table? is it because of optimization reasons?
If I want to improve the load of an attribute, could I insert it in the entity table? What is a reasonable number of attributes in the entity table?
These attributes (with type static) are necessary in order to do a proper save of any EAV object without overwriting the save() method or other methods involved in the save for each entity type.
You can take a look at this method Mage_Eav_Model_Entity_Abstract::_collectSaveData(). All EAV entity models extend Mage_Eav_Model_Entity_Abstract and that method is called when saving an instance.
All the static attribtues are added to the array $entityRow. this means that they will be saved in the main table but you still have the option to attach a backend_model to the static attribute that will process the value to be saved, before the actual save is done.

Resources