I have a table with 3 columns: "Id", "A", "B".
All of them are searchable. Id is identity and used only to search exact rows so it's clear. But I have doubts about "A" and "B". I have a 3 cases to search in my application: search by "A", search by "B" and search by "A" and "B" simultaneously. So i'm not sure which index type to choose. Should I use two single-column indexes or one multi-column? Or maybe it's better to combine single-column indexes with multi-column (3 indexes in total)? I don't really care about INSERT/UPDATE/DELETE duration, my target priority is to make SELECT as fast as possible.
I use SQL Server 2017.
Thank you.
I think two additional indexes will be enough:
CREATE INDEX IDX_YourTable_AB ON YourTable(A,B) -- the first column here which has more different values
CREATE INDEX IDX_YourTable_B ON YourTable(B)INCLUDE(A)
If you have other columns in this table you can create included indexes:
CREATE INDEX IDX_YourTable_AB ON YourTable(A,B) INCLUDE(C,D,E,...)
CREATE INDEX IDX_YourTable_B ON YourTable(B) INCLUDE(A,C,D,E,...)
Index IDX_YourTable_AB might used for conditions WHERE A='...' or WHERE A='...' AND B='...' or WHERE A LIKE '...%' AND B='...' - used only A column or A&B columns.
Index IDX_YourTable_B might used for conditions with B column only (WHERE B='...' or WHERE B LIKE '...%').
Also try to test CREATE INDEX IDX_YourTable_BA ON YourTable(B,A) instead of CREATE INDEX IDX_YourTable_B ON YourTable(B)INCLUDE(A). Maybe it will be better.
Related
I usually use jsonb field store array data.
for example, I want to store customer's barcode info, I will create a table like this:
create table customers(fcustomerid bigint, fcodes jsonb);
One customer has one row, all barcode info stored in its fcodes field, just like below:
[
{
"barcode":"000000001",
"codeid":1,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":true,
"lottdate":"2021-01-20",
"bonus":50
},
{
"barcode":"000000002",
"codeid":2,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
...
{
"barcode":"000500000",
"codeid":500000,
"product":"Pepsi Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
]
The jsonb array maybe store millions of barcode's objects with the same structure. Perhaps this is not a good idea, but you konw when I have thousands of customer, I can store all the data in one table, one customer has one row in this table, all its data store in one field, it looks very tersely and easy to manage.
For this kind of application scenarios, how to efficiently to insert or modify or query the data?
I can use jsonb_insert to insert one object, just like:
update customers
set fcodes=jsonb_insert(fcodes,'{-1}','{...}'::jsonb)
where fcustomerid=999;
When I want modify some object, I found it is a little difficulty, I should know the index of object first, if I use the incremental key codeid as the array index, things looks easilly. I can use jsonb_modify,Just like below:
update customers
set fcodes=jsonb_set(fcodes,concat('{',(mycodeid-1)::text,',lottery}'),'true'::jsonb)
where fcustomerid=999;
But if I want to query the objects in the jsonb array with createdate or bonus or lottorry or product, I should use jsonpath operator. just like:
select jsonb_path_query_array(fcodes,'$ ? (product=="Pepsi Cola")'
from customer
where fcustomerid=999;
or like:
select jsonb_path_query_array(fcodes,'$ ? (lottdate.datetime()>="2021-01-01".datetime() && lottdate.datetime()<="2021-01-31".datetime())'
from customer
where fcustomerid=999;
Thie jsonb index looks useful, But it looks useful between different row, and my operation mostly works in one row's one jsonb field.
I am very worrying about the efficiency, for millions of objects stored in one row's one jsonb field, is this a good idea? And how to improve the efficiency in this scenarios? Especially for the query.
You are right to worry. With a huge JSON like that, you will never get good performance.
Your data don't need JSON at all. Create a table that stores a single barcode and has a foreign key reference to customers. Then everything will be simple and efficient.
Using JSON in the database is almost always the wrong choice, judging from the questions in this forum.
I'm researching now on creating indexes for our tables.
I found out about multicolumn indexes but I'm not sure on the impact.
Example:
We have SQLs on findById, findByIdAndStatus, findByResult.
It says that the most used on WHERE should be listed first in the columns list. But I was wondering if it'll have a huge impact if I create index on different combination where clauses.
This: (creating one index for all)
CREATE INDEX CONCURRENTLY ON Students (id, status, result)
vs.
This: (creating different indexes on different queries)
CREATE INDEX CONCURRENTLY ON Students (id)
CREATE INDEX CONCURRENTLY ON Students (id, status)
CREATE INDEX CONCURRENTLY ON Students (result)
Thank you so much in advance!
Creating one index for all and creating different indexes will have completely different impact on the queries.
You can use EXPLAIN to see if indexes are getting used for the queries.
This video is really good to know about DB indexes.
Index CREATE INDEX CONCURRENTLY ON Students (id, status, result) will be used only and only if query uses id, (id,status) or (id, status and result) in WHERE clause. a query with status in Where will not use this index at all.
Indexes are basically balanced binary trees. A multicolumn index will index rows by id, then rows ordered by id's are further indexes by status and then with result and so on.
You can see that in this index, the ordering via status is not present at all. It is only available on rows indexed by the id's first.
Do have the look at video, it explains all this pretty well.
The rule of thumb you read is wrong.
A better rule is: create such an index only if it is useful and gets used often enough that it is worth the performance hit on data modification that comes with every index.
A multi-column B-tree index on (a, b, c) is useful in several cases:
if the query looks like this:
SELECT ... FROM tab
WHERE a = $1 AND b = $2 AND c <operator> $3
where <operator> is an operator supported by the index and $1, $2 and $3 are constants.
if the query looks like this:
SELECT ... FROM tab
WHERE a = $1 AND b = $2
ORDER BY c;
or like this
SELECT ... FROM tab
WHERE a = $1
ORDER BY b, c;
Any decorations in the ORDER BY clause must be reflected in the CREATE INDEX statement. For example, for ORDER BY b, c DESC the index must be created on (a, b, c DESC) or (a, b DESC, c) (indexes can be read in both directions).
if the query looks like this:
SELECT c
FROM tab
WHERE a = $1 AND b <operator> $2;
If the table is newly VACUUMed, this can get you an index only scan, because all required information is in the index.
In recent PostgreSQL versions, such an index in better created as
CREATE INDEX ON tab (a, b) INCLUDE (c);
my questions are What indexes are used? In what order? Why? in following sample
Query:
SELECT House
FROM myTable
WHERE 1=1
and City='myCity'
and Street='myStreet'
and Color='myColor'
Indexes:
Ind1: City
Ind2: Street
Ind3: Color
Ind4: Street,Color
It depends on... The server might have statistics, so it will choose the index which has the most effective filtering like:
if City='myCity' returns 100
if Street='myStreet' returns 1000
if Color='myColor' returns 10000
element, then City index will be used. This logic is valid for composite indexes as well.
The optimizer will try to get the smallest set first then the other filters will be applied on this.
This requires uptodate statistic, otherwise the wrong index might be used.
I am going to express the idea in SQL:
SELECT key,value
FROM table1
WHERE value > 10
Or do we always need to know the key?
I suppose you can use secondary indexes which are available since version 0.7 of casssandra.
You might also checkout the following answer: Cassandra and Secondary-Indexes, how do they work internally?
it is recommended to use secondary indexes only for low-cardinality columns, which means for columns which do not have many different values (e.g. columns like 'status' or 'priority' which have usually only a handful different values like 'high', 'medium', 'low').
In case you are using Hector as your cassandra client you can find information here how to use them:
https://github.com/rantav/hector/wiki/User-Guide
Yes, of course, for example, you can use *
select * from CF where value = 10
If you use the Hector API (e.g. CqlQuery), you can get a list of rows back from this query.
Note, currently for secondary indexes, you must have at least one equality conditional, so your query with just value > 10 would not work. See this question
I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])