How to get max value from cassandra table with where clause - database

I have a little design problem, I have the following request:
SELECT MAX(idt) FROM table WHERE idt < 2018
but I can't figure out how to create the table according to this request.
idt must be a clustering key to be able to do greater than or lower than operations as well as the max aggragation but I don't know what should I use as partition key (I don't want to use ALLOW FILTERING).
The only solution I've found is to use a constant value as partition key but I know it's considered as a bad design.
Any help?
Thank you,

You will need to partition your data somehow. If you do not it will be like you say, is to either read everything from whole cluster (allow filtering) or put everything in a single partition (constant key). Not knowing anything about your data, design or goals, a common setup is to partition by date like:
SELECT id FROM table WHERE bucket = '2018' AND id < 100 limit 1;
Then your key would look like ((bucket), id) ordering id DESC so largest at head of partition. In this case buckets are by year so end up making a query per year that your looking for. if idt is not unique you might need to do something like:
((uuid), idt) or ((bucket), uuid, idt) sorting by idt DESC (once again issues if not unique for that record). Then you can do things like
SELECT max(idt) FROM table WHERE GROUP BY bucket
although still better to
SELECT max(idt) FROM table WHERE bucket = '2018' GROUP BY bucket
which will give you the max per bucket so you would have to page through it and generate the global max yourself, but it is better for cluster as it will naturally throttle a little vs single query slamming whole cluster. Might be good idea on that query to also limit the fetch size to like 10 or 100 or something vs the default 5000, so the resultset pages slower (preventing too much work on coordinator).
To have the work to do all this done somewhere else you might wanna consider Spark, as it can give you a lot more rich queries and do it as efficiently as it can (which might not be efficient but it will try).

Related

Removing PAGELATCH with randomized ID instead of GUID

We have two tables which receive 1 million+ insertions per minute. This table is heavily indexed, which also can’t be removed to support business requirements. Due to such high volume of insertions, we are seeing PAGELATCH_EX and PAGELATCH_SH. These locks further slowdown insertions.
A commonly accepted solution is to change the identity column to GUID so that insertions are written on random page every time. We can do this but changing IDs will trigger a need for the development cycle of migration scripts so that existing production data can be changed.
I tried another approach which seems to be working well in our load tests. Instead of changing to GUID, We are now generating IDs in a randomized pattern using following logic
SELECT #ModValue = (DATEPART(NANOSECOND, GETDATE()) % 14);
INSERT xxx(id)
SELECT NEXT VALUE FOR Sequence * (#ModValue + IIF(#ModValue IN (0,1,2,3,4,5,6), 100000000000,-100000000000))
It has eliminated PAGELATCH_EX and PAGELATCH_SH locks and our insertions are quite fast now. I also think GUID as PK of such a critical table is less efficient then a bigint ID column.
However, some of team members are sceptical on this as IDs with negative values that too generated on random basis is not a common solution. Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
I am wondering what the community’s take on this solution. If you could please point any disadvantage of approach suggested, that will be highly appreciated.
However, some of team members are skeptical on this as IDs with negative values that too generated on random basis is not a common solution
You have an uncommon problem, and so uncommon solutions might need to be entertained.
Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
Sure. The system as it exists now has a high (but not perfect) correlation between IDs and time. That is, in general a higher ID for a row means that it was created after one with a lower ID. So it's convenient to order by IDs as a stand-in for ordering by time. If that's something that they need to do (i.e. order the data by time), give them a way to do that in the new proposal. Conversely, play out the hypothetical scenario where you're explaining to your CTO why you didn't fix performance on this table for your end users. Would "so that our support personnel don't have to change the way they do things" be an acceptable answer? I know that it wouldn't be for me but maybe the needs of support outweigh the needs of end users in your system. Only you (and your CTO) can answer that question.

SQL Server - what kind of index should I create?

I need to make queries such as
SELECT
Url, COUNT(*) AS requests, AVG(TS) AS avg_timeSpent
FROM
myTable
WHERE
Url LIKE '%/myController/%'
GROUP BY
Url
run as fast as possible.
The columns selected and grouped are almost always the same, being the difference, an extra column on the select and group by (the column tenantId)
What kind of index should I create to help me run this scenario?
Edit 1:
If I change my base query to '/myController/%' (note there's no % at the begging) would it be better?
This is a query that cannot be sped up with an index. The DBMS cannot know beforehand how many records will match the condition. It may be 100% or 0.001%. There is no clue for the DBMS to guess this. And access via an index only makes sense when a small percentage of rows gets selected.
Moreover, how can such an index be structured and useful? Think of a telephone book and you want to find all names that contain 'a' or 'rs' or 'ems' or whatever. How would you order the names in the book to find all these and all other thinkable letter combinations quickly? It simply cannot be done.
So the DBMS will read the whole table record for record, no matter whether you provide an index or not.
There may be one exception: With an index on URL and TS, you'd have both columns in the index. So the DBMS might decide to read the whole index rather than the whole table then. This may make sense for instance when the table has hundreds of columns or when the table is very fragmented or whatever. I don't know. A table is usually much easier to read sequentially than an index. You can still just try, of course. It doesn't really hurt to create an index. Either the DBMS uses it or not for a query.
Columnstore indexes can be quite fast at such tasks (aggregates on globals scans). But even they will have trouble handling a LIKE '%/mycontroler/%' predicate. I recommend you parse the URL once into an additional computed field that projects the extracted controller of your URL. But the truth is that looking at global time spent on a response URL reveals very little information. It will contain data since the beginning of time, long since obsolete by newer deployments, and not be able to capture recent trends. A filter based on time, say per hour or per day, now that is a very useful analysis. And such a filter can be excellently served by a columnstore, because of natural time order and segment elimination.
Based on your posted query you should have a index on Url column. In general columns which are involved in WHERE , HAVING, ORDER BY and JOIN ON condition should be indexed.
You should get the generated query plan for the said query and see where it's taking more time. Again based n the datatype of the Url column you may consider having a FULLTEXT index on that column

Data modeling with counters in Cassandra, expiring columns

The question is directed to experienced Cassandra developers.
I need to count how many times and when each user accessed some resource.
I have data structure like this (CQL):
CREATE TABLE IF NOT EXISTS access_counter_table (
access_number counter,
resource_id varchar,
user_id varchar,
dateutc varchar,
PRIMARY KEY (user_id, dateutc, resource_id)
);
I need to get an information about how many times user has accessed resources for last N days. So, to get last 7 days I make requests like this:
SELECT * FROM access_counter_table
WHERE
user_id = 'user_1'
AND dateutc > '2015-04-03'
AND dateutc <= '2015-04-10' ;
And I get something like this:
user_1 : 2015-04-10 : [resource1:1, resource2:4]
user_1 : 2015-04-09 : [resource1:3]
user_1 : 2015-04-08 : [resource1:1, resource3:2]
...
So, my problem is: old data must be deleted after some time, but Cassandra does not allow set EXPIRE TTL to counter tables.
I have millions of access events per hour (and it could billions). And after 7 days those records will be useless.
How can I clear them? Or make something like garbage collector in Cassandra? Is this a good approach?
Maybe I need to use another data model for this? What could it be?
Thanks.
As you've found, Cassandra does not support TTLs on Counter columns. In fact, deletes on counters in Cassandra are problematic in general (once you delete a counter, you essentially cannot reuse it for a while).
If you need automatic expiration, you can model it using an int field, and perhaps use external locking (such as zookeeper), request routing (only allow one writer to access a particular partition), or Lightweight transactions to safely increment that integer field with a TTL.
Alternatively, you can page through the table of counters and remove "old" counters manually with DELETE on a scheduled task. This is less elegant, and doesn't scale as well, but may work in some cases.

How to handle joins between huge tables in PostgreSQL?

I have two tables:
urls (table with indexed pages, host is indexed column, 30 mln rows)
hosts (table with information about hosts, host is indexed column, 1mln rows)
One of the most frequent SELECT in my application is:
SELECT urls.* FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?
In projects which have more than 100 000 rows in urls table the query executes very slow.
Since the tables has grown the query is execution slower and slower. I've read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables but changing my database from PgSQL to MongoDB is for me big issue. Right now i would like try to optimize PgSQL solution. Do you have any advice for? What should i do?
This query should be fast in combination with the provided indexes:
CREATE INDEX hosts_host_idx ON hosts (host)
WHERE is_spam IS NULL;
CREATE INDEX urls_projects_id_idx ON urls (projects_id, id DESC);
SELECT *
FROM urls u
WHERE u.projects_id = ?
AND EXISTS (
SELECT 1
FROM hosts h USING (host)
WHERE h.is_spam IS NULL
)
ORDER BY urls.id DESC
LIMIT ?;
The indexes are the more important ingredient. The JOIN syntax as you have it may be just as fast. Note that the first index is a partial index and the second is a multicolumn index with DESC order on the second column.
It much depends on specifics of your data distribution, you will have to test (as always) with EXPLAIN ANALYZE to find out about performance and whether the indexes are used.
General advice about performance optimization applies, too. You know the drill.
Add an index on the hosts.host column (primarily in the hosts table, this matters), and a composite index on urls.projects_id, urls.id, run ANALYZE statement to update all statistics and observe subsecond performance regardless of spam percentage.
A slightly different advice would apply if almost everything is always spam and if the "projects", whatever they are, are few in number and and very big each.
Explanation: update of statistics makes it possible for the optimizer to recognize that the urls and hosts tables are both quite big (well, you didn't show us schema, so we don't know your row sizes). The composite index starting with projects.id will hopefully1 rule out most of the urls content, and its second component will immediately feed the rest of urls in the desired order, so it is quite likely that an index scan of urls will be the basis for the query plan chosen by the planner. It is then essential to have an index on hosts.host to make the hosts lookups efficient; the majority of this big table will never be accessed at all.
1) Here is where we assume that the projects_id is reasonably selective (that it is not the same value throughout the whole table).

The most efficient way to get all data from SQL Server table with varchar(max) column

This question is for SQL Server 2005.
I have a table with 2 columns.
Table_A
Id Guid (PrimaryKey)
TextContent varchar(max)
The table contains around 7000 records and textcontent range from 0 - 150K+.
When I do a select statement
SELECT Id, TextContent FROM Table_A, it took a very long time around 10 minutes.
Is there a better way to get all data out of the table?
During the main execution I only load certain records only. Example: SELECT Id, TextContent FROM TableA WHERE ID IN (#id0,#id1, #id2, #id3....#id20). This query is not slow but not that very fast neither. I want to see if I can optimize the process by pull the TextContent ahead of run time. It is okay for this process to run in a minute or two but 10 minutes is not acceptable.
The GUID is primary key, which is also by default going to be your clustering key no doubt will be causing large fragmentation - but given the nature of the columns, the varchar(max) is going to regularily be off page in the LOB storage and not stored on page unless it fits whilst remaining within the 8060 limit.
So fragmentation is not going to be helped by having a GUID as primary if you also have made it clustered - you can check fragmentation levels using the DMV sys.dm_db_index_physical_stats
I wouldn't think fragmentation is really the problem unless the average amount of data per row is high e.g. regularily above 8k.
If it was,... the fragmentation starts hurting. Worst case is 1 row per page, 7k I/Os which is not ideal, but at 100k average per LOB storage, you could be looking at further 87k I/Os and the order in which the data has been written etc would result in what is supposed to be a sequential scan of the table (and the disk), turning into a massive random I/O fest as the disk heads long stroke back and forth between the page with the row + LOB pointer and the LOB pages.
Added to that is the chance te GUID is the clustering key, so it couldn't even scan the data pages without quite a bit of disk head movement.
I also have to agree with Erich that the quantity of data you are trying to shift across the wire will cause quite a delay on an insufficient link and you should look to properly filter your data at the server level through paging or suitable queries.
I know your looking to pre-cache the data, which can work at times - but it is being performed on such a large entity, it tends to indicate something else is wrong and you are fixing the wrong issue.
A.
This is the correct way to get the data out of the table, unless you only want 1 row. If all you need is 1 row, just use the correct query.
What sort of network connection are you using? Lets put it this way, you have 7000 records. Each contains on average 100k of data (for ease, if it is more or less than this, that's fine, my point still stands). The total query will return 700 MB of data! Even over an extremely fast connection, that is easily 10 minutes of download time.
Even under a PERFECT 100 Megabit connection, the transfer would take nearly a minute! In addition, you have to get that data off of the physical disk, which is going to take a while in addition.
I would recommend doing some sort of paging in order to take the data in smaller bites.
I doubt it. If you want to "get all the data out of the table", then you have to read every byte that is stored within the table, and that may well require a lot of physical disk I/O.
Perhaps what you want is to only retrieve some of the data from the table?
Your Id column is a GUID. Are you using a default? Is it NewID()? I assume it's clustered on the PK.
If you use NewSequentialID() as the default, you'll get fewer page splits, so your data will be spread across fewer physical pages.
With that huge amount of data, that's the only thing I can see that would help the performance.
As many others mentioned, you're fetching a lot of data. First make sure if you need really all rows.
If you do, don't fetch everything at once - use LIMIT instead. This will actually decrease the speed, but if anything fails you'll only have to load a short bit again and don't have to wait another 10 minutes.
SELECT Id, TextContent FROM Table_A LIMIT 0, 30
This query will fetch first 30 entries of your table. With
SELECT Id, TextContent FROM Table_A LIMIT 30, 30
you'll get the next piece.
Maybe you could provide us with a bit more of info, like what you want to do with the data and which programming language you use?
Yes
1)don't use SELECT * ever, always list your columns whether 1, 2 or 100
2)try looking into indexes
150k characters ? in that field ? Is that what you are referring to ?

Resources