How to select lots of counts of lots of different criterias - sql-server

I'm trying to do a detailed Member Search page. It uses Ajax in every aspect like Linkedin did on search pages.
But I don't know how I can select counts of multiple criterias. You can see what I meant by the attachment. I mean, if I select every count with different queries it's gonna take forever.
Should I store the count values on another table? Then, further development will be hard and time consuming.
I need your advices.
In this web site, you enter just a keyword and it shows you the all available fields order by count DESC;

You can create an Indexed View that groups by your criteria and uses COUNT_BIG to get totals.
CREATE VIEW dbo.TagCount
WITH SCHEMABINDING
AS
SELECT Tag, COUNT_BIG(*) AS CountOfDocs
FROM dbo.Docs
GROUP BY Tag
GO
CREATE UNIQUE CLUSTERED INDEX IX_TagCount ON dbo.TagCount (Tag)

Related

Snowflake/Snowsight dashboard custom filters cannot select desired values

We have many dashboards that require a user to filter to on a specific ID in our database. We have billions of IDs to select from but Snowsight will only all us to display 10,000 fields in the drop down for selection.
Is there any way that we can just paste in the ID and run rather than having to find and select? We'll never be able to write a filter query that'll dynamically cover all bases.
select id from table
Snowsight gives a warning that only 10,000 rows can be shown so I tried hard-typing the ID in the filter option and nothing happens. It forces you to either select an ID shown (the one I want isn't in the list of 10,000) or the "All" option.

How don't select the same row twice in two select cassandra queries?

I am working on a social networking project with cassandra. Users can subscribe to a profile and have access to the list of people who have subscribed to that same profile. My goal is to retrieve in a table called user_follows the list of people subscribed to a profile.
CREATE TABLE users_follows (to_id text, from_id text, followed_at timestamp, PRIMARY KEY(to_id, from_id))
The problem is that some profiles can have thousands of subscribers and I don't want to get them all at once. That's why I'd like to get the list in increments of 20 depending on how far down the user goes. My problem is that I can't see how to retrieve the other parts of the list after the first select because Cassandra always returns the same users.
SELECT * FROM users_follows where to_id = 'xxxxx'
A possible solution was to sort with a timestamp but in case I want to retrieve the list of people to whom a user is subscribed (the reverse query) this would not work. One solution would be to use materialized views but I'm not sure that it would be very optimal given the size of the table. Or to create a different table, one user_follows and another user_followers, but I don't think this is very recommended....

One large table or many small ones in database?

Say I want to create a typical todo-webApp using a db like postgresql. A user should be able to create todo-lists. On this lists he should be able to make the actual todo-entries.
I regard the todo-list as an object which has different properties like owner, name, etc, and of course the actual todo-entries which have their own properties like content, priority, date ... .
My idea was to create a table for all the todo-lists of all the users. In this table I would store all the attributes of each list. But the questions which arises is how to store the todo-entries themselves? Of course in an additional table, but should I rather:
1. Create one big table for all the entries and have a field storing the id of the todo-list they belong to, like so:
todo-list: id, owner, ...
todo-entries: list.id, content, ...
which would give 2 tables in total. The todo-entries table could get very large. Although we know that entries expire, hence the table only grows with more usage but not over time. Then we would write something like SELECT * FROM todo-entries WHERE todo-list-id=id where id is the of the list we are trying to retrieve.
OR
2. Create a todo-entries table on a per user basis.
todo-list: id, owner, ...
todo-entries-owner: list.id, content,. ..
Number of entries table depends on number of users in the system. Something like SELECT * FROM todo-entries-owner. Mid-sized tables depending on the number of entries users do in total.
OR
3. Create one todo-entries-table for each todo-list and then store a generated table name in a field for the table. For instance could we use the todos-list unique id in the table name like:
todo-list: id, owner, entries-list-name, ...
todo-entries-id: content, ... //the id part is the id from the todo-list id field.
In the third case we could potentially have quite a large number of tables. A user might create many 'short' todo-lists. To retrieve the list we would then simply go along the lines SELECT * FROM todo-entries-id where todo-entries-id should be either a field in the todo-list or it could be done implicitly by concatenating 'todo-entries' with the todos-list unique id. Btw.: How do I do that, should this be done in js or can it be done in PostgreSQL directly? And very related to this: in the SELECT * FROM <tablename> statement, is it possible to have the value of some field of some other table as <tablename>? Like SELECT * FROM todo-list(id).entries-list-name or so.
The three possibilities go from few large to many small tables. My personal feeling is that the second or third solutions are better. I think they might scale better. But I'm not sure quite sure of that and I would like to know what the 'typical' approach is.
I could go more in depth of what I think of each of the approaches, but to get to the point of my question:
Which of the three possibilities should I go for? (or anything else, has this to do with normalization?)
Follow up:
What would the (PostgreSQL) statements then look like?
The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.
Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.
Scenario 1
In the first scenario you have three tables:
todo_users: 1 million records
todo_lists: 3 million records
todo_entries: 15 million records
Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:
-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;
-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;
-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;
You can also combine the three queries into one:
SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;
Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).
Inserting, updating and deleting records can be done with very similar statements and similarly fast.
PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.
Scenario 2
In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.
1 million todo_lists tables with a few records each
1 million todo_entries tables with a few dozen records each
The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:
username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';
And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.
In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.
Scenario 3
This scenario is even worse that scenario 2. Don't do it, it's madness.
3 million tables todo_entries with a few records each
So...
Stick with option 1. It is your only real option.

Counting the number of occurances of something in the database

For my website, I want to make something that works a bit like the tags on Stackoverflow - so some fields will have an autocompleter, and the autocompleter will display the number of times that other users have selected each suggested value. I suppose I'd have a database structure like this:
Articles
ArticleID
Content
TagId
Tags
TagId
TagName
Occurances
With the idea being that Occurances represents the number of times each TagId is referenced from the Articles table.
What is the best way to implement this? I could add/subtract from the occurances column on each of the stored procedures that update the article table, but I might miss one, and anyway, there is are some difficulties with this if a user removes a tag from something (as its easy to add 1 to the field for the newly added tag, but harder to work out which tag is being replaced.)
There is lots I don't understand about sql-server. Is there a more robust way of counting occurances like this, that the database system will deal with itself? It would be ok if the data was cached once a day or something.
To be able to have more than one tag attached to an article, you will have to add another table that connects the article table to the tag table. It's called a 'many to many' relation.
article
article_id
content
article_tag
article_id
tag_id
tag
tag_id
tagname
Doing like this, article 1 can be attached to tag 2, and the next row can be 1 and 3 and so on, so one article points to many tags. To count a certain tag, you join the Article_Tag and Tag tables, and and count the rows in Article_Tag where Tag.tagname = 'mysql', for examle.
You can create an indexes view that aggregates all the counts you need and is automatically maintained:
create view TagCounts
with schemabinding
as select TagId, count_big(*) as Occurances
from dbo.ArticleTags
group by TagId;
go
create unique clustered index cdxTagCounts on TagCounts (TagId);
go
Now the TagCounts.Occurances field is automatically maintained by SQL Server whenever you insert/delete/update the Articles table. You can query it like:
select Occurances from dbo.TagCounts with (noexpand) where TagId = ...;
And you can cache the result with LinqToCache, as such a query matches the restrictions of Query Notifications.
The trade off of using a pre-aggregated indexed view is scalability: as update of any article updates the count of Occurances for the tags of the article, an exclusive lock is required to update this count. Which implies that only one transaction can use a TagId at any moment. Depending on your traffic and on other elements of your design this restriction may or may not be acceptable.
The other alternative is a table of counts. Front ends (your ASP.Net farm) read this counts and then they update the in-memory count for each operation, keeping track of the delta from the counts in the table. Periodically the front ends merge their deltas into the table (eg. every 5 minutes) and refresh the in-memory table. This way front ends see a stale version of the truth, but an user sees immediate feedback of its actions: because of session stickiness his HTTP requests are processed by the same front end, and thus he see immediately his own article updates triggering modifications to the tag counts. User though do no immediately see the updates from other users that are load-balanced to another front end. Because a crash of the front end (or a process recycle...) will loose the deltas kept so far, the count table will drift in time away from the truth and would have to be periodically updated to the true count in the database.
If you which even more accuracy (all users see the true count immediately) then you can do something based on fast in-memory key value stores, which would be basically the same as my first proposal but with much higher throughput/lower latency, perhaps something based on memcached + redis. I'm not acquainted with SO architecture, but I believe they may be doing something similar.
You could use this query to get the number of occurances by tag:
SELECT Tags.TagId, COUNT(Articles.TagId) as Occurances
FROM Articles
JOIN Tags ON Articles.TagId
GROUP BY Tags.TagId
It could be used in a view or stored procedure, and you can set up your website's cache to requery it as often as required.
If you are using a relational database, the correct way to handle this problem is to NOT store the occurrences on the table itself, but rather dynamically query the number of occurrences on the articles table.
If you don't do it this way, you're stuck coding update queries every time you add/delete a row...generally not nice. If you query dynamically, you won't have an occurrences column in the table, but rather will get that information in your eg. presentation/model layer code.
Use:
SELECT COUNT(*) FROM ARTICLES WHERE TagId = 'xxx' ;
This line is part of iterating code.

Activity list ala SO

We are building a set of features for our application. One of which is a list of recent user activities ala on SO. I'm having a little problem finding the best way to design the table for these activities.
Currently we have an Activities table with the following columns
UserId (Id of the user the activity is for)
Type (Type of activity - i.e. PostedInForum, RepliedInForum, WroteOnWall - it's a tinyint with values taken from an enumerator in C#)
TargetObjectId (An id of the target of the activity. For PostedInForum this will be the Post ID, for WroteOnWall this will be the ID of the User whose wall was written on)
CreatedAtUtc (Creationdate)
My problem is that TargetObjectId column doesn't feel right. It's a soft link - no foreign keys and only a knowledge about the Type tells you what this column really contains.
Does any of you have a suggestion on an alternate/better way of storing a list of user activites?
I should also mention that the site will be multilingual, so you should be able to see the activity list in a range of languages - that's why we haven't chosen for instance to just put the activity text/html in the table.
Thanks
You can place all content to a single table with a discriminator column and then just select top 20 ... from ... order by CreatedAtUtc desc.
Alternatively, if you store different type of content in different tables, you can try something like (not sure about exact syntax):
select top 20 from (
select top 20 ID, CreatedAtUtc, 'PostedToForum' from ForumPosts order by CreatedAtUtc
union all
select top 20 ID, CreatedAtUtc, 'WroteOnWalll' from WallPosts order by CreatedAtUtc) t
order by t.CreatedAtUtc desc
You might want to check out http://activitystrea.ms/ for inspiration, especially the schema definition. If you look at that spec you'll see that there is also the concept of a "Target" object. I have recently done something very similar but I had to create my own database to encapsulate all of the activity data and feed data into it because I was collecting activity data from multiple applications with disparate data sources in different databases.
Max

Resources