Let's say we have a table with 100 million records. Which are transactions of sales form multiple sellers. Each record has around 14 columns
TABLE SellerTransactions
string SellerId,
string ProductId,
DateTime CreateDate,
string BankNumber,
string Name(name+' '+surname+' 'alias),
string Comments,
decimal Amount
etc...
Each year we will add around 60 million new records.. and the record count will increase by 10% yearly +-
Search will be done by seller Id, then by product Id or products Id's(for multiple products in a time period for that seller).
Now every search will be filtered by time usually 1 week period/ mostly the last week but its also possible to have worst case scenarios when we will search within all the data we have: years.
And also a search should be possible by bank number, or full text search by name or part of it.
SELECT *
FROM SellerTransactions
WHERE SelledId = 'Seller1Guid'
AND ProductId IN 'ProductGuid1,ProductGuid2..'
AND (CreateDate <= CurrentDate AND CreateDate >= (CurrendDate - 7 days))
AND CONTAINS(Name, "BoB Skynet")
ORDER BY CreateDateTime
TAKE 20 SKIP 20
So search time consumption in regards to these scenarios:
After filtering by id's & time range search in up to 10k records
After filtering by id's & time range search in up to 100k records
After filtering by id's & time range search in up to 1 million records
After filtering by id's & time range search in up to 1 to 10 million records
in these cases would it be 0.5s? or 1s or up to 20s?
Also how would the search performance would change if we would also add search in comments column as well as name?
CONTAINS(Name, "BoB Skynet") OR CONTAINS(Comments, "online")
In most cases the searching should be done on small records counts, in very rare cases we would go though millions of rows.. but how much time would it take?
When it would be an good idea to move to Elastic search for example?
Well we are storing a bit of data, but request count for this data is usually small.
Related
I have read from different articles saying cursor pagination query has time complexity O(1) or O(limit) where limit is the number of item limit in sql. Some example article source:
https://uxdesign.cc/why-facebook-says-cursor-pagination-is-the-greatest-d6b98d86b6c0 and
https://dev.to/jackmarchant/offset-and-cursor-pagination-explained-b89
But I canont find related references explaining why the time complexity is O(limit). Say I have a table consist of 3 columns
id, name, created_at, where id is primary key,
if I use created_at as the cursor (which is unique and sequential), can someone explain why the time complexity is O(limit)?
Is it related to data structure used to store created_at?
After some reading, I guess the time complexity is talking about after retrieving the intermediate records, the time complexity of getting the final required records.
For offset case, all records will be selected, then database will discard x records where x is the offset, finally select y records (where y = limit), so the time complexity is O(offset + limit).
For cursor case, records matched the cursor where condition will be selected, then select y records (where y = limit), so the time complexity is O(limit).
Please find my use case below and suggest the appropriate solution:
We have done declarative partitioning over our table A based on some key and further range sub partitioned each partition on the basis of createddate range of 90 days.
In the search query to display last 90 day data , I give both column B and createddate which may pick any of the required partitions.
But how can I display last 90 days data in which a record which is modified last is also shown at the top but its created date might not lie in the last 90 days range..
Basically, I need to partition my huge table based on some key and also on date to split the data into smaller dataset for faster query . The UI needs to display the latest modified records first but the created date also needs to be given in the query to pick the right partition...
Can we do partition differently to achieve this?
Should i find the max modified date and give it in the range of the created date.
Yes I have a btree index on the modified_date column currently.
SELECT * FROM partitioned_table where
A ='Value'
AND created_date >= '2021-03-01 08:16:13.589' and created_date <= '2021-04-02 08:16:13.589' ORDER BY viewpriority desc OFFSET 0 ROWS FETCH NEXT 200 ROWS ONLY;
Here viewpriority is basically a long value containing created_date in milliseconds.
Issue here is : I want to somehow include modified_date also in this query to get the records sorted by modified_date . But that sorting will happen only in the specified created_date range only. I want those latest modified records also whose created_date might not lie in this range.
I am new to SQL and Stack overflow and have a question about SQL Server syntax. i have searched online but I am not finding what I need and I would appreciate your assistance in this matter.
I have data in a source table for meal orders (each with a specific ID (e.g. 12345C) and items of each order (e.g. sandwich, drink, chips), each with an associated number starting with 1. For instance, the sandwich would have an item number of 1, the chips would be item # 2, and the drink would be item # 3 for the same orderID 12345C. The prior example would therefore have 3 rows of data in the source table for orderID 12345C.
My questions are these:
how do I use a SQL expression to determine the number of items per each order (e.g. 3 for the above example, which is also the maximum value for item number for each orderID)
and then add all of these items per order across hundreds of orders per day to determine the daily total number of items ordered.
So, if I had 3 orders in one day - one with 2 items, the second with 3 items, and the third with 4 items, I would like my final number to be 9.
This number is for use in a Sisense dashboard that allows SQL syntax in the field definition. Thank you for your help!
It is a bit difficult to explain but I am not able to use a query from a table because I am working with a dashboard in Sisense so I am adding fields in a pivot display and one of the fields I would like to include is the total count of order items per day (across several dozen orderIDs).
Here is an example of the data in the table: from the example I would like the final answer for orderID 1787588 to be 3 (there are 3 items within the order).
This should be simple so I think I am missing it. I have a simple line chart that shows Users per day over 28 days (X axis is date, Y axis is number of users). I am using hard-coded 28 days here just to get it to work.
I want to add a scorecard for average daily users over the 28 day time frame. I tried to use a calculated field AVG(Users) but this shows an error for re-aggregating an aggregated value. Then I tried Users/28, but the result oddly is the value of Users for today. The division seems to be completely ignored.
What is the best way to show average number of daily users over a time frame? Average daily users over 10 days, 20 day, etc.
Try to create a new metric that counts the dates eg
Count of Date = COUNT(Date) or
Count of Date = COUNT_DISTINCT(Date) in case you have duplicated dates
Then create another metric for average users
Users AVG = (Users / Count of Date)
The average depends on the timeframe you have selected. If you are selecting the last 28 days the average is for those 28 days (dates), if you filter 20 days the average is for those 20 days etc.
Hope that helps.
I have been able to do this in an extremely crude and ugly manner using Google Sheets as a means to do the calculation and serve as a data source for Data studio.
This may be useful for other people trying to do the same thing. This assumes you know how to work with GA data in Sheets and are starting with a Report Configuration. There must be a better way.
Example for Average Number of Daily Users over the last 7 days:
Edit the Report Configuration fields:
Report Name: create one report per day, in this case 7 reports. Name them (for example) Users-1 through Users-7. These are your Row 2 values. You'll have 7 columns, with the first report name in column B.
Start Date and End Date: use TODAY()-X where X is the number of days previous to define the start and end dates for each report. Each report will contain the user count for one day. Report Users-1 will use TODAY()-1 for start and end, etc.
Metrics: enter the metrics e.g. ga:users and ga:new users
Create the reports
Use 'Run reports' to have the result sheets created and populated.
Create a sheet for an interim data set you will use as the basis for the average calculation. The first column is date, the remaining columns are for the metrics, in this case Users and New Users.
Populate the interim data set with the dates and values. You will reference the Report Configuration to get the dates, and you will pull the metrics from each of the individual reports. At this stage you have a sheet with date in first columns and values in subsequent columns with a row for each day's values. Be sure to use a header.
Finally, create a sheet that averages the values in the interim data set. This sheet will have a column for each metric, with one value per column. The one value is calculated from the series in the interim data set, for example =AVG(interim_sheet_reference:range) or any other calculation you'd like to do.
At last, you can use Data Studio to connect to this data source and use the values. For counts of users such as this example, you would use Sum as the aggregation field type when you are creating the data source.
It's super ugly but it works.
I am working on a Cassandra data model for storing time series (I'm a Cassandra newbie).
I have two applications: intraday stock data and sensor data.
The stock data will be saved with a time resolution of one minute.
Seven datafields build one timeframe:
Symbol, Datetime, Open, High, Low, Close, Volume
I will query the data mostly by Symbol and Date. e.g. give me all data for AAPL between 2013-01-01 and 2013-01-31 ordered by Datetime.
The recommendation for cassandra queries is to query whole columns. So you could create five rows with the keys Open, High, Low, Close, Volume. And for each Symbol and Minute an own column. E.g. "AAPL:2013-01-04T130400Z".
This would result in a table of five rows and n*NT columns where n = number of symbols, nT = number of minutes.
Most of the time I will query date ranges. I.e. all minutes of a day. So I could rearrange the data to have columns named "AAPL:2013-01-04" and rows: OpenT130400Z, HighT130400Z, LowT130400Z, CloseT130400Z, VolumeT130400Z.
This would result in a table with n*nD columns (n: number of Symbols, nD: number of Days) and 5*nM rows (nM: number of minutes/entries per day).
To sum up: I have columns, which hold the information for a whole day for one symbol.
I have found a description how to deal with time series data in cassandra here http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
But I don't really get, if they use the hour (1332960000) as a column name or as a row key!?
I understood they use the hour as row key and have the small timesteps as columns. So they would have a fixed column number. But that would have disadvantages in reading because I would have to do a range query on keys! Am I right?
Second question:
If I have sensor data, which is much more fine grained than 1 minute stock data (let's say I have to save timesteps with a resolution of microseconds) how would I deal with this?
If I use columns for saving a composite of sensor channel and hours, and rows for microseconds since the last hour this would result in 3,600,000,000 rows and n*nH columns (n: number of sensors, nH: number of Hours).
I could not use the microseconds since last hour for columns because I have 3,6 billion points which is higher than the allowed number of 2 billion columns.
Did I get it?
What do you think about this problem? How to solve it?
Thank you!
Best,
Malte
So I have a suggestion for your first question about the stock data. A naive implementation might look like this:
RowKey:
Column Format:
Name: The current datetime granular to a minute
Value: a composite column of Open,High,Low,Close,Volume
So you would have something like
AAPL = [2013-05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]
That would give you roughly half a million columns in one year so it might be ok for maybe 4 years. I wouldn't go and attempt to hit the 2 billion limit. What you could do is define a splitting factor on the row key. It all depends on your usage pattern, but a simple one might be on the year so the column family entry might look like this with a composite row key and that would guarantee that you always have less than a million columns per row.
AAPL:2013 = [05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]