I Have a python implementation of streaming data to big query, it is similar to the example at docs, in every task I got a "Loaded 1 row into..." but when query the table I only got like 30-35 rows per day(The table is partitioned by day) however I am getting an average of 25k request to data streamed to the table. when looking to the "get" api call it show me like 800 rows ant buffer, but It took 4 days like this and I can't see my data from 4 days ago in my table.
Are you supplying a deduplication insertId for each row when you call tabledata.insertAll? If you're re-using the same insertId for all the inserted rows, you'll observe behaviors similar to this.
Related
I have a single MSSQL 2017 Standard table, let's call it myTable, with data going back to 2015, containing 206.4 million rows. Once INSERTed by the application, these rows are never modified or deleted. The table is actively collecting data, 24/7.
My goal is to reduce the data in this table to only the most recent full 6 months plus current month, into monthly-based partitions for easy monthly pruning. myTable.dateCreated would determine which partition the data ultimately resides.
(Unrelated, but mentioning in case it ends up being relevant: I have an existing application that replicates all data that gets stored in myTable out to a data warehouse for long term storage every 15 minutes; the main application is able to query myTable for recent data and the data warehouse for older data as needed.)
Because I want to prune the oldest one month worth of data out of myTable each time a new month starts, partitioning myTable by month makes the most sense - I can simply SWITCH the oldest partition to a staging table, then truncate that staging table without causing downtime or performance on the main table.
I've come up with the following plan, and my questions are simple: Is this the best way to approach this task, and will it keep downtime/performance degradation to a minimum?
Create a new table, myTable_pending, with the same exact table structure as myTable, EXCEPT that it will have a total of 7 monthly partitions (6 months retention plus current month) configured;
In one complete step: rename myTable to myTable_transfer, and rename myTable_pending to myTable. This should have the net effect of allowing incoming data to continue being stored, but now it will be in a partition for the month of 2023-01;
Step 3 is where I need advice... which of the following might be best to get the remaining 6mos + current data back into the now-partitioned myTable, or are there additional options I should consider?
OPTION 1: Run a Bulk Insert of just the most recent 6 months of data from myTable_transfer back into myTable, causing the data to end up in the correct partitions in the process (with the understanding that this may still take some time, but not as long as a bunch of INSERTs that would end up chewing on the transaction log);
OPTION 2: Run a DELETE against myTable_transfer, getting rid of all data except the most recent full 6 months + current, and then set up and apply partitions on THIS table, that would then cause SQL Server to reorganize the data into those partitions, but without affecting access or performance on myTable, after which I could just SWITCH the partitions from myTable_transfer into myTable for immediate access; (related issue: since myTable is still collecting current data, and myTable_transfer will contain data from the current month as well, can the current month partitions be merged?)
OPTION 3: Any other way to do this, so that myTable ends up with 6 months worth of data, properly partitioned, without significant downtime?
We ended up revising our solution, since the original table was replicated to a data warehouse anyway, we simply renamed the table and created a new one with partitioning to start collecting new data from the rename point. This provided the least amount of downtime, the fastest schema changes, and gave us the partitioning we needed to maintain the table efficiently going forward.
Our application shows near-real-time IoT data (up to 5 minute intervals) for our customers' remote equipment.
The original pilot project stores every device reading for all time, in a simple "Measurements" table on a SQL Server 2008 database.
The table looks something like this:
Measurements: (DeviceId, Property, Value, DateTime).
Within a year or two, there will be maybe 100,000 records in the table per device, with the queries typically falling into two categories:
"Device latest value" (95% of queries): looking at the latest value only
"Device daily snapshot" (5% of queries): looking at a single representative value for each day
We are now expanding to 5000 devices. The Measurements table is small now, but will quickly get to half a billion records or so, for just those 5000 devices.
The application is very read-intensive, with frequently-run queries looking at the "Device latest values" in particular.
[EDIT #1: To make it less opinion-based]
What database design techniques can we use to optimise for fast reads of the "latest" IoT values, given a big table with years worth of "historic" IoT values?
One suggestion from our team was to store MeasurementLatest and MeasurementHistory as two separate tables.
[EDIT #2: In response to feedback]
In our test database, seeded with 50 million records, and with the following index applied:
CREATE NONCLUSTERED INDEX [IX_Measurement_DeviceId_DateTime] ON Measurement (DeviceId ASC, DateTime DESC)
a typical "get device latest values" query (e.g. below) still takes more than 4,000 ms to execute, which is way too slow for our needs:
SELECT DeviceId, Property, Value, DateTime
FROM Measurements m
WHERE m.DateTime = (
SELECT MAX(DateTime)
FROM Measurements m2
WHERE m2.DeviceId = m.DeviceId)
This is a very broad question - and as such, it's unlikely you'll get a definitive answer.
However, I have been in a similar situation, and I'll run through my thinking and eventual approach. In summary though - I did option B but in a way to mirror option A: I used a filtered index to 'mimic' the separate smaller table.
My original thinking was to have two tables - one with the 'latest data only' for most reporting, then a table with all historical values. An alternate was to have two tables - one with all records, and one with just the latest.
When inserting a new row, it would typically need to therefore update at least two rows, if not more (depending on how it's stored).
Instead, I went for a slightly different route
Put all the data into one table
On that one table, add a new column 'Latest_Flag' (bit, NOT NULL, DEFAULT 1). If it's 1 then it's the latest value; otherwise it's historical
Have a filtered index on the table that has all columns (with appropriate column order) and filter of Latest_Flag = 1
This filtered index is similar to a second copy of the table with just the latest rows only
The insert process therefore has two steps in a transaction
'Unflag' the last Latest_Flag for that device, etc
Insert the new row
It still makes the writes a bit slower (as it needs to do several row updates as well as index updates) but fundamentally it does the pre-calculation for later reads.
When reading from the table, however, you need to then specify WHERE Latest_Flag = 1. Alternatively, you may want to put it into a view or similar.
For the filtered index, it may be something like
CREATE INDEX ix_measurements_deviceproperty_latest
ON Measurements (DeviceId, Property)
INCLUDE (Value, DateTime, Latest_Flag)
WHERE (Latest_Flag = 1)
Note - another version of this can be done in a trigger e.g., when inserting a new row, it invalidates (sets Latest_Flag = 0) any previous rows. It means you don't need to do the two-step inserts; but you do then rely on business/processing logic being within triggers.
I have to fetch n number of records from the database. The count of this record may very manager to manager. It means one manager can have 100+ records and other may have only 50+ records.
If it is all about fetching data from only one table then its super easy to get.
In my case the main pain point is I will get my result after using so many joins , temp tabels , functions , maths on some column and dates filter using switch cases and many more and yes each tables has 100k+ records with proper indexing.
I have added pagination in UI side so that I can get only 20 records at a time on screen. once I clicked on page number based on that I should offset the records for next 20. Supposed clicked on page number 3 then from db I should get only records from 41- 60.
UI part is not a big deal the point is how to optimise your query so that every time I should get only 20 records.
My current implementation is every time I am calling the same procedure with index value to offset the data. Is that correct way to run same complex with all functions , cte, cases in filters and inner/left joins again and again to fetch only piece of data from recordset.
Here is my problem, I need to fetch some large record from various tables; to be exact, it consists of 30 tables. I did the join for the 30 tables, and it took 20 min just to fetch 200 rows.
I was thinking of creating a stored procedure to do some transactional DB call to fetch bit by bit of data and store it to a new report table.
Here is the nature of my business process:
In my web screen, I have 10 tabs of questionnaire need to be fill up by insurance client. Basically I need to fetch all questions and answers and put them in one row
The problem is, my client won't finish all the 10 tabs in one day, they might finish all the tabs in 3 days max
Initially I want to put a trigger for insert on the primary table and fetch all and put in a reporting table. But I only can get record for t+0, not t+1 or t+n. How am I going to update the same row if user updated another tab at another day?
To simplify my requirement, I have 10 tabs of questionnaire and to make it simpler for discussion, each tab has its own table. And to complete all the questionnaire doesn't required you to finish it in one day.
How am I going to fetch all the data using transactional SQL in a stored procedure?
I have a table in SQL Server 2012 that contains call detail records. A simplified version of the schema is shown in this SQLFiddle.
It's trivial to count calls for a given region, but I would like to further break the data down into discrete half-hour buckets. I am then feeding the data into a chart, so I need the query to be able to return all buckets, even if there we no calls in those buckets.
Any thoughts?
Additionally, I can't lose the offsets on those values (note they are DATETIMEOFFSET type). Most solutions I've found out there involve throwing away that data because they can only handle DATETIME.
Create a dim_time table or something to that extent and insert time ranges into this table (one for each half hour slot, can automate this population).
select time_id, time_start, time_end
from dim_time
Now you have a table with all the time slots you are interested in...left join it to a count query to get the counts associated to that time slot.
Once your code is in...you can alter this to 15 minute blocks, or 2 hours blocks or whatever, by manipulating the dim_time entries.