How to calculate the hourly number of observation? - loops

I need to calculate the hourly number of observations for a data set consists of around 100000 dates. each data-time represents one observation.
For example, I need to calculate the hourly events from 2018-11-25-20-35-00 to 2018-11-27-01-35-00.
Additionally, where there is no event there should be an output of zero.
The sample data set is shown in this figure

Related

Is there possible way to get data sampling in continous data batches?

We've a data stream with continously dumps data in our data lake. Is there a good solution with min running to get 10% random data samples from the data?
I'm currently using code(snipped below) but this will outgrow 10% total sampling as new batches will arrive. I've also tried to calculate 10 batches of 100 records each with (.1) mean but it resulted in ~32% sampling.
select id,
(uniform(0::float, 1::float, random(1)) < .10)::boolean as sampling
from temp_hh_mstr;
Prior to it, I thought to get sampling via snowflake's TABLESAMPLE by substracting from the total count and current IDs in sampling from the table. It takes calculations for every time and any batch arrives which will increase the cost.
Some additional referece I've been thinking towards -
Wilson Score Interval With Continuity Correction
Binomial Confidence Interval

Getting an average from a nested calculation with averageifs in excel

i am trying to pull performance data togther from a csv. We have a design team with a list of tasks which arrive on their list, sit there until dealt with, and then get completed. For any given day i want to calculate:
1/ The number of tasks on the list (so i'm running countifs, with citeria being 1: the design team, 2: the task had arrived on the list by the given day, and 3: had not been completed by the given day. This works ok.
2/ I also want to know what the average amonth of days those tasks had been on the list on that given day. IN the data set i have the day they arrive, and the day they got completed, and i know the given day i want to calculate for. So i think i ned to nest a formula to calculate (given date - arrival date) into an averageifs formula with the same conditions as the task count.
I have set this up in a sheet using a helper column to give me the days elapsed, but in practice, i want to run this for a number of given days so cannot add a helper column for each query.
So i have =AVERAGEIFS($G:$G,B:B,"Design",$C:$C, "<"&I$1, $E:$E, ">="&I$1)
but where $G:$G is a calculation of the days elapsed =IF(C2<>"",SUM(I$1-C2),"")
enter image description here

How can I set up a database with one annual dependent variable (Y) and many daily independent variables?

The dependent variable can only be measured once a year, so I have a value over 365 days. The independent variables are daily or weekly (I have data for about 10 years). My goal is to predict the value of Y (one value per year) using daily data.
I tried to do a regression with R, but did not get any significant results.
For each year in the column "dependent variable" I have only one value and 364 empty cells.
It is not possible to replace the 365 independent variable data with an average because each day influenced the result differently and I would not know how to assign such weights.

SSIS Percentage sampling of 50% split isn't even

I have a stored procedure as a data source run through a RESULTS SET.
The second step is to use percentage sampling to split the data gathered 50/50. One half to go down one output, the remainder down the second output. The end result after some other tasks is two files that get uploaded to two separate destinations.
The source query is gettting 11 rows of data for the days activities in question but the percentage sampling is splitting it as 10 rows down the Trustpilot output and 1 row down the Feefo output.
How can it not understand the concept of 50%? Is there something I'm missing?
According to Microsoft on the documentation page for this task, the specified percentage is not always the only factor in choosing which rows to send to the output.
In addition to the specified percentage, the Percentage Sampling
transformation uses an algorithm to determine whether a row should be
included in the sample output. This means that the number of rows in
the sample output may not exactly reflect the specified percentage.
For example, specifying 10 percent for an input data set that has
25,000 rows may not generate a sample with 2,500 rows; the sample may
have a few more or a few less rows.
If you need a specific number of rows, you could use the Row Sampling Transformation. In this case, you'd want to get a row count of the data set and then use an expression to set the number of rows property of the Row Sampling Transformation task equal to half the row count.

Web stats: Calculating/estimating unique visitors for arbitary time intervals

I am writing an application which is recording some 'basic' stats -- page views, and unique visitors. I don't like the idea of storing every single view, so have thought about storing totals with a hour/day resolution. For example, like this:
Tuesday 500 views 200 unique visitors
Wednesday 400 views 210 unique visitors
Thursday 800 views 420 unique visitors
Now, I want to be able to query this data set on chosen time periods -- ie, for a week. Calculating views is easy enough: just addition. However, adding unique visitors will not give the correct answer, since a visitor may have visited on multiple days.
So my question is how do I determine or estimate unique visitors for any time period without storing each individual hit. Is this even possible? Google Analytics reports these values -- surely they don't store every single hit and query the data set for every time period!?
I can't seem to find any useful information on the net about this. My initial instinct is that I would need to store 2 sets of values with different resolutions (ie day and half-day), and somehow interpolate these for all possible time ranges. I've been playing with the maths, but can't get anything to work. Do you think I may be on to something, or on the wrong track?
Thanks,
Brendon.
If you are OK with approximations, I think tom10 is onto something, but his notion of random subsample is not the right one or needs a clarification. If I have a visitor that comes on day1 and day2, but is sampled only on day2, that is going to introduce a bias in the estimation. What I would do is to store full information for a random subsample of users (let's say, all users whose hash(id)%100 == 1). Then you do the full calculations on the sampled data and multiply by 100. Yes tom10 said about just that, but there are two differences: he said "for example" sample based on the ID and I say that's the only way you should sample because you are interested in unique visitors. If you were interested in unique IPs or unique ZIP codes or whatever you would sample accordingly. The quality of the estimation can be assessed using the normal approximation to the binomial if your sample is big enough. Beyond this, you can try and use a model of user loyalty, like you observe that over 2 days 10% of visitors visit on both days, over three days 11% of visitors visit twice and 5% visit once and so forth up to a maximum number of day. These numbers unfortunately can depend on time of the week, season and even modeling those, loyalty changes over time as the user base matures, changes in composition and the service changes as well, so any model needs to be re-estimated. My guess is that in 99% of practical situations you'd be better served by the sampling technique.
You could store a random subsample of the data, for example, 10% of the visitor IDs, then compare these between days.
The easiest way to do this is to store a random subsample of each day for future comparisons, but then, for the current day, temporarily store all your IDs and compare them to the subsampled historical data and determine the fraction of repeats. (That is, you're comparing the subsampled data to a full dataset for a given day and not comparing two subsamples -- it's possible to compare two subsamples and get an estimate for the total but the math would be a bit trickier.)
You don't need to store every single view, just each unique session ID per hour or day depending on the resolution you need in your stats.
You can keep these log files containing session IDs sorted to count unique visitors quickly, by merging multiple hours/days. One file per hour/day, one unique session ID per line.
In *nix, a simple one-liner like this one will do the job:
$ sort -m sorted_sid_logs/2010-09-0[123]-??.log | uniq | wc -l
It counts the number of unique visitors during the first three days of September.
You can calculate the uniqueness factor (UF) on each day and use it to calculate the composite (week by example) UF.
Let's say that you counted:
100 visits and 75 unique session id's on monday (you have to store the sessions ID's at least for a day, or the period you use as unit).
200 visits and 100 unique session id's on tuesday.
If you want to estimate the UF for the period Mon+Tue you can do:
UV = UVmonday + UVtuesday = TVmonday*UFmonday + TVtuesday*UFtuesday
being:
UV = Unique Visitors
TV = Total Visits
UF = Uniqueness Factor
So...
UV = (Sum(TVi*UFi))
UF = UV / TV
TV = Sum(TVi)
I hope it helps...
This math counts two visits of the same person as two unique visitors. I think it's ok if the only way you have to identify somebody is via the session ID.

Resources