How to analayze/display a raw web analytics data? - sql-server

I've created a web tracking system that simply insert an event information (click or page view) into a simple SQL server table:
Column | Type | NULL?
-------------------------------------
RequestId | bigint | NOT NULL
PagePath | varchar(50) | NOT NULL
EventName | varchar(50) | NULL
Label | varchar(50) | NULL
Value | float | NULL
UserId | int | NOT NULL
LoggedDate | datetime | NOT NULL
How can I harvest/analayze/display this raw information?

First decide what trends you are most interested in. Perhaps looking at some existing web analytics software - there is free software available - to see what options exist.
If your requirements are simple, you have enough data. If you want a breakdown of which countries are accessing your website, you need to log IP addresses and get a database that ties IP ranges to countries - these are not 100% reliable but will get you fairly good accuracy.
Some simple examples of reporting you can do with your current data:
Number of hits per hour, day, week, month
Top 20 accessed pages
Top Users
Number of users accessing the site per hour, day, week, month
etc.
Most of these you can pull with a single SQL query using the group by clause and date functions.
Example MS SQL Server query to achieve hits per day (untested):
SELECT COUNT(RequestID) AS NumberOfHits,
YEAR(LoggedDate) AS EventYear,
MONTH(LoggedDate) AS EventMonth,
DAY(LoggedDate) AS EventDay
FROM MyTable
GROUP BY YEAR(LoggedDate), MONTH(LoggedDate), DAY(LoggedDate)
ORDER BY YEAR(LoggedDate), MONTH(LoggedDate), DAY(LoggedDate)

Maybe Logparser is sufficient for your needs: http://www.microsoft.com/downloads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en

Related

How to forecast count based on a day?

I am new to SQL Server world. I have a table as below:
alert_id | create_date | Status
---------+-------------+---------
1231 | 4/15/2017 | Open
1232 | 4/15/2017 | Open
1234 | 4/15/2017 | Closed
1235 | 4/16/2017 | Open
All of these alerts should be closed in 30 days. I need to get a forecast report which shows how many alerts are open for past 30 days.
I would like to write a select query whose output would be 2 columns. First would be Date and 2nd would be count. The date column should display all the dates for next 30 days and Count column should display the number of records which are due to expire on that day. Something like below would work. Please assist.
date | Count
----------+---------
5/15/2017 | 2
5/16/2017 | 3
5/17/2017 | 0
5/18/2017 | 0
.
.
.
6/14/2017 | 0
This is a job for GROUP BY and date arithmetic. In MySQL:
SELECT DATE(create_date) + INTERVAL 30 DAY expire_date, COUNT(*) num
FROM tbl
WHERE status = 'Open'
GROUP BY DATE(create_date)
DATE(create_date) + INTERVAL 30 DAY gets you the create date values with thirty days added.
GROUP BY(create_date) groups your data by values of your create date, truncated to midnight.
And, COUNT(*) goes with GROUP BY to tell you how many records in each group.
Edit In recent versions of SQL Server (MS)
SELECT DATEADD(day, 30, CAST(create_date AS DATE)) expire_date, COUNT(*) num
FROM tbl
WHERE status = 'Open'
GROUP BY CAST(create_date AS DATE)
Notice, please, that date arithmetic varies between make and model of SQL server software. That's why you get hassled by Stack Overflow users in comments when you use more than one tag like [oracle] [mysql] [sql-server] on your questions.
Cool, huh? You should read up on aggregate queries, sometimes called summary queries.
You're not going to get the missing dates with zeros by them. That's quite a bit harder to do with SQL.

Subquery for calculated field giving invalid argument to function error

I have a table with a list of stores and attributes that dictate the age of the store in weeks and the order volume of the store. The second table lists the UPLH goals based on age and volume. I want to return the stores listed in the first table along with its associated UPLH goal. The following works correctly:
SELECT store, weeksOpen, totalItems,
(
SELECT max(UPLH)
FROM uplhGoals as b
WHERE b.weeks <= a.weeksOpen AND 17000 between b.vMIn and b.vmax
) as UPLHGoal
FROM weekSpecificUPLH as
a
But this query, which is replacing the hard coded value of totalItems with the field from the first table, gives me the "Invalid argument to function" error.
SELECT store, weeksOpen, totalItems,
(
SELECT max(UPLH)
FROM uplhGoals as b
WHERE b.weeks <= a.weeksOpen AND a.totalItems between b.vMIn and b.vmax
) as UPLHGoal
FROM weekSpecificUPLH as a
Any ideas why this doesnt work? Are there any other options? I can easily use a dmax() and cycle through every record to create a new table but that seems the long way around something that a query should be able to produce.
SQLFiddle: http://sqlfiddle.com/#!9/e123a8/1
It appears that SQLFiddle output (below) was what i was looking for even though Access gives the error.
| store | weeksOpen | totalItems | UPLHGoal |
|-------|-----------|------------|----------|
| 1 | 15 | 13000 | 30 |
| 2 | 37 | 4000 | 20 |
| 3 | 60 | 10000 | 30 |
EDIT:
weekSpecificUPLH is a query not a table. If I create a new test table in Access, with identical fields, it works. This would indicate to me that it has something to do with the [totalItems] field which is actually a calculated result. So instead i replace that field with [a.IPO * a.OPW]. Same error. Its as if its not treating it as the correct type of number.
Ive tried:
SELECT store, weeksOpen, (opw * ipo) as totalItems,
(
SELECT max(UPLH)
FROM uplhGoals as b
WHERE 17000 between b.vMIn and b.vmax AND b.weeks <= a.weeksOpen
) as UPLHGoal
FROM weekSpecificUPLH as
a
which works. but replace the '17000' with 'totalitems' and same error. I even tried using val(totalItems) to no avail.
Try to turn it into
b.vmin < a.totalItems AND b.vmax > a.totalItems
Although there're questions to your DB design.
For future approaches, it would be very helpful if you reveal your DB structure.
For example, it seems you don't have the records in weekSpecificUPLH table related to the records in UPLHGoals table, do you?
Or, more general: these table are not related in any way except for rules described by data itself in Goals table (which is "external" to DB model).
Thus, when you call it "associated" you got yourself & others into confusion, I presume, because everyone immediately start considering the classical Relation in terms of Relational Model.
Something was changing the type of value of totalItems. To solve I:
Copied the weekSpecificUPLH query results to a new table 'tempUPLH'
Used that table in place of the query which correctly pulled the UPLHGoal from the 'uplhGoals' table

Pivottable on SSAS cube: hide rows without result on measure

I'm having trouble hiding rows that have no data for certain dimension members for the selected measure, however there are rows for that member in the measuregroup.
Consider the following datasource for the measuregroup:
+---------+----------+----------+----------+--------+
| invoice | customer | subtotal | shipping | total |
+---------+----------+----------+----------+--------+
| 1 | a | 12.95 | 2.50 | 15.45 |
| 2 | a | 7.50 | | 7.50 |
| 3 | b | 125.00 | | 125.00 |
+---------+----------+----------+----------+--------+
When trying to create a pivottable based on a measuregroup in a SSAS-cube, this might result in the following:
However, I would like to hide the row for Customer b, since there are no results in the given pivottable. I have tried using Pivottable Options -> Display -> Show items with no data on rows but that only works for showing/hiding a Customer that's not at all referenced in the given measuregroup, say Customer c.
Is there any way to hide rows without results for the given measure, without creating a seperate measuregroup WHERE shipping IS NOT NULL?
I'm using SQL-server 2008 and Excel 2013
Edit:
For clarification, I'm using Excel to connect to the SSAS cube. The resulting pivottable (in Excel) looks like the given image.
In the DSV find the table with the shipping column and add a calculated column with expression:
Case when shipping <> 0 then shipping end
Please go to the properties window for the Shipping measure in the cube designer in BIDS and change the NullHandling property to Preserve. Change the source column to the new calculated column. Then redeploy and I am hopeful that row in your pivot will disappear.
And ensure Pivottable Options -> Display -> Show items with no data on rows is unchecked.
If that still doesn't do it connect the Object Explorer window of Management Studio to SSAS and tell us what version number it shows in the server node. Could be you aren't on the latest service pack and this is a bug.
Go in your pivot table in Excel, click on the dropdown on your customer column. In the dropdown menu you will find an option called Value Filters. There you can set something like Shipping greater than 0.1.
This filters your Customer dimension column based on the measure values.
Maybe something like this (but I don't see the pivot...)
DECLARE #tbl TABLE(invoice INT,customer VARCHAR(10),subtotal DECIMAL(8,2),shipping DECIMAL(8,2),total DECIMAL(8,2));
INSERT INTO #tbl VALUES
(1,'a',12.95,2.50,15.45)
,(2,'a',7.50,NULL,7.50)
,(3,'b',125.00,NULL,125.00);
SELECT customer, SUM(shipping) AS SumShipping
FROM #tbl
GROUP BY customer
HAVING SUM(shipping) IS NOT NULL

Possible to query a database into excel on a cell by cell basis? Or another solution..?

I have various large views/stored procedures that basically churns out a lot of data into an excel spread sheet. There was a problem where not all of the
company amounts weren't flowing through. I narrowed it down to a piece of code in a stored procedure: (Note this is cut down for simplicity)
LEFT OUTER JOIN view_creditrating internal_creditrating
ON creditparty.creditparty =
internalrating.company
LEFT OUTER JOIN (SELECT company, contract, SUM(amount) amount
FROM COMMON_OBJ.amount
WHERE status = 'Active'
GROUP BY company, contract) col
ON vd.contract = col.contract
Table with issue:
company | contract | amount |
| | |
TVC | NULL | 1006 |
KS | 10070 | -2345 |
NYC-G | 10060 | 334000 |
NYC-G | 100216 | 4000 |
UECR | NULL | 0 |
SP | 10090 | 84356 |
Basically some of the contracts are NULL. So when there is a LEFT OUTER JOIN on contract the null values in contract drop out and don't flow through...So i decided to do it based on company.
This also causes problems because company appears within the table more than once in order to show different contracts. With this change the query becomes ambiguous because it won't know if I want
contract 10060's amount or the contract 100216's amount and more often than not it gives me the incorrect amount. I thought about leaving the final ON clause with company = company.
This causes the least issues.... Then Somehow directly querying for for each cell value that would be inconsistent because it only affects a few cells. Although I've searched and I don't think that this is possible.
Is this possible?? OR is there another way to fix this on the database end?
As you've worked out, the problem is in the ON clause, and its use of NULL.
One way to alter the NULL to be a value you can match against is to use COALESCE, which would alter the clause to:
ON coalesce(vd.contract,'No Contract') = coalesce(col.contract,'No Contract')
This will turn all NULL's into 'No Contract', which will change the NULL=NULL test (which would return NULL) to 'No Contract'='No Contract', which will return True

Detecting Correlated Columns in Data

Suppose I have the following data:
OrderNumber | CustomerName | CustomerAddress | CustomerCode
1 | Chris | 1234 Test Drive | 123
2 | Chris | 1234 Test Drive | 123
How can I detect that the columns "CustomerName", "CustomerAddress", and "CustomerCode" all correlate perfectly? I'm thinking that Sql Server data mining is probably the right tool for the job, but I don't have too much experience with that.
Thanks in advance.
UPDATE:
By "correlate", I mean in the statistics sense, that whenever column a is x, column b will be y. In the above data, The last three columns correlate with each other, and the first column does not.
The input of the operation would be the name of the table, and the output would be something like :
Column 1 | Column 2 | Certainty
CustomerName | CustomerAddress | 100%
CustomerAddress | CustomerCode | 100%
There is a 'functional dependency' test built in to the SQL Server Data Profiling component (which is an SSIS component that ships with SQL Server 2008). It is described pretty well on this blog post:
http://blogs.conchango.com/jamiethomson/archive/2008/03/03/ssis-data-profiling-task-part-7-functional-dependency.aspx
I have played a little bit with accessing the data profiler output via some (under-documented) .NET APIs and it seems doable. However, since my requirement dealt with distribution of column values, I ended up going with something much simpler based on the output of DBCC STATISTICS. I was quite impressed by what I saw of the profiler component and the output viewer.
What do you mean by correlate? Do you just want to see if they're equal? You can do that in T-SQL by joining the table to itself:
select distinct
case when a.OrderNumber < b.OrderNumber then a.OrderNumber
else b.OrderNumber
end as FirstOrderNumber,
case when a.OrderNumber < b.OrderNumber then b.OrderNumber
else a.OrderNumber
end as SecondOrderNumber
from
MyTable a
inner join MyTable b on
a.CustomerName = b.CustomerName
and a.CustomerAddress = b.CustomerAddress
and a.CustomerCode = b.CustomerCode
This would return you:
FirstOrderNumber | SecondOrderNumber
1 | 2
Correlation is defined on metric spaces, and your values are not metric.
This will give you percent of customers that don't have customerAddress uniquely defined by customerName:
SELECT AVG(perfect)
FROM (
SELECT
customerName,
CASE
WHEN COUNT(customerAddress) = COUNT(DISTINCT customerAddress)
THEN 0
ELSE 1
END AS perfect
FROM orders
GROUP BY
customerName
) q
Substitute other columns instead of customerAddress and customerName into this query to find discrepancies between them.

Resources