What are some suggested LogParser queries to run to detect sources of high network traffic? - logparser

In looking at the network in/out metrics for our AWS/EC2 instance, I would like to find the sources of the high network traffic occurrences.
I have installed up Log Parser Studio and run a few queries - primarily looking for responses that took a while:
SELECT TOP 10000 * FROM '[LOGFILEPATH]' WHERE time-taken > 1000
I am also targeting time spans that cover when the network in/out spikes have occurred:
WHERE [date] BETWEEN TIMESTAMP('2013-10-20 02:44:00', 'yyyy-MM-dd hh:mm:ss')
AND TIMESTAMP('2013-10-20 02:46:00', 'yyyy-MM-dd hh:mm:ss')
One issue is that the log files are 2-7 gigs (targeting single files per query). In trying Log Parser Lizard, it crashed with an out of memory exception on large files (boo).
What are some other queries, and methodologies I should follow to identify the source of the high network traffic, which would hopefully help me figure out how to plug the hole?

One function that may be of particular use to you is the QUANTIZE() function. This allows you to aggregate stats for a period of time thus allowing you to see spikes in a given time period. Here is one query I use that allows me to see when we get scanned:
COUNT(*) AS Hits,
SUM(sc-bytes) AS TotalBytesSent,
DIV(MUL(1.0, SUM(time-taken)), Hits) AS LoadTime,
SQRROOT(SUB(DIV(MUL(1.0, SUM(SQR(time-taken))), Hits), SQR(LoadTime))) AS StandardDeviation
GROUP BY LocalTime
ORDER BY LocalTime
I usually output this to a .csv file and then chart in Excel to visually see where a period of time is out of normal range. This particular query breaks things down to 15 min segments based on the 900 passed to QUANTIZE. The TotalBytesSent, LoadTime and StandardDeviation allow me to see other aberrations in downloaded content or response times.
Another thing to look at is the number of requests a particular client has made to your site. The following query can help identify scanning or DoS activity coming in:
DISTINCT c-ip as ClientIP,
COUNT(*) as Hits,
PROPCOUNT(*) as Percentage
HAVING (Hits > 50)
ORDER BY Percentage DESC
Adjusting the HAVING clause will set the minimum number of requests an IP will need to make before it shows up. Based on the activity and the WHERE clause, 50 may be too low. The PROPCOUNT() function gives a percentage of the overall value of a particular field. In this case, it gives the what percent a particular IP of all the requests made to the site. Typically this will surface the IP addresses of search engines as well, but those are pretty easy to weed out.
I hope that gives you some ideas on what you can do.


Best way to handle time consuming queries in InfluxDB

We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
Step 1: create a CQ
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.

Flink DataStream - execute SQL query on a window, do orderBy

so I'm simulating a streaming task using Flink DataStream and I want to execute an SQL query on each window.
Let's say this is the query
SELECT name, age, sum(days), avg(salary)
FROM employees
WHERE age > 25
GROUP BY name, age
ORDER BY name, age
I'm having a hard time to translate it to Flink. From my understanding, to calculate average I need to do it manually using .apply() and WindowFunction. But how do I calculate the sum then? Also manually in the same WindowFunction?
I'm also wondering if it is possible to do order by on the whole window?
Below is the pseudocode of what I thought of so far. Any help would be appreciated! Thanks!
.filter(new FilterFunction() ....) \\ where clause
.keyby(nameIndex, ageIndex) \\ group by??
.timeWindow(Time.seconds(10), Time.seconds(1))
.apply(new WindowFunction() ....) \\ calculate average (and sum?)
// order by??
I checked the Table API but it seems for streaming not a lot of operations are supported, e.g orderBy.
Ordering in streaming is not trivial. How do you want to sort something that is never ending? In your example you want to calculate an average or a sum, which is just one value per window. You cannot sort one value.
Another possibility is to buffer all values and wait for an indicator of completeness to start sorting. Thanks to event-time and watermarks, it is possible to sort a stream if you know that you have seen all values until a certain time (aka watermarks).
Event-time sort has been introduced recently and will be part of Flink 1.4 Table API. See here for an example.

Tableau – Using Nested Aggregations to Establish a Weekday/Hour Baseline

Background Information: We have an incident time tracker that tracks how long each user spends with a representative before the issue can be closed. We want to determine the average volume of incidents that are being handled for each hour. To say this in another way: We want to get an hourly baseline for each day of the week that will show us the average total call length within the specific time period. Eg: We want to average the total length of every call on Monday from 9AM-10AM for all the weeks in the database, and the same for other hourly intervals.
The simplest way to think of this is that I want AVG(SUM) for the specific time periods, but Tableau does not allow me to do this.
Tableau Output:
This is the desired, target visualization that I am looking for from Tableau.
SQL Query:
I have written a SQL query that returns the answer:
We are looking at two columns: start_time (time stamp) and interval_seconds(float)
In the inner query I use the hour_start function which truncates the date/time value to the hour start, so I can group by the hour and day of the week in the outer query.
SQL Results:
Is there a way to solve this problem ENTIRELY in Tableau that would get me the result that I am looking for without having to write any SQL code?
Files Stored on Drive
CSV File:
Tableau Worksheet:
You can use Level of Detail expressions to compute the SUM(interval_seconds) at the hour level and then use AVG to calculate the number you are looking for.
I created a couple of calculations:
hour which is defined as: DATETRUNC('hour',[start_time])
this should be equivalent to your hour_start(start_time).
and interval_hours which is defined as {FIXED [hour] : SUM([interval_seconds])/3600 }
This calculates the aggregate for each start_time truncated to the hour.
After this, you simply calculate AVG(interval_hours) and use it in your view.
I put a workbook in dropbox: https://www.dropbox.com/s/3hfvz8w529g9f46/Interval%20Time%20Baseline.twbx?dl=0
Although the chart looks similar to yours, the numbers I came up with are somewhat different from the "SQL Results" you show. Was the data you provided slightly different?

How to join splayed table in KDB?

I have 2 very large (billions of rows) splayed tables, Trades and StockPrices, on a remote server. I want to do an asof join
h:hopen `:RemoteServer:Port
select from Trades where Date within 2014.04.01 2014.04.13,
But I just get the error (I'm Studio for KDB+)
An error occurred during execution of the query.
The server sent the response:
Studio Hint: Possibly this error refers to nyi op on splayed table
So what would be the correct way to do such a join?
Also, performance and efficiency is an issue with such a big table -- what should I be doing to ensure the query doesn't take hours and doesn't consume to much of the server's system resources?
You need to map the splayed StockPrices table into memory. This can be done by using a select query:
q)(`::6060)"aj[`sym`time;select from trade;quote]" / bad
q)(`::6060)"aj[`sym`time;select from trade;select from quote]" / good
sym time prx bid ask
aea 01:01:16.347 637.7554 866.0131 328.1476
aea 01:59:14.108 819.5301 115.053 208.1114
aea 02:42:44.724 69.38325 641.8554 333.3092
This page may be useful for looking up errors from Kdb+: http://code.kx.com/q/ref/error-list/
Regarding optimising performance of aj see http://code.kx.com/q/ref/joins/#aj-aj0-asof-join
Also, if there isn't an overlap of data between days, it may be faster to run the query on a day by day basis, possibly in parallel.
If there is an overlap of data across days, combining the date & time columns into a single timestamp column would speed up the lookup.

Paginated searching... does performance degrade heavily after N records?

I just tried the following query on YouTube:
and received the error message:
Sorry, YouTube does not serve more than 1000 results for any query.
(You asked for results starting from 2000.)
I also tried Google search for "test", and although it said there were about 3.44 billion results, I was only able to get to page 82 (or about 820 results).
This leads me to wonder, does performance start to degrade with paginated searches after N records (specifically wondering about with ROW_NUMBER() in SQL Server or similar feature in other DB systems), or are YouTube/Google doing this for other reasons? Granted, it's pretty unlikely that most people would need to go past the first 1000 results for a query, but I would imagine the limitation is specifically put in place for some technical reason.
Then again Stack Overflow lets you page through 47k results: https://stackoverflow.com/questions/tagged/c?page=955&sort=newest&pagesize=50
Yes. High offsets are slow and inefficient.
The only way to find the records at an offset, is to compute all records that came before and then discard them.
(I dont know ROW_NUMBER(), but would be LIMIT in standard SQL. So
SELECT * FROM table LIMIT 1999,20
.. in the above example, the first 2000 records have to be fetched first, and then discarded. Generally it can't skip ahead, or use indexes to jump right to the correct location in the data, because normally there would be a 'WHERE' clause filtering the results.
It is possible to cache the results, which is probably what SO does. So it doesn't actually have to compute the large offsets each and every time. (Most of SO's searches are a 'small' set of known tags, so its quite feasible to cache. A arbitrary search query is will have much versions to catch, making it impractical)
(Alternatively it might be using some other implementation that does allow arbitrary offsets)
Other places taking about similar things
Back of the envolope test:
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 100999,3;
3 rows in set (11.32 sec)
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 3;
3 rows in set (4.59 sec)
(Arbitrary query choosen so as not to use indexes very well, if indexes can be used the difference is less pronounced and harder to see. But in a production system running lots of queries, 1 or 2ms difference is huge)
Update: (to show a indexed query)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 10;
10 rows in set (0.00 sec)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 100000,10;
10 rows in set (1.70 sec)
It's a TOP clause designed to limit the amount of physical reads that the database has to perform, which limits the amount of time that the query takes. Imagine you have 82 billion links to stories about "Japan" in your database. What if someone queries "Japan"? Are all 82 billion results really going to be clicked? No. The user needs the top 1000 most relevant results. When the search is generic, like "test", there is no way to determine relevance. In this case, YouTube/Google has to limit the volume returned so other users aren't affected by generic searches. What's faster, returning 1,000 results or 82,000,000,000 results?
