use tcpreplay for real trace internet dataset - dataset

I have CAIDA internet trace dataset and it contain more than 200000 unique IPv4 addresses and almost 1 million flows.
Im currently using mininet to emulate my SDN project and i wish that i could use this dataset in my simulation.
What i plan is tou use the Tcpreplay to replay the dataset in my mininet. The question is,
1. Do i have to manually configure more than 200000 unique ipv4 hosts in order to mimic the real network as in the dataset?
2. Or there are another way
Appreciate whoever has this experience of using tcpreplay together with real internet dataset could share the knowledge.
thanks

You can use the use tcpreplay with the --unique-ip option, which will create unique IP addresses for each transmission of the pcap file. This option is very fast, and allows you to run at a controlled rate, all the way up to 10GigE full wire speed.
tcpreplay -i eth7 -tK --loop 50 --unique-ip bigFlows.pcap
I suggest that you read my article How to do a Performance Test for an IP Flow Appliance. Also suggest using my bigFlows.pcap file which contains a large capture of actual traffic on a corporate WAN.

Related

MQTT-Broker meets WebServer with Database

I have a question about a combination of a MQTT-Broker and a Webserver.
Please check out the image bellow.
Is this a good way to save data from different sensors in a database?
In the picture the WebServer which communicates with the database is an MQTT Client. The WebServer just subscribe too all topics via #.
Is this scalable? I mean if there are 100.000 sensors out there and all send messages to this one WebServer..?
If you want to keep a record of all sensor data then it's about the only option (unless you have different client for different sensor types so split things up a bit). The only other option to a seperate client subscribed to # would be to use a broker like HiveMQ which has a plugin mechanism that can record all messages in a database.
Also # should probably be sensors/# in order to skip any other messages that may be on the system.
100,000 sensors isn't the deciding factor here, the rate at which these sensors deliver messages will be the important point as it will determine the actual load.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

Fastest way to store tiny data on the server

I'm looking for a better and faster way to store data on my webserver at the best speed possible.
My idea is to log the IP address of every incoming request to any website on the server and if it reaches a certain number within a set time then users will be redirected to a page where they need to enter a code to regain access.
I created an apache module that does just that. It attempts to create files on a ramdisk however I constantly run into permission problems since there is another module that switches users before my module has a chance to run.
Using a physical disk is an option that is too slow.
So my only options are as follows:
Either create folders on the ramdrive for each website so IP addresses can be logged independently.
Somehow figure out how to make my apache module execute its functionality before all other modules.
OR
Allocate a huge amount of ram and store everything in it.
If I choose option #2, then I'll continue to beat around the bush as I have already attempted that.
If I choose option #1, then I might need lots more ram as tons of duplicate IP addresses are expected to be stored across several folders on the ramdrive.
If I choose option #3, then the apache module will have to constantly seek through the ram space allocated in order to find the IP address, and seeking takes time.
People say that memory access is faster than file access but I'm not sure if just a direct memory access via malloc is faster than storing data to a ram drive.
The reason why I expect to collect alot of IP addresses is to block script-kiddies from constantly accessing my server at a very high rate.
So what I'm asking is what is the best way I should store my data and why?
You can use hashmap instead of huge amount of ram. It will be pretty fast.
May be several hashmaps for each websits. Or use composite like string_hash(website_name) + (int_hash(ip) << 32)
If the problem is with permissions, why not solve it at that level? Use a common user account or group. Or make everything on the RAM disk world readable/writable.
If you want to solve it at the Apache level, you might want to look into mod_security and mod_evasive.

WP7 , Calculating Internet Speed

I am building an application where URL sources will be decided by the speed of the internet connection(Bandwidth).
Is there a way that we can ascertain the speed of the internet connection ?
For Example : If the User of my application is using a low speed connection (2G) then i need to provide him with a suitable source.
If the user is using a 3G connection then i will provide him with a better quality content through another URL.
So for this i need to ascertain the speed of the internet connection that the user might have.
Kindly give me some guidance on this.
send a data of some size to the user and get the total time it takes to finish sending the data. then calculate data size / time needed.

Load Testing a Database attached to a Network storage device

We are looking at stress testing a NAS system for a Database, basically was want to see how much abuse it can take and how much it affects the Database performance. here is what we have planed
I have a test Tool I'm building that will kickoff a configurable number of
threads that run a sql query (also configurable, and thinking about
having it able to run multiple quires)
using the SQLIOSim utility to simulate SQL Server activity
Copying very large amounts of Data onto and off of the device (at the same time)
Can anyone think of anything else we could do (that's repeatable) for placing a load on the system.
You'll also want to simulate network conditions between your database and the NAS. As more traffic hits a network, its realizable utilization drops, and this will seriously affect your performance.
By way of example. If you have 50 machines on a 1Gbps network and the network is approaching 100% utilization, packet collisions and retries at the data link layer mean that your effective total transfer is a fraction of the network potential that would be realized if you had only two communicators on the net. Worse yet, as retries increase, so does effective load. You get an ugly feedback loop in the face of peak demand.
There are a number of network traffic simulators and generators out there, though I'm afraid I've never used any of them.
You could look at Pole Position, a generic database performance testing suite.
http://www.polepos.org/
Depending on the goals you wish to achieve with your load testing you may also want to look at using SQLIO & not just SQLIOSim. SQLIOSim is very good for stress testing & simulating SQL Server load & will go from green to red if it detects any IO errors. It's output is a bit cryptic though although KKline gives some insight.
SQLIO is useful if you want to perform one operation continuously such as large random reads, or large sequential writes & anything in between. It will also give you some useful output stats which you can graph & use as a comparison.
You can try IoMeter from Intel

Resources