Approaches to scan and fetch all DNS entries - database

I'd like to conduct a experiemnt and I need a full database of all DNS entries on the Internet.
Is it practical to scan the Internet and fetch all DNS entries?
What is the limitation: storage, time or network bandwidth?
Any good approaches to start with?
(I can always bruteforcely scan the IP space and do a reverse DNS lookup, but I guess that not the efficient way to do so)

Downloading databases like RIPE's or ARIN's will not get you the reverse DNS entries you want. In fact, you'll only get the Autonomous Systems and the DNS servers resolving these ranges. Nothing else. Check this one: ftp://ftp.ripe.net/ripe/dbase/ripe.db.gz
Reverse DNS queries will get you only a fraction of all the DNS entries. In fact, no one can have them, as most domain names don't accept AXFR requests, and it could be considered illegal in some countries. To get access to the complete list of .com/.net/.org domain names you might be ICANN or maybe an ICANN reseller, but you'll never get other TLDs which aren't publicly available (several countries).
Then, the best possible approach would be to bruteforce all reverse-ip-resolution + become an Internet giant like google to set up your own public DNS's + try to perform an AXFR request on every domain name you're able to detect.
Mixing all these options are the only way to get a significative portion of all the DNS entries, but never the 100%, and probably not more than 5 to 10% . Forget about bruteforcing whois servers to get the list of domain names. It's forbidden by their terms and conditions.
We're bruteforcing reverse-ipv4-resolution right now, because it's the only legal way to do it without being Google. We started 2 weeks ago.
After two weeks of tunning, we've completed a 20% of the Internet. We've developed a python script launching thousands of threads scanning /24 ranges in parallel, from several different nodes.
It's way faster than nmap -sL, however it's not as reliable as nmap, so we'll need a "second pass" to fill-in the gaps we got (arond 85% of the IPs got resolved on the first attempt). Regular rescanning must be performed to obtain a complete and consistent database.
Right now we've several servers running at 2mbps of DNS queries on every node (from 300 to 4000 queries/second on every node, mostly depending on the RTT between our servers and the remote DNS's).
We expect to complete the first pass of all the IPV4 entries in around 30 days.
The text files where we store the preeliminary results have an average of 3M entries for every "A class" range (i.e. 111.0.0.0/8). These files are just "IP\tname\n", and we only store resolved IP's.
We needed to configure a DNS on every server, because we were affecting the DNS service of our provider and it blocked us. In fact we performed a bit of benchmarking on different DNS servers. Forget about Bind, it's to heavy and you'll hardly get more than 300 resolutions/second.
Once we'll finish the scan we'll publish an article and share the database :)
Follow me on Twitter: #kaperuzito
One conclusion we have already got is that people might think twice about the names they put in their DNS PTR entries. You can't name an IP "payroll", "ldap", "intranet", "test", "sql", "VPN" and so on... and there's millions of those :(

Related

Equal connection distribution is not happening in Aurora autoscaled insatnces

We are running a REST API based spring boot application using AWS Aurora as Database. Our application connects to read-only Aurora MySQL RDS instances.
We are doing load testing on it. Initially we have one database and we have autoscaling in place, which is triggered on high CPU.
Now we are expecting that if we are getting some X throughput with one db instance then we should be getting approx 1.8X when autoscaling happens, and connections should be distributed equally among with the newly created database instances.
But it is not happening, instead DB connections are going up and down on both database instances erratically. Due to which our load is not getting distributed equally and we are not getting desired throughput. Sometimes one database is running on 100 % CPU while the other is still on 20% CPU and after few minutes it is reversed.
Below are the database connection cofiguration :-
Driver - com.mysql.jdbc.driver
Maximum active connections=100
Max age = 300000
Initial pool size = 10
Tomcat jdbc pool is used for connection pooling
NOTE:
1) We have also disabled jvm network DNS caching.
2) we also tried refreshing the database connections every 5 minutes,
Even the active ones.
3) We have tried everything suggested by AWS but nothing is working.
4)We have even written a lambda code to update Route 53 when new db instance comes up to avoid cluster endpoint caching but still same issue.
Can anyone please help what is the best practice for this as currently we cannot take this into production.
This is not a great answer, but since you haven't gotten any replies yet some thoughts.
1) The behavior you are seeing replicates bad routing logic of load balancers
This is no surprise you, but this used to be much more common with small web server deployments – especially long running queries. With connection pooling, you mirror this situation.
2) Taking this assumption forward, we need to guess on how Amazon choose to balance traffic to read only replicas.
Even in their white paper, they don't mention how they are doing routing: https://www.allthingsdistributed.com/files/p1041-verbitski.pdf
Likely options are route53 or an NLB.
My best guess would be that they are using an NLB. NLBs became available to us only in Q3 2017 and Aurora was 2 years before, but it still is a reasonable guess.
NLBs would let us balance based on least connections (far better than round robin).
3) Validating assumptions
If route53 is being used, then we would be able to use DNS to find out.
I did a dig against the route53 end point and found that it gave me an answer
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-0.yyy.us-east-1.rds.amazonaws.com.
zzz-0.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.8.33
I did it again and got a different answer.
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-2.yyy.us-east-1.rds.amazonaws.com.
zzz-2.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.7.97
What you can see is that the read only endpoint is giving me a CNAME result to
Zzz is name of my cluster, yyy came from my cloudformation stack formation, and yyy comes from amazon.
Note: zzz-0 and zzz-2 are the two read only replicas.
What we can see here is that we have route53 for our load balancing.
4) Route53 Load Balancing
They are likely setting up Route53 with round robin on all healthy read only replicas.
The TTL is likely 5s.
Healthy nodes will get removed, but there is no balancing based on
5) Ramifications
A) Using the Read Only end point can only balance traffic away from unhealthy instances
B) DB Pools will keep connections for a long time which means that new read replicas won’t be touched
If we have a small number of servers, we will be unbalanced – which we can’t do much against.
6) Thoughts on what you can do
A) Verify yourself with dig that you are getting correct DNS resolution that keeps rotating between replicas every 5s.
If you don’t, this is something you need to fix
B) Periodically recycle DB Clients
New replicas will get used and while you will be unbalanced, this will help by keeping changing.
What is critical though is you MUST not have all your clients recycle at the same time. Otherwise, you run the risk of all getting the same time. I would suggest doing some random ttl per client (within min/max).
C) Manage it yourself
Summary: When you connect, connect directly to the read replica with least connection/lowest CPU.
How you do this is slightly not simplistic. I would suggest a lambda function that keeps this connection string in a queryable location. Have it update at some frequency. I would say the frequency of updating the preferred DB is 1/10 of the frequency you are recycle the DB connections. You could add logic if the DBs are running similarly, you give the readonly end point..and only give an explicit one when there is significant inequity.
I would caution when a new instance comes up you want to be careful of floating.
D) Increase number of clients or number of read only copies
Both of these would decrease the chance that two boxes would get significant differences.

Restrict number of requests from an IP

I am writing an application wherein its a requirements to restrict the number of logins a user can have from a single IP address (as a way to stop spam).
We can't use captcha for some reason!
The only 2 ways I could think of to make this work was to either store in the database, the number of requests coming in from each IP.
OR
To store a tracking cookie which has the information regarding the same.
Now, the downside of the first mode is that there would be too much of db traffic - the application is going to be used by a ton of people.
The downside of storing this info as a cookie is that users can clear them up ad start fresh again.
I need suggestions, if there could be a way wherein the high db traffic and the loose bond with cookie based tracking can be handled.
You're talking about "logins" and a web-application therefore you have some sort of a session persisted somwhere. When creating those sessions you need to keep track of the number of active sessions per IP and not allocate new sessions when that threshold is reached.
Without more specific information about your framework / environment, that's about the best answer anyone can provide.
Also be aware that this approach fails in numerous ways because of NAT (network address translation). For example, our office has exactly one public IP address for X hundred people. The internal network is on private IP space.
if you want to get the IP and store somewhere, you could use $_SERVER['REMOTE_ADDR'] to get the IP of the user, make a field like "ip" in your database and you make a query in your SQL to check if the IP was used.
There are also other ways of tracking, like Flash Cookie, people usually don't know the existance of it, so most people wouldn't know how to clear it.

One database, multiple frontends, maintaing request ordering

Assuming I have one database keeping a simple history with multiple front ends talking to it (one front end per server), I wonder what are the common solutions to deal with time. As soon as I have multiple servers, I cannot assume a global consistent clock, and I was interested in the possible solutions to maintain some kind of ordering between requests.
For a concrete example, let's say I want to record histories of customers, where history is defined as time ordered set of records. The record table would be as simple as (customer_id, time, data), and history would be all the rows where customer_id == requested id. Each request sent by the user would contain one record sent to one customer. Ideally, the time should refer to the "actual" time the request was sent to the front end by the customer (as that's the time as seen from the user POV). To be exact, I only care about preserving the ordering between records for each customer, not about the absolute time.
I am aware of solutions such as vector clocks, etc... but that seems rather complex, and I would expect this to be a rather common issue ?
Solutions which are not acceptable in my case:
Changing the requests arriving at the front end: I unfortunately have to work under the constraint that the requests are passed as is. I have complete control of whatever communication protocol is needed between front ends and database, though.
Server time clocks are synchronized
All request which require being ordered to each other are handled by the same front end server
[EDIT]: the question may sound a bit like red-herring, so here is my rationale for asking it: while this is not my issue right now, I am interested in the possibility to go to a platform like Google App Engine, which explicitly says that their servers are not guaranteed to be time synchronized. The solution to that issue for request ordering does not sound obvious to me - but maybe something like vector clock is actually the only "good" solution ?
When you perform any action that records history data to the database you could record two sets of datetime info:
the datetime as set by the DB when the record was inserted
the datetime passed through with the data as a legitimate piece of metadata.
The former would give you a central view of the world if you ever needed it, and the latter would let you reconstruct datetime from customers perspective.
If you were ultra-keen you could also pass through the datetime from the users browser by filling some sort of parameter/field using JavaScript.
As soon as I have multiple servers, I
cannot assume a global consistent
clock
Well, you can configure servers to sync their clocks to a time server. You could also configure your database server to sync to a time server, and configure the other servers to sync to your database server as often as you need to. (I'm not saying that's a great idea, just saying it's possible. If you have access to all the servers.)
Anyway . . . so the front ends are the only pieces of software you have that actually know when a request arrives. Is that right?
If that's right, then it's the front ends job to record the time of the customer's request, possibly in UTC, and then forward that timestamp to the database.
If you can't synchronize the server's clocks, then I think your only hope is to have every front ends ask just one specific server--maybe your database server, but maybe not--what time it is when a customer request arrives. A front end can do that by asking for daytime on port 13 (DAYTIME protocol, RFC-867), asking for time on port 37 (TIME protocol, RFC-868), or asking a time server on port 123 (either NTP or SNTP protocol, RFC-1305 and RFC-2030).
But after reading your edit, I think what you want is impossible. You seem to be saying that
what the front ends send doesn't
contain enough information to
reconstruct the "true" ordering
what the front ends send cannot be
changed
If the front ends can't send you any other information, vector clocks and interval tree clocks won't help.

Advice on using a web server as a cache

I'd like advice on the following design. Is it reasonable? Is it stupid/insane?
Requirements:
We have some distributed calculations that work on chunks of data that are sometimes up to 50Mb in size.
Because the calculations take a long time, we like to parallelize the calculations on a small grid (around 20 nodes)
We "produce" around 10000 of these "chunks" of binary data each day - and want to keep them around for up to a year... Most of the items aren't 50Mb in size though, so the total daily space requirement is more around 5Gb... But we'd like to keep stuff around for as long as possible, (a year or more)... But hey, you can get 2TB hard disks nowadays.
Although we'd like to keep the data around, this is essentially a "cache". It's not the end of the world if we lose data - it just has to get recalculated, which just takes some time (an hour or two).
We need to be able to efficiently get a list of all "chunks" that were produced on a particular day.
We often need to, from a support point of view, delete all chunks created on a particular day or remove all chunks created within the last hour.
We're a Windows shop - we can't easily switch to Linux/some other OS.
We use SQLServer for existing database requirements.
However, it's a large and reasonably bureaucratic company that has some policies that limit our options: for example, conventional database space using SQLServer is charged internally at extremely expensive prices. Allocating 2 terabytes of SQL Server space is prohibitively expensive. This is mainly because our SQLServer instances are backed up, archived for 7 years, etc. etc. But we don't need this "gold-plated" functionality because we can just recreate the stuff if it goes missing. At heart, it's just a cache, that can be recreated on demand.
Running our own SQLServer instance on a machine that we maintain is not allowed (all SQLServer instances must be managed by a separate group).
We do have fairly small transactional requirement: if a process that was producing a chunk dies halfway through, we'd like to be able to detect such "failed" transactions.
I'm thinking of the following solution, mainly because it seems like it would be very simple to implement:
We run a web server on top of a windows filesystem (NTFS)
Clients "save" and "load" files by using HTTP requests, and when processes need to send blobs to each other, they just pass the URLs.
Filenames are allocated using GUIDS - but have a directory for each date. So all of the files created on 12th November 2010 would go in a directory called "20101112" or something like that. This way, by getting a "directory" for a date we can find all of the files produced for that date using normal file copy operations.
Indexing is done by a traditional SQL Server table, with a "URL" column instead of a "varbinary(max)" column.
To preserve the transactional requirement, a process that is creating a blob only inserts the corresponding "index" row into the SQL Server table after it has successfully finished uploading the file to the web server. So if it fails or crashes halfway, such a file "doesn't exist" yet because the corresponding row used to find it does not exist in the SQL server table(s).
I like the fact that the large chunks of data can be produced and consumed over a TCP socket.
In summary, we implement "blobs" on top of SQL Server much the same way that they are implemented internally - but in a way that does not use very much actual space on an actual SQL server instance.
So my questions are:
Does this sound reasonable. Is it insane?
How well do you think that would work on top of a typical windows NT filesystem? - (5000 files per "dated" directory, several hundred directories, one for each day). There would eventually be many hundreds of thousands of files, (but not too many directly underneath any one particular directory). Would we start to have to worry about hard disk fragmentation etc?
What about if 20 processes are all, via the one web server, trying to write 20 different "chunks" at the same time - would that start thrashing the disk?
What web server would be the best to use? It needs to be rock solid, runs on windows, able to handle lots of concurrent users.
As you might have guessed, outside of the corporate limitations, I would probably set up a SQLServer instance and just have a table with a "varbinary(max)" column... But given that is not an option, how well do you think this would work?
This is all somewhat out of my usual scope so I freely admit I'm a bit of a Noob in this department. Maybe this is an appalling design... but it seems like it would be very simple to understand how it works, and to maintain and support it.
Your reasons behind the design are insane, but they're not yours :)
NTFS can handle what you're trying to do. This shouldn't be much of a problem. Yes, you might eventually have fragmentation problems if you run low on disk space, but make sure that you have copious amounts of space and you shouldn't have a problem. If you're a Windows shop, just use IIS.
I really don't think you will have much of a problem with this architecture. Just keep it simple like you're doing and things should be fine.

How is MapReduce a good method to analyse http server logs?

I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.
But here is my problem : I can't figure out how it can help with http server logs analysis.
My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization.
Here are the questions I can't answer right now, any help would be greatly appreciated :
Can the MapReduce concept really be applied to weblogs analysis ?
Is MapReduce the most clever way of doing it ?
How would you split the web log files between the various computing instances ?
Thank you.
Nicolas
Can the MapReduce concept really be applied to weblogs analysis ?
Yes.
You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):
192.168.1.1,FireFox x.x,username1
192.168.1.1,FireFox x.x,username1
192.168.1.2,FireFox y.y,username1
192.168.1.7,IE 7.0,username1
You can extract browsers, ignoring version, using a map operation to get this list:
FireFox
FireFox
FireFox
IE
Then reduce to get this :
FireFox,3
IE,1
Is MapReduce the most clever way of doing it ?
It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs.
To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.
You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...
With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.
http://img355.imageshack.us/img355/7355/mqlogs.png
How would you split the web log files between the various computing instances ?
By number of elements or lines if it's a text-based logfile.
In order to test MapReduce, I'd like to suggest that you play with Hadoop.
Can the MapReduce concept really be applied to weblogs analysis ?
Sure. What sort of data are you storing?
Is MapReduce the most clever way of doing it ?
It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding.
How would you split the web log files between the various computing instances ?
Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.

Resources