Fastest way to store tiny data on the server - c

I'm looking for a better and faster way to store data on my webserver at the best speed possible.
My idea is to log the IP address of every incoming request to any website on the server and if it reaches a certain number within a set time then users will be redirected to a page where they need to enter a code to regain access.
I created an apache module that does just that. It attempts to create files on a ramdisk however I constantly run into permission problems since there is another module that switches users before my module has a chance to run.
Using a physical disk is an option that is too slow.
So my only options are as follows:
Either create folders on the ramdrive for each website so IP addresses can be logged independently.
Somehow figure out how to make my apache module execute its functionality before all other modules.
OR
Allocate a huge amount of ram and store everything in it.
If I choose option #2, then I'll continue to beat around the bush as I have already attempted that.
If I choose option #1, then I might need lots more ram as tons of duplicate IP addresses are expected to be stored across several folders on the ramdrive.
If I choose option #3, then the apache module will have to constantly seek through the ram space allocated in order to find the IP address, and seeking takes time.
People say that memory access is faster than file access but I'm not sure if just a direct memory access via malloc is faster than storing data to a ram drive.
The reason why I expect to collect alot of IP addresses is to block script-kiddies from constantly accessing my server at a very high rate.
So what I'm asking is what is the best way I should store my data and why?

You can use hashmap instead of huge amount of ram. It will be pretty fast.
May be several hashmaps for each websits. Or use composite like string_hash(website_name) + (int_hash(ip) << 32)

If the problem is with permissions, why not solve it at that level? Use a common user account or group. Or make everything on the RAM disk world readable/writable.
If you want to solve it at the Apache level, you might want to look into mod_security and mod_evasive.

Related

How to prevent people or a program from extracting data out of a system?

Let us say, there is a system containing data, where the user can view or manipulate it, using the options in the system, but should not be able to copy/ extract/ export the data out of the system. Also, any bots such as RPA or crawlers should not be exporting too. The data strictly recides in the system.
Eg: VDI - Virtual Desktop Infrastructure, does some sort of this work. People can connect to remote machines and do some work, but cannot extract data out of it to their local machine, unless it allows the user to do so. Even RPA bots will not be allowed to run in that remote system, only can be run in local system but it will be tedious to build such a bot, providing a closer solution to the above problem.
I am just looking for alterate light-weight options. Please let me know, if there is any solution available.
There is simply no way of stopping all information export.
A user could just take a photo to the screen and share the info.
If by exporting you mean exporting files, then simply do not allow exporting the files in your program or restrict the option, if you need to store data on the disk, store it encrypted.
The best options would be to configure a machine only to use that software, so on boot it would lauch the software fullscreen, deny any usb autorun keys and have something like Veyon insyalled to be remotely controlled and have some config data on the disk but pretty much all the data on a remote server.
If you need a local cache, you can keep it encrypted.
That said theoretically if a user had access to the ram physically, he/she could retrieve that data but it is highly unlikely.
First of all, you'll have to make ssh and ftp useless! this is to prevent scp or other ftp software from being used to move things from inside the system out and vice versa, block ports 20, 21 and 22!
If possible, I'd block access to cloud storage services (DNS/Firewall), so that no one with access to the machine would be able to upload stuff to common cloud services or if you have a known address that might be a potential goal for your protected data. Make sure that online code repositories are also blocked! if the data can be stored as text, it can be also transfered to github/gitlab/bitbucket as a normal repo... you can block them also at DNS level. Make sure that users don't have the previlage to change network settings, otherwise they can bypass your DNS blocks!
You should prevent any kind of external storage connectivity.. by disallowing your VM from connecting to the server's USB ports or even bluetooth if exists.
That's off the top of my head... I'll edit this answer if I remember any more things to block.

Restrict number of requests from an IP

I am writing an application wherein its a requirements to restrict the number of logins a user can have from a single IP address (as a way to stop spam).
We can't use captcha for some reason!
The only 2 ways I could think of to make this work was to either store in the database, the number of requests coming in from each IP.
OR
To store a tracking cookie which has the information regarding the same.
Now, the downside of the first mode is that there would be too much of db traffic - the application is going to be used by a ton of people.
The downside of storing this info as a cookie is that users can clear them up ad start fresh again.
I need suggestions, if there could be a way wherein the high db traffic and the loose bond with cookie based tracking can be handled.
You're talking about "logins" and a web-application therefore you have some sort of a session persisted somwhere. When creating those sessions you need to keep track of the number of active sessions per IP and not allocate new sessions when that threshold is reached.
Without more specific information about your framework / environment, that's about the best answer anyone can provide.
Also be aware that this approach fails in numerous ways because of NAT (network address translation). For example, our office has exactly one public IP address for X hundred people. The internal network is on private IP space.
if you want to get the IP and store somewhere, you could use $_SERVER['REMOTE_ADDR'] to get the IP of the user, make a field like "ip" in your database and you make a query in your SQL to check if the IP was used.
There are also other ways of tracking, like Flash Cookie, people usually don't know the existance of it, so most people wouldn't know how to clear it.

Charts or Stats comparing Database vs. HTTP vs. Direct File Access Performance?

I am wondering what the stats are for different ways of storing (and therefore retrieving) content. Are there any charts out there, or do you guys have any quick tests to show, the requests per second, etc., of:
Direct (local) database access, vs.
HTTP Access to cached data, vs.
HTTP Access to uncached data (remote database), vs.
Direct File access
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Thanks!
.. what the stats are ...
Although some people may have published their findings, this will not map directly to your experience - you may find the opposite of they discovered.
Sometimes it may be faster to retrieve files from a database than a file - it depends on the size of the file, the filesystem or DBMS it resides on, the other data which affects the access path (e.g. indexes, number of I/O operations to dereference the start of the file...) the underlying hardware, the amount of caching available, the presence of the data or information relating to its location in the cache and the interaction between each of these factors.
And that's before you start considering the additional variables introduced when you start talking about HTTP, which also implies remote network access.
While ultimately any file would need to be read from the filesystem at some point, this suggests that direct file access would be the fastest method (but only on the local machine) however if you consider centralized caching and concurrency this is not necessarily the case.
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Rather hard to say. How remote? what are your bandwidth costs? Latency? What level of service do you hope to provide? Does the remote system provide caching information already? How do you deal with cache invalidations?
If we knew everything about your application, the data source, your customers and networks connecting them and your budget for implementing the service then we might hazard a guess. And, yes, caching on the MITM server probably is a good idea but only if you know that you're not breaking anything by using caching.
C.

File writes per second

I want to log visits to my website with a high visits rate to file. How much writes to log file can I perform per second?
If you can't use Analytics, why wouldn't you use your webserver's existing logging system? If you are using a real webserver, it almost certainly as a logging mechanism that is already optimized for maximum throughput.
Your question is impossible to answer in all other respects. The number of possible writes is governed by hardware, operating system and contention from other running software.
Don't do that, use Google Analytics instead. You'd end up running into many problems trying to open files, write to them, close them, so on and so forth. Problems would arise when you overwrite data that hasn't yet been committed, etc.
If you need your own local solution (within a private network, etc) you can look into an option like AWStats which operates off of crawling through your log files.
Or just analyze the Apache access log files. For example with AWStats.
File writes are not expensive until you actually flush the data to disk. Usually your operating system will cache things aggressively so you can have very good write performance if you don't try to fsync() your data manually (but of course you might lose the latest log entries if there's a crash).
Another problem however is that file I/O is not necessarily thread-safe, and writing to the same file from multiple threads or processes (which will probably happen if we're talking about a Web app) might produce the wrong results: missing or duplicate or intermingled log lines, for example.
If your hard disk drive can write 40 MB/s and your log file lines are approx. 300 bytes in length, I'd assume that you can write 140000 HTTP requests per second to your logfile if you keep it open.
Anyway, you should not do that on your own, since most web servers already write to logfiles and they know very good how to do that, how to roll the files if a maximum limit is reached and how to format the log lines according to some well-known patterns.
File access is very expensive, especially when doing writes. I would recommend saving them to RAM (using whatever cache method suits you best) and periodically writing the results to disk.
You could also use a database for this. Something like:
UPDATE stats SET hits = hits + 1
Try out a couple different solutions, benchmark the performance, and implement whichever works fast enough with minimal resource usage.
If using Apache, I'd recommend using the rotatelogs utility supplied as a part of the standard kit.
We use this to allow rotating the server logs out on a daily basis without having to stop and start the server. N.B. Use the new "||" syntax when declaring the log directive.
The site I'm involved with is one of the largest on the Internet with hit rates peaking in the millions per second for extended periods of time.
Edit: I forgot to say that the site uses standard Apache logging directives and we have not needed to customise the Apache logging code at all.
Edit: BTW Unless you really need it, don't log bytes served as this causes all sorts of issues around the midnight boundary.
Let Apache do it; do the analysis work on the back-end.

Advice on using a web server as a cache

I'd like advice on the following design. Is it reasonable? Is it stupid/insane?
Requirements:
We have some distributed calculations that work on chunks of data that are sometimes up to 50Mb in size.
Because the calculations take a long time, we like to parallelize the calculations on a small grid (around 20 nodes)
We "produce" around 10000 of these "chunks" of binary data each day - and want to keep them around for up to a year... Most of the items aren't 50Mb in size though, so the total daily space requirement is more around 5Gb... But we'd like to keep stuff around for as long as possible, (a year or more)... But hey, you can get 2TB hard disks nowadays.
Although we'd like to keep the data around, this is essentially a "cache". It's not the end of the world if we lose data - it just has to get recalculated, which just takes some time (an hour or two).
We need to be able to efficiently get a list of all "chunks" that were produced on a particular day.
We often need to, from a support point of view, delete all chunks created on a particular day or remove all chunks created within the last hour.
We're a Windows shop - we can't easily switch to Linux/some other OS.
We use SQLServer for existing database requirements.
However, it's a large and reasonably bureaucratic company that has some policies that limit our options: for example, conventional database space using SQLServer is charged internally at extremely expensive prices. Allocating 2 terabytes of SQL Server space is prohibitively expensive. This is mainly because our SQLServer instances are backed up, archived for 7 years, etc. etc. But we don't need this "gold-plated" functionality because we can just recreate the stuff if it goes missing. At heart, it's just a cache, that can be recreated on demand.
Running our own SQLServer instance on a machine that we maintain is not allowed (all SQLServer instances must be managed by a separate group).
We do have fairly small transactional requirement: if a process that was producing a chunk dies halfway through, we'd like to be able to detect such "failed" transactions.
I'm thinking of the following solution, mainly because it seems like it would be very simple to implement:
We run a web server on top of a windows filesystem (NTFS)
Clients "save" and "load" files by using HTTP requests, and when processes need to send blobs to each other, they just pass the URLs.
Filenames are allocated using GUIDS - but have a directory for each date. So all of the files created on 12th November 2010 would go in a directory called "20101112" or something like that. This way, by getting a "directory" for a date we can find all of the files produced for that date using normal file copy operations.
Indexing is done by a traditional SQL Server table, with a "URL" column instead of a "varbinary(max)" column.
To preserve the transactional requirement, a process that is creating a blob only inserts the corresponding "index" row into the SQL Server table after it has successfully finished uploading the file to the web server. So if it fails or crashes halfway, such a file "doesn't exist" yet because the corresponding row used to find it does not exist in the SQL server table(s).
I like the fact that the large chunks of data can be produced and consumed over a TCP socket.
In summary, we implement "blobs" on top of SQL Server much the same way that they are implemented internally - but in a way that does not use very much actual space on an actual SQL server instance.
So my questions are:
Does this sound reasonable. Is it insane?
How well do you think that would work on top of a typical windows NT filesystem? - (5000 files per "dated" directory, several hundred directories, one for each day). There would eventually be many hundreds of thousands of files, (but not too many directly underneath any one particular directory). Would we start to have to worry about hard disk fragmentation etc?
What about if 20 processes are all, via the one web server, trying to write 20 different "chunks" at the same time - would that start thrashing the disk?
What web server would be the best to use? It needs to be rock solid, runs on windows, able to handle lots of concurrent users.
As you might have guessed, outside of the corporate limitations, I would probably set up a SQLServer instance and just have a table with a "varbinary(max)" column... But given that is not an option, how well do you think this would work?
This is all somewhat out of my usual scope so I freely admit I'm a bit of a Noob in this department. Maybe this is an appalling design... but it seems like it would be very simple to understand how it works, and to maintain and support it.
Your reasons behind the design are insane, but they're not yours :)
NTFS can handle what you're trying to do. This shouldn't be much of a problem. Yes, you might eventually have fragmentation problems if you run low on disk space, but make sure that you have copious amounts of space and you shouldn't have a problem. If you're a Windows shop, just use IIS.
I really don't think you will have much of a problem with this architecture. Just keep it simple like you're doing and things should be fine.

Resources