This is not much a technical question but more a reflexion about the subject.
REST have democtratized a good way of serving resources using HTTP protocols and let developper do the cleanest project by splitting resources and user interface (back-end now really deals about back-end only with the use of REST APIs).
About those API, most of the time we use GET, PUT/PATCH (hmm meh ?), POST and DELETE, both mimic CRUD database.
But as time spent on our projects goes by, we feel the UX can be improved by adding tons of great features. For example, why should the user feel afraid of deleting a resource ? Why do not just put a recovery system (like we can see in Google Keep app that let us undo a deletion which I think is awesome in term of UX).
One of the practices that preventing unintentionnal deletion is the use of a column in the table that represents the resource. For example, I am about to delete a book, so by clicking the delete button I will only flag this row as "deleted = TRUE" in my database, and prevent displaying the rows that are deleted when browsing the list of resource (GET).
This last comes in conflict with our dear and loved REST pattern, as there is no distinction between DELETE and DESTROY "methods".
What I mean is, should we think about making REST evolving to our UX needs, so by that I mean also making HTTP protocols evolving, or should this stays as a puristic resource management and we should instead follow HTTP protocol without trying to bother it and just adapt to it using workaround (like using PATCH for soft deletion) ?
Personnaly I would like to see at least 4 new protocols as we are trying to qualify a resource as good as possible :
DELETE becomes a way to prevent others methods to have an impact on it
DESTROY becomes more dramatic by completely removing trace of this resource
RECOVER is a way to say to the other methods "hey guys, he is coming back, stay tuned"
TRASH is a GET like but only for the DELETED resources
What made me think about it is my research of a clean REST solution to deal with this resource behavior. I have seen some website posts including
https://www.pandastrike.com/posts/20161004-soft-deletes-http-api
https://philsturgeon.uk/rest/2014/05/25/restful-deletions-restorations-and-revisions/
...
That advice us to use PUT or PATCH to make soft deletion something usable but I kind of feel it does not sounds right, does not it ?
My thoughts about this problem :
Is there a big step between proposing new HTTP methods and update previous methods (I heard HTTP/2 is a thing, maybe we could ship those in ?)
Does it make sense outside the web developpement realm ? I mean does this changes could impact other domains that our ?
I'm not sure this makes sense even within the web development realm; the starting premises seem to be wrong.
RFC 7231 offers this explanation for POST
The POST method requests that the target resource process the representation enclosed in the request according to the resource's own specific semantics.
Riddle: if this is the official definition of POST, why do we need GET? Anything that the target can do with GET can also be done with POST.
The answer is that the additional constraints on GET allow participants to make intelligent decisions using only the information included in the message.
For example, because the header data informs the intermediary component that the method is GET, the intermediary knows that the action upon receiving the message is safe, so if a response is lost the message can be repeated.
The entire notion of crawling the web depends upon the fact that you can follow any safe links to discover new resources.
The browser can pre-fetch representations, because the information encoded in the link tells it that the message to do so is safe.
The way that Jim Webber describes it: "HTTP is an application, but it's not your application". What the HTTP specification does is define the semantics of messages, so that a generic client can be understood by a generic server.
To use your example; the API consumer may care about the distinction between delete and destroy, but the browser doesn't; the browser just wants to know what message to send, what the retry rules are, what the caching rules are, how to react to various error conditions, and so on.
That's the power of REST -- you can use any browser that understands the media-types of the representations, and get correct behavior, even though the browser is completely ignorant of the application semantics.
The browser doesn't know that it is talking to an internet message board or the control panel of a router.
In summary: your idea looks to me as though you are trying to achieve richer application semantics by changing the messaging semantics; which violates separation of concerns.
I understand this is probably going to sound like an extremely rudimentary question to some of the more senior people on here but, I had to ask. I understand Google is there for my use but, I wanted to make sure I found the right answer. My friend is concerned that when we post a coming soon page for our project that someone could post malicious code into our web form (username, dob, email, contact information sort of web form) and he wants to know what's the best way to protect against such malicious code injections.
Do you just get SSL protection on your database or is there something more programming intensive we need to produce to protect ourselves and the information we collect?
You sanitize your inputs before storing them, which generally means e.g. escaping special characters in sql or whatever before building a query.
Pretty much every language and platform for building web applications provides ways for you to do this. Typically you'll use whatever version of parameterized queries your infrastructure provides and it will handle escaping for you. But "sanitizing inputs" is the Google keyword here.
This has nothing to do with ssl at all. Ssl encrypts network traffic so third parties can't spy on it. The use of ssl is completely independent of your sites susceptibility to code injection attacks.
I would go into more detail about sanitizing inputs but without more information there's not much I can say, and also info about this is readily available in docs, tutorials, and examples all over the Internet.
So far while logging userlogins I always stored the complete user agent additionally to already parsed informations (like browser, version, os, etc). The user agent usually just is a TEXT field in the table.
While implementing another similar thing, I was asking myself: What's even the point of doing that? Obviously, the user agent can be manipulated easily in any case, and the only relevant informations (browser, version and operating system) are already parsed and stored separately anyways.
Is there some actual benefit in still storing it, except for backtracking of data that could be faked anyways? What other relevant informations does the user agent contain to justify the (over years, quite large) amount of data that is used to store it?
And of course I realize that the user agent contains a lot more than just the browser specifications - but how many times did you really have to go back and analyze the user agent itself?
Just to clarify: I'm talking about reasons why to store the raw user-agent string, after parsing the "relevant" informations out of it (browser, os, etc) - what is the point of the user-agent after that point?
The user agent string contains information about the environment including operating system and browser. It is something I frequently check. There are two main reasons to store it.
If you are following up on a bug report or error then this
information is useful or even essential for determining what went
wrong - imagine trying to find an error that occurs only on IE8
without the user agent! This information can also help you prioritize a bug fix. You will want to fix an issue that is present on 93% of environments before you fix the one that is present on 7%.
Secondly, it provides very useful stats on the profile of your user. You might only want to support environments of more than a certain percentage of your user base. For example, if you are designing a new version of your software and, on examining your user agent logs, you find no one using IE, you might not bother to optimize or design for IE.
You seem to be concerned that the user agent string can be faked. While this is possible, unless there is some specific reason someone might do this in your app, it seems rather paranoid to worry about it. You make a good point, though, to remember what information is possible to fake.
UPDATE: I see your point, in fact in the logging I recently implemented I removed the parsed string because of the data overhead. There is little point in storing both the raw string and the parsed string. The only real reason to do that would be to make querying the logs slightly easier, which is not a good enough reason to me. Personally, I store the whole raw useragent which means no loss of data, future proofing for future browsers/oses/formats of user string, and eliminates the possibility of making mistakes when parsing.
From Wikipedia:
For this reason, most Web browsers use a User-Agent value as follows:
Mozilla/[version] ([system and browser information]) [platform]
([platform details]) [extensions]
If you have stored all the fields out of that you need then by all means discard the rest. The amount of data to log, how long to keep logs for, and in what form to keep them is a fairly personal thing that will differ in some ways from company to company and project to project.
I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.
Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the scraping crawler.
If, as a defender, I detect an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
If, as a defender, I detect the same agent/IP, the attacker will randomize the agent.
And this is where I get lost - if an attacker randomizes the intervals and the agent, how would I not discriminate against proxies and machines hitting the site from the same network?
I am thinking of checking the suspect agent with javascript and cookie support. If the bogey can't do either consistently, then it's a bad guy.
What else can I do? Are there any algorithms, or even systems designed for quick on-the-fly analysis of historical data?
My solution would be to make a trap. Put some pages on your site where access are banned by robots.txt. Make a link on you page, but hide it with CSS, then ip ban anybody who goes to that page.
This will force the offender to obey robots.txt, which means that you can put important information or services permanently away from him, which will make his carbon-copy clone useless.
Don't try and recognize by IP and timing or intervals--use the data you send to the crawler to trace them.
Create a whitelist of known good crawlers--you'll serve them your content normally. For the rest, serve pages with an extra bit of unique content that only you will know how to look for. Use that signature to later identify who has been copying your content and block them.
And how do you keep someone from hiring a person in a country with low wages to use a browser to access your site and record all of the information? Set up a robots.txt file, invest in a security infrastructure to prevent DoS attacks, obfuscate your code (if accessible, like javascript), patent your inventions, and copyright your site. Let the legal people worry about someone ripping you off.
Premise: The requirements for an upcoming project include the fact that no one except for authorized users have access to certain data. This is usually fine, but this circumstance is not usual. The requirements state that there be no way for even the programmer or any other IT employee be able to access this information. (They want me to store it without being able to see it, ever.)
In all of the scenarios I've come up with, I can always find a way to access the data. Let me describe some of them.
Scenario I: Restrict the table on the live database so that only the SQL Admin can access it directly.
Hack 1: I rollout a change that sends the data to a different table for later viewing. Also, the SQL Admin can see the data, which breaks the requirement.
Scenario II: Encrypt the data so that it requires a password to decrypt. This password would be known by the users only. It would be required each time a new record is created as well as each time the data from an old record was retrieved. The encryption/decryption would happen in JavaScript so that the password would never be sent to the server, where it could be logged or sniffed.
Hack II: Rollout a change that logs keypresses in javascript and posts them back to the server so that I can retrieve the password. Or, rollout a change that simply stores the unecrypted data in a hidden field that can be posted to the server for later viewing.
Scenario III: Do the same as Scenario II, except that the encryption/decryption happens on a website that we do not control. This magic website would allow a user to input a password and the encrypted or plain-text data, then use javascript to decrypt or encrypt that data. Then, the user could just copy the encrypted text and put the in the field for new records. They would also have to use this site to see the plain-text for old records.
Hack III: Besides installing a full-fledged key logger on their system, I don't know how to break this one.
So, Scenario III looks promising, but it's cumbersome for the users. Are there any other possibilities that I may be overlooking?
If you can have javascript on the page, then I don't think there's anything you can do. If you can see it in a browser, then that means it's in the DOM, which means you can write a script to get it and send it to you after it has been decrypted.
Aren't these problems usually solved via controls:
All programmers need a certain level of clearance and background checks
They are trained to understand that rolling out code to access the data is a fireable or worse offense
Every change in certain areas needs some kind of signoff
For example -- no JavaScript on page without signoff.
If you are allowed to add any code you want, then there's always a way, IMO.
Ask the client to provide an Non-disclosure Agreement for you to sign, sign it, then look at as much data as you want.
What I'm wondering is, what exactly will you be able to do with encrypted data anyway? Pretty-much all apps require you to do some filtering of the data, whether it be move it to a required place, modify it, sanitize it, or display it. Otherwise, you're just a glorified pipe, and you don't have to do any work.
The only way I can think of where you wouldn't be looking at the data or doing anything with it would be a simple form to table mapping with CRUD options. If you know what format the data will be coming in as you should be able to roll something out with RoR, a simple skin, put SSL into the mix, and roll it out. Test with dummy data in the same format, and you're set.
In fact, is your client unable to supply dummy data for testing? If they can, then your life is simple as all you do is provide an "installable" and tell them how to edit a config file.
I think you could still create the app in the following way:
Create a dev database and set up a user for it.
Ask them for: the data type, size, and name of each field that needs to be on the screen.
Set up the screens, create columns in the database that accept the data type and size they specify.
Deploy the app to production, hooked up to an empty database. Get someone with permission (not you) to go in and set the password on the database user and set the password for the DB user in the web app.
Authorized users can then do whatever they want and you never saw what any of the data looked like.
Of course, maintaining the app and debugging is gonna be a bitch!
--In answer to comments:
Ok, so after setting up the password for the Username in the database and in the web app's config, write a program that connects to the database, sets a randomized password, then writes that same randomized password to the web config.
Prevent any outgoing packets from the machine except to a set of authorized workstations - so you can't install your spyware.
Then set the Admin password on both servers to the same random password, then delete all other users on the servers, delete the program, and delete the program source code.
Wipe the hard drives of the developer machines with the DOD algorithm, and then toss them into an industrial shredder.
10. If the server ever needs debugging, toss it in the trash, buy a new one, and start back at #1.
But seriously - this is an insolvable problem. The best answer to this really is:
Tell them they can't have an application. Write your stuff on paper. Put it in a folder. Lock it in a vault. Thrust, repeat.
Wouldn't scenario 3 just expose all the data to the magic website? This doesn't sound like a solvable problem (at least I can't think of a solution).
Go with whatever solution is easiest for you to implement, I think the requirements show the the client does not understand software development and so it should be easy to sell any approach you take.
I have to say I really don't like the idea of using JavaScript on the client to decrypt the data. That is a huge hole as any script (hacker, GreaseMonkey, IE7Pro, etc.) can access the DOM and get data out of the page.
Also, it is very hard to get around the problem of key stroke loggers. If you throw those into the mix, then your options are limited. At that point you need a security FOB such as RSA (commonly used with corporate VPNs) to generate truly random PINs. That will probably be expensive, and it is a pain, and I have only seen it used with VPNs but I assume it could work with websites as well.
As far as the website, I'd stick with HTTPS and find a way to encrypt/decrypt through the WebServer rather than relying on JavaScript. The SSL traffic isn't very prone to sniffing (very difficult to decrypt), so that allows the encryption and decryption to happen server-side which (IMHO) is more secure.
Look at banking scenarios and other financial institutions for a starting point, and then go from there. Try not to over-complicate if possible.
You can't guarantee against hacking into the data as long as you have access to the server it lives on. So tell the employer they have to host the data somewhere else and grant access to the client's browser via a secure HTTPS connection.
You can design your web page to dynamically load an XML data stream securely, and format it into a web page using an XSLT script on the client.
See http://www.w3schools.com/xsl/xsl_client.asp for examples
That way you produce the code, but you never have access to the data. Only the user has access to their own data.
As for how the employer is going to host the data without granting any IT people access to it, that's their problem. It's a foolish requirement.
I think that I'll just tell them that they either have to trust a couple of us to have access (and not look at it) or they don't get a project.
Thanks for the answers. Feel free to post more thoughts if you have them.
You can never have 100% security, and extra security comes at a cost of speed/price/convenience etc.
Let's suppose you take scenario 3 - one of your programmers can use social engineering to get the password from one of the users. Goodbye security.
There's no point having a high-security iron door as a gate if people can just walk around it. Just implement a decent level of security.
(They want me to store it without being able to see it, ever.)
Hey, the recording industry wants people to be able to listen to their music, but not copy it. Sounds like they should get together sometime!
Their idea won't work for the same reason DRM doesn't work: the trust chain is inherently compromised. Encryption examples often use Alice, Bob, and Charlie where Alice is trying to communicate with Bob without Charlie listening in. With DRM, the trust chain is compromised because Bob and Charlie are the same person. With your situation, Charlie is the guy writing the software that Alice and Bob use to communicate. There's an implied trust, because if you don't trust Charlie then you can't trust Charlie's software, either.
That's the root of the issue: trust. If they can't trust the programmer, the game is over before it starts.
There are lots of options based on what their goal really is, but I am confused by their paranoia, er, intent:
Is this their (and end-user) data that they wish to keep private or end-user data to be kept private from everyone?
Is it just that your (or any contracted) company is suspect?
Are they afraid of over-the-wire snooping?
Are they afraid of DOM access through JavaScript or browser plugins?
Are they planning staged deployment? In that case you work on test/dev server w/o real data but have no access to the production server with the real data, and DNS logging and/or firewall rules inhibit all of your hacks from working undetected.
Ultimately if the data is stored in a DB then the programmer and DB admin can, by working together, get it. Period. A good audit should uncover that, though.
If this is truly a requirement, the only way to guard against this is to hire an outside firm to audit the code prior to releasing the software, and that's going to be very expensive.