How is Google App Engine Firewall working? - google-app-engine

With this feature now I can test my app on app engine, allow & deny access for particular set of IP's. That is more relief to developers.
But, how could I set the priority in a way that I can only allow the IP addresses in certain range as shown in the below image? Also, this is in beta, Is there any way will it remain consistent with the upcoming stable version?
And, how can I avoid the conflicting rules in the app engine? There is a large rnge limit on priorities, Does it affect the performance in any case? Does it works fine with multiple domains?

Hopefully these points that may help
All traffic that is not matched by a rule is permitted, while not one of your questions it is critical to remember this.
Items that match multiple items will honor the rule with the numerically lowest priority.
default is priority 2147483647 so you are in good shape there, it is also the max value or last to be applied rule.
Priority values must be unique, this avoids conflicts but requires careful attention to the ordering of rules. Higher priority matching rules will not be applied.
App engine firewall performance should not be effected by a reasonable number of rules, but as it is in beta, Google does not offer SLAs and it is subject to deprecation without notice or changes that break backwards compatibility.
Note that you cannot edit rules once they are created, you have to delete and recreate them. Due to the ordered nature of these you may also want to increase the priority of the one rule you have. If you want to add more critical rules you will be limited to a max of 99 under the current config.

Related

Costs generated by old versions of my appengine application

I have many versions of appengine instances that are active but not used because they are old. What are the costs they can generate? How do you behave with the old versions of appengine instances? Do you delete them or deactivate them?
On the documentation I don't find any reference to the costs of the old instances.
https://cloud.google.com/appengine/pricing?hl=it
UPDATE:
(GAE STANDARD)
Thank you
It's a poorly documented aspect of App Engine. What you describe as versions that are "not used" are more specifically versions that don't receive traffic. But depending on your scaling configuration (essentially defined in your app.yaml file), there may not be a 1:1 relationship between traffic and the number of active instances serving a version.
For example, I'm familiar with using "automatic_scaling" with min_instances = 1. This will prevent a service to scale to zero instance and have latency to serve an incoming request after some idle time, but it also means that any version until deleted will generate a baseline cost of 1 instance running 24/7.
Also, I've found that the estimated number of instances displayed in the dashboard you screenshotted can be misleading (more specifically, it can show 0 instance while there is actually one running).
Note that if you do not have scaling related configuration in your app.yaml file, you should check what are the default values currently considered by App Engine.
It's tricky when you get started and I'm sure I'm not the only one who lost most of the free trial budget because of this.
There is actually a limit of versions you can have depending on your app's pricing:
Limit
Free app
Paid app
Max versions
15
210
It seems that you can have them active in case you want to switch between versions migrating or splitting the traffic between them, but they won't charge you for them if you don't reach more than 15 versions.

When is data consistency not an issue?

I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.
My question is in what cases can you actually sacrifice consistency? Effectively what I am saying is sacrificing consistency means serving inaccurate data. In what cases would then you actually use an AP datastore like Cassandra? I can't think of any case where I wouldn't want my reads to be consistent.
By AP system, I assume you will at least target to ensure eventual consistency.
Imagine you're developing a social network where users have friends and their own news feeds. It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency). Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear. And in fact, Facebook built it's news feed using Cassandra.
Imagine a distributed key-value store cache system where update is very rare. If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability. Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.
My question is in what cases can you actually sacrifice consistency?
One case would be when building a recommendation engine data set and serving it with Cassandra. These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.
For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey. The query returns the top 5 product results, and puts them at the bottom of the page.
Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases. Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned. Is that really a big deal?
tl;dr; The real question to ask; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?
Both result sets will help drive sales. But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.
'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time.
Linearizability is a recency guarantee which makes it appear that there is a single copy of data. As soon as you make a change in the data, all subsequent reads will return the changed data. Such a level of consistency is expensive and doesn't scale well. Yet in certain scenarios we need linearizability, viz.
Leader election
Allowing end users to create their unique user id
Distributed locking etc.
When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability. This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive. And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance. You trade a little bit of consistency with scalability and performance.
Some eCommerce websites send apology letter to customers for not being able to fulfill their orders. That is because the last copy of the product has been sold to more than one customers due to lack and linearizability. They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.
Cassandra is said to have a tuneable consistency. You may want to record user clicks or activities for analysis. You are okay if some data are lost, but you cannot compromise with the performance. You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).
If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair. In vast majority of case all nodes are updated instantaneously. Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.
Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas. Instead it uses Last Write Wins (LWW) based on timestamp. If timestamps are same, it uses lexicographical order. Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for e.g. client side timestamp instead of server side timestamp.
The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). So if you want to have maximum availability, you'll have to give up on the consistency. This depends of course, on how critical the business is.
You answered something on SO but the answer doesn't show up when you visit the page? Can be tolerated. SO being down? Can't be. Critical financial systems would rather have strong consistency than availability. Every once-in-a-while, my bank's servers would go offline when I try to make a payment.
Normally, you choose availability and eventual consistency. The answer you wrote into SO would eventually show up.
Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.
For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.

Should I set max limits on as many entities as possible in my webapp?

I am working on a CRUD webapp where users create and manager their data assets.
Having no desire to be bombarded with tons of data for the first time, I think that it would be reasonable to set limits where possible in the database.
For example, limit number of created items A = 10, B = 20, C = 50 then, if user reaches the limit, have a look at his account and figure out if I should update the rules if it doesn't break the code and performance.
Is it a good practice at all to set such limits from the performance/maintenance side, not from the business side or should I think like data entities are unlimited and try to make it well-performing with lots of data from the start?
You suggest to test your application's performance on real users, which is bad. In addition, your solution will create inconvenience for users by limiting them, when there is no reason for that (at least from user's point of view), which decreases user's satisfaction.
Instead, you should test performance before you release. It will give you understanding of your application's and infrastructure's limits of running under high load. Also, it will help you to find and eliminate bottle necks in your code. You can perform such testing with tools like JMeter and many others.
Also, if you afraid of tons of data at start moment, you can release your application as private beta: just make a simple form where users can ask for early access and receive invite. By sending invites you can easily control growth of user base and therefore loading on you app.
But you should, of course, create limitations where it is necessary, for example, limit items per page, etc.

What exactly is the throughput restriction on an entity group in Google App Engine's datastore?

The documentation describes a limitation on the throughput to an entity group in the datastore, but is vague on what exactly the limitation is. My confusion is in two parts:
1. What is being restricted?
Specifically, is it:
The number of writes?
Number of transactions that write to the datastore?
Number of transactions regardless of whether it reads or writes to the datastore?
2. What is the type of the restriction?
Specifically, is it:
An artificially enforced one-per-second hard rule?
An empirically observed max throughput, that may in practice be better based on factors like network load, etc.?
There's no throughput restriction per se, but to guarantee atomicity in transactions, updates must be serialized and applied sequentially and in order, so if you make enough of them things will start to fail/timeout. This is called datastore contention:
Datastore contention occurs when a single entity or entity group is updated too rapidly. The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception. If you're expecting to update a single entity or write to an entity group more than several times per second, it's best to re-work your design early-on to avoid possible contention once your application is deployed.
To directly answer your question in simple terms, it's specifically the number of writes per entity group (5/ish per second), and it's just a rule of thumb, your milage may vary (greatly).
Some people have reported no contention at all, while others have problems to get more than 1 update per second. As you can imagine this depends on the complexity of the operation and the load of all the machines involved in execution.
Limits:
writes per second to an entity group
entity groups per cross-entity-group transaction (XG transaction)
There is a limit of 1 write per second per entity group. This is a documented limit that in practice appears to be a 'soft' limit, as in it is possible to exceed it, but not guaranteed to be allowed. Transactions 'block' if the entity had been written to in the last second, however the API allows for transient exceptions to occur as well. Obviously you would be susceptible to timeouts as well.
This does not affect the overall number of transactions for your app, just specifically related to that entity group. If you need to, you can design portions of your data model to get around this limitation.
There is a limit of 25 entity groups per XG transaction, meaning a transaction can not incorporate more than 25 entity groups in its context (reads, writes etc). This used to be a limit of 5 but was recently increased.
So to answer your direct questions:
Writes for the entire entity group (as defined by the root key) within a second window (which is not strict)
artificially enforced one-per-second soft rule
If you ask that question, then the Google DataStore is probably not for you.
The Google DataStore is an experimental database, where the API can be changed any time - it is also ment for retail apps, non-critical applications.
A clear indication you meet when you signup for the DataStore, something like no responsibility to backwards compatibility etc. Another indication is the lack of clear examples, the lack of wrappers providing a simple API to implement an access to the DataStore - and the examples on the net being a soup of complicated installations and procedures to make a simple query.
My own conclusion so far after days of research, is Google DataStore is not ready for commercial use, but looks promising once it is finished and in a stable release version.
When you search the net, and look at the few Google examples, if there at all are any - it is about to notice whats not mentioned rather than what is mentioned - which is about nothing is mentioned by Google ..... ;-) If you look at the vendors "supporting" Google DataStore, they simply link to the Google DataStore site for further information, which mention nothing, so you are in a ring where nothing concrete is mentioned ....

What is the best way to do basic View tracking on a web page?

I have a web facing, anonymously accessible, blog directory and blogs and I would like to track the number of views each of the blog posts receives.
I want to keep this as simple as possible, accuracy need only be an approximation. This is not for analytics (we have Google for that) and I dont want to do any log analysis to pull out the stats as running background tasks in this environment is tricky and I want the numbers to be as fresh as possible.
My current solution is as follows:
A web control that simply records a view in a table for each GET.
Excludes a list of known web crawlers using a regex and UserAgent string
Provides for the exclusion of certain IP Addresses (known spammers)
Provides for locking down some posts (when the spammers come for it)
This actually seems to do a pretty good job, but a couple of things annoy me. The spammers still hit some posts, thereby skewing the Views. I still have to manually monitor the views an update my list of "bad" IP addresses.
Does anyone have some better suggestions for me? Anyone know how the views on StackOverflow questions are tracked?
It sounds like your current solution is actually quite good.
We implemented one where the server code which delivered the view content also updated a database table which stored the URL (actually a special ID code for the URL since the URL could change over time) and the view count.
This was actually for a system with user-written posts that others could comment on but it applies equally to the situation where you're the only user creating the posts (if I understand your description correctly).
We had to do the following to minimise (not eliminate, unfortunately) skew.
For logged-in users, each user could only add one view point to a post. EVER. NO exceptions.
For anonymous users, each IP address could only add one view point to a post each month. This was slightly less reliable as IP addresses could be 'shared' (NAT and so on) from our point of view. The reason we relaxed the "EVER" requirement above was for this sharing reason.
The posts themselves were limited to having one view point added per time period (the period started low (say, 10 seconds) and gradually increased (to, say, 5 minutes) so new posts were allowed to accrue views faster, due to their novelty). This took care of most spam-bots, since we found that they tend to attack long after the post has been created.
Removal of a spam comment on a post, or a failed attempt to bypass CAPTCHA (see below), automatically added that IP to the blacklist and reduced the view count for that post.
If a blacklisted IP hadn't tried to leave a comment in N days (configurable), it was removed from the blacklist. This rule, and the previous rule, minimised the manual intervention in maintaining the blacklist, we only had to monitor responses for spam content.
CAPTCHA. This solved a lot of our spam problems, especially since we didn't just rely on OCR-type things (like "what's this word -> 'optionally'); we actually asked questions (like "what's 2 multiplied by half of 8?") that break the dumb character recognition bots. It won't beat the hordes of cheap labour CAPTCHA breakers (unless their maths is really bad :-) but the improvements from no-CAPTCHA were impressive.
Logged-in users weren't subject to CAPTCHA but spam got the account immediately deleted, IP blacklisted and their view subtracted from the post.
I'm ashamed to admit we didn't actually discount the web crawlers (I hope the client isn't reading this :-). To be honest, they're probably only adding a minimal number of view points each month due to our IP address rule (unless they're swarming us with multiple IP addresses).
So basically, I'm suggested the following as possible improvements. You should, of course, always monitor how they go to see if they're working or not.
CAPTCHA.
Automatic blacklist updates based on user behaviour.
Limiting view count increases from identical IP addresses.
Limiting view count increases to a certain rate.
No scheme you choose will be perfect (e.g., our one month rule) but, as long as all posts are following the same rule set, you still get a good comparative value. As you said, accuracy need only be an approximation.
Suggestions:
Move the hit count logic from a user control into a base Page class.
Redesign the exclusions list to be dynamically updatable (i.e. store it in a database or even in an xml file)
Record all hits. On a regular interval, have a cron job run through the new hits and determine whether they are included or excluded. If you do the exclusion for each hit, each user has to wait for the matching logic to take place.
Come up with some algorithm to automatically detect spammers/bots and add them to your blacklist. And/Or subscribe to a 3rd party blacklist.

Resources