Set default settings to 'no-cache' on Google Cloud Storage - google-app-engine

Is there a way to set all public links to have 'no-cache' in Google Cloud Storage?
I've seen solutions to use gsutil to set the "Cache-Control" upon file-upload, but I'm looking for a more permanent solution.
There was a conversation about providing a cache invalidation feature but I didn't quite follow the reasoning. Any explanations would be greatly appreciated!
it would be difficult to provide a cache invalidation feature because once served with a non-0 cache TTL any cache on the Internet (not just those under Google's control) is allowed (per HTTP spec) to cache the data
Thanks!

For a more permanent one-time-effort solution, with the current offerings on GCP, you can do this with Cloud Functions.
Create a new Funciton, set the Event type to "On (finalizing/creating) file in the selected bucket" - google.storage.object.finalize. Make sure to select the bucket you want this on. In the body of the function, set the cacheControl / Cache-Control attribute for the blob. The attribute name depends on the language. Here's my version in Python, using cache_control:
main.py:
match the function name below to the Entry point
from google.cloud import storage
def set_file_uncached(event, context):
file = event # auto-generated
print(f"Processing file: {file=}") # logging, if you want it
storage_client = storage.Client()
# we expect just one with that name
blob = storage_client.bucket(file["bucket"]).get_blob(file["name"])
if not blob:
# in case the blob is deleted before this executes
print(f"blob not found")
return None
blob.cache_control = "public, max-age=0" # or whatever you need
blob.patch()
requirements.txt
google-cloud-storage
From the logs: Function execution took 1712 ms, finished with status: 'ok'. This could have been faster but I've set the minimum to 0 instances so it needs to spin-up for each upload. Depending on your usage and cost constraints, you can set it to 1 or something higher.
Other settings:
Retry on failure: No/False
Region: [wherever your bucket is]
Memory allocated: 128 MB (smallest available currently)
Timeout: 5 seconds (smallest available currently, function shouldn't take longer)
Minimum instances: 0
Maximum instances: 1

Related

How to configure checkpointing on an XTDB node using AWS S3

I am using XTDB 1.21.0 deployed on AWS/ECS (Fargate) with checkpoints configured (frequency 30 minutes) and stored on an S3 bucket (RocksDB). After a couple of successful checkpoints, they seem to be constantly failing with an XTDB warning due to an exception in the HTTP request to AWS, as shown below:
This leaves the S3 buckets with incomplete checkpoints (i.e., a Folder containing a set of SSTs and other RocksDB files and no associated EDN index file):
XTDB documentation mentions the fact that an optional S3configurator can be passed to the node configuration and after a bit of Googling around I figured that makeClient should be overridden so that connectionAcquisitionTimeout can be set:
NettyNioAsyncHttpClient.builder()
.maxConcurrency(200)
.connectionAcquisitionTimeout(Duration.ofMillis(20000))
I am not too familiar with NETTY so would appreciate if someone could help with the right incantation.
Also I am configuring the XT node from an EDN file, and haven't figure out how to write a S3 configurator in an EDN file (or if it is even possible).
Thanks in advance!
This can happen for large datasets where the default S3 client used will create a new async request for each object (for which the number of objects may be very large, particularly if using the RockDBs index). Internally it uses the connectionAcquisitionTimeout as a type of backpressure to ensure that incoming requests don't wait indefinitely for a connection from the connection pool, however, in this case we're the only source of these requests and we definitely want the requests to complete before starting the nodes so it's reasonable to set the connectionAcquisitionTimeout to something very high (the default is only 10 seconds). A good choice of limit might be something like the maximum amount of time you want to wait for the node to start before failing.
This appears to be a non-optional parameter of the SDK for what I can only assume is a sensible default strategy for requests coming from an external source, in our case we essentially want it to behave as if it was a synchronous operation.
Configuring this in Clojure with xtdb would look something like this:
(ns foo.db
(:require
[xtdb.api :as xtdb]
[xtdb.checkpoint]
[xtdb.rocksdb]
[xtdb.s3.checkpoint])
(:import
(java.time Duration)
(software.amazon.awssdk.http.nio.netty NettyNioAsyncHttpClient)
(software.amazon.awssdk.services.s3 S3AsyncClient)
(xtdb.checkpoint Checkpointer)
(xtdb.s3 S3Configurator)))
(def s3-configurator
(reify S3Configurator
(makeClient [this]
(.. (S3AsyncClient/builder)
(httpClientBuilder
(.. (NettyNioAsyncHttpClient/builder)
(connectionAcquisitionTimeout
(Duration/ofSeconds 600)) ;; Set a high limit here
;; We can rely on the defaults for maxConcurrency and
;; maxPendingConnectionAcquires
;; (maxConcurrency (Integer. 200))
;; (maxPendingConnectionAcquires (Integer. 10000))
))
(build)))))
(defn start-node!
[]
(xtdb/start-node
{:xtdb/index-store
{:kv-store {:xtdb/module 'xtdb.rocksdb/->kv-store
:db-dir "/var/xtdb/idxs"
:checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
:store {:xtdb/module 'xtdb.s3.checkpoint/->cp-store
:configurator (constantly s3-configurator)
:bucket "checkpoints"}
:approx-frequency "PT3H"}}}}))

What are the possible values for X-AppEngine-TaskRetryReason header in Google app engine request headers?

Basically, I am facing an issue while n number of taskqueues are running in the Google Cloud Platform. There is no error in code or server but the execution of the taskqueues got terminated due to instance unavailability by which it will trigger a taskqueue again and again.
I know a few reasons by which this kind of termination process takes place.
Reasons:
Instance Unavailable
App Error / AppEngine Error
Memory Exceeded
I want to know the other possible values for the X-AppEngine-TaskRetryReason header.
For example (the response of GAE):
self.request.headers {'Content_Length': '432', 'Content-Length': '432', 'X-Appengine-Current-Namespace': '75f4910a-b925-4945-92f0-b214a459f0be', 'X-Appengine-Taskexecutioncount': '1', 'X-Appengine-Tasketa': '1624452214.545367', 'User-Agent': 'AppEngine-Google; (+http://code.google.com/appengine)', 'X-Appengine-Taskpreviousresponse': '503', 'Host': 'payqa-dot-hw-pay.qa.appspot.com', 'X-Appengine-Taskretrycount': '2', 'Referer': 'http://payqa-dot-hw-pay.qa-.appspot.com/pay/runpayroll', 'Content_Type': 'application/octet-stream', 'X-Cloud-Trace-Context': 'd44fdfd56bc7733afb3169fb354b80ed/6659926505008598367', 'Traceparent': '00-d44fdfd56bc7733afb3169fb354b80ed-5c6ccfded93f0d5f-00', 'X-Appengine-Queuename': 'payroll', 'X-Appengine-Taskname': '21925984910338157231', 'Content-Type': 'application/octet-stream', 'X-Appengine-Country': 'ZZ', **'X-Appengine-Taskretryreason': 'Instance Unavailable'**}
Like I mentioned in the comments there is no listing in the documentation for the possible values of X-AppEngine-TaskRetryReason and it only states that it represents:
The reason for retrying the task.
That being said there is two possibilities why this happens, either this has no specific value and just spits out whatever message it is passed to it by the actual class or component that generated the failure of the execution of the tasks or this is not being shared because the Google Cloud team did not considered it necessary.
Either way if you want to know why this happens and what values you can expect, you should open a Customer issue in Google's Issue Tracker so you can check why this is not shared in the documentation with their Engineering team.

Google AppEngine application log assigned to the wrong request log

When I look at the logs in the Google Log Viewer for my GAE project, I see that often the logs that I write myself in the code are assigned to the wrong request. Most of the time the log is assigned to the request directly after the request that produced the log entry.
As the root of every application log in GAE must be a request, this means that the wrong request is sometimes marked as error, because another request before produced an error, but the log is somehow assigned to the request after that.
I don't really do anything special, I use Ktor as my servlet and have an interceptor that creates a log when an exception occurs before returning status 500.
I use Java logging via SLF4J with the google cloud logging handler, but before that I used logback via SLf4J and had the same problem.
The content of the logs itself is also correct, the returned status of the request, the level of the log entry, the message, everything is ok.
I thought that it may be because I use kotlin and switch coroutine contexts during a single request, but in some cases the point where I write the log and where I send the response are exactly next to each other, so I'm not sure if kotlin has anything to do with it.
My logging.properties:
# To use this configuration, add to system properties : -Djava.util.logging.config.file="/path/to/file"
#
.level = INFO
# it is recommended that io.grpc and sun.net logging level is kept at INFO level,
# as both these packages are used by Stackdriver internals and can result in verbose / initialization problems.
io.grpc.netty.level=INFO
sun.net.level=INFO
handlers=com.google.cloud.logging.LoggingHandler
# default : java.log
com.google.cloud.logging.LoggingHandler.log=custom_log
# default : INFO
com.google.cloud.logging.LoggingHandler.level=INFO
# default : ERROR
com.google.cloud.logging.LoggingHandler.flushLevel=WARNING
# default : auto-detected, fallback "global"
#com.google.cloud.logging.LoggingHandler.resourceType=container
# custom formatter
com.google.cloud.logging.LoggingHandler.formatter=java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n
#optional enhancers (to add additional fields, labels)
#com.google.cloud.logging.LoggingHandler.enhancers=com.example.logging.jul.enhancers.ExampleEnhancer
My logging relevant dependencies:
implementation "org.slf4j:slf4j-jdk14:1.7.30"
implementation "com.google.cloud:google-cloud-logging:1.100.0"
An example logging call:
exception<Throwable> { e ->
logger().error("Error", e)
call.respondText(e.message ?: "", ContentType.Text.Plain, HttpStatusCode.InternalServerError)
}
with logger() being:
import org.slf4j.Logger
import org.slf4j.LoggerFactory
inline fun <reified T : Any> T.logger(): Logger = LoggerFactory.getLogger(T::class.java)
Edit:
An example of the log in Google cloud. The first request has the query parameter GAID=cdda802e-fb9c-47ad-0794d394c913, but as you can see the error log for that request is in the one below, marked in red.

Using JanusGraph with Solr

Setting up JanusGraph i noticed the following in the console:
09:04:12,175 INFO ReflectiveConfigOptionLoader:173 - Loaded and initialized config classes: 10 OK out of 12 attempts in PT0.023S
09:04:12,230 INFO Reflections:224 - Reflections took 28 ms to scan 1 urls, producing 2 keys and 2 values
09:04:12,291 WARN GraphDatabaseConfiguration:1445 - Local setting index.search.index-name=entity (Type: GLOBAL_OFFLINE) is overridden by globally managed value (janusgraph). Use the ManagementSystem interface instead of the local configuration to control this setting.
09:04:12,294 WARN GraphDatabaseConfiguration:1445 - Local setting index.search.backend=solr (Type: GLOBAL_OFFLINE) is overridden by globally managed value (elasticsearch). Use the ManagementSystem interface instead of the local configuration to control this setting.
09:04:12,300 INFO CassandraThriftStoreManager:628 - Closed Thrift connection pooler.
and then i see the following:
Exception in thread "main" java.lang.IllegalArgumentException: Could not instantiate implementation: org.janusgraph.diskstorage.es.ElasticSearchIndex
How do i stop using elasticsearch and switch to Solr?
My properties file is as follows:
index.search.backend=solr
index.search.directory=/path/to/directory/for/solr/index/something
index.search.index-name=something
index.search.solr.mode=http
index.search.solr.http-urls=http://127.0.0.1:8983/solr
storage.backend=cassandrathrift
storage.hostname=127.0.0.1
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.25
The answer to this basically the same as this one for Titan. JanusGraph was forked from Titan.
You are probably trying to connect to an existing graph that was previously configured to use Elasticsearch. By default, the keyspace is named janusgraph.
1) You could connect to a different keyspace by updating conf/janusgraph-cassandra.properties
gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=cassandrathrift
storage.hostname=127.0.0.1
storage.cassandra.keyspace=mygraph
2) You could drop the existing keyspace. If you used bin/janusgraph.sh start from the quick start directions (which starts a single node Cassandra and a single node Elasticsearch),
bin/janusgraph.sh clean
Or if you have a standalone Cassandra installation:
$CASSANDRA_HOME/bin/cqlsh -e 'drop keyspace if exists janusgraph'
Then you would be able to connect with the default conf/janusgraph-cassandra.properties.

Appengine deployments are extraodinarily slow today?

We have a small java project need to deploy
it include 9000+ files
command : mvn gcloud:deploy
but I get the Log:
...
[INFO] INFO: Uploading [/home/steven/work/idigisign/target/appengine-staging/__static__/node_modules/rx/src/core/linq/observable/when.js] to [7dfb30ad32893c5042dba03601f006a40419fab0]
[INFO] DEBUG: Uploading [/home/steven/work/idigisign/target/appengine-staging/assets/global/plugins/bootstrap-switch/js/bootstrap-switch.min.js] to [7e0725897d7b99c3c33b56915d202e2dde552ea9]
[INFO] INFO: Uploading [/home/steven/work/idigisign/target/appengine-staging/assets/global/plugins/bootstrap-switch/js/bootstrap-switch.min.js] to [7e0725897d7b99c3c33b56915d202e2dde552ea9]
[INFO] DEBUG: Uploading [/home/steven/work/idigisign/target/appengine-staging/node_modules/is-redirect/index.js] to [7e0afe4775bf7f8558665760171c01948c22f771]
[INFO] INFO: Uploading [/home/steven/work/idigisign/target/appengine-staging/node_modules/is-redirect/index.js] to [7e0afe4775bf7f8558665760171c01948c22f771]
[INFO] DEBUG: Uploading [/home/steven/work/idigisign/target/appengine-staging/node_modules/rxjs/src/util/Map.ts] to [7e11722f4cd9ce91ec99b97710fbc4e7f40be09d]
...
About 50 per minute
So it will spent 180 minute...
It is extraodinarily slow
anybody can help me?
Set the environment variable CLOUDSDK_APP_USE_GSUTIL=1 and try again; this uses a less-reliable but faster codepath for file upload (there are plans to speed up the default codepath).
We have the same issue, it's very slow.
Guess we have solved it.
First, we traced the gcloud logs and we found many files had been uploaded again, these files all are no modified. So we try to trace the source code of gcloud and we found the issue is caused by "Google Cloud Storage JSON API".
When it queried the List of Bucket, it returned 1000 items but we have 1325 items so I guess we find the issue.
Then, we look for the api reference, and we find a parameter - maxResults, so we try to modify the source code(cloud_storage.py), and we find it has No Effect when its value is over 1000.
Finally, we find another parameter - nextPageToken, and we query list until the "nextPageToken" is None, now it got all items from "Google Cloud Storage" and the exists files not be uploaded again.
def ListBucket(bucket_ref, client):
request = STORAGE_MESSAGES.StorageObjectsListRequest(bucket=bucket_ref.bucket)
items = set()
try:
response = client.objects.List(request)
for item in response.items:
items.add(item.name)
while response.nextPageToken:
request = STORAGE_MESSAGES.StorageObjectsListRequest(bucket=bucket_ref.bucket,pageToken=response.nextPageToken)
response = client.objects.List(request)
for item in response.items:
items.add(item.name)
except api_exceptions.HttpError as e:
raise UploadError('Error uploading files: {e}'.format(e=e))
return items

Resources