Mismatch between content node storage ratio and server storage - vespa

I am using Vespa in a docker with one single content node on a Ubuntu server. The total storage is:
[root#vespa-container /]# df -h .
Filesystem Size Used Avail Use% Mounted on
overlay 485G 118G 343G 26% /
Apparently, 26% is far less than the default 80% (0.8) limit ratio in the Vespa setting. But I still got a NO_SPACE error:
ReturnCode(NO_SPACE, External feed is blocked due to resource exhaustion: memory on node 0 [vespa-container] (0.802 > 0.800))
How can I fix this? Thanks!

In this case you are limited by memory:
memory on node 0 [vespa-container] (0.802 > 0.800))

Related

How to configure checkpointing on an XTDB node using AWS S3

I am using XTDB 1.21.0 deployed on AWS/ECS (Fargate) with checkpoints configured (frequency 30 minutes) and stored on an S3 bucket (RocksDB). After a couple of successful checkpoints, they seem to be constantly failing with an XTDB warning due to an exception in the HTTP request to AWS, as shown below:
This leaves the S3 buckets with incomplete checkpoints (i.e., a Folder containing a set of SSTs and other RocksDB files and no associated EDN index file):
XTDB documentation mentions the fact that an optional S3configurator can be passed to the node configuration and after a bit of Googling around I figured that makeClient should be overridden so that connectionAcquisitionTimeout can be set:
NettyNioAsyncHttpClient.builder()
.maxConcurrency(200)
.connectionAcquisitionTimeout(Duration.ofMillis(20000))
I am not too familiar with NETTY so would appreciate if someone could help with the right incantation.
Also I am configuring the XT node from an EDN file, and haven't figure out how to write a S3 configurator in an EDN file (or if it is even possible).
Thanks in advance!
This can happen for large datasets where the default S3 client used will create a new async request for each object (for which the number of objects may be very large, particularly if using the RockDBs index). Internally it uses the connectionAcquisitionTimeout as a type of backpressure to ensure that incoming requests don't wait indefinitely for a connection from the connection pool, however, in this case we're the only source of these requests and we definitely want the requests to complete before starting the nodes so it's reasonable to set the connectionAcquisitionTimeout to something very high (the default is only 10 seconds). A good choice of limit might be something like the maximum amount of time you want to wait for the node to start before failing.
This appears to be a non-optional parameter of the SDK for what I can only assume is a sensible default strategy for requests coming from an external source, in our case we essentially want it to behave as if it was a synchronous operation.
Configuring this in Clojure with xtdb would look something like this:
(ns foo.db
(:require
[xtdb.api :as xtdb]
[xtdb.checkpoint]
[xtdb.rocksdb]
[xtdb.s3.checkpoint])
(:import
(java.time Duration)
(software.amazon.awssdk.http.nio.netty NettyNioAsyncHttpClient)
(software.amazon.awssdk.services.s3 S3AsyncClient)
(xtdb.checkpoint Checkpointer)
(xtdb.s3 S3Configurator)))
(def s3-configurator
(reify S3Configurator
(makeClient [this]
(.. (S3AsyncClient/builder)
(httpClientBuilder
(.. (NettyNioAsyncHttpClient/builder)
(connectionAcquisitionTimeout
(Duration/ofSeconds 600)) ;; Set a high limit here
;; We can rely on the defaults for maxConcurrency and
;; maxPendingConnectionAcquires
;; (maxConcurrency (Integer. 200))
;; (maxPendingConnectionAcquires (Integer. 10000))
))
(build)))))
(defn start-node!
[]
(xtdb/start-node
{:xtdb/index-store
{:kv-store {:xtdb/module 'xtdb.rocksdb/->kv-store
:db-dir "/var/xtdb/idxs"
:checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
:store {:xtdb/module 'xtdb.s3.checkpoint/->cp-store
:configurator (constantly s3-configurator)
:bucket "checkpoints"}
:approx-frequency "PT3H"}}}}))

Using JanusGraph with Solr

Setting up JanusGraph i noticed the following in the console:
09:04:12,175 INFO ReflectiveConfigOptionLoader:173 - Loaded and initialized config classes: 10 OK out of 12 attempts in PT0.023S
09:04:12,230 INFO Reflections:224 - Reflections took 28 ms to scan 1 urls, producing 2 keys and 2 values
09:04:12,291 WARN GraphDatabaseConfiguration:1445 - Local setting index.search.index-name=entity (Type: GLOBAL_OFFLINE) is overridden by globally managed value (janusgraph). Use the ManagementSystem interface instead of the local configuration to control this setting.
09:04:12,294 WARN GraphDatabaseConfiguration:1445 - Local setting index.search.backend=solr (Type: GLOBAL_OFFLINE) is overridden by globally managed value (elasticsearch). Use the ManagementSystem interface instead of the local configuration to control this setting.
09:04:12,300 INFO CassandraThriftStoreManager:628 - Closed Thrift connection pooler.
and then i see the following:
Exception in thread "main" java.lang.IllegalArgumentException: Could not instantiate implementation: org.janusgraph.diskstorage.es.ElasticSearchIndex
How do i stop using elasticsearch and switch to Solr?
My properties file is as follows:
index.search.backend=solr
index.search.directory=/path/to/directory/for/solr/index/something
index.search.index-name=something
index.search.solr.mode=http
index.search.solr.http-urls=http://127.0.0.1:8983/solr
storage.backend=cassandrathrift
storage.hostname=127.0.0.1
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.25
The answer to this basically the same as this one for Titan. JanusGraph was forked from Titan.
You are probably trying to connect to an existing graph that was previously configured to use Elasticsearch. By default, the keyspace is named janusgraph.
1) You could connect to a different keyspace by updating conf/janusgraph-cassandra.properties
gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=cassandrathrift
storage.hostname=127.0.0.1
storage.cassandra.keyspace=mygraph
2) You could drop the existing keyspace. If you used bin/janusgraph.sh start from the quick start directions (which starts a single node Cassandra and a single node Elasticsearch),
bin/janusgraph.sh clean
Or if you have a standalone Cassandra installation:
$CASSANDRA_HOME/bin/cqlsh -e 'drop keyspace if exists janusgraph'
Then you would be able to connect with the default conf/janusgraph-cassandra.properties.

How to send CSV into file in straming mode where i get input in "java.io.PipedInputStream#1bca8e6" format in mule esb?

My requirement is, I get input xml and based on some condition checking(using choice) I need to send into 2 different files. As I'm getting big file(100-400MB) I'm sending in stream mode(by enable streaming in file and datamapper components).
It is working fine for small size input xml(10-20MB). But when I give large input xml. Condition checking and XML to CSV conversion is working fine but while writing CSVB-data I'm getting error message.,
INFO 2015-09-08 12:03:49,227 [[simplebatch_1].simplebatchFlow.stage1.02] org.mule.api.processor.LoggerMessageProcessor: default logger java.io.PipedInputStream#1bca8e6
INFO 2015-09-08 12:03:49,258 [[simplebatch_1].File1.dispatcher.01] org.mule.lifecycle.AbstractLifecycleManager: Initialising: 'File1.dispatcher.29118412'. Object is: FileMessageDispatcher
INFO 2015-09-08 12:03:49,258 [[simplebatch_1].File1.dispatcher.01] org.mule.lifecycle.AbstractLifecycleManager: Starting: 'File1.dispatcher.29118412'. Object is: FileMessageDispatcher
INFO 2015-09-08 12:03:49,258 [[simplebatch_1].File1.dispatcher.01] org.mule.transport.file.FileConnector: Writing file to: D:\MulePOC's\output\myoutput1
ERROR 2015-09-08 12:03:54,999 [XML_READER0_0] org.jetel.graph.Node: java.lang.OutOfMemoryError: Java heap space
ERROR 2015-09-08 12:03:55,000 [WatchDog_0] org.jetel.graph.runtime.WatchDog: Component [XML READER:XML_READER0] finished with status ERROR.
Java heap space
Please suggest me on this., Thanks..,
You need increase your JVM memory for Mule. You can found the config file in $MULE_HOME/conf/wrapper.conf
You will found something like this:
# Increase Permanent Generation Size from default of 64m
# Increase this value if you get "Java.lang.OutOfMemoryError: PermGen space error"
# This property is not used when running java 8 and may cause a warning.
wrapper.java.additional.7=-XX:PermSize=256m
wrapper.java.additional.8=-XX:MaxPermSize=256m
# GC settings
wrapper.java.additional.9=-XX:+HeapDumpOnOutOfMemoryError
wrapper.java.additional.10=-XX:+AlwaysPreTouch
wrapper.java.additional.11=-XX:+UseParNewGC
wrapper.java.additional.12=-XX:NewSize=512m
wrapper.java.additional.13=-XX:MaxNewSize=512m
wrapper.java.additional.14=-XX:MaxTenuringThreshold=8
You can change this configurations as you like.

Set default settings to 'no-cache' on Google Cloud Storage

Is there a way to set all public links to have 'no-cache' in Google Cloud Storage?
I've seen solutions to use gsutil to set the "Cache-Control" upon file-upload, but I'm looking for a more permanent solution.
There was a conversation about providing a cache invalidation feature but I didn't quite follow the reasoning. Any explanations would be greatly appreciated!
it would be difficult to provide a cache invalidation feature because once served with a non-0 cache TTL any cache on the Internet (not just those under Google's control) is allowed (per HTTP spec) to cache the data
Thanks!
For a more permanent one-time-effort solution, with the current offerings on GCP, you can do this with Cloud Functions.
Create a new Funciton, set the Event type to "On (finalizing/creating) file in the selected bucket" - google.storage.object.finalize. Make sure to select the bucket you want this on. In the body of the function, set the cacheControl / Cache-Control attribute for the blob. The attribute name depends on the language. Here's my version in Python, using cache_control:
main.py:
match the function name below to the Entry point
from google.cloud import storage
def set_file_uncached(event, context):
file = event # auto-generated
print(f"Processing file: {file=}") # logging, if you want it
storage_client = storage.Client()
# we expect just one with that name
blob = storage_client.bucket(file["bucket"]).get_blob(file["name"])
if not blob:
# in case the blob is deleted before this executes
print(f"blob not found")
return None
blob.cache_control = "public, max-age=0" # or whatever you need
blob.patch()
requirements.txt
google-cloud-storage
From the logs: Function execution took 1712 ms, finished with status: 'ok'. This could have been faster but I've set the minimum to 0 instances so it needs to spin-up for each upload. Depending on your usage and cost constraints, you can set it to 1 or something higher.
Other settings:
Retry on failure: No/False
Region: [wherever your bucket is]
Memory allocated: 128 MB (smallest available currently)
Timeout: 5 seconds (smallest available currently, function shouldn't take longer)
Minimum instances: 0
Maximum instances: 1

ASIHTTPRequest misses files

I'm using ASIHTTPRequest to download a list of 30 files but 2 or 3 (different) are always lost.
Is it possible set the maximum number of connections per seconds?? I've tried with
- [[ASIHTTPRequest sharedQueue] setMaxConcurrentOperationCount:1];
- [cola setMaxConcurrentOperationCount:1];
But i don't have any luck...
Any help?
Thank you
I've solved this problem with:
[request setPersistentConnectionTimeoutSeconds:80];
[request setShouldAttemptPersistentConnection:NO];
The problem may be that Apache installed doesnt support persistent connections.
See Configuring persistent connections section in http://allseeing-i.com/ASIHTTPRequest/How-to-use for more info.

Resources