Apache Flink - Dataset api - Side outputs

Apache Flink - Dataset api - Side outputs - apache-flink

Does Flink supports Side Outputs feature in Dataset(Batch Api) ? If not, how to handle valid and invalid records when loading from file ?

You can always do something like this:
DataSet<EventOrInvalidRecord> goodAndBadTogether = input.map(new CreateObjectIfPossible())
goodAndBadTogether.filter(new KeepOnlyGood())...
goodAndBadTogether.filter(new KeepOnlyBad())...
Another reasonable option in some cases is to go ahead and use the DataStream API, even if you don't have streaming sources.

Related

How to send custom DocumentOperation to DocumentProcessing pipeline from a Processor?

Scenario: I've been stuck on this for way to long and I think solution might be easy but I just can't see it, this is the scenario:
cURL POST to http://localhost:8080/my_imports (raw JSON data on body)
->
MyImportsCustomHandler (extends ThreadedHttpRequestHandler [Validations]
->
MyObjectProcessor (extends Processor) [JSON deserialize and data massage]
->
MyFirstDocumentProcessor (extends DocumentProcessor) [Set some fields and save]
Problem is that execution never reaches MyFirstDocumentProcessor, likely because request didn't started from the document_api endpoints (intentionaly).
There are no errors thrown, just processing route never reaches the document processor chain, I think it should because on MyObjectProcessor I'm doing:
DocumentType type =
localDocHandler.getDocumentTypeManager().getDocumentType("my_doc");
DocumentId id = new DocumentId("id:default:my_doc::2");
Document document = new Document(type, id);
DocumentPut docPut = new DocumentPut(document);
Processing proc = com.yahoo.docproc.Processing.of(docPut);
I got this idea from here: https://github.com/vespa-engine/vespa/blob/master/docproc/src/test/java/com/yahoo/docproc/util/SplitterJoinerTestCase.java
but on that test I see this line splitter.process(p);, which I'm not able to find a suitable replacement that works inside a Processor, in that context I only have the Request, Execution and DocumentProcessingHandler
I hope somebody versed on Vespa con shine some light on this, is just the last hop on the processing chain that I can't bridge :|

To write documents from Java code, you need to use the Document Access API:
http://docs.vespa.ai/documentation/document-api-guide.html#document-access
A working solution is in https://github.com/vespa-engine/sample-apps/pull/44

Why aren't my queries and batch gets executed in parallel?

Based on the documentation for Objectify and Google Cloud Datastore, I would expect the queries and the batch loads in the following code to execute in parallel:
List<Iterable<Key<MyType>>> results = new ArrayList<>();
for (...) {
results.add(ofy().load()
.type(MyType.class)
.filter(...)
.keys()
.iterable());
}
...
Iterable<MyType> keys = ...;
Collection<MyType> c = ofy().load().keys(keys).values();
But the trace makes it look like each query and each entity load executes in sequence:
What gives?

It looks like this only happens when doing a cached get from Memcache. With similar code I see the expected async behavior for datastore_v3.Get/Put/Delete:
It seems the reason for this is that Objectify doesn't use AsyncMemcacheService. Indeed, there is an open issue for this on the project page, and this can also be confirmed by checking out the source and doing a grep -r AsyncMemcacheService.
Regarding the serial datastore_v3.RunQuery calls, calls to ofy().load().type(...).filter(...).iterable() are 'asynchronous' in that they return immediately, however the actual Datastore queries themselves get executed serially as the App Engine Datastore API doesn't expose an explicitly async API for queries.

Make a solr query from Geotools through geoserver

I come here because I am searching (like the title mentionned) to do a query from geotools (through geoserver) to get feature from a solr index.
To be more precise :
I saw on geoserver user manual that i can do query on solr like this in http :
http://localhost:8080/geoserver/wfs?service=WFS&version=1.1.0&request=GetFeature
&typeName=mySolrLayer
&format="xxx"
&viewparams=q:"mySolrQuery"
The important part on this URL is the viewparams that I want to use directly from geotools.
I have already test this case (this is a part of my code):
url = new URL(
"http://localhost:8080/geoserver/wfs?request=GetCapabilities&VERSION=1.1.0";
);
Map<String, String> param = new HashMap();
params.put(WFSDataStoreFactory.URL.key, url);
param.put("viewparams","q:myquery");
Hints hints = new Hints();
hints.put(Hints.VIRTUAL_TABLE_PARAMETERS, viewParams);
query.setHints(hints);
...
featureSource.getFeatures(query);
But here, it seems to doesn't work, the url send to geoserver is a normal "GET FEATURE" request without the viewparams parameter.
I tried this with geotools-12.2 ; geotools-13.2 and geotools-15-SNAPSHOT but I didn't succeed to pass the query, geoserver send me all the feature in my database and doesn't take "viewparams" as a param.
I need to do it like this because actually the query come from another program and I would easily communicate this query to another part of the project...
If someone can help me ?

There doesn't currently seem to be a way to do this in the GeoTool's WFSDatastore implementations as the GetFeature request is constructed from the URL provided by the getCapabilities document. This is as the standard requires but it may be worth making a feature enhancement request to allow clients to override this string (as QGIS does for example) which would let you specify the additional parameter in your base URL which would then be passed to the server as you need.
Unfortunately the WFS module lives in Unsupported land at present so unless you have resources to work on this issue yourself and can provide a PR to implement it there is not a great chance of it being implemented.

AppEngine - Optimize read/write count on POST request

I need to optimize the read/write count for a POST request that I'm using.
Some info about the request:
The user sends a JSON array of ~100 items
The servlet needs to check if any of the received items is newer then its counterpart in the datastore using a single long attribute
I'm using JDO
what i currently do is (pseudo code):
foreach(item : json.items) {
storedItem = persistenceManager.getObjectById(item.key);
if(item.long > storedItem.long) {
// Update storedItem
}
}
Which obviously results in ~100 read requests per request.
What is the best way to reduce the read count for this logic? Using JDO Query? I read that using "IN"-Queries simply results in multiple queries executed after another, so I don't think that would help me :(
There also is PersistenceManager.getObjectsById(Collection). Does that help in any way? Can't find any documentation of how many requests this will issue.

I think you can use below call to do a batch get:
Query q = pm.newQuery("select from " + Content.class.getName() + " where contentKey == :contentKeys");
Something like above query would return all objects you need.
And you can handle all the rest from here.

Best bet is
pm.getObjectsById(ids);
since that is intended for getting multiple objects in a call (particularly since you have the ids, hence keys). Certainly current code (2.0.1 and later) ought to do a single datastore call for getEntities(). See this issue

Fiddler: is it possible to compress/gzip the request body?

Great tool, does everything I need. Love its Transform tab that allows compression of the response. But what about request? Seems like a simple thing but I don't see that functionality. Am I missing something?
Fiddler Web Debugger, V2.3.4.4.

You can write a bit of script to compress the request body. Click Rules > Customize Rules, and add something like this:
static function OnBeforeRequest(oSession: Session){
if (oSession.requestBodyBytes != null && oSession.requestBodyBytes.Length>0){
oSession.requestBodyBytes = Utilities.GzipCompress(oSession.requestBodyBytes);
oSession["Content-Length"] = oSession.requestBodyBytes.Length.ToString();
oSession["Content-Encoding"] = "gzip";
}
However, I'm not aware of any servers that actually support compressed requests. There's no good way for a server to signal that it supports compressed requests, and Zip Bomb attacks are a real threat for servers.