How to generate dynamic path in dataset during the output method - apache-flink

Is there a way to create a dynamic DataSink output path in Flink?
DataSet has data type as Tuple2<String, String>
When we tried using stream I had a way to generate dynamic bath using custom Bucketer like below
#Override
public Path getBucketPath(Clock clock, Path basePath, Tuple2<String, String> element) {
return new Path(basePath + "/schema=" + element.f0.toLowerCase().trim() + "/");
}
I would like to know is there a similar way to handle in DataSet for generating the custom path.

I poked around a bit, and didn't find anything similar for batch processing. Which means I think you'd have to create your own OutputFormat class that wraps a regular FileOutputFormat and does bucketing, using the same Bucketer interface.

Related

Create PDF file from Text String or HTML String

My codenameone-app is producing some data, which I like to be able to summarize in a PDF-file for documentation purposes.
Would it be possible to either use a java library as cn1 library or to use a webservice which converts an HTML String into a PDF file like this:?
https://www.html2pdfrocket.com/convert-android-html-to-pdf
Maybe someone else already figured out a best-practice for this.
Thanks a lot!
There is no current builtin solution for that, it should be easy enough to wrap native libs or maybe even port a JavaSE lib that does that. Most of the developers who do something like this use a server side process to generate the PDF.
After giving it a quick-and-dirty try with html2pdfrocket - instead of using or porting a java library - I was simply amazed by the simplicity of the possibility with codenameone. I wasn't expecting this to be so easy AT ALL.
This class and method is all you'd need to simply save the pdf file to FileSystemStorage.
import com.codename1.io.Util;
public class PDFHandler {
private final static String URL="http://api.html2pdfrocket.com/pdf";
private final static String APIKEY = "<YOURAPI-KEY>";
/**
* Stores given HTML String or URL to Storage with given filename
* #param value URL or HTML add quote if you have spaces. use single quotes instead of double
* #param filename
*/
public void getFile(String value,String filename){
// Validate parameters
if(value==null||value.length()<1)
return;
if(filename==null||filename.length()<1)
return;
//Encode
value = Util.encodeUrl(value);
String fullPathToFile = FileSystemStorage.getInstance().getAppHomePath()+filename;
Util.downloadUrlToFileSystemInBackground(URL+"?apikey="+APIKEY+"&value="+value, fullPathToFile);
}
}
I hope this helps some other codenameone-newbie!

Save result from Objectify in human readable form in datastore

I am trying to create an Eventlog (ORMSLOG in example), that saves events in human readable form in Datastore.
Doing this should write readable event:
List<Device> devices = ofy().transactionless().load().type(Device.class).list();
ORMSLOG.log(ORMSLOG.GET_ALL_DEVICES, "Devices found: " + String.valueOf(devices));
The ORMSLOG is a simple class.
public class ORMSLOG {
public final static String CREATE_DEVICE = "Create Device";
public final static String GET_ALL_DEVICES = "Get all Devices";
public static void log(final String event, final String data) {
ofy().save().entity(new Event(event, data)).now();
}
}
But the data saved in Datastore is not readable and looks like this:
ORMSLOG data
I need to transform the reference to the object into human readable text.
You are just logging the String representation of the objects, which is done by calling the toString method. Since you did not override the toString method in the Device class, you are getting the pointer to the objects. If you override the toString method in your Device class to return whatever state you want to return, you would see a much better result. Most IDEs (e.g. Eclipse) have an option to generate toString method for you.

Apache Camel - java DSL - transform body to one of its fields

First, I'm fairly new to Camel so if what (or how) I'm trying to do here is dumb, let me know.
CODE:
from("direct:one")
.to("mock:two")
.process(new Processor(){
#Override
public void process(Exchange exchange)throws Exception{
MyCustomObject obj = exchange.getIn().getBody(MyCustomObject.class);
exchange.getOut().setBody(obj.getOneOfTheFields());
}
})
.to("mock:three");
QUESTION:
This processor transforms an object to one of it's fields. I know that I could replace it with simple expression but that would require me to put 'oneOfTheFields' in a string and I don't want to do that.
Is there a shorter way to do this using java code only?
This can be easily achieved using setBody and Camel simple:
from("direct:one")
.to("mock:two")
.setBody(simple("${body.fieldName}"))
.to("mock:three");
You specify the name of the field, and Camel will use the standard accessor mechanism to set the body appropriately.
Can you not simply do this:
from("direct:one")
.to("mock:two")
.setBody(body().getOneOfTheFields())
.to("mock:three");
Let me know if this works.

solr: read stopword.txt in Custom Handler

I want to read stopword.txt in my custom handler. How to do that ? I know that this is used in Filtering and can be done from there. But I need to read that list in my Custom UpdateRequestProcessorFactory. Also can I read any other custom file created by me.
I was aware that limitation. I overlooked that you are using about update processor.
I looked into the code, here is an existing code you can use as example. SolrCoreAware is the interface you are after.
public class StatelessScriptUpdateProcessorFactory extends UpdateRequestProcessorFactory implements SolrCoreAware
#Override
public void inform(SolrCore core) {
resourceLoader = core.getResourceLoader();
}
Classes that implement org.apache.lucene.analysis.util.ResourceLoaderAware can read files under conf directory. However what it your use case anyway?
looks like xy problem

Training own model in opennlp

I am finding it difficult to create my own model openNLP.
Can any one tell me, how to own model.
How the training shouls be done.
What should be the input and where the output model file will get stored.
https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html
This website is very useful, shows both in code, and using the OpenNLP application to train models for all different types, like entity extraction and part of speech etc.
I could give you some code examples in here, but the page is very clear to use.
Theory-wise:
Essentially you create a file which lists the stuff you want to train
eg.
Sport [whitespace] this is a page about football, rugby and stuff
Politics [whitespace] this is a page about tony blair being prime minister.
The format is described on the page above (each model expects a different format). once you have created this file, you run it through either the API or the opennlp application (via command line), and it generates a .bin file. Once you have this .bin file, you can load it into a model, and start using it (as per the api in the above website).
First you need to train the data with the required Entity.
Sentences should be separated with new line character (\n). Values should be separated from and tags with a space character.
Let's say you want to create medicine entity model, so data should be something like this:
<START:medicine> Augmentin-Duo <END> is a penicillin antibiotic that contains two medicines - <START:medicine> amoxicillin trihydrate <END> and
<START:medicine> potassium clavulanate <END>. They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.
You can refer a sample dataset for example. Training data should have at least 15000 sentences to get the better results.
Further you can use Opennlp TokenNameFinderTrainer.
Output file will be in the .bin format.
Here is the example: Writing a custom NameFinder model in OpenNLP
For more details, refer the Opennlp documentation
Perhaps this article will help you out. It describes how to do TokenNameFinder training from data extracted from Wikipedia...
nuxeo - blog - Mining Wikipedia with Hadoop and Pig for Natural Language Processing
Copy the data in data and run below code to get your own mymodel.bin .
Can refer for data=https://github.com/mccraigmccraig/opennlp/blob/master/src/test/resources/opennlp/tools/namefind/AnnotatedSentencesWithTypes.txt
public class Training {
static String onlpModelPath = "mymodel.bin";
// training data set
static String trainingDataFilePath = "data.txt";
public static void main(String[] args) throws IOException {
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new PlainTextByLineStream(
new FileInputStream(trainingDataFilePath), charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
lineStream);
TokenNameFinderModel model = null;
HashMap<String, Object> mp = new HashMap<String, Object>();
try {
// model = NameFinderME.train("en","drugs", sampleStream, Collections.<String,Object>emptyMap(),100,4) ;
model= NameFinderME.train("en", "drugs", sampleStream, Collections. emptyMap());
} finally {
sampleStream.close();
}
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}

Resources