How to pipe the complete graph to Giraph through TinkerPop 3 stack? - graph-databases

I've a graph with different types of nodes & relationships. Each type of node has 3-4 properties. For testing purpose on HDFS, I'm using GraphSON file to store this graph. Now I want to analyse this graph using Giraph. I've explore Giraph's IO classes & also found that Gremlin can directly load GraphSON. So could you please explain me how to load the graph into Giraph using TinkerPop stack?

See the Giraph sample in the docs, it does almost exactly what you're looking for. Instead of hadoop-gryo.properties use hadoop-graphson.properties (and of course adjust the input location setting, gremlin.hadoop.inputLocation, in the configuration file).

Related

Sagemaker Inference Endpoint with fitted Encoder

So as I don't get any help by reading documentations and blogposts I ll ask over here:
I want to deploy a Sagemaker Endpoint with fitting a Sagemaker Pipeline. I want to have an endpoint which is backed by a PipelineModel. This PipelineModel should consist of two models: A fitted model which encodes my data and a model which predicts with an XGBoost estimator. I follow along this docu: enter link description here
But this example doesn't show how to integrate the fitted preprocessor model in a PipelineStep. What Step do I have to use? A TrainingStep? Thanks in advance. I am desperate
Check out this official example: Train register and deploy a pipeline model.
The two variations to keep in mind:
For models that need training (usually for those based on tensorflow/pytorch), a TrainingStep must be used so that the output (the model artifact) is correctly (and automatically) generated with the ability to use it later for inference.
For models generated by a simple fitting on the data (e.g., a scaler with sklearn), you can think about creating a TrainingStep in disguised (it is an extra component in pipeline, it is not very correct to do it but it is a working round) but the more correct method is to configure the preprocessing script so that it internally saves a model.tar.gz file with the necessary files (e.g., pickle or joblib objects) inside it can then be properly used in later steps as model_data. In fact, if you have a model.tar.gz, you can define a Model of various types (e.g., an SKLearnModel) that is already fitted.
At this point, you define your PipelineModel with the trained/fitted models and can either proceed to direct endpoint deployment or decide to go through the model registry and keep a more robust approach.

Is there a way to show pdf in its original structure in the human review custom entity labelling in aws sagemaker?

I have modified this sample to read PDFs in tabular format. I would like to keep the tabular structure of the original pdf when doing the human review process. I notice the custom worker task template uses the crowd-entity-annotation element which seems to read only texts. I am aware that the human reviewer process reads from an S3 key which contains raw text written by the textract process.
I have been considering writing to S3 using tabulate but I don't think that is the best solution. I would like to keep the structure and still have the ability to annotate custom entities.
Comprehend now natively support to detect custom-defined entities for pdf documents. To do so, you can try the following steps:
Follow this github readme to start the annotation process for PDF documents.
Once the annotations are produced. You can use Comprehend CreateEntityRecognizer API to train a custom entity model for Semi-structured document”
Once entity recognizer is trained, you can use StartEntitiesDetectionJob API to run inference for PDF documents

Generate a series of documents based on SQL table

I am trying to formulate a proposal for an application that allows a user to print a batch of documents based on data stored in a SQL table. The SQL table indicates which documents are due and also contains all demographic information. This is outside of what I normally do and am trying to see if these is a platform/application that already exists to do such a task
For example
List of all documents: Document #1 - Document #10
Person 1 is due for document #: 1,5,7,8
Person 2 is due for document #: 2.6
Person 3 is due for document #: 7,8,10
etc
Ideally, what I would like is for the user to be able to push a button and get a printed stack of documents that have been customized for each user including basic demographic info like name, DOB, etc
Like i said at the top, I already have all of the needed information in a database, I am just trying to figure out the best approach to move that information onto a document
I have done some research and found some people have used mail merge in Word or using Access as a front end but I don't know if this is the best way. I've also found this document. Any advice would be greatly appreciated
If I understand your problem correctly, your problem is two-fold: Firstly, you need to find a way to generated documents based on data (mail-merge) and secondly, you might need to print them two.
For document generation you have two basic approaches: template-based and programmatically from scratch. I suppose that you will opt for a template based approach which basically means that you design (in MS Word) a template document (Word, RTF, ...) that acts as a template and contains placeholders and other tags that designate »dynamic« parts of the document. Then, at document generation time, you need a .NET library/processor that you will pass this template document and the data, where the processor will populate the template with the data and return the resulting document.
One way to achieve this functionality would be employing MS Words' native mail-merge, but you should know that this would involve using Office COM and Word Application Automation which should be avoided almost always.
Another option is to build such a system on top of Open XML SDK. This is velid option, but it will be a pretty demanding task and will most probably cost you much more than buying a commercial .NET library that does mail-merge out-of-the-box – been there, done that. But of course, the good side here is that you will be able to tailer the solution to your needs. If you go down this road I recoment that you use Content Controls for tagging documents/templates. The solution with CCs will be much easier to implement than the solution with bookmarks.
I'm not very familliar with the open source solutions and I'm not sury how many there are that can do mail-merge. One I know is FlexDoc (on CodePlex) but its problem is that uses a construct (XmlControl) for tagging that is depricated in Word 2010+.
Then there are commercial solutions. Again I don't know them in detail but I know that the majority of them are a general purpose document processing libraries. Our company has been using this document generation toolkit for some time now and I can say it covers all our »template-based document generation« needs. It doesn't require MS Word at doc generation time, and has really helpful add-in for MS word and you only need several lines of code to integrate it in your project. Templating is very powerful and you can set-up a template in a very short time. While templates are Word documents, you can generate PDF or XPS docs as well. XPS is useful because you can use .NET/WPF prining framework that works with XPS docs to print documents. This is a very high-end solution, but of course, the downside here is that it is not a free solution.

where does Jackrabbit store its tree information

I am a newbie to Jackrabbit. I wonder where Jackrabbit store its tree information. I want to access to the tree information in real time, even i restart my program. It should behave like the tree information is stored in the file system permanently. But now if I stopped my program, all the Jackrabbit tree information will be lost. How can I solve this problem?
Thanks!
Jackrabbit can use different PersistenceManagers to store the content, if you're using a TransientRepository, for example, the content isn't persisted as that's meant for testing only.
See http://wiki.apache.org/jackrabbit/PersistenceManagerFAQ for more info.

DB objects relations visualization

I'm not guru in DBA, so I'll try explain what I want in terms I imagine it.
I have Oracle DB with network devices. each device has ports which has parent device/port
I want some tool which will automaticaly create visual map of this device relations.
Will create "Network Map" based on this relations.
It's would be better if this tool will have some output ready for web publishing, or web based tool from the begging. Also if it will automatically update "picture" as soon as I add new relation/object
From far it looks something like Gource http://youtu.be/E5xPMW5fg48
But not exactly what i need
Hope to get some suggestion.
Thanks in advance!
UPD: found another tool: Gephi
You could try graphviz. It was created specifically for visualising large graphs of network nodes.
It's not out of the box; you'll have to write some code that:
Reads data on the devices & their relationships
Creates the graphviz input file
generates the diagram by calling the graphviz binary.
There are many ways to do that. One of the easiest is to use python with the pydot library.
Note that graphviz generates static images (jpeg / tiff etc.) so you'd have to regenerate on demand.
There are more interactive toolkits available, e.g. protovis / infovis. Both are javascript based and render directly in the browser.
hth.

Resources