Content Validation in JCR - jackrabbit

We are evaluating a few technologies to build a repository of WSDLs, and XSDs used within our organization. One of the options we have is to use Apache JackRabbit, that implements JCR 1.0 and 2.0. It almost meets our expectations on uploading contents, authentication and versioning. However, we are also planning to upload several pieces of metadata (e.g., createdBy,lastModifiedBy,lastModifiedTime, etc.) with the WSDLs and XSDs to the repository. We have read through several of the posts on StackOverflow, JCR specs and wiki pages in the JackRabbit's website, but did not quite understand - how the metadata that we are uploading can be validated ? For example, if we upload the metadata as an XML -formated content, we want the repository to validate the XML against an XML schema.
In terms of the JCR API, is there a way to enable validation of XML while importing the XML content through Session.importXML ?

As Randall says, the JCR API doesn't provide hooks to validate content while you're storing it.
One common pattern is to upload data to an intermediate location in the JCR tree, say /incoming, and have JCR observers watch this incoming data, validate it and move it to its final location if valid.
Another option is to use Apache Sling [1] which provides an OSGi-based scriptable application layer on top of a JCR repository. With Sling you can intercept HTTP POST requests, for example, to validate data before storing it.
[1] http://sling.apache.org

You might try looking at ModeShape. It's also an open source (LGPL-licensed) JCR implementation, but it has the notion of 'sequencers' that automatically derives information from the uploaded files and stores that information as structured content (e.g., subgraphs of nodes and properties) in the repository, where it can be searched, queried, and accessed like any other repository content. ModeShape has quite a few sequences already, but doesn't yet have WSDL or XSD sequencers (they're scheduled to appear within the next release, around the end of May 2011).
I'm the project lead for ModeShape, and I too am using it for storage of WSDL and XSD files (as well as other file formats). In fact, we're using JCR repositories to store all kinds of structured metadata.
As you mention, JCR does provide a way to import content, but the XML files that are imported are of one of two formats defined by the JCR specification (system view and document view). The System View XML format uses JCR-specific elements and attributes, whereas the Document View maps elements into nodes and attributes into properties (its actually a bit more nuanced). And because this import process will result in additional repository content (nodes and properties), JCR repositories do validate this structure using JCR's node type mechanism.
Here's an example of an XML file in Document View format:
<?xml version="1.0" encoding="UTF-8"?>
<Hybrid xmlns:car="http://www.modeshape.org/examples/cars/1.0"
xmlns:jcr="http://www.jcp.org/jcr/1.0"
xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
xmlns:mix="http://www.jcp.org/jcr/mix/1.0"
jcr:primaryType="nt:unstructured"
jcr:uuid="7e999653-e558-4131-8889-af1e16872f4d"
jcr:mixinTypes="mix:referenceable">
<Toyota_x0020_Prius jcr:primaryType="car:Car"
jcr:mixinTypes="mix:referenceable"
jcr:uuid="e92eddc1-d33a-4bd4-ae36-fe0a761b8d89"
car:year="2008" car:msrp="$21,500" car:mpgHighway="45"
car:model="Prius" car:valueRating="5" car:maker="Toyota"
car:mpgCity="48" car:userRating="4"/>
<Toyota_x0020_Highlander jcr:primaryType="car:Car"
jcr:mixinTypes="mix:referenceable"
jcr:uuid="f6348fbe-a0ba-43c4-9ae5-3faff5c0f6ec"
car:year="2008" car:msrp="$34,200" car:mpgHighway="25"
car:model="Highlander" car:valueRating="5" car:maker="Toyota"
car:mpgCity="27" car:userRating="4"/>
</Hybrid>
Here, 'Hybrid' is an 'nt:unstructured' node that contains two nodes of type 'car:Car' nodes. The 'car:Car' node type is defined as follows:
[car:Car] > nt:unstructured, mix:created
- car:maker (string)
- car:model (string)
- car:year (string) < '(19|20)\d{2}' // any 4 digit number starting with '19' or '20'
- car:msrp (string) < '[$]\d{1,3}[,]?\d{3}([.]\d{2})?' // of the form "$X,XXX.ZZ", "$XX,XXX.ZZ" or "$XXX,XXX.ZZ"
// where '.ZZ' is optional
- car:userRating (long) < '[1,5]' // any value from 1 to 5 (inclusive)
- car:valueRating (long) < '[1,5]' // any value from 1 to 5 (inclusive)
- car:mpgCity (long) < '(0,]' // any value greater than 0
- car:mpgHighway (long) < '(0,]' // any value greater than 0
- car:lengthInInches (double) < '(0,]' // any value greater than 0
- car:wheelbaseInInches (double) < '(0,]' // any value greater than 0
- car:engine (string)
- car:alternateModels (reference) < 'car:Car'
If this node type is registered within the JCR repository, it will ensure that your imported content structure is valid per the node type definition.
If you're talking about validating the values of content (e.g., metadata values, structure of binary files, etc.), I'm not aware of any JCR repository implementation that can do this out of the box. JCR repositories are more general purpose, so this would be something that your application can do by using JCR event listeners to observe when new XML files (or content) are being uploaded into the repository, fetching the binary content that was just uploaded, and using other libraries to perform the validation.
Finally, you talk about storing extra properties on your uploaded files. I've wrote a blog post some time ago that talks about how define and use mixin node types do this with JCR 'nt:file' and 'nt:folder' nodes.
Hope this helps.

Related

In Snowflake: How to access internally staged pre-trained model from UDF, syntax dilemma?

What is the syntax to reference a staged zip file from UDF? Specifically, I created UDF in Snowpark and it needs to load s-bert sentence_transformers pre-trained model (I downloaded the model, zipped it, and uploaded it to internal stage).
The "SentenceTransformer" method (see the code line below) takes a parameter that can either be the name of the model -- in which case, the pre-trained model will be downloaded form the web; or it can take a directory path to a folder that contains already downloaded pre-trained model files.
Downloading the files from the Web with UDF is not an option in Snowflake.
So, what is the directory path to the internally staged file that I can use as a parameter to the SentenceTransformer method so it can access already downloaded zip model? "#stagename/filename.zip" is not working.
#udf(....)
def create_embedding()..:
....
model = SentenceTransformer('all-MiniLM-L6-v2') #THIS IS THE LINE IN THE QUESTION
....
....
UDFs need to specify specific files when creating them (for now):
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-creating.html#loading-a-file-from-a-stage-into-a-python-udf
Check the example from the docs, which uses imports, snowflake_import_directory to open(import_dir + 'file.txt'):
create or replace function my_udf()
returns string
language python
runtime_version=3.8
imports=('#my_stage/file.txt')
handler='compute'
as
$$
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
def compute():
with open(import_dir + 'file.txt', 'r') as file:
return file.read()
$$;

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

How to set the Ontology ID of an anonymous Ontology using the OWL API

I have a file containing an ontology without an ontology id (the ontology tag <Ontology/> is empty). The used serialization format is RDF/XML. My goal is to serialize the file, set an ontology id and write the file back using the OWLAPI. Unfortunatly I don't know how to do this. I tried the following:
ontology = ontologyManager.loadOntologyFromOntologyDocument(new File("filename"));
ontologyManager.setOntologyDocumentIRI(ontology, IRI.create("http://www.mydesiredIri.com/abc"));
ontologyManager.saveOntology(ontology,new FileOutputStream(new File("outputfile")));
By running the code, the Ontology-ID is not added to the ontology. Instead of <Ontology rdf:about="http://www.mydesiredIri.com/abc"/> the tag is still emtpy. What I am doing wrong?
Thank you!
Kind regards
OWLOntologyManager.setOntologyDocumentIRI() is for setting the document IRI of the ontology, not the ontology IRI itself. The difference between the two is that the document IRI is a resolvable URL or a file path (i.e., int can be used to parse the ontology), while the ontology IRI is the symbolic name of the ontology (it does not need to be resolvable and it can even be missing - which is the case for anonymous ontologies).
To set the ontology IRI, use:
//versionIRI can be null
OWLOntologyID newOntologyID = new OWLOntologyID(ontologyIRI, versionIRI);
// Create the change that will set our version IRI
SetOntologyID setOntologyID = new SetOntologyID(ontology, newOntologyID);
// Apply the change
manager.applyChange(setOntologyID);
After this, save the ontology as usual.

how to get array from and xml file in swift

I am new to swift but I have made an android app where a string array is selected from an xml file. This is a large xml file that contains a lot of string arrays and the app gets the relevant string array based on a user selection.
I am now trying to develop the same app for iOS using swift. I would like to use the same xml file but I can not see and easy way to get the correct array. For example, part of the xml looks like this
<string-array name="OCR_Businessstudies_A_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
<item>7. The global environment of business</item>
</string-array>
<string-array name="OCR_Businessstudies_AS_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
</string-array>
If I have the string "OCR_Businessstudies_A_Topics" how do i get the "OCR_Businessstudies_A_Topics" array from the xml file.
This is very straight forward in android and although I have used online tutorials for swift it seems like I have to parse the xml file but do not seem to be getting anywhere.
Is there a better approach than trying to parse the whole xml fie?
Thanks
Barry
You can write your own XML parser, conforming to NSXMLParser or use a library like HTMLReader:
let fileURL = NSBundle.mainBundle().URLForResource("data", withExtension: "xml")!
let xmlData = NSData(contentsOfURL: fileURL)!
let topic = "OCR_Businessstudies_A_Topics"
let document = HTMLDocument(data: xmlData, contentTypeHeader: "text/xml")
for item in document.nodesMatchingSelector("string-array[name='\(topic)'] item") {
print(item.textContent)
}

Is there a less-database-intensive way to get data from my extended Django Site model?

I run a site that operates the same on many URLs except for the name and the display of objects tied to a given site. Because of this I have extended the Site model to include various other bits of information about a site and created a middleware to put the standard Site object information into the request object. Previously the only info I needed in the request object was the site's name, which I could get from the Site models Django provides. I now need bits of information that reside in my extended Site model (which previously was only used by my other various app models).
This goes from adding one query to each page (request.site = Site.objects.get_current()) to adding two, as I need to get the current site, then get the associated extended Site object from my model.
Is there a way to get this information without using two queries? Or even without using one?
models.py:
from django.contrib.sites.models import Site
class SiteMethods(Site):
"""
Extended site model
"""
colloquial_name = models.CharField(max_length=32,)
...
middleware.py:
class RequestContextMiddleware(object):
"""
Puts the Site into each request object
"""
def process_request(self, request):
# This runs two queries on every page, instead of just one
request.site = SiteMethods.objects.get(id=Site.objects.get_current().id)
return None
In my settings.py file, I have all shared configuration data. My server instances (gunicorn) are configured to load [site]_settings.py, which holds all site-specific settings (to include Django's SITE_ID), and at the bottom:
try:
from settings import *
except ImportError:
pass
I am looking for options (if they exist) that do not include referencing the hard-coded SITE_ID in [site]_settings.py.
Update:
As suggested below, subclassed objects should still have access to their parent objects and all the parent object's functionality. For the Site object, strangely, this seems to not be the case.
>>> Site.objects.get_current()
<Site: website.com>
>>> SiteMethods.objects.get_current()
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'Manager' object has no attribute 'get_current'
>>> SiteMethods.objects.select_related('site').get_current() # as suggested below
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'QuerySet' object has no attribute 'get_current'
>>> dir(SiteMethods)
['DoesNotExist', 'MultipleObjectsReturned', '__class__', '__delattr__', '__dict__',
'__doc__', '__eq__', '__format__', '__getattribute__', '__hash__', '__init__',
'__metaclass__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__',
'__weakref__', '_base_manager', '_default_manager', '_deferred', '_get_FIELD_display',
'_get_next_or_previous_by_FIELD', '_get_next_or_previous_in_order', '_get_pk_val',
'_get_unique_checks', '_meta', '_perform_date_checks', '_perform_unique_checks',
'_set_pk_val', 'clean', 'clean_fields', 'date_error_message', 'delete', 'full_clean',
'objects', 'pk', 'prepare_database_save', 'save', 'save_base', 'serializable_value',
'site_ptr', 'sitemethods', 'unique_error_message', 'validate_unique',]
Since you subclassed Site you should be able to just do SiteMethods.objects.get_current(), which will net you an instance of SiteMethods. Since Django's implementation of MTI (Multiple Table Inheritance) uses a OneToOneField to the parent class, you should also be able to use select_related for site. So, try the following:
SiteMethods.objects.select_related('site').get_current()

Resources