rdflib "EnumeratedClass" to create enumerated datatype - owl

I am new to rdflib (and RDF/OWL), so I'm having some trouble understanding the rdflib documentation in the absence of concrete examples / layman's guides.
I have an OWL subClass for which I want to create three possible values. I found this official tennis example with the "end result" I assume I would need, which I wanted to recreate with my data using rdflib.
I found rdflib's "rdflib.extras.infixowl.EnumeratedClass", and tried modifying and running the sample code, but my serialized output doesn't resemble the tennis example output.
My code:
from rdflib import URIRef, BNode, Literal, Namespace, Graph
from rdflib.namespace import RDF, RDFS, OWL
from rdflib.extras.infixowl import EnumeratedClass, Individual
from rdflib.collection import Collection
from rdflib.util import first
n = Namespace("http://example.org/example/")
g = Graph()
g.bind("owl",OWL)
g.bind("", n)
my_class = n.my_class
g.add((my_class, RDF.type, OWL.Class))
g.add((my_class, RDFS.subClassOf, OWL.Thing))
g.add((my_class, RDF.ID, Literal("my_class")))
my_subclass = n.my_subclass
g.add((my_subclass, RDF.type, OWL.Class))
g.add((my_subclass, RDF.ID, Literal("my_subclass")))
g.add((my_subclass, OWL.subClassOf, my_class))
Individual.factoryGraph = g
my_list = EnumeratedClass(n.my_list,
members=[n.listitem1,
n.listitem2,
n.listitem3])
col = Collection(g, first(
g.objects(predicate=OWL.oneOf,
subject=my_list.identifier)))
[g.qname(item) for item in col]
This gives me:
<owl:Class rdf:about="http://example.org/example/my_subclass">
<rdf:ID>my_subclass</rdf:ID>
<owl:oneOf rdf:parseType="Collection">
<rdf:Description rdf:about="http://example.org/example/listitem1"/>
<rdf:Description rdf:about="http://example.org/example/listitem2"/>
<rdf:Description rdf:about="http://example.org/example/listitem3"/>
</owl:oneOf>
<owl:subClassOf rdf:resource="http://example.org/datacategories/my_class"/>
</owl:Class>
and not the "first/rest" format seen in the official tennis example.
First: Is this wrong?
Second: If someone can explain to me what the effective difference is between these two formats, and whether there is a preference, I would really appreciate it.
Thanks!

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

OWL API: Traverse imported ontology

I'm trying to import an ontology to the primary ontology, and traverse over all classes:
manager = OWLManager.createOWLOntologyManager();
ontology = manager.loadOntologyFromOntologyDocument(new File("data/prim.owl"));
factory = manager.getOWLDataFactory();
OWLImportsDeclaration im = factory.getOWLImportsDeclaration(IRI.create("https://protege.stanford.edu/ontologies/pizza/pizza.owl"));
manager.applyChange(new AddImport(ontology,im));
reasoner = OpenlletReasonerFactory.getInstance().createReasoner(ontology);
I’m running this code to get all classes:
//*********************
Set<OWLClass> allCls = ontology.getClassesInSignature();
allCls.forEach(System.out::println);
Classes belonging to prim.owl are returned, but classes in the imported ontology (pizza.owl) are not returned.
The code in the question contains a mistake: it does not load the desired imported ontology (pizza) into the manager.
OWLImportsDeclaration im = factory.getOWLImportsDeclaration(IRI.create("https://protege.stanford.edu/ontologies/pizza/pizza.owl"));
manager.applyChange(new AddImport(ontology,im));
These lines just add the owl:imports declaration into the ontology header (_:x a owl:Ontology) for the pizza-iri.
To make the code work, you need to load the pizza-ontology separately:
OWLOntology pizza = manager.loadOntology(IRI.create("https://protege.stanford.edu/ontologies/pizza/pizza.owl"));
OWLImportsDeclaration im = factory.getOWLImportsDeclaration(pizza.getOntologyID().getOntologyIRI().orElseThrow(AssertionError::new));
manager.applyChange(new AddImport(ontology, im));
Now you can check that all imports and references are really present and correct, and, therefore, your ontology has a reference to the pizza ontology:
Assert.assertEquals(1, ontology.importsDeclarations().count());
Assert.assertEquals(1, ontology.imports().count());
Assert.assertEquals(2, manager.ontologies().count());
Then you can get all OWL-classes from both ontologies as a single collection or java-Stream:
ontology.classesInSignature(Imports.INCLUDED).forEach(System.err::println);
Also please note: the method Set<OWLClass> getClassesInSignature(boolean includeImportsClosure) is deprecated (in OWL-API v5).

Which geopandas datasets (maps) are available?

I just created a very simple geopandas example (see below). It works, but I noticed that it is important for me to be able to have a custom part of the world. Sometimes Germany and sometimes only Berlin. (Also, I want to aggregate the data I have by areas which I define as polygons in a geopandas file, but I'll add this in another question.)
How can I get a different "base map" than
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
for visualizations?
Example
# 3rd party modules
import pandas as pd
import geopandas as gpd
import shapely
# needs 'descartes'
import matplotlib.pyplot as plt
df = pd.DataFrame({'city': ['Berlin', 'Paris', 'Munich'],
'latitude': [52.518611111111, 48.856666666667, 48.137222222222],
'longitude': [13.408333333333, 2.3516666666667, 11.575555555556]})
gdf = gpd.GeoDataFrame(df.drop(['latitude', 'longitude'], axis=1),
crs={'init': 'epsg:4326'},
geometry=[shapely.geometry.Point(xy)
for xy in zip(df.longitude, df.latitude)])
print(gdf)
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='white', edgecolor='black')
gdf.plot(ax=base, marker='o', color='red', markersize=5)
plt.show()
As written in the geopandas.datasets.get_path(...) documentation, one has to execute
>>> geopandas.datasets.available
['naturalearth_lowres', 'naturalearth_cities', 'nybb']
Where
naturalearth_lowres: contours of countries
naturalearth_cities: positions of cities
nybb: maybe New York?
Other data sources
Searching for "germany shapefile" gave an arcgis.com url which used the "Bundesamt für Kartographie und Geodäsie" as a source. The result of using vg2500_geo84/vg2500_krs.shp looks like this:
Source:
© Bundesamt für Kartographie und Geodäsie, Frankfurt am Main, 2011
Vervielfältigung, Verbreitung und öffentliche Zugänglichmachung, auch auszugsweise, mit Quellenangabe gestattet.
I also had to set base.set_aspect(1.4), otherwise it looked wrong. The value 1.4 was found by trial and error.
Another source for such data for Berlin is daten.berlin.de
When geopandas reads the shapefile, it is a geopandas dataframe with the columns
['USE', 'RS', 'RS_ALT', 'GEN', 'SHAPE_LENG', 'SHAPE_AREA', 'geometry']
with:
USE=4 for all elements
RS is a string like 16077 or 01003
RS_ALT is a string like 160770000000 or 010030000000
GEN is a string like 'Saale-Holzland-Kreis' or 'Erlangen'
SHAPE_LENG is a float like 202986.1998816 or 248309.91235015
SHAPE_AREA is a float like 1.91013141e+08 or 1.47727769e+09
geometry is a shapely geometry - mostly POLYGON

How does one call external datasets into scikit-learn?

For example consider this dataset:
(1)
https://archive.ics.uci.edu/ml/machine-learning-databases/annealing/anneal.data
Or
(2)
http://data.worldbank.org/topic
How does one call such external datasets into scikit-learn to do anything with it?
The only kind of dataset calling that I have seen in scikit-learn is through a command like:
from sklearn.datasets import load_digits
digits = load_digits()
You need to learn a little pandas, which is a data frame implementation in python. Then you can do
import pandas
my_data_frame = pandas.read_csv("/path/to/my/data")
To create model matrices from your data frame, I recommend the patsy library, which implements a model specification language, similar to R formulas
import patsy
model_frame = patsy.dmatrix("my_response ~ my_model_fomula", my_data_frame)
then the model frame can be passed in as an X into the various sklearn models.
Simply run the following command and replace the name 'EXTERNALDATASETNAME' with the name of your dataset
import sklearn.datasets
data = sklearn.datasets.fetch_EXTERNALDATASETNAME()

OWL-API Intersection of more concepts

I'm new at OWL-API.
I need to represent an intersection of N concepts
So, intersectionOf (C1, C2, ..... CN).
IntersectionOf has two arguments, but how can I do the general purpose solution?
Is it good enough to build an HashSet and then put it into the arguments?
OWLDataFactory has methods for obtaining an intersection that accept collections - usually sets- of class expressions. I believe that's what you're after.
Just like Ignazio said, but with code:
Java Code:
OWLDataFactory factory = manager.getOWLDataFactory();
Set<OWLClassExpression> mySet = new HashSet<OWLClassExpression>();
factory.getOWLObjectIntersectionOf(mySet);
HTH

Resources