How to delete lines with specific subjects from an RDF file? - file

I have a file containing triple RDF (subject-predicate-object) in the turtle syntax (.ttl), and I have another file in which I only have some subjects.
For example:
<http://dbpedia.org/resource/AlbaniaHistory> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaHistory"#en .
<http://dbpedia.org/resource/AsWeMayThink> <http://www.w3.org/2000/01/rdf-schema#label> "AsWeMayThink"#en .
<http://dbpedia.org/resource/AlbaniaEconomy> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaEconomy"#en .
<http://dbpedia.org/resource/AlbaniaGovernment> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaGovernment"#en .
And in the other file I have, for example:
<http://dbpedia.org/resource/AlbaniaHistory>
<http://dbpedia.org/resource/AlbaniaGovernment>
<http://dbpedia.org/resource/Pérotin>
<http://dbpedia.org/resource/ArtificalLanguages>
I would like to get:
<http://dbpedia.org/resource/AlbaniaHistory> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaHistory"#en .
<http://dbpedia.org/resource/AlbaniaGovernment> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaGovernment"#en .
So, I would like to remove from the first file the triples whose subjects are not in the second file. How could I get this?
I tried in java reading the contents of the second file in an arraylist and using the "contain" method to check if the subjects of each triple of the first file match any line in the second file, however it is too slow since the files are very big. How could I get this?
Thank you very much for helping

In Java, you could use an RDF library to read/write in streaming fashion and do some basic filtering.
For example, using RDF4J's Rio parser you could create a simple SubjectFilter class that checks for any triple if it has the required subject:
public class SubjectFilter extends RDFHandlerWrapper {
#Override
public void handleStatement(Statement st) throws RDFHandlerException {
// only write the statement if it has a subject we want
if (myListOfSubjects.contains(statement.getSubject()) {
super.handleStatement(st);
}
}
}
And then connect a parser to a writer that spits out the filtered content, something along these lines:
RDFParser rdfParser = Rio.createParser(RDFFormat.TURTLE);
RDFWriter rdfWriter = Rio.createWriter(RDFFormat.TURTLE,
new FileOutputStream("/path/to/example-output.ttl"));
// link our parser to our writer, wrapping the writer in our subject filter
rdfParser.setRDFHandler(new SubjectFilter(rdfWriter));
// start processing
rdfParser.parse(new FileInputStream("/path/to/input-file.ttl"), "");
For more details on how to use RDF4J and the Rio parsers, see the documentation.
As an aside: although this is perhaps more work than doing some command line magic with things like grep and awk, the advantage is that this is semantically robust: you leave interpretation of which bit of your data is the triple's subject to a processor that understands RDF, rather than taking an educated guess through regex ("it's probably the first URL on each line"), which may break in cases where the input file use a slightly different syntax variation.
(disclosure: I am on the RDF4J development team)

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

How to use Collections.binarySearch() in a CodenameOne project

I am used to being able to perform a binary search of a sorted list of, say, Strings or Integers, with code along the lines of:
Vector<String> vstr = new Vector<String>();
// etc...
int index = Collections.binarySearch (vstr, "abcd");
I'm not clear on how codenameone handles standard java methods and classes, but it looks like this could be fixed easily if classes like Integer and String (or the codenameone versions of these) implemented the Comparable interface.
Edit: I now see that code along the lines of the following will do the job.
int index = Collections.binarySearch(vstr, "abcd", new Comparator<String>() {
#Override
public int compare(String object1, String object2) {
return object1.compareTo(object2);
}
});
Adding the Comparable interface (to the various primitive "wrappers") would also would also make it easier to use Collections.sort (another very useful method :-))
You can also sort with a comparator but I agree, this is one of the important enhancements we need to provide in the native VM's on the various platforms personally this is my biggest peeve in our current VM.
Can you file an RFE on that and mention it as a comment in the Number issue?
If we are doing that change might as well do both.

MATLAB extract data from mat-files using logical conditions

I have a lot of data in several hundred .mat-files where I want to extract specific data from. All the names of my .mat-files have specific numbers to identify the content like Number1_Number2_Number3_Number4.mat:
01_33_06_121.mat
01_24_12_124.mat
02_45_15_118.mat
02_33_11_190.mat
01_33_34_142.mat
Now I want to extract for example all the data from files with Number1=01 or Number1=02 and Number2=33.
Before I start to write a program from scratch, I would like to know, if there is a simple way to do this with Matlab. Does anybody know how I can solve this problem in a fast way?
Thanks a lot!
There are multiple ways you can do this; on top of my head following can work:
Obtain all the file names into an array
allFiles = dir( 'folder' );
allNames = { allFiles.name };
Loop through your file names and compare against the condition using the regex
for i=1:size(allNames)
if regexp(allNames, pattern, 'match')
disp(allNames)
end
end

Open linked data_a data set

I downloaded a data set which is supposed to be in RDF format http://iw.rpi.edu/wiki/Dataset_1329, using Notepad++ I opened it but can't read it. Any suggestions?
The file, uncompressed, is about 140MB. Notepad++ is probably failing due to the size of the file. The RDF format used in this dataset is Ntriples, one triple per line with three components (subject, predicate, object), very human readable. Sample data from the file:
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/race_other_multi_racial> "0" .
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/race_black_and_white> "0" .
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/national_origin_hispanic> "0" .
<http://data-gov.tw.rpi.edu/raw/1329/data-1329-00017.rdf#entry8389> <http://data-gov.tw.rpi.edu/vocab/p/1329/filed_cases> "1" .
If you want to have a look at the data then try to open it with a tool that streams the file rather than loading it all at once, for instance less or head.
If you want to use the data you might want to look into loading it in a triple store (4store, Virtuoso, Jena TDB, ...) and use SPARQL to query it.
Try Google Refine (possibly with RDF extension: http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/ )

the best way to make codeigniter website multi-language. calling from lang arrays depends on lang session?

I'm researching hours and hours, but I could not find any clear, efficient way to make it :/
I have a codeigniter base website in English and I have to add a Polish language now. What is the best way to make my site in 2 language depending visitor selection?
is there any way to create array files for each language and call them in view files depends on Session from lang selection? I don't wanna use database.
Appreciate helps! I'm running out of deadline :/ thanks!!
Have you seen CodeIgniter's Language library?
The Language Class provides functions
to retrieve language files and lines
of text for purposes of internationalization.
In your CodeIgniter system folder you'll
find one called language containing sets
of language files. You can create your
own language files as needed in order
to display error and other messages in
other languages.
Language files are typically stored in
your system/language directory. Alternately
you can create a folder called language
inside your application folder and store
them there. CodeIgniter will look first
in your application/language directory.
If the directory does not exist or the
specified language is not located there
CI will instead look in your global
system/language folder.
In your case...
you need to create a polish_lang.php and english_lang.php inside application/language/polish
then create your keys inside that file (e.g. $lang['hello'] = "Witaj";
then load it in your controller like $this->lang->load('polish_lang', 'polish');
then fetch the line like $this->lang->line('hello'); Just store the return value of this function in a variable so you can use it in your view.
Repeat the steps for the english language and all other languages you need.
Also to add the language to the session, I would define some constants for each language, then make sure you have the session library autoloaded in config/autoload.php, or you load it whenever you need it. Add the users desired language to the session:
$this->session->set_userdata('language', ENGLISH);
Then you can grab it anytime like this:
$language = $this->session->userdata('language');
In the controller add following lines when you make the cunstructor
i.e, after
parent::Controller();
add below lines
$this->load->helper('lang_translate');
$this->lang->load('nl_site', 'nl'); // ('filename', 'directory')
create helper file lang_translate_helper.php with following function and put it in directory system\application\helpers
function label($label, $obj)
{
$return = $obj->lang->line($label);
if($return)
echo $return;
else
echo $label;
}
for each of the language, create a directory with language abbrevation like en, nl, fr, etc., under
system\application\languages
create language file in above (respective) directory which will contain $lang array holding pairs label=>language_value as given below
nl_site_lang.php
$lang['welcome'] = 'Welkom';
$lang['hello word'] = 'worde Witaj';
en_site_lang.php
$lang['welcome'] = 'Welcome';
$lang['hello word'] = 'Hello Word';
you can store multiple files for same language with differently as per the requirement
e.g, if you want separate language file for managing backend (administrator section) you can use it in controller as $this->lang->load('nl_admin', 'nl');
nl_admin_lang.php
$lang['welcome'] = 'Welkom';
$lang['hello word'] = 'worde Witaj';
and finally
to print the label in desired language, access labels as below in view
label('welcome', $this);
OR
label('hello word', $this);
note the space in hello & word you can use it like this way as well :)
whene there is no lable defined in the language file, it will simply print it what you passed to the function label.
I second Randell's answer.
However, one could always integrate a GeoIP such as http://www.maxmind.com/app/php
or http://www.ipinfodb.com/. Then you can save the results with the codeigniter session class.
If you want to use the ipinfodb.com api You can add the ip2locationlite.class.php file to your codeigniter application library folder and then create a model function to do whatever geoip logic you need for your application, such as:
function geolocate()
{
$ipinfodb = new ipinfodb;
$ipinfodb->setKey('API KEY');
//Get errors and locations
$locations = $ipinfodb->getGeoLocation($this->input->ip_address());
$errors = $ipinfodb->getError();
//Set geolocation cookie
if(empty($errors))
{
foreach ($locations as $field => $val):
if($field === 'CountryCode')
{
$place = $val;
}
endforeach;
}
return $place;
}
For easier use CI have updated this so you can just use
$this->load->helper('language');
and to translate text
lang('language line');
and if you want to warp it inside label then use optional parameter
lang('language line', 'element id');
This will output
// becomes <label for="form_item_id">language_key</label>
For good reading
http://ellislab.com/codeigniter/user-guide/helpers/language_helper.html
I've used Wiredesignz's MY_Language class with great success.
I've just published it on github, as I can't seem to find a trace of it anywhere.
https://github.com/meigwilym/CI_Language
My only changes are to rename the class to CI_Lang, in accordance with the new v2 changes.
When managing the actual files, things can get out of sync pretty easily unless you're really vigilant. So we've launched a (beta) free service called String which allows you to keep track of your language files easily, and collaborate with translators.
You can either import existing language files (in PHP array, PHP Define, ini, po or .strings formats) or create your own sections from scratch and add content directly through the system.
String is totally free so please check it out and tell us what you think.
It's actually built on Codeigniter too! Check out the beta at http://mygengo.com/string
Follow this https://github.com/EllisLab/CodeIgniter/wiki/CodeIgniter-2.1-internationalization-i18n
its simple and clear, also check out the document # http://ellislab.com/codeigniter/user-guide/libraries/language.html
its way simpler than
I am using such code in config.php:
$lang = 'ru'; // this language will be used if there is no any lang information from useragent (for example, from command line, wget, etc...
if (!empty($_SERVER['HTTP_ACCEPT_LANGUAGE'])) $lang = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'],0,2);
$tmp_value = $_COOKIE['language'];
if (!empty($tmp_value)) $lang = $tmp_value;
switch ($lang)
{
case 'ru':
$config['language'] = 'russian';
setlocale(LC_ALL,'ru_RU.UTF-8');
break;
case 'uk':
$config['language'] = 'ukrainian';
setlocale(LC_ALL,'uk_UA.UTF-8');
break;
case 'foo':
$config['language'] = 'foo';
setlocale(LC_ALL,'foo_FOO.UTF-8');
break;
default:
$config['language'] = 'english';
setlocale(LC_ALL,'en_US.UTF-8');
break;
}
.... and then i'm using usualy internal mechanizm of CI
o, almost forget! in views i using buttons, which seting cookie 'language' with language, prefered by user.
So, first this code try to detect "preffered language" setted in user`s useragent (browser). Then code try to read cookie 'language'. And finaly - switch sets language for CI-application
you can make a function like this
function translateTo($language, $word) {
define('defaultLang','english');
if (isset($lang[$language][$word]) == FALSE)
return $lang[$language][$word];
else
return $lang[defaultLang][$word];
}
Friend, don't worry, if you have any application installed built in codeigniter and you wanna add some language pack just follow these steps:
1. Add language files in folder application/language/arabic (i add arabic lang in sma2 built in ci)
2. Go to the file named setting.php in application/modules/settings/views/setting.php. Here you find the array
<?php /*
$lang = array (
'english' => 'English',
'arabic' => 'Arabic', // i add this here
'spanish' => 'Español'
Now save and run the application. It's worked fine.

Resources