How does TYPO3 generate the index of a file, as seen in its XML-website export data? - file

The following XML-code is an excerpt of a file generated using the TYPO3 website export feature. The export feature was configured to include the files used on the website:
...
<files_fal type="array">
<file index="008f35a8201e50eb24a9667092782ec0" type="array">
<filename>somefilename.jpg</filename>
<filemtime>1603259011</filemtime>
<content base64="1">...</conten>
<content_sha1>...</content_sha1>
</file>
</files_fal>
...
What I'd like to know is, how exactly the value of the index attribute (008f35a8201e50eb24a9667092782ec0) is generated in such a case. Thank you for any clues in that regard.

This looks pretty much like one of the hashes stored in the sys_file table. Should be identifier_hash to make any sense.

I found the answer by myself now (by looking into the PHP source code of TYPO3)
The string is created during export by concatenation of the storage uid of the file in the sys_file table, the identifer_hash from the same table with a : inbetween.
The resulting string is then turned into a hash using md5.
In pseudo-code:
hash = make_md5( storage_uid + ':' + identifier_hash )
Depending on how your md5 making function works, it may be necessary to encode the string as binary beforehand.

Related

Loading pre-trained CBOW/skip-gram embeddings from a file that has unknown encoding?

I'm trying to load pre-trained word embeddings for the Arabic language (Mazajak embeddings: http://mazajak.inf.ed.ac.uk:8000/). The embeddings file does not have a particular extension and I'm struggling to get it to load. What's the usual process to load these embeddings?
I've tried doing with open("get_sg250", encoding = encoding) as file: file.readlines() for different encodings but it seems like none of them are the answer (utf-8 does not work at all), if I try windows-1256 I get gibberish:
e.g.
['8917028 300\n',
'</s> Hل®:0\x16ء:؟X§؛R8ڈ؛\xa0سî9K\u200fƒ::m¤9¼»“8¤p\u200c؛tعA:UU¾؛“_ع9‚Nƒ¹®G§¹قفگ؛ww$؛\u200eba:\x14.„:R¸پ:0–\x0b:–ü\x06:×#¦؛Yٍ²؛m ظ:{\x14¦:µ\x01‡:ه\x17S¹Yr¯:j\x03-¹ff€9×£P¸\n',
'W‚؛UUه9¼»é¹""§؛\u200c¶د:UU؟:\u200eb؟¹{\x14\u200d¸,ù19ïî\u200d؛ئ\x12¯؛\x00\x00ا:\u200c6°7A§a؛ذé„؛ذi†؛®G\x14:حجŒ8\x03\u200cè9ه\x17¸؛ق]¦؛ڈآ5¸قفا9حج^:\x00€ٹ؛q=²:\x00\x00¢9\x14®أ9×£T¹لz‚:\x1bèG؛®G7؛ڑ™<:m\xa0ƒ¹""´9\x14®\x1d:"¢²؛®G-؛ڑ™~:±ن¸:\x18ث«:¸\x1e…؛`,8؛Hل\u200d¹±ن.:\x1f…¥؛لْ‚:ڑ™s:R¸\x0b؛ئ’\x07؛0–C؛ڈآ¸:ذéھ:ة/خ¹A\'¸:ڑ™ز:m\xa0\x1e:è´ظ::ي‡؛\n',
'×\x05؛Œ%8؛ش\x06~؛أُu:\x00\x00\n',
":‰ˆ\x149\x14®?؛ِ(\x05:«ھ…:)\\‡833G:Haط؛\x1f…¼:¼»'9\x00\x00 ؛=\n",
'6؛R¸‚¹¼;€؛\x1bè¾؛\x1bèw؛قف؛:A§\x1a؛""j؛K~J:Hل\x14؛ىرد:\u200c6\x0c؛–|ب؛‚Nm:cةد·:mک؛‰ˆھ9\x00\x00ü9DD(¹ذi\x1f:ذé¬؛,ù™9¼»\x1e:wwƒ؛\x03\u200cF87ذ©·×£Q؛\x1f…w؛ئ\x12ح؛\x00\x00\x007ٍ‹U8\x0etZ6“ك«؛cةط؛Haد؛–ü¼؛33?¹Œ%َ9أُخ9=\n',
'‹؛ق]ع:ڈآ/؛0–ق¹¤pُ¹Dؤخ:¤p¤؛\x1bèت9\u200ebé¹ùE‹:–üb7=ٹ؛:؟Xv؛×£c؛ِ(·؛è4\xa0؛cة‹؛0\x16ˆ؛ئ’U:""#؛ة/j:R8،:أُى9ذé€:ىQX:\x1f…L:""›؛K\u200f•؛ڈآں؛‰ˆ8¸ww´:""o؛è´…؛\n',
'W·؛¤pگ:{”¶؛\x0etJ¹\u200eb>:ùإة؛`¬أ؛ِ(ü9K\u200f™:‚N؛:لz;:ِ(ٹ:Œ¥ˆ؛§\n',
'ں؛ِ¨\xad:ڑ™q؛\u200c6\x19:×£H9¤p\x1c:\x03\u200cخ¹–üٹ8UU\x13؛Hلؤ¹è´ء؛ïnژ؛®Gک:è´¯9\x0etN؛O\x1b\x0b؛\x00\x00Z:\n',
'Wڑ؛""J؛؟طخ:\x03\u200c¹:لْ¬؛\u200c6ک9ڑ™D؛\x1bèT8ق]ƒ:¼»س:0–-:~±³:,y‰:è´،¸jƒأ:m\xa0]:A\'د:j\x03\x15؛Haد:""½:wwù¹ه\x17ء؛×#س:&؟œ9×£5؛Hلz¹\\ڈ€¹)\\¨؛O\x1bْ¹ه\x17\x1b¹ڈB×؛\x03\u200c™؛ىQز¹لz¤¹ذi\x1c:\\ڈژ9ùإV¹R¸€:ùإü9ww?9‰\x08\u200d:~±ؤ¹‚Nù¹‰ˆ\x10¹UUn؛\x11\x11ƒ؛ٍ‹چ8‰ˆ½:\x1bèî¹O\x1bè¶`¬´؛=\n',
'¢:\n',
I've also tried using pickle but that also doesn't work.
Any suggestions on what I could try out?

Returning the filename of the current sketch

I am trying to write a GUI that will display the name of the sketch it was generated from using a simple text() command. However, I am running into trouble getting any of the general JS solutions to work for me. Many solutions I have found use the filename reserved word but that does not seem to be reserved in Processing 3.5.4. I have also tried parsing the strings using a similar method to what can be found here. I am very new to processing and this is only my 2nd attempt at using Processing.
Any advice would be greatly appreciated
You can get the path (as a string) to the sketch with sketchPath().
From there you could either parse the string (pull off everything after the last slash) to get the sketch name, or you can use sketchFile() to get a reference to the file itself and get the name from there:
String path = sketchPath();
File file = sketchFile(path);
String sketchName = file.getName();
println(sketchName);
You could combine this all into one line like so:
String sketchName = sketchFile(sketchPath()).getName();

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

ColdFusion server file with apostrophe character

When I try to upload a file with apostrophe, I get the error:
Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
if the file name is test's.pdf, I get the error. But if I change the name to test.pdf, there is no error.
Does anyone know why?
Thanks
I had a similar situation where I was dynamically creating filenames for pages that created excel files from query results. The approach I took was to create a function that replaced all the bad characters with something. Here is part of that function.
<cfargument name="replacementString" required="no" default=" ">
<cfscript>
var inValidFileNameCharacters = "[/\\*'?[\]:><""|]";
return reReplace (arguments.fileNameIn, inValidFileNameCharacters, arguments.replacementString, "all");
</cfscript>
You might want to consider an opposite approach. Instead of declaring invalid characters and replacing them, declare valid ones and replace anything that is not in the list of valid characters.
I suggest making this a function that's available on all appropriate pages. How you do that depends on your situation.
My guess is that the apostrophe is one of those multi-character apostrophes that Microsoft Word often uses. A character like that may not be a valid character for your OS file system.
You may want to re-code the system to use a temporary file on upload and then rename it to a valid file name after the upload is successful.
Here's some basic trouble shooting info.
Wrap your code in a try/catch block and dump the full error to the page output. Examples of using try/catch/dump below. The examples below force an error by dividing by zero.
For tag based cfml:
<cftry>
<cfset offendingCode = 1 / 0>
<cfcatch type="any">
<cfdump var="#cfcatch#" label="cfcatch">
</cfcatch>
</cftry>
For cfscript cfml:
<cfscript>
try {
offendingCode = 1 / 0;
} catch (any e) {
writeDump(var=e, label="Exception");
}
</cfscript>

how to get array from and xml file in swift

I am new to swift but I have made an android app where a string array is selected from an xml file. This is a large xml file that contains a lot of string arrays and the app gets the relevant string array based on a user selection.
I am now trying to develop the same app for iOS using swift. I would like to use the same xml file but I can not see and easy way to get the correct array. For example, part of the xml looks like this
<string-array name="OCR_Businessstudies_A_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
<item>7. The global environment of business</item>
</string-array>
<string-array name="OCR_Businessstudies_AS_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
</string-array>
If I have the string "OCR_Businessstudies_A_Topics" how do i get the "OCR_Businessstudies_A_Topics" array from the xml file.
This is very straight forward in android and although I have used online tutorials for swift it seems like I have to parse the xml file but do not seem to be getting anywhere.
Is there a better approach than trying to parse the whole xml fie?
Thanks
Barry
You can write your own XML parser, conforming to NSXMLParser or use a library like HTMLReader:
let fileURL = NSBundle.mainBundle().URLForResource("data", withExtension: "xml")!
let xmlData = NSData(contentsOfURL: fileURL)!
let topic = "OCR_Businessstudies_A_Topics"
let document = HTMLDocument(data: xmlData, contentTypeHeader: "text/xml")
for item in document.nodesMatchingSelector("string-array[name='\(topic)'] item") {
print(item.textContent)
}

Resources