Can't load bad data with Anzograph - graph-databases

I'm trying to load a filtered Wikidata dump with Anzograph using LOAD WITH 'global' <file:wdump-749.nt.gz> INTO GRAPH <WD_749>. The file exists; Anzograph gives out this error:
Error - At Turtle production subject=http://www.wikidata.org/entity/Q144> predicate=http://www.wikidata.org/prop/direct/P1319> file=wdump-749.nt.gz line=3229 details: -34000-01-01T00:00:00Z:Datum is not a datetime, use setting 'load_normalize_datetime' to patch bad data
I've set load_normalize_datetime=true in settings.conf and settings_anzograph.conf inside Anzograph's filesystem, restarted the server, but still can't load the dump. I get the exact same error.

load_normalize_datetime does not take a boolean. Change bad datetimes in loads to this value, e.g. 0001-01-01T00:00:00Z
So instead try setting:
load_normalize_datetime=0001-01-01T00:00:00Z
in your settings.conf, which worked for me on that specific file using the command you listed.
WD_749 has 38,131,614 statements, loaded in 372 seconds on my Thinkpad. It was relatively slow (102k triples per second) to load because it is a single file. If you break it up into smaller pieces (you can do this with the COPY command to dump the graph to a dir:/mydir/wdump-749.nt.gz) it will load in parallel (for me 114 seconds, 335k tps).

Related

How To Upload A Large File (>6MB) To SalesForce Through A Lightning Component Using Apex Aura Methods

I am aiming to take a file a user attaches through an Lightning Component and create a document object containing the data.
So far I have overcome the request size limits by chunking the data being uploaded into 1MB chunks. When the Apex Aura method receives these chunks of data it will either create a new document (if it is the first chunk), or will retrieve the existing document and add the new chunk to the end.
Data is received Base64 encoded, and then decoded server-side.
As the document data is stored as a Blob, the original file contents will be read as a String, and then appended with the chunk received. The new contents are then converted back into a Blob to be stored within the ContentVersion object.
The problem I'm having is that strings in Apex have a maximum length of 6,000,000 or so. Whenever the file size exceeds 6MB, this limit is hit during the concatenation, and will cause the file upload to halt.
I have attempted to avoid this limit by converting the Blob to a String only when necessary for the concatenation (as suggested here https://developer.salesforce.com/forums/?id=906F00000008w9hIAA) but this hasn't worked. I'm guessing it was patched because it's still technically allocating a string larger then the limit.
Code's really simple when appending so far:
ContentVersion originalDocument = [SELECT Id, VersionData FROM ContentVersion WHERE Id =: <existing_file_id> LIMIT 1];
Blob originalData = originalDocument.VersionData;
Blob appendedData = EncodingUtil.base64Decode(<base_64_data_input>);
Blob newData = Blob.valueOf(originalData.toString() + appendedData.toString());
originalDocument.VersionData = newData;
You will have hard time with it.
You could try offloading the concatenation to asynchronous process (#future/Queueable/Schedulable/Batchable), they'll have 12MB RAM instead of 6. Could buy you some time.
You could try cheating by embedding an iframe (Visualforce or lightning:container tag? Or maybe a "canvas app") that would grab your file and do some manual JavaScript magic calling normal REST API for document upload: https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/dome_sobject_insert_update_blob.htm (last code snippet is about multiple documents). Maybe jsforce?
Can you upload it somewhere else (SharePoint? Heroku?) and have that system call into SF to push them (no Apex = no heap size limit). Or even look "Files Connect" up.
Can you send an email with attachments? Crude but if you write custom Email-to-Case handler class you'll have 36 MB of RAM.
You wrote "we needed multiple files to be uploaded and the multi-file-upload component provided doesn't support all extensions". That may be caused by these:
In Experience Builder sites, the file size limits and types allowed follow the settings determined by site file moderation.
lightning-file-upload doesn't support uploading multiple files at once on Android devices.
if the Don't allow HTML uploads as attachments or document records security setting is enabled for your organization, the file uploader cannot be used to upload files with the following file extensions: .htm, .html, .htt, .htx, .mhtm, .mhtml, .shtm, .shtml, .acgi, .svg.

Sagemaker Studio trial component chart not showing

I am wondering why I am unable to show the loss and accuracy curve in Sagemaker Studio, Trial components chart.
I am using tensorflow's keras API for training.
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
entry_point="sm_entrypoint.sh",
source_dir=".",
role=role,
instance_count=1,
instance_type="ml.m5.4xlarge",
framework_version="2.4",
py_version="py37",
metric_definitions=[
{'Name':'train:loss', 'Regex':'loss: ([0-9.]+'},
{'Name':'val:loss', 'Regex':'val_loss: ([0-9.]+'},
{'Name':'train:accuracy', 'Regex':'accuracy: ([0-9.]+'},
{'Name':'val:accuracy', 'Regex':'val_accuracy: ([0-9.]+'}
],
enable_sagemaker_metrics=True
)
estimator.fit(
inputs="s3://xxx",
experiment_config={
"ExperimentName": "urbansounds-20211027",
"TrialName": "tf-classical-NN-20211027",
"TrialComponentDisplayName": "Train"
}
)
Regex is enabled, and appears to be logging them correctly. Since under the metrics tab, it shows 12 counts for each metric, corresponding to 12 epochs cycle which I specified.
However, the chart is empty. The x-axis is in time here, but it is also empty when I switched to epoch.
tldr: in your entry_point source code sm_entrypoint.sh, you need to explicitly inform the experiment tracker which epoch the metric is associated with, using the log_metric() function.
There are two ways tracker work in SageMaker Experiment Tracker: (1) you log the metric in your entry_point code, and use metric_definitions argument in the estimator to teach SM to parse the metric from the logs, as the way you did it, or, you can (2) explicitly create a Tracker instance inside your entry_point, and invoke the log_metric() method. Apparently only method (2) tells SM Tracker what epoch each metric entry is registered to.
I found the answer from a random video https://youtu.be/gMnkfPztIHU?t=141, after days of search :(
Oh there is also a catch: SageMaker images do not have the Tracker package installed, so if you just have from smexperiments.tracker import Tracker in your entry_point source code, your SM Estimator will complain. So you will need to install sagemaker-experiments for your image, by
create a requirements.txt file that has sagemaker-experiments==0.1.35 in it;
specify the source dir by including source_dir="./dir_that_contains_requirements.txt" in your estimator creation.

How I can get real img file name by src in Selenium?

Using LogoImg.GetAttribute("src") I get the following scr:
https://scol.stage-next.sc.local/lspprofile/5a2e7338d6e9a927741175e2/image?id=5a2fbc98d6e9a9177c8c1592
But the real name of the file is: TestImage - 9fb0c49d-69b1-49ed-8c63-4283e405b781.jpg
If i enter the src in my browser i got the file with real name downloaded.
How can I get the real name of the file in selenium as I need it for test.
Well the task is solved by other means, i just compared the differences in src. But the responce to the question would be yet interesting.
As you are able to retrieve the src attribute as follows :
https://scol.stage-next.sc.local/lspprofile/5a2e7338d6e9a927741175e2/image?id=5a2fbc98d6e9a9177c8c1592
This is the reference to the resource stored in the Database. So it wouldn't be possible to retrive the name 9fb0c49d-69b1-49ed-8c63-4283e405b781.jpg before the file gets downloaded.
To ensure the download is completed and then to read the filename you will need to use either of the following :
glob.glob() or fnmatch :
https://stackoverflow.com/a/4296148/771848
Watchdog module to monitor changes with in a directory:
python selenium, find out when a download has completed?

Python Google appengine 'Attachment' object does not support indexing

Since sometime after 3pm EST on January 9th I am getting
TypeError: 'Attachment' object does not support indexing errors when trying to access the data portion of an email attachment:
attach = mail_message.attachments.pop()
encodedAttachment = attach[1]
The format of the emails I am processing has not changed in that time, and this code worked flawlessly up until then
The latest version (1.8.9) has introduced an Attachment class that is returned now instead of the (filename content) tuple that was returned previously. The class does implement __iter__, so unpacking works exactly the same:
filename, content = attachment
But it doesn't implement __getitem__, so accessing via index as you're doing will cause the error you're seeing. It's possible that creating an issue will get the code changed to be completely backwards-compatible, but the practical thing would be to change your code.

Rollback Doctrine data:load when insert from fixtures fails

I have often noticed that when database insert for a model fails, data loaded previously continue to stay in the database. So when you try to load the same fixture file again, it gives an error.
Is there any way the DATA:LOAD process can be made ATOMIC, i.e. GO or NO-GO for all data, so that data is never inserted half-way.
Hopefully that should work :
Write a task that do the same as data:load but wrap it in :
$databaseManager = new sfDatabaseManager($this->configuration);
$conn = $sf_database_managaer->getDatabase('doctrine')->getDoctrineConnection();
try{
...............
}catch(Exception $e){ //maybe you can be more specific about the exception thrown
echo $e->getMessage();
$conn->rollback();
}
Fixtures are meant for loading initial data, which means that you should be able to build --all --and-load, or in other words, clear all data and re-load fixtures. It doesn't take any longer.
One option you have is to break your fixtures into multiple files and load them individually. This is also what I'd do if you first need to load large amounts of data via a script or from a CSV (i.e. something bigger than just a few fixtures). This way you don't need to redo it if you had a fixtures problem somewhere else.

Resources