Reading edge list data set in apache giraph? - giraph

I'm using SNAP dataset for social network analysis. SNAP uses simple edge list as a data format. How to read SNAP dataset in Apache Giraph?

As per I know SNAP has various data formats depending upon which dataset you are looking at. If the dataset that you are looking at has the format : sourceid destinationid on each line then you might want to use IntNullTextEdgeInputFormat (it's in giraph-core/src/main/java/org/apache/giraph/io/formats ).
Also take a look at various predefined formats available in the same folder. If none of those fit for your dataset format then you can write your own input format class (it will be really simple if you start from the predefined formats and edit it as you need).

use -eif org.apache.giraph.io.formats.IntNullTextEdgeInputFormat

Yes, SNAP uses Simple Edge List format for representing graph databases. You can use this code for converting it to a JSON format which is accepted by Apache Giraph.

Related

How to Extract .owl and save to mysql

I have a file ontobible.owl. how to extract that file and then save data to mysql (because I want display data from ontobible.owl in website). can anyone help me?
edited:
here is my ontobible.owl file (https://teamtrainit.com/ontobible.owl)
i've try open ontobible.owl with sublime text 3 and contains like this
<Verse rdf:about="http://www.semanticweb.org/budsus/ontologies/2021/7/ontobible#HOS5_2">
<verseID>HOS5_2</verseID>
<verse_text>And the revolters are profound to make slaughter, though I have been a rebuker of them all.</verse_text>
</Verse>
<Verse rdf:about="http://www.semanticweb.org/budsus/ontologies/2021/7/ontobible#2CH2_1">
<hasPerson rdf:resource="http://semanticbible.org/ns/2006/NTNames#god_1324"/>
<hasPerson rdf:resource="http://www.co-ode.org/roberts/family-tree.owl#solomon_2762"/>
<verseID>2CH2_1</verseID>
<verse_text>And Solomon determined to build an house for the name of the LORD, and an house for his kingdom.</verse_text>
</Verse>
how to convert that xml tag to array or json so I cant save it to mysql database
you have several options for extracting data from owl
use owl-api and write java code (i think owl api is accessible in other languages) to extract data and pack it in the format you need. also you can use sparql queries for extracting data via jena api
install protege, open your file in protege and save it in format json-dl. this format is very similar to the regular json and you can easily transform it for your needs
install fuseki server, add your file and using sparql queries extract data from there
i think that the second option is the easiest for start if you don't want to write queries or code and it won't take long

how to visualize geodata (Points) in databricks builtin map?

I simply have a dataframe of "lat" and "lon" coordinates that I would like to visualize on a map.
I am trying to view my data data on the map. But nothing is being displayed.
In my case, the mmsi is the id.
Please note that I have around 2M points to display. Is that possible in databricks?
If there is no way around (plotting 2,000,000 points) in databricks, then what tool can handle large amount of data?
Any help is much appreciated!!
The built in map does not support latitude / longitude data points.
Delivering 2 million data points directly to a browser is problematic. Databricks has a protective limit of 20 MB for HTML display. Your viewers won't be able to take in and visually process that level of detail at any point in time.
I recommend implementing a filter/summarization strategy to enable the display within Databricks notebooks. Some ideas can be taken from this Databricks post: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Also this Anaconda post gives an excellent survey of data visualization packages, some (like Datashader, Vaex) implement a visual summarization strategies before rendering images: https://www.anaconda.com/blog/python-data-visualization-2018-why-so-many-libraries
The error message clearly says that " Unrecognizable values in the first column. The values should be either country codes in ISO 3166-1 alpha-3 format (e.g. "GBR") or US state abbreviations (e.g. "TX")."
Note: To plot a graph of the world, use country codes in ISO 3166-1 alpha-3 format as the key.
A Map Graph is a way to visualize your data on a map.
Plot Options... was used to configure the graph below.
Keys should contain the field with the location.
Series groupings is always ignored for World Map graphs.
Values should contain exactly one field with a numerical value.
Since there can multiple rows with the same location key, choose "Sum", "Avg", "Min", "Max", "COUNT" as the way to combine the values for a single key.
Different values are denoted by color on the map, and ranges are always spaced evenly.
Reference: Databricks - Charts and Graph
Hope this helps.

Converting sensor tag data in DSX

I'm working on converting the existing recipe for Data Science Experience (DSX) to use data from a connected Sensor Tag device. However the mobile applications for that device send the data as strings rather than numerics - this is causing the DSX recipe that calculates a Z score to choke. The data is coming from a cloudant db used as a histtorian for Watson IoT Platform so I cant simply reformat it there. Is there a simple way to convert the data inside a DSX notebook ?
Just access the row object and convert it:
cloudantdata.rdd.map(lambda row : float(row.temperature)).take(10)
EDIT 30.1.17:
To directly address your question:
df = cloudantdata.selectExpr("timestamp as timestamp", "data.d.objectTemp as temperature").map(lambda row : (row.timestamp,float(row.temperature)))
That way you get a tuple RDD which IMHO anyway is more usable as a RowRDD
I'm not familiar with DSX but you can use node red to parse the information from devices then store it in cloudant db in numeric format

pattern matching data sets without regexs

i have a large data set, tcpdump capture file exported to text.
I want to find any like unknown pattern in the data set and return the fields
and data sorted. basically we wish to look for unknown "patterns" in the data set.
i know we can use a regex with like awk and grep, but from a pure research perspective, im given a data set 5Gb and 10Gb, in which i want to find unknown like matching patterns....
they could be from any field or data structure. Ive used wireshark in various modes to sort structures

How can I extract human-readable text from a code snippet?

I need to write a T-SQL query against a text column where some of the values are html or asp.net coding but include normal human-readable text. For example:
{\colortbl ;\red31\green73\blue125;\red0\green0\blue0;} \viewkind4\uc1\pard\ltrpar\lang1033\f0\fs22 All invoices to be emailed to Jack Jack.Marsman#brampton.ca
I don't need that information I need the real text; in this case I want to get just All invoices to be emailed to Jack Jack.Marsman#brampton.ca
Any suggestions on how to go about extracting the text without getting the coding?
Short answer is that there is no easy standard way to do this. I’d try creating a CLR since this kind of parsing is easier in C# or VB.NET.
You can also try using regex to strip out everything that’s not human readable.
Is all of your data in similar format like you already shown? If that’s the case then it comes down to calling substring several times…

Resources