How do I output only a subset of a graph? - giraph

I have a graph computation that starts with a subset of vertices of a certain type and propagates information through the graph to a set of target vertices, which are also subset of the graph. I want to output only information from those particular vertices, but I don't see a way to do this in the various VertexOutputFormat subclasses, which all seem oriented to outputting something for every vertex in the graph. How do I do this? E.g., are there hooks for the output phase where I can filter output? Or am I supposed to write a VertexOutputFormat implementation that generates no output for the vertices that have no data? Thanks in advance.

You can simply extend the class and add an if-condition, that will do the trick.
For instance here a class which will print out only even vertex ids:
public class ExampleTextVertexOutputFormat extends
TextVertexOutputFormat<LongWritable, LongWritable, NullWritable> {
#Override
public TextVertexWriter createVertexWriter(
TaskAttemptContext context) throws IOException, InterruptedException {
return new ExampleTextVertexLineWriter();
}
/**
* Outputs for each line the vertex id and the searched vertices with their
* hop count
*/
private class ExampleTextVertexLineWriter extends TextVertexWriterToEachLine {
#Override
protected Text convertVertexToLine(
Vertex<LongWritable, LongWritable, NullWritable> vertex) throws IOException {
if (vertex.getId() % 2 == 0) {
return new Text(vertex.getId());
}
}
}
}

Related

Summing a number from a random number source

I'm just starting to learn flink and trying to build a very basic toy example which sums an integer over time and periodically prints the total sum so far
I've created a random number generator source class like this:
// RandomNumberSource.java
public class RandomNumberSource implements SourceFunction<Integer> {
public volatile boolean isRunning = true;
private Random rand;
public RandomNumberSource() {
this.rand = new Random();
}
#Override
public void run(SourceContext<Integer> ctx) throws Exception {
while (isRunning) {
ctx.collect(rand.nextInt(200));
Thread.sleep(1000L);
}
}
#Override
public void cancel() {
this.isRunning = false;
}
}
As you can see, it generates a random number every 1 second
Now how would I go about summing the number that's being generated?
// StreamJob.java
public class StreamingJob {
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Integer> randomNumber = env.addSource(new RandomNumberSource());
// pseudo code:
// randomNumber
// .window(Time.seconds(5))
// .reduce(0, (acc, i) => acc+i) // (initial value, reducer)
// .sum()
// execute program
env.execute("Flink Streaming Random Number Sum Aggregation");
}
}
I've added pseudo code to explain what I'm trying to do. i.e every 5 seconds, perform a sum of all the numbers and print it out.
I feel like I'm missing something in my approach and might need some guidance on how to do this.
window operator is used for keyed streams. You should instead use windowAll for this task. Here's the snippet:
randomNumber
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(0)
.print()
.setParallelism(1);
Also check this for reference on various window considerations.

How to split and merge data (vectors) in Apache's Flink, without using windows

I need to split a cube of integers into vectors, perform some operation on each vector (a simple addition say), and then merge the vectors back into a cube. The vector operations should be performed in parallel (i.e. a vector per stream). The cubes are objects that contain an ID.
I can split the cube into vectors and create a tuple using the cube ID and then use keyBy(id), and create a partition per cube's vectors. However it seems like I have to use a window of some time unit to do this. The application is very latency sensitive so I would prefer to combine the vectors as they arrive, perhaps using some kind of logical clock(I know how many vectors are in a cube), and when the last vector arrives send the reassembled cube downstream. Is this possible in Flink?
Here's a code snippet exemplifying this idea:
//Stream topology..
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Cube> stream = env
//Take cubes from collection and send downstream
.fromCollection(cubes)
//Split the cube(int[][][]) to vectors(int[]) and send downstream
.flatMap(new VSplitter()) //returns tuple with id at pos 1
.keyBy(1)
//For each value in each vector element, add its value with one.
.map(new MapFunction<Tuple2<CubeVector, Integer>, Tuple2<CubeVector, Integer>>() {
#Override
public Tuple2<CubeVector, Integer> map(Tuple2<CubeVector, Integer> cVec) throws Exception {
CubeVector cv = cVec.getField(0);
cv.cubeVectorAdd(1);
cVec.setField(cv, 0);
return cVec;
}
})
//** Merge vectors back to a cube **//
.
.
.
//The cube splitter to vectors..
public static class VSplitter implements FlatMapFunction<Cube, Tuple2<CubeVector, Integer>> {
#Override
public void flatMap(Cube cube, Collector<Tuple2<CubeVector, Integer>> out) throws Exception {
for (CubeVector cv : cubeVSplit(cube)) {
//out.assignTimestamp()
out.collect(new Tuple2<CubeVector, Integer>(cv, cube.getId()));
}
}
}
You could use a FlatMapFunction which keeps appending the CubeVectors until it has seen enough CubeVectors to reconstruct a Cube. The following code snippet should do the trick:
DataStream<Tuple2<CubeVector, Integer> input = ...
input.keyBy(1).flatMap(
new RichFlatMapFunction<Tuple2<CubeVector, Integer>, Cube> {
private static final ListStateDescriptor<CubeVector> cubeVectorsStateDescriptor = new ListStateDescriptor<CubeVector>(
"cubeVectors",
new CubeVectorTypeInformation());
private static final ValueStateDescriptor<Integer> cubeVectorCounterDescriptor = new ValueStateDescriptor<>(
"cubeVectorCounter",
BasicTypeInfo.INT_TYPE_INFO);
private ListState<CubeVector> cubeVectors;
private ValueState<Integer> cubeVectorCounter;
#Override
public void open(Configuration parameters) {
cubeVectors = getRuntimeContext().getListState(cubeVectorsStateDescriptor);
cubeVectorCounter = getRuntimeContext().getState(cubeVectorCounterDescriptor);
}
#Override
public void flatMap(Tuple2<CubeVector, Integer> cubeVectorIntegerTuple2, Collector<Cube> collector) throws Exception {
cubeVectors.add(cubeVectorIntegerTuple2.f0);
final int oldCounterValue = cubeVectorCounter.value();
final int newCounterValue = oldCounterValue + 1;
if (newCounterValue == NUMBER_CUBE_VECTORS) {
Cube cube = createCube(cubeVectors.get());
cubeVectors.clear();
cubeVectorCounter.update(0);
collector.collect(cube);
} else {
cubeVectorCounter.update(newCounterValue);
}
}
});

How to define an array in hadoop partitioner

I am new in hadoop and mapreduce programming and don't know what should i do. I want to define an array of int in hadoop partitioner. i want to feel in this array in main function and use its content in partitioner. I have tried to use IntWritable and array of it but none of them didn't work . I tried to use IntArrayWritable but again it didn't work. I will be pleased if some one help me. Thank you so much
public static IntWritable h = new IntWritable[1];
public static void main(String[] args) throws Exception {
h[0] = new IntWritable(1);
}
public static class CaderPartitioner extends Partitioner <Text,IntWritable> {
#Override
public int getPartition(Text key, IntWritable value, int numReduceTasks) {
return h[0].get();
}
}
if you have limited number of values, you can do in the below way.
set the values on the configuration object like below in main method.
Configuration conf = new Configuration();
conf.setInt("key1", value1);
conf.setInt("key2", value2);
Then implement the Configurable interface for your Partitioner class and get the configuration object, then key/values from it inside your Partitioner
public class testPartitioner extends Partitioner<Text, IntWritable> implements Configurable{
Configuration config = null;
#Override
public int getPartition(Text arg0, IntWritable arg1, int arg2) {
//get your values based on the keys in the partitioner
int value = getConf().getInt("key");
//do stuff on value
return 0;
}
#Override
public Configuration getConf() {
// TODO Auto-generated method stub
return this.config;
}
#Override
public void setConf(Configuration configuration) {
this.config = configuration;
}
}
supporting link
https://cornercases.wordpress.com/2011/05/06/an-example-configurable-partitioner/
note if you have huge number of values in a file then better to find a way to get cache files from job object in Partitioner
Here's a refactored version of the partitioner. The main changes are:
Removed the main() which isnt needed, initialization should be done in the constructor
Removed static from the class and member variables
public class CaderPartitioner extends Partitioner<Text,IntWritable> {
private IntWritable[] h;
public CaderPartitioner() {
h = new IntWritable[1];
h[0] = new IntWritable(1);
}
#Override
public int getPartition(Text key, IntWritable value, int numReduceTasks) {
return h[0].get();
}
}
Notes:
h doesn't need to be a Writable, unless you have additional logic not included in the question.
It isn't clear what the h[] is for, are you going to configure it? In which case the partitioner will probably need to implement Configurable so you can use a Configurable object to set the array up in some way.

Protected? getSource and getTarget methods on JGraphT DefaultEdge class

The methods getSource() and getTarget() of DefaultEdge on org.jgrapht.graph.DefaultEdge are protected.
How should I access source and target vertices of each of the edges returned by the edgeSet() of org.jgrapht.graph.SimpleGraph ?
The code below shows what is happening.
import java.util.Set;
import org.jgrapht.UndirectedGraph;
import org.jgrapht.graph.DefaultEdge;
import org.jgrapht.graph.SimpleGraph;
public class TestEdges
{
public static void main(String [] args)
{
UndirectedGraph<String, DefaultEdge> g =
new SimpleGraph<String, DefaultEdge>(DefaultEdge.class);
String A = "A";
String B = "B";
String C = "C";
// add the vertices
g.addVertex(A);
g.addVertex(B);
g.addVertex(C);
g.addEdge(A, B);
g.addEdge(B, C);
g.addEdge(A, C);
Set<DefaultEdge> edges = g.edgeSet();
for(DefaultEdge edge : edges) {
String v1 = edge.getSource(); // Error getSource() is protected method
String v2 = edge.getTarget(); // Error getTarget() is protected method
}
}
}
The "correct" method to access edges source and target, according to JGraphT mailing list is to use the method getEdgeSource(E) and getEdgeTarget(E) from the interface Interface Graph<V,E> of org.jgrapht
the modification of the code is then
for(DefaultEdge edge : edges) {
String v1 = g.getEdgeSource(edge);
String v2 = g.getEdgeTarget(edge);
}
I was having a similar issue when trying to extract the values of the edges, and although not OP's case, might be helpful for anyone else facing this issue.
When I instantiated my graph and passed it an edge class:
DirectedGraph graph = new SimpleDirectedGraph(DefaultEdge.class);
Netbeans gave me the option for what DefaultEdge.class file to import, I chose the wrong one. I used the org.jgraph library instead of the org.jgrapht.
If you are using the DefaultEdge class make sure you are using the one from jgrapht.
import org.jgrapht.graph.DefaultEdge;

Using Sencha GXT 3, generate a line chart populated with a dynamic number of line series fields?

Using Sencha GXT 3.0 is it possible to generate a line chart and populate it with a dynamic number of line series fields, and if so, what is the recommended method?
I know multiple series fields can be added to a chart, but the line chart examples (and the other chart examples for that matter) make use of an interface which extends PropertyAccess<?> and the interface specifies a static number of expected fields (e.g. data1(), data2(), data3(), etc.). If the interface is to be used to specify the fields to add to the chart, how could you account for a chart which may require n number of fields (i.e. n number of line series on a given chart).
Example provided on Sencha's site:
http://www.sencha.com/examples/#ExamplePlace:linechart
I ran into the same issue. It would be a much nicer design if each series had a store instead of having one store per chart.
I had one long list of metric values in metricDataStore. Each metric value has a description. I wanted all the metric values with the same description displayed on one (and only one) series. I had my value providers for each series return null for both the x and y axis if the value wasn't supposed to be in the series.
This seems like a hack to me but it works for my usage:
myChart = new Chart<MetricData>();
myChart.setStore(metricDataStore);
.
.
.
for (MetricInfo info : metricInfoData) {
LineSeries<MetricData> series = new LineSeries<MetricData>();
series.setChart(myChart);
series.setSmooth(false);
series.setShownInLegend(true);
series.setHighlighting(true);
series.setYAxisPosition(Chart.Position.LEFT);
series.setYField(new MetricValueProvider(info.getName()));
series.setXAxisPosition(Chart.Position.BOTTOM);
series.setXField(new MetricTimeProvider(info.getName()));
myChart.addSeries(series);
}
.
.
.
private class MetricTimeProvider extends Object implements ValueProvider<MetricData, Long> {
private String metricName;
public MetricTimeProvider(String metricName) {
this.metricName = metricName;
}
#Override
public Long getValue(MetricData m) {
if (metricName != null && metricName.equals(m.getLongDesc()))
return m.getId();
else
return null;
}
#Override
public void setValue(MetricData m, Long value) {
}
#Override
public String getPath() {
return null;
}
}
private class MetricValueProvider extends Object implements ValueProvider<MetricData, Double> {
private String metricName;
public MetricValueProvider(String metricName) {
this.metricName = metricName;
}
#Override
public Double getValue(MetricData m) {
if (metricName != null && metricName.equals(m.getLongDesc()))
return m.getMetricValue();
else
return null;
}
#Override
public void setValue(MetricData m, Double value) {
}
#Override
public String getPath() {
return null;
}
}

Resources