Appending values to DataSet in Apache Flink - apache-flink

I am currently writing an (simple) analytisis code to sum time connected powerreadings. With the data being assumingly raw (e.g. disturbances from the measuring device have not been calculated out) I have to account for disturbances by calculation the mean of the first one thousand samples. The calculation of the mean itself is not a problem. I only am unsure of how to generate the appropriate DataSet.
For now it looks about like this:
DataSet<Tupel2<long,double>>Gyrotron_1=ECRH.includeFields('11000000000'); // obviously the line to declare the first gyrotron, continues for the next ten lines, assuming separattion of not occupied space
DataSet<Tupel2<long,double>>Gyrotron_2=ECRH.includeFields('10100000000');
DataSet<Tupel2<long,double>>Gyrotron_3=ECRH.includeFields('10010000000');
DataSet<Tupel2<long,double>>Gyrotron_4=ECRH.includeFields('10001000000');
DataSet<Tupel2<long,double>>Gyrotron_5=ECRH.includeFields('10000100000');
DataSet<Tupel2<long,double>>Gyrotron_6=ECRH.includeFields('10000010000');
DataSet<Tupel2<long,double>>Gyrotron_7=ECRH.includeFields('10000001000');
DataSet<Tupel2<long,double>>Gyrotron_8=ECRH.includeFields('10000000100');
DataSet<Tupel2<long,double>>Gyrotron_9=ECRH.includeFields('10000000010');
DataSet<Tupel2<long,double>>Gyrotron_10=ECRH.includeFields('10000000001');
for (int=1,i<=10;i++) {
DataSet<double> offset=Gyroton_'+i+'.groupBy(1).first(1000).sum()/1000;
}
It's the part in the for-loop I'm unsure of. Does anybody know if it is possible to append values to DataSets and if so how?
In case of doubt, I could always put the values into an array but I do not know if that is the wise thing to do.

This code will not work for many reasons. I'd recommend looking into the fundamentals of Java and the basic data structures and also in Flink.
It's really hard to understand what you actually try to achieve but this is the closest that I came up with
String[] codes = { "11000000000", ..., "10000000001" };
DataSet<Tuple2<Long, Double>> result = env.fromElements();
for (final String code : codes) {
DataSet<Tuple2<Long, Double>> codeResult = ECRH.includeFields(code)
.groupBy(1)
.first(1000)
.sum(0)
.map(sum -> new Tuple2<>(sum.f0, sum.f1 / 1000d));
result = codeResult.union(result);
}
result.print();
But please take the time and understand the basics before delving deeper. I also recommend to use an IDE like IntelliJ that would point to at least 6 issues in your code.

Related

(C) - How would one compare 2 txt files REQUESTS.txt and AVAILABLE.txt, separating each str read into a (STR6, STR3, STR3, INT) formatted Structure?

I have been working on this program for over a week with no breakthrough. The questions states as follows:
A ​disc​ ​file​ ​‘REQUESTS.TXT’​ ​contains​ ​airline​ ​flight​ ​data formatted​
​(STR6,​ ​STR3,​ ​STR3,​ ​INT)​.
Example:​
AA1011​SFx​LAx​​34​ ​(American Airlines​ ​1010,​ ​SF​ ​to​ ​LA,​ ​34​ ​seats)
W0924​DNV​DFW​​101​ ​(Western​ ​0924,​ ​DNV​ ​to​ ​DFW,​ ​101​ ​seats)
Another​ ​file​ ​‘AVAILABL.TXT’​ ​contains​ ​an​ ​unspecified​ number​ ​of​ ​reservation​ request​ ​records formatted​ ​identically​ ​as​ ​described​ ​above​ ​except​ ​the​ Seats​ ​Available​ ​field​ ​is​ ​a​ ​Seats​ ​Requested field.
Guidelines:
Read reservation flights and process requests. If the request can be fullfilled (i.e.. it is in AVAILABL and REQUESTS) then print "Reservation Processed", otherwise print "Reservation Denied".
Print out flight data file before and after reservations are processed, ordered by flight ID in a four(4) column format.
Print an overall outcome report for all processed.(Present totals for the number of requests satisfied and denied)
I have tried a few different approaches.. I tried to split up the first STR6 by isalpha/isdigit and combine them to make the FlightID (AA + 1011). Proceeded to try to then split up the remaining characters between STR3 and STR3 via isalpha + for loop. And lastly, I tried to take the last 3+ digits for the # of seats during each for loop iteration and multiply the first digit by 100(for a 3-digit value) or 10(for a 2-digit value), adding it to a running total for availSeats(INT). This, at least I thought so, would produce a
AA+1011 = AA1011(STR6) // W+0924 = W0924(STR6)
SFx(STR3) // DNV(STR3)
LAx(STR3) // DFW(STR3)
(3*10)+(4*1) = 34(INT) // (1*100)+(0*10)+(1*1) = 101(INT)
All of this stored within a Struct Array.
i.e...
FlightData Flight; ............................................FlightData Flight;
Flight[0].flightID = AA1011; .........................Flight[1].flightID = W0924;
Flight[0].fromCity = SFx; ...............................Flight[1].fromCity = DNV;
Flight[0].toCity = LAx; ..................................Flight[1].toCity = DFW;
Flight[0].seatsAvail = 34; .............................Flight[1].seatsAvail = 101;
I am really at a loss right now and have no other way to progress other than searching up different techniques/methods to use to make this work. I am a beginner clearly and will continue to practice and progress in C, but if anyone could provide me with a push in the right direction on how one would execute this via .txt into a Struct would be amazing. Also, if anyone has another method they used to solve this problem I would love to analyze it. Thanks!
(This is my first post, I spent a lot of time formatting it to be clear on Stackoverflow, so If i messed up in areas some constructive critisism would be useful! This applies to my posting and my coding practices. Thanks again!)
EDIT: The question I am asking here is how to successfully take a string such as AA1011SFxLAx34 and turn it into a Structure like the above diagram. It must also work for the second string W0924DNVDFW101 which has only 1 Char in its ID. (rather than two in AA1011). Im not sure what else I am supposed to edit after reading the guidelines.
I consider this a home work question, so I answer according to
How do I ask and answer homework questions?
Find a tutorial on C, work through it.
Then take a HelloWorld, modify it in small steps to approach your goal in steps from working program to working program. This way you should at least get to being able to read text from a file and print it.
Then learn to store parts of what you print into basic variables.
Then learn about structures.
And so on.
This way you will get quite close to the solution.
If it is not completely what you need show the code you have here at that point and ask a specific question about the first problem explaining what you suspect the problem to be. Show code which has exactly that one problem and makes it visible and has not other warnings (using at least e.g. gcc -Wall mycode).
Fix with the help of commments/answers you receive, repeat.

how to use arima.rob

does anyone use arima.rob() function described by Eric Zivot and Jiahui Wang in { Modelling Financial Time Series with S-PLUS } ?
I have a question about it:
I used a dataset of network traffic flows that has anomaly, and I tried to predict the last part of dataset by robust ARIMA method (Arima.rob() function) .I compare this model with arima.mle of S-PLUS. But Unexpectedly, arima.rob’s prediction did not better than that.
I’m not sure my codes are correct and may be the reason of fault is my codes.
Please, help me if I used Arima.rob inappropriately?
tmp.rr<-arima.rob((tmh75)~1,p=2,d=1,q=2,freq=24,maxiter=4,max.fcal=80000)
tmp.for<-predict(tmp.rr,n.predict=10,newdata=df1,se=T)
plot(tmp.for,tmh75)
summary(tmp.for)
my code for classic arima:
`model <- list(list(order=c(2,1,2)),list(order=c(3,1,2),period=24))
fith <- arima.mle(tmh75-mean(tmh75),model=model)
foreh <- arima.forecast(tmh75,n=25,model=fith$model)
tsplot(tmh75,foreh$mean,foreh$mean+foreh$std.err,foreh$mean-foreh$std.err)
`

Filtering "Smoothing" an array of numbers in C

I am writing an application in X-code. It is gathering the sensor data (gyroscope) and then transforming it throw FFTW. At the end I am getting the result in an array. In the app. I am plotting the graph but there is so much peaks (see the graph in red) and i would like to smooth it.
My array:
double magnitude[S];
...
magnitude[i]=sqrt((fft_result[i][0])*(fft_result[i][0])+ (fft_result[i][1])*(fft_result[i][1]) );
An example array (for 30 samples, normally I am working with 256 samples):
"0.9261901713034604",
"2.436272348237486",
"1.618854900218465",
"1.849221286218342",
"0.8495016887742839",
"0.5716796354304043",
"0.4229791869017677",
"0.3731843430827401",
"0.3254446111798023",
"0.2542702545675339",
"0.25237940627189",
"0.2273716541964159",
"0.2012780334451323",
"0.2116151847259499",
"0.1921943719520009",
"0.1982429400169304",
"0.18001770452247",
"0.1982429400169304",
"0.1921943719520009",
"0.2116151847259499",
"0.2012780334451323",
"0.2273716541964159",
"0.25237940627189",
"0.2542702545675339",
"0.3254446111798023",
"0.3731843430827401",
"0.4229791869017677",
"0.5716796354304043",
"0.8495016887742839",
"1.849221286218342"
How to filter /smooth it? whats about gauss? Any idea how to begin or even giving me a sample code.
Thank you for your help!
best regards
josef
Simplest way to smooth would be to replace each sample with the average of it and its 2 neighbors.
The simpliest idea would be taking average of 2 points and putting them into an array. Something like
double smooth_array[S];
for (i = 0; i<S-2; i++)
smooth_array[i]=(magnitude[i] + magnitude[i+1])/2;
smooth_array[S-1]=magnitude[S-1];
It is not best one, but I think it should be ok.
If you need the scientific approach - use some kind of approximation / approximation algorithms. Something like least squares function approximation or even full SE13/SE35 etc. algorithms.

Testing an Algorithms speed. How?

I'm currently testing different algorithms, which determine whether an Integer is a real square or not. During my research I found this question at SOF:
Fastest way to determine if an integer's square root is an integer
I'm compareably new to the Programming scene. When testing the different Algorithms that are presented in the question, I found out that this one
bool istQuadratSimple(int64 x)
{
int32 tst = (int32)sqrt(x);
return tst*tst == x;
}
actually works faster than the one provided by A. Rex in the Question I posted. I've used an NS-Timer object for this testing, printing my results with an NSLog.
My question now is: How is speed-testing done in a professional way? How can I achieve equivalent results to the ones provided in the question I posted above?
The problem with calling just this function in a loop is that everything will be in the cache (both the data and the instructions). You wouldn't measure anything sensible; I wouldn't do that.
Given how small this function is, I would try to look at the generated assembly code of this function and the other one and I would try to reason based on the assembly code (number of instructions and the cost of the individual instructions, for example).
Unfortunately, it only works in trivial / near trivial cases. For example, if the assembly codes are identical then you know there is no difference, you don't need to measure anything. Or if one code is like the other plus additional instructions; in that case you know that the longer one takes longer to execute. And then there are the not so clear cases... :(
(See the update below.)
You can get the assembly with the -S -emit-llvm flags from clang and with the -S flag from gcc.
Hope this help.
UPDATE: Response to Prateek's question in the comment "is there any way to determine the speed of one particular algorithm?"
Yes, it is possible but it gets horribly complicated REALLY quick. Long story short, ignoring the complexity of modern processors and simply accumulating some predefined cost associated with the instructions can lead to very very inaccurate results (the estimate off by a factor of 100, due to the cache and the pipeline, among others). If you try take into consideration the complexity of the modern processors, the hierarchical cache, the pipeline, etc. things get very difficult. See for example Worst Case Execution Time Prediction.
Unless you are in a clear situation (trivial / near trivial case), for example the generated assembly codes are identical or one is like the other plus a few instructions, it is also hard to compare algorithms based on their generated assembly.
However, here a simple function of two lines is shown, and for that, looking at the assembly could help. Hence my answer.
I am not sure if there is any professional way of checking the speed (if there is let me know as well). For the method that you directed to in your question I would probably do something this this in java.
package Programs;
import java.math.BigDecimal;
import java.math.RoundingMode;
public class SquareRootInteger {
public static boolean isPerfectSquare(long n) {
if (n < 0)
return false;
long tst = (long) (Math.sqrt(n) + 0.5);
return tst * tst == n;
}
public static void main(String[] args) {
long iterator = 1;
int precision = 10;
long startTime = System.nanoTime(); //Getting systems time before calling the isPerfectSquare method repeatedly
while (iterator < 1000000000) {
isPerfectSquare(iterator);
iterator++;
}
long endTime = System.nanoTime(); // Getting system time after the 1000000000 executions of isPerfectSquare method
long duration = endTime - startTime;
BigDecimal dur = new BigDecimal(duration);
BigDecimal iter = new BigDecimal(iterator);
System.out.println("Speed "
+ dur.divide(iter, precision, RoundingMode.HALF_UP).toString()
+ " nano secs"); // Getting average time taken for 1 execution of method.
}
}
You can check your method in similar fashion and check which one outperforms other.
Record the time value before your massive calculation and the value after that. The difference is the time executed.
Write a shell script where you will run the program. And run 'time ./xxx.sh' to get it's running time.

Plotting a word-cloud by date for a twitter search result? (using R)

I wish to search twitter for a word (let's say #google), and then be able to generate a tag cloud of the words used in twitts, but according to dates (for example, having a moving window of an hour, that moves by 10 minutes each time, and shows me how different words gotten more often used throughout the day).
I would appreciate any help on how to go about doing this regarding: resources for the information, code for the programming (R is the only language I am apt in using) and ideas on visualization. Questions:
How do I get the information?
In R, I found that the twitteR package has the searchTwitter command. But I don't know how big an "n" I can get from it. Also, It doesn't return the dates in which the twitt originated from.
I see here that I could get until 1500 twitts, but this requires me to do the parsing manually (which leads me to step 2). Also, for my purposes, I would need tens of thousands of twitts. Is it even possible to get them in retrospect?? (for example, asking older posts each time through the API URL ?) If not, there is the more general question of how to create a personal storage of twitts on your home computer? (a question which might be better left to another SO thread - although any insights from people here would be very interesting for me to read)
How to parse the information (in R)? I know that R has functions that could help from the rcurl and twitteR packages. But I don't know which, or how to use them. Any suggestions would be of help.
How to analyse? how to remove all the "not interesting" words? I found that the "tm" package in R has this example:
reuters <- tm_map(reuters, removeWords, stopwords("english"))
Would this do the trick? I should I do something else/more ?
Also, I imagine I would like to do that after cutting my dataset according to time (which will require some posix-like functions (which I am not exactly sure which would be needed here, or how to use it).
And lastly, there is the question of visualization. How do I create a tag cloud of the words? I found a solution for this here, any other suggestion/recommendations?
I believe I am asking a huge question here but I tried to break it to as many straightforward questions as possible. Any help will be welcomed!
Best,
Tal
Word/Tag cloud in R using "snippets" package
www.wordle.net
Using openNLP package you could pos-tag the tweets(pos=Part of speech) and then extract just the nouns, verbs or adjectives for visualization in a wordcloud.
Maybe you can query twitter and use the current system-time as a time-stamp, write to a local database and query again in increments of x secs/mins, etc.
There is historical data available at http://www.readwriteweb.com/archives/twitter_data_dump_infochimp_puts_1b_connections_up.php and http://www.wired.com/epicenter/2010/04/loc-google-twitter/
As for the plotting piece: I did a word cloud here: http://trends.techcrunch.com/2009/09/25/describe-yourself-in-3-or-4-words/ using the snippets package, my code is in there. I manually pulled out certain words. Check it out and let me know if you have more specific questions.
I note that this is an old question, and there are several solutions available via web search, but here's one answer (via http://blog.ouseful.info/2012/02/15/generating-twitter-wordclouds-in-r-prompted-by-an-open-learning-blogpost/):
require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)
##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
gsub("#\\w+", "", tweet)
}
#Then for example, remove #d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))
##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
#Install the textmining library
require(tm)
#The following is cribbed and seems to do what it says on the can
tw.corpus= Corpus(VectorSource(df))
# remove punctuation
tw.corpus = tm_map(tw.corpus, removePunctuation)
#normalise case
tw.corpus = tm_map(tw.corpus, tolower)
# remove stopwords
tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)
tw.corpus
}
wordcloud.generate=function(corpus,min.freq=3){
require(wordcloud)
doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
dm = as.matrix(doc.m)
# calculate the frequency of words
v = sort(rowSums(dm), decreasing=TRUE)
d = data.frame(word=names(v), freq=v)
#Generate the wordcloud
wc=wordcloud(d$word, d$freq, min.freq=min.freq)
wc
}
print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))
##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()
#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
require(twitteR)
rdmTweets = searchTwitter(searchTerm, n=num)
tw.df=twListToDF(rdmTweets)
as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)
I would like to answer your question in making big word cloud.
What I did is
Use s0.tweet <- searchTwitter(KEYWORD,n=1500) for 7 days or more, such as THIS.
Combine them by this command :
rdmTweets = c(s0.tweet,s1.tweet,s2.tweet,s3.tweet,s4.tweet,s5.tweet,s6.tweet,s7.tweet)
The result:
This Square Cloud consists of about 9000 tweets.
Source: People voice about Lynas Malaysia through Twitter Analysis with R CloudStat
Hope it help!

Resources