How to write the content of a Flink var to screen in Zeppelin? - apache-flink

I try to run the followin simple commands in Apache Zeppelin.
%flink
var rabbit = env.fromElements(
"ARTHUR: What, behind the rabbit?",
"TIM: It is the rabbit!",
"ARTHUR: You silly sod! You got us all worked up!",
"TIM: Well, that's no ordinary rabbit. That's the most foul, cruel, and bad-tempered rodent you ever set eyes on.",
"ROBIN: You tit! I soiled my armor I was so scared!",
"TIM: Look, that rabbit's got a vicious streak a mile wide, it's a killer!")
var counts = rabbit.flatMap { _.toLowerCase.split("\\W+")}.map{ (_,1)}.groupBy(0).sum(1)
counts.print()
I try to print out the results in the notebook. But unfortunately, I only get the following output.
rabbit: org.apache.flink.api.scala.DataSet[String] = org.apache.flink.api.scala.DataSet#37fdb65c
counts: org.apache.flink.api.scala.AggregateDataSet[(String, Int)] = org.apache.flink.api.scala.AggregateDataSet#1efc7158
res103: org.apache.flink.api.java.operators.DataSink[(String, Int)] = DataSink '<unnamed>' (Print to System.out)
How can I spill the content of counts to the notebook in Zeppelin?

The way to print the result of such computation in Zeppelin is:
%flink
counts.collect().foreach(println(_))
//or one might prefer
//counts.collect foreach println
Output:
(a,3)
(all,1)
(and,1)
(armor,1)
...

The reason for the observed behaviour lies in the interplay between Apache Zeppelin and Apache Flink. Zeppelin captures all standard output of Console. However, Flink also prints output to System.out and that's exactly what's happening when you call counts.print(). The reason why bzz's solution works is that it prints the result using Console.
I opened a JIRA issue [1] and opened a pull request [2] to correct this behaviour so that you can also use counts.print().
[1] https://issues.apache.org/jira/browse/ZEPPELIN-287
[2] https://github.com/apache/incubator-zeppelin/pull/288

Related

How do I convert binary field to text in Ruby?

I have a MSSQL table with a field of type image that has some text stored in it.
The field has data that looks like this:
54004800490053002000490053002000410020004c00490047004800540041005200540020004f0052004400450052002e00200020004c004900470048005400410052005400200049005300200044004f0049004e004700200054004800450020004600410042002e000d000a004c00490047004800540041005200540020005300480049005000500049004e004700200054004f00200043005500530054004f004d004500520020003c0038002d00320033002d00310037003e000d000a000d000a0043006f006e006e00690065002c00200070006c00650061007300650020007000720069006e007400200073007400690063006b00650072007300200066006f0072002000650061006300680020006f007500740065007200200062006f00780020007400680061007400200069006e0063006c0075006400650073002000740068006500200069006e0066006f003a000d000a0028003100290020006f00660020002800310029000d000a004c004100320020005400680072006500650020004c00610072006700650020000d000a00380036005c0022004c0020007800200036005c002200570020007800200038005c00220048000d000a004e00610074007500720061006c000d000a005000320030003900380031003000350020004d004f0044002000500069007a007a00610020005300750067006100720068006f007500730065002c00200055005400
In PHP I can write a SQL query to convert that data like this: SELECT CAST(CAST(CUST_ORDER_BINARY.BITS as VARBINARY(8000)) as VARCHAR(8000)) as result FROM CUST_ORDER_BINARY WHERE CUST_ORDER_ID = 'CO-299403S';
When I try the same thing in Ruby I get a result like this:
specs = VisualCustomer.connection.exec_query(sql).first
{"result"=>"T\u0000H\u0000I\u0000S\u0000 \u0000I\u0000S\u0000 \u0000A\u0000 \u0000L\u0000I\u0000G\u0000H\u0000T\u0000A\u0000R\u0000T\u0000 \u0000O\u0000R\u0000D\u0000E\u0000R\u0000.\u0000 \u0000 \u0000L\u0000I\u0000G\u0000H\u0000T\u0000A\u0000R\u0000T\u0000 \u0000I\u0000S\u0000 \u0000D\u0000O\u0000I\u0000N\u0000G\u0000 \u0000T\u0000H\u0000E\u0000 \u0000F\u0000A\u0000B\u0000.\u0000\r\u0000\n\u0000L\u0000I\u0000G\u0000H\u0000T\u0000A\u0000R\u0000T\u0000 \u0000S\u0000H\u0000I\u0000P\u0000P\u0000I\u0000N\u0000G\u0000 \u0000T\u0000O\u0000 \u0000C\u0000U\u0000S\u0000T\u0000O\u0000M\u0000E\u0000R\u0000 \u0000<\u00008\u0000-\u00002\u00003\u0000-\u00001\u00007\u0000>\u0000\r\u0000\n\u0000\r\u0000\n\u0000C\u0000o\u0000n\u0000n\u0000i\u0000e\u0000,\u0000 \u0000p\u0000l\u0000e\u0000a\u0000s\u0000e\u0000 \u0000p\u0000r\u0000i\u0000n\u0000t\u0000 \u0000s\u0000t\u0000i\u0000c\u0000k\u0000e\u0000r\u0000s\u0000 \u0000f\u0000o\u0000r\u0000 \u0000e\u0000a\u0000c\u0000h\u0000 \u0000o\u0000u\u0000t\u0000e\u0000r\u0000 \u0000b\u0000o\u0000x\u0000 \u0000t\u0000h\u0000a\u0000t\u0000 \u0000i\u0000n\u0000c\u0000l\u0000u\u0000d\u0000e\u0000s\u0000 \u0000t\u0000h\u0000e\u0000 \u0000i\u0000n\u0000f\u0000o\u0000:\u0000\r\u0000\n\u0000(\u00001\u0000)\u0000 \u0000o\u0000f\u0000 \u0000(\u00001\u0000)\u0000\r\u0000\n\u0000L\u0000A\u00002\u0000 \u0000T\u0000h\u0000r\u0000e\u0000e\u0000 \u0000L\u0000a\u0000r\u0000g\u0000e\u0000 \u0000\r\u0000\n\u00008\u00006\u0000\\\u0000\"\u0000L\u0000 \u0000x\u0000 \u00006\u0000\\\u0000\"\u0000W\u0000 \u0000x\u0000 \u00008\u0000\\\u0000\"\u0000H\u0000\r\u0000\n\u0000N\u0000a\u0000t\u0000u\u0000r\u0000a\u0000l\u0000\r\u0000\n\u0000P\u00002\u00000\u00009\u00008\u00001\u00000\u00005\u0000 \u0000M\u0000O\u0000D\u0000 \u0000P\u0000i\u0000z\u0000z\u0000a\u0000 \u0000S\u0000u\u0000g\u0000a\u0000r\u0000h\u0000o\u0000u\u0000s\u0000e\u0000,\u0000 \u0000U\u0000T\u0000"}
So the data is "almost" there. :)
I've tried gsubing to remove the \u0000 from the result but that's not working, obviously.
** EDIT 1 **
So, for some reason, getting the data from MSSQL into ruby is causing some kind of partial translation. I never get the raw data from the field, instead I get the "semi-translated" data. Even if I just query it, it still comes out like
"T\x00H\x00I\x00S\x00 \x00I\x00S\x00 \x00A\x00...
I tried to put it back doing:
s = order_specs.each_byte.map { |b| b.to_s(16) }.join
Then, when I do:
order_specs = s.scan(/.{2}(?=0{2})/).map{|s| s.to_i(16)}.pack("c*").tr("\x02", " ")
I just get an empty string. :/
That happens when you're inspecting the data, but when you write it will be fine:
Example:
$ ruby -e 'bin = File.read("/bin/ls");p bin; File.open("/tmp/file","w+"){|f| f.write bin}'
"\u007FELF\u0002\u0001\u0001\u0000\u0000\u0000 ...
....
$ md5sum /bin/ls
84b7b042405dfc79f2afe9b12d6b931d /bin/ls
$ md5sum /tmp/file
84b7b042405dfc79f2afe9b12d6b931d /tmp/file
So here we read a binary file /bin/ls and wrote it to another file /tmp/file as you can the the checksums are the identical.
s = "54004800490053002000490053002000410020004c00490047004800540041005200540020004f0052004400450052002e00200020004c004900470048005400410052005400200049005300200044004f0049004e004700200054004800450020004600410042002e000d000a004c00490047004800540041005200540020005300480049005000500049004e004700200054004f00200043005500530054004f004d004500520020003c0038002d00320033002d00310037003e000d000a000d000a0043006f006e006e00690065002c00200070006c00650061007300650020007000720069006e007400200073007400690063006b00650072007300200066006f0072002000650061006300680020006f007500740065007200200062006f00780020007400680061007400200069006e0063006c0075006400650073002000740068006500200069006e0066006f003a000d000a0028003100290020006f00660020002800310029000d000a004c004100320020005400680072006500650020004c00610072006700650020000d000a00380036005c0022004c0020007800200036005c002200570020007800200038005c00220048000d000a004e00610074007500720061006c000d000a005000320030003900380031003000350020004d004f0044002000500069007a007a00610020005300750067006100720068006f007500730065002c00200055005400"
Code:
puts s.scan(/.{2}(?=0{2})/).map{|s| s.to_i(16)}.pack("c*")
Output:
THISISALIGHTARTORDER.LIGHTARTISDOINGTHEFAB.
LIGHTARTSHIINGTOCUSTOMER<8-23-17>
Connie,leaserintstickersforeachouterboxthatincludestheinfo:
(1)of(1)
LA2ThreeLarge
86\"Lx6\"Wx8\"H
Natural
29815MODizzaSugarhouse,UT
Note: Some characters are unprintable, so they do not appear in this page. See the edit page of this answer for detail.
Or, if you replace "\x02" with a space,
puts s.scan(/.{2}(?=0{2})/).map{|s| s.to_i(16)}.pack("c*").tr("\x02", " ")
you get:
THIS IS A LIGHTART ORDER. LIGHTART IS DOING THE FAB.
LIGHTART SHIING TO CUSTOMER <8-23-17>
Connie, lease rint stickers for each outer box that includes the info:
(1) of (1)
LA2 Three Large
86\"L x 6\"W x 8\"H
Natural
29815 MOD izza Sugarhouse, UT
I finally figured this out. I needed to do string.gsub("\u0000", '')
So, I was getting the data from the MSSQL database correctly it seemed, but I that null byte was really throwing things off and was being sent to the front end where it was appearing on the page. i swear i tried gsubing before but for whatever reason it wasn't working. I tried it again now before when the response is being formed and it is now being sent correctly.

PowerShell script returning the wrong screen resolution

I just wrote a simple PowerShell script to get the screen resolution of my monitor, but it seems to be returning the wrong values.
# Returns an screen width and screen height of maximum screen resolution
function Get-ScreenSize {
$screen = [System.Windows.Forms.Screen]::PrimaryScreen
$width = $screen.Bounds.Width
$height = $screen.Bounds.Height
return $width, $height
}
Get-ScreenSize
I am running this script on a 4k monitor with the resolution set at 3840 x 2160, but it is giving me the following output:
1536
864
Is there anything that would cause System.Windows.Forms.Screen to get the wrong "Bounds" values?
Well I didn't exactly find out why I was getting such strange results... but I did find another approach that actually seems simpler and appears to be accurate.
$vc = Get-WmiObject -class "Win32_VideoController"
$vc.CurrentHorizontalResolution
$vc.CurrentVerticalResolution
This will print the current screen resolution and appears to be giving me accurate results which is what I was actually looking for. If anyone figures out what could cause the other approach to produce inaccurate results I would still really like to know why it is happening though...
It's because that command gives you the scaled resolution. If you're running 3840 x 2160 but you're not running on 100% scaling you'll get a different value.
That's odd.
Why on earth has Microsoft only provided the Get-DisplayResolution cmdlet with Server Core?
That edition ships without a Start-button... and according to the comment above on the returned display size (minus start-bar); I won't be surprised to hear that cmdlet is using the same .NET code library.
Quick search in my HKLM\SYSTEM\CurrentControlSet\Control lists a few keys for monitors and values per screen, but nothing useful.
Edit: see Q7967699.
PS D:\Scripts> Add-Type -AssemblyName System.Windows.Forms
PS D:\Scripts> [System.Windows.Forms.Screen]::AllScreens
BitsPerPixel : 32
Bounds : {X=0,Y=0,Width=3840,Height=2160}
DeviceName : \\.\DISPLAY1
Primary : True
WorkingArea : {X=0,Y=0,Width=3840,Height=2120}

SWIFT OS X - multiple statements inside a closure statement, a debugging tool?

I am using the following code to filter a large array:
var arrayOfSelectedRowDetails = self.projectRowDetails.filter(
{ $0.projectNumber == self.projectNumberArray[selectedRow] }
)
Normally the code runs fine and I have no issues. But in one scenario (after I have deleted some management objects from the persistent store) and then rerun the code I am getting a EXC_BAD_ACCESS (code = 1, address=0x0) error at runtime.
I have set a break and stepped through the runtime of this statement. It is a large array built from a core data entity (using a fetch statement) - and therefore takes a long time. When I step through the code over the first dozen or so indexes the code runs ok - when i remove the break and let it run it then presents the error.
Is it possible to println() from within the closure statement to assist with debugging? I have tried a number of different syntaxes and cannot get it to work.
Alternatively, is it possible to set an error capture statement within the closure so that the code ceases through a break or an abort() statement?
Fundamentally i am trying to identify the index of the array at the point that the error occurs so that I can get sufficient information to debug the delete function (which is where I think the error is). I do not seem to be able to ascertain the index from the info available to me when the error occurs.
This is the first time I have tried programming in Swift and making use of closures so I am learning as I go. Apologies if I am asking fundamental questions. I have not been able to find a similar question elsewhere here with an answer that works.
You can set an exception breakpoint in Xcode (for an example see here).
Also, I suggest that you move the access to self.projectNumberArray out of the closure:
let pn = self.projectNumberArray[selectedRow]
var arrayOfSelectedRowDetails = self.projectRowDetails.filter(
{ $0.projectNumber == pn }
)
The change might not solve the issue, but it will at least help the debugging.
Lastly, if you want to print the index, the following approach will probably work:
let pn = self.projectNumberArray[selectedRow]
var index = 0
var arrayOfSelectedRowDetails = self.projectRowDetails.filter(
{ println(index++); return $0.projectNumber == pn }
)

NDB query giving different results on local versus production environment

I am banging my head into a wall over this and hoping you can tell me the very simple thing I have overlooked in my sleep deprived/noob state.
Very simply I am doing a query and the type of object returned is different on my local machine than what gets returned once I deploy the application.
match = MatchRealTimeStatsModel.queryMatch(ancestor_key)[0]
On my local machine the above produces a MatchRealTimeStatsModel object. So I can run the following to lines without a problem:
logging.info(match) # outputs a MatchRealTimeStatsModel object
logging.info(match.match) # outputs a dictionary from json data
When the above two lines are run on Goggles machines I get the following though:
logging.info(match) # outputs a dictionary from json data
logging.info(match.match) # AttributeError: 'dict' object has no attribute 'match'
Any suggestions as to what might be causing this? I cleared the data store and did everything I could think of to clean the GAE environment.
Edit #1: Adding MatchRealTimeStatsModel code:
class MatchRealTimeStatsModel(ndb.Model):
match = ndb.JsonProperty()
#classmethod
def queryMatch(cls, ancestor_key):
return cls.query(ancestor=ancestor_key).fetch()
And here is the actual call:
ancestor_key = ndb.Key('MatchRealTimeStatsModel', matchUniqueUrl)
match = MatchRealTimeStatsModel.queryMatch(ancestor_key)[0]
Perhaps you are using different versions of your code locally than in prod? Try to reset your copy of the source code in both places.

R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

Having learned loads from answers on this site (thanks!), it's finally time to ask my own question.
I'm using R (tm and lsa packages) to create, clean and simplify, and then run LSA (latent semantic analysis) on, a corpus of about 15,000 text documents. I'm doing this in R 3.0.0 under Mac OS X 10.6.
For efficiency (and to cope with having too little RAM), I've been trying to use either the 'PCorpus' (backend database support supported by the 'filehash' package) option in tm, or the newer 'tm.plugin.dc' option for so-called 'distributed' corpus processing). But I don't really understand how either one works under the bonnet.
An apparent bug using DCorpus with tm_map (not relevant right now) led me to do some of the preprocessing work with the PCorpus option instead. And it takes hours. So I use R CMD BATCH to run a script doing things like:
> # load corpus from predefined directory path,
> # and create backend database to support processing:
> bigCcorp = PCorpus(bigCdir, readerControl = list(load=FALSE), dbControl = list(useDb = TRUE, dbName = "bigCdb", dbType = "DB1"))
> # converting to lower case:
> bigCcorp = tm_map(bigCcorp, tolower)
> # removing stopwords:
> stoppedCcorp = tm_map(bigCcorp, removeWords, stoplist)
Now, supposing my script crashes soon after this point, or I just forget to export the corpus in some other form, and then I restart R. The database is still there on my hard drive, full of nicely tidied-up data. Surely I can reload it back into the new R session, to carry on with the corpus processing, instead of starting all over again?
It feels like a noodle question... but no amount of dbInit() or dbLoad() or variations on the 'PCorpus()' function seem to work. Does anyone know the correct incantation?
I've scoured all the related documentation, and every paper and web forum I can find, but total blank - nobody seems to have done it. Or have I missed it?
The original question was from 2013. Meanwhile, in Feb 2015, a duplicate, or similar question, has been answered:
How to reconnect to the PCorpus in the R tm package?. That answer in that post is essential, although pretty minimalist, so I'll try to augment it here.
These are some comments I've just discovered while working on a similar problem:
Note that the dbInit() function is not part of the tm package.
First you need to install the filehash package, which the tm-Documentation only "suggests" to install. This means it is not a hard dependency of tm.
Supposedly, you can also use the filehashSQLite package with library("filehashSQLite") instead of library("filehash"), and both of these packages have the same interface and work seamlesslessly together, due to object-oriented design. So also install "filehashSQLite" (edit 2016: some functions such as tn::content_transformer() are not implemented for filehashSQLite).
then this works:
library(filehashSQLite)
# this string becomes filename, must not contain dots.
# Example: "mydata.sqlite" is not permitted.
s <- "sqldb_pcorpus_mydata" #replace mydat with something more descriptive
suppressMessages(library(filehashSQLite))
if(! file.exists(s)){
# csv is a data frame of 900 documents, 18 cols/features
pc = PCorpus(DataframeSource(csv), readerControl = list(language = "en"), dbControl = list(dbName = s, dbType = "SQLite"))
dbCreate(s, "SQLite")
db <- dbInit(s, "SQLite")
set.seed(234)
# add another record, just to show we can.
# key="test", value = "Hi there"
dbInsert(db, "test", "hi there")
} else {
db <- dbInit(s, "SQLite")
pc <- dbLoad(db)
}
show(pc)
# <<PCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
#Content: documents: 900
dbFetch(db, "test")
# remove it
rm(db)
rm(pc)
#reload it
db <- dbInit(s, "SQLite")
pc <- dbLoad(db)
# the corpus entries are now accessible, but not loaded into memory.
# now 900 documents are bound via "Active Bindings", created by makeActiveBinding() from the base package
show(pc)
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
# ...
# [900]
#[883] "883" "884" "885" "886" "887" "888" "889" "890" "891" "892"
#"893" "894" "895" "896" "897" "898" "899" "900"
#[901] "test"
dbFetch(db, "900")
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 33
dbFetch(db, "test")
#[1] "hi there"
This is what the database backend looks like. You can see that the documents from the data frame have been encoded somehow, inside the sqlite table.
This is what my RStudio IDE shows me:

Resources