What's the simplest reliable way to encode multiple jpeg images in a single byte string? - multipartform-data

I need to publish a Google Cloud Pub/Sub message with multiple jpeg images. It needs to go in the data body. Putting it as a base64-encoded string in an attribute won't work, because attribute values are limited to 1024 bytes:
https://cloud.google.com/pubsub/quotas#resource_limits
What's a simple and reliable pattern for doing that? It might seem possible to choose some fixed delimiter, but I want to avoid the possibility of that delimiter occurring inside an image. Is it possible something like |||| might occur in a jpeg byte array? Another possibility might seem to encode as multi-part mime, but I haven't found any general-purpose non-http libraries to do that. I need implementations in both Java/Scala and Python. Or maybe can I just concatenate the jpeg byte arrays without any external delimiter, and split them based on header identifiers?

You probably want to store data in some kind of schema-based message using something like Avro or Protocol Buffers. Both can generate code that can be used to serialize and deserialize messages in Java/Scala and Python.
For example, in protocol buffers, you could create a message in a file image.proto:
syntax = "proto3";
message Images {
bytes images = 1;
}
You could generate the python code for this with protoc compiler:
$ protoc -I=. --python_out=. image.proto
In Python3, to add images, serialize the message, and send it, you would do the following:
import image_pb2
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(<project name>, <topic name>)
def send_images(images):
img_msg = image_pb2.Images()
for i in images:
img_msg.images.append(i)
msg_data = img_msg.SerializeToString()
message_future = publisher.publish(topic_path, data=msg_data)
print(message_future.result())
To receive the images and process them:
import image_pb2
from google.cloud import pubsub_v1
def receive(message):
images = image_pb2.Images()
images.ParseFromString(message.data)
for i in images.images:
# Process the image
message.ack()
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(<project name>, <subscription name>)
subscribe_future = subscriber.subscribe(subscription_path, receive)
print(subscribe_future.result())

It looks like the following approach may work, written in Scala, using only natural delimiters:
def serializeJpegs(jpegs: Seq[Array[Byte]]): Array[Byte] =
jpegs.foldLeft(Array.empty[Byte])(_ ++ _)
def deserializeJpegs(bytes: Array[Byte]): Seq[Array[Byte]] = {
val JpegHeader = Array(0xFF.toByte, 0xD8.toByte)
val JpegFooter = Array(0xFF.toByte, 0xD9.toByte)
val Delimiter = JpegFooter ++ JpegHeader
val jpegs: mutable.Buffer[Array[Byte]] = mutable.Buffer.empty
var (start, end) = (0, 0)
end = bytes.indexOfSlice(Delimiter, start) + JpegFooter.length
while (end > JpegFooter.length) {
jpegs += bytes.slice(start, end)
start = end
end = bytes.indexOfSlice(Delimiter, start) + JpegFooter.length
}
if (start < bytes.length) {
jpegs += bytes.drop(start)
}
jpegs
}
I'm sure there's a more efficient and functional implementation, but that's for another day!

Related

How to resolve Attribute Error "array.array object has no attribute read " in python

I am getting the "object has no attribute" error. AttributeError: The "array.array" object has no attribute "read." Attached code as below. Kindly guide me on how to resolve the error.
import os
from functools import reduce
import numpy
import array
from utils import *
def load_raw_data_with_mhd(filename):
meta_dict = read_meta_header(filename)
dim = int(meta_dict['NDims'])
assert(meta_dict['ElementType']=='MET_FLOAT')
arr = [int(i) for i in meta_dict['DimSize'].split()]
volume = reduce(lambda x,y: x*y, arr[0:dim-1], 1)
pwd = os.path.split(filename)[0]
if pwd:
data_file = pwd +'/' + meta_dict['ElementDataFile']
else:
data_file = meta_dict['ElementDataFile']
print (data_file)
fid = open(data_file,'rb')
binvalues = array.array('f')
binvalues.read(fid, volume*arr[dim-1])
if is_little_endian(): # assume data in file is always big endian
binvalues.byteswap()
fid.close()
data = numpy.array(binvalues, numpy.float)
data = numpy.reshape(data, (arr[dim-1], volume))
return (data, meta_dict)
It looks like the error is being caused by the line
binvalues.read(fid, volume*arr[dim-1])
. In this line, you are trying to call the read method on an array.array object, but this class does not have a read method.
To fix this error, you can use the fromfile method of the array module to read the data from the file, like this:
binvalues = array.array('f')
binvalues.fromfile(fid, volume*arr[dim-1])
This should read the binary data from the file into the array.array object.
You may also want to consider using the struct module to read the binary data from the file, as this may be more efficient than using the array module. The struct module allows you to specify the format of the binary data you want to read, and then directly read the data into a numpy array without having to first create an array.array object.
For example, you could use the following code to read the binary data from the file using the struct module:
fmt = '>' + 'f'*(volume*arr[dim-1])
# the format string specifies that the data is in big-endian format and consists of a sequence of float values
data = numpy.fromfile(fid, dtype=numpy.float, count=volume*arr[dim-1], sep='')
# read the data from the file
You can then use the data array in the rest of your code.

Kotlin Parsing json array with new line separator

I'm using OKHttpClient in a Kotlin app to post a file to an API that gets processed. While the process is running the API is sending back messages to keep the connection alive until the result has been completed. So I'm receiving the following (this is what is printed out to the console using println())
{"status":"IN_PROGRESS","transcript":null,"error":null}
{"status":"IN_PROGRESS","transcript":null,"error":null}
{"status":"IN_PROGRESS","transcript":null,"error":null}
{"status":"DONE","transcript":"Hello, world.","error":null}
Which I believe is being separated by a new line character, not a comma.
I figured out how to extract the data by doing the following but is there a more technically correct way to transform this? I got it working with this but it seems error-prone to me.
data class Status (status : String?, transcript : String?, error : String?)
val myClient = OkHttpClient ().newBuilder ().build ()
val myBody = MultipartBody.Builder ().build () // plus some stuff
val myRequest = Request.Builder ().url ("localhost:8090").method ("POST", myBody).build ()
val myResponse = myClient.newCall (myRequest).execute ()
val myString = myResponse.body?.string ()
val myJsonString = "[${myString!!.replace ("}", "},")}]".replace (",]", "]")
// Forces the response from "{key:value}{key:value}"
// into a readable json format "[{key:value},{key:value},{key:value}]"
// but hoping there is a more technically sound way of doing this
val myTranscriptions = gson.fromJson (myJsonString, Array<Status>::class.java)
An alternative to your solution would be to use a JsonReader in lenient mode. This allows parsing JSON which does not strictly comply with the specification, such as in your case multiple top level values. It also makes other aspects of parsing lenient, but maybe that is acceptable for your use case.
You could then use a single JsonReader wrapping the response stream, repeatedly call Gson.fromJson and collect the deserialized objects in a list yourself. For example:
val gson = GsonBuilder().setLenient().create()
val myTranscriptions = myResponse.body!!.use {
val jsonReader = JsonReader(it.charStream())
jsonReader.isLenient = true
val transcriptions = mutableListOf<Status>()
while (jsonReader.peek() != JsonToken.END_DOCUMENT) {
transcriptions.add(gson.fromJson(jsonReader, Status::class.java))
}
transcriptions
}
Though, if the server continously provides status updates until processing is done, then maybe it would make more sense to directly process the parsed status instead of collecting them all in a list before processing them.

Gmail API .NET: Get full message

How do I get the full message and not just the metadata using gmail api?
I have a service account and I am able to retrieve a message but only in the metadata, raw and minimal formats. How do I retrieve the full message in the full format? The following code works fine
var request = service.Users.Messages.Get(userId, messageId);
request.Format = UsersResource.MessagesResource.GetRequest.FormatEnum.Metadata;
Message message = request.Execute();
However, when I omit the format (hence I use the default format which is FULL) or I change the format to UsersResource.MessagesResource.GetRequest.FormatEnum.Full
I get the error: Metadata scope doesn't allow format FULL
I have included the following scopes:
https://www.googleapis.com/auth/gmail.readonly,
https://www.googleapis.com/auth/gmail.metadata,
https://www.googleapis.com/auth/gmail.modify,
https://mail.google.com/
How do I get the full message?
I had to remove the scope for the metadata to be able to get the full message format.
The user from the SO post have the same error.
Try this out first.
Go to https://security.google.com/settings/security/permissions
Choose the app you are working with.
Click Remove > OK
Next time, just request exactly which permissions you need.
Another thing, try to use gmailMessage.payload.parts[0].body.dataand to decode it into readable text, do the following from the SO post:
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
System.out.println(StringUtils.newStringUtf8(Base64.decodeBase64(gmailMessage.payload.parts[0].body.data)));
You can also check this for further reference.
try something like this
public String getMessage(string user_id, string message_id)
{
Message temp =service.Users.Messages.Get(user_id,message_id).Execute();
var parts = temp.Payload.Parts;
string s = "";
foreach (var part in parts) {
byte[] data = FromBase64ForUrlString(part.Body.Data);
s += Encoding.UTF8.GetString(data);
}
return s
}
public static byte[] FromBase64ForUrlString(string base64ForUrlInput)
{
int padChars = (base64ForUrlInput.Length % 4) == 0 ? 0 : (4 - (base64ForUrlInput.Length % 4));
StringBuilder result = new StringBuilder(base64ForUrlInput, base64ForUrlInput.Length + padChars);
result.Append(String.Empty.PadRight(padChars, '='));
result.Replace('-', '+');
result.Replace('_', '/');
return Convert.FromBase64String(result.ToString());
}

Saving users and items features to HDFS in Spark Collaborative filtering RDD

I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// extract users latent factors
val users = model.userFeatures
// extract items latent factors
val items = model.productFeatures
// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected
However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:
[myname#host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D#3c3137b5)
(3,[D#505d9755)
(4,[D#241a409a)
(2,[D#c8c56dd)
.
.
It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:
for (user <- users) {
val (userId, lf) = user
val str = "user:" + userId + "\t" + lf.mkString(" ")
println(str)
}
This does print what I want but I can't then write to HDFS (this prints on the console).
What should I do to get the complete array written to HDFS properly?
Spark version is 1.2.1.
#JohnTitusJungao is right and also the following lines works as expected :
users.saveAsTextFile("myhdfs/outputdirectory/users")
items.saveAsTextFile("myhdfs/outputdirectory/items")
And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D#3c3137b5 , D for double, followed by # and hex code which is created using the Java toString method for this type of objects. More on that here.
val users: RDD[(Int, Array[Double])] = model.userFeatures
To solve that you'll need to make the array as a string :
val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))
The same goes for items.

Unzip from a SQL Server text column to an image column

I have images of various formats (.png, .jpg, .bmp, etc.) stored as compressed text in a text column in a SQL Server 2005 table. I need to read the row, unzip the image and store it in an image column in another table.
I am using the SharpZip library, and all of the examples deal with file sources and destinations. I can't find anything that covers unzipping from a variable to another variable. A code snippet illustrating this or a link to a relevant resource would be much appreciated.
EDIT: A bit more information - the data is stored in a TEXT column. It appears as follows (text column abbreviated for display):
ImageID ImageData
1 FORMAT-ZIPV3 UEsDBBQAAAAIAOV6wzxdTnDvshs...
2 FORMAT-ZIPV3 UEsDBBQAAAAIAAF2yjxGncjOLgA...
3 FORMAT-ZIPV3 UEsDBBQAAAAIAKd6yjyjnQNr6gg...
4 FORMAT-ZIPV3 UEsDBBQAAAAIALdNyzyrPC8EMJw...
5 FORMAT-ZIPV3 UEsDBBQAAAAIAA1rOD1nZY1t0f0...
6 FORMAT-ZIPV3 UEsDBBQAAAAIANZplj2seyJ+VmM...
7 FORMAT-ZIPV3 UEsDBBQAAAAIAC5vhD27LPbPcv8...
8 FORMAT-ZIPV3 UEsDBBQAAAAIAK1qKz5DJNH3xMg...
9 FORMAT-ZIPV3 UEsDBBQAAAAIAHVkEztC3th/9hs...
10 FORMAT-ZIPV3 UEsDBBQAAAAIAEtXKz7DXHUdvow...
What I know for certain is that the images were compressed at some point in the process using SharpZip before being inserted into the table. It appears that the format information was added to the beginning of the data prior to inserting.
Looking at this data, would anyone have any insight on how this image data has been manipulated? Again, I need to get the uncompressed image data into a column of a data type conducive to reading for display on a web page.
EDIT: Ok, I'm stumped. Executing the following code produces the error, "Failed to convert parameter value from a Int32 to a Byte[]". It appears to be placing the length of the byte array into the byte array's value...
commandUncompressed.Connection = connectionUncompressed;
commandUncompressed.Parameters.Add("#Image_k", SqlDbType.VarChar, 10);
commandUncompressed.Parameters.Add("#ImageContents", SqlDbType.Image);
commandUncompressed.CommandText = sqlSaveImage;
connectionUncompressed.Open();
reader = command.ExecuteReader();
if (reader.HasRows)
{
while (reader.Read())
{
Console.WriteLine(reader["Image_k"].ToString()); // Merely for testing
String format = reader["ImageContents_Compressed"].ToString().Substring(0, 12);
var offset = 13; //"FORMAT-ZIPV3 ".Length;
var s = reader["ImageContents_Compressed"].ToString().Substring(offset);
var bytes = Convert.FromBase64String(s);
if (format == "FORMAT-ZIPV2 ")
{
bytes = ConvertStringToBytes(s); // Not a Base-64 encoded string? External conversion function utilized.
}
using (var zis = new ZipInputStream(new MemoryStream(bytes)))
{
ZipEntry zipEntry = zis.GetNextEntry(); // Doesn't seem to work unless an entry has been referenced
byte[] buffer = new byte[zis.Length];
commandUncompressed.Parameters["#Image_k"].Value = reader["Image_k"].ToString();
commandUncompressed.Parameters["#ImageContents"].Value = zis.Read(buffer, 0, buffer.Length);
commandUncompressed.ExecuteNonQuery();
}
}
}
It appears to be reading the data from the source text column just fine. I just cannot figure out how to get that into the image type parameter. The value for buffer variable shows the length of the byte array, rather than the actual bytes. Maybe that's what the value property typically shows for byte arrays? I'm so close and yet so far away. :/
EDIT: Ok, I'm a knucklehead. I made the following correction, and it works!
zis.Read(buffer, 0, buffer.Length)
commandUncompressed.Parameters["#ImageContents"].Value = buffer;
At this point I am only able to process FORMAT-ZIPV3 data, as I haven't figured out how to decode the FORMAT-ZIP2 strings yet. Following is a sampling of the V2 data. If anyone is able to determine the encoding, let me know. Would it be different if zipped using BZIP instead of ZIP format?
ImageID ImageData
1 FORMAT-ZIPV2 504B03041400020008005157422A2E25FDBAF26701008D6901000E...
2 FORMAT-ZIPV2 504B03041400020008009159422A7FC94BA2B2540500D35705000E...
3 FORMAT-ZIPV2 504B0304140002000800685A422A0CAA51F4473A0600B97206000E...
4 FORMAT-ZIPV2 504B03041400020008001D5D422A770BD3ED201902002C4A02000E...
5 FORMAT-ZIPV2 504B0304140002000800325E422A4B6C2FB4045001001C6E01000E...
6 FORMAT-ZIPV2 504B03041400020008006F72422A5F793AC1A1F00200ECF302000E...
7 FORMAT-ZIPV2 504B0304140002000800D572422A1B348A731DE5000085EB00000E...
8 FORMAT-ZIPV2 504B03041400020008003D73422A8AEBB7F855640300DD1B04000E...
9 FORMAT-ZIPV2 504B03041400020008006368D528C5D0A6BA794900004A2502000E...
10 FORMAT-ZIPV2 504B03041400020008008E5B6C2A2D9E9C33D7AF05005CEC05000E...
In response to a similar question, someone on sqlmonster.com provided a nifty VarBinaryStream class. It works with a column type of varbinary(max).
If your data is stored in a varbinary(max), and is in zip format, you could use that class to instantiate a VarBinaryStream, then instantiate a ZipInputStream around that, and ba-da-boom, you're there. Just read from the ZipInputStream.
In C# it might look like this
using (var imageSrc = new VarBinarySource(connection,
"Table.Name",
"Column",
"KeyColName",
1))
{
using (var s = new VarBinaryStream(imageSrc))
{
using(var zis = new ZipInputStream(s))
{
....
}
}
}
If the images are small, then you probably wouldn't want all this streaming stuff. If the column is a binary(n) or a varbinary(n) where n is less than 8000, just use the SqlBinary type and read in all the data into memory, then instantiate a MemoryStream around that. Simpler. In VB.NET it looks something like this:
Dim bytes as Bytes()
bytes = dr.GetSqlBinary(columnNumber)
Using ms As New MemoryStream(bytes)
Using zis As New ZipInputStream(ms)
...
End Using
End Using
Finally, I'm going to question the wisdom of applying zip compression to .jpg images, and similar. The jpg format is already compressed; compressing it again before putting the data into SQL Server won't cause the data to become appreciably smaller. It only increases processing time. If possible, I'd suggest you reconsider your design for storage of compressed images.
ok, with the update you provided, containing the data format, you can draw some conclusions.
The data is an actual string. Suspecting that it is a Base64-encoded string, I did a small test and used Convert.ToBase64String() on a byte stream that contains a zip file. It looks like this: UEsDBBQAAAAIAJJyYyk3M56F+QIAA...
Aha! you have a base64-encoded (string) version of the byte data for a bonafide zip file. To decode it, strip the prefix and then use FromBase64String() to get the byte array, insert into a MemoryStream, then read it with ZipInputStream.
something like this:
var offset = "FORMAT-ZIPV3 ".Length();
var s = sqlReader["CompressedImage"].ToString().Substring(offset);
var bytes = Convert.FromBase64String(s);
using (var zis = new ZipInputStream(new MemoryStream(bytes)))
{
...
zis.Read(...);
...
}
If the data is "really long", you're going to want to stream it out of that table, rather than just read it into a big string and convert it. I don't know how large text columns can be, but supposing that it could be 500mb, you don't want a 500mb string, and you don't want to do a conversion of a 500mb string with Convert.FromBase64String(). In that case You need to use a Base64Stream, or the FromBase64Transform class in the System.Security.Cryptography namespace.
Editorial comment. It is sort of backwards to zip-compress image data. The images are probably compressed already. But to compound that backwardsness by then doing a base64 encode, thereby expanding the data... ??? That is triple backwards. That makes noooooo sense at all. I understand that's how your vendor supplied it.
Ok, with your furhter update, using this as the format:
ImageID ImageData
1 FORMAT-ZIPV2 504B03041400020008005157422A2E25FDBAF26701008D6901000E...
2 FORMAT-ZIPV2 504B03041400020008009159422A7FC94BA2B2540500D35705000E...
That data is still zipfile data, but it is encoded as simple hex digits. You need to convert that to a byte array. Here's some code to do it.
public static class ConvertEx
{
static readonly String prefix= "FORMAT-ZIPV2 ";
public static string ToHexString(byte[] b)
{
System.Text.StringBuilder sb1 = new System.Text.StringBuilder();
int i = 0;
for (i = 0; i < b.Length; i++)
{
sb1.Append(System.String.Format("{0:X2}", b[i]));
}
return sb1.ToString().ToLower();
}
public static byte[] ToByteArray(string s)
{
if (s.StartsWith(prefix))
{
System.Console.WriteLine("removing prefix");
s = s.Substring(prefix.Length);
}
s= s.Trim(); // whitespace
System.Console.WriteLine("length: {0}", s.Length);
var r= new byte[s.Length/2];
for (int i = 0; i < s.Length; i+=2)
{
r[i/2] = (byte) Convert.ToUInt32(s.Substring(i,2), 16);
}
return r;
}
}
You can use that this way:
string s = GetStringContentFromDatabase()
var decoded = ConvertEx.ToByteArray(s);
using (var ms = new MemoryStream(decoded))
{
// use DotNetZip to read the zip file
// SharpZipLib is something similar...
using (var zip = ZipFile.Read(ms))
{
// print out the list of entries in the zipfile
foreach (var e in zip)
{
System.Console.WriteLine("{0}", e.FileName);
}
}
}
The examples on the SharpZip Wiki use Stream objects - while the sample does use a File, you could easily use a MemoryStream object here and the sample would work the same.

Resources