Stream of Char to Stream of Byte/Byte Array - arrays

The following code takes a String s, converts into char array, filters digits from it, then converts it to string, then converts into byte array.
char charArray[] = s.toCharArray();
StringBuffer sb = new StringBuffer(charArray.length);
for(int i=0; i<=charArray.length-1; i++) {
if (Character.isDigit(charArray[i]))
sb.append(charArray[i]);
}
byte[] bytes = sb.toString().getBytes(Charset.forName("UTF-8"));
I'm trying to change the above code to streams approach. Following is working.
s.chars()
.sequential()
.mapToObj(ch -> (char) ch)
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::append, StringBuilder::append)
.toString()
.getBytes(Charset.forName("UTF-8"));
I think there could be a better way to do it.
Can we directly convert Stream<Character> to byte[] & skip the conversion to String in between?

First, both of your variants have the problem of not handling characters outside the BMP correctly.
To support these characters, there is codePoints() as an alternative to chars(). You can use appendCodePoint on the target StringBuilder to consistently use codepoints throughout the entire operation. For this, you have to remove the unnecessary .mapToObj(ch -> (char) ch) step, whose removal also eliminates the overhead of creating a Stream<Character>.
Then, you can avoid the conversion to a String in both cases, by encoding the StringBuilder using the Charset directly. In the case of the stream variant:
StringBuilder sb = s.codePoints()
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append);
ByteBuffer bb = StandardCharsets.UTF_8.encode(CharBuffer.wrap(sb));
byte[] utf8Bytes = new byte[bb.remaining()];
bb.get(utf8Bytes);
Performing the conversion directly with the stream of codepoints is not easy. Not only is there no such support in the Charset API, there is no straight-forward way to collect a Stream into a byte[] array.
One possibility is
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.flatMap(c -> c<128? IntStream.of(c):
c<0x800? IntStream.of((c>>>6)|0xC0, c&0x3f|0x80):
c<0x10000? IntStream.of((c>>>12)|0xE0, (c>>>6)&0x3f|0x80, c&0x3f|0x80):
IntStream.of((c>>>18)|0xF0, (c>>>12)&0x3f|0x80, (c>>>6)&0x3f|0x80, c&0x3f|0x80))
.collect(
() -> new Object() { byte[] array = new byte[8]; int size;
byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
},
(b,i) -> {
if(b.array.length == b.size) b.array=Arrays.copyOf(b.array, b.size*2);
b.array[b.size++] = (byte)i;
},
(a,b) -> {
if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
System.arraycopy(b.array, 0, a.array, a.size, b.size);
a.size+=b.size;
}).result();
The flatMap step converts the stream of codepoints to a stream of UTF-8 unit. (Compare with the UTF-8 description on Wikipedia) The collect step collects int values into a byte[] array.
It’s possible to eliminate the flatMap step by creating a dedicate collector which collects a stream of codepoints directly into a byte[] array
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.collect(
() -> new Object() { byte[] array = new byte[8]; int size;
byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
void put(int c) {
if(array.length == size) array=Arrays.copyOf(array, size*2);
array[size++] = (byte)c;
}
},
(b,c) -> {
if(c < 128) b.put(c);
else {
if(c<0x800) b.put((c>>>6)|0xC0);
else {
if(c<0x10000) b.put((c>>>12)|0xE0);
else {
b.put((c>>>18)|0xF0);
b.put((c>>>12)&0x3f|0x80);
}
b.put((c>>>6)&0x3f|0x80);
}
b.put(c&0x3f|0x80);
}
},
(a,b) -> {
if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
System.arraycopy(b.array, 0, a.array, a.size, b.size);
a.size+=b.size;
}).result();
but it doesn’t add to readability.
You can test the solutions using a String like
String s = "some test text 1234 ✔ 3 𝟝";
and printing the result as
System.out.println(Arrays.toString(utf8Bytes));
System.out.println(new String(utf8Bytes, StandardCharsets.UTF_8));
which should produce
[49, 50, 51, 52, -17, -68, -109, -16, -99, -97, -99]
12343𝟝
It should be obvious that the first variant is the simplest, and it will have reasonable performance, even if it doesn’t create a byte[] array directly. Further, it’s the only variant which can be adapted for getting other result charsets.
But even the
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append)
.toString().getBytes(StandardCharsets.UTF_8);
is not so bad, regardless of whether the toString() operation bears a copying operation.

Related

How to seach emoji with proper text via Entity Framework Core

Here is my code:
var emoji = "⭐";
var query = myContext.Products.Where(x => x.Name.Contains(emoji));
var queryString = query.ToQueryString();
var list = query.ToList();
Query returns all table records. If I replace contains to equal works great, but I have to search something like this:
"this is my emoji ⭐"
This is the SQL query:
DECLARE #__emoji_0 nvarchar(4000) = N'⭐'
SELECT [p].[Id], [p].[Name], [p].[Quantity]
FROM [Products] AS [p]
WHERE (#__emoji_0 LIKE N'') OR (CHARINDEX(#__emoji_0, [p].[Name]) > 0)
Is any way to do this in EF Core or raw SQL?
Your main issue is the fact that emojis and strings are represented differently.
Before you can search the emojis you will need to decide how are you gonna unify them both in search query and db.
First of all emojis are a pair of chars.What does that mean? Here as a quote from the Microsoft docs:
"🐂".Length = 2
s[0] = '�' ('\ud83d')
s[1] = '�' ('\udc02')
These examples show that the value of string.Length, which indicates the number of char instances, doesn't necessarily indicate the number of displayed characters. A single char instance by itself doesn't necessarily represent a character.
The char pairs that map to a single character are called surrogate pairs. To understand how they work, you need to understand Unicode and UTF-16 encoding.
Having this in mind I would go as follows:
Define a method which will convert emojis to a UTF16 string[] which will keep the two surrogate chars representation.
internal static string[] EmojiToUtf16Pair(string emoji)
{
string[] arr = new string[2];
for (int i = 0; i < emoji.Length; i++)
{
arr[i] = emoji[i].ToString();
}
return arr;
}
This could be use when you persist emojis in DB. Depending on how you decide to persist the emojis in DB some modification could be done for that method e.g. to return concatenated string or something like that.
I am not sure when, but for some reason you could use another method to do the reverse operation -> UTF16 to Emoji
internal static string UTF16PairToEmoji(string[] codes)
{
var test = string.Empty;
foreach (var i in codes)
{
test += i;
}
var result = test.ToString();
return result;
}
Here is all the code example:
class Program
{
static void Main()
{
var str = "🚴";
var utf16 = string.Join("",EmojiToUtf16Pair(str));
Console.WriteLine(utf16);
var testEmpoji = UTF16PairToEmoji(EmojiToUtf16Pair(str));
Console.WriteLine(testEmpoji);
}
internal static string[] EmojiToUtf16Pair(string emoji)
{
string[] arr = new string[2];
for (int i = 0; i < emoji.Length; i++)
{
arr[i] = emoji[i].ToString();
}
return arr;
}
internal static string UTF16PairToEmoji(string[] codes)
{
var test = string.Empty;
foreach (var i in codes)
{
test += i;
}
var result = test.ToString();
return result;
}
}
emoji ef-core db-query
You have to use like command
SELECT * FROM emoticon where emoji_utf like '👨🏫';
with EF in .net core
Emoticon emoticon=db_context.Emoticons.Where(a=>EF.Functions.Like(a.EmojiUtf,"%"+item.emojiString+"%" ) ).FirstOrDefault();

Serious performance regression upon porting bubble sort from C to Rust [duplicate]

I was playing around with binary serialization and deserialization in Rust and noticed that binary deserialization is several orders of magnitude slower than with Java. To eliminate the possibility of overhead due to, for example, allocations and overheads, I'm simply reading a binary stream from each program. Each program reads from a binary file on disk which contains a 4-byte integer containing the number of input values, and a contiguous chunk of 8-byte big-endian IEEE 754-encoded floating point numbers. Here's the Java implementation:
import java.io.*;
public class ReadBinary {
public static void main(String[] args) throws Exception {
DataInputStream input = new DataInputStream(new BufferedInputStream(new FileInputStream(args[0])));
int inputLength = input.readInt();
System.out.println("input length: " + inputLength);
try {
for (int i = 0; i < inputLength; i++) {
double d = input.readDouble();
if (i == inputLength - 1) {
System.out.println(d);
}
}
} finally {
input.close()
}
}
}
Here's the Rust implementation:
use std::fs::File;
use std::io::{BufReader, Read};
use std::path::Path;
fn main() {
let args = std::env::args_os();
let fname = args.skip(1).next().unwrap();
let path = Path::new(&fname);
let mut file = BufReader::new(File::open(&path).unwrap());
let input_length: i32 = read_int(&mut file);
for i in 0..input_length {
let d = read_double_slow(&mut file);
if i == input_length - 1 {
println!("{}", d);
}
}
}
fn read_int<R: Read>(input: &mut R) -> i32 {
let mut bytes = [0; std::mem::size_of::<i32>()];
input.read_exact(&mut bytes).unwrap();
i32::from_be_bytes(bytes)
}
fn read_double_slow<R: Read>(input: &mut R) -> f64 {
let mut bytes = [0; std::mem::size_of::<f64>()];
input.read_exact(&mut bytes).unwrap();
f64::from_be_bytes(bytes)
}
I'm outputting the last value to make sure that all of the input is actually being read. On my machine, when the file contains (the same) 30 million randomly-generated doubles, the Java version runs in 0.8 seconds, while the Rust version runs in 40.8 seconds.
Suspicious of inefficiencies in Rust's byte interpretation itself, I retried it with a custom floating point deserialization implementation. The internals are almost exactly the same as what's being done in Rust's Reader, without the IoResult wrappers:
fn read_double<R : Reader>(input: &mut R, buffer: &mut [u8]) -> f64 {
use std::mem::transmute;
match input.read_at_least(8, buffer) {
Ok(n) => if n > 8 { fail!("n > 8") },
Err(e) => fail!(e)
};
let mut val = 0u64;
let mut i = 8;
while i > 0 {
i -= 1;
val += buffer[7-i] as u64 << i * 8;
}
unsafe {
transmute::<u64, f64>(val);
}
}
The only change I made to the earlier Rust code in order to make this work was create an 8-byte slice to be passed in and (re)used as a buffer in the read_double function. This yielded a significant performance gain, running in about 5.6 seconds on average. Unfortunately, this is still noticeably slower (and more verbose!) than the Java version, making it difficult to scale up to larger input sets. Is there something that can be done to make this run faster in Rust? More importantly, is it possible to make these changes in such a way that they can be merged into the default Reader implementation itself to make binary I/O less painful?
For reference, here's the code I'm using to generate the input file:
import java.io.*;
import java.util.Random;
public class MakeBinary {
public static void main(String[] args) throws Exception {
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(System.out));
int outputLength = Integer.parseInt(args[0]);
output.writeInt(outputLength);
Random rand = new Random();
for (int i = 0; i < outputLength; i++) {
output.writeDouble(rand.nextDouble() * 10 + 1);
}
output.flush();
}
}
(Note that generating the random numbers and writing them to disk only takes 3.8 seconds on my test machine.)
When you build without optimisations, it will often be slower than it would be in Java. But build it with optimisations (rustc -O or cargo --release) and it should be very much faster. If the standard version of it still ends up slower, it’s something that should be examined carefully to figure out where the slowness is—perhaps something is being inlined that shouldn’t be, or not that should be, or perhaps some optimisation that was expected is not occurring.

Building a String array from a text file without collection classes

I am trying to build an array from a buffered in text file. This class is used by another class with a main method. What I have only prints the file... what I need is to have an array of strings, built line by line, mirroring the text file. I need to be able to then search against that array using a String from user input (that part will be in main method too) that will name a product, and find the corresponding price. I can't use things like ArrayList, Maps, Vectors, etc. This is in Java8.
/**
* A class that reads in inventory from vmMix1 text file using BufferedReader
* # author Michelle Merritt
*/
import java.io.*;
public class VendingMachine1
{
BufferedReader inInvFile1 = new BufferedReader(
new FileReader("vmMix1.txt"))
/**
* A method to print vending machine 1 inventory
*/
public void printVM1()
{
try
{
String vm1Line;
while((vm1Line = inInvFile1.readLine()) != null)
{
// This is what I was using for now to simply print my file
System.out.println(vm1Line);
}
}
catch(IOException e)
{
System.out.println("I/O Error: " + e);
}
}
}
This is the code that created my text file, since I can't seem to see how I attach the text file instead.
/**
* A class that creates the inventory found in vending machine #1, using
* a PrintWriter stream.
* # author Michelle Merritt
*/
import java.io.*;
public class VMMix1
{
public static void main(String[] args)
{
String [] product = {"Coke", "Dr. Pepper", "Sprite", "RedBull",
"Cool Ranch Doritos", "Lay's Potato Chips",
"Pretzels", "Almonds", "Snickers", "Gummi Bears",
"Milky Way", "KitKat"};
String [] packaging = {"bottle", "can", "can", "can", "bag", "bag",
"bag", "bag", "wrapper", "bag", "wrapper",
"wrapper"};
float [] price = {2.25f, 1.75f, 1.75f, 2.00f, 1.00f, 1.00f, 0.75f, 1.50f,
1.25f, 1.00f, 1.25f, 1.25f};
int [] quantity = {10, 10, 10, 12, 8, 10, 12, 9, 7, 11, 10, 8};
try(PrintWriter outFile = new PrintWriter("vmMix1.txt"))
{
for (int index = 0; index < product.length; index++)
{
outFile.printf("%-18s %-10s: $%.2f qty: %3d\n", product[index],
packaging[index], price[index], quantity[index]);
}
}
catch (IOException except)
{
System.out.println("IOException: " + except.getMessage());
}
}
}
I need for this thing to be dynamic. As the program runs, and something is purchased, I will have to account for losing inventory and changing the amount of money in the vending machine (there's another class for currency that houses quantities of denominations of money). I have to maintain the values in the array and reprint the updated array. Any help is much appreciated.
You may use Java8 stream API
String[] array = reader.lines().toArray(String[]::new);
You could even skip the buffer creation using
try (Stream<String> stream = Files.lines(Paths.get("vmMix1.txt"))) {
String [] array = stream.toArray(String[]::new);
}
Pre-Java8, probably one of the shortest way is to read the entire file into a string, and split it (ways to read a reader into string can be found here):
String[] array = fileAsString.split('\\n');
Of course you could also built the array in your loop and increase it for every line using System.arraycopy (which can be quite slow in that case).
String[] array = new String[0];
while((vm1Line = inInvFile1.readLine()) != null) {
String[] newArray = new String[array.length + 1];
System.arraycopy(array, 0, newArray, 0, array.length);
newArray[array.length] = vm1Line;
array = newArray;
}
You may optimize this approach by creating a larger array first, fill in the lines, increase size of the array as needed (using arraycopy), and finally shrink the array to the number of written lines.
Thats more or less what an array list does. So in case you may use the collections api, you could simply do
List<String> list = new ArrayList<>();
while((vm1Line = inInvFile1.readLine()) != null) {
list.add(vm1Line);
}
String[] array = list.toArray(new String[list.size()]);
Hope it helps.
Stream<String> lines = Files.lines(Paths.get("C:/SelfStudy/Input.txt"));
String[] array = lines.toArray(String[]::new);

Any Efficient way to parse large text files and store parsing information?

My purpose is to parse text files and store information in respective tables.
I have to parse around 100 folders having more that 8000 files and whole size approximately 20GB.
When I tried to store whole file contents in a string, memory out exception was thrown.
That is
using (StreamReader objStream = new StreamReader(filename))
{
string fileDetails = objStream.ReadToEnd();
}
Hence I tried one logic like
using (StreamReader objStream = new StreamReader(filename))
{
// Getting total number of lines in a file
int fileLineCount = File.ReadLines(filename).Count();
if (fileLineCount < 90000)
{
fileDetails = objStream.ReadToEnd();
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
//call respective method for parsing and insertion
}
else
{
while ((firstLine = objStream.ReadLine()) != null)
{
lineCount++;
fileDetails = (fileDetails != string.Empty) ? string.Concat(fileDetails, "\n", firstLine)
: string.Concat(firstLine);
if (lineCount == 90000)
{
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
//when content is 90057, to parse 57
if (lineCount < 90000 )
{
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
}
Here 90,000 is the bulk size which is safe to process without giving out of memory exception for my case.
Still the process is taking more than 2 days for completion. I observed this is because of reading line by line.
Is there any better approach to handle this ?
Thanks in Advance :)
You can use a profiler to detect what sucks your performance. In this case it's obvious: disk access and string concatenation.
Do not read a file more than once. Let's take a look at your code. First of all, the line int fileLineCount = File.ReadLines(filename).Count(); means you read the whole file and discard what you've read. That's bad. Throw away your if (fileLineCount < 90000) and keep only else.
It almost doesn't matter if you read line-by-line in consecutive order or the whole file because reading is buffered in any case.
Avoid string concatenation, especially for long strings.
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
It's really bad. You read the file line-by-line, why do you do this replacement/split? File.ReadLines() gives you a collection of all lines. Just pass it to your parsing routine.
If you'll do this properly I expect significant speedup. It can be optimized further by reading files in a separate thread while processing them in the main. But this is another story.

How do I run an encryption program multiple times to strengthen the encode?

Here is my code so far. I need to run the encode part of the code 5 times and then decode the encode the same number of times. I figured out how to encode the message but now I can't figure out how to run the "encode" or "decode" variable back through the code to strengthen the ecryption.
public class Codes
{
/**
* Encode and decode a message using a key of values stored in
* a queue.
*/
public static void main(String[] args)
{
int[] key = {7, 6, 5, 2, 8, 5, 8, 6, 4, 1};
Integer keyValue;
String encoded = "", decoded = "";
String message = "Queues are useful for encoding messages.";
Queue<Integer> encodingQueue = new LinkedList<Integer>();
Queue<Integer> decodingQueue = new LinkedList<Integer>();
// load key queues
for (int scan = 0; scan < key.length; scan++)
{
encodingQueue.add(key[scan]);
decodingQueue.add(key[scan]);
}
// encode message
for (int scan = 0; scan < message.length(); scan++)
{
keyValue = encodingQueue.remove();
encoded += (char) (message.charAt(scan) + keyValue);
encodingQueue.add(keyValue);
}
System.out.println ("Encoded Message:\n" + encoded + "\n");
// decode message
for (int scan = 0; scan < encoded.length(); scan++)
{
keyValue = decodingQueue.remove();
decoded += (char) (encoded.charAt(scan) - keyValue);
decodingQueue.add(keyValue);
}
System.out.println ("Decoded Message:\n" + decoded);
}
}
as of right now I am receiving this output:
Encoded Message:
X{jwmx(gvf'{xgnzt&jpy&jpktlorh'sju{fokw/
Decoded Message:
Queues are useful for encoding messages.
In order to complete this program I need the output to look like this:
Encoded Message 1: X{jwmx(gvf'{xgnzt&jpy&jpktlorh'sju{fokw/
Encoded Message 2: _?oyu}0mzg.?}iv•|,nq?,orsytuvi.yow?kwq{0
Encoded Message 3: f?t{}?8s~h5??k~??2rr?2tt{~|{zj5•ty?p•w•1
Encoded Message 4: m?y}??#y?i<??m???8vs?8yv????~k<?y{?u?}?2
Encoded Message 5: t?~•??H•?jC??o???>zt?>~x?????lC?~}?z???3
Decoded Message 5: m?y}??#y?i<??m???8vs?8yv????~k<?y{?u?}?2
Decoded Message 4: f?t{}?8s~h5??k~??2rr?2tt{~|{zj5•ty?p•w•1
Decoded Message 3: _?oyu}0mzg.?}iv•|,nq?,orsytuvi.yow?kwq{0
Decoded Message 2: X{jwmx(gvf'{xgnzt&jpy&jpktlorh'sju{fokw/
Decoded Message 1: Queues are useful for encoding messages.
I estimate that in order to make this happen I need to use a loop to run the "encode" and "decode" variables back through the program. However I cannot figure out how to make that happen.
This will be easier if you use separate functions for the encode() and decode() operations:
class Codes {
public static void main(String[] args) {
...
}
private static String encode(String plaintext, Queue<Integer> encodingQueue) {
...
}
private static String decode(String ciphertext, Queue<Integer> decodingQueue) {
...
}
}
Does that help?

Resources