What's the proper implementation for hardware emulation? - c

I'm going to programme a Game Boy emulator (Z80 is the CPU in case somebody is not familiar with it), and while I was doing my research, I've found some things I'm not so sure about.
The first one was that C is the programming language to choose here. That's not so much of a problem, but I'd like to hear your opinion from today's point of view. Even C++ was not recommended.
The second thing I found out was that everybody was using one function per opcode. That seems logical since it's just one function call and probably better optimised than having one function for the "ADD" instruction and then you've got to find out what registers are used here. But how necessary is that today? Is it something I should stick to or should I rather rewrite my emulator if I notice that another way which might be more convenient just doesn't cut it (more or less modern gaming consoles pop into my mind right now)?
Also, it's kind of demotivating to write a function for "add that register to this register" over and over again. Is there a way to automate that from an opcode map or something like that?

I mostly agree with WingsOfIcarus. I wrote a few emulators already so here is my insight:
The use of function pointers is a good idea (for speed and clarity of code)
OOP is not a problem
Yes, member calls are a little bit slower, but if you are careful it will not affect performance too much. On the other hand, OOP emulation code is much better to manage/read/understand.
Use an instruction database instead of fixed instruction decoding.
I am using a single text file which consist of all the necessary information for all instructions. The emulator parses it during initialization (feeds the arrays of function pointers and operands...). In this architecture it is very easy to correct errors in the instruction set without any code change.
Complex instruction sets documentation are almost always faulty to some point. The worst case is Z80 (I have never see a 100% error-free instruction set). So use more instruction sets, compare them and create an error-free set (if you can).
Add sound, video, keyboard and mouse to your emulation
This is usually not a problem. On Windows use WaveOut instead of DirectSound. It's more stable, much faster (usable latencies of DSound are sometimes even > 400 ms). With WaveOut I was able to lover latency to 20-80 ms which is OK.
Apply limit speed by T cycles of emulated CPU per second
I am using machine cycle correct timings which is much slower, but allows me to correctly implement any hardware periphery emulations as (FDC, DMAC, sound chips,...without any hacks)
Apply load/save of files for the emulated platform
For example, this is part of my instruction set (which is directly fed to CPU emulation:
opc T0 T1 MC1 MC2 MC3 MC4 MC5 MC6 MC7 mnemonic
B8 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,B
B9 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,C
BA 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,D
BB 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,E
BC 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,H
BD 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,L
BE 07 00 M1R 4 MRD 3 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,(HL)
BF 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,A
C0 11 05 M1R 5 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 RET NZ
C1 10 00 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 POP BC
C2L2H2 10 10 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 JP NZ,U16
C3L1H1 10 00 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 JP U16
C4L2H2 17 10 M1R 4 MRD 3 MRD 4 MWR 3 MWR 3 ... 0 ... 0 CALL NZ,U16
C5 11 00 M1R 5 MWR 3 MWR 3 ... 0 ... 0 ... 0 ... 0 PUSH BC
C6U2 07 00 M1R 4 MRD 3 ... 0 ... 0 ... 0 ... 0 ... 0 ADD A,U8
C7 11 00 M1R 5 MWR 3 MWR 3 ... 0 ... 0 ... 0 ... 0 RST 00H
C8 11 05 M1R 5 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 RET Z
C9 10 00 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 RET
CAL2H2 10 10 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 JP Z,U16
opc: operation code [hex]
L1,H1,U1,S1 means first operand direct number or address
L2,H2,U2,S2 means second operand direct number or address
L3,H3,U3,S3 means third operand direct number or address
H,L ... U16 high and low byte
U ... U8 unsigned byte
S ... S8 signed byte
T0 normal instruction duration [T] always 2 decimal digits
T1 instruction duration if condition not met [T] always 2 decimal digits
MC1++ Machine cycle first is type,second is duration [T] always 1 decimal digit
... unused
M1R M1 cycle
MRD memory read
MWR memory write
IOR IO read
IOW IO write
NON no external operation (internal computation)
INT interrupt cycle
mnem instruction text (mnemonic)
opc is used for the address in an array of pointers
mnemonic is used to select the proper function pointer, and operands type
T0 and T1 are used for instructions timing (this is enough for rough emulations)
MC1++ are used for correct MC timings (to implement correct hardware emulation and contentions timing)
Here is my Zilog Z80A complete instruction set with machine cycle timing link for download. Feel free to use (just mention my nick somewhere). After porting to this I was finally able to 100% pass the ZEXALL test. For more info see Writing a graphical Z80 emulator in C or C++.

First suggestion, you shouldn't use nested switch statements, you should rather use array of function pointers, alot faster -> better emulation, and nicer code, nested switch-es can also get a bit messy, here are some links where you can read more about these arrays http://www.newty.de/fpt/fpt.html
http://www.multigesture.net/wp-content/uploads/mirror/zenogais/FunctionPointers.htm
Second suggestion, Yes you can do it in C#, Java, C++, but since you want every single bit of your CPU cycles so you can get as close emulation as possible - emulating one CPU cycle of target architecture with least number of CPU cycles on curret architecture, and OOP isn't so good in this case from what I heard/read from people. One of the things is performance, and second is pretty much obvious, emulation is, as you probably noticed, really complex task and wraping it in OOP can be unnecessary pain in the neck.

Here's a pretty cool implementation of working with some opcodes for an NES emulator:
http://bisqwit.iki.fi/jutut/kuvat/programming_examples/nesemu1/
Here's the accompanying youtube videos that have a little more explanation as to what's going on
http://www.youtube.com/watch?v=y71lli8MS8s
It uses C++ templates and some additional C++11 features. As to whether you choose C++ or C that is up to you but it shouldn't really matter a whole lot. If you're just emulating a gameboy I doubt that speed is going to be an issue on modern processors so try to just use whatever you're comfortable with.

Related

Common stack-based VM bytecode optimizations?

Ok, I'm posting this fearing that it might be closed before anyone ever reads it - I'm quite used to that - but I'll give it a try... even pointing me to the right direction or some existing answer that does contain a specific answer would definitely do...
So, after this brief intro...
I'm currently writting a bytecode interpreter, in C, (stack-based VM) for a programming language I have designed.
If you want to have a look at the supported opcodes, feel free to check them out here: https://github.com/arturo-lang/arturo/blob/master/src/vm/opcodes.h
There is nothing really special about the stack machine. Values are being pushed and popped, and operators and functions work on them, pushing the evaluation result back to the stack. So far so good.
Now, I'm at the point where all the core functionality is in and I'm trying to give it an extra boost by doing further optimizations.
Here's an example (and hopefully a rather illustrative one).
Input:
fibo: $(x){
if x<2 {
return 1
} {
return [fibo x-1] + [fibo x-2]
}
}
i: 0
loop i<34 {
print "fibo(" + i + ") = " + [fibo i]
i: i+1
}
Bytecode produced:
|== Data Segment /======================>
0 : [Func ]= function <5,1>
1 : [Int ]= 34
2 : [String]= fibo(
3 : [String]= ) =
==/ Data Segment =======================|
|== Bytecode Listing /======================>
0 :0 JUMP [Dword] 31
1 :5 LLOAD0
2 :6 IPUSH2
3 :7 CMPLT
4 :8 JMPIFNOT [Dword] 20
5 :13 IPUSH1
6 :14 RET
7 :15 JUMP [Dword] 30
8 :20 LLOAD0
9 :21 IPUSH1
10 :22 SUB
11 :23 GCALL0
12 :24 LLOAD0
13 :25 IPUSH2
14 :26 SUB
15 :27 GCALL0
16 :28 ADD
17 :29 RET
18 :30 RET
19 :31 CPUSH0
20 :32 GSTORE0
21 :33 IPUSH0
22 :34 GSTORE1
23 :35 GLOAD1
24 :36 CPUSH1
25 :37 CMPLT
26 :38 JMPIFNOT [Dword] 61
27 :43 CPUSH2
28 :44 GLOAD1
29 :45 ADD
30 :46 CPUSH3
31 :47 ADD
32 :48 GLOAD1
33 :49 GCALL0
34 :50 ADD
35 :51 DO_PRINT
36 :52 GLOAD1
37 :53 IPUSH1
38 :54 ADD
39 :55 GSTORE1
40 :56 JUMP [Dword] 35
41 :61 END
==/ Bytecode Listing =======================|
For anyone who has worked with compilers, bytecode interpreters or even JVM, the code above should be familiar.
What I want?
Ideas - general or specific ones - about how to further optimize my bytecode.
For examples, every *2 (that is: IPUSH2 followed by a MUL instruction) is converted to: IPUSH1, SHL since it's a faster operation.
What else would you suggest? Is there anywhere a list of such things to optimize? Could you suggest something concrete?
Thanks in advance! :)
The example you give is not particularly good, because the performance gain for an interpreter is very low if it makes a shift instead of a multiplication. The overhead of executing a single byte code instruction at all outnumbers the gain of this particular optimization in a order of several magnitudes.
The highest performance gain for an interpreter is to minimize the number of instructions that need to be performed. For example, accumulate two succeeding additions or subtractions on the same register to a single operation when possible.
To be able to make this kind of optimizations, you should try to identify so-called Basic Blocks (these are blocks where either all or no instructions are executed, i.e. no jumps in or out of the block happens) and optimize the number of instructions in those blocks by substituting several instructions into a single one while maintaining the same code semantics.
If you really mean it, you can also try to write a gcc backend for your language to compile it to bytecode; this way you can benefit from gcc's sophisticated optimization methods on the intermediate code representation (RTL).

Sqlite optimization for MAX query on non-leftmost column on index

I noticed the online page about SQLite's query optimizer guarantees that queries of the form SELECT MAX(colA) FROM TABLE can be optimized if there is an index whose leftmost column is colA.
However, I'm less clear about what happens when an index is used to narrow the table based on an equality in WHERE clause, such that the next column in the index is the one that I'm taking a MAX on. Based on the structure of the index, the maximum value should be quickly accessible as the last row in the subset of the index satisfying the WHERE clause. For example, given an index on colA and colB, it should be possible to find SELECT MAX(colB) FROM SillyTable WHERE colA = 1 without scanning all 6 rows associated with colA = 1:
Index of SillyTable on colA, colB:
colA colB rowid
1 1 4
1 2 5
1 4 2
1 5 8
1 6 3 # This is the one
2 1 1
2 5 6
2 8 7
Does SQLite actually optimize a query like this, or will it scan all the rows that satisfy the WHERE clause? If it does a scan, how can I change the query to make it run faster?
My specific use case is similar to the SillyTable example. I created the following table:
CREATE TABLE Product(
ProductTypeID INTEGER NOT NULL,
ProductID INTEGER NOT NULL,
PRIMARY KEY(ProductTypeID, ProductID),
FOREIGN KEY(ProductTypeID)
REFERENCES ProductType(ProductTypeID)
);
ProductTypeID is not particularly selective for the table; I might have many rows with the same ProductTypeID but different ProductID. EXPLAIN QUERY PLAN tells me that my query uses an index automatically built for the composite primary key, but that is true whether it scans or binary-searches the subset of rows found with the index:
EXPLAIN QUERY PLAN SELECT MAX(ProductID) FROM Product
WHERE ProductTypeID = ?;
=>
SEARCH TABLE Product USING COVERING INDEX sqlite_autoindex_Product_1(ProductTypeID=?)
This is shown in the EXPLAIN output:
sqlite> EXPLAIN SELECT MAX(ProductID) FROM Product WHERE ProductTypeID = ?;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 17 0 00 Start at 17
1 Null 0 1 2 00 r[1..2]=NULL
2 OpenRead 1 3 0 k(2,,) 02 root=3 iDb=0; sqlite_autoindex_Product_1
3 Variable 1 3 0 00 r[3]=parameter(1,)
4 IsNull 3 13 0 00 if r[3]==NULL goto 13
5 Affinity 3 1 0 D 00 affinity(r[3])
6 SeekLE 1 13 3 1 00 key=r[3]
7 IdxLT 1 13 3 1 00 key=r[3]
8 Column 1 1 4 00 r[4]=Product.ProductID
9 CollSeq 0 0 0 (BINARY) 00
10 AggStep0 0 4 1 max(1) 01 accum=r[1] step(r[4])
11 Goto 0 13 0 00 max() by index
12 Prev 1 7 0 00
13 AggFinal 1 1 0 max(1) 00 accum=r[1] N=1
14 Copy 1 5 0 00 r[5]=r[1]
15 ResultRow 5 1 0 00 output=r[5]
16 Halt 0 0 0 00
17 Transaction 0 0 1 0 01 usesStmtJournal=0
18 Goto 0 1 0 00
To make the code generator simpler, SQLite always creates a loop for the aggregation (lines 6 to 12). However, for max(), this loop aborts after the first successful step (line 11).

Summing up multiple variable scores depending on their score

tl;dr: I need to first dichotomize a set of variables to 0/1, then sum up these values. I need to do this for 14x8 variables, so I am looking for a way to to this in a loop.
Hi guys,
I have a very specific problem I need your help with:
Description of problem:
In my dataset I have 14 sets of 8 variables each (e.g. a1 to a8, b1 to b8, c1 to c8, etc.) with scores ranging from 1 to 6. Note that the variables are non-contiguous, with string variables in between them (which I need for a different purpose).
I know want to compute scores for each set of these variables (e.g. scoreA, scoreB, scoreC). The score should be computed according the following rule:
scoreA = 0.
If a1 > 1 then increment scoreA by 1.
If a2 > 1 then increment scoreA by 1.
... etc.
Example:
Dataset:
1 5 6 3 2 1 1 5
1 1 1 3 4 6 2 3
scores:
5
5
My previous attempts:
I know I could do this task by first recoding the variables to dichotomize them, and then sum up these values. This has two large drawbacks for me: Firstly it creates a lot of new variables which I don't need. Secondly it is a very tedious and repetitive task since I have multiple sets of variables (which have different variable names) with which I need to do the same task.
I took a look at the DO REPEAT and LOOP with VECTOR commands, but I seem to not fully understand how they work. I was not able to transfer solutions from other examples I read online to my problem.
I would be happy with a solution that only loops through one set of variables and does the task, then I would adjust the syntax appropriately for my other 13 sets of variables. Hope you can help me out.
See two solutions: one loops over each of the sets, the second is a macro which loops over a list of sets:
* creating some sample data.
DATA LIST list/a1 to a8 b1 to b8 c1 to c8 hello1 to hello8.
BEGIN DATA
1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 3 3 3 1 1 1 1 4 4 4 4
1 1 1 1 2 3 4 5 1 1 1 2 3 4 1 0 0 0 0 0 1 2 1 2 3 2 1 2 3 2 1 6
END DATA.
* solution 1: a loop for each set (example for sets a, b and c).
compute scoreA=0.
compute scoreB=0.
compute scoreC=0.
do repeat
a=a1 a2 a3 a4 a5 a6 a7 a8
/b=b1 b2 b3 b4 b5 b6 b7 b8
/c=c1 c2 c3 c4 c5 c6 c7 c8./* if variable names are consecutive replace with "a1 to a8" etc'.
compute scoreA=scoreA+(a>1).
compute scoreB=scoreB+(b>1).
compute scoreC=scoreC+(c>1).
end repeat.
execute.
Doing this for 14 different sets is no fun, so assuming your sets are always named $1 to $8, you can use the following macro:
define DoSets (SetList=!cmdend)
!do !set !in (!SetList)
compute !concat("Score_",!set)=0.
do repeat !set=!concat(!set,"1") !concat(!set,"2") !concat(!set,"3") !concat(!set,"4") !concat(!set,"5") !concat(!set,"6") !concat(!set,"7") !concat(!set,"8").
compute !concat("Score_",!set)=!concat("Score_",!set)+(!set>1).
end repeat.
!doend
execute.
!enddefine.
* now call the macro and list all set names.
DoSets SetList= a b c hello.
The do repeat loop above works perfectly, but with a lot of sets of variables, it would be tedious to create. Using Python programmability, this can be generated automatically without regard to the variable order. The code below assumes an unlimited number of variables with names of the form lowercase letter digit that occur in sets of 8 and generates and runs the do repeat. For simplicity it generates one loop for each output variable, but these will all be executed on a single data pass. If the name pattern is different, this code could be adjusted if you say what it is.
begin program.
import spss, spssaux
vars = sorted(spssaux.VariableDict(pattern="[a-z]\d").variables)
cmd = """compute %(score)s = 0.
do repeat index = %(vlist)s.
compute %(score)s = %(score)s + (index > 1).
end repeat."""
if len(vars) % 8 != 0:
raise ValueError("Number of input variables not a multiple of 8")
for v in range(0, len(vars),8):
score = "score" + vars[v][0]
vlist = " ".join(vars[v:v+8])
spss.Submit(cmd % locals())
end program.
execute.

XBee packet format

I have to IEEE 802.15.4 devices running. The question is about XBee-PRO.
Firmware: XBEE PRO 802.15.4 (Version: 10e6)
Hardware: XBEE (Version: 1744)
Both units are configured to the same channel (15) and same PAN id (0x1234). It's hooked to my machines COM port and can actually transmit data when I connect picocom to it. (It responds to AT commands properly and can be configured fully through moltosenso Network Manager - I'm on a Mac). All other registers are at their defaults, apart from the serial baudrate.
The XBee side source address is at 0x1, destination address is 0x2. Now when I type an ASCII character into picocom, this is what I see received on the other device, running in promiscous mode:
-- Typing "a"
E 61 88 7E 34 12 2 0 1 0 2B 0 61 E1
E 61 88 7E 34 12 2 0 1 0 2B 0 61 E1
E 61 88 7E 34 12 2 0 1 0 2B 0 61 E1
E 61 88 7E 34 12 2 0 1 0 2B 0 61 E1
-- Typing "b"
E 61 88 7F 34 12 2 0 1 0 2C 0 62 58
E 61 88 7F 34 12 2 0 1 0 2C 0 62 58
E 61 88 7F 34 12 2 0 1 0 2C 0 62 58
E 61 88 7F 34 12 2 0 1 0 2C 0 62 58
--- Typing "a" again
E 61 88 80 34 12 2 0 1 0 2D 0 61 A9
E 61 88 80 34 12 2 0 1 0 2D 0 61 A9
...
ln pc pan da sa ct pl ck
So for every data payload sent, I see four frames sent out (nobody is picking them up of course). I suppose three of these are 802.15.4 retries, and XBee adds another one for kicks (although the RR register is clearly zero...).
What's the packet format here and where is this specified?
I've looked at XBee API packets and this does look vaguely similar, but I don't see 0x7e delimiters or anything like that here.
I guess what I am seeing is:
ln = length
61 = ??
88 = ??
pc = some sort of packet counter
pan = 16 bits of PAN ID
da = 16 bits of destination address
sa = 16 bits of source address
ct = another counter?
0 = ??
pl = my ASCII character payload
ck = probably a checksum
I tried with setting PAN to 0xFFFF and setting the destination address to 0xFF or broadcast, seeing pretty much the same. These 0x61 and 0x88 don't seem to correspond to much anything in the XBee documentation...
It doesn't directly look like 802.15.4 MAC level data frame either - or if it does, what are the missing fields and where are they specified? Pointers?
EDIT:
Actually, hmm. After importing a hex-formatted dump into Wireshark, it told me exactly that it's a 802.15.4 MAC frame and how to read it.
IEEE 802.15.4 Data, Dst: 0x0002, Src: 0x0001, Bad FCS
Frame Control Field: Data (0x8861)
.... .... .... .001 = Frame Type: Data (0x0001)
.... .... .... 0... = Security Enabled: False
.... .... ...0 .... = Frame Pending: False
.... .... ..1. .... = Acknowledge Request: True
.... .... .1.. .... = Intra-PAN: True
.... 10.. .... .... = Destination Addressing Mode: Short/16-bit (0x0002)
..00 .... .... .... = Frame Version: 0
10.. .... .... .... = Source Addressing Mode: Short/16-bit (0x0002)
Sequence Number: 126
Destination PAN: 0x1234
Destination: 0x0002
Source: 0x0001
I still don't know where the second 16-bit counter comes from in front of the actual data byte, and why FCS is messed up (I had to strip the beginning len field to get Wireshark to read it - that's probably it.)
I think the second counter ct is a counter for the application layer in Zigbee protocol to notice when it should update its data because it is receiving a new one :)
For more information about Frames Format in Zigbee Stack try to download this :
Newnes.ZigBee.Wireless.Networks.and.Transceivers.Sep.2008.eBook-DDU.pdf
Have a nice day :)
Have you try to read packets with X-CTU software?
I suggest you to read this post entry: http://www.tunnelsup.com/xbee-guide/
The pdf with the "Quick Reference Guide" is really useful and contains some data format indicated.
Also, it's always good to study the real documentation from developer (Digi in this case).
The frame is like:
API Frame
But only if you have configured previously the xbee to work in API mode with command:
ATAP 1
Or with XCTU.
Try monitoring communication between two XBee modules to see what the acknowledgement frame looks like.
Try sending a sequence of bytes.
Try performing a Node Discovery (ATND) to see what those frames look like.
Try sending a remote AT command from X-CTU to see what those frames and responses look like.
When reverse engineering a protocol, it's useful to see both sides of the conversation. You can test various theories by emulating each side of the protocol, and trying out variations on what you see. For example, "What if I change this byte, does the remote end still respond?".
My guess is that you're correct about the ct byte being a counter. The following zero byte could be flags, or it could identify the type of packet sent (serial data, remote AT command/response, node discovery/response, etc.).
As you build up an understanding of the structure, you can write a program to parse and dump the contents of the frames. Dump an interpreted version of what you know, and leave the unknown bytes as a sequence of hex bytes. Continue to experiment until you can narrow down the meaning of the remaining bytes.
The extra 2 bytes in payload (0x2D 0x0) is MaxStream header (MM in XCTU). If you disable the MaxStream headers by setting the MM command to without MaxStream headers, then these two bytes will become a part of a 802.15.4 payload, so your full payload would become 2B 0 61 instead of just 61

how to use arm neon vbit intrinsics?

I don't understand how I differentiate between vbit, vbsl and vbif with neon intrinsics. I need to do the vbit operation but if I use the vbslq instruction from the intrinsics I don't get what I want.
For example I have a source vector like this:
uint8x16_t source = 39 62 9b 52 34 5b 47 48 47 35 0 0 0 0 0 0
The destination vector is:
uint8x16_t destination = 0 0 0 0 0 0 0 0 0 0 0 0 c3 c8 c5 d5
I would like to have as an output this:
39 62 9b 52 34 5b 47 48 47 35 0 0 c3 c8 c5 d5
meaning that I want to copy the first ten bytes from the source and leave the other 6 unchanged.
I'm using this mask:
{0,0,0,0,0,0,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF};
What is the correct way to use the vbslq_u8?
The ARM documentation is not very clear, but it looks like you would need to use the intrinsic like this:
uint8x16_t src = {0x39,0x62,0x9b,0x52,0x34,0x5b,0x47,0x48,
0x47,0x35,0x00,0x00,0x00,0x00,0x00,0x0};
uint8x16_t dest = {0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0xc3,0xc8,0xc5,0xd5};
uint8x16_t mask = {0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,
0xff,0xff,0x00,0x00,0x00,0x00,0x00,0x00};
dest = vbslq_u8(mask, src, dest);
Note that order of bytes in the mask needs to correspond with the order in the source/dest registers (they seem to be swapped in your question ?).
Also note that the first param to the intrinsic appears to be the selection mask, where a 1 bit selects the corresponding bit from the second param and a 0 bit selects the corresponding bit from the third param.

Resources