Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 6 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I need to select a compression algorithm when configuring "well-known application".
Also, as part of my day job, my company is developing distributed application that deal with a fair amount of data. We've been looking into compressing data to try to reduce network bandwidth, but we are hitting a wall on what algorithm to use. There's too many choices.
How do I decide between LZ4 and Snappy?
TL;DR The answer is always LZ4.
First, let's discuss what they have in common
They are both algorithms that are designed to operate at "wire" speed (order of 1 GB/s per core) when compressing and decompressing.
The main use case is to apply compression before writing data to disk or to network (that usually operate nowhere near GB/s). Compress data to reduce IO, it's transparent since the compression algorithm is so fast -faster than reading/writing from the medium-.
Both algorithms appeared in early 2010s and can be considered relatively recent. It takes a good decade for newer technologies to gain adoption and for optimized stable libraries to emerge in all popular languages.
They are both widely usable now and have good libraries available (I am writing this in 2021), however that wasn't the case a few years ago .
They are both compressing at a similar speed and a similar compression ratio (except decompression speed where LZ4 is much faster).
For historical reference there is a third algorithm called LZO that plays in the same league, it's much older (paper from 1996) and not widely used.
Second, let's discuss the differences.
While they are both extremely fast, LZ4 is (slightly) faster and stronger, hence it should be preferred.
In particular when it comes to decompression speed, LZ4 is multiple times faster.
LZ algorithms are generally extremely fast at decompression (they can operate in constant time), that's one of the reasons they are popular. LZ4 was constructed to fully take advantage of that property and saturate CPU/memory bandwidth.
In addition, LZ4 is tunable, the compression level can be finely tuned from 1 to 16, which allows to have stronger compression if you have CPU to spare. It would be great if all software supporting LZ4 should would expose the compression level as a setting, but not all do.
"faster is better" of course, yet, you may be tempted to ask if it really matters at these kind of speed? Do we care to do 1 GB/s or 2 GB/s per core?
The answer is yes, because the effect is noticeable and on-the-wire compression should keep up with the hardware it's running on, including NVMe SSD (750+ MB/s) and local network (1.25+ GB/s).
For client-server applications where the server will be receiving and decompressing many streams from many clients, the cost of decompression can add up real quick. One practical example is distributed queues like Kafka that have to decompress/recompress data on the fly, adapting to whatever formats the many clients can send/receive.
Another major use case is databases, where data can be compressed before being stored on disk. A well-known example is ElasticSearch, data is compressed with LZ4 out-of-the-box (internally data is immutable/append-only which works very well with compression and logs), when you run a query on the last month of logs it could be decompressing Terabytes of data on the fly (1 GB/s doesn't sound that quick anymore ;) )
Third, compatibility and availability of libraries
Last but not least, you will need to find some libraries to support whichever compression you intend to use.
Or, if we are talking about tuning a third party application/database, you will need to see what algorithms can be configured.
As of 2021 when I am writing this answer, there are mature libraries available in all popular languages for LZ4 (and snappy (and ZSTD)).
If you're the guy developing software that could benefit from wire-speed compression, you should use LZ4. If you are looking for a stronger compression -albeit slower- you can look into ZSTD instead. Forget about snappy.
One exception may be for some Java software, that has may support snappy but not lz4.
A Bit of History and Software Archeology
There is one edge case around java software. Snappy had an optimized java implementation for much longer, notably driven by Kafka. There's a good chance you ended up on this post because you were looking into tuning Kafka compression.
Kafka settled on snappy compression early-on and required all kafka clients (in all languages) to support snappy. It drove snappy adoption and further optimization.
If you see old comparisons that puts snappy ahead, for example this extensive Kafka benchmark from CloudFlare from 2018. The reason it does is because the article is old and LZ4 wasn't equally supported/optimized at the time (CloudFlare couldn't use lz4 in the end because not all clients supported it at the time).
It's hard work to retrofit more compression algorithms into an existing systems. LZ4 (and ZSTD) should be supported now but your mileage may vary. You may need to upgrade your cluster and upgrade your client libraries. You may find that some client libraries don't support it. The difference between snappy vs lz4 being thin, it's not worth the hassle to tune if you've got either working fine.
On a side node. If you run across multiple datacenters and find yourself heavily limited by the network, you should have a look into ZSTD, that has a much stronger compression (can reduce network traffic by 2 or 3).
LZ4 is mature and widely usable now, it wasn't as much pre-2020 (same with snappy outside of java). Many software are seeing noticeable performance improvements by adopting LZ4 and then further improvements as the libraries are deeply optimized.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Our company has five million users. We store user's code files. Users can edit and add their files, just like web IDE, the web IDE list users's file. We use PHP functions to implement these operations, such as readdir, file_get_contents and file_put_contents. We used the MooseFS but when we read the files in the program, in particular the slow loading speed.
So, we need to replace the file system , I hope someone can give me some advice , we have a huge number of small files, which distributed file system should be used.
Five million entries is small to a relational database. I'd wonder why you feel the need to store these in a file system.
Does every user require that all files be loaded on startup? If yes, I'd wonder about the design of the system. That operation is O(N) no matter how you design it.
If you put those five million small files into a relational or NoSQL database, and then let each user connect to it and query for the particular ones they want, then you eliminate the need to load them repeatedly on startup. Problem solved.
In any distributed filesystem, one of the most crucial aspects when we consider operations on small files is network latency - it should be as small as possible (like 0.1 ms) between such distributed filesystem components. The best way to achieve it is to use reliable switch and connect all machines to the same switch.
Also, in distributed filesystems (especially in MooseFS) the best thing is scalability - it means, that the more nodes you have (and the more your calculations are distributed, i.e. done simultaneously on more than one mount), the faster the cluster is.
If you use MooseFS, please check out MooseFS 3.0, because operations on small files are improved since 3.0 version. This is an easy way for now, because you don't have to make a "revolution" (before upgrade remember to backup the /var/lib/mfs on Master Server - i.e. metadata). MooseFS can handle small files well, so maybe there's a problem in configuration?
In MooseFS additionally (still considering small files operations), one of the most important things is to have high CPU clock (like e.g. 3.7 GHz) with small amount of CPU cores and disabled energy saving options in BIOS for Master Server (because Master Server is a single-threaded process). For Chunkservers and Clients situation is different - they are multi-threaded, so you'll get better results while using multicore CPUs.
Additionally, as stated in MooseFS Best practices in paragraph 4. "Virtual Machines and MooseFS":
[...] we do not recommend running MooseFS components (especially Master Server(s)) on Virtual Machines.
So if you run MFS on VMs, you in fact may have poor results.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Following the Prometheus webpage one main difference between Prometheus and InfluxDB is the usecase: while Prometheus stores time series only InfluxDB is better geared towards storing individual events. Since there was some major work done on the storage engine of InfluxDB I wonder if this is still true.
I want to setup a time series database and apart from the push/push model (and probably a difference in performance) I can see no big thing which separates both projects. Can someone explain the difference in usecases?
InfluxDB CEO and developer here. The next version of InfluxDB (0.9.5) will have our new storage engine. With that engine we'll be able to efficiently store either single event data or regularly sampled series. i.e. Irregular and regular time series.
InfluxDB supports int64, float64, bool, and string data types using different compression schemes for each one. Prometheus only supports float64.
For compression, the 0.9.5 version will have compression competitive with Prometheus. For some cases we'll see better results since we vary the compression on timestamps based on what we see. Best case scenario is a regular series sampled at exact intervals. In those by default we can compress 1k points timestamps as an 8 byte starting time, a delta (zig-zag encoded) and a count (also zig-zag encoded).
Depending on the shape of the data we've seen < 2.5 bytes per point on average after compactions.
YMMV based on your timestamps, the data type, and the shape of the data. Random floats with nanosecond scale timestamps with large variable deltas would be the worst, for instance.
The variable precision in timestamps is another feature that InfluxDB has. It can represent second, millisecond, microsecond, or nanosecond scale times. Prometheus is fixed at milliseconds.
Another difference is that writes to InfluxDB are durable after a success response is sent to the client. Prometheus buffers writes in memory and by default flushes them every 5 minutes, which opens a window of potential data loss.
Our hope is that once 0.9.5 of InfluxDB is released, it will be a good choice for Prometheus users to use as long term metrics storage (in conjunction with Prometheus). I'm pretty sure that support is already in Prometheus, but until the 0.9.5 release drops it might be a bit rocky. Obviously we'll have to work together and do a bunch of testing, but that's what I'm hoping for.
For single server metrics ingest, I would expect Prometheus to have better performance (although we've done no testing here and have no numbers) because of their more constrained data model and because they don't append writes to disk before writing out the index.
The query language between the two are very different. I'm not sure what they support that we don't yet or visa versa so you'd need to dig into the docs on both to see if there's something one can do that you need. Longer term our goal is to have InfluxDB's query functionality be a superset of Graphite, RRD, Prometheus and other time series solutions. I say superset because we want to cover those in addition to more analytic functions later on. It'll obviously take us time to get there.
Finally, a longer term goal for InfluxDB is to support high availability and horizontal scalability through clustering. The current clustering implementation isn't feature complete yet and is only in alpha. However, we're working on it and it's a core design goal for the project. Our clustering design is that data is eventually consistent.
To my knowledge, Prometheus' approach is to use double writes for HA (so there's no eventual consistency guarantee) and to use federation for horizontal scalability. I'm not sure how querying across federated servers would work.
Within an InfluxDB cluster, you can query across the server boundaries without copying all the data over the network. That's because each query is decomposed into a sort of MapReduce job that gets run on the fly.
There's probably more, but that's what I can think of at the moment.
We've got the marketing message from the two companies in the other answers. Now let's ignore it and get back to the sad real world of time-data series.
Some History
InfluxDB and prometheus were made to replace old tools from the past era (RRDtool, graphite).
InfluxDB is a time series database. Prometheus is a sort-of metrics collection and alerting tool, with a storage engine written just for that. (I'm actually not sure you could [or should] reuse the storage engine for something else)
Limitations
Sadly, writing a database is a very complex undertaking. The only way both these tools manage to ship something is by dropping all the hard features relating to high-availability and clustering.
To put it bluntly, it's a single application running only a single node.
Prometheus has no goal to support clustering and replication whatsoever. The official way to support failover is to "run 2 nodes and send data to both of them". Ouch. (Note that it's seriously the ONLY existing way possible, it's written countless times in the official documentation).
InfluxDB has been talking about clustering for years... until it was officially abandoned in March. Clustering ain't on the table anymore for InfluxDB. Just forget it. When it will be done (supposing it ever is) it will only be available in the Enterprise Edition.
https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/
Within the next few years, we will hopefully have a well-engineered time-series database that is handling all the hard problems relating to databases: replication, failover, data safety, scalability, backup...
At the moment, there is no silver bullet.
What to do
Evaluate the volume of data to be expected.
100 metrics * 100 sources * 1 second => 10000 datapoints per second => 864 Mega-datapoints per day.
The nice thing about times series databases is that they use a compact format, they compress well, they aggregate datapoints, and they clean old data. (Plus they come with features relevant to time data series.)
Supposing that a datapoint is treated as 4 bytes, that's only a few Gigabytes per day. Lucky for us, there are systems with 10 cores and 10 TB drives readily available. That could probably run on a single node.
The alternative is to use a classic NoSQL database (Cassandra, ElasticSearch or Riak) then engineer the missing bits in the application. These databases may not be optimized for that kind of storage (or are they? modern databases are so complex and optimized, can't know for sure unless benchmarked).
You should evaluate the capacity required by your application. Write a proof of concept with these various databases and measures things.
See if it falls within the limitations of InfluxDB. If so, it's probably the best bet. If not, you'll have to make your own solution on top of something else.
InfluxDB simply cannot hold production load (metrics) from 1000 servers. It has some real problems with data ingestion and ends up stalled/hanged and unusable. We tried to use it for a while but once data amount reached some critical level it could not be used anymore. No memory or cpu upgrades helped.
Therefore our experience is definitely avoid it, it's not mature product and has serious architectural design problems. And I am not even talking about sudden shift to commercial by Influx.
Next we researched Prometheus and while it required to rewrite queries it now ingests 4 times more metrics without any problems whatsoever compared to what we tried to feed to Influx. And all that load is handled by single Prometheus server, it's fast, reliable, and dependable. This is our experience running huge international internet shop under pretty heavy load.
IIRC current Prometheus implementation is designed around all the data fitting on a single server. If you have gigantic quantities of data, it may not all fit in Prometheus.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am currently searching for a good distributed file system.
It should:
be open-source
be horizontally scalable (replication and sharding)
have no single point of failure
have a relatively small footprint
Here are the four most promising candidates in my opinion:
GridFS (based on MongoDB)
GlusterFS
Ceph
HekaFS
The filesystem will be used mainly for media files (images and audio). There are very small as well as medium sized files (1 KB - 10 MB). The amount of files should be around several millions.
Are there any benchmarks regarding performance, CPU-load, memory-consumption and scalability? What are your experiences using these or other distributed filesystems?
I'm not sure your list is quite correct. It depends on what you mean by a file system.
If you mean a file system that is mountable in an operating system and usable by any application that reads and writes files using POSIX calls, then GridFS doesn't really qualify. It is just how MongoDB stores BSON-formatted objects. It is an Object system rather than a File system.
There is a project to make GridFS mountable, but it is a little weird because GridFS doesn't have concepts for things like hierarchical directories, although paths are allowed. Also, I'm not sure how distributed writes on gridfs-fuse would be.
GlusterFS and Ceph are comparable and are distributed, replicable mountable file systems. You can read a comparison between the two here (and followup update of comparison), although keep in mind that the benchmarks are done by someone who is a little biased. You can also watch this debate on the topic.
As for HekaFS, it is GlusterFS that is set up for cloud computing, adding encryption and multitenancy as well as an administrative UI.
After working with Ceph for 11 months I came to conclusion that it utterly sucks so I suggest to avoid it. I tried XtreemFS, RozoFS and QuantcastFS but found them not good enough either.
I wholeheartedly recommend LizardFS which is a fork of now proprietary MooseFS. LizardFS features data integrity, monitoring and superior performance with very few dependencies.
2019 update: situation has changed and LizardFS is not actively maintained any more.
MooseFS is stronger than ever and free from most LizardFS bugs. MooseFS is well maintained and it is faster than LizardFS.
RozoFS has matured and maybe worth a try.
GfarmFS have its niche but today I would have chosen MooseFS for most applications.
OrangeFS, anyone?
I am looking for a HPC DFS and found this discussion here:
http://forums.gentoo.org/viewtopic-t-901744-start-0.html
Lots of good data and comparisons :)
After some talk the OP decided for OrangeFS, quoting:
"OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this
way consistency is kept without locks). But it works, and works well and stable. Furthermore this is
not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including
ROMIO support. All test were done for stripe data distribution.
a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common
uid/gid based quotas, but directory size limitations, more like LVM works.
b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata
storage (single node) this gives +50% performance on small files and no significant difference on
large ones.
c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive
(do not forget each node participates as a data server as well) speed and available network bandwidth.
CPU consumption on such load is decent and is about 50% of single core on a client node and about
10% percents on each other data server nodes.
d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes
over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.
CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and
about several percents on each node.
e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use
PVFS2/OrangeFS parallel input-output features directly from applications.
f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use
NFSv4 for that task providing users quota-restricted small home space. Though most distributed
filesystems don't support special files anyway. "
I do not know about the other systems you posted but I have made a comparison of 3 PHP CMS/Frameworks on local storage vs GlusterFS to see if it does better on real world tests than raw benchmarks. Sadly not.
http://blog.lavoie.sl/2013/12/glusterfs-performance-on-different-frameworks.html
I am trying to create a key/value database with 300,000,000 key/value pairs of 8 bytes each (both for the key and the value). The requirement is to have a very fast key/value mechanism which can query about 500,000 entries per second.
I tried BDB, Tokyo DB, Kyoto DB, and levelDB and they all perform very bad when it comes to databases at that size. (Their performance is not even close to their benchmarked rate at 1,000,000 entries).
I cannot store my database in memory because of hardware limitations (32 bit software), so memcached is out of the question.
I cannot use external server software as well (only a database module), and there is no need for multi-user support at all. Of course server software cannot hold 500,000 queries per second from a single endpoint anyways, so that leaves out Redis, Tokyo tyrant, etc.
David Segleau, here. Product Manager for Berkeley DB.
The most common problem with BDB performance is that people don't configure the cache size, leaving it at the default, which is pretty small. The second most common problem is that people write application behavior emulators that do random look-ups (even though their application is not really completely random) which forces them to read data out of cache. The random I/O then takes them down a path of conclusions about performance that are not based on the simulated application rather than the actual application behavior.
From your description, I'm not sure if your running into these common problems or maybe into something else entirely. In any case, our experience is that Berkeley DB tends to perform and scale very well. We'd be happy to help you identify any bottlenecks and improve your BDB application throughput. The best place to get help in this regard would be on the BDB forums at: http://forums.oracle.com/forums/forum.jspa?forumID=271. When you post to the forum it would be useful to show the critical query segments of your application code and the db_stat output showing the performance of the database environment.
It's likely that you will want to use BDB HA/Replication in order to load balance the queries across multiple servers. 500K queries/second is probably going to require a larger multi-core server or a series of smaller replicated servers. We've frequently seen BDB applications with 100-200K queries/second on commodity hardware, but 500K queries per second on 300M records in a 32-bit application is likely going to require some careful tuning. I'd suggest focusing on optimizing the performance of a the queries on the BDB application running on a single node, and then use HA to distribute that load across multiple systems in order to scale your query/second throughput.
I hope that helps.
Good luck with your application.
Regards,
Dave
I found a good benchmark comparison web page that basically compares 5 renowned databases:
LevelDB
Kyoto TreeDB
SQLite3
MDB
BerkeleyDB
You should check it out before making your choice: http://symas.com/mdb/microbench/.
P.S - I know you've already tested them, but you should also consider that your configuration for each of these tests was not optimized as the benchmark shows otherwise.
Try ZooLib.
It provides a database with a C++ API, that was originally written for a high-performance multimedia database for educational institutions called Knowledge Forum. It could handle 3,000 simultaneous Mac and Windows clients (also written in ZooLib - it's a cross-platform application framework), all of them streaming audio, video and working with graphically rich documents created by the teachers and students.
It has two low-level APIs for actually writing your bytes to disk. One is very fast but is not fault-tolerant. The other is fault-tolerant but not as fast.
I'm one of ZooLib's developers, but I don't have much experience with ZooLib's database component. There is also no documentation - you'd have to read the source to figure out how it works. That's my own damn fault, as I took on the job of writing ZooLib's manual over ten years ago, but barely started it.
ZooLib's primarily developer Andy Green is a great guy and always happy to answer questions. What I suggest you do is subscribe to ZooLib's developer list at SourceForge then ask on the list how to use the database. Most likely Andy will answer you himself but maybe one of our other developers will.
ZooLib is Open Source under the MIT License, and is really high-quality, mature code. It has been under continuous development since 1990 or so, and was placed in Open Source in 2000.
Don't be concerned that we haven't released a tarball since 2003. We probably should, as this leads lots of potential users to think it's been abandoned, but it is very actively used and maintained. Just get the source from Subversion.
Andy is a self-employed consultant. If you don't have time but you do have a budget, he would do a very good job of writing custom, maintainable top-quality C++ code to suit your needs.
I would too, if it were any part of ZooLib other than the database, which as I said I am unfamiliar with. I've done a lot of my own consulting work with ZooLib's UI framework.
300 M * 8 bytes = 2.4GB. That will probably fit into memory (if the OS does not restrict the address space to 31 bits)
Since you'll also need to handle overflow, (either by a rehashing scheme or by chaining) memory gets even tighter, for linear probing you probably need > 400M slots, chaining will increase the sizeof item to 12 bytes (bit fiddling might gain you a few bits). That would increase the total footprint to circa 3.6 GB.
In any case you will need a specially crafted kernel that restricts it's own "reserved" address space to a few hundred MB. Not impossible, but a major operation. Escaping to a disk-based thing would be too slow, in all cases. (PAE could save you, but it is tricky)
IMHO your best choice would be to migrate to a 64 bits platform.
500,000 entries per second without holding the working set in memory? Wow.
In the general case this is not possible using HDDs and even difficult SSDs.
Have you any locality properties that might help to make the task a bit easier? What kind of queries do you have?
We use Redis. Written in C, its only slightly more complicated than memcached by design. Never tried to use that many rows but for us latency is very important and it handles those latencies well and lets us store the data in the disk
Here is a bench mark blog entry, comparing redis and memcached.
Berkely DB could do it for you.
I acheived 50000 inserts per second about 8 years ago and a final database of 70 billion records.
I will be in my final year (Electrical and Computer Engineering )the next semester and I am searching for a graduation project in embedded systems or hardware design . My professor advised me to search for a current system and try to improve it using hardware/software codesign and he gave me an example of the "Automated License Plate Recognition system" where I can use dedicated hardware by VHDL or verilog to make the system perform better .
I have searched a bit and found some youtube videos that are showing the system working ok .
So I don't know if there is any room of improvement . How to know if certain algorithms or systems are slow and can benefit from codesign ?
How to know if certain algorithms or systems are slow and can benefit
from codesign ?
In many cases, this is an architectural question that is only answered with large amounts of experience or even larger amounts of system modeling and analysis. In other cases, 5 minutes on the back of an envelop could show you a specialized co-processor adds weeks of work but no performance improvement.
An example of a hard case is any modern mobile phone processor. Take a look at the TI OMAP5430. Notice it has a least 10 processors, of varying types(the PowerVR block alone has multiple execution units) and dozens of full-custom peripherals. Anytime you wish to offload something from the 'main' CPUs, there is a potential bus bandwidth/silicon area/time-to-market cost that has to be considered.
An easy case would be something like what your professor mentioned. A DSP/GPU/FPGA will perform image processing tasks, like 2D convolution, orders of magnitude faster than a CPU. But 'housekeeping' tasks like file-management are not something one would tackle with an FPGA.
In your case, I don't think that your professor expects you to do something 'real'. I think what he's looking for is your understanding of what CPUs/GPUs/DSPs are good at, and what custom hardware is good at. You may wish to look for an interesting niche problem, such as those in bioinformatics.
I don't know what codesign is, but I did some verilog before; I think simple image (or signal) processing tasks are good candidates for such embedded systems, because many times they involve real time processing of massive loads of data (preferably SIMD operations).
Image processing tasks often look easy, because our brain does mind-bogglingly complex processing for us, but actually they are very challenging. I think this challenge is what's important, not if such a system were implemented before. I would go with implementing Hough transform (first for lines and circles, than the generalized one - it's considered a slow algorithm in image processing) and do some realtime segmentation. I'm sure it will be a challenging task as it evolves.
First thing to do when partitioning is to look at the dataflows. Draw a block diagram of where each of the "subalgorithms" fits, along with the data going in and out. Anytime you have to move large amounts of data from one domain to another, start looking to move part of the problem to the other side of the split.
For example, consider an image processing pipeline which does an edge-detect followed by a compare with threshold, then some more processing. The output of the edge-detect will be (say) 16-bit signed values, one for each pixel. The final output is a binary image (a bit set indicates where the "significant" edges are).
One (obviously naive, but it makes the point) implementation might be to do the edge detect in hardware, ship the edge image to software and then threshold it. That involves shipping a whole image of 16-bit values "across the divide".
Better, do the threshold in hardware also. Then you can shift 8 "1-bit-pixels"/byte. (Or even run length encode it).
Once you have a sensible bandwidth partition, you have to find out if the blocks that fit in each domain are a good fit for that domain, or maybe consider a different partition.
I would add that in general, HW/SW codesign is useful when it reduces cost.
There are 2 major cost factors in embedded systems:
development cost
production cost
The higher is your production volume, the more important is the production cost, and development cost becomes less important.
Today it is harder to develop hardware than software. That means that development cost of codesign-solution will be higher today. That means that it is useful mostly for high-volume production. However, you need FPGAs (or similar) to do codesign today, and they cost a lot.
That means that codesign is useful when cost of necessary FPGA will be lower than an existing solution for your type of problem (CPU, GPU, DSP, etc), assuming both solutions meet your other requirements. And that will be the case (mostly) for high-performance systems, because FPGAs are costly today.
So, basically you will want to codesign your system if it will be produced in high volumes and it is a high-performance device.
This is a bit simplified and might become false in a decade or so. There is an ongoing research on HW/SW synthesis from high-level specifications + FPGA prices are falling. That means that in a decade or so codesign might become useful for most of embedded systems.
Any project you end up doing, my suggestion would be to make a software version and a hardware version of the algorithm to do performance comparison. You can also do a comparison on development time etc. This will make your project a lot more scientific and helpful for everyone else, should you choose to publish anything. Blindly thinking hardware is faster than software is not a good idea, so profiling is important.