hardware specification recommendation for Solr - solr

I am looking for a hardware specification for Solr search engine. Our requirement is to build a search system which indexes about 5 to 9 million documents. The peak query per second is around 50 people. I checked the Dell website and think that maybe a Rack Server is good. So I made a sample product. How do you think about my choice? Do you have any experience on hardware specification for Solr system?
PowerEdge R815
R815 Chassis for Up to Six 2.5 Inch Hard Drives
Processor
2x AMD Opteron 6276, 2.3GHz, 16C, Turbo CORE, 16M L2/16M L3, 1600Mhz Max Mem
Additional Processor
No 3rd/4th Processors edit
Operating System
No Operating System edit
OS Media kits
None edit
OS and SW Client Access Licenses
None edit
Memory
64GB Memory (8x8GB), 1333MHz, Dual Ranked LV RDIMMs for 2 Processors edit
Hard Drive Configuration
No RAID for PERC H200 Controllers (Non-Mixed Drives) edit
Internal Controller
PERC H200 Integrated RAID Controller edit
Hard Drives
1TB 7.2K RPM SATA 2.5in Hot-plug Hard Drive edit
Data Protection Offers
None edit
Embedded Management
iDRAC6 Express edit
System Documentation
Electronic System Documentation and OpenManage DVD Kit edit
Network Adapter
Intel® Gigabit ET NIC, Dual Port, Copper, PCIe-4 edit
Network Adapter
Intel® Gigabit ET NIC, Dual Port, Copper, PCIe-4 edit
Host Bus Adapter/Converged Network Adapter
None edit
Power Supply
1100 Watt Redundant Power Supply edit
Power Cords
NEMA 5-15P to C13 Wall Plug, 125 Volt, 15 AMP, 10 Feet (3m), Power Cord edit
BIOS Setting
Performance BIOS Setting edit
Rails
No Rack Rail or Cable Management Arm edit
Bezel
PowerEdge R815 Bezel edit
Internal Optical Drive
DVD ROM, SATA, Internal

I agree with Marko (not myself, other Marko:).
You should use e.g. jMeter to test capabilities (the most important metric of course being: how response time changes with number of parallel users) of your configuration and then make educated decision based on those results.
Be prepared to play with JVM memory settings in order to see how if affects overall performance.
I'd also test various application servers to see how that decision affects response time.
PS If you choose to use jMeter you should definitely make use of jMeter Plugins, which will allow you (Composite graph) to show number of parallel users and response time with server's processor, memory and network loads on the same graph.

This is a hugely open ended question, with far too many details unknown - the straw-man hardware spec is really not very useful (TL;DR)
There is only one sensible way to go about tackling this problem and that is empirically.

Related

How do I increase the speed of my USB cdc device?

I am upgrading the processor in an embedded system for work. This is all in C, with no OS. Part of that upgrade includes migrating the processor-PC communications interface from IEEE-488 to USB. I finally got the USB firmware written, and have been testing it. It was going great until I tried to push through lots of data only to discover my USB connection is slower than the old IEEE-488 connection. I have the USB device enumerating as a CDC device with a baud rate of 115200 bps, but it is clear that I am not even reaching that throughput, and I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong. I control every aspect of this from the front end on the PC to the firmware on the embedded system.
I am assuming my issue is how I write to the USB on the embedded system side. Right now my USB_Write function is run in free time, and is just a while loop that writes one char to the USB port until the write buffer is empty. Is there a more efficient way to do this?
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications. The CPU would just write data across a bus to this board, and it would handle communications, which means that the CPU didn't have to waste free time handling the actual communications, but could offload the communications to a "co processor" (not a CPU but functionally the same here). Even with this concern though I figured I should be getting faster speeds given that full speed USB is on the order of MB/s while IEEE-488 is on the order of kB/s.
In short is this more likely a fundamental system constraint or a software optimization issue?
I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong.
You are correct, the baud number is a dummy value. If you create a CDC/RS232 adapter you would use this to configure your RS232 hardware, in this case it means nothing.
Is there a more efficient way to do this?
Absolutely! You should be writing chunks of data the same size as your USB endpoint for maximum transfer speed. Depending on the device you are using your stream of single byte writes may be gathered into a single packet before sending but from my experience (and your results) this is unlikely.
Depending on your latency requirements you can stick in a circular buffer and only issue data from it to the USB_Write function when you have ENDPOINT_SZ number of byes. If this results in excessive latency or your interface is not always communicating you may want to implement Nagles algorithm.
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications.
The NXP part you mentioned in the comments is without a doubt fast enough to saturate a USB full speed connection.
In short is this more likely a fundamental system constraint or a software optimization issue?
I would consider this a software design issue rather than an optimisation one, but no, it is unlikely you are fundamentally stuck.
Do take care to figure out exactly what sort of USB connection you are using though, if you are using USB 1.1 you will be limited to 64KB/s, USB 2.0 full speed you will be limited to 512KB/s. If you require higher throughput you should migrate to using a separate bulk endpoint for the data transfer.
I would recommend reading through the USB made simple site to get a good overview of the various USB speeds and their capabilities.
One final issue, vendor CDC libraries are not always the best and implementations of the CDC standard can vary. You can theoretically get more data through a CDC endpoint by using larger endpoints, I have seen this bring host side drivers to their knees though - if you go this route create a custom driver using bulk endpoints.
Try testing your device on multiple systems, you may find you get quite different results between windows and linux. This will help to point the finger at the host end.
And finally, make sure you are doing big buffered reads on the host side, USB will stop transferring data once the host side buffers are full.

Why my Wpf application (on full software mode) runs slowly on Citrix server

I've WPF application. Because our Citrix doesn't have a independant graphic carte. I've to set RenderMode en Software only with :
RenderOptions.ProcessRenderMode = System.Windows.Interop.RenderMode.SoftwareOnly;
It runs still fluently in my sony duo 13:
Windows 8.1;
Intel(R) Core(TM) i5-4200U CPU # 1.60GHz (4 CPUs), ~1.6GHz
8192MB RAM
Intel(R) HD Graphics Family; 1792 MB approx memory total;
DirectDraw speedup activated; Direct3D speedup activated; AGP texture activated
But it runs bad on Citrix server (only one application executed):
Windows Server 2008 R2 Standard 64 bits (6.1m version 7601)
VMware Virtual Platform
Intel(R) Xeon(R) CPU x5670 # 2.93GHz (4 CPUs), ~2.9GHz
4096MB RAM
No graphic carte info
I proved my application was running on mode Software with using Windows Performance Tools, all my window of my application is colored by purple tint (it means draw software rendering)
Except Graphic Carte, our Citrix server is more powerful than my hyperbook Sony, but why it works bad?
Thanks
Because your Sony doesn't have to send all the graphical updates over the network to another machine. The network kills graphical performance when you're remoting applications. This is particularly noticeable when using mobile devices which are using 3G or WIFI since the latency is quite high. Every single graphical update your app makes is encoding into a JPEG and then sent over the wire to the receiver. So you're looking at 10's if not 100's of ms for updates to be seen on the client depending on network conditions.
Performance for graphically intensive apps is getting better on Citrix. Later versions of the server/receiver support H264 encoding which improves performance considerably, e.g.
http://blogs.citrix.com/2013/11/06/go-supersonic-with-xendesktop-7-x-bandwidth-supercodecs/
Other technology like Framehawk is also being integrated in the Citrix stack which improves performance in poor network conditions:
http://blogs.citrix.com/2014/01/08/framehawk-will-take-our-hdx-technology-to-the-limit/

Is the shared L2 cache in multicore processors multiported? [duplicate]

The Intel core i7 has per-core L1 and L2 caches, and a large shared L3 cache. I need to know what kind of an interconnect connects the multiple L2s to the single L3. I am a student, and need to write a rough behavioral model of the cache subsystem.
Is it a crossbar? A single bus? a ring? The references I came across mention structural details of the caches, but none of them mention what kind of on-chip interconnect exists.
Thanks,
-neha
Modern i7's use a ring. From Tom's Hardware:
Earlier this year, I had the chance to talk to Sailesh Kottapalli, a
senior principle engineer at Intel, who explained that he’d seen
sustained bandwidth close to 300 GB/s from the Xeon 7500-series’ LLC,
enabled by the ring bus. Additionally, Intel confirmed at IDF that
every one of its products currently in development employs the ring
bus.
Your model will be very rough, but you may be able to glean more information from public information on i7 performance counters pertaining to the L3.

Raspberry Pi cluster, neuron networks and brain simulation

Since the RBPI (Raspberry Pi) has very low power consumption and very low production price, it means one could build a very big cluster with those. I'm not sure, but a cluster of 100000 RBPI would take little power and little room.
Now I think it might not be as powerful as existing supercomputers in terms of FLOPS or others sorts of computing measurements, but could it allow better neuronal network simulation ?
I'm not sure if saying "1 CPU = 1 neuron" is a reasonable statement, but it seems valid enough.
So does it mean such a cluster would more efficient for neuronal network simulation, since it's far more parallel than other classical clusters ?
Using Raspberry Pi itself doesn't solve the whole problem of building a massively parallel supercomputer: how to connect all your compute cores together efficiently is a really big problem, which is why supercomputers are specially designed, not just made of commodity parts. That said, research units are really beginning to look at ARM cores as a power-efficient way to bring compute power to bear on exactly this problem: for example, this project that aims to simulate the human brain with a million ARM cores.
http://www.zdnet.co.uk/news/emerging-tech/2011/07/08/million-core-arm-machine-aims-to-simulate-brain-40093356/ "Million-core ARM machine aims to simulate brain"
http://www.eetimes.com/electronics-news/4217840/Million-ARM-cores-brain-simulator "A million ARM cores to host brain simulator"
It's very specialist, bespoke hardware, but conceptually, it's not far from the network of Raspberry Pis you suggest. Don't forget that ARM cores have all the features that JohnB mentioned the Xeon has (Advanced SIMD instead of SSE, can do 64-bit calculations, overlap instructions, etc.), but sit at a very different MIPS-per-Watt sweet-spot: and you have different options for what features are included (if you don't want floating-point, just buy a chip without floating-point), so I can see why it's an appealing option, especially when you consider that power use is the biggest ongoing cost for a supercomputer.
Seems unlikely to be a good/cheap system to me.
Consider a modern xeon cpu. It has 8 cores running at 5 times the clock speed, so just on that basis can do 40 times as much work. Plus it has SSE which seems suited for this application and will let it calculate 4 things in parallel. So we're up to maybe 160 times as much work. Then it has multithreading, can do 64 bit calculations, overlap instructions etc. I would guess it would be at least 200 times faster for this kind of work.
Then finally, the results of at least 200 local "neurons" would be in local memory but on the raspberry pi network you'd have to communicate between 200 of them... Which would be very much slower.
I think the raspberry pi is great and certainly plan to get at least one :P But you're not going to build a cheap and fast network of them that will compete with a network of "real" computers :P
Anyway, the fastest hardware for this kind of thing is likely to be a graphics card GPU as it's designed to run many copies of a small program in parallel. Or just program an fpga with a few hundred copies of a "hardware" neuron.
GPU and FPU do this kind of think much better then a CPU, the Nvidia GPU's that support CDUA programming has in effect 100's of separate processing units. Or at least it can use the evolution of the pixel pipe lines (where the card could render mutiple pixels in parallel) to produce huge incresses in speed. CPU allows a few cores that can carry out reletivly complex steps. GPU allows 100's of threads that can carry out simply steps.
So for tasks where you haves simple threads things like a single GPU will out preform a cluster of beefy CPU. (or a stack of Raspberry pi's)
However for creating a cluster running some thing like "condor" Which cand be used for things like Disease out break modelling, where you are running the same mathematical model millions of times with varible starting points. (size of out break, wind direction, how infectious the disease is etc.. ) so thing like the Pi would be ideal. as you are general looking for a full blown CPU that can run standard code.http://research.cs.wisc.edu/condor/
some well known usage of this aproach are "Seti" or "folding at home" (search for aliens and cancer research)
A lot of universities have a cluster such as this so I can see some of them trying the approach of mutipl Raspberry Pi's
But for simulating nurons in a brain you require very low latency between the nodes they are special OS's and applications that make mutiply systems act as one. You also need special networks to link it togather to give latency between nodes in terms of < 1 millsecond.
http://en.wikipedia.org/wiki/InfiniBand
the Raspberry just will not manage this in any way.
So yes I think people will make clusters out of them and I think they will be very pleased. But I think more university's and small organisations. They arn't going to compete with the top supercomputers.
Saying that we are going to get a few and test them against our current nodes in our cluster to see how they compare with a desktop that has a duel core 3.2ghz CPU and cost £650! I reckon for that we could get 25 Raspberries and they will use much less power, so will be interesting to compare. This will be for disease outbreak modelling.
I am undertaking a large amount of neural network research in the area of chaotic time series prediction (with echo state networks).
Although I see using the raspberry PI's in this way will offer little to no benefit over say a strong cpu or a GPU, I have been using a raspberry PI to manage the distribution of simulation jobs to multiple machines. The processing power benefit of a large core will nail that possible on the raspberry PI, not just that but running multiple PI's in this configuration will generate large overheads of waiting for them to sync, data transfer etc.
Due to the low cost and robustness of the PI, i have it hosting the source of the network data, as well as mediating the jobs to the Agent machines. It can also hard reset and restart a machine if a simulation fails taking the machine down with it allowing for optimal uptime.
Neural networks are expensive to train, but very cheap to run. While I would not recommend using these (even clustered) to iterate over a learning set for endless epochs, once you have the weights, you can transfer the learning effort into them.
Used in this way, one raspberry pi should be useful for much more than a single neuron. Given the ratio of memory to cpu, it will likely be memory bound in its scale. Assuming about 300 megs of free memory to work with (which will vary according to OS/drivers/etc.) and assuming you are working with 8 byte double precision weights, you will have an upper limit on the order of 5000 "neurons" (before becoming storage bound), although so many other factors can change this and it is like asking: "How long is a piece of string?"
Some engineers at Southampton University built a Raspberry Pi supercomputer:
I have ported a spiking network (see http://www.raspberrypi.org/phpBB3/viewtopic.php?f=37&t=57385&e=0 for details) to the Raspberry Pi and it runs about 24 times slower than on my old Pentium-M notebook from 2005 with SSE and prefetch optimizations.
It all depends on the type of computing you want to do. If you are doing very numerically intensive algorithms with not much memory movement between the processor caches and RAM memory then a GPU solution is indicated. The middle ground is an Intel PC chip using the SIMD assembly language instructions - you can still easily end up being limited by rate you can transfer data to and from RAM. For nearly the same cost you can get 50 ARM boards with say 4 cores per board and 2Gb RAM per board. That's 200 cores and 100 Gb of RAM. The amount of data that can be shuffled between the CPUs and RAM per second is very high. It could be a good option for neural nets that use large weight vectors. Also the latest ARM GPU's and the new nVidea ARM based chip (used in the slate tablet) have GPU compute as well.

Open Source C embedded web server supporting SOAP / JSON-RPC based web service and compatible with ARM processor

I am working on a project to embed a web server written in C into a device. The requirement is that it should support web services (SOAP / JSON-RPC) and should be compatible with ARM processor. Any suggestions of specific products or where to look first?
Given your description: Linux based platform with 256MB RAM, you can basically use any web server you like. 256MB RAM takes your device out of the typical embedded territory into server space.
Don't worry about ARM support too much because it is well supported by the Linux community. It is one of the architectures that is officially supported by Debian. I myself run a couple of web servers on ARM running Debian and lighttpd with hardware having only 32MB RAM.
The top three most popular web servers (and popularity is quite important since it means you can easily google if you have a problem):
lighttpd - very light on RAM usage since it is single threaded and very light on CPU usage as well. The disadvantage is that it can be slow to respond if you try to run heavyweight, CPU intensive CGI applications on it since it is single threaded.
Apache2 - heavy on RAM usage. Apache's default operating mode is to keep threads alive as long as possible to handle heavy loads. This means most of the time you use up RAM on sleeping processes. But if you DO need to handle heavy loads this is a good thing. Good for heavy duty CGI apps.
Nginx - the new kid on the block. Not as well documented (at the moment, obviously documentation improves with time) as either lighttpd or Apache but people have been saying that it outperforms both. It is multithreaded like Apache2 but nonblocking like lighttpd so it has the best of both worlds: it uses less RAM that Apache2 (though more than lighttpd) in general and performs at least as well if not better than Apache2 under load. The only real downside for me is the documentation.
If the device is really short on resources consider an embedded webserver library like Mongoose or libsoup (using GLib). However note that services like SOAP and XML parsing in general are pretty heavy on resources.

Resources