Uploading Large(8GB) File Issue using Weka

Uploading Large(8GB) File Issue using Weka - dataset

I am trying to upload a 8GB file to weka for usage of Apriori Algorithm. The server configuration is as follows :-
Its 8 processor server with 4 cores in each physical address space = 40bits and virtual address space =48 bits. Its a 64 bits processor.
Physical Memory =26GB and SWAP =27GB
JVM = 64bit. We have allocated 32GB for JVM Heap using XmX option. Our concern is that the loading of such a huge file is taking a very long time(around 8 hours) and java is utilizing 107% CPU and 91% memory and it has not shown Out of memory exception and weka is showing reading from file.
Please help me how do I handle huge file and what exactly is happening here?
Reagards,
Aniket

I can't speak to Weka, I don't know your data set, or how many elements are in it. The number of elements matter as in a 64b JVM, the pointers are huge, and they add up.
But do NOT create a JVM larger than physical RAM. Swap is simply not an option for Java. A swapping JVM is a dead JVM. Swap is for idle processes rarely used.
Also note that the Xmx value and the physical heap size are not the same, physical size will always be larger than the Xmx size.
You should pre-allocate your JVM heap (Xms == Xmx) and try out various values until MOST of your physical RAM is consumed. This will limit full GCs and memory fragmentation. It also helps (a little) to do this on a fresh system if you're allocating such a large portion of the total memory space.
But whatever you do, do not let Java swap. Swapping and Garbage Collectors do not mix.

Related

malloc fragmentation in lots of 64MB arenas

I am struggling with profiling what looks like internal malloc memory fragmentation in a database server application. To rule out a leak, all malloc, realloc and free calls are wrapped with our own accounting that prepends our own header to bookkeep the memory balance, plus the code is valgrinded using quite a big test suite. Moreover, most of the time we use our custom allocator, directly mmaping pools of memory and doing our own administration.
glibs malloc is used only for some small stuff that doesn't fit in the scheme of our own allocator.
Running a test for a few days that just keeps allocating and freeing a lot of memory in the server (lots of short connections coming and going, lots of DDL operations modifying global catalogs), results in the "RES" memory creeping up and staying up, way above our internal accounting.
After these few days of testing, we count a total of about 400TB of memory being malloced/freed, with the balance reported by our accounting varying around a few hundred megabytes to 2-3 GB most of the time (with spikes up to 15GB). The "RES" memory of the process however never goes down below 8.3-8.4GB.
Parsing /proc/$PID/maps, practically all of it is in "rw-p" mappings of exactly 64MB (or "rw-p" plus a "---p" reserved "tail") - in a captured snapshot 143 such arenas account almost exactly for such 8.3-8.4 GB of resident memory.
Googling around tells that malloc allocated memory in such 64MB arenas, and that such multiple arenas can cause excessive "VIRT" memory:
https://infobright.com/blog/malloc_arena_max/
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
However in my case most of the areas are full and actually count to RES not to VIRT (only 9 out of the 143 areas with a "---p" tail of more than 1 MB).
In this case it is just a few GB of memory, but in actual production systems we've seen the disrepancy grow to numbers like 40-50 GB (on a 512 GB RAM server).
Is there a way that I could get more insight into this fragmentation? malloc_info output seems to be somewhat corrupted, reporting some odd numbers like:
<unsorted from="321" to="847883883078550" total="140585643867701" count="847883883078876"/>
- such exact line (exact same "to", "total", and "count") repeats in every heap.
I'm going to test the behaviour of different allocators (jemalloc, tcmalloc) in a similar fashion.

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.
What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?
If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?
Can we do the same by using the disk's original 4KB block size?

What does 64MB block size mean?
The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.
If yes, what is the advantage of doing that?
HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.

HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).

In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.
So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.

In normal OS block size is 4K and in hadoop it is 64 Mb.
Because for easy maintaining of the metadata in Namenode.
Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.
If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.
Conclusion:
To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.

It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.
Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.

If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.
Since we need a Mapper for each block, there would be a lot of Mappers, each processing a piece bit of data, which isn't efficient.

The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.
Having a much smaller block size would cause seek overhead to increase.
Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.
See Google Research Publication: MapReduce
http://research.google.com/archive/mapreduce.html

Below is what the book "Hadoop: The Definitive Guide", 3rd edition explains(p45).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks. By making a block large enough, the time
to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus the time to transfer
a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10 ms and
the transfer rate is 100 MB/s, to make the seek time 1% of the
transfer time, we need to make the block size around 100 MB. The
default is actually 64 MB, although many HDFS installations use 128 MB
blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in
MapReduce normally operate on one block at a time, so if you have too
few tasks (fewer than nodes in the cluster), your jobs will run slower
than they could otherwise.

Why is memory fragmentation an issue on a 64-bit machine?

In a 32-bit machine each process gets a 4GB virtual space. In this case one can worry that we might face trouble due to fragmentation. But in the case of a 64-bit machine we theoretically have a huge addressable virtual memory, so why is memory fragmentation still an issue (if it is) in a 64-bit machine?

Each virtual address that you try to access is mapped by the operating system to physical memory. Physical memory is allocated in pages (e.g. 4K in size). If you manage to allocate a byte at offset 1000000*n and do it for n from 1 to 1000000 (I think you could do that with mmap), then the OS will have to back that with a million pages of physical memory, which is something like 4G. That physical memory will not be available for anything else. If you had allocated the bytes contiguously, you'd only need about 1M of physical memory (256 pages) for your million bytes.
You can get in a similar bad situation if you allocate 4G for legitimate reasons, and then deallocate parts of it, keeping a bit of every page allocated. The OS will not be able to actually reuse the freed memory for anything else because there are no physical pages that are fully free. So that's a fragmentation problem.
In theory, you could imagine that virtual addresses 1000000 and 2000000 would map to the same page of physical memory, avoiding the fragmentation. But in practice, and for good reasons, the virtual memory mapping is done on a page by page basis. You can read more about it here: http://en.wikipedia.org/wiki/Page_table.

Because all that memory is "wasted" consider an application where you have a lot of internal fragmentation. That process requires more pages in memory because the working set is now scattered in memory and that means its memory footprint is much higher. If this application is contending for physical slots in RAM (machines still really only have about 4 - 8 GB of RAM for a typical home setup) then it causes more page swapping. Generally you want to reduce your applications memory footprint to avoid memory pressure and contention with other applications.
There are cases though where it doesn't really matter, it won't kill you to use an extra megabyte here or there but it all adds up in larger applications. It depends on the situation as to whether or not it is important to have as little fragmentation as possible depending on what you're coding or what the aim of your project is.

outOfMemoryException while reading excel data

I am trying to read data from an excel file(xlsx format) which is of size 100MB. While reading the excel data I am facing outOfMemoryException. Tried by increasing the JVM heap size to 1024MB but still no use and I cant increase the size more than that. Also tried by running garbage collection too but no use. Can any one help me on this to resolve my issue.
Thanks
Pavan Kumar O V S.

By default a JVM places an upper limit on the amount of memory available to the current process in order to prevent runaway processes gobbling system resources and making the machine grind to a halt. When reading or writing large spreadsheets, the JVM may require more memory than has been allocated to the JVM by default - this normally manifests itself as a java.lang.OutOfMemoryError.
For command line processes, you can allocate more memory to the JVM using the -Xms and -Xmx options eg. to allocate an initial heap allocation of 10 MB, with 100 MB as the upper bound you can use:
java -Xms10m -Xmx100m -classpath jxl.jar spreadsheet.xls
You can refer to http://www.andykhan.com/jexcelapi/tutorial.html#introduction for further details

How much data can be malloced at a time? what is the limit in modern OS such as Linux?

How much data can be malloced and how is the limit determined? I am writing an algorithm in C that basically utilizes repeatedly some data stored in arrays. My idea is to have this saved in dynamically allocated arrays but I am not sure if it's possible to have such amounts malloced.
I use 200 arrays of size 2046 holding complex data of size 8 byte each. I use these throughout the process so I do not wish to calculate it over and over.
What are your thoughts about feasibility of such an approach?
Thanks
Mir

How much memory malloc() can allocate depends on:
How much memory your program can address directly
How much physical memory is available
How much swap space is available
On a modern, flat-memory-model 32-bit system, your program can address 4 gigabytes, but some of the address space (usually 2 gigabytes, sometimes 1 gigabyte) is reserved for the kernel. So, as a rule of thumb, you should be able to allocate almost two gigabytes at once, assuming you have the physical memory and swap space to back it up.
On a 64-bit system, running a 64-bit OS and a 64-bit program, your addressable memory is essentially unlimited.
200 arrays of 2048 bytes each is only 400k, which should fit in cache (even on a smartphone).

A 32bit OS has a limit of 4Gb, typically some (upto half on win32) are reserved for the operating system - mapping the address space of graphcis card memory etc.
Linux supports 64Gb of address space (using Intel's 36bit PAE) on 32bit versions.
EDIT: although each process is limited to 4Gb
The main problem with allocating large amounts of memory is if you need it to be locked in RAM - then you obviously need a lot of RAM. Or if you need it all to be contiguous - it's much easier to get 4 * 1Gb chunks of memory than a single 4Gb chunk with nothing else in the way.
A common approach is to allocate all the memory you need at the start of the program so you can be sure that if the app isn't going to be possible it will fail instantly rather than when it's done 90% of the work.
Don't run other memory intensive apps at the same time.
There are also a bunch of flags you can use to suggest to the kernel that this app should get priority in memory or keep memory locked in ram - sorry it's too long since i did HPC on linux and i'm probably out of date with modern kernels.

I think that on most mordern (64bit) systems you can allocate 4GB at a time with a malloc( size_t ) call if that much memory is available. How big is each of those 'complex data' entries? if they are of the size 256 bytes, then you'll only need to allocate 100MB.
256bytes × 200 arrays × 2048 entries = 104857600bytes
104857600 bytes / 1024 / 1024 = 100MB.
So for 4096bytes each that's still only 1600MB or ≃ 1.6GB so it is feasible on most systems today, my four year old laptop got 3GB internal memory. Sometimes I does image manipulation with GIMP and it takes up over 2GB of memory.

With some implementations of malloc(), the regions are not actually backed by memory until they really get used so you can in theory carry on forever (though in practice of course the list of allocated regions assigned to your process in the kernel takes up space, so you might find you can only call malloc() a few million times even if it never actually gives you any memory). It's called "optimistic allocation" and is the strategy used by Linux (which is why it then has the OOM killer, for when it was over-optimistic).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Uploading Large(8GB) File Issue using Weka - dataset

Related

malloc fragmentation in lots of 64MB arenas

data block size in HDFS, why 64MB?

Why is memory fragmentation an issue on a 64-bit machine?

outOfMemoryException while reading excel data

How much data can be malloced at a time? what is the limit in modern OS such as Linux?

Categories

Resources