Openrefine memory pb

Openrefine memory pb - dataset

Hello everyone and thanks for help,
First, I'm a begginner in database uses and It. Please be indulgent with me. Moreover I'm a ubuntu user on Xfce environment.
I'm trying to use Openrefine with a dataset of approximately 11 000 rows and 8 colums. When I try to treat it, I encounter a memory pb. " Memory usage: 100% (1517/1517MB)"
like this :
capture of memory pb
I've tried to allocate more memory to openrefine by writing the command :
./ refine -m 1800m
I've read that I can't allocated more than the half of my free memory that is 3800 m. But even with more memory i've waited a night long and openrefine doesn't treat the dataset. I don't understand why, because openrefine is supposed to can treat about 100 000 raws with a few colums.
I was using firefox browser. I tried Opera that is considered by openrefine more convenient for openrefine, but I have the same result.
Some people more used to dataset treatment could help me ?

To give an "official" answer to your question: ODS sometimes it is quite a burdon on the parser.
Therefore you can get around some limitations by exporting/importing your data as CSV, which is easier to read.
As described in the OpenRefine documentation about increasing memory allocation you may also benefit from turning the automatic cell type parsing off.

Related

What if the db size exceeds memory in sqlite?

I'm working with a heavy dataset, which I chose to store in plain binary format and load into memory in chunks. However, even the smallest chunk is about to exceed my computer memory (16GB), so i'll have to fragment them further or find another solution. The whole dataset is around half a terabyte.
I've seen on wiki that sqlite can work with up to 32TB of data, which is cool. However, I could not figure out whether you have to have 32TB of memory to use it, or you can have smaller memory and store the thing on a hard drive. My understanding is that it should be possible, I need very simple operations like add line, read line, pick all lines with a given value e.t.c
I would appreciate if you folks help me, because I would hate to invest into studying sqlite only to learn that it does not work the way I thought. Also if you have any insight that you think might help, please share.

Most databases, including SQLite, can work on the data in chunks, i.e., they need to load only a single record into memory at once. (But using more memory for caching makes things much faster.) Furthermore, if intermediate data becomes too large, it can be moved into temporary files. This is the default behaviour, so it is not explicitly advertized.
Having the entire database in memory is possible, but would make sense only for temporary data that is not to be saved anyway.

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?

I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.

I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

Arduino, max array size > 255?

I'm using Arduino-IRremote code to read in an AC unit remote on an Arduino Uno R3.
There's a unsigned int rawCodes[RAWBUF]. When I set RAWBUF to 255 it all works great. I push it to 256 and it uploads just fine, but there's no response.
Is this a memory limitation? According to this, it's not. I should be able to get ~400 elements.
Also, because the limit I'm hitting is 255 makes me believe there's something else going on.
Thanks, Justin

You shouldn't have that limitation.
You're playing with high amounts of memory. Are you sure you have enough available memory to do so?
Looks like you're talking about runtime errors(of the memory leak/segfault type) here.
You can check total available memory or check this great article (and code) on how to debug AVR.
Also if you're using heavy static strings allocation, you can reduce memory usage by using PROGMEM storage (and having impact on the available size for the sketch).

Found that the variable controlling the buffer size was a uint8_t, so it was a simple change to 16 and now we've got the length I was looking for.
https://github.com/shirriff/Arduino-IRremote/issues/49

FORTRAN program can't read all the data into an array, size limit?

I have a csv file that has about 2 million lines, and about 150 columns of data. Total file size is about 1.3 GB. That's about 300 million array members.
I started with a 3.5 million line file, and through trial and error learned that FORTRAN would not even compile unless the array was defined at 3.9 million or less. 4 million, no go. Bus error/core dumps.
So anyway, I thought my 2 million line file would work. I read a few posts about a 2 GB limit. However, when I print out the line number when reading the data in, I only get to 250,000 or so before it just ends. Strangely enough, I have an almost identical file (used the split command), and it only gets to 85,000 before conking out. Not sure why so different, same number of characters per line.
Is there anything I can do to get this data read in? It would be a major pain to compile all the data hundreds of times.

This isn't a property of Fortran per se, but of your particular compiler and OS. Which is why you should provide that information.
Re the bus error: likely the array is being placed on the stack and you have run out of stack space. Various OS'es provide ways of increasing the stack size. Many compilers provide options so that large arrays are placed into the heap. You can also try declaring the array "allocatable" and allocating it. That last suggestion assumes that you are using a Fortran 95 compiler, rather than a FORTRAN 77 one.
There is also how you declare the integer variable used for indexing. If a loop in your program exceeds 2,147,483,647 you will need to use a variable more than four bytes in size. We can only guess since you don't show any of your source code.

Why does Lucene cause OOM when indexing large files?

I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving OutOfMemoryError: Java heap space, when trying to index large text files.
Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I increased the max. heap size to 512 MB. This worked for the 5 MB text file, but Lucene still used 84 MB of heap space to do this. Why so much?
The class FreqProxTermsWriterPerField appears to be the biggest memory consumer by far according to JConsole and the TPTP Memory Profiling plugin for Eclipse Ganymede.
Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB max. heap size. Increasing the max. heap size to 1024 MB works but Lucene uses 826 MB of heap space while performing this. Still seems like way too much memory is being used to do this. I’m sure larger files would cause the error as it seems correlative.
I’m on a Windows XP SP2 platform with 2 GB of RAM. So what is the best practice for indexing large files? Here is a code snippet that I’m using:
// Index the content of a text file.
private Boolean saveTXTFile(File textFile, Document textDocument) throws MyException {
try {
Boolean isFile = textFile.isFile();
Boolean hasTextExtension = textFile.getName().endsWith(".txt");
if (isFile && hasTextExtension) {
System.out.println("File " + textFile.getCanonicalPath() + " is being indexed");
Reader textFileReader = new FileReader(textFile);
if (textDocument == null)
textDocument = new Document();
textDocument.add(new Field("content", textFileReader));
indexWriter.addDocument(textDocument); // BREAKS HERE!!!!
}
} catch (FileNotFoundException fnfe) {
System.out.println(fnfe.getMessage());
return false;
} catch (CorruptIndexException cie) {
throw new MyException("The index has become corrupt.");
} catch (IOException ioe) {
System.out.println(ioe.getMessage());
return false;
}
return true;
}

In response as a comment to Gandalf
I can see you are setting the setMergeFactor to 1000
the API says
setMergeFactor
public void setMergeFactor(int
mergeFactor)
Determines how often
segment indices are merged by
addDocument(). With smaller values,
less RAM is used while indexing, and
searches on unoptimized indices are
faster, but indexing speed is slower.
With larger values, more RAM is used
during indexing, and while searches on
unoptimized indices are slower,
indexing is faster. Thus larger values
(> 10) are best for batch index
creation, and smaller values (< 10)
for indices that are interactively
maintained.
This method is a convenience method, it uses the RAM as you increase the mergeFactor
What i would suggest is set it to something like 15 or so on.; (on a trial and error basis) complemented with setRAMBufferSizeMB, also call Commit(). then optimise() and then close() the indexwriter object.(probably make a JavaBean and put all these methods in one method) call this method when you are closing the index.
post with your result, feedback =]

For hibernate users (using mysql) and also using grails (via searchable plugin).
I kept getting OOM errors when indexing 3M rows and 5GB total of data.
These settings seem to have fixed the problem w/o requiring me to write any custom indexers.
here are some things to try:
Compass settings:
'compass.engine.mergeFactor':'500',
'compass.engine.maxBufferedDocs':'1000'
and for hibernate (not sure if it's necessary, but might be helping, esp w/ mysql which has jdbc result streaming disabled by default. [link text][1]
hibernate.jdbc.batch_size = 50
hibernate.jdbc.fetch_size = 30
hibernate.jdbc.use_scrollable_resultset=true
Also, it seems specially for mysql, had to add some url parameters to the jdbc connection string.
url = "jdbc:mysql://127.0.0.1/mydb?defaultFetchSize=500&useCursorFetch=true"
(update: with the url parameters, memory doesn't go above 500MB)
In any case, now I'm able to build my lucene / comapss index with less than 2GB heap size. Previously I needed 8GB to avoid OOM. Hope this helps someone.
[1]: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html mysql streaming jdbc resultset

Profiling is the only way to determine, such large memory consumption.
Also, in your code,you are not closing the Filehandlers,Indexreaders,Inderwriters, perhaps the culprit for OOM,

You can set the IndexWriter to flush based on memory usage or # of documents - I would suggest setting it to flsuh based on memory and seeing if this fixes your issue. My guess is your entire index is living in memory because you never flush it to disk.

We experienced some similar "out of memory" problems earlier this year when building our search indexes for our maven repository search engine at jarvana.com. We were building the indexes on a 64 bit Windows Vista quad core machine but we were running 32 bit Java and 32 bit Eclipse. We had 1.5 GB of RAM allocated for the JVM. We used Lucene 2.3.2. The application indexes about 100GB of mostly compressed data and our indexes end up being about 20GB.
We tried a bunch of things, such as flushing the IndexWriter, explicitly calling the garbage collector via System.gc(), trying to dereference everything possible, etc. We used JConsole to monitor memory usage. Strangely, we would quite often still run into “OutOfMemoryError: Java heap space” errors when they should not have occurred, based on what we were seeing in JConsole. We tried switching to different versions of 32 bit Java, and this did not help.
We eventually switched to 64 bit Java and 64 bit Eclipse. When we did this, our heap memory crashes during indexing disappeared when running with 1.5GB allocated to the 64 bit JVM. In addition, switching to 64 bit Java let us allocate more memory to the JVM (we switched to 3GB), which sped up our indexing.
Not sure exactly what to suggest if you're on XP. For us, our OutOfMemoryError issues seemed to relate to something about Windows Vista 64 and 32 bit Java. Perhaps switching to running on a different machine (Linux, Mac, different Windows) might help. I don't know if our problems are gone for good, but they appear to be gone for now.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight