Solr: Document size inexplicably large

Solr: Document size inexplicably large - solr

I updated to Solr 8.4.0 (from 6.x) on a test server and reindexed (this is an index of a complicated Moodle system, mainly lots of very small documents). It worked initially, but later ran out of disk space so I deleted everything and tried indexing a smaller subset of the data, but it still ran out of disk space.
Looking at the segment info chart, the first segment seems reasonable:
Segment _2a1u:
#docs: 603,564
#dels: 1
size: 5,275,671,226 bytes
age: 2020-11-25T22:10:05.023Z
source: merge
That's 8,740 bytes per document - a little high but not too bad.
Segment _28ow:
#docs: 241,082
#dels: 31
size: 5,251,034,504 bytes
age: 2020-11-25T18:33:59.636Z
source: merge
21,781 bytes per document
Segment _2ajc:
#docs: 50,159
#dels: 1
size: 5,222,429,424 bytes
age: 2020-11-25T23:29:35.391Z
source: merge
104,117 bytes per document!
And it gets worse, looking at the small segments near the end:
Segment _2bff:
#docs: 2
#dels: 0
size:23,605,447 bytes
age: 2020-11-26T01:36:02.130Z
source: flush
None of our search documents will have anywhere near that much text.
On our production Solr 6.6 server, which has similar but slightly larger data (some of it gets replaced with short placeholder text in the test server for privacy reasons) the large 5GB-ish segments contain between 1.8 million and 5 million documents.
Does anyone know what could have gone wrong here? We are using Solr Cell/Tika and I'm wondering if somehow it started storing the whole files instead of just the extracted text?

It turns out that a 10MB English language PowerPoint file being indexed, with mostly pictures and only about 50 words of text in the whole thing, is indexed (with metadata turned off) as nearly half a million terms most of which are Chinese characters. Presumably, Tika has incorrectly extracted some of the binary content of the PowerPoint file as if it were text.
I was only able to find this by reducing the index by trial and error until there are only a handful of documents in it (3 documents but using 13MB disk space), then Luke 'Overview' tab let me see that one field (called solr_filecontent in my schema) which contains the indexed Tika results has 451,029 terms. Then, clicking 'Show top terms' shows a bunch of Chinese characters.
I am not sure if there is a less laborious way than trial and error to find this problem, e.g. if there's any way to find documents that have a large number of terms associated. (Obviously, it could be a very large PDF or something that legitimately has that many terms, but in this case, it isn't.) This would be useful as even if there are only a few such instances across our system, they could contribute quite a lot to overall index size.
As for resolving the problem:
1 - I could hack something to stop it indexing this individual document (which is used repeatedly in our test data otherwise I probably wouldn't have noticed) but presumably the problem may occur in other cases as well.
2 - I considered excluding the terms in some way but we do have a small proportion of content in various languages, including Chinese, in our index so even if there is a way to configure it to only use ASCII text or something, this wouldn't help.
3 - My next step was to try different versions to see what happens with the same file, in case it is a bug in specific Tika versions. However, I've tested with a range of Solr versions - 6.6.2, 8.4.0, 8.6.3, and 8.7.0 - and the same behaviour occurs on all of them.
So my conclusion from all this is that:
Contrary to my intitial assumption that this was related to the version upgrade, it isn't actually worse now than it was in the older Solr version.
In order to get this working now I will probably have to do a hack to stop it indexing that specific PowerPoint file (which occurs frequently on our test dataset). Presumably the real dataset wouldn't have too many files like that otherwise it would already have run out of disk space there...

Related

Too big tsfile / datastore in Apache IoTDB database (version 0.11.2)

We are using Apache IoTDB Server in Version 0.11.2 in a scenario and observe a data directory / tsfiles that are way bigger than they should be (about 130 sensors with 4 Million double values for each sensor but the files are about 200gb).
Are there known issues or do you have any ideas what could cause this is how to track that down?
The only thing we could think off could be some merge artefacts as we do write many datapoints out of order so merging has to happen frequently.
Are there any ideas or tools on how to debug / inspect the tsfiles to get an idea whats happening here?
Any help or hints is appreciated!

this may be due to the compaction strategy.
You could fix this in two ways (no need at the same time):
(1) upgrade to 0.12.2 version
(2) open the configuration in iotdb-engine.properties: force_full_merge=true
The reason is:
The unsequenced data compaction in the 0.11.2 version has two strategies.
E.g.,
Chunks in a sequence TsFile: [1,3], [4,5]
Chunks in a unsequence TsFile: [2]
(I use [1,3] to indicate the timestamp of 2 data points in a Chunk)
(1) When using full merge(rewrite all data): we get a tidy sequence file: [1,2,3],[4,5]
(2) However, to speed up the compaction, we use append merge by default, when we get a sequence TsFile: [1,3],[4,5],[1,2,3]. In this TsFile, [1,3] does not have metadata at the end of the File, it is garbage data.
So, if you have lots of out-of-order data merged frequently, this will happen (get a verrry big TsFile).
The big TsFile will be tidy after a new compaction.
You could also use TsFileSketchTool or example/tsfile/TsFileSequenceRead to see the structure of the TsFile.

Best way to store redis keys

I am using Redis to store some information and detect changes in that information over time (for example, think users and locations). What is the value to using a longer or shorter keyname? Using a longer key is clearer, but is there much cost for memory or performance to using longer keyname?
Here are examples:
SET L:123456 "<name> <latitude> <longitude> ..."
HSET U:987654321 loc 123456 time <epoch>
or
SET loc:{123456} "<name> <latitude> <longitude> ..."
HSET user:{U987654321} loc 123456 time <epoch>

It all depends on how you are going to use it.
If every byte counts, for example when you have to pay for each kB transferred to a cloud service, you can calculate the costs. The maths is simple; a byte is a byte 'on the wire'. Inside redis, for larger values it is equally simple. For smaller values, Redis does some memory optimization.
In your HSET example, you split out the members, which only makes sense if you need them separated from eachother most of the time. A better approach -might- be: HSET user:data 987654321 '{"loc": "123456", "time": "2014-01-01T13:00:00"}'. Separate keys/members 'cost' a lot more than longer strings, performance wise. You can even put a whole table or dataset in one member if it's only going to be used as one complete semi-static entity.
Speed and Size: There is a notable difference between keys and values.
Keys:
Shorter is generally more memory efficient as well as speed efficient. If you use a redis Sorted Set you can even use 'numbers' as keys (sorted set 'members' plus 'scores'). I say 'numbers' because a score is technically a float64, but to be used as an ID it has to be between -999999999999999 and 999999999999999 including (that's 15 digits), without any fractional part. This can be really helpful, since Redis does fast and scalable O(log(n)) on-the-fly sorting of Sorted Sets (using skiplists, simplified).
Values:
The MsgPack format (uncompressed) takes up the least space, especially if you store the definitions once and the values many. JSON is a bit less memory efficient, but is ofcourse such a common IPC format that it should not be left out. Raw strings, character separated, fixed length (ugh), whatever your desire, it's possible to use. You can always compress your data before storing it in Redis. So far memory efficiency. When it comes to speed, it's less simple. If you want to use Lua server-side scripting (which you should), you can't do anything with compressed data. JSON and MsgPack can be deserialized, but only 'as a whole'. Which is fine in mosts scenarios. Most flexible is storing separate values (for example as members of a HSET), but this comes at a price as well (most of the time: too high a price). You also can combine all these. What we use most: a prefix of two or three delimiter-separated values, followed by a MsgPack payload.
My general advice is: start with using only HSET's and ZSET's, don't split out data that belongs together, use descriptive PascalCased names for your keys between 10-25 chars, use ':' if you need delimiters in your keys (namespaces), serialize as JSON (for simplicity, but code for easy switching to MsgPack), use Lua scripting (even if you don't know Lua, the subset you use in Redis is tiny).
I wouldn't worry about it too much in the startup phase of your project, you can always change it later on and do some A/B comparisons as soon as you have some interpolatable data.
Hope this helps, TW

Now that Redis v3.2 is almost here, you should consider switching to the new geo hashing functionality: http://redis.io/commands/geoadd

Dictionary Database Size - Which algorithms and strategies make it so light?

I am going through the database sizes of open dictionaries like wordnet It has almost 52 MB of database size. But I have seen some offline dictionary applications on Google Play like
1: http://wordnet.princeton.edu/ English Dictionary app which uses Wiktionary Database. I do not know how they are managing to provide an offline dictionary with 15 MB size only and more than 167000 words?
What might be the way of keeping words in database?

Wordnet packs quite a punch in small memory footprint.
How? - Here is brief picture:
words are stored into index files for fast search - index.noun,
index.adj, etc.
relation between word and offset in definition file - data.noun, etc
is provided.
each of line in definition file corresponds to one definition and
relationships between words are marked by a symbol and offset. eg.
! for antonym, # for kind of, etc.
This makes the whole thing pretty compact.
For more info on this read: man 5 wndb.
Regarding size:
52MB = 52000KB
for about ~180K words in Wordnet - each word has 52000K/180K ~ 300bytes.
average about 300 bytes to represent definition + relations - good enough.
e.g. (approx)average for each 4 definition(20 chars), 2 usages(20chars), overhead of relation(10 relations)

Using GUID's as folder names + splitting up

I want to use GUID's (uuid) for naming folders in a huge file store. Each storage item gets his own folder and guid.
The easiest way would be "x:\items\uuid\{uuid}..."
example: "x:\items\uuid\F3B16318-4236-4E45-92B3-3C2C3F31D44F..."
I see here one problem. What if you expect to get at least 10.000 items and probably a few 100.000 or more then 1 million. I don't want to put so many items (sub folders) in one folder.
I thought to solve this by splitting up the guid. Taking the 2 first chars to create sub folders at the first level and the take the next 2 chars and also create sub folders.
The above example would be --> "x:\items\uuid\F3\B1\6318-4236-4E45-92B3-3C2C3F31D44F..."
If the first 4 chars of guid's are really as random as expected then I get after a while 256 folder within 256 folders and I always end up with a reasonable amount of items within each of these folders
For example if you have 1 million items then you get --> 1 000 000 / 256 /256 = 15.25 items per folder
In the past I'v already tested the randomness of the first chars. (via vb.net app). Result: The items where spread quit evenly over the folders.
Also somebody else came to the same conclusion. see How evenly spread are the first four bytes of a Guid created in .NET?
Possible splits I thought of (1 million items as example)
C1 = character 1 of GUID, C2 = character 2, etc
C1\C2\Rest of GUID --> 16 * 16 * 3906 (almost 4000 are still al lot of folders)
C1\C2\C3\C4\Rest of Guid --> 16 * 16 * 16 * 16 * 15 ( unnecessary splitting up of folders)
C1C2\C3C4\Rest of Guid --> 256 * 256 * 15 (for me the best option ?)
C1C2C3\Rest of Guid --> 4096 * 244 (to many folders at first level??)
C1C2C3C4\Rest of Guid --> 65536 * 15 (to many folders at first level!)
My questions are:
Does anyone see drawbacks for this kind of implementation. (scheme: *C1C2\C3C4\Rest of Guid)
Is there some standard for splitting up Guids, or a general way of doing this.
What happens if you put a few 100 thousands of sub folders in one folder (I still prefer not to use any splitting if possible)
Thanks, Mumblic

This is fairly similar to the method git uses for sharding it's object database (although with SHA1 hashes instead of GUIDs...). As with any algorithm, there are pros and cons, but I don't think there are any significant cons in this case that would outweigh the definite pros. There's a little extra CPU overhead to calculate the directory structure, but in the long term, that overhead is probably significantly less than what is necessary to search through a single directory of a million files repeatedly.
Regarding how to do it, it depends a bit on what library you are using to generate the GUIDs - do you get them in a byte-array (or even a struct) format that then needs to be converted to a character representation in order to display it, or do you get them in an already formatted ASCII array? In the first case, you need to extract the appropriate bytes and format them yourself, in the second you just need to extract a substring.
As far as putting an extreme number of sub-folders (or even files) in one folder, the exact performance characteristics are highly dependent on the actual file system in use. Some perform better than others, but almost all will show significant performance degradation the more entries each directory has.

Loading tiles for a 2D game

Im trying to make an 2D online game (with Z positions), and currently im working with loading a map from a txt file. I have three different map files. One contains an int for each tile saying what kind of floor there is, one saying what kind of decoration there is, and one saying what might be covering the tile. The problem is that the current map (20, 20, 30) takes 200 ms to load, and I want it to be much much bigger. I have tried to find a good solution for this and have so far come up with some ideas.
Recently I'v thought about storing all tiles in separate files, one file per tile. I'm not sure if this is a good idea (it feels wrong somehow), but it would mean that I wouldn't have to store any unneccessary tiles as "-1" in a text file and I would be able to just pick the right tile from the folder easily during run time (read the file named mapXYZ). If the tile is empty I would just be able to catch the FileNotFoundException. Could anyone tell me a reason for this being a bad solution? Other solutions I'v thought about would be to split the map into smaller parts or reading the map during startup in a BackgroundWorker.

Try making a much larger map in the same format as your current one first - it may be that the 200ms is mostly just overhead of opening and initial processing of the file.
If I'm understanding your proposed solution (opening one file per X,Y or X,Y,Z coordinate of a single map), this is a bad idea for two reasons:
There will be significant overhead to opening so many files.
Catching a FileNotFoundException and eating it will be significantly slower - there is actually a lot of overhead with catching exceptions, so you shouldn't rely on them to perform application logic.

Are you loading the file from a remote server? If so, that's why it's taking so long. Instead you should embed the file into the game. I'm saying this because you probably take 2-3 bytes per tile, so the file's about 30kb and 200ms sounds like a reasonable download time for that size of file (including overhead etc, and depending on your internet connection).
Regarding how to lower the filesize - there are two easy techniques I can think of that will decrease the filesize a bit:
1) If you have mostly empty squares and only some significant ones, your map is what is often referred to as 'sparse'. When storing a sparse array of data you can use a simple compression technique (formally known as 'run-length encoding') where each time you come accross empty squares, you specify how many of them there are. So for example instead of {0,0,0,0,0,0,0,0,0,0,1,1,2,3,0,0,0,0,0,0,0,0,0,0,0,0,1} you could store {10 0's, 1, 1, 2, 3, 12 0's, 1}
2) To save space, I recommend that you store everything as binary data. The exact setup of the file mainly depends on how many possible tile types there are, but this is a better solution than storing the ascii characters corresponding to the base-10 representation of the numers, separated by delimiters.
Example Binary Format
File is organized into segments which are 3 or 4 bytes long, as explained below.
First segment indicates the version of the game for which the map was created. 3 bytes long.
Segments 2, 3, and 4 indicate the dimensions of the map (x, y, z). 3 bytes long each.
The remaining segments all indicate either a tile number and is 3 bytes long with an MSB of 0. The exception to this follows.
If one of the tile segments is an empty tile, it is 4 bytes long with an MSB of 1, and indicates the number of empty tiles including that tile that follow.
The reason I suggest the MSB flag is so that you can distinguish between segments which are for tiles, and segments which indicate the number of empty tiles which follow that segment. For those segments I increase the length to 4 bytes (you might want to make it 5) so that you can store larger numbers of empty tiles per segment.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight