textmining, Export clusters in Rapidminer - export

I am working with approximately 300 articles (txt-files), and my cluster-analysis has so far been successful.
Now I wish to save/export these 5 clusters separately, so that I can do further analysis on each cluster.
How is that possible?
I would appreciate any answer!
/LK

You could export the result of the clustering operation to Excel and examine the results using this.
Alternatively, you could use the Filter Examples operator with a suitable filter to create example sets for specific values of the cluster attribute.

Related

How do I perform a clustering algorithm (like DBSCAN) in Snowflake?

I need to cluster some data in Snowflake using DBSCAN. I created a UDF but the results won't match with a local run, so I ran an UDF that just creates a list with the row number that is being processed and it results in a list that has repeated values and its max value is much smaller than the number of rows in my table. (The expected result was unique values up to the number of rows)
Can this be a parallelization issue?
If so, is there a way to cluster data using DBSCAN in Snowflake?
Thanks!
EDIT -> code example:
#funcs.pandas_udf(name='DBSCAN_TEST', is_permanent=True,
stage_location='#UDF', replace=True,
packages=['scikit-learn==1.0.2', 'pandas', 'numpy'])
def DBSCAN_TEST(data_x: types.PandasDataFrame[float, float]) -> types.PandasSeries[float]:
data_x.columns = [features]
DBSCAN_cluster = DBSCAN(eps=2.5, min_samples=4)
DBSCAN_cluster.fit(data_x)
return resul
As input data I used this
to test DBSCAN.
If I run that locally (using the exact same code inside the UDF) I end up with 24 clusters and that is the expected result. But if I use the UDF it scales up to 71.
I've tried changing the input types to string, as a coworker suggested but it didn't work.
The only clustering option in Snowflake is a clustering key, and general rule of thumb is that this isn't something typically needed until you eclipse a TB of data in a table and/or the auto clustering proves to be deficient in performance.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

OR search in solr

I have a situation where I have to search a document in Solr with multiple OR keywords. Now the number of keywords may lead up to 5000 which is resulting in a awfully large query with 5000 OR conditions. This is resulting in the Solr server to hang. Is there any other way I can design the query to work. Short sample of the query is given below
tweet_id:337931022601699328 OR 337931064293081089 OR 337931089538584576 OR 337931098761871361 OR 337931138851016704 OR 337931143099854848 OR 337931160082591745 OR 337931163857453056 OR 337931230819516416 OR 337931239996665857 OR 337931287518126080 OR 337931322850951168 OR 337931325648535553 OR 337931331398934528 OR 337931413057830912 OR 337931442363441152 OR 337931448629731329 OR 337931453344129025 OR 337931465016877056 OR 337931482066726912 OR 337931514388029442 OR 337931533149155328 OR 337931645527130114 OR 337931704935256064 OR 337931784459268096 OR 337931845545103360 OR 337931889086185472 OR 337931892668108801 OR 337931963983855617 OR 337932154212319233 OR 337932176454721536 OR 337932193198374912 OR 337932229659459584 OR 337932437290090496 OR 337932436807749632 OR 337932436828725250 OR 337932437449474048 OR 337932448518250496 OR 337932458832035843 OR 337932458634915840 OR 337932458278387712 OR 337932474246119425 OR 337932476209041409 OR 337932477408620544 OR 337932480478842880 OR 337932478775959554 OR 337932480566931456 OR 337932478763376640 OR 337932481841999872 OR 337932479337992192 OR 337932479296045057 OR 337932479333797889 OR 337932484614434816 OR 337932484606038017 OR 337932482777317376 OR 337932484664758272 OR 337932482785718273 OR 337932484589273088 OR 337932487399444481 OR 337932489031032833 OR 337932489114923008 OR 337932486573166592 OR 337932490704560130 OR 337932489144270848 OR 337932488762601472 OR 337932492097069056 OR 337932497780355072 OR 337932498900230144 OR 337932499722321921 OR 337932514431729665 OR 337932561806409731 OR 337932567284154368 OR 337932567300935680 OR 337932574603214848 OR 337932571134533632 OR 337932574674518016 OR 337932575484026881 OR 337932578206121984 OR 337932582215892994 OR 337932586653454336 OR 337932584917024768 OR 337932592986865664 OR 337932597017587712 ....
I intend to facet the result based on a few fields.
I'm not sure whether this solution would help you or not, but tried something for your problem.
Whatever the query you provide to Solr, first it parses that query to it's understandable format. Then Solr executes that for result. You have to do some calculations before querying to Solr. Let's take the following scenario to solve your use case.
Suppose You have total 5000 tweet_id. You have to do an OR query on around 4000 tweet_id. In this type of scenario, it's better to query on other (5000-4000=1000) 1000 tweet_id with negation AND query. So, your query will have less values passed.
So, try querying with rest of the tweet_id with negation AND query instead of OR query.
If I were you, I'd create a new field denoting this custom_list_id .. Whenever you generate a new list, index the new data then query by the list I'd.

Hbase batch query example

I read in "hadoop design pattern" book, "HBase supports batch queries, so it would be ideal to buffer all the queries we want to execute up to some predetermined size. This constant depends on how many records you can comfortably store in memory before querying HBase."
I tried to search some examples online but couldn't find any, can someone show me the example using java map reduce?
Thanks.
Dan
Is this what you want? You can save HBase Get object in a list and submit the list at the same time. It's a little better than invoke table.get(get) multiple times.
Configuration conf = HBaseConfiguration.create();
pool = new HTablePool(conf, 5);
HTableInterface table = pool.getTable('table');
List<Get> gets = new ArrayList<Get>();
table.get(gets);

Row count of a column family in Cassandra

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.
For instance, if I have a column family containing users and wanted to get the number of users. How could I do it? Each user is it's own row.
If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:
nodetool --host <hostname> cfstats
This will dump out a list for each column family looking like this:
Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634
The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.
If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.
If you are not, you will need to store your user ids in a special row.
I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra
select count(*) from cf limit 1000000
Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.
[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]
I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.
I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.
Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.
http://www.mail-archive.com/user#cassandra.apache.org/msg03965.html
There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.
NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.
I have been getting the counts like this after I convert the data into a hash in PHP.

Autocomplete Dropdown - too much data, timing out

So, I have an autocomplete dropdown with a list of townships. Initially I just had the 20 or so that we had in the database... but recently, we have noticed that some of our data lies in other counties... even other states. So, the answer to that was buy one of those databases with all towns in the US (yes, I know, geocoding is the answer but due to time constraints we are doing this until we have time for that feature).
So, when we had 20-25 towns the autocomplete worked stellarly... now that there are 80,000 it's not as easy.
As I type I am thinking that the best way to do this is default to this state, then there will be much less. I will add a state selector to the page that defaults to NJ then you can pick another state if need be, this will narrow down the list to < 1000. Though, I may have the same issue? Does anyone know of a work around for an autocomplete with a lot of data?
should I post teh codez of my webservice?
Are you trying to autocomplete after only 1 character is typed? Maybe wait until 2 or more...?
Also, can you just return the top 10 rows, or something?
Sounds like your application is suffocating on the amount of data being returned, and then attempted to be rendered by the browser.
I assume that your database has the proper indexes, and you don't have a performance problem there.
I would limit the results of your service to no more than say 100 results. Users will not look at any more than that any how.
I would also only being retrieving the data from the service once 2 or 3 characters are entered which will further reduce the scope of the query.
Good Luck!
Stupid question maybe, but... have you checked to make sure you have an index on the town name column? I wouldn't think 80K names should be stressing your database...
I think you're on the right track. Use a series of cascading inputs, State -> County -> Township where each succeeding one grabs the potential population based on the value of the preceding one. Each input would validate against its potential population to avoid spurious inputs. I would suggest caching the intermediate results and querying against them for the autocomplete instead of going all the way back to the database each time.
If you have control of the underlying SQL, you may want to try several "UNION" queries instead of one query with several "OR like" lines in its where clause.
Check out this article on optimizing SQL.
I'd just limit the SQL query with a TOP clause. I also like using a "less than" instead of a like:
select top 10 name from cities where #partialname < name order by name;
that "Ce" will give you "Cedar Grove" and "Cedar Knolls" but also "Chatham" & "Cherry Hill" so you always get ten.
In LINQ:
var q = (from c in db.Cities
where partialname < c.Name
orderby c.Name
select c.Name).Take(10);

Resources