Sunday, May 25, 2008

Feature elimination in RapidMiner

Recently I've discovered RapidMiner - data mining application. It has quite good, but huge tutorial, where you can read section 3 and 4 to get the impression of what this tool can do for you. Essentially, as input it gets sample data, where each instance is described as a vector of features (e.g. size, weight, color) and indication to which class this instance belongs to. The processing typically consists of pre-processing (converting of feature values in more appropriate format, eliminating features, that does not bring any additional information, etc.), machine learning (it has quite a few algorithms) and post-processing. In the software, you have something similar to IDE, where you can visually construct process from the existing building blocks. Typical problem with this approach is how to map outputs of previous block to inputs of the succeeding one. They seem to choose the most simplest solution, assuming that all outputs has certain type, and there can be only one input/output of certain type. Obviously they were targeting on data mining specialists, so user should not worry about such details.
I was trying to apply it to my task, which essentially is a subset of the whole process. I have a vector of features and I know for sure the way to classify them - take maximum value among features (which is either -1, 0 or 1) and use it as class. I need to get smallest possible subset of features that still get 100% match with initial classification. It appears that there are 2 ways of doing that in RapidMiner:

  • Directly alter Model file saved on disk - extremely difficult because model file is just a serialized xml-dump of java class, so even the smallest one is 70K
  • Use ability to generate new attributes based on existing set as a model

And here I faced the problem, which is typical for all WYSIWYG editors, that helps you to build your program from building blocks. As long as you are doing samples - everything work fine. If you need to do one step off-road, you are in dust. I can create a new attribute, but it will have type regular, but all validators expects to see labels which has type predicted(label). And user is not intended to change types of attributes, because it is dangerous.
I'm apologizing if there is a simple way of solving my task, but I think it is a very good illustration of how programming works. If you are constructing a new development environment, and even if you are targeting on non-developers, you need to make sure that there is "advanced" way, where you have no constraints. No one can predict all the cases how your system will be used. Just hide a candy deep enough so only geeks can find it.

No comments: