Wednesday, October 18, 2006

Awkward behavior of System.Collections.Generic.List.BinarySearch

I was trying to build my own morphoanalyzer for Russian language. First step is to search through the list of word radicals – prefixes which will remain constant in all word forms like "wom" for "woman"/"women". In Russian I get more 100K+ radicals, so search efficiency is definitely an issue. Since my initial list is presorted, but can contain duplicate radicals, I decided to choose .NET 2.0 Generic List over SortedList/Dictionary and use BinarySearch. One of the surprising results I've got is that BinarySearch (Int32, Int32, T, Generic IComparer) does not always return first occurrence of item, if there are duplicates. I realize this is caused by nature of binary search algorithm. After some time I even found corresponding note in MSDN, but I cannot come up with any reasons why someone might want behavior like that. Combination of List and BinarySearch is the only container that accepts duplicate items, and it is quite obvious that most of the people will use BinarySearch to find all occurrences of items, not the random one. I do realize that it will hurt performance insignificantly, but it will become right-out-of-the-box usable solution. I created my wrapper around BinarySearch, which produce correct results. The assumption is that number of duplicate items is significantly less than total number of items in the collection:


public static int TrueBinarySearch(string prefix, int startIdx) {
int idx = List.BinarySearch(startIdx + 1, List.Count - startIdx - 1,
prefix, null);

if (idx >= 0) {
while (idx > startIdx && List[idx].Key == prefix) {
idx--;
}
}

return ++idx;
}

No comments: