Blog-Archiv

Mittwoch, 7. Oktober 2015

Natural Sort in Java, First Performance Tuning

This is about performance tuning of the natural-sort class I started in my previous article. Simple and comprehensible code is rarely the fastest, and sorting is always about speed. The challenge is to achieve a fast sorter without hacking on long methods that use countless variables with abbreviated names no one understands. In other words, I want to prove that simplicity and good performance are not contradictory.

But before talking about a way to make the sorter faster I need to specify how I perceive that something is fast. I wrote a unit test for that. This test does not assert anything, it only outputs its benchmark at end. It measures in nanoseconds, 1000000 nanoseconds make up a millisecond, 1000 millis make up a second. It performs several repeats and calculates an average result then. It uses the NaturalSortTest.names data of the unit test from previous Blog and blows them up a little.

import org.junit.Test;

/**
 * Compares different implementations of natural sort orders.
 */
public class NaturalSortPerformanceTest
{
    @Test
    public void testPerformance() {
        final int REPEATS = 5;
        long time0 = 0;
        long time1 = 0;
        long time2 = 0;
        long time3 = 0;
        for (int i = 0; i < REPEATS; i++) {
            time0 += performance(null);
            time1 += performance(new PoolPaourComparator());
            time2 += performance(new NaturalSortComparator2());
            time3 += performance(new NaturalSortComparator());
        }
        System.err.println("Alphabetical comparator"+" yields "+(time0 / REPEATS)+" millis");
        System.err.println(PoolPaourComparator.class.getSimpleName()+" yields "+(time1 / REPEATS)+" millis");
        System.err.println(NaturalSortComparator2.class.getSimpleName()+" yields "+(time2 / REPEATS)+" millis");
        System.err.println(NaturalSortComparator.class.getSimpleName()+" yields "+(time3 / REPEATS)+" millis");
    }

    private long performance(Comparator<String> comparator) {
        final int FACTOR = 10000;
        final List<String> toSort = new ArrayList<>(FACTOR * NaturalSortTest.names.size());
        for (int i = 0; i < FACTOR; i++) {
            for (int j = 0; j < NaturalSortTest.names.size(); j++) {
                String name = NaturalSortTest.names.get(j);
                final int MULTIPLY = 3;
                for (int k = 0; k < MULTIPLY; k++)
                    name += "/"+name;
                toSort.add(name);
            }
        }
        final long start = System.nanoTime();
        Collections.sort(toSort, comparator);
        final long end = System.nanoTime();
        
        return (end - start) / 1000000l;
    }

}

This will tell me the truth about all sorters I want to throw into the game.

As you can see there are already three comparators in the race. The null comparator is the alfabetical order. This will clearly be the fastest. As it does not produce natural sort order, it is here just to show the gap.

The PoolPaourComparator you can download from its GitHub. Might not be what you are looking for, but you can fix it, it is not so hard to understand. At least it is fast.

The NaturalSortComparator2 is subject of the following article. It is a rewrite of the naive NaturalSortComparator from my previous Blog.

Here is the result of one of the test runs:

Alphabetical comparator yields 45 millis
PoolPaourComparator yields 560 millis
NaturalSortComparator2 yields 1367 millis
NaturalSortComparator yields 2597 millis

As you see my NaturalSortComparator is far behind PoolPaourComparator. So this article might have another follower that tries to achieve even better performance. But at least it is two times faster than the latest.

First Optimized Natural Sorter

What were the weaknesses of the first NaturalSortComparator? It had to read all strings in the list entirely, as many times as they were compared to each other, so it had to scan lots of strings several times repeatedly. The new comparator just reads until the first difference is detected, not the entire string. Nevertheless it reads as far as a letter- or digit-sequence lasts, so it also does more than needed. And this is the reason why it still significantly slower than a comparator that reads just characters, not string parts. Such a comparator could be implemented as finite automaton. Might be the next NaturalSortComparator Blog:-)

I must rewrite the split(String string) method to return an Iterator, and I must implement this iterator to never read more characters than needed. Here is what I've done to integrate that Iterator into the old code.

public class NaturalSortComparator2 implements Comparator<String>
{
    private enum Type
    {
        SPACE,
        DIGIT,
        LETTER,
        SEPARATOR;
    }
    
    private static class Part
    {
        final Type type;
        final String content;
        
        Part(Type type, String content) {
            this.type = type;
            this.content = content;
        }
    }
    
    private final boolean caseSensitive;
    
    /** Comparator that compares case-insensitive. */
    public NaturalSortComparator2() {
        this(false);
    }
    
    /** Comparator that treats case-sensitivity according to given parameter. */
    public NaturalSortComparator2(boolean caseSensitive) {
        this.caseSensitive = caseSensitive;
    }
    
    /**
     * Splits the given strings and then compares them,
     * according to their nature, either as numbers or others.
     */
    @Override
    public int compare(String string1, String string2) {
        final Iterator<Part> iterator1 = split(string1);
        final Iterator<Part> iterator2 = split(string2);
        
        while (iterator1.hasNext() || iterator2.hasNext()) {
            // first has no more parts -> comes first
            if ( ! iterator1.hasNext() && iterator2.hasNext())
                return -1;
            
            // second has no more parts -> comes first
            if (iterator1.hasNext() && ! iterator2.hasNext())
                return 1;

            // get parts to compare
            final Part part1 = iterator1.next();
            final Part part2 = iterator2.next();
            
            int result;
            // if part is a number
            if (part1.type == Type.DIGIT && part2.type == Type.DIGIT) {
                result = Long.compare(Long.valueOf(part1.content), Long.valueOf(part2.content));
                
                // if numbers are equal, then shorter comes first
                if (result == 0)
                    result = Integer.compare(part1.content.length(), part2.content.length());
            }
            else    {   // part is name or separator
                result = caseSensitive ? part1.content.compareTo(part2.content) : part1.content.compareToIgnoreCase(part2.content);
            }

            if (result != 0)    // found difference
                return result;
        }
        
        return 0;
    }

    ......

}

This is mostly the same as the old Comparator, with following differences:

  • it uses an inner class Part to represent the split results, and it tags any such part with a Type (enum) that expresses its nature (number, word, separator)
  • the Part instances come from a split() method that returns an Iterator<Part> which should never read more than needed
  • it distinguishes between spaces and separators
  • it ignores spaces completely

The rest is the same as it was before. The shorter string wins. When not both are numbers, the parts are compared like they were strings.

Now here is the missing rewritten split(String) method that produces a "lazy" iterator.

    /** Splits given string into a list of names, numbers and separators (others). */
    private Iterator<Part> split(final String string) {
        return new Iterator<Part>()    {
            private char currentChar;
            private int position;
            private final int length = string.length();
            private final StringBuilder sb = new StringBuilder();
            
            @Override
            public boolean hasNext() {
                // read away spaces
                while (position < length &&
                        getState(currentChar = string.charAt(position)) == Type.SPACE)
                    position++;
                
                return position < length;
            }
            
            @Override
            public Part next() {
                if (hasNext() == false) // read away spaces
                    throw new IllegalStateException("Iterator called when no next item exists!");
                
                final Type initialState = getState(currentChar);
                Type state = initialState;
                while (position < length && state == initialState) {
                    sb.append(currentChar);
                    position++;
                    if (position < length)
                        state = getState(currentChar = string.charAt(position));
                }
                final String part = sb.toString();
                sb.setLength(0);
                return new Part(initialState, part);
            }
            
            private Type getState(char c) {
                return Character.isWhitespace(c) ? Type.SPACE
                        : Character.isDigit(c) ? Type.DIGIT
                        : Character.isLetter(c) ? Type.LETTER
                        : Type.SEPARATOR;
            }
        };
    }

An anonymous inner class implements the Iterator interface to lazily return split parts of a string. The final string parameter is always visible and available for the inner implementation. The hasNext() method reads away spaces and returns false when no next split-part is available. The next() method delivers another split-part when the caller demands it, not earlier. This makes the sorter two times faster than the old version. The next() implementation also determines the type of the part, whether it is a number, a separator or a word. Spaces are ignored completely here.

The next performance tuning attempt will be to rewrite this to always compare characters, not string parts, unless a digit sequence is starting on both input strings. Let's see how fast this will evolve :-)




Keine Kommentare: