Nikolai Obedin


ghcid: must-have tool for Haskell developer

Aug 18, 2016 in Haskell, Programming

Usually when programming in Haskell, I rely a lot on compiler. It catches many errors that are caused by inattention, which occur especially often when refactoring large portions of code. In some sense famous “it compiles, therefore it works” mantra is true (assuming that you don’t have logical errors). The problem with this approach for me for a long time was that I ran compiler manually and then read its output. The output was usually quite long, and I had to read it and start fixing errors from the top, otherwise I might end up fixing wrong error. It was a tedious routine, even with a help of bash scripts and file watchers, but thanks to ghcid1, I am free from it now.

ghcid can be interpreted as “GHCi-daemon” or “GHC-with-a-bit-of-IDE”. It is a rather small program which starts GHCi, imports your modules there, executes some Haskell expression in it, watches your source code and fires :reload and reevaluates expression once something is changed. It also captures warnings and errors produced by compiler, and presents them in a way where you can see the topmost ones right away, without scrolling the whole output to the top. Last but not least is that it plays really well with stack, which is de-facto a standard tool for building and testing Haskell projects nowadays.

How to use it

First, install it:

$ stack install ghcid

Now you can run it with stack exec. At this point I usually make a bash script and tune ghcid the way I need it for my application.

First of all we need to setup is the expression which will be evaluated in GHCi session. You can do it with -T flag. If you want to run something different than GHCi, you can pass the command to -c flag. By default ghcid detects whether you’re using stack or just plain cabal and runs either stack repl or cabal repl respectively.

Another thing I usually turn on for my projects are -Wall -Werror compiler flags. It catches such errors as redundant imports, orphan instances, unused variables and helps you keep the codebase clean. However, when working in a REPL mode, you’d usually concentrate on the big picture and leave those warnings to be fixed later, e.g. before committing code. When -W option is passed ghcid doesn’t take those errors into account and evaluate given expression anyway.

By default ghcid watches only for those source files that were loaded by GHCi. Sometimes it might not be enough. For example, now I’m working on application which has Hamlet templates in templates directory and I’d like to reload the app each time some of templates is changed. Luckily it’s easy to make ghcid do so with --restart=<PATH> or --reload=<PATH> options. The first one will rebuild the project and restart the whole GHCi process while the second one will just fire :reload command to existing GHCi process whenever any of the files and directories in <PATH> is changed.

Issues and ways to improve

One thing that would be nice to have is errors highlighting. Sure they are printed in bold font now, but having line numbers and error messages in different colors would help even more. Right now it is possible to achieve by using ghcid inside terminal emulator in Neovim or Emacs.

Currently highlighting is the only thing that comes to my mind. Overall ghcid is a great development tool, and even minor bugs (usually caused by GHCi) can’t really outweigh its advantages. Definitely must-have.


Algorithms 101: Sorting

Aug 12, 2016 in Algorithms

This article will cover several sorting algorithms: working in quadratic time (bubble, insertion and selection sort), in loglinear time (merge sort, quicksort and heap sort) and in linear time, but only for special data types (counting and radix sort). Source code of examples used in this article is available on Github.

Quadratic time algorithms

There is not much to say about quadratic time algorithms. Usually, they are not used by their own, but with some loglinear algorithm instead. For example, insertion sort sometimes is used inside merge sort or quicksort to sort small subarrays. As for bubble sort, usually it is considerably slower that other quadratic time sort algorithms and not used in practice.

Bubble sort

Idea: repeatedly iterate over an array comparing items pairwise and exchanging if they are out of order. It is easy to notice that on each step the biggest item will be placed at the end of an array, thus at most n iterations is needed to put all items in their places. This algorithm is stable.

package com.dancingrobot84.algorithms.sorting;

import com.dancingrobot84.algorithms.util.ArraysExt;

public class BubbleSort {
    public static void sort(int[] array) {
        boolean isSorted;
        do {
            isSorted = true;
            for (int i = 0; i < array.length - 1; i++) {
                if (array[i] > array[i+1]) {
                    ArraysExt.swap(array, i, i+1);
                    isSorted = false;
                }
            }
        } while (!isSorted);
    }
}

Insertion sort

Idea: iterate over an array keeping its left side sorted. When next item is encountered, push it further to the left until it is less or equal than other items. There are two loops, each of at most n iterations, thus quadratic time. This algorithm is stable.

package com.dancingrobot84.algorithms.sorting;

import com.dancingrobot84.algorithms.util.ArraysExt;

public class InsertionSort {
    public static void sort(int[] array) {
        for (int i = 1; i < array.length; i++) {
            for (int j = i; j > 0 && array[j] < array[j-1]; j--) {
                ArraysExt.swap(array, j, j-1);
            }
        }
    }
}

Selection sort

Idea: iterate over all positions in array. For each position i search for minimal item in the range of [i, n] and put it in position i. For each iteration searching for minimal item will take O(n) time, thus giving quadratic time overall. This algorithm can be implemented as stable, however, the implementation below is not stable.

package com.dancingrobot84.algorithms.sorting;

import com.dancingrobot84.algorithms.util.ArraysExt;

public class SelectionSort {
    public static void sort(int[] array) {
        for (int i = 0; i < array.length; i++) {
            int curMinIndex = i;
            for (int j = i; j < array.length; j++) {
                if (array[j] < array[curMinIndex]) {
                    curMinIndex = j;
                }
            }
            ArraysExt.swap(array, curMinIndex, i);
        }
    }
}

Loglinear time algorithms

Loglinear algorithms are the ones that widely used in practice. It is proven that it is not possible to sort arbitrary data using comparisons faster than in O(nlog(n)) time, so these algorithms are the fastest ones. However, which algorithm to use in a program depends on its goals and constraints.

Merge sort

Idea: (top-down merge sort) split array in halves, sort each half recursively and then merge them. Using master theorem, T(n) = 2T(n/2) + O(n) = O(nlog(n)). This algorithm is stable.

Bottom-up merge sort treats n-element array as n singleton arrays. It merges them pairwise producing n/2 two-element arrays, merging them again and again until it merges two n/2-element arrays into one. It can be implemented without recursion.

The implementation below use O(n) space for buffer array which is used for merging two subarrays. This is standard recursive top-down merge sort.

package com.dancingrobot84.algorithms.sorting;

import java.util.Arrays;

public class MergeSort {
    public static void sort(int[] array) {
        new Sorter(array).sort();
    }

    private static class Sorter {

        private final int[] originalArray;
        private final int[] buffer;

        public Sorter(int[] array) {
            originalArray = array;
            buffer = new int[array.length];
        }

        public void sort() {
            sort(0, originalArray.length-1);
        }

        private void sort(int start, int end) {
            if (start >= end) {
                return;
            }

            int mid = (start + end) / 2;
            sort(start, mid);
            sort(mid+1, end);
            merge(start, end);
        }

        private void merge(int start, int end) {
            int mid = (start + end) / 2;
            int left = start;
            int right = mid+1;
            int cur = 0;

            for (; left <= mid && right <= end; cur++) {
                if (originalArray[left] < originalArray[right]) {
                    buffer[cur] = originalArray[left];
                    left++;
                } else {
                    buffer[cur] = originalArray[right];
                    right++;
                }
            }

            for (; left <= mid; cur++, left++) {
                buffer[cur] = originalArray[left];
            }

            for (; right <= end; cur++, right++) {
                buffer[cur] = originalArray[right];
            }

            System.arraycopy(buffer, 0, originalArray, start, cur);
        }
    }
}

Merge sort is easily parallelizable algorithm. It is more efficient than quicksort on linked lists where it can be implemented using only constant space.

Quicksort

Idea: split array in two parts by comparing each element to the pivot, then recursively sort these parts. Because quicksort depends on the contents of array, it is possible to construct such array that will force this algorithm to work quadratic time. Though this problem is easily solved by choosing random element, to be precise it is necessary to say that this algorithm has O(nlog(n)) time complexity in the average case and O(n^2) time complexity in the worst case. Space complexity is O(log(n)) in the average case and O(n) in the worst case since this algorithm is recursive.

Another delicate part of quicksort is how to partition array in constant space. In order to do that create hiStart pointer which will store index of the beginning of the second part of the array. Then iterate over the array comparing items to pivot: if current item is less than pivot, swap it with the item at hiStart and increment hiStart.

To speed up quicksort in the situation of repeated items one could implement splitting into three parts: less than, equal to and greater than pivot. It requires one more pointer, loEnd which will store index of the end of the first part.

The implementation below chooses random pivot and uses two-way partition. It is not stable, however, it is possible to make quicksort stable, but it requires additional O(n) space. This implementation use O(1) space.

package com.dancingrobot84.algorithms.sorting;

import com.dancingrobot84.algorithms.util.ArraysExt;
import java.util.Arrays;
import java.util.Random;

public class QuickSort {
    public static void sort(int[] array) {
        sort(array, 0, array.length-1);
    }

    private static Random generator = new Random();

    private static void sort(int[] array, int start, int end) {
        if (start >= end) {
            return;
        }

        int pivotIndex = partition(array, start, end);
        sort(array, start, pivotIndex);
        sort(array, pivotIndex+1, end);
    }

    private static int partition(int[] array, int start, int end) {
        int pivotIndex = start + generator.nextInt(end - start);
        int pivot = array[pivotIndex];
        ArraysExt.swap(array, pivotIndex, end);

        int hiStart = start;
        for (int i = start; i < end; i++) {
            if (array[i] < pivot) {
                ArraysExt.swap(array, hiStart, i);
                hiStart += 1;
            }
        }

        ArraysExt.swap(array, hiStart, end);
        return hiStart;
    }
}

Quicksort has better constant than merge sort and heap sort. When implemented well it could outperform its competitors drastically. It is possible to parallelize quicksort, but it is a bit harder to do than in case of merge sort.

Heap sort

Idea: build a heap on input array, then iteratively take minimal element out of it until there is no items left. Since building a heap has O(n) time complexity and retrieving minimal element from it requires O(log(n)) time overall time complexity for this sorting algorithm is O(nlog(n)). This algorithm is not stable.

This idea could be used on any data structure which supports building in O(nlog(n)) or faster and retrieving minimum/maximum in O(log(n)). Heap is preferred because it is possible to build heap on the same input array, thus spending only O(1) space.

package com.dancingrobot84.algorithms.sorting;

import com.dancingrobot84.algorithms.datastructures.Heap;
import com.dancingrobot84.algorithms.datastructures.impl.BinaryHeap;
import com.dancingrobot84.algorithms.util.ArraysExt;
import java.util.Arrays;

public class HeapSort {
    public static void sort(Integer[] array) {
        Heap<Integer> heap = BinaryHeap.fromArray(array);
        int i = array.length - 1;
        while (heap.size() > 0) {
            array[i] = heap.pop();
            i--;
        }
        ArraysExt.reverse(array);
    }
}

Linear time algorithms

Linear time sorting algorithms do not compare elements or they would work in O(nlog(n)). Instead, they either require some constraints on input or use some knowledge about items of an array. They usually used in complex algorithms to cut some time here and there, e.g. building suffix arrays or for sorting data on parallel clusters.

Counting sort

Assumption: all items of input array belong to some finite set of items or could be uniquely associated with a key from this finite set. Denote the size of such set as k.

Idea: iterate over input array counting how many times each distinct item occurs here, then using this information put items in their places. To store occurences we need array of integers of length k and output array of length n, thus we need O(n+k) space. Both of these array are traversed during the execution of algorithm, so its time complexity is O(n+k) too.

package com.dancingrobot84.algorithms.sorting;

import java.util.Arrays;

public class CountingSort {
    public static final int K = 100;

    public static void sort(int[] array) {
        int[] counter = new int[K];
        for (int item: array) {
            counter[item]++;
        }

        int totalCount = 0;
        for (int i = 0; i < K; i++) {
            int oldCount = counter[i];
            counter[i] = totalCount;
            totalCount += oldCount;
        }

        int[] arrayCopy = Arrays.copyOf(array, array.length);
        for (int item: arrayCopy) {
            array[counter[item]] = item;
            counter[item]++;
        }
    }
}

Radix sort

Assumption: each item from input array should be represented in positional notation, i.e. as a collection of characters from finite alphabet of size k where the power of each character depends on where it is placed within a collection. Examples of positional notation are binary and decimal numbers and strings.

Idea: sort input items by a single character in fixed position for all possible positions, e.g. sort by 1st character, then by 2nd, continue until all m positions were used to sort items. Sorting at specified position could be performed using counting sort since alphabet is finite, therefore it takes O(m(n+k)) time and O(mn) space to sort the whole array.

Depending on the starting position radix sort could be LSD (least significant digits first) or MSD (most significant digits first). The first one is used to sort in representation order (correct for decimals), the second one is for sorting in lexicographic order. The implementation below is LSD radix sort.

package com.dancingrobot84.algorithms.sorting;

import java.util.Arrays;

public class RadixSort {
    public static final int K = 2;
    public static final int M = 32;

    public static void sort(int[] array) {
        int[] temp = array;
        for (int i = 0; i < M; i++) {
            temp = sort(temp, i);
        }
        System.arraycopy(temp, 0, array, 0, array.length);
    }

    private static int[] sort(int[] array, int bit) {
        int[] counter = new int[K];
        for (int i = 0; i < array.length; i++) {
            counter[getBit(array[i], bit)]++;
        }

        int totalCount = 0;
        for (int i = 0; i < K; i++) {
            int oldCount = counter[i];
            counter[i] = totalCount;
            totalCount += oldCount;
        }

        int[] result = new int[array.length];
        for (int item: array) {
            result[counter[getBit(item, bit)]] = item;
            counter[getBit(item, bit)]++;
        }
        return result;
    }

    private static int getBit(int n, int k) {
        return (n >> k) & 1;
    }
}

Algorithms comparison chart

Algorithm \ Property Avg. time Worst time Avg. space Stability Generality
Bubble sort O(n2) O(n2) O(1) Yes Yes
Insertion sort O(n2) O(n2) O(1) Yes Yes
Selection sort O(n2) O(n2) O(1) Yes Yes
Merge sort O(nlog(n)) O(nlog(n)) O(n) Yes Yes
Quicksort O(nlog(n)) O(n2) O(log(n)) No Yes
Heap sort O(nlog(n)) O(nlog(n)) O(1) No Yes
Counting sort O(n+k) O(n+k) O(n+k) No No
Radix sort O(m(n+k)) O(m(n+k)) O(mn) Yes No

  1. Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms

  2. Sorting - topcoder

Algorithms 101: Basic Data Structures Pt. 2

Aug 12, 2016 in Algorithms

This article will cover another bunch of must-know data structures, such as heaps, priority queues, associative arrays and hashes. Source code of examples used in this article is available on Github.

Heaps

Heap is a tree-like data structure satisfying the following property: the key of parent node is ordered with respect of the key of child node. According to the order, heaps can be divided into max heaps (parent key is greater than the chilld’s one) and min heaps (parent key is lesser than the child’s one). According to the structure of an underlying tree heaps can be binary, d-ary, binomial, fibonacci and many others. The following interface is used to represent heaps:

package com.dancingrobot84.algorithms.datastructures;

public interface Heap<T extends Comparable<? super T>> {
    /** Get top item without removing it from heap **/
    public T peek();

    /** Get top item and remove it from heap **/
    public T pop();

    /** Return number of items in heap **/
    public int size();

    /** Insert new item into heap **/
    public void insert(T item);
}

Heaps are usually used to implement priority queues, which are used in some graph algorithms (Prim, Dijkstra). Heapsort sorting algorithm is another example of heap’s usage.

AFAIK there is no particular implementation of heap as data structure in Java and C++ languages. However, there is std::make_heap function in C++ standard library which rearranges elements in given container in a way that they form a heap. As for Java, there is PriorityQueue class which uses heap under the hood.

Binary min heap implementation

Idea: this implementation does not have explicit tree structure. Instead, array is used to store nodes. Index of given node is used to identify its parent and children (getParent, getLeftChild and getRightChild methods). The image below gives an example of how to store binary heap of 7 items in an array.

Binary tree stored in array

The most important part of heap implementation is methods for maintaining heap property. In case of binary heap there are two of such methods: bubbleUp and bubbleDown. bubbleUp compares given element and its parent and if they violate heap property, exhanges them and moves up the tree. bubbleDown does the same thing, only for the children of the given element and moves down the tree. The first method is used to implement insert: we put new item at the end of an array and run bubbleUp on it in order to put it in its place. The second method is used to implement popMinimum: we exchange the top item with the last one and run bubbleDown on the top item in order to put it in its place.

Now the last thing to do is how to heapify given array efficiently. Naive approach is to create an empty heap and perform n insertions, which gives us O(nlog(n)) time and O(n) space. However, it is possible to do it in O(n) time and O(1) space using bubbleDown method: just iterate over the array in reverse order and run bubbleDown on each item.

package com.dancingrobot84.algorithms.datastructures.impl;

import com.dancingrobot84.algorithms.datastructures.Heap;
import com.dancingrobot84.algorithms.util.ArraysExt;

public class BinaryHeap<T extends Comparable<? super T>> implements Heap<T> {

    public static <T extends Comparable<? super T>> Heap<T> emptyWithCapacity(int capacity) {
        return new BinaryHeap<T>(capacity);
    }

    public static <T extends Comparable<? super T>> Heap<T> fromArray(T[] array) {
        return new BinaryHeap<T>(array);
    }

    @Override
    public T peek() {
        assert size() > 0 : "Heap is empty";
        return array[0];
    }

    @Override
    public T pop() {
        T min = peek();
        ArraysExt.swap(array, 0, lastPos);
        lastPos -= 1;
        bubbleDown(0);
        return min;
    }

    @Override
    public int size() {
        return lastPos + 1;
    }

    @Override
    public void insert(T item) {
        assert size() < capacity() : "Heap is full";
        lastPos += 1;
        array[lastPos] = item;
        bubbleUp(lastPos);
    }

    public int capacity() {
        return array.length;
    }

    private final T[] array;
    private int lastPos;

    @SuppressWarnings("unchecked")
    private BinaryHeap(int capacity) {
        array = (T[])new Comparable[Math.max(capacity, 0)];
        lastPos = -1;
    }

    private BinaryHeap(T[] originalArray) {
        array = originalArray;
        lastPos = array.length - 1;
        for (int i = lastPos; i >= 0; i--) {
            bubbleDown(i);
        }
    }

    private void bubbleDown(int pos) {
        if (pos < 0 || pos > lastPos) {
            return;
        }

        int minPos = pos;

        int leftPos = getLeftChild(pos);
        if (leftPos <= lastPos && array[minPos].compareTo(array[leftPos]) == 1) {
            minPos = leftPos;
        }

        int rightPos = getRightChild(pos);
        if (rightPos <= lastPos && array[minPos].compareTo(array[rightPos]) == 1) {
            minPos = rightPos;
        }

        if (minPos != pos) {
            ArraysExt.swap(array, pos, minPos);
            bubbleDown(minPos);
        }
    }

    private void bubbleUp(int pos) {
        int parentPos = getParent(pos);
        if (pos < 0 || pos > lastPos || parentPos < 0 || parentPos > lastPos) {
            return;
        }

        if (array[parentPos].compareTo(array[pos]) == 1) {
            ArraysExt.swap(array, parentPos, pos);
            bubbleUp(parentPos);
        }
    }

    private int getLeftChild(int parentPos) {
        return 2 * parentPos + 1;
    }

    private int getRightChild(int parentPos) {
        return 2 * parentPos + 2;
    }

    private int getParent(int childPos) {
        childPos = childPos % 2 == 0 ? childPos - 2 : childPos - 1;
        return childPos / 2;
    }
}

isEmpty and getMinimum have O(1) time complexity. Because both bubbleUp and bibbleDown take O(nlog(n)) time, insert and popMinimum methods have the same time complexity. As for Heap(int[]) ctor: though it is just n runs of bubbleDown which sholud give us O(nlog(n)) time, actually, it works faster than that, because most of heapification takes place in lower levels. Complete proof of its time complexity can be found on Wikipedia.

Priority Queues

Priority queue is abstract data structure similar to stack or queue, but the order of items retrieving in it depends on items’ priorities. Usually item with the highest priority is served first.

Priority queues applicable in many popular algorithms: Dijkstra, Prim, Huffman encoding, etc. There are implementations of priority queues in C++ and Java: std::priority_queue and PriorityQueue classes respectively.

Implementation of priority queue using binary min heap

No additional comments as implementation is quite straightforward.

package com.dancingrobot84.algorithms.datastructures;

public class PriorityQueue<T extends Comparable<? super T>> {
    private final Heap<T> heap;

    public PriorityQueue(int capacity) {
        heap = BinaryHeap.<T>emptyWithCapacity(capacity);
    }

    public void insert(T item) {
        heap.insert(item);
    }

    public T pop() {
        return heap.pop();
    }
}

Associative arrays

Associative array or map is an abstract data structure. It stores key-value pairs in such a way, that each possible key appears only once. To describe associative array the following interface is used:

package com.dancingrobot84.algorithms.datastructures;

public interface Map<K extends Comparable<? super K>, V> {
    /** Return value associated with a key or null otherwise **/
    public V get(K key);

    /** Insert value into the map under specified key or rewrite if such key already exists **/
    public void put(K key, V value);
}

Associative arrays are applicable in great variaties of computer tasks: removing duplicates, counting items, etc. There are even specials databases called “key-value storage” which store key-value pairs either in memeory or on disk (Redis, Memcached, etc).

Hash Tables

Hash table is one of data structures implementing associative array. It uses array to store key-value pairs and special hash function to compute index of particular pair by its key. Usually, for implementation of map it performs better than binary search tree (both put and get work in O(1) amortized comparing to O(log(n)) in BST), but it is highly dependent on choosing the right hash function, load factor and collision resolution method.

Hash tables are used to implement maps, sets, database indexes and caches. They are available in almost any modern programming language. Some languages provide them as syntactic construct (Python, Ruby), others have specific classes in standard libraries (std::unordered_map in C++ and HashMap in Java).

Implementation of hash table

Idea: hash table is always implemented as underlying array containing entries. Depending on collision resolution method, entries could be either key-value pairs or their chains. There are several methods for collision resolution; the most popular ones are open adressing and chaining.

Open addressing method implements the following strategy: when collision occurs during insertion, try to find a different place in array for particular key by sequential probing of other places until the free one is found. When searching for a value, pairs are traversed in exactly the same way. There are several ways of choosing which places to look at when searching for the free one; the easiest method, called “linear probing”, just looks at the place following the current one.

Chaining method solves the collision problem differently. It stores buckets of pairs in underlying array; when collision occurs, new pair is dropped in associated bucket. When searching for a value, bucket is searched. Usually, buckets are implemented as linked lists of pairs. The implementation below uses this method.

Another important part of hash table implementation is choosing the right hash function. The main requirement of hash function is uniform distribution of its values. The uniformity is crucial; otherwise the number of collisions rises causing bad performance of hash table. Though, for particular implementation this condition can be loosen to uniformity at specific sizes. The implementation below relies on hashCode function of Java’s Object for simplicity and performs no additional hashing. It may affect performance as programmers writing classes are not obliged to guarantee uniformity of hashCode.

As the number of items in hash table grows, the chance of collision rises too; that’s why it is necessary to resize underlying array and re-insert all key-value pairs sometimes. To determine the time for performing this action, load factor indicator is used. It is calculated as the number of key-value pairs inserted divided by the size of underlying array. As for the implementation below, load factor of 0.75 indicates that resizing is needed; the number was chosen manually, though in real world you may want to perform some experiments to choose it more precisely.

package com.dancingrobot84.algorithms.datastructures.impl;

import com.dancingrobot84.algorithms.datastructures.Map;

public class HashMap<K extends Comparable<? super K>, V> implements Map<K, V> {

    public static final int DEFAULT_CAPACITY = 2;

    public static <K extends Comparable<? super K>, V> Map<K, V> empty() {
        return new HashMap<K, V>(DEFAULT_CAPACITY);
    }

    public static <K extends Comparable<? super K>, V> Map<K, V> emptyWithCapacity(int capacity) {
        return new HashMap<K, V>(capacity);
    }

    @Override
    public V get(K key) {
        checkKey(key);
        Entry<K, V> entry = entries[findIndex(key)];
        for (; entry != null; entry = entry.next) {
            if (key.compareTo(entry.key) == 0) {
                return entry.value;
            }
        }
        return null;
    }

    @Override
    public void put(K key, V value) {
        checkKey(key);
        int index = findIndex(key);
        Entry<K, V> entry = entries[index];
        for (; entry != null; entry = entry.next) {
            if (key.compareTo(entry.key) == 0) {
                entry.value = value;
                return;
            }
            if (entry.next == null) {
                break;
            }
        }

        Entry<K, V> newEntry = new Entry<K, V>(key, value);
        if (entry == null) {
            entries[index] = newEntry;
        } else {
            entry.next = newEntry;
        }
        size++;

        if (isFull()) {
            resize(entries.length * 2);
        }
    }

    private int size;
    private Entry<K, V>[] entries;

    @SuppressWarnings("unchecked")
    private HashMap(int capacity) {
        size = 0;
        entries = (Entry<K, V>[])new Entry[Math.max(capacity, DEFAULT_CAPACITY)];
    }

    private static class Entry<K, V> {
        public K key;
        public V value;
        public Entry<K, V> next;

        public Entry(K key, V value) {
            this.key = key;
            this.value = value;
        }
    }

    private void checkKey(K key) {
        assert key != null : "Nullary keys are forbidden";
    }

    private int findIndex(K key) {
        return key.hashCode() % entries.length;
    }

    private boolean isFull() {
        return size >= 0.75 * entries.length;
    }

    @SuppressWarnings("unchecked")
    private void resize(int newCapacity) {
        Entry<K, V>[] oldEntries = entries;
        entries = (Entry<K, V>[])new Entry[newCapacity];
        size = 0;

        for (int i = 0; i < oldEntries.length; i++) {
            for (Entry<K, V> entry = oldEntries[i]; entry != null; entry = entry.next) {
                put(entry.key, entry.value);
            }
        }
    }
}

  1. Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms

Algorithms 101: Basic Data Structures Pt. 1

Aug 12, 2015 in Algorithms

This article is intended to cover some basic data structures: arrays, linked lists, stacks and queues; their applications and implementations in Java. Source code of implementations used in this article is available on Github.

Abstract and concrete data structures

Abstract data structure is a mathematical model, an interface declaring desired behaviour. Concrete data structure is an implementation of such interface. For example, collection or container is an abstract data structure. It says only that some items of particular type are stored inside it and declares methods for accessing and updating them. Array and list, on the other side, are concrete data structures. They describe the way items are stored (in continuous memory chunk versus as a chain of pointers) and implement methods declared in collection’s interface.

Collections

Collection (Container) is an abstract data structure. It holds several items of some specified type T and offers methods for accessing and updating them. The following interface will be used throughout this article to describe containers:

package com.dancingrobot84.algorithms.datastructures;

public interface Collection<T> {
    /** Insert item at the beginning of collection **/
    void append(T item);

    /** Insert item at the end of collection **/
    void prepend(T item);

    /** Insert item at specified place of collection **/
    void insert(int index, T item);

    /** Retrieve item at specified place **/
    T get(int index);

    /** Return the size of collection **/
    int size();

    /** Remove item at specified place from collection **/
    void remove(int index);
}

Arrays

Array is a container that holds a number of items in contiguous area of memory. In its simplest form, static array, the capacity of a container is set up at the moment of its creation and may not be changed. Dynamic arrays, however, can grow automatically when current capacity is exceeded.

Arrays are used both on its own and as an underlying data structure for more sophisticated structures, e.g. heaps, associative arrays and adjacency matrices.

Arrays are supported out of the box in overwhelming majority of modern programming languages. Usually static arrays present as a syntactic feature. Here is an example of creating static array of 10 ints in Java:

int[] arrayOfInts = new int[10];

Dynamic arrays often shipped as utility classes of standard library (std::vector in C++ and ArrayList in Java).

Dynamic array implementation

Idea: let’s start with a small static array. If there is no free space left to insert new item, then allocate new static array with extended capacity and copy all items from the old array into the new one. It is proven that allocating new array twice as large as the old one gives you O(1) amortized complexity of inserting new item at the end of an array.

Here is a simple implementation of dynamic array in Java:

package com.dancingrobot84.algorithms.datastructures.impl;

import java.util.Arrays;
import com.dancingrobot84.algorithms.datastructures.Collection;

public class DynamicArray<T> implements Collection<T> {

    public static <T> Collection<T> empty() {
        return new DynamicArray<T>();
    }

    @Override
    public void append(T item) {
        ensureCapacity(size + 1);
        items[size] = item;
        size += 1;
    }

    @Override
    public void prepend(T item) {
        ensureCapacity(size + 1);
        System.arraycopy(items, 0, items, 1, size);
        items[0] = item;
        size += 1;
    }

    @Override
    public void insert(int index, T item) {
        assert index >= 0 : "Index is out of range: " + Integer.toString(index);
        size = Math.max(index + 1, size + 1);
        ensureCapacity(size);
        System.arraycopy(items, index, items, index+1, size-index-1);
        items[index] = item;
    }

    @Override
    public void remove(int index) {
        checkRange(index);
        System.arraycopy(items, index+1, items, index, size-index-1);
        size -= 1;
    }

    @Override
    @SuppressWarnings("unchecked")
    public T get(int index) {
        checkRange(index);
        return (T)items[index];
    }

    public void set(int index, T item) {
        checkRange(index);
        items[index] = item;
    }

    @Override
    public int size() {
        return size;
    }

    private int size = 0;
    private Object[] items = new Object[1];

    private DynamicArray() {
        // Disable subclassing
    }

    private void ensureCapacity(int capacity) {
        if (items.length < capacity) {
            int enlargeTo = Math.max(2 * items.length, capacity);
            items = Arrays.copyOf(items, enlargeTo);
        }
    }

    private void checkRange(int index) {
        assert index >= 0 && index < size : "Index is out of range: " + Integer.toString(index);
    }
}

Advanced implementation can shrink array when the number of items drops below some point, say, half of capacity. However, it should be implemented with cautious, because too eager behaviour can cause drastic change of complexity. Imagine an array which shrinks itself in half as soon as the number of items drops below half of capacity; when there are n/2 items and you remove one, then insert one and repeat these actions m times the overall complexity will be O(mn). It’s better to shrink array by lesser amount, a third or a fourth of capacity for example.

Linked lists

Linked list is also a container. It stores its items in nodes. Each node consists of at least two fields: a value to store and a pointer to another node. Nodes are chained one to another and, thus, can be located in arbitrary places of memory (contrary to arrays). Linked lists can be divided into several groups according to the number of pointers in its nodes: singly-linked, doubly-linked and multiply-linked.

Linked lists can be used either on its own or as an underlying data structure for such ADTs as adjacency lists, stacks, queues and for collision resolution in hash tables.

Linked lists are present in many programming languages, for example, std::list in C++ and LinkedList in Java. Usually they are implemented as doubly-linked lists.

Doubly-linked list implementation

Idea: key idea of this implementation is to use head and tail sentinel nodes. They are constant for particular linked list object; they point to each other when the list is empty and to the first/last node otherwise. Head sentinel’s prev is always null, just as tail sentinel’s next. Main purpose of such sentinel nodes is to save us time and energy playing around null checks: this way we can be sure that list always has head and tail nodes, though they’re merely decorative.

Here is a simple implementation of doubly-linked list in Java:

package com.dancingrobot84.algorithms.datastructures.impl;

import java.util.Arrays;
import com.dancingrobot84.algorithms.datastructures.Collection;

public class DoublyLinkedList<T> implements Collection<T> {

    public static <T> Collection<T> empty() {
        return new DoublyLinkedList<T>();
    }

    @Override
    public void append(T item) {
        insertBefore(tail, new Node(item));
        size++;
    }

    @Override
    public void prepend(T item) {
        insertAfter(head, new Node(item));
        size++;
    }

    @Override
    public void insert(int index, T item) {
        Node target = searchForInsertion(index);
        insertBefore(target, new Node(item));
        size++;
    }

    @Override
    public void remove(int index) {
        Node target = search(index);
        target.prev.next = target.next;
        target.next.prev = target.prev;
        size--;
    }

    @Override
    public T get(int index) {
        return search(index).value;
    }

    @Override
    public int size() {
        return size;
    }

    private class Node {
        T value;
        Node prev;
        Node next;

        public Node(T value) {
            this.value = value;
        }
    }

    private final Node head = new Node(null);
    private final Node tail = new Node(null);
    private int size = 0;

    private DoublyLinkedList() {
        head.next = tail;
        tail.prev = head;
    }

    private void insertAfter(Node target, Node node) {
        node.next = target.next;
        node.prev = target;
        if (target.next != null) {
            target.next.prev = node;
        }
        target.next = node;
    }

    private void insertBefore(Node target, Node node) {
        node.next = target;
        node.prev = target.prev;
        if (target.prev != null) {
            target.prev.next = node;
        }
        target.prev = node;
    }

    private Node search(int index) {
        Node target = searchForInsertion(index);
        assert target != head && target != tail : "Item at index " + Integer.toString(index) + " is not found";
        return target;
    }

    private Node searchForInsertion(int index) {
        assert index >= 0 && index <= size : "Index is out of range: " + Integer.toString(index);
        int mid = size / 2;
        return index > mid ? searchBackward(size - index) : searchForward(index);
    }

    private Node searchForward(int index) {
        Node target = head.next;
        int position = 0;
        while (target != tail && position < index) {
            target = target.next;
            position++;
        }
        return target;
    }

    private Node searchBackward(int index) {
        Node target = tail;
        int position = 0;
        while (target != head && position < index) {
            target = target.prev;
            position++;
        }
        return target;
    }
}

Complexity comparison

Here is complexity comparision of dynamic array and doubly linked list for all operations in Collection interface. As you see, linked list wins on prepending, while dynamic array wins on random access.

Operation Array Linked list
append(item) Oam(1) O(1)
prepend(item) O(n) O(1)
insert(index, item) O(n) O(n)
remove(index) O(n) O(n)
get(index) O(1) O(n)
size() O(1) O(1)
Additional space used O(n) O(n)

Stacks and queues

Stacks and queues are abstract data structures that support limited access to their elements: they allow only appending and retrieval. They are distinguished by the particular retrieval order they support: stack is LIFO container (“last-in-first-out”) and queue is FIFO container (“first-in-first-out”). Interfaces of stack and queue are presented below:

package com.dancingrobot84.algorithms.datastructures;

public interface Stack<T> {
    /** Insert new item at the end of stack **/
    void push(T item);

    /** Retrieve and remove item from the end of stack **/
    T pop();

    /** Return number of elements in stack **/
    int size();
}
package com.dancingrobot84.algorithms.datastructures;

public interface Queue<T> {
    /** Insert new item at the end of queue **/
    void enqueue(T item);

    /** Retrieve and remove item from the beginning of queue **/
    T dequeue();

    /** Return number of elements in queue **/
    int size();
}

Both stacks and queues are widely applicable in modern software engineering. Examples of stack usage are:

  • Function calls are stored in stack while program is executed
  • Many parsing and expression evaluation algorithms use stacks to hold values or currently processed lexemes
  • JVM is a stack-based virtual machine. It does not use registers; instead it stores all variables on stack

Examples of queue usage are:

  • Concurrent applications; for example, concurrent access to some resource means that there is a queue of requests and they are processed in order
  • Buffering and data transfer

Usually stacks and queues are implemented as adapters to another containers, for example, linked lists or dynamic arrays. In Java Queue is an interface implemented by classes ArrayDeque, LinkedList and others. Stack is a concrete class, but it is obsolete. It is recommended to use interface Deque instead and its implementations as it has methods of both stacks and queues.

Stack implementation

Explanations skipped as the implementation is quite straightforward.

package com.dancingrobot84.algorithms.datastructures.impl;

import com.dancingrobot84.algorithms.datastructures.Collection;
import com.dancingrobot84.algorithms.datastructures.Stack;

public class ArrayStack<T> implements Stack<T> {

    public static <T> Stack<T> empty() {
        return new ArrayStack<T>();
    }

    @Override
    public void push(T item) {
        items.append(item);
    }

    @Override
    public T pop() {
        assert items.size() > 0 : "Stack is empty";
        int index = items.size() - 1;
        T item = items.get(index);
        items.remove(index);
        return item;
    }

    @Override
    public int size() {
        return items.size();
    }

    private Collection<T> items = DynamicArray.empty();

    private ArrayStack() {
        // Disable subclassing
    }
}

Implementation of stack using doubly-linked list will be exactly the same. Complexity of operations is O(1) amortized for array and O(1) for list.

Queue implementation using dynamic array

Idea: this is standard two-pointer implementation of a queue.

Because in queue we need to retrieve element from the beginning of a queue, it is inefficient to shift the whole array back each time dequeue is called. Instead lets create a pointer to location where the real beginning of a queue is and shift it forward on dequeue. On enqueue, however, lets insert new item not just at the end of an array, but, if possible, in the free place between array’s beginning and queue’s beginning. In order to achieve this lets create another pointer to store location of the real end of a queue. When this two pointers meet lets relocate the whole array so that queue’s beginning and end will be the same as underlying array’s are.

package com.dancingrobot84.algorithms.datastructures.impl;

import com.dancingrobot84.algorithms.datastructures.Queue;

public class ArrayQueue<T> implements Queue<T> {

    public static <T> Queue<T> empty() {
        return new ArrayQueue<T>();
    }

    @Override
    public void enqueue(T item) {
        if (items.size() == queueEnd) {
            if (queueStart == 0) {
                items.append(item);
            } else {
                queueEnd = 0;
                items.set(0, item);
            }
            queueEnd++;
        } else {
            items.set(queueEnd, item);
            queueEnd++;
            if (queueEnd == queueStart) {
                relocate();
            }
        }
    }

    @Override
    public T dequeue() {
        assert queueEnd != queueStart : "Queue is empty";

        T item;
        item = items.get(queueStart);
        queueStart++;

        if (queueStart == queueEnd) {
            queueStart = 0;
            queueEnd = 0;
        } else if (queueStart == items.size()) {
            queueStart = 0;
        }

        return item;
    }

    @Override
    public int size() {
        if (queueEnd == queueStart) {
            return 0;
        } else if (queueEnd > queueStart) {
            return queueEnd - queueStart;
        } else {
            return items.size() - (queueStart - queueEnd);
        }
    }

    @SuppressWarnings("unchecked")
    private DynamicArray<T> items = (DynamicArray<T>) DynamicArray.empty();
    private int queueStart = 0;
    private int queueEnd = 0;

    private ArrayQueue() {
        // Disable subclassing
    }

    private void relocate() {
        @SuppressWarnings("unchecked")
        DynamicArray<T> newItems = (DynamicArray<T>) DynamicArray.empty();

        queueEnd--;
        for (int i = queueStart; i != queueEnd; i = (i+1) % items.size()) {
            newItems.append(items.get(i));
        }
        newItems.append(items.get(queueEnd));

        items = newItems;
        queueStart = 0;
        queueEnd = items.size();
    }
}

Complexity of this solution is O(1) amortized for both operations.

Queue implementation using doubly-linked list

This implementation is much easier, because in doubly-linked list we can remove item from both beginning and end in O(1).

package com.dancingrobot84.algorithms.datastructures.impl;

import com.dancingrobot84.algorithms.datastructures.Collection;
import com.dancingrobot84.algorithms.datastructures.Queue;

public class ListQueue<T> implements Queue<T> {

    public static <T> Queue<T> empty() {
        return new ListQueue<T>();
    }

    @Override
    public void enqueue(T item) {
        items.append(item);
    }

    @Override
    public T dequeue() {
        assert items.size() > 0 : "Queue is empty";
        T item = items.get(0);
        items.remove(0);
        return item;
    }

    @Override
    public int size() {
        return items.size();
    }

    private Collection<T> items = DoublyLinkedList.empty();

    private ListQueue() {
        // Disable subclassing
    }
}

Complexity of this solution is O(1) for both operations.


  1. Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms

  2. Skiena. The Algorithm Design Manual

Let's write DBMS in Haskell: project management with Cabal

Mar 8, 2015 in Haskell, Programming

Last term has passed and I’ve got a lot of time to spend on writing master thesis and doing other stuff. I’ve decided that improving FP and Haskell skills is the thing I’d like to do and the best way to do this is to write a couple of small, but rather serious projects. We’ve had some interesting ones in university, but, to be honest, my architecture decisions were not so good sometimes and by the end of a term projects become surrounded with all kinds of workarounds and it was too late to rewrite them from scratch. Now I have no hard deadlines and requirements so why not write these projects properly?

Projects I’m talking about are database management system (like Sqlite) and virtual machine implementation with JIT compilation and some simple optimizations (like Hotspot). The former I wrote more than a year ago, the latter was written a couple of months ago. I’ll start with the first one because it has been written in Haskell, so it will be easier to fix my previous mistakes. Another reason is that I’m starting to forget things about DBMS and it would be great to freshen them up.

I will store my code on Github and keep the link to the state of project by the time of writing a post on top of it. Current specifications and requirements are here; in short: it will be a simple database supporting three fixed-size types (ints, doubles and varchars), b-tree indexes, cross joins and ‘vacuum’ operation. I will cover internal architecture more in latter posts; now I’m going to talk about project management with Cabal.

Cabal is recommended1 build tool for Haskell. Its name stays for “Common Architecture for Building Applications and Libraries”. It is split in two parts: cabal-install package which contains cabal executable and Cabal library which contains some advanced things for complex builds. Be sure to have the former installed in order to proceed with the instructions below.

Project configuration

Creating new Cabal project is simple: create new directory, cd into it, run

$ cabal init

and answer some questions about your project. A couple of files will be generated: <your-project>.cabal, a declarative definition of your project’s build and Setup.hs, a build script. Here is possible contents of generated .cabal file:

-- Initial my-project.cabal generated by cabal init.  For further
-- documentation, see http://haskell.org/cabal/users-guide/

name:                my-project
version:             0.1.0.0
synopsis:            My precious project
-- description:
license:             MIT
license-file:        LICENSE
author:              Nikolay Obedin
maintainer:          dancingrobot84@gmail.com
-- copyright:
-- category:
build-type:          Simple
-- extra-source-files:
cabal-version:       >=1.10

executable my-project
  main-is:             Main.hs
  -- other-modules:
  -- other-extensions:
  build-depends:       base >=4.7 && <4.8
  -- hs-source-dirs:

-- ... other sections

Syntax is pretty straightforward: indentation is used to distinguish entries and their contents, “--” to comment things out. Global properties are on top of the file, they are pretty simple too: name of your project, version, short description, license, category on Hackage, etc. One interesting field is build-type which is used to tell Cabal what type of build is going to be used: Simple is for builds configured via .cabal file, Custom is for more complex builds using Setup.hs2. I’ll stick with Simple build type for now and then.

A set of sections is placed after global properties. Each section should belong to one of four types: library, executable, test suite or benchmark. I’m going to take a look at first three of them.

Library section and common settings

library
  other-modules:    Foo.Bar
  exposed-modules:  Foo.Baz
  hs-source-dirs:   src
  default-extensions:
        CPP, ForeignFunctionInterface
  build-tools:      alex, happy
  build-depends:
        base
      , array

Library is a collection of Haskell modules that can be reused by other libraries and applications. Project may contain at most one library, its name matches the name of a project. Each module of a library should be either in other-modules entry or exposed-modules entry based on its visibility to end user – the latter are visible and the former aren’t. If you chose library in cabal init then all your existent modules have been automatically added to exposed-modules.

The rest of settings in this example are common to all types of sections. hs-source-dirs is comma-separated list of directories where Cabal will search for source files. default-extensions is list of language extensions enabled by default for all source files.

build-tools is list of programs used to process specific files before compiling them. Note that these executables for some reason ARE NOT installed automatically3, you should do it manually. In this example alex is used to process files with .x extension and happy processes files with .y extension to generate lexer and parser respectively.

build-depends is list of packages your project depends on. Each package may be constrained4, but be careful with it as inadequate constraints may lead Cabal to inability of installing your dependencies. Usually, I use Stackage to constrain dependencies for me while leaving them unconstrained in build-depends. However, this approach is useful only if you’re developing internal library or application – if you’re going to publish it on Hackage then you should set sane constraints to your dependencies, but, I guess, by that time you will know these things better than me:)

Executable and Test-Suite sections

executable my-program
  hs-source-dirs:   src
  build-depends:    base
  main-is:          Main.hs

Executable is a standalone program. Its name is not required to be the same as package’s one and project may have many executables. The only thing it requires is to have an entry point which is main :: IO () function. A source file having this function should be specified in main-is entry.

test-suite tests
  type:             exitcode-stdio-1.0
  main-is:          Main.hs
  hs-source-dirs:   test
  build-depends:
        base
      , hspec

Project also may contain many test suite sections and each of these sections should use one of supported testing interfaces. The interface used by particular test-suite is defined in its type field. Cabal supports two testing interfaces out of the box: exitcode-stdio and detailed. I prefer the first one because it is simpler – it just compiles and runs your test application checking its exit code: if its non-zero then test has failed. The only required field for exitcode-stdio is main-is which means exactly the same thing as in executable section - it is a source file of your test program.

Installing dependencies, building and running

Now that the project is configured it is time to build it. But first you need to install the dependencies stated in build-depends fields. I strongly recommend that you use Cabal sandboxes to avoid possible conflicts. To create new sandbox run:

$ cabal sandbox init

and then install dependencies:

$ cabal install --dependencies-only

If you’re using build-tools install them manually:

$ cabal install alex happy <any-other-build-tool>

Now you can build your project:

$ cabal configure && cabal build

Then you can test it and run:

$ cabal test && cabal run

If you want to try something in ghci you can start it with all dependencies of your project available by running:

$ cabal repl

These are the commands you’re going to use rather frequently; the rest of available commands and flags can be found by running cabal --help.

In the end

Haskell is a great language and Cabal is a sane build tool if it’s used properly. Use sandboxes and careful version constraints, do not install many packages in global scope – remember that Cabal is not a package manager and you’ll be fine.