Essence Java Framework
This web site is being migrated to the Essence Wiki
Google
 
Web www.jtoolkit.org weblog.jtoolkit.org
Navigation
Home
SourceForge Home
Articles on Java
Getting Started
Essence Documentation
Download

Contact
Get rewarded for helping out
Email A Question
Weblog

Support This Project

SourceForge.net Logo

Articles Home

How can components be tested in a cluster for failover?

This is a question I find is in the too hard basket for most projects. However it doesn't need to be. An article discussing clustering for Terracotta Using Terracotta DSO inspired me to produce a similar example which I believe show how a cluster can function with two node, with alternating nodes failing and the fast recovery of the cluster from disk.

The full code and configuration is available on Sourceforge SVN WordSearch sample

The first thing to note is how different the code is. In both cases the design and the framework is non-invasive and yet still yields very different coding styles. Both frameworks use convension instead of a specific API resulting in a differnt structure. It is not clear from the article how multiple nodes in a cluster are tested, but I would assume this is very different to as Terracotta uses a custom invocation of the JVM.

Source for the WordSearch component.

The acquireList() method is simpler than in the Terracotta Article. This is because Terracotta requires synchronized blocks to ensure thread safety and that the nodes in the cluster are kept in sync. Essence's Map are implemented as Java 5 style ConcurrentMap and do not require explicit synchronization. getList() is added to simplify testing and determine whether a prefix is in the WordSearch cache via cluster replication. clear() is added to clear the cache between tests. In the configuration the cache is durable. For some tests to be meaningful the cache has to be cleared explicity.

@ThreadSafe
public class WordSearch {
    // cache controlled externally.
    private final Map<String, List> prefixCache;
    private final String wordsFile;
    // caseless form to cased form.
    private final SortedMap<String, String> words = new TreeMap<String, String>();

    public WordSearch(Map<String, List<String>> prefixCache, String wordsFile) throws IOException {
        this.prefixCache = prefixCache;
        this.wordsFile = wordsFile;
        BufferedReader br = new BufferedReader(new InputStreamReader(IOUtils.getInputStream(wordsFile)));
        String line;
        while ((line = br.readLine()) != null)
            words.put(line.toLowerCase(), line);
    }

    public List<String> acquireList(String prefix) {
        List<String> list = prefixCache.get(prefix);
        if (list == null)
            prefixCache.put(prefix, list = buildList(prefix));
        return list;
    }

    public List<String> getList(String prefix) {
        return prefixCache.get(prefix);
    }

    private List<String> buildList(String prefix) {
        List<String> ret = new ArrayList();
        for (Map.Entry<String, String> word : words.tailMap(prefix).entrySet()) {
            if (!word.getKey().startsWith(prefix)) break;
            ret.add(word.getValue());
        }
        return ret;
    }

    public void clear() {
        prefixCache.clear();
    }
}

Testing to node in a cluster are in sync.

Notes:
  • The compress property determines whether the communication between nodes is compressed. This is used in the configuration file.
  • The same configuration file is used to describe all the nodes in the cluster.  As all node perform the same service they share the same configuration file.  The container name "cA" and "cB" determines how one container differs from another. By setting properties for each environment, development, UAT, production and contingency can share the same configuration files.  This also holds for multi-region applications.
  • The configuration file available here.
  • In this test, the "word-search" component in container A is used first and same component in container B is checked to ensure it is sync.
  • A dummy look up is performed to ensure the container has completely started and give consistent timings.  Without this, the first timing varies by about 200 ms and can be dropped in a real use case.
    public static void testBuildListCache() throws IOException, InstantiationException {
        System.setProperty("compress", "");
        Container cA = Main.start("cA", "config/tdso-cluster.config");
        Container cB = Main.start("cB", "config/tdso-cluster.config");
        WordSearch wordSearchA = cA.getComponent("word-search", WordSearch.class);
        WordSearch wordSearchB = cB.getComponent("word-search", WordSearch.class);
        wordSearchA.clear();
        wordSearchB.clear();

        assertEquals(0, wordSearchA.acquireList("zzz").size());

        long starttime = System.nanoTime();
        runTest(wordSearchA, wordSearchB);
        runTest(wordSearchB, wordSearchA);
        long time = System.nanoTime() - starttime;
        System.out.printf("%s: %3.0f ms\n", "All prefixes twice", time * 1.0e-6);
        assertTrue("Long time=" + time, time < 1e9);
        wordSearchA.clear();
        wordSearchB.clear();
    }

The output of the two node cluster test

The first two messages show the containers have found each other. The next two lines should the output and timings of the tests. Clearly the test for Container B is faster
26-Jan-2007 21:00:00 org.jtoolkit.essence.data.impl.ClusterImpl connectionMonitor
INFO: cB: All masters [cA] available.
26-Jan-2007 21:00:00 org.jtoolkit.essence.data.impl.ClusterImpl connectionMonitor
INFO: cA: All masters [cB] available.
s: 13259 (62.1 ms)t: 6783 (32.0 ms)l: 2469 (10.8 ms)e: 1147 (4.5 ms)r: 3185 (7.5 ms)n: 1255 (10.1 ms)g: 2587 (6.4 ms)v: 1156 (4.0 ms)x: 18 (0.8 ms)h: 2906 (7.2 ms)p: 5389 (16.2 ms)i: 1071 (4.7 ms)q: 1320 (3.9 ms)y: 411 (1.8 ms)m: 2207 (12.5 ms)u: 602 (4.3 ms)z: 159 (1.2 ms)j: 498 (1.8 ms)k: 548 (2.1 ms)d: 3212 (12.8 ms)o: 1240 (9.5 ms)a: 3070 (7.8 ms)w: 5041 (23.8 ms)f: 4276 (136.3 ms)b: 6091 (24.5 ms)c: 6275 (38.2 ms)
m: 2207 (0.0 ms)h: 2906 (0.0 ms)l: 2469 (0.0 ms)x: 18 (0.0 ms)j: 498 (0.0 ms)b: 6091 (0.0 ms)d: 3212 (0.0 ms)y: 411 (0.0 ms)k: 548 (0.0 ms)v: 1156 (0.0 ms)i: 1071 (0.0 ms)u: 602 (0.0 ms)r: 3185 (0.0 ms)a: 3070 (0.0 ms)f: 4276 (0.0 ms)n: 1255 (0.0 ms)g: 2587 (0.0 ms)z: 159 (0.0 ms)p: 5389 (0.0 ms)w: 5041 (0.0 ms)o: 1240 (0.0 ms)t: 6783 (0.0 ms)e: 1147 (0.0 ms)s: 13259 (0.0 ms)q: 1320 (0.0 ms)c: 6275 (0.0 ms)
All prefixes twice: 499 ms
Without durable caching this test takes about 350 ms.

Testing failover.

Alternative node test

Hopefully what this test is doing is clear as it just an extension of the first test.
    public static void testReplication() throws IOException, InstantiationException {
        System.setProperty("compress", "");
        // fill the cache using Container A
        Container cA = Main.start("cA", "config/tdso-cluster.config");
        System.out.println("==== Container A started, Container B is down.");
        WordSearch wordSearchA = cA.getComponent("word-search", WordSearch.class);
        wordSearchA.clear();
        runTest(wordSearchA, wordSearchA);
        {
            // start container B and let get a copy of the cache.
            Container cB = Main.start("cB", "config/tdso-cluster.config");
            System.out.println("==== Container A & B started.");
            WordSearch wordSearchB = cB.getComponent("word-search", WordSearch.class);

            assertEquals(0, wordSearchB.acquireList("z-b2").size());

            // test the container B times are much faster than before.
            long starttime = System.nanoTime();
            runTest(wordSearchB, wordSearchA);
            long time = System.nanoTime() - starttime;
            System.out.printf("%s: %3.0f ms\n", "test container B started after A", time * 1.0e-6);
            assertTrue("Long time=" + time, time < 1e8);

            // stop A and see test still work.
            cA.close();
            System.out.println("==== Container A stopped, Container B is running.");

            assertEquals(0, wordSearchB.acquireList("zzy").size());
            Thread.yield();

            // test the container B times are much faster than before.
            long starttime2 = System.nanoTime();
            runTest(wordSearchB, wordSearchB);
            long time2 = System.nanoTime() - starttime2;
            System.out.printf("%s: %3.0f ms\n", "test container B after A stopped", time2 * 1.0e-6);
            assertTrue("Long time=" + time2, time2 < 1e8);

            // restart B and show it is still fast.
            cB.close();
            System.out.println("==== Container A & B stopped.");
        }

        Container cB2 = Main.start("cB", "config/tdso-cluster.config");
        System.out.println("==== Container B restarted, Container A is stoppped.");
        WordSearch wordSearchB2 = cB2.getComponent("word-search", WordSearch.class);

        assertEquals(0, wordSearchB2.acquireList("zz-b2").size());
        Thread.yield();

        // test the container B times are much faster than before.
        long starttime3 = System.nanoTime();
        runTest(wordSearchB2, wordSearchB2);
        long time3 = System.nanoTime() - starttime3;
        System.out.printf("%s: %3.0f ms\n", "test container B after restart", time3 * 1.0e-6);
        assertTrue("Long time=" + time3, time3 < 1e8);

        wordSearchA.clear();
        wordSearchB2.clear();
    }

The output of the failover test

==== Container A started, Container B is down.
26-Jan-2007 21:00:02 org.jtoolkit.essence.data.impl.ClusterImpl checkConnections
INFO: cA: connecting to cB as netObject was null
26-Jan-2007 21:00:03 org.jtoolkit.essence.data.impl.ClusterImpl waitUntilAllMastersTried
INFO: cA: .... Waiting for test-cluster to have allMastersTried to get listenerSet
26-Jan-2007 21:00:03 org.jtoolkit.essence.data.impl.ClusterImpl connectionMonitor
INFO: cA: tried masters [] available.
w: 5041 (1.9 ms)y: 411 (0.5 ms)l: 2469 (0.8 ms)m: 2207 (0.7 ms)i: 1071 (0.5 ms)v: 1156 (0.5 ms)z: 159 (0.2 ms)p: 5389 (2.2 ms)t: 6783 (1.9 ms)f: 4276 (1.3 ms)a: 3070 (1.0 ms)x: 18 (0.1 ms)j: 498 (0.2 ms)c: 6275 (1.9 ms)k: 548 (0.4 ms)n: 1255 (0.4 ms)b: 6091 (1.9 ms)q: 1320 (0.5 ms)r: 3185 (1.1 ms)h: 2906 (0.8 ms)s: 13259 (3.8 ms)d: 3212 (1.0 ms)o: 1240 (0.5 ms)g: 2587 (0.9 ms)e: 1147 (0.4 ms)u: 602 (0.3 ms)
==== Container A & B started.
26-Jan-2007 21:00:04 org.jtoolkit.essence.data.impl.ClusterImpl acquireMaster
INFO: cB: Perfomed bootstrap from cA updated 26 entries.
26-Jan-2007 21:00:04 org.jtoolkit.essence.data.impl.ClusterImpl connectionMonitor
INFO: cB: All masters [cA] available.
p: 5389 (0.0 ms)n: 1255 (0.0 ms)s: 13259 (0.0 ms)u: 602 (0.0 ms)q: 1320 (0.0 ms)l: 2469 (0.0 ms)t: 6783 (0.0 ms)r: 3185 (0.0 ms)f: 4276 (0.0 ms)k: 548 (0.0 ms)a: 3070 (0.0 ms)x: 18 (0.0 ms)d: 3212 (0.0 ms)j: 498 (0.0 ms)m: 2207 (0.0 ms)e: 1147 (0.2 ms)g: 2587 (0.0 ms)z: 159 (0.0 ms)y: 411 (0.0 ms)h: 2906 (0.0 ms)c: 6275 (0.0 ms)i: 1071 (0.0 ms)o: 1240 (0.0 ms)b: 6091 (0.0 ms)v: 1156 (0.0 ms)w: 5041 (0.0 ms)
test container B started after A:  19 ms
==== Container A stopped, Container B is running.
26-Jan-2007 21:00:05 org.jtoolkit.essence.data.impl.ClusterImpl$ClusterCommitCallback completeChangesToOtherServers
WARNING: test-cluster: Unable to send change to cA java.net.ConnectException: Connection refused: connect
m: 2207 (0.0 ms)d: 3212 (0.0 ms)g: 2587 (0.0 ms)u: 602 (0.0 ms)p: 5389 (0.0 ms)r: 3185 (0.0 ms)h: 2906 (0.0 ms)z: 159 (0.0 ms)c: 6275 (0.0 ms)k: 548 (0.0 ms)y: 411 (0.0 ms)i: 1071 (0.0 ms)n: 1255 (0.0 ms)w: 5041 (0.0 ms)q: 1320 (0.0 ms)b: 6091 (0.0 ms)t: 6783 (0.0 ms)j: 498 (0.0 ms)f: 4276 (0.0 ms)l: 2469 (0.0 ms)s: 13259 (0.0 ms)x: 18 (0.0 ms)o: 1240 (0.0 ms)e: 1147 (0.0 ms)v: 1156 (0.0 ms)a: 3070 (0.0 ms)
test container B after A stopped:  11 ms
==== Container A & B stopped.
==== Container B restarted, Container A is stoppped.
26-Jan-2007 21:00:07 org.jtoolkit.essence.data.impl.ClusterImpl waitUntilAllMastersTried
INFO: cB: .... Waiting for test-cluster to have allMastersTried to get test-cluster.prefix-cache
26-Jan-2007 21:00:07 org.jtoolkit.essence.data.impl.ClusterImpl connectionMonitor
INFO: cB: Waiting for [cA] to become available. 2 of 2 retries
26-Jan-2007 21:00:08 org.jtoolkit.essence.data.impl.ClusterImpl checkConnections
INFO: cB: connecting to cA as netObject was null
26-Jan-2007 21:00:09 org.jtoolkit.essence.data.impl.ClusterImpl connectionMonitor
INFO: cB: tried masters [] available.
b: 6091 (0.0 ms)u: 602 (0.0 ms)y: 411 (0.0 ms)w: 5041 (0.0 ms)k: 548 (0.0 ms)h: 2906 (0.0 ms)d: 3212 (0.0 ms)j: 498 (0.0 ms)x: 18 (0.0 ms)q: 1320 (0.0 ms)e: 1147 (0.0 ms)p: 5389 (0.0 ms)i: 1071 (0.0 ms)t: 6783 (0.0 ms)v: 1156 (0.0 ms)z: 159 (0.0 ms)m: 2207 (0.0 ms)s: 13259 (0.0 ms)l: 2469 (0.0 ms)g: 2587 (0.0 ms)o: 1240 (0.0 ms)c: 6275 (0.0 ms)a: 3070 (0.0 ms)f: 4276 (0.0 ms)n: 1255 (0.0 ms)r: 3185 (0.0 ms)
test container B after restart:  12 ms

Copyright 2006 Peter Lawrey Essence Email