Design Pattern for Data Synchronisation
Ian Tree 10 November 2008 13:05:19
Data Synchronisation
I have, over the years, come across many implementations of data synchronisation both within Domino/Notes and between Domino/Notes and other platforms. Many of these implementations are.
- Flawed by design
- Inefficient
- Not scalable
- Error prone
- Impossible to maintain
- All of the above
I will present below a generic design pattern for data synchronisation that eliminates all of the above problems and provides a simple, stable basis for applications that need this type of functionality.
The Problem
I have 1-n data sources that contain records that need to be synchronised across the data sources based on a key (the synchronisation key) that is shared across the data sources. I assume that I can access all of the data sources in the order defined by the synchronisation key. There are two possible variants of the Data Synchronisation problem one is the "Master/Slave" situation where one of the Data Sources is designated as Master and all other Data Sources must represent the state of that Data Source, the other is the "Peer" situation where additions/deletions/updates can be propagated into any one of the Data Sources and must be reflected in all others. This design pattern is suitable for implementation with either of the situations identified.
A Typical (and trivial) Solution
The simplest problem definition for this set of problems is where there are two Data Sources, one master and one slave. A typical example of this would be a Domino Directory that requires (partial) synchronisation with some external directory. A common implementation of this would use the following logic.
1) Make a sequential pass of the Master Directory and lookup (by Key) each entry in the slave directory, if the key is not found add the record from the master directory to the slave directory.
2) Make a sequential pass of the Slave Directory and lookup (by Key) each entry in the Master Directory, if the key is not found then remove the entry from the Slave Directory.
The logic for the first pass can be extended to compare timestamps and determine if content updates need to be applied to the Slave Directory entries.
There is nothing wrong with this solution, there needs to be a little bit of safety logic added to detect duplicate records in either directory but the pattern will work. The solution starts to unwind where you have multiple directories with large numbers of entries then the added complexity and processing requirements quickly mount up to give an unworkable solution.
The Conceptual Solution
The Design Pattern needed to simplify this picture is based on taking a different view of the data that is being processed. If you can imagine being able to merge all of the data from all Data Sources into a single Data Source (only for the purpose of the synchronisation processing) then it is possible to see a relatively simple core logic that processed all incoming records (in key sequence) and on detecting key-changes just looked at the count and source of the records that shared a particular synchronisation key and determined what action should be taken.
Data Source #1 | Data Source #2 | Data Source #3 | |
Key #1 | Key #1 | Key #2 | |
Key #2 | Key #2 | Key #3 | |
Key #4 | Key #3 | Key #4 | |
Key #5 | Key #5 | Key #5 |
We can create an abstraction of the problem data as a single ordered sequential stream of records.
DataSource#1-Key#1
DataSource#2-Key#1
DataSource#1-Key#2
DataSource#2-Key#2
DataSource#3-Key#2
DataSource#2-Key#3
DataSource#3-Key#3
DataSource#1-Key#4
DataSource#3-Key#4
DataSource#1-Key#5
DataSource#2-Key#5
DataSource#3-Key#5
Processing the data in this way creates a very simple logic pattern to implement and has the great performance and scalability advantage that the complete synchronisation is performed with only a single sequential pass of each of the data sources.
The Design Pattern
The first and most important component in the design pattern is the "multi-source iterator". The component must be capable of iterating 1 to n sequential (but ordered) data sources and on each iteration it should return the record with the lowest key value that is present on any one of the streams. The data object that is returned by the iterator may need to be more complex depending on the implementation, the calling routine will require access to the following entities associated with a returned record (document).
The data record (document)
An identifier to indicate which data source the current record belongs to
A handle to the data source (e.g. database) that will allow the creation of new documents
The iterator needs to support a minimum of two methods the first tests if there is a next record available in the multi-source stream (equivalent to hasNext() in the java iterator interface) i.e. a return value of false indicated end-of-file on all data sources. The second method next() returns the next data record object in sequence. Implementation in java can be done using the classical iterator interface.
The logic of the iterator is fairly simple, it must keep a buffer for one record and key on each data source and an indicator for each. Each call to the hasNext() method should check if all data sources are at end-of-file (EOF), if so it should return false, otherwise it should return true. Each call to the next() method will search all of the key buffers (only for data sources that are not at EOF) and compare the keys it should then return the data record with the lowest key and read a new record into the buffer that has just been returned (i.e. from the same data source from which the returned record came). It is important to note that the insertion or deletion of keys in any of the data sources must NOT alter the key position of the readers within the iterator, if the implementation does not allow this then additional logic must be added to the iterator to re-position the individual reader whenever an addition or deletion is performed on that data source.
The main processing (synchronisation) logic can now be built around a simple iteration loop. The routine will read sequentially that data records from the iterator. On each iteration it must first test if the key of the data record returned to determine if it has changed, if the key has not changed then it must buffer the data record ready for the next key change event, note that buffers must be bound to an individual data source. While buffering that data record it is possible to detect and deal with duplicate key conditions on an individual data source, according to the business logic. When a key change (or EOF) event is detected then the main synchronisation logic comes into play. Comparison of the keys and records across the buffers for all data sources will give a complete state definition for the current synchronisation state of the current key value and the appropriate actions taken, according to the business rules, to make all data sources fully synchronised for the current key value. After the current key has been fully synchronised then the buffers should be emptied and buffer filling starts again with the record that triggered the key change.
Share: | Tweet |
- Comments [0]