Why

Many reasons why, one is because the master is running out of space, another would be to safely facilitate an OS upgrade. In the first instance we could resize the droplet, but that’ll take time during which the DB will not be available and therefore result in enforced downtime, in the latter case we could upgrade the OS on the running server but you know not what perils may lie in wait which could result in much downtime.

By following this process we should be able to complete the task with minimal, < 30 minutes, downtime – probably even less with good planning. By adding a new slave we can also get the benefit of rolling out a new droplet running the latest OS.

Topology

The basic topology is one master with 2 slaves directly attached, this provides redundancy for running backups if a slave fails, and provides scope for a reasonably quick fail over by promoting a slave should the master fail.

[add diagrams]

Step 1: – Re-structuring

When we promote one slave to be the new master, we want the other slave to become a slave of this new master. The slaves are all configured to write bin logs already so all we need to do is select which role each of our slaves will have, stop the salves and ensure they are exactly the same point in respect of the current master, update slave which will not be promoted to the new master as a slave of the other one. Essentially we are moving to a chain structure, which will allow us to move the application connections to down the chain to the promoted master, with it’s initial slave and then deprecate the old master from the top of the chain. Later we can create a new slave which we will also attach to the new master restoring the original architecture of one master with two directly linked slaves.

Selecting the slave to promote

If 1 slave has a higher spec, then this would be the one to select – memory and disk space being of most importance.

Otherwise if the slaves are only being used for backups it doesn’t really matter, but if we using the slaves for reads to reduce load on the master then the best bet is to pick the slave being used for this purpose. If both slaves are being used for reads then migrate all the reads to the slave to be promoted first. Traffic can be checked using

mysql> SHOW PROCESS LIST;

or on the command line

sudo tcpdump -i any port 3306

In both cases we want the slave which is not being promoted to only be displaying replication connections, the reasons for this is are :

  1. allow us to shutdown this server and take a snapshot to create a new slave from without impacting the application(s)
  2. minimise impact on the application as modify the hierarchy, we can change connections back to a slave later with minimal to no impact on the user experience.

Moving the slave

The objective here is to create a chain, where the slave to be promoted will be in the middle. This will allow us to redirect traffic from the current slave and then remove the old master from the top of the chain leaving us with a new master with at least 1 slave already in place. Adding a second during this process is also recommended. For this example lets say we are moving slave (id: 3) to become a slave if salve (id: 2).

To do this we need to set up a replication user, bring the 2 slaves to a stop at exactly the same point in time and then change the master config of slave (id: 3) to point to slave (id:2).

  1. On the slave to be promoted (id: 2), create a user for replication to use:
CREATE USER 'slave_repl'@'10.131.248.74' IDENTIFIED BY 'PASSWORD'; GRANT REPLICATION SLAVE ON . TO 'slave_repl'@'10.131.248.74'

2. On the slave to be moved (id : 3), stop the slave from running and check the Master_Log_File and Exec_Master_Log_Pos – these indicate at what point transactions replicated from the master have been executed on this slave. There is also Read_Master_Log_Pos, however this indicates the point at which the server has read transactions on to the local relay log – these have not yet been committed on this slave!

STOP SLAVE;
SHOW SLAVE STATUS\G
*********************** 1. row ***********************
 Master_Log_File: mysql-bin.000002
 Exec_Master_Log_Pos: 93425058

3. On the slave to be promoted (id: 2), also stop the slave and check the log position, the important thing here is that it must be ahead of the instance which will become a slave of this one (id :3)

STOP SLAVE;
SHOW SLAVE STATUS;
*********************** 1. row ************************
Master_Log_File: mysql-bin.000002 Exec_Master_Log_Pos: 93427899

4. restart the slave to be moved (id: 3), but have it stop at the same position as the one to be promoted, i.e. Master_Log_File = mysql-bin.000002, Master_Log_Pos = 93427899

START SLAVE UNTIL 
MASTER_LOG_FILE = <Master_Log_File>, 
MASTER_LOG_POS = <Exec_Master_Log_Pos>;
SHOW SLAVE STATUS;

5. Prepare the new slave to be promoted (id: 2) by purging the binary logs (requires pre-planning of having the slave write bin logs and have the log-slave-updates setting in the mysql config file, otherwise this server will need to be configured and restarted with the settings.

RESET MASTER; // if server current has NO replicas 
PURGE_BINARY_LOGS; // if server already has replicas; 
SHOW MASTER STATUS;
+------------------+----------+ 
| File             | Position | 
+------------------+----------+ 
| mysql-bin.000001 |      154 |
+------------------+----------+ 

6. Now we have both slaves at the same point in time we can switch slave (id: 3) to use slave (id: 2) as it’s master creating the new structure. On the slave being moved (id: 3)

STOP SLAVE;  -- stop both threads SQL & IO
RESET SLAVE; -- clean up relay logs read from the old master 
               -- (even if we roll back from here these can be pulled again from the master)*

-- Copy slave status somewhere, so you know the state of the server if you need 
-- to work back
SHOW SLAVE STATUS\G 

-- change for details for the server to use as master
CHANGE MASTER TO
        MASTER_HOST='<IP OF NEW MASTER (slave id:2)',
        MASTER_USER='slave_repl',
        MASTER_PASSWORD='PASSWORD',
        MASTER_PORT=3306,
        MASTER_LOG_FILE='<BIN FILE NAME>',   -- i.e. mysql-bin.000001
        MASTER_LOG_POS=<MASTER POSITION,     -- i.e. 154
        MASTER_CONNECT_RETRY=10;
        
START SLAVE;
SHOW SLAVE STATUS;      

-- check LAST_IO_ERROR, if empty this process should be almost complete


7. Test – make an update we can check on the slave to be promoted (id: 2)- remember as new master is still also a slave itself we need to take care not to make a change which could affect replication from the current actual master at the top of the chain. You could create a new table for this and drop it after, or in the case of the savewatersavemoney databases you can update the heartbeat table.

-- Slave to be promoted (id: 2)
USE heartbeat;
  
-- as we stopped this slave it should not be being updated
SELECT * FROM heartbeat; 
SELECT * FROM heartbeat; 

UPDATE heartbeat SET ts = NOW();
SELECT * FROM heartbeat;
  
-- Slave moved (id: 3), timestamp should match (maybe to test again to be sure)
SELECT * FROM heartbeat.heartbeat; 

8. Restart slave on the slave to be promoted (id: 2) and now all updates on the master should trickle down through the salve to be promoted (id: 2) and on to the migrated slave (id: 3). The topology will now be as follows:

Step 2 – Promoting the new master, slave (id: 2)

This is less about DB work and more about about service management, we need to :

1. Identify all services reading and writing to the master (id: 1), the tcpdump tool can help here

sudo tcp dump -i any -p 3306

2. Disable applications, this may require shutting down services or load balancers – to ensure we are not at risk of data loss ALL interactions should be stopped

3. Update necessary configurations to direct all master (read/write and write) traffic to the new master, slave (id:2)

4. check traffic to the master again, if all is quite with the exception of the slave pulling the binary logs we are good to continue.

5. Stop the slave thread on the slave to be promoted

6. Start re-enabling services, 1 at a time and verify traffic is going to the new master. N.B. if you have multiple application servers behind a load balancer you can remove one from the load balancer and update the configuration, and then switch the traffic between the servers creating a almost seamless switchover to the new master.

7. Continue to bring back up the applications and services until everything is running again.

8. Test, Test, Test – new servers can have different default configurations, check SQL_MODE, which may only affect certain type of query – so try to check a varied selection of use cases, most importantly any thing client facing.

9. Once everything is moved over check for activity on the original master (id: 1) and if you’re happy the it’s time to shutdown the mysql service and watch out for any reported errors. All good we can now power down the old master, leaving with our promoted master and it’s slave.

! DO NOT DESTROY the old master yet, if you really want to save the bucks be sure to snapshot it first…. just incase we missed some STORED PROCEDURES or FUNCTIONS or …. something we later regret being so rash on the clean up for.

3. Create a new second

If you have time I’d suggest doing this as part of the preparation so by the end of the previous step we already have our triage of master with 2 direct slaves, but as long as the slave is not yet responsible for serving reads we can still do this – if it is then we need to redirect that traffic first.

1 – shutdown slave

sudo service mysql stop
sudo shutdown -h now 

2. snapshot the droplet, if DigitalOcean otherwise its plan B, build a new server and restore from backup of the the slave (id :3) which may be preferable in any case.

3. Create a new droplet from the sanpshot

4. Logon and shutdown server, delete auto.cnf from the data path to ensure server gets a fresh UUID and update the server_id in the mysql configuration file

5. Create a slave user of the new server

6. Start the slave – all being well it’l start ticking a way nicely, if not double check the other slave (id: 3) as if the UUID was not updated it’ll have stopped as well due to the conflict – you can just restart that slave once the new one has been stop again.

7. If you find you get a lot of update/write errors chances are the slave stopped at a later point in time – you can check the error log of the salve teh snapshot was taken from to find out at what position it restarted – set the master log details of the new to thses and try starting again

4 . Done

At this point we should be back to square one, and as the new slave is not yet being used by any services it the perfect candidate for upgrading to the latest OS – unless you created a fresh one and restored from backup in which case hopefully everything is running the latest stable versions.