How to ensure double write consistency between the cache and the database?

Interview question

Interviewer psychoanalysis

You only need to use the cache, it may involve double storage and double storage of the database. As long as you double write, there will be data consistency problems. How do you solve the consistency problem?

Analysis of interview questions

In general, if you allow the cache to be slightly inconsistent with the database, that is, if your system is not strictly required to "cache + database" must be consistent, it is best not to do this, namely: read request Serialize the write request and string it into a memory queue .

Serialization guarantees that there will be no inconsistencies, but it will also result in a significant reduction in system throughput, supporting a request on the line with machines that are several times larger than normal.

Cache Aside Pattern

The most classic cache + database read and write mode is the Cache Aside Pattern.

When reading, read the cache first, if the cache does not, read the database, then take the data and put it into the cache, and return the response.
When updating, update the database before deleting the cache .

Why is the cache deleted instead of the cache?

The reason is very simple. In many cases, in a cached scene with complex points, the cache is not just the value directly fetched from the database.

For example, it is possible to update a field of a table, and then its corresponding cache, it is necessary to query the data of the other two tables and perform operations to calculate the latest value of the cache.

In addition, the cost of updating the cache is sometimes very high. Does it mean that every time you modify the database, you must update its corresponding cache? Maybe some scenes are like this, but this is not the case for more complex cached data calculation scenarios . If you frequently modify multiple tables involved in a cache, the cache is updated frequently. But the question is, will this cache be accessed frequently?

For example, the field of a table involved in a cache is modified 20 times or 100 times in 1 minute, then the cache is updated 20 times and 100 times; but this cache is only read once in 1 minute. There is a lot of cold data . In fact, if you just delete the cache, then within 1 minute, the cache is recalculated once, and the overhead is greatly reduced. Use the cache to calculate the cache.

In fact, deleting the cache instead of updating the cache is a lazy calculation. Don't re-do complex calculations every time, whether it will be used or not, and let it be recalculated when it needs to be used. Like mybatis, hibernate, there are lazy loading ideas. Inquiring a department, the department has a list of employees, there is no need to say that the data of 1000 employees in each inquiry department is also detected at the same time. 80% of the cases, check this department, just to access the information of this department. Check the department first, and at the same time, visit the employees inside. At this time, only when you want to visit the employees inside, you will go to the database to query 1000 employees.

The most basic cache inconsistency problem and solution

Problem: Update the database first, then delete the cache. If the deletion of the cache fails, it will result in new data in the database, the old data in the cache, and the data will be inconsistent.

Solution: delete the cache first, then update the database. If the database update fails, the database is old data, the cache is empty, then the data will not be inconsistent. Because the cache is not available when reading, the old data in the database is read and then updated into the cache.

More complex data inconsistency analysis

The data has changed, the cache is deleted first, and then the database is modified. It has not been modified yet. A request comes over, reads the cache, finds that the cache is empty, queries the database, finds the old data before the modification , and puts it in the cache. Subsequent data change procedures complete the modification of the database. Finished, the data in the database and the cache is different...

Why is there a problem with caching in the case of hundreds of millions of high traffic concurrent scenarios?

This problem can only occur when a piece of data is being read and written concurrently. In fact, if your concurrency is very low, especially the reading concurrency is very low, the number of visits per day is 10,000 times. In rare cases, the inconsistent scene just described will appear. But the problem is, if there are hundreds of millions of traffic per day, the number of concurrent reads per second is tens of thousands, and as long as there is a request for data update every second, the above database + cache inconsistency may occur .

The solution is as follows:

When updating data, according to the unique identifier of the data , the operation is routed and sent to a jvm internal queue. When reading data, if the data is found not in the cache, the data will be re-read + update cache operation, and after the route is uniquely identified, it will also be sent to the same jvm internal queue.

A queue corresponds to a worker thread, and each worker thread serially gets the corresponding operation, and then executes one by one. In this case, a data change operation, first delete the cache, and then update the database, but has not completed the update. At this point, if a read request comes over and no cache is read, the cache update request can be sent to the queue first, and then there will be a backlog in the queue, and then the synchronization waits for the cache update to complete.

There is an optimization point here . In a queue, in fact, it is meaningless to serialize multiple update cache requests together , so you can do filtering. If you find that there is already a request to update the cache in the queue, then you don't need to put another update request. The operation goes in and waits directly for the previous update operation request to complete.

After the worker thread corresponding to the queue completes the modification of the database of the previous operation, it will perform the next operation, that is, the operation of buffering the update. At this time, the latest value is read from the database and then written into the cache. .

If the request is still within the waiting time range, and the polling finds that the value can be fetched, it returns directly; if the request waits for more than a certain period of time, this time directly reads the current old value from the database.

In the case of high concurrency, the solution should pay attention to the problem:

Read request is blocked for a long time

Since the read request is very lightly asynchronous, it is important to pay attention to the read timeout problem, and each read request must be returned within the timeout period.

The biggest risk point of this solution is that it is possible that the data is updated frequently , resulting in a large number of update operations in the queue, and then the read request will have a large number of timeouts , and finally a large number of requests go directly to the database. Be sure to check out the actual frequency of the test to see how often the data is updated.

In addition, because there is a backlog of update operations for multiple data items in a queue, you need to test according to your own business conditions. You may need to deploy multiple services , and each service will share some data update operations. If a memory queue actually squeezes the inventory modification operation of 100 items, the inventory modification operation takes 10ms to complete, then the read request of the last item may wait for 10 * 100 = 1000ms = 1s before the data can be obtained. This time, it causes a long-term blocking of the read request .

Be sure to do some stress testing according to the actual business system operation, and simulate the online environment, to see how many update operations may be squeezed in the memory queue during the busiest time, which may result in the last update operation. How long does it take to read the request, if the read request returns at 200ms, if you have calculated, even the busiest time, backlog of 10 update operations, waiting for up to 200ms, that's okay.

If there is a large amount of update operations that may be backlogged in a memory queue , then you need to add machines so that the service instances deployed on each machine process less data, so there is less backlog of updates in each memory queue.

In fact, according to previous project experience, in general, the frequency of data writing is very low, so in fact, in general, the back-up update operation in the queue should be very small. Like this kind of project for reading high concurrency and read cache architecture, the write request is generally very small, and the QPS per second can be as good as a few hundred.

Let's take a rough look at the actual calculation .

If there are 500 write operations in one second, if it is divided into 5 time slices, 100 write operations every 200ms, put into 20 memory queues, and each memory queue may have a backlog of 5 write operations. After each write operation performance test, it is generally completed in about 20ms, then the read request for the data of each memory queue is up to hang for a while, and it can definitely be returned within 200ms.

After a simple calculation, we know that the write QPS supported by a single machine is no problem at a few hundred. If the QPS is expanded by 10 times, then the machine is expanded, and the machine is expanded by 10 times, and each machine has 20 queues.

Read request is too high

There must also be stress tests to ensure that when it happens to run into the above situation, there is still a risk that suddenly a large number of read requests will be delayed in tens of milliseconds on the service, to see if the service can live without it. How many machines can hold the peak of the maximum limit situation.

But because not all data is updated at the same time, the cache will not expire at the same time, so each time it may be a small amount of data cache invalidation, then the corresponding read request of those data, the concurrent volume should not be special Big.

Request routing for multiple service instance deployments

It is possible that this service has multiple instances deployed, so it must be guaranteed that the data update operation and the request to perform the cache update operation are routed through the Nginx server to the same service instance .

For example, read and write requests to the same item are all routed to the same machine. You can do the hash routing according to a request parameter between services, you can also use Nginx's hash routing function and so on.

Routing problem with hot items, causing the request to tilt

In case the read and write requests for a particular product are particularly high, all of them go to the same queue of the same machine, which may cause excessive pressure on a certain machine. That is to say, because the cache will be cleared only when the product data is updated, and then the read and write concurrency will occur, so it is necessary to look at the business system. If the update frequency is not too high, the impact of this problem is not particularly large. But it is true that some machines may have higher loads.

Stacknology

Tìm kiếm Blog này