Wishbone: [...]
I'd love to go into more detail with the technical discussion, but unfortunately I don't have time right now.
Aemenyn: So.. what you're saying is that we should throw more hardware at you to get the explanation? Right?
Hehe, sure. You got a spare 1080 GTX lying around? :-D
No, I was at work at the time, so I couldn't really delve into long technical explanations.
The thing is, the way the problem is described, it sounds like it could very well be a deadlock issue. Let me try to describe the scenario I see in my mind. Keep in mind that this is just one possible explanation for the problem, and that I'm basically pulling it out of my ass (figuratively speaking) as I go along, so don't treat it as gospel.
A customer puts a game in his basket, goes to the checkout, enters his payment information and clicks "Go" (or whatever the button is called). His browser sends an order request to a server, which forwards the payment information to a payment processor. Once the payment clears, the actual order processing begins.
Because of the sale giveaway, at some point during the order processing, an extra step is triggered, in order to check if this is the customer's first purchase during this sale, and to trigger the generation of a game key and the sending of an email containing said key if this is the case. So as not to make the customer wait any further than necessary before the server can report back to his browser that the order is completed, this extra step is spawned in an asynchronous thread of its own, so the normal order processing doesn't have to wait for it to complete before it can continue.
As part of the giveaway process, eventually some data will need to be saved to a database, most likely by way of a stored procedure call. Let's say that the stored procedure writes lines to 4 separate data tables. In order to secure data integrity, the stored procedure call will ask the database to lock the tables it needs to write to before it executes, so no other processes can change the data simultaneously, as that can lead to "really bad things" happening. Once the call is done, the tables will be unlocked again.
However, if two calls come in simultaneously (or nearly so), what sometimes happens is that one call will lock two of the tables, while the other call locks the other two. The result is that each process is now waiting for the remaining two tables it needs to become available, but since the other process is waiting for the other two tables, neither process can do anything, all 4 tables stay locked, and no data is put into the database. This is what is known as a deadlock. Any other calls to the same procedure that come in at this time will also halt and start waiting for the tables they need to become available.
At this point, one of two things can happen. Either enough time passes that one of the procedure calls times out and is cancelled, or the database detects that a deadlock is happening and kills one of the processes in order to resolve the deadlock. Either way, one of the procedure calls fail, and the data it was supposed to put into the database, isn't. The asynchronous thread that made the failed procedure call notices that the call failed, but since the order process that spawned it has long since completed, it has nowhere to report the failure, so the user is never notified.
Meanwhile, in the database, the other procedure calls that were waiting for the deadlock to be resolved are notified that the tables they need are now available, and promptly try to secure them for themselves so they can proceed. This immediately creates another deadlock, and so the vicious cycle continues for as long as the orders come in faster than the database can process them, resulting in lots of lost data.
So what can be done about this?
Well, many different things, but scaling up the hardware isn't going to make much of a difference, if any. Parallelling the hardware isn't really an option. Load balancing works fine for many things, but not really for databases, unless you can separate your data into segments that don't rely on each other in any way, and don't rely on any shared data either.
Database performance can be optimized in many ways without fiddling with the hardware at all. The techniques involved range from "what every DB admin should know" to "occult midnight rituals requiring years of preparation and personal sacrifice", and experts in the field are few and far between, not to mention very highly paid. However, in scenarios such as this, performance optimization can only alleviate the problem slightly, by making each call finish in a shorter time, so two calls have to arrive a bit closer together to create a deadlock.
Proper error handling on the calling side can help. If a call fails, make sure that it is logged somewhere (and make sure that this somewhere is being monitored), and that the data is saved, so it can be resent or reconstructed later. An automatic retry mechanism can also be helpful, but beware:
Thou shalt not make infinite retries of failed calls! A call can fail for many reasons, not all of which are temporary. For instance, if a call contains data that is invalid (e.g. a wrongly formatted date), no amount of retries will make it succeed. All that it results in is that the poor database is being hammered by more and more doomed calls over time, all infinitely trying to accomplish what they never can. Retries should always be finite, for instance 3 retries 5 minutes apart, and if the call still fails, log the error and the data.
A much better way is instead of calling the database directly, send the call to a queueing mechanism, which forwards the calls to the database one at a time. Sure, it may mean that the calls from the initial 5-minute activity spike take 20 minutes to process, but once the spike has flattened out, calls will be processed almost instantaneously, and no data will be lost.
Of course, best of all is to design your processes in such a way that deadlocks simply can't happen. This isn't always feasible or easy, but with just a little forward thinking in your solution design, a lot of deadlocks can certainly be prevented. First and foremost, try to design your stored procedures so that they need to lock as few resources as possible in order to do what they have to do, especially procedures that are likely to be called a lot of times within a very short timeframe at times.
Phew, that took a while to write.