It seems that you're using an outdated browser. Some things may not work as they should (or don't work at all).
We suggest you upgrade newer and better browser like: Chrome, Firefox, Internet Explorer or Opera

×
I can't say I care about some delay. Or about rebel galaxy.

But I'd *love* to learn more about these technical difficulties.
high rated
avatar
Wishbone: [...]
I'd love to go into more detail with the technical discussion, but unfortunately I don't have time right now.
avatar
Aemenyn: So.. what you're saying is that we should throw more hardware at you to get the explanation? Right?
Hehe, sure. You got a spare 1080 GTX lying around? :-D

No, I was at work at the time, so I couldn't really delve into long technical explanations.

The thing is, the way the problem is described, it sounds like it could very well be a deadlock issue. Let me try to describe the scenario I see in my mind. Keep in mind that this is just one possible explanation for the problem, and that I'm basically pulling it out of my ass (figuratively speaking) as I go along, so don't treat it as gospel.

A customer puts a game in his basket, goes to the checkout, enters his payment information and clicks "Go" (or whatever the button is called). His browser sends an order request to a server, which forwards the payment information to a payment processor. Once the payment clears, the actual order processing begins.

Because of the sale giveaway, at some point during the order processing, an extra step is triggered, in order to check if this is the customer's first purchase during this sale, and to trigger the generation of a game key and the sending of an email containing said key if this is the case. So as not to make the customer wait any further than necessary before the server can report back to his browser that the order is completed, this extra step is spawned in an asynchronous thread of its own, so the normal order processing doesn't have to wait for it to complete before it can continue.

As part of the giveaway process, eventually some data will need to be saved to a database, most likely by way of a stored procedure call. Let's say that the stored procedure writes lines to 4 separate data tables. In order to secure data integrity, the stored procedure call will ask the database to lock the tables it needs to write to before it executes, so no other processes can change the data simultaneously, as that can lead to "really bad things" happening. Once the call is done, the tables will be unlocked again.

However, if two calls come in simultaneously (or nearly so), what sometimes happens is that one call will lock two of the tables, while the other call locks the other two. The result is that each process is now waiting for the remaining two tables it needs to become available, but since the other process is waiting for the other two tables, neither process can do anything, all 4 tables stay locked, and no data is put into the database. This is what is known as a deadlock. Any other calls to the same procedure that come in at this time will also halt and start waiting for the tables they need to become available.

At this point, one of two things can happen. Either enough time passes that one of the procedure calls times out and is cancelled, or the database detects that a deadlock is happening and kills one of the processes in order to resolve the deadlock. Either way, one of the procedure calls fail, and the data it was supposed to put into the database, isn't. The asynchronous thread that made the failed procedure call notices that the call failed, but since the order process that spawned it has long since completed, it has nowhere to report the failure, so the user is never notified.

Meanwhile, in the database, the other procedure calls that were waiting for the deadlock to be resolved are notified that the tables they need are now available, and promptly try to secure them for themselves so they can proceed. This immediately creates another deadlock, and so the vicious cycle continues for as long as the orders come in faster than the database can process them, resulting in lots of lost data.

So what can be done about this?

Well, many different things, but scaling up the hardware isn't going to make much of a difference, if any. Parallelling the hardware isn't really an option. Load balancing works fine for many things, but not really for databases, unless you can separate your data into segments that don't rely on each other in any way, and don't rely on any shared data either.

Database performance can be optimized in many ways without fiddling with the hardware at all. The techniques involved range from "what every DB admin should know" to "occult midnight rituals requiring years of preparation and personal sacrifice", and experts in the field are few and far between, not to mention very highly paid. However, in scenarios such as this, performance optimization can only alleviate the problem slightly, by making each call finish in a shorter time, so two calls have to arrive a bit closer together to create a deadlock.

Proper error handling on the calling side can help. If a call fails, make sure that it is logged somewhere (and make sure that this somewhere is being monitored), and that the data is saved, so it can be resent or reconstructed later. An automatic retry mechanism can also be helpful, but beware: Thou shalt not make infinite retries of failed calls! A call can fail for many reasons, not all of which are temporary. For instance, if a call contains data that is invalid (e.g. a wrongly formatted date), no amount of retries will make it succeed. All that it results in is that the poor database is being hammered by more and more doomed calls over time, all infinitely trying to accomplish what they never can. Retries should always be finite, for instance 3 retries 5 minutes apart, and if the call still fails, log the error and the data.

A much better way is instead of calling the database directly, send the call to a queueing mechanism, which forwards the calls to the database one at a time. Sure, it may mean that the calls from the initial 5-minute activity spike take 20 minutes to process, but once the spike has flattened out, calls will be processed almost instantaneously, and no data will be lost.

Of course, best of all is to design your processes in such a way that deadlocks simply can't happen. This isn't always feasible or easy, but with just a little forward thinking in your solution design, a lot of deadlocks can certainly be prevented. First and foremost, try to design your stored procedures so that they need to lock as few resources as possible in order to do what they have to do, especially procedures that are likely to be called a lot of times within a very short timeframe at times.

Phew, that took a while to write.
Oh my god, I did not get my FREE GAME the second I should have. I had to wait *GASP* hours or even *DAYS*.

Get a grip.
avatar
SamyMerchi: Oh my god, I did not get my FREE GAME the second I should have. I had to wait *GASP* hours or even *DAYS*.

Get a grip.
You completely and utterly missed the point.
avatar
Wishbone: snip
You know, having a bit of a CS background myself, I'm quite aware that more hardware is often not the solution (I was mostly just making a stupid joke). Still, I'm not too familiar with DB stuff, so it was still an interesting read.
Thanks for taking the effort of actually typing it all out :)

*Throws imaginary GTX*
avatar
Wishbone: *snip*

Phew, that took a while to write.
Dude, thanks. That at the very least was informative. I appreciate taking the time to write it.

See, as I said this is not my area of expertise, and normally in such cases I'd say "Well, the guys at {Insert Company Name Here} know this better; they're the experts. So if they haven't fixed it for the past teen sales then it's probably not cost feasible to do it." But this is GOG. It's the people who have not fixed the proper sorting of games, despite this being a job that even I could do in a couple of hours at most. Or relied on their users manually deleting spam attacks for days. So I was wondering whether it's a genuine case of something being not very feasible or another one of GOG not fixing things because... things.
Post edited June 08, 2017 by ZFR
avatar
tinyE: If they had learned from their mistakes I seriously doubt I would still be allowed in here. :P
Nothing is stopping you from leaving, though.
avatar
adaliabooks: But I'd imagine it mostly comes down to traffic and that's why it happens with every big sale.
avatar
ZFR: But after n sales, it should be easy to predict the amount of traffic they'd get. Nothing really unexpected happened that the increase in traffic this sale should be larger than anticipated (GWENT release might have brought extra people, but again that's something that could have been foreseen).

So if GOG know they're going to get x (+/-) extra traffic during the sale, can they not get extra (servers/hardware/staff/whateverittakes) for those 2 sale weeks? This is a genuine question on my part since I don't know how feasible would it be to accomodate the extra temporary traffic. But given the amount of frustrated customers this is producing, and the fact that it doesn't seem to happen on other sites, I think GOG should do something about it.
You do realize Steam tends to literally go down every sale?
avatar
ZFR:
avatar
Pax11: You do realize Steam tends to literally go down every sale?
Just like a prostitute! *rimshot*
avatar
Wishbone: *snip*

Phew, that took a while to write.
avatar
ZFR: Dude, thanks. That at the very least was informative. I appreciate taking the time to write it.
You're welcome. I do enjoy technical discussions (as you well know), so it was my pleasure ;-)
avatar
ZFR: See, as I said this is not my area of expertise, and normally in such cases I'd say "Well, the guys at {Insert Company Name Here} know this better; they're the experts. So if they haven't fixed it for the past teen sales then it's probably not cost feasible to do it." But this is GOG. It's the people who have not fixed the proper sorting of games, despite this being a job that even I could do in a couple of hours at most. Or relied on their users manually deleting spam attacks for days. So I was wondering whether it's a genuine case of something being not very feasible or another one of GOG not fixing things because... things.
Oh, it's definitely the latter. Throwing hardware at the problem won't help, but fixing the problem another way should definitely be possible. The question is just how awkward the existing codebase is. Occasionally I get the impression that the reason so many problems on GOG go unfixed for such a long time, is that the core systems are a convoluted mess of interconnected spaghetti, designed by someone who had no business designing software in the first place, so that fixing even the simplest of things would require major rewrites of significant parts of the codebase.