I've been active on Slashdot for many years. This is my second account, my new account only five years old or so. I have an affection for this site.
I've been managing servers far longer, since 1997 or so. I've owned two hosting companies and consulted for several others. I've had the opportunity to contribute code to the Apache server, the Linux kernel, and a lot of the other software we all use. I've been writing code in Perl, like Slashdot uses, the whole time. I was once the only person allowed to tou
"whipslash" is Logan Abbott [twitter.com], President of SourceForge, Slashdot, and others. He said, "I appreciate the concern, conspiracy theories, and even the anger and vitriol. It's nice to see people care."
Slashdot is extremely important to the technology community.
What Ray Morris said in the parent comment seems reasonable to me.
What Ray Morris said in the parent comment seems reasonable to me.
Well not to me. He claims to know exactly what went wrong whilst stating a few obvious facts on DNS, TTL and raid. I seriously doubt the/. engineers don't know about these things. He's been managing servers for years, I've been managing migrations such as these and high priority incidents for years. I have very little faith in engineers that come to the table claiming to know it all without any investigation into what really went wrong. I agree that on the surface it doesn't look very good. But we really don
I said I'm sure they ran into some bad luck with a drive failing or something. On a well-run server migration drive failure doesn't result in a week down time, or any down time - drives are redundant in raid. During a properly- managed server migration, you can always switch back to the old server, which has been working for years, by updates the A record. That takes no more than five minutes for roll back, because you lower the TTL ahead of time.
Yes, I was wrong saying that you claimed to know exactly what went wrong, but you (again) make a lot of assumptions that I see no proof of. I am fully aware of doing a roll back by reverting the DNS record, after you made sure in advance that the TTL was at 1 minute. We had to do one recently. And it was a bit more complex than that, as we also had to replay the transactions that did come in during the short period that we were on the new system. Fortunately we had planned for that so we were actually able. T
> This can be technical, it can also be financial (if we roll back we have to pay two data centers for another month or even a year).
That's why you do the migration a week before the billing period ends. That's part of my checklist. What do you want to bet these guys didn't have that on their checklist, if they had a checklist at all?
Shit happens. Good engineers *know* that shit happens. So they have roll back plans, backups, etc. So that even when shit happens, you don't have the site down for a week
I don't know exactly what happened, and even the boss may not know because techs may be trying to cover their ass. (In fact trying to cover your ass during an incident is a VERY common mistake.). That's why I suggested he may want to have a third party look into what happened and how robust the set up is now. Because evidence suggests there is little reason to be confident.
I'd certainly not rule out that the mistakes predate the current owners, too--or issues outside of their control, which may overlap if it's something along the lines of "Previous data center has alleged backups" or "Somebody failed to make sure code was properly documented, and also failed to notice that among the badly-documented parts of code there was a 'clever' trick that depended on the old server's hardware configuration and would break if run on different hardware." (I leave how much fault you attribute where in such situations, but I suspect some of this might reasonably not be noticed until you're in the process of the move, though I would note that it would have been polite to stick a notice on the front page itself.)
Also (Score:5, Funny)
Rather unnecessary, though (Score:3)
I've been active on Slashdot for many years. This is my second account, my new account only five years old or so. I have an affection for this site.
I've been managing servers far longer, since 1997 or so.
I've owned two hosting companies and consulted for several others. I've had the opportunity to contribute code to the Apache server, the Linux kernel, and a lot of the other software we all use. I've been writing code in Perl, like Slashdot uses, the whole time. I was once the only person allowed to tou
Slashdot is extremely important. (Score:2)
Slashdot is extremely important to the technology community.
What Ray Morris said in the parent comment seems reasonable to me.
Re: (Score:2)
What Ray Morris said in the parent comment seems reasonable to me.
Well not to me. He claims to know exactly what went wrong whilst stating a few obvious facts on DNS, TTL and raid. I seriously doubt the /. engineers don't know about these things.
He's been managing servers for years, I've been managing migrations such as these and high priority incidents for years. I have very little faith in engineers that come to the table claiming to know it all without any investigation into what really went wrong.
I agree that on the surface it doesn't look very good. But we really don
I didn't say I know exactly what went wrong (Score:2)
I didn't say I know exactly what went wrong.
I said I'm sure they ran into some bad luck with a drive failing or something. On a well-run server migration drive failure doesn't result in a week down time, or any down time - drives are redundant in raid. During a properly- managed server migration, you can always switch back to the old server, which has been working for years, by updates the A record. That takes no more than five minutes for roll back, because you lower the TTL ahead of time.
Worst case wou
Re: (Score:2)
Yes, I was wrong saying that you claimed to know exactly what went wrong, but you (again) make a lot of assumptions that I see no proof of.
I am fully aware of doing a roll back by reverting the DNS record, after you made sure in advance that the TTL was at 1 minute. We had to do one recently. And it was a bit more complex than that, as we also had to replay the transactions that did come in during the short period that we were on the new system. Fortunately we had planned for that so we were actually able.
T
Re: (Score:2)
> This can be technical, it can also be financial (if we roll back we have to pay two data centers for another month or even a year).
That's why you do the migration a week before the billing period ends. That's part of my checklist. What do you want to bet these guys didn't have that on their checklist, if they had a checklist at all?
Shit happens. Good engineers *know* that shit happens. So they have roll back plans, backups, etc. So that even when shit happens, you don't have the site down for a week
Re:I didn't say I know exactly what went wrong (Score:2)
I don't know exactly what happened, and even the boss may not know because techs may be trying to cover their ass. (In fact trying to cover your ass during an incident is a VERY common mistake.). That's why I suggested he may want to have a third party look into what happened and how robust the set up is now. Because evidence suggests there is little reason to be confident.
I'd certainly not rule out that the mistakes predate the current owners, too--or issues outside of their control, which may overlap if it's something along the lines of "Previous data center has alleged backups" or "Somebody failed to make sure code was properly documented, and also failed to notice that among the badly-documented parts of code there was a 'clever' trick that depended on the old server's hardware configuration and would break if run on different hardware." (I leave how much fault you attribute where in such situations, but I suspect some of this might reasonably not be noticed until you're in the process of the move, though I would note that it would have been polite to stick a notice on the front page itself.)