These days I spend most of my time in the cloud and the thousand-and-one other things that go with it, such as build pipelines, Docker, Kubernetes, Terraform, and while these also involve a lot of searching, and there are days when the problem sucks your soul dry (I’m looking at you oauth), somehow I find it more fun. I think it’s because once the problem is solved, the effect is immediate. And whatever the new thing is, it somehow still feels familiar. All these new technologies are analogous to concepts from 30-40 years ago (that’s a different topic).
Anyway, last week one of the dev teams deployed a microservice to prod and it promptly started erroring out. After half a day of troubleshooting, I was called in. At first I’d heard that it was having trouble connecting to the database, and one of the developers stated that the issue was the database version.
“We are using Entity Relation Framework an it is does not support v5.6,” he said on a call. “The version in staging is 5.7, which was released in 2015. The version in production was released in 2012! This worked fine in staging, so that is the only difference.”
He wanted us to upgrade the DB version. In prod. For that microservice.
Looking at the logs, it turned out that the actual error was that the query was timing out. I couldn’t understand what optimizations or backwards-incompatible features the framework was using that could be causing an issue on a query. The rest of our application code worked just fine.
I asked the developer for the query that was timing out, and he posted it to the channel. I ran it directly against the database and it completed in 0.03 seconds. So could it really be the version?
Then I noticed that the query he posted seemed to be written by a person, because it had table aliases and other telltale signs. So I asked for the actual query written by the framework. This time I had a strong suspicion that there was another difference between staging and prod that might be at play.
The developer went off to change the code to log the query, and when he posted it, it had ten times as many lines and 5 times as many tables in the join. I ran this in a connection window and it took 564 seconds to complete.
The difference between staging and prod? The lower environment table had 100K records, the prod table had 10M records. The query involved a full table scan and so took inordinately long.
Even assuming staging and prod matched in the number of records at present, it is a mistake to not analyze the query and check for things like full table scans. If the business is successful and grows a hundred-fold, what is acceptable performance now will become unacceptable later, when the code may be baked in deep in the application and changing it may be much more expensive.