This article first appeared in the DataPlatformGeeks Magazine.
Author: Michelle Gutzait
Subscribe to get your copy.
As infrastructure and technology get stronger and the Serverless Cloud platforms are very compelling for varying intense workloads, we tend to think that database performance may not be that important.
When we start developing new systems or applications, we usually know the initial number of users, processes and data sizes but it is very hard to be able to estimate the rate that those numbers will eventually grow by. And this growth can also be sudden. As an example: you create an application for an SMB, where there are only dozens of users and few GBs of data but later on, the company grows fast or being bought by another huge corporation and your system is now to be used by hundreds or even thousands of users with GBs or TBs of data.
This is why it is very critical to ALWAYS develop database application with intense workloads and large datasets in mind, just in case things will grow surprisingly big.
Relational Databases and Capacity
The relational database technology is built for transactional processing. That means that it needs to include locking mechanisms to allow for processes to update and query the data in parallel. Each database technology is different and the locking mechanisms differ in a way that the resolution of locks, blocks and deadlocks bottlenecks will be very specific to the particular engine.
In addition, each database platform will handle caching differently but the size of data that can be cached depends on the machine capacity. When it comes to Serverless, scaling up resources may take time, hence if performance is important, we need to take this into account and shoot for the smallest data sets to be cached.
Lastly, the database engine will require resources such as memory, processing power, storage and network, therefore the less resource consumption by processes, the better performance you will get.
Is creating a “database agnostic” application good or bad?
Building a generic “database agnostic” application could be a positive thing because you will be able to use almost any relational database in the back-end. However, in some cases, different database platforms will behave differently when the same code is executed. For example, some databases (such as SQL Server) will perform better in certain scenarios if you execute Stored Procedures rather than sending ad-hoc queries. While SQL Server would work nicely with set-based transactions, Oracle may perform better by using cursors. Sometimes it would be a good idea to create temporary tables or objects in the database engine to achieve a better performance, however, each database engine will have its proprietary way of handling those (and the code will not be compatible between the platforms).
If you are using a Framework in your application, you should test your application with large sets of data to ensure it will perform well in the generic way, even under pressure.
Communicating With the Database Server
Application developers often tend to prefer the row-by-row processing methodology because it may be easier to understand and code. Code in a loop may look clean and neat but you should ask yourself, what are the implications of such code on the general performance. A very basic Best Practice is to keep the “back and forth” communication between the application and database server to a minimum, so in a nutshell – DO NOT process row by row if you don’t have to.
Furthermore, ensure that intense database operations are executed in the database engine as much as possible (for example, using Stored Procedures and set-based operations).
Controlling Number of Sessions/Connections
In most if not all database platforms, every time a new connection/login is made to the database engine, minimum resources will be reserved for that connection and “security checks” will be launched to ensure and keep track of proper permissions for the specific user or credential. Also some database platforms (i.e.: Cloud Database Services) will have a limitation on the amount of connected users/sessions and you may get an error when trying to go over this limit. For that reason, it is important to be able to control the amount of logins and logouts (or connects / disconnects) from the database server by using, for example, connection pooling in the application.
The Cost of Performance Tuning to the Business
Any change in an already existing application may be costly, especially when it comes to troubleshooting. Firstly, it may take a while to find the culprit piece of code responsible for a bottleneck.
Subsequently, if your system is already in Production and depending on the type of code change required to be made, this may mean that you will need to rewrite some code and then pass the change through the regular CI/CD process.
Those two above scenarios can be expensive as you will sometimes need to spend hours or even days on troubleshooting, test and trials and eventually code modification.
The earliest you “make things right”, by potentially following performance Best Practices, the less chance those things will happen. When code follows Best Practices, there is more probability that the issue can be solved on the database level (i.e.: modifying indexes, Table Partitioning, Read-only replicas, etc).
Summary
Remember that even though hardware, infrastructure and software have been improved significantly, you always need to target applications that transfer as less data as possible between the different layers: Cloud Services, Virtual Machines, Operating Systems, network, storage, Application/Web servers, containers, database engines and more. Each layer will need to process/transfer the data to or through another layer and this process may add latency.
This article first appeared in the DataPlatformGeeks Magazine.
Author: Michelle Gutzait
Subscribe to get your copy.