Category Archives: SRE

cmemgzip Python tool to compress files in memory when there is no free space on the disk

Rationale

All the Operation Engineers and SREs that work with systems have found the situation of having a Server with the disk full of logs and needing to keep those logs, and at the same time needing the system to keep running.

This is an uncomfortable situation.

I remember when I was being interviewed in Facebook, in Menlo Park, for a SDM position in the SRE (Software Development Manager) back in 2013-2014. They asked me about a situation where they have the Server disk full, and they deleted a big log file from Apache, but the space didn’t come back. They told me that nobody ever was able to solve this.

I explained that what happened is that Apache still had the fd (file descriptor), and that he will try to write to end of that file, even if they removed the huge log file with rm command, from the system they will not get back any free space. I explained that the easiest solution was to stop apache. They agreed and asked me how we could do the same without restarting the Webserver and I said that manipulating the file descriptors under /proc. They told me what I was the first person to solve this.

How it works

Basically cmemgzip will read a file, as binary, and will load it completely in to Memory.

Then it will compress it also in Memory. Then it will release the memory used to keep the original, will validate write permissions on the folder, will check that the compressed file is smaller than the original, and will delete the original and, using the new space now available in disk, write the compressed and smaller version of the file in gzip format.

Since version 0.3 you can specify an amount of memory that you will use for the blocks of data read from the file, so you can limit greatly the memory usage and compress files much more bigger than the amount of memory.

If for whatever reason the gz version cannot be written to disk, you’ll be asked for another route.

I mentioned before about File Descriptors, and programs that may keep those files open.

So my advice here, is that if you have to compress Apache logs or logs from a multi-thread program, and disk is full, and several instances may be trying to write to the log file: to stop Apache service if you can, and then run cmemgzip. I want to add it the future to auto-release open fd, but this is delicate and requires a lot of time to make sure it will be reliable in all the circumstances and will obey the exact desires of the SRE realizing the operation, without unexpected undesired side effects. It can be implemented with a new parameter, so the SysAdmin will know what is requesting.

Get the source code

You can decompress it later with gzip/gunzip.

So about cmemgzip you can git clone the project from here:

https://gitlab.com/carles.mateo/cmemgzip

git clone https://gitlab.com/carles.mateo/cmemgzip

The README.md is very clear:

https://gitlab.com/carles.mateo/cmemgzip/-/blob/master/README.md

The program is written in Python 3, and I gave it License MIT, so you can use it and the Open Source really with Freedom.

Do you want to test in other platforms?

This is a version 0.3.

I have only tested it in:

  • Ubuntu 20.04 LTS Linux for x64
  • Ubuntu 20.04 LTS 64 bits under Raspberry Pi 4 (ARM Processors)
  • Windows 10 Professional x64
  • Mac OS X
  • CentOS

It should work in all the platforms supporting Python, but if you want to contribute testing for other platforms, like Windows 32 bit, Solaris or BSD, let me know.

Alternative solutions

You can create a ramdisk and compress it to there. Then delete the original and move the compressed file from ramdisk to the hard drive, and unload the ramdrive Kernel Module. However we find very often with this problems in Docker containers or in instances that don’t have the Kernel module installed. Is much more easier to run cmemgzip.

Another strategy you can do for the future is to have a folder based on ZFS and compression. Again, ZFS should be installed on the system, and this doesn’t happen with Docker containers.

cmemgzip is designed to work when there is no free space, if there is free space, you should use gzip command.

In a real emergency when you don’t have enough RAM, neither disk space, neither the possibility to send the log files to another server to be compressed there, you could stop using the swap, and fdisk the swap partition to be a ext4 Linux format, format it, mount is, and use the space to compress the files. And after moving the files compressed to the original folder, fdisk the old swap partition to change type to Swap again, and enable swap again (swapon).

Memory requirements

As you can imagine, the weak point of cmemgzip, is that, if the file is completely loaded into memory and then compressed, the requirements of free memory on the Server/Instance/VM are at least the sum of the size of the file plus the sum of the size of the file compressed. You guess right. That’s true.

If there is not enough memory for loading the file in memory, the program is interrupted gracefully.

I decided to keep it simple, but this can be an option for the future.

So if your VM has 2GB of Available Memory, you will be able to use cmemgzip in uncompressed log files around 1.7GB.

In version 0.3 I implemented the ability to load chunks of the original file, and compress into memory, so I would be able use less memory. But then the compression is less efficient and initial tests point that I’ll have to keep a separate file for each compressed chunk. So I will need to created a uncompress tool as well, when now is completely compatible with gzip/gunzip, zcat, the file extractor from Ubuntu, etc…

For a big Server with a logfile of 40TB, around 300GB of RAM should be sufficient (the Servers I use have 768 GB of RAM normally).

Honestly, nowadays we find ourselves more frequently with VMs or Instances in the Cloud with small drives (10 to 15GB) and enough Available RAM, rather than Servers with huge mount points. This kind of instances, which means scaling horizontally, makes more difficult to have NFS Servers were we can move those logs, for security.

So cmemgzip covers very well some specific cases, while is not useful for all the scenarios.

I think it’s safe to say it covers 95% of the scenarios I’ve found in the past 7 years.

cmemgzip will not help you if you run out inodes.

Usage

Usage is very simple, and I kept it very verbose as the nature of the work is Operations, Engineers need to know what is going on.

I return error level/exit code 0 if everything goes well or 1 on errors.

./cmemgzip.py /home/carles/test_extract/SherlockHolmes.txt
 
 cmemgzip.py v.0.1

 Verifying access to: /home/carles/test_extract/SherlockHolmes.txt
 Size of file: /home/carles/test_extract/SherlockHolmes.txt is 553KB (567,291 bytes)
 Reading file: /home/carles/test_extract/SherlockHolmes.txt (567,291 bytes) to memory.
 567,291 bytes loaded.
 Compressing to Memory with maximum compression level…
 Size compressed: 204KB (209,733 bytes). 36.97% of the original file
 Attempting to create the gzip file empty to ensure write permissions
 Deleting the original file to get free space
 Writing compressed file /home/carles/test_extract/SherlockHolmes.txt.gz
 Verifying space written match size of compressed file in Memory
 Write verification completed.

You can also simulate, without actually delete or write to disk, just in order to know what will be the

Installation

There are no third party libraries to install. I only use the standard ones: os, sys, gzip

So clone it with git in your preferred folder and just create a symbolic link with your favorite name:

sudo ln --symbolic /home/carles/code/cmemgzip/cmemgzip.py /usr/bin/cmemgzip

I like to create the link without the .py extension.

This way you can invoke the program from anywhere by just typing: cmemgzip

Solving Oracle error ORA 600 [KGL-heap-size-exceeded]

Time ago there was a web page that was rendered in blank for certain group of users.

The errors were coming from an Oracle instance. One SysAdmin restarted the instance, but the errors continued.

Often there are problems due to having two different worlds: Development and Production/Operations.

What works in Development, or even in Docker, may not work at Scale in Production.

That query that works with 100,000 products, may not work with 10,000,000.

I have programmed a lot for web, so when I saw a blank page I knew it was an internal error as the headers sent by the Web Server indicated 500. DBAs were seeing elevated number of errors in one of the Servers.

So I went straight to the Oracle’s logs for that Servers.

I did a quick filter in bash:

cat /u01/app/oracle/diag/rdbms/world7c/world7c2/alert/log.xml | grep "ERR" -B4 -A3

This returned several errors of the kind “ORA 600 [ipc_recreate_que_2]” but this was not the error our bad guy was:

‘ORA 600 [KGL-heap-size-exceeded]’

The XML fragment was similar to this:

<msg time='2016-01-24T13:28:33.263+00:00' org_id='oracle' comp_id='rdbms'
msg_id='7725874800' type='INCIDENT_ERROR' group='Generic Internal Error'
level='1' host_id='gotham.world7c.justice.cat' host_addr='10.100.100.30'
pid='281279' prob_key='ORA 600 [KGL-heap-size-exceeded]' downstream_comp='LIBCACHE'
errid='726175' detail_path='/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc'>
<txt>Errors in file /u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc  (incident=726175):
ORA-00600: internal error code, arguments: [KGL-heap-size-exceeded], [0x14D22C0C30], [0], [524288008], [], [], [], [], [], [], [], []
</txt></msg>

Just before this error, there was an error with a Query, and the PID matched, so it seemed cleared to me that the query was causing the crash at Oracle level.

Checking the file:

/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc

The content was something like this:

<msg time='2016-01-24T13:28:33.263+00:00' org_id='oracle' comp_id='rdbms'
msg_id='7725874800' type='INCIDENT_ERROR' group='Generic Internal Error'
level='1' host_id='gotham.world7c.justice.cat' host_addr='10.100.100.30'
pid='281279' prob_key='ORA 600 [KGL-heap-size-exceeded]' downstream_comp='LIBCACHE'
errid='726175' detail_path='/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc'>
<txt>Errors in file /u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc  (incident=726175):
ORA-00600: internal error code, arguments: [KGL-heap-size-exceeded], [0x14D22C0C30], [0], [524288008], [], [], [], [], [], [], [], []
</txt>
</msg>

Basically in our case, the query that was launched by the BackEnd was using more memory than allowed, which caused Oracle to kill it.

That is a tunnable that you can modify introduced in Oracle 10g.

You can see the current values first:

SQL> select
2 nam.ksppinm NAME,
3 nam.ksppdesc DESCRIPTION,
4 val.KSPPSTVL
5 from
6 x$ksppi nam,
7 x$ksppsv val
8 where nam.indx = val.indx and nam.ksppinm like '%kgl_large_heap_%_threshold%';

NAME                              | DESCRIPTION                       | KSPPSTVL
=============================================================================================
_kgl_large_heap_warning_threshold | maximum heap size before KGL      | 4194304
                                    writes warnings to the alert log
---------------------------------------------------------------------------------------------
_kgl_large_heap_assert_threshold  | maximum heap size before KGL      | 4194304
                                    raises an internal error

So, _kgl_large_heap_warning_threshold is the maximum heap before getting a warning, and _kgl_large_heap_assert_threshold is the maximum heap before getting the error.

Depending in your case the solution can be either:

  • Breaking your query in several to reduce the memory used
  • Use paginating or LIMIT
  • Set a bigger value for those tunnables.

It will work setting 0 for these to variables, although I don’t recommend it to you, as you want your Server to kill queries that are taking more memory than you want.

To increase the value of , you have to update it. Please note it is in bytes, so for 32MB is 32 * 1024 * 1024, so 33,554,432, and using spfile:

SQL> alter system set "_kgl_large_heap_warning_threshold"=33554432
scope=spfile ;
 
SQL> shutdown immediate 

SQL> startup
 
SQL> show parameter _kgl_large_heap_warning_threshold
NAME                               TYPE      VALUE
==================================|=========|===============
 
_kgl_large_heap_warning_threshold | integer | 33554432
 

Or if using the parameter file, set:

_kgl_large_heap_warning_threshold=33554432

Adding a swapfile on the fly as a temporary solution for a Server with few memory

Here is an easy trick that you can use for adding swap temporarily to a Server, VMs or Workstations, if you are in an emergency.

In this case I had a cluster composed from two instances running out of memory.

I got an alert for one of the Servers, reporting that only had 7% of free memory.

Immediately I checked it, but checked also any other forming part of the cluster.

Another one appeared, had just only a bit more memory than the other, but was considered in Critical condition too.

The owner of the Service was contacted and asked if we can hold it until US Business hours. Those Servers were going to be replaced next day in US Business hours, and when possible it would be nice not to wake up the Team. It was day in Europe, but night in US.

I checked the status of the Server with those commands:

# df -h

There are 13GB of free space in /. More than enough to be safe as this service doesn’t use much.

# free -h
              total        used        free      shared  buff/cache   available
Mem:           5.7G        4.8G        139M        298M        738M        320M
Swap:            0B          0B          0B

I checked the memory, ok, there are only 139MB free in this node, but 738MB are buff/cache. Buff/Cache is memory used by Linux to optimize I/O as long as it is not needed by application. These 738 MB in buff/cache (or most of it) will be used if needed by the System. The field available corresponds to the memory that is available for starting new applications (not counting the swap if there was any), and basically is the free memory plus a fragment of the buff/cache. I’m sure we could use more than 320MB and there is a lot if buff/cache, but to play safe we play by the book.

With that in mind it seemed that it would hold perfectly to Business hours.

I checked top. It is interesting to mention the meaning of the Column RES, which is resident memory, in other words, the real amount of memory that the process is using.

I had a Java process using 4.57GB of RAM, but a look at how much Heap Memory was reserved and actually being used showed a Heap of 4GB (Memory reserved) and 1.5GB actually being used for real, from the Heap, only.

It was unlikely that elastic search would use all those 4GB, and seemed really unlikely that the instance will suffer from memory starvation with 2.5GB of 4GB of the Heap free, ~1GB of RAM in buffers/cache plus free, so looked good.

To be 100% sure I created a temporary swap space in a file on the SSD.

# fallocate -l 1G /swapfile-temp

# dd if=/dev/zero of=/swapfile-temp bs=1024 count=1048576 status=progress
1034236928 bytes (1.0 GB) copied, 4.020716 s, 257 MB/s
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 4.26152 s, 252 MB/s

If you ask me why I had to dd, I will tell you that I needed to. I checked with command blkid and filesystem was xfs. I believe that was the reason.

The speed writing to the file is fair enough for a swap.

# chmod 600 /swapfile-temp

# mkswap /swapfile-temp
Setting up swapspace version 1, size = 1048572 KiB
no label, UUID=5fb12c0c-8079-41dc-aa20-21477808619a

# swapon /swapfile-temp

I check that memory is good:

# free -h
              total        used        free      shared  buff/cache   available
Mem:           5.7G        4.8G        117M        298M        770M        329M
Swap:          1.0G          0B        1.0G

And finally I check that the Kernel parameter swappiness is not too aggressive:

# sysctl vm.swappiness
vm.swappiness = 30

Cool. 30 is a fair enough value.

Post-Mortem: The mystery of the duplicated Transactions into an e-Commerce

Me, with 4 more Senior BackEnd Engineers wrote the new e-Commerce for a multinational.

The old legacy Software evolved into a different code for every country, making it impossible to be maintained.

The new Software we created used inheritance to use the same base code for each country and overloaded only the specific different behavior of every country, like for the payment methods, for example Brazil supporting “parcelados” or Germany with specific payment players.

We rewrote the old procedural PHP BackEnd into modern PHP, with OOP and our own Framework but we had to keep the transactional code in existing MySQL Procedures, so the logic was split. There was a Front End Team consuming our JSONs. Basically all the Front End code was cached in Akamai and pages were rendered accordingly to the JSONs served from out BackEnd.

It was a huge success.

This e-Commerce site had Campaigns that started at a certain time, so the amount of traffic that would come at the same time would be challenging.

The project was working very well, and after some time the original Team was split into different projects in the company and a Team for maintenance and evolutives was hired.

At certain point they started to encounter duplicate transactions, and nobody was able to solve the mystery.

I’m specialized into fixing impossible problems. They used to send me to Impossible Missions, and I am famous for solving impossible problems easily.

So I started the task with a SRE approach.

The System had many components and layers. The problem could be in many places.

I had in my arsenal of tools, Software like mysqldebugger with which I found an unnoticed bug in decimals calculation in the past surprising everybody.

Previous Engineers involved believed the problem was in the Database side. They were having difficulties to identify the issue by the random nature of the repetitions.

Some times the order lines were duplicated, and other times were the payments, which means charging twice to the customer.

Redis Cluster could also play a part on this, as storing the session information and the basket.

But I had to follow the logic sequence of steps.

If transactions from customer were duplicated that mean that in first term those requests have arrived to the System. So that was a good point of start.

With a list of duplicated operations, I checked the Webservers logs.

That was a bit tricky as the Webserver was recording the Ip of the Load Balancer, not the ip of the customer. But we were tracking the sessionid so with that I could track and user request history. A good thing was also that we were using cookies to stick the user to the same Webserver node. That has pros and cons, but in this case I didn’t have to worry about the logs combined of all the Webservers, I could just identify a transaction in one node, and stick into that node’s log.

I was working with SSH and Bash, no log aggregators existing today were available at that time.

So when I started to catch web logs and grep a bit an smile was drawn into my face. :)

There were no transactions repeated by a bad behavior on MySQL Masters, or by BackEnd problems. Actually the HTTP requests were performed twice.

And the explanation to that was much more simple.

Many Windows and Mac User are used to double click in the Desktop to open programs, so when they started to use Internet, they did the same. They double clicked on the Submit button on the forms. Causing two JavaScript requests in parallel.

When I explained it they were really surprised, but then they started to worry about how they could fix that.

Well, there are many ways, like using an UUID in each request and do not accepting two concurrents, but I came with something that we could deploy super fast.

I explained how to change the JavaScript code so the buttons will have no default submit action, and they will trigger a JavaScript method instead, that will set a boolean to True, and also would disable the button so it can not be clicked anymore. Only if the variable was False the submit would be performed. It was almost impossible to get a double click as the JavaScript was so fast disabling the button, that the second click will not trigger anything. But even if that could be possible, only one request would be made, as the variable was set to True on the first click event.

That case was very funny for me, because it was not necessary to go crazy inspecting the different layers of the system. The problem was detected simply with HTTP logs. :)

People often forget to follow the logic steps while many problems are much more simple.

As a curious note, I still see people double clicking on links and buttons on the Web, and some Software not handling it. :)