Category Archives: Troubleshoot

Some weird things from Python 3 that you may not know

Last Update: 2021-09-29 18:12 IST

You can find those bizarre things and more in my book Python 3 Combat Guide.

I’m not talking about the wonderful things, like how big can the Integers be, but about the bizarre things that may ruin your day.

What sums 0.1 + 0.1 + 0.1 in Python?

0.3?

Wrong answer.

A bit of humor

Well, to be honest the computer was wrong. They way programming languages handle the Floats tend to be less than ideal.

Floats

Maybe you know JavaScript and its famous NaN (Not a number).

You are probably sure that Python is much more exact than that…

…well, until you do a big operation with Floats, like:

10.0**304 * 10.0**50 

and

It returns infinite

I see your infinite and I add one :)

However If we try to define a number too big directly it will return OverflowError:

Please note Integers are handled in a much more robust cooler way:

Negative floats

Ok. What happens if we define a number with a negative power, like 10 ** -300 ?

And if we go somewhere a bit more far? Like 10 ** -329

It returns 0.0

Ups!

I mention in my books why is better to work with Integers, and in fact most of the eCommerces, banks and APIs work with Integers. For example, if the amount in USD 10.00 they send multiplied by 100, so they will send 1000. All the actor know that they have to divide by 2.

Breaking the language innocently

I mentioned always that I use the MT Notation, the prefix notation I invented, inspired by the Hungarian Notation and by an amazing C++ programmer I worked with in Volkswagen and in la caixa (now caixabank), that passed away many years ago.

Well, that system of prefixes will name a variable with a prefix for its type.

It’s very useful and also prevents the next weird thing from Python.

Imagine a Junior wants to print a String and they put in a variable. And unfortunately they call this variable print. Well…

print = "Hello World!"
print("That will hurt")

Observe the output of this and try not to scream:

Variables and Functions named equally

Well, most of languages are able to differentiate a function, with its parenthesis, from a variable.

The way Python does it hurts my coder heart:

Another good reason to use MT Notation for the variables, and for taking seriously doing Unit Testing and giving a chance to using getters and setters and class Constructor for implementing limits and sanitation.

Nested Loops

This will work in Python, it doesn’t work in other languages (but please never do it).

for i in range(3):
    print("First Loop", i)
    for i in range(4):
        print("Second Loop", i)

The code will not crash by overwriting i used in the first loop, but the new i will mask the first variable.

And please, name variables properly.

Import… once?

Imports are imported only once. Even if different files imported do import the same file.

So don’t have code in the middle of them, outside functions/classes, unless you’re really know what you’re doing.

Define functions first, and execute code after if __name__ == “__main__”:

Take a look at this code:

def first_function():
    print("Inside first function")
    second_function()

first_function()

def second_function():
    print("Inside second function")

Well, this will crash as Python executes the code from top to bottom, and when it gets to first_function() it will attempt to call second_function() which has not been read by Python yet. This example will throw an error.

You’ll get an error like:

Inside first function
Traceback (most recent call last):
  File "/home/carles/Desktop/code/carles/python_combat_guide/src/structure_dont_do_this.py", line 14, in <module>
    first_function()
  File "/home/carles/Desktop/code/carles/python_combat_guide/src/structure_dont_do_this.py", line 12, in first_function
    second_function()
NameError: name 'second_function' is not defined

Process finished with exit code 1

Add your code at the bottom always, under:

if __name__ == "__main__":
    main()

The code inside this if will only be executed if you directly call this code as main file, but will not be executed if you import this file from another one.

You don’t have this problem with classes in Python, as they are defined first, completely read, and then you instantiate or use them. To avoid messing and creating bugs, have the imports always on the top of your file.

Cloning a Windows Application running in Wine

I’ve some very old Windows Applications running in Wine in my Linux workstations.

It’s Software I bought years ago and that is not available anymore.

Keeping and migrating or cloning to another Linux Workstation or Virtual machine is really easy.

I share the steps with you.

You just have to copy the contents from your /home/username/.wine folder.

Then, in the new workstation install wine. For Ubuntu this is:

apt update && apt install wine

Run winecfg so basic links and structures are created.

Then simply copy the .wine folder backup to your new machine /home/username/

Your programs will be in /home/username/.wine/drive_c_/Program Files/ or /home/username/.wine/drive_c_/Program Files (x86)/

If you want you can just copy your programs folder.

Remember that to cd to a directory with spaces you have to use “

For example:

$ pwd
/home/carles/.wine/drive_c
$ cd "Program Files"
$ pwd
/home/carles/.wine/drive_c/Program Files

You can also use \ (slash space) to escape space.

Then start your favorite program with:

wine yourprogram.exe

If that fails is very probably that creating a new configuration, for a new user, will make things right.

Solving Oracle error ORA 600 [KGL-heap-size-exceeded]

Time ago there was a web page that was rendered in blank for certain group of users.

The errors were coming from an Oracle instance. One SysAdmin restarted the instance, but the errors continued.

Often there are problems due to having two different worlds: Development and Production/Operations.

What works in Development, or even in Docker, may not work at Scale in Production.

That query that works with 100,000 products, may not work with 10,000,000.

I have programmed a lot for web, so when I saw a blank page I knew it was an internal error as the headers sent by the Web Server indicated 500. DBAs were seeing elevated number of errors in one of the Servers.

So I went straight to the Oracle’s logs for that Servers.

I did a quick filter in bash:

cat /u01/app/oracle/diag/rdbms/world7c/world7c2/alert/log.xml | grep "ERR" -B4 -A3

This returned several errors of the kind “ORA 600 [ipc_recreate_que_2]” but this was not the error our bad guy was:

‘ORA 600 [KGL-heap-size-exceeded]’

The XML fragment was similar to this:

<msg time='2016-01-24T13:28:33.263+00:00' org_id='oracle' comp_id='rdbms'
msg_id='7725874800' type='INCIDENT_ERROR' group='Generic Internal Error'
level='1' host_id='gotham.world7c.justice.cat' host_addr='10.100.100.30'
pid='281279' prob_key='ORA 600 [KGL-heap-size-exceeded]' downstream_comp='LIBCACHE'
errid='726175' detail_path='/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc'>
<txt>Errors in file /u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc  (incident=726175):
ORA-00600: internal error code, arguments: [KGL-heap-size-exceeded], [0x14D22C0C30], [0], [524288008], [], [], [], [], [], [], [], []
</txt></msg>

Just before this error, there was an error with a Query, and the PID matched, so it seemed cleared to me that the query was causing the crash at Oracle level.

Checking the file:

/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc

The content was something like this:

<msg time='2016-01-24T13:28:33.263+00:00' org_id='oracle' comp_id='rdbms'
msg_id='7725874800' type='INCIDENT_ERROR' group='Generic Internal Error'
level='1' host_id='gotham.world7c.justice.cat' host_addr='10.100.100.30'
pid='281279' prob_key='ORA 600 [KGL-heap-size-exceeded]' downstream_comp='LIBCACHE'
errid='726175' detail_path='/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc'>
<txt>Errors in file /u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc  (incident=726175):
ORA-00600: internal error code, arguments: [KGL-heap-size-exceeded], [0x14D22C0C30], [0], [524288008], [], [], [], [], [], [], [], []
</txt>
</msg>

Basically in our case, the query that was launched by the BackEnd was using more memory than allowed, which caused Oracle to kill it.

That is a tunnable that you can modify introduced in Oracle 10g.

You can see the current values first:

SQL> select
2 nam.ksppinm NAME,
3 nam.ksppdesc DESCRIPTION,
4 val.KSPPSTVL
5 from
6 x$ksppi nam,
7 x$ksppsv val
8 where nam.indx = val.indx and nam.ksppinm like '%kgl_large_heap_%_threshold%';

NAME                              | DESCRIPTION                       | KSPPSTVL
=============================================================================================
_kgl_large_heap_warning_threshold | maximum heap size before KGL      | 4194304
                                    writes warnings to the alert log
---------------------------------------------------------------------------------------------
_kgl_large_heap_assert_threshold  | maximum heap size before KGL      | 4194304
                                    raises an internal error

So, _kgl_large_heap_warning_threshold is the maximum heap before getting a warning, and _kgl_large_heap_assert_threshold is the maximum heap before getting the error.

Depending in your case the solution can be either:

  • Breaking your query in several to reduce the memory used
  • Use paginating or LIMIT
  • Set a bigger value for those tunnables.

It will work setting 0 for these to variables, although I don’t recommend it to you, as you want your Server to kill queries that are taking more memory than you want.

To increase the value of , you have to update it. Please note it is in bytes, so for 32MB is 32 * 1024 * 1024, so 33,554,432, and using spfile:

SQL> alter system set "_kgl_large_heap_warning_threshold"=33554432
scope=spfile ;
 
SQL> shutdown immediate 

SQL> startup
 
SQL> show parameter _kgl_large_heap_warning_threshold
NAME                               TYPE      VALUE
==================================|=========|===============
 
_kgl_large_heap_warning_threshold | integer | 33554432
 

Or if using the parameter file, set:

_kgl_large_heap_warning_threshold=33554432

Fixing Wifi not enabled in Linux Kali with Dell XPS laptop

This is a trick I share, as I see many students having problems with this.

Assuming that your Kali distribution is recent (Linux Kernel bigger than Kernel 5.3), the most typical problem student have is that laptops xps from Dell and other brands have a combination of keys to enable or disable the Wifi.

On the Dell xps is on the key PrtScr, so if your Wifi is disabled, you can enable it in Kali Linux press:

CTRL + ALT + Fn + PrtScr

As you can see in the PrtScr the is an icon of Wifi Signal.

The Fn key is on the bottom left, next to Ctrl.

This is a very simple to fix problem, but many people suffer this problem and go crazy trying to update drivers or even having to use an external USB dongle.

A simple Bash one line script to log the temperature of your HDDs and CPUs in Ubuntu

I’ve been helping to troubleshoot the reason one Commodity Server (with no iDrac/Ilo ipmi) is powering off randomly. One of the hypothesis is the temperature.

This is a very simple script that will print the temperature of the HDDs and the CPU and keep to a log file.

First you need to install hddtemp and lm-sensors:

sudo apt install hddtemp lm-sensors

Then this is the one line script, that you should execute as root:

while [ true ]; do date | tee -a /var/log/hddtemp.log; hddtemp /dev/sda /dev/sdb /dev/sdc /dev/sdd | tee -a /var/log/hddtemp.log; date | tee -a /var/log/cputemp.log; sensors | tee -a /var/log/cputemp.log; sleep 2; done

Feel free to change sleep 2 for the number of seconds you want to wait, like sleep 10.

Press CTRL + C to interrupt the script at any time.

You can execute this inside a screen session and leave it running in Background.

Note that I use tee command, so the output is print to the screen and to the log file.

News from the blog 2020-10-16

  • I’ve been testing and adding more instances to CMIPS. I’m planning on testing the Azure instance with 120 cores.
  • News: Microsoft makes an option to permanently remote work

https://www.bbc.com/news/business-54482245

  • One of my colleagues showed me dstat, a very nice tool for system monitoring, and bandwidth of a drive monitoring. Also ifstat, as complement to iftop is very cool for Network too. This functionality is also available in CTOP.py
  • As I shared in the past news of the blog, I’m resuming my contributions to ZFS Community.

Long time ago I created some ZFS tools that I want to share soon as Open Source.

I equipped myself with the proper Hardware to test on SAS and SATA:

  • 12G Internal PCI-E SAS/SATA HBA RAID Controller Card, Broadcom’s SAS 3008, compatible for SAS 9300-8I.
    This is just an HDA (Host Data Adapter), it doesn’t support RAID. Only connects up to 8 drives or 1024 through expander, to my computer.
    It has a bandwidth of 9,600 MB/s which guarantees me that I’ll be able to add 12 SAS SSD Enterprise grade at almost the max speed of the drives. Those drives perform at 900 MB/s so if I’m using all of them at the same time, like if I have a pool of 8 + 3 and I rebuild a broken drive or I just push Data, I would be using 12×900 = 10,800 MB/s. Close. Fair enough.
  • VANDESAIL Mini-SAS Cables, 1m Internal Mini-SAS to 4x SAS SATA Forward Breakout Cable Hard Drive Data Transfer Cable (SAS Cable).
  • SilverStone SST-FS212B – Aluminium Trayless Hot Swap Mobile Rack Backplane / Internal Hard Drive Enclosure for 12x 2.5 Inch SAS/SATA HDD or SSD, fit in any 3x 5.25 Inch Drive Bay, with Fan and Lock, black
  • Terminator is here.
    I ordered this T-800 head a while ago and finally arrived.

Finally I will have my empty USB keys located and protected. ;)

Remember to be always nice to robots. :)

Fixing problems with audio not sounding after upgrade from Ubuntu 18.04 LTS to 20.04.1 LTS

Two days ago I upgraded my Ubuntu Linux 18.04 LTS Workstation to Ubuntu 20.04.1 LTS and I experienced some audio problems.

Basically I noticed that the system was not playing any sound.

When I checked the audio config I noticed that only the external output of my motherboard was detected, but not the HDMI output from the monitor.

I have a 28″ Asus monitor with speakers embedded.

It didn’t make any sense, so I decided to restart pulseaudio:

pulseaudio -k

That fixed my problem.

However I noticed that when I lock my session, and so the monitor goes off for power saving my HDMI monitor output disappears from the list again.

Repeating the command pulseaudio -k will fix that again.

I checked that the power saving was enabled:

cat /sys/module/snd_hda_intel/parameters/power_save
cat /sys/module/snd_hda_intel/parameters/power_save_controller

I had 1 and Y.

To make the change permanently, I change the power mode settings:

sudo sh -c "echo 0 > /sys/module/snd_hda_intel/parameters/power_save" 
sudo sh -c "echo N > /sys/module/snd_hda_intel/parameters/power_save_controller"

Bash Script: Count repeated lines in the logs

This small script will count repeated patterns in the Logs.

Ideal for checking if there are errors that you’re missing while developing.

#!/usr/bin/env bash
# count_repeated_pattern_in_logs.sh
# By Carles Mateo
# Helps to find repeated lines in Logs
LOGFILE_MESSAGES="/var/log/messages"
LOGFILE_SYSLOG="/var/log/syslog"
if [[ -f "${LOGFILE_MESSAGES}" ]]; then
LOGFILE=${LOGFILE_MESSAGES}
else
LOGFILE=${LOGFILE_SYSLOG}
fi
echo "Using Logfile: ${LOGFILE}"
CMD_OUTPUT=`cat ${LOGFILE} | awk '{ $1=$2=$3=$4=""; print $0 }' | sort | uniq --
count | sort --ignore-case --reverse --numeric-sort`
echo -e "$CMD_OUTPUT"

Basically it takes out the non relevant fields that can prevent from detecting repetition, like the time, and prints the rest.
Then you will launch it like this:

count_repeated_pattern_in_logs.sh | head -n20

If you are checking a machine with Ubuntu UFW (Firewall) and want to skip those likes:

./repeated.sh | grep -v "UFW BLOCK" | head -n20

Troubleshooting a shell prompt irresponsible that locks/hangs intermittently

You do df -h or ls / and the terminal freezes and not even CTRL + C works, you have a lock.

Normally this is due to a lock of the system trying to perform an IO.

Could be a physical spinning disk failing, but the most probably nowadays is that you have a network mount point and it is timing out.

If you execute mount and you get a timeout, and when you finally see the list you see a NFS, iSCSI or another kind of Network mount (you will see an Ip Address), check for errors.

To do this in CentOS/RHEL you can do as root:

dmesg | grep -i "timed"

or depending on the System

cat /var/log/messages | grep -i "timed"

You’ll get something like this:

[root@compute01 carles]# dmesg -T | grep timed | head -n5
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:45 2020] nfs: server storage07 not responding, timed out

Please note I use dmesg -T in order to have human readable date instead of Unix Epoch.

You can count the errors today:

[root@compute01 carles]# dmesg -T | grep time | grep "Mon Apr 6" | wc --lines
3123

CTOP.py

Current version is v.0.8.6 updated on 2021 June.

Version 0.8.0 add compatibility with Python 2, for older Systems.

Find the source code in: https://gitlab.com/carles.mateo/ctop

Clone it with:

git clone https://gitlab.com/carles.mateo/ctop.git

ctop.py is an Open Source tool for Linux System Administration that I’ve written in Python3. It uses only the System (/proc), and not third party libraries, in order to get all the information required.
I use only this modules, so it’s ideal to run in all the farm of Servers and Dockers:

  • os
  • sys
  • time
  • shutil (for getting the Terminal width and height)

The purpose of this tool is to help to troubleshot and to identify problems with a single view to a single tool that has all the typical indicators.

It provides in a single view information that is typically provided by many programs:

  • top, htop for the CPU usage, process list, memory usage
  • meminfo
  • cpuinfo
  • hostname
  • uptime
  • df to see the free space in / and the free inodes
  • iftop to see real-time bandwidth usage
  • ip addr list to see the main Ip for the interfaces
  • netstat or lsof to see the list of listening TCP Ports
  • uname -a to see the Kernel version

Other cool things it does is:

  • Identifying if you’re inside an Amazon VM, Google GCP, OpenStack VMs, Virtual Box VMs, Docker Containers or lxc.
  • Compatible with Raspberry Pi (tested on 3 and 4, on Raspbian and Ubuntu 20.04LTS)
  • Uses colors, and marks in yellow the warnings and in red the errors, problems like few disk space reaming or high CPU usage according to the available cores and CPUs.
  • Redraws the screen and adjust to the size of the Terminal, bigger terminal displays more information
  • It doesn’t use external libraries, and does not escape to shell. It reads everything from /proc /sys or /etc files.
  • Identifies the Linux distribution
  • Supports Plugins loaded on demand.
  • Shows the most repeated binaries, so you can identify DDoS attacks (like having 5,000 apache instances where you have normally 500 or many instances of Python)
  • Indicates if an interface has the cable connected or disconnected
  • Shows the Speed of the Network Connection (useful for Mellanox cards than can operate and 200Gbit/sec, 100, 50, 40, 25, 10…)
  • It displays the local time and the Linux Epoch Time, which is universal (very useful for logs and to detect when there was an issue, for example if your system restarted, your SSH Session would keep latest Epoch captured)
  • No root required
  • Displays recent errors like NFS Timed outs or Memory Read Errors.
  • You can enforce the output to be in a determined number of columns and rows, for data scrapping.
  • You can specify the number of loops (1 for scrapping, by default is infinite)
  • You can specify the time between screen refreshes, for long placed SSH sessions
  • You can specify to see the output in b/w or in color (default)

Plugins allow you to extend the functionality effortlessly, without having to learn all the code. I provide a Plugin sample for starting lights on a Raspberry Pi, depending on the CPU Load, and playing a message “The system is healthy” or “Warning. The CPU is at 80%”.

Limitations:

  • It only works for Linux, not for Mac or for Windows. Although the idea is to help with Server’s Linux Administration and Troubleshot, and Mac and Windows do not have /proc
  • The list of process of the System is read every 30 seconds, to avoid adding much overhead on the System, other info every second
  • It does not run in Python 2.x, requires Python 3 (tested on 3.5, 3.6, 3.7, 3.8, 3.9)

I decided to code name the version 0.7 as “Catalan Republic” to support the dreams and hopes and democratic requests of the Catalan people, to become and independent republic.

I created this tool as Open Source and if you want to help I need people to test under different versions of:

  • Atypical Linux distributions

If you are a Cloud Provider and want me to implement the detection of your VMs, so the tool knows that is a instance of the Amazon, Google, Azure, Cloudsigma, Digital Ocean… contact me through my LinkedIn.

Monitoring an Amazon Instance, take a look at the amount of traffic sent and received

Some of the features I’m working on are parsing the logs checking for errors, kernel panics, processed killed due to lack of memory, iscsi disconnects, nfs errors, checking the logs of mysql and Oracle databases to locate errors