Category Archives: Troubleshoot

Solving Oracle error ORA 600 [KGL-heap-size-exceeded]

Time ago there was a web page that was rendered in blank for certain group of users.

The errors were coming from an Oracle instance. One SysAdmin restarted the instance, but the errors continued.

Often there are problems due to having two different worlds: Development and Production/Operations.

What works in Development, or even in Docker, may not work at Scale in Production.

That query that works with 100,000 products, may not work with 10,000,000.

I have programmed a lot for web, so when I saw a blank page I knew it was an internal error as the headers sent by the Web Server indicated 500. DBAs were seeing elevated number of errors in one of the Servers.

So I went straight to the Oracle’s logs for that Servers.

I did a quick filter in bash:

cat /u01/app/oracle/diag/rdbms/world7c/world7c2/alert/log.xml | grep "ERR" -B4 -A3

This returned several errors of the kind “ORA 600 [ipc_recreate_que_2]” but this was not the error our bad guy was:

‘ORA 600 [KGL-heap-size-exceeded]’

The XML fragment was similar to this:

<msg time='2016-01-24T13:28:33.263+00:00' org_id='oracle' comp_id='rdbms'
msg_id='7725874800' type='INCIDENT_ERROR' group='Generic Internal Error'
level='1' host_id='' host_addr=''
pid='281279' prob_key='ORA 600 [KGL-heap-size-exceeded]' downstream_comp='LIBCACHE'
errid='726175' detail_path='/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc'>
<txt>Errors in file /u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc  (incident=726175):
ORA-00600: internal error code, arguments: [KGL-heap-size-exceeded], [0x14D22C0C30], [0], [524288008], [], [], [], [], [], [], [], []

Just before this error, there was an error with a Query, and the PID matched, so it seemed cleared to me that the query was causing the crash at Oracle level.

Checking the file:


The content was something like this:

<msg time='2016-01-24T13:28:33.263+00:00' org_id='oracle' comp_id='rdbms'
msg_id='7725874800' type='INCIDENT_ERROR' group='Generic Internal Error'
level='1' host_id='' host_addr=''
pid='281279' prob_key='ORA 600 [KGL-heap-size-exceeded]' downstream_comp='LIBCACHE'
errid='726175' detail_path='/u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc'>
<txt>Errors in file /u01/app/oracle/diag/rdbms/world7c/world7c2/trace/world7c2_ora_281279.trc  (incident=726175):
ORA-00600: internal error code, arguments: [KGL-heap-size-exceeded], [0x14D22C0C30], [0], [524288008], [], [], [], [], [], [], [], []

Basically in our case, the query that was launched by the BackEnd was using more memory than allowed, which caused Oracle to kill it.

That is a tunnable that you can modify introduced in Oracle 10g.

You can see the current values first:

SQL> select
2 nam.ksppinm NAME,
3 nam.ksppdesc DESCRIPTION,
5 from
6 x$ksppi nam,
7 x$ksppsv val
8 where nam.indx = val.indx and nam.ksppinm like '%kgl_large_heap_%_threshold%';

NAME                              | DESCRIPTION                       | KSPPSTVL
_kgl_large_heap_warning_threshold | maximum heap size before KGL      | 4194304
                                    writes warnings to the alert log
_kgl_large_heap_assert_threshold  | maximum heap size before KGL      | 4194304
                                    raises an internal error

So, _kgl_large_heap_warning_threshold is the maximum heap before getting a warning, and _kgl_large_heap_assert_threshold is the maximum heap before getting the error.

Depending in your case the solution can be either:

  • Breaking your query in several to reduce the memory used
  • Use paginating or LIMIT
  • Set a bigger value for those tunnables.

It will work setting 0 for these to variables, although I don’t recommend it to you, as you want your Server to kill queries that are taking more memory than you want.

To increase the value of , you have to update it. Please note it is in bytes, so for 32MB is 32 * 1024 * 1024, so 33,554,432, and using spfile:

SQL> alter system set "_kgl_large_heap_warning_threshold"=33554432
scope=spfile ;
SQL> shutdown immediate 

SQL> startup
SQL> show parameter _kgl_large_heap_warning_threshold
NAME                               TYPE      VALUE
_kgl_large_heap_warning_threshold | integer | 33554432

Or if using the parameter file, set:


Fixing Wifi not enabled in Linux Kali with Dell XPS laptop

This is a trick I share, as I see many students having problems with this.

Assuming that your Kali distribution is recent (Linux Kernel bigger than Kernel 5.3), the most typical problem student have is that laptops xps from Dell and other brands have a combination of keys to enable or disable the Wifi.

On the Dell xps is on the key PrtScr, so if your Wifi is disabled, you can enable it in Kali Linux press:

CTRL + ALT + Fn + PrtScr

As you can see in the PrtScr the is an icon of Wifi Signal.

The Fn key is on the bottom left, next to Ctrl.

This is a very simple to fix problem, but many people suffer this problem and go crazy trying to update drivers or even having to use an external USB dongle.

A simple Bash one line script to log the temperature of your HDDs and CPUs in Ubuntu

I’ve been helping to troubleshoot the reason one Commodity Server (with no iDrac/Ilo ipmi) is powering off randomly. One of the hypothesis is the temperature.

This is a very simple script that will print the temperature of the HDDs and the CPU and keep to a log file.

First you need to install hddtemp and lm-sensors:

sudo apt install hddtemp lm-sensors

Then this is the one line script, that you should execute as root:

while [ true ]; do date | tee -a /var/log/hddtemp.log; hddtemp /dev/sda /dev/sdb /dev/sdc /dev/sdd | tee -a /var/log/hddtemp.log; date | tee -a /var/log/cputemp.log; sensors | tee -a /var/log/cputemp.log; sleep 2; done

Feel free to change sleep 2 for the number of seconds you want to wait, like sleep 10.

Press CTRL + C to interrupt the script at any time.

You can execute this inside a screen session and leave it running in Background.

Note that I use tee command, so the output is print to the screen and to the log file.

News from the blog 2020-10-16

  • I’ve been testing and adding more instances to CMIPS. I’m planning on testing the Azure instance with 120 cores.
  • News: Microsoft makes an option to permanently remote work

  • One of my colleagues showed me dstat, a very nice tool for system monitoring, and bandwidth of a drive monitoring. Also ifstat, as complement to iftop is very cool for Network too. This functionality is also available in
  • As I shared in the past news of the blog, I’m resuming my contributions to ZFS Community.

Long time ago I created some ZFS tools that I want to share soon as Open Source.

I equipped myself with the proper Hardware to test on SAS and SATA:

  • 12G Internal PCI-E SAS/SATA HBA RAID Controller Card, Broadcom’s SAS 3008, compatible for SAS 9300-8I.
    This is just an HDA (Host Data Adapter), it doesn’t support RAID. Only connects up to 8 drives or 1024 through expander, to my computer.
    It has a bandwidth of 9,600 MB/s which guarantees me that I’ll be able to add 12 SAS SSD Enterprise grade at almost the max speed of the drives. Those drives perform at 900 MB/s so if I’m using all of them at the same time, like if I have a pool of 8 + 3 and I rebuild a broken drive or I just push Data, I would be using 12×900 = 10,800 MB/s. Close. Fair enough.
  • VANDESAIL Mini-SAS Cables, 1m Internal Mini-SAS to 4x SAS SATA Forward Breakout Cable Hard Drive Data Transfer Cable (SAS Cable).
  • SilverStone SST-FS212B – Aluminium Trayless Hot Swap Mobile Rack Backplane / Internal Hard Drive Enclosure for 12x 2.5 Inch SAS/SATA HDD or SSD, fit in any 3x 5.25 Inch Drive Bay, with Fan and Lock, black
  • Terminator is here.
    I ordered this T-800 head a while ago and finally arrived.

Finally I will have my empty USB keys located and protected. ;)

Remember to be always nice to robots. :)

Fixing problems with audio not sounding after upgrade from Ubuntu 18.04 LTS to 20.04.1 LTS

Two days ago I upgraded my Ubuntu Linux 18.04 LTS Workstation to Ubuntu 20.04.1 LTS and I experienced some audio problems.

Basically I noticed that the system was not playing any sound.

When I checked the audio config I noticed that only the external output of my motherboard was detected, but not the HDMI output from the monitor.

I have a 28″ Asus monitor with speakers embedded.

It didn’t make any sense, so I decided to restart pulseaudio:

pulseaudio -k

That fixed my problem.

However I noticed that when I lock my session, and so the monitor goes off for power saving my HDMI monitor output disappears from the list again.

Repeating the command pulseaudio -k will fix that again.

I checked that the power saving was enabled:

cat /sys/module/snd_hda_intel/parameters/power_save
cat /sys/module/snd_hda_intel/parameters/power_save_controller

I had 1 and Y.

To make the change permanently, I change the power mode settings:

sudo sh -c "echo 0 > /sys/module/snd_hda_intel/parameters/power_save" 
sudo sh -c "echo N > /sys/module/snd_hda_intel/parameters/power_save_controller"

Bash Script: Count repeated lines in the logs

This small script will count repeated patterns in the Logs.

Ideal for checking if there are errors that you’re missing while developing.

#!/usr/bin/env bash
# By Carles Mateo
# Helps to find repeated lines in Logs
if [[ -f "${LOGFILE_MESSAGES}" ]]; then
echo "Using Logfile: ${LOGFILE}"
CMD_OUTPUT=`cat ${LOGFILE} | awk '{ $1=$2=$3=$4=""; print $0 }' | sort | uniq --
count | sort --ignore-case --reverse --numeric-sort`
echo -e "$CMD_OUTPUT"

Basically it takes out the non relevant fields that can prevent from detecting repetition, like the time, and prints the rest.
Then you will launch it like this: | head -n20

If you are checking a machine with Ubuntu UFW (Firewall) and want to skip those likes:

./ | grep -v "UFW BLOCK" | head -n20

Troubleshooting a shell prompt irresponsible that locks/hangs intermittently

You do df -h or ls / and the terminal freezes and not even CTRL + C works, you have a lock.

Normally this is due to a lock of the system trying to perform an IO.

Could be a physical spinning disk failing, but the most probably nowadays is that you have a network mount point and it is timing out.

If you execute mount and you get a timeout, and when you finally see the list you see a NFS, iSCSI or another kind of Network mount (you will see an Ip Address), check for errors.

To do this in CentOS/RHEL you can do as root:

dmesg | grep -i "timed"

or depending on the System

cat /var/log/messages | grep -i "timed"

You’ll get something like this:

[root@compute01 carles]# dmesg -T | grep timed | head -n5
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:45 2020] nfs: server storage07 not responding, timed out

Please note I use dmesg -T in order to have human readable date instead of Unix Epoch.

You can count the errors today:

[root@compute01 carles]# dmesg -T | grep time | grep "Mon Apr 6" | wc --lines

Current version is v.0.8.0 updated on 2021-03-06.

Version 0.8.0 add compatibility with Python 2, for older Systems.

Find the source code in:

Clone it with:

git clone is an Open Source tool for Linux System Administration that I’ve written in Python3. It uses only the System (/proc), and not third party libraries, in order to get all the information required.
I use only this modules, so it’s ideal to run in all the farm of Servers and Dockers:

  • os
  • sys
  • time
  • shutil (for getting the Terminal width and height)

The purpose of this tool is to help to troubleshot and to identify problems with a single view to a single tool that has all the typical indicators.

It provides in a single view information that is typically provided by many programs:

  • top, htop for the CPU usage, process list, memory usage
  • meminfo
  • cpuinfo
  • hostname
  • uptime
  • df to see the free space in / and the free inodes
  • iftop to see real-time bandwidth usage
  • ip addr list to see the main Ip for the interfaces
  • netstat or lsof to see the list of listening TCP Ports
  • uname -a to see the Kernel version

Other cool things it does is:

  • Identifying if you’re inside an Amazon VM, Google GCP, OpenStack VMs, Virtual Box VMs, Docker Containers or lxc.
  • Compatible with Raspberry Pi (tested on 3 and 4, on Raspbian and Ubuntu 20.04LTS)
  • Uses colors, and marks in yellow the warnings and in red the errors, problems like few disk space reaming or high CPU usage according to the available cores and CPUs.
  • Redraws the screen and adjust to the size of the Terminal, bigger terminal displays more information
  • It doesn’t use external libraries, and does not escape to shell. It reads everything from /proc /sys or /etc files.
  • Identifies the Linux distribution
  • Supports Plugins loaded on demand.
  • Shows the most repeated binaries, so you can identify DDoS attacks (like having 5,000 apache instances where you have normally 500 or many instances of Python)
  • Indicates if an interface has the cable connected or disconnected
  • Shows the Speed of the Network Connection (useful for Mellanox cards than can operate and 200Gbit/sec, 100, 50, 40, 25, 10…)
  • It displays the local time and the Linux Epoch Time, which is universal (very useful for logs and to detect when there was an issue, for example if your system restarted, your SSH Session would keep latest Epoch captured)
  • No root required
  • Displays recent errors like NFS Timed outs or Memory Read Errors.
  • You can enforce the output to be in a determined number of columns and rows, for data scrapping.
  • You can specify the number of loops (1 for scrapping, by default is infinite)
  • You can specify the time between screen refreshes, for long placed SSH sessions
  • You can specify to see the output in b/w or in color (default)

Plugins allow you to extend the functionality effortlessly, without having to learn all the code. I provide a Plugin sample for starting lights on a Raspberry Pi, depending on the CPU Load, and playing a message “The system is healthy” or “Warning. The CPU is at 80%”.


  • It only works for Linux, not for Mac or for Windows. Although the idea is to help with Server’s Linux Administration and Troubleshot, and Mac and Windows do not have /proc
  • The list of process of the System is read every 30 seconds, to avoid adding much overhead on the System, other info every second
  • It does not run in Python 2.x, requires Python 3 (tested on 3.5, 3.6, 3.7, 3.8, 3.9)

I decided to code name the version 0.7 as “Catalan Republic” to support the dreams and hopes and democratic requests of the Catalan people, to become and independent republic.

I created this tool as Open Source and if you want to help I need people to test under different versions of:

  • Atypical Linux distributions

If you are a Cloud Provider and want me to implement the detection of your VMs, so the tool knows that is a instance of the Amazon, Google, Azure, Cloudsigma, Digital Ocean… contact me through my LinkedIn.

Monitoring an Amazon Instance, take a look at the amount of traffic sent and received

Some of the features I’m working on are parsing the logs checking for errors, kernel panics, processed killed due to lack of memory, iscsi disconnects, nfs errors, checking the logs of mysql and Oracle databases to locate errors

Upgrading my new HP 14-bp060sa

As the company I was working for, Sanmina, has decided to move all the Software Development to Colorado, US, and closing the offices in Bishopstown, Cork, Ireland I found myself with the need to get a new laptop. At work I was using two Dell laptops, one very powerful and heavy equipped with an Intel Xeon processor and 32 GB of RAM. The other a lightweight one that I updated to 32 GB of RAM.

I had an accident around 8 months ago, that got my spine damaged, and so I cannot carry much weight.

My personal laptops at home, in Ireland are a 15″ with 16 GB of RAM, too heavy, and an Acer 11,6″ with 8GB of RAM and SSD (I upgraded it), but unfortunately the screen crashed. I still use it through the HDMI port. My main computer is a tower with a Core i7, 64GB of RAM and a Samsung NVMe SSD drive. And few Raspberrys Pi 4 and 3 :)

I was thinking about what ultra-lightweight laptop to buy, but I wanted to buy it in Barcelona, as I wanted a Catalan keyboard (the layout with the broken ç and accents). I tried by but I have problems to have shipped the Catalan keyboard layout laptops to my address in Ireland.

I was trying to find the best laptop for me.

While I was investigating I found out that none of the laptops in the market were convincing me.

The ones in around 1Kg, which was my initial target, were too big, and lack a proper full size HDMI port and Gigabit Ethernet. Honestly, some models get the HDMI or the Ethernet from an USB 3.1, through an adapter, or have mini-HDMI, many lack the Gigabit port, which is very annoying. Also most of the models come with 8GB of RAM only and were impossible to upgrade. I enrolled my best friend in my quest, in the research, and had the same conclusions.

I don’t want to have to carry adapters with me to just plug to a monitor or projector. I don’t even want to carry the power charger. I want a laptop that can work with me for a complete day, a full work session, without needing to recharge.

So while this investigation was going on, I decided to buy a cheap laptop with a good trade off of weight and cost, in order to be able to work on the coffee. I needed it for writing documents in Google Docs, creating microservices architectures, programming in Java and PHP, and writing articles in my blog. I also decided that this would be my only laptop with Windows, as honestly I missed playing Star Craft 2, and my attempts with Wine and Linux did not success.

Not also, for playing games :) , there are tools that are only available for Windows or for Mac Os X and Windows, like: POSTMAN, Kitematic for managing dockers visually, vSphere…

(Please note, as I reviewed the article I realized that POSTMAN is available for Linux too)

Please note: although I use mainly Linux everywhere (Ubuntu, CentOS, and RedHat mainly) and I contribute to Open Source projects, I do have Windows machines.

I created my Start up in 2004, and I still have Windows Servers, physical machines in a Data Center in Barcelona, and I still have VMs and Instances in Public Clouds with Windows Servers. Also I programmed some tools using Visual Studio and Visual Basic .NET, ASP.NET and C#, but when I needed to do this I found more convenient spawn an instance in Amazon or Azure and pay for its use.

When I created my Start up I offered my infrastructure as a way to get funding too, and I offered VMs with VMWare. I found that having my Mail Servers in VMs was much more convenient for Backups, cloning, to scale up, to avoid disruption and for Disaster and Recovery.

I wanted a cheap laptop that will not make feel bad if transporting it in a daily basis gets a hit and breaks, or that if it rains (and this happens more than often in Ireland) and it breaks is not super-hurtful, or even if it gets stolen. Yes, I’m from a big city, like is Barcelona, Catalonia, and thieves are a real problem. I travel, so I want a laptop decent enough that I can take to travel, and for going for a coffee, coding anything, and I feel comfortable enough that if something happens to it is not the end of the world.

Cork is not a big city, so the options were reduced. I found a laptop that meets my needs.

I got a HP s14-bp060sa for 439€.

It is equipped with a Intel® Core™ i3-6006U (2 GHz, 3 MB cache, 2 cores) , a 500GB SATA HDD, and 4 GB of DDR4 RAM.

The information on HP webpage is really scarce, but checking other pages I was able to see that the motherboard has 2 memory banks, accepting a max of 16GB of RAM.

I saw that there was an slot, unclear if supporting NVMe SSD drives, but supporting M.2 SSD for sure.

So I bought in Amazon 2x8GB and a M.2 500GB drive.

Since I was 5 years old I’ve been upgrading and assembling by myself all the computers. And this is something that I want to keep doing. It keeps me sharp, knowing the new ports, CPUs, and motherboard architectures, and keeps me in contact with the Hardware. All my life I’ve thought that specializing Software Engineers and Systems Engineers, like if computers were something separate, is a mistake, so I push myself to stay up to date of the news in all the fields.

I removed the spinning 500 GB SATA HDD, cause it’s slow and it consumes a lot of energy. With the M.2 SSD the battery last forever.

The interesting part is how I cloned the drive from the Spinning HDD to the new M.2.

I did:

  • Open the computer (see pics below) and Insert the new drive M.2
  • Boot with an USB Linux Rescue distribution (to do that I had to enable Legacy Boot on BIOS and boot with the USB)
  • Use lsblk command to identify the HDD drive, it was easy as it was the one with partitions
  • dd from if=/dev/sda to of=/dev/sdb with status=progress to see live status and speed (around 70MB/s) and estimated time to complete.
  • Please note that the new drive should be bigger or at least have the same number of bytes to avoid problems with the last partition.
  • I removed the HDD drive, this reduces the weight of my laptop by 100 grams
  • Disable Legacy Boot, and boot the computer. Windows started perfectly :)

I found so few information about this model, that I wanted to share the pictures with the Community. Here are the pictures of the upgrade process.

Here you can see the Crucial M.2 SSD installed and the Spinning HDD removed. Yes, I did in a coffee :)
Final step, installing the 2x8GB RAM memory modules

A sample forensic post mortem for a iSCSI Initiator (client) that had connectivity problems to the Server

My Team in The States report an issue with a Red Hat iSCSI Initiator having issues connecting to a Volume exported by a ZFS Server.

There is an issue on GitLab.

As I always do when I troubleshot a problem, I create a forensics post-mortem document recording everything I do, so later, others can learn how I fix it, or they can learn the steps I did in order to troubleshoot.

Please note: Some Ip addresses have been manually edited.

2019-08-09 10:20:10 Start of the investigation

I log into the Server, with Ip Address: xxx.yyy.16.30. Is an All-Flash-Array Server with RHEL6.10 and DRAID v.08091350.

Htop shows normal/low activity.

I check the addresses in the iSCSI Initiator (client), to make sure it is connecting to the right Server.

[root@Host-164 ~]# ip addr list 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet scope host lo
    valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:90:c5:1e:ea brd ff:ff:ff:ff:ff:ff
    inet xxx.yyy.13.164/16 brd xxx.yyy.255.255 scope global eno1
    valid_lft forever preferred_lft forever
    inet6 fe80::225:90ff:fec5:1eea/64 scope link
    valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether 00:25:90:c5:1e:eb brd ff:ff:ff:ff:ff:ff 
4: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 24:8a:07:a4:94:9c brd ff:ff:ff:ff:ff:ff
    inet brd scope global enp3s0f0
    valid_lft forever preferred_lft forever
    inet6 fe80::268a:7ff:fea4:949c/64 scope link
    valid_lft forever preferred_lft forever 
5: enp3s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 24:8a:07:a4:94:9d brd ff:ff:ff:ff:ff:ff
    inet brd scope global enp3s0f1
    valid_lft forever preferred_lft forever inet6 fe80::268a:7ff:fea4:949d/64 scope link
    valid_lft forever preferred_lft forever                                                                                                       

I see the luns on the host, connecting to the 10Gbps of the Server:

[root@Host-164 ~]# iscsiadm -m session
 tcp: [10],1 (non-flash)
 tcp: [11],1 (non-flash)
 tcp: [7],1 (non-flash)
 tcp: [8],1 (non-flash)
 tcp: [9],1 (non-flash)

Finding the misteries…

Executing cat /proc/partitions is a bit strange respect mount:

[root@Host-164 ~]# cat /proc/partitions
 major minor #blocks name
 8  0 125034840 sda
 8  1 512000 sda1
 8  2 124521472 sda2
 253 0 12505088 dm-0
 253 1 112013312 dm-1
 8 32 104857600 sdc
 8 16 104857600 sdb
 8 48 104857600 sdd
 8 64 104857600 sde
 8 80 104857600 sdf

As mount has this:

/dev/sdg1 on /mnt/large type ext4 (ro,relatime,seclabel,data=ordered)

Lsblk shows that /dev/sdg is not present:

[root@Host-164 ~]# lsblk
 sda 8:0 0 119.2G 0 disk
 ├─sda1 8:1 0 500M 0 part /boot
 └─sda2 8:2 0 118.8G 0 part
  ├─rhel-swap 253:0 0 11.9G 0 lvm [SWAP]
  └─rhel-root 253:1 0 106.8G 0 lvm /
 sdb 8:16 0 100G 0 disk
 sdc 8:32 0 100G 0 disk
 sdd 8:48 0 100G 0 disk
 sde 8:64 0 100G 0 disk
 sdf 8:80 0 100G 0 disk

And as expected:

[root@Host-164 ~]# ls -al /mnt/large
 ls: reading directory /mnt/large: Input/output error
 total 0

I see that the Volumes appear to not having being partitioned:

[root@Host-164 ~]# fdisk /dev/sdf
 Welcome to fdisk (util-linux 2.23.2).
 Changes will remain in memory only, until you decide to write them.
 Be careful before using the write command.
 Device does not contain a recognized partition table
 Building a new DOS disklabel with disk identifier 0xddf99f40.
 Command (m for help): p
 Disk /dev/sdf: 107.4 GB, 107374182400 bytes, 209715200 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disk label type: dos
 Disk identifier: 0xddf99f40
 Device Boot
 Blocks Id System
 Command (m for help): q

I create a partition and format with ext2

[root@Host-164 ~]# mke2fs /dev/sdb1
 mke2fs 1.42.9 (28-Dec-2013)
 Filesystem label=
 OS type: Linux
 Block size=4096 (log=2)
 Fragment size=4096 (log=2)
 Stride=0 blocks, Stripe width=0 blocks
 6553600 inodes, 26214144 blocks
 1310707 blocks (5.00%) reserved for the super user
 First data block=0
 Maximum filesystem blocks=4294967296
 800 block groups
 32768 blocks per group, 32768 fragments per group
 8192 inodes per group
 Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872
 Allocating group tables: done
 Writing inode tables: done
 Writing superblocks and filesystem accounting information: done

I mount:

[root@Host-164 ~]# mount /dev/sdb1 /mnt/vol1

I fill the volume from the client, and it works. I check the activity in the Server with iostat and there are more MB/s written to the Server’s drives than actually speed copying in the client.

I completely fill 100GB but speed is slow. We are working on a 10Gbps Network so I expected more speed.

I check the connections to the Server:

[root@obs4602-1810 ~]# netstat | grep -v "unix"
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0        ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0 xxx.yyy.18.10:ssh            xxx.yyy.12.154:57137         ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0 xxx.yyy.18.10:ssh            xxx.yyy.12.154:56395         ESTABLISHED
tcp        0      0 xxx.yyy.18.10:ssh            xxx.yyy.14.52:57330          ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0 xxx.yyy.18.10:ssh            xxx.yyy.12.154:57133         ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0        ESTABLISHED
tcp        0      0        ESTABLISHED

I see many connections from Host 180, I check that and another member of the Team is using that client to test with vdbench against the Server.

This explains the slower speed I was getting.


  1. There was a local problem on the Host. The problems with the disconnection seem to be related to a connection that was lost (sdg). All that information was written to iSCSI buffer, not to the Server. In fact, that volume was mapped in the system with another letter, sdg was not in use.
  2. Speed was slow due to another client pushing Data to the Server too
  3. Windows clients with auto reconnect option are not reporting timeout reports while in Red Hat clients iSCSI connection timeouts. It should be increased

2020-03-10 22:16 IST TIP: At that time we were using Google suite and Skype to communicate internally with the different members across the world. If we had used a tool like Slack, and we had a channel like #engineering for example or #sanjoselab, then I could have paged and asked “Is somebody using obs4602-1810?