Troubleshoot: fatal: open lock file /var/lib/postfix/master.lock: unable to set exclusive lock

Error Message & Trace details:

One of my customer came with an error saying the postfix in his server isn’t working. The server was running CentOS 7, and the system postfix status was inactive, means not running. Although, the system queue was running I could see. The error that was returning while restarting/checking status was the following:

# service postfix status
Redirecting to /bin/systemctl status postfix.service
● postfix.service - Postfix Mail Transport Agent
Loaded: loaded (/usr/lib/systemd/system/postfix.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2018-01-09 04:04:05 UTC; 1s ago
Process: 9201 ExecStart=/usr/sbin/postfix start (code=exited, status=1/FAILURE)
Process: 9197 ExecStartPre=/usr/libexec/postfix/chroot-update (code=exited, status=0/SUCCESS)
Process: 9194 ExecStartPre=/usr/libexec/postfix/aliasesdb (code=exited, status=0/SUCCESS)
Main PID: 1358 (code=killed, signal=TERM)

Jan 09 04:04:03 twin7.hifrank.biz systemd[1]: Starting Postfix Mail Transport Agent...
Jan 09 04:04:03 twin7.hifrank.biz postfix/master[9273]: fatal: open lock file /var/lib/postfix/master.lock: unable to set exclusive lock: Resource tempo...vailable
Jan 09 04:04:04 twin7.hifrank.biz postfix/master[9272]: fatal: daemon initialization failure
Jan 09 04:04:05 twin7.hifrank.biz postfix/postfix-script[9274]: fatal: mail system startup failed
Jan 09 04:04:05 twin7.hifrank.biz systemd[1]: postfix.service: control process exited, code=exited status=1
Jan 09 04:04:05 twin7.hifrank.biz systemd[1]: Failed to start Postfix Mail Transport Agent.
Jan 09 04:04:05 twin7.hifrank.biz systemd[1]: Unit postfix.service entered failed state.
Jan 09 04:04:05 twin7.hifrank.biz systemd[1]: postfix.service failed.

How to fix:

The error to note here is the following:

fatal: open lock file /var/lib/postfix/master.lock

I first killed the smtp and smtpd processes that runs by postfix:

# killall -9 smtp
# killall -9 smtpd

But that didn’t solve the problem. I then used the fuser command to check which process holds the lock file:

# fuser /var/lib/postfix/master.lock
/var/lib/postfix/master.lock: 18698

Then we check the process 18698 and kill the responsible process:

# ps -axwww|grep 18698
9333 pts/0 S+ 0:00 grep --color=auto 18698

18698 ? Ss 4:28 /usr/libexec/postfix/master -w
# killall -9 /usr/libexec/postfix/master
or
# kill -9 18698

Once the process is killed, you can now start the postfix:

# service postfix start
# service postfix status|grep Active
Redirecting to /bin/systemctl status postfix.service
Active: active (running) since Tue 2018-01-09 04:15:50 UTC; 4min 45s ago

Troubleshoot: killall command not found in centos 7

Problem:

If you are using a centos 7 minimal installation, and trying to kill process by name using the command ‘killall’, you are most likely going to see the error:

# killall -9 php
killall: command not found

The error appears because CentOS 7 encourages you to use pkill instead of killall to kill process by name. pkill has versatile application, although, it can be used to kill process by name same as killall.

How to kill process by name in Centos 7

You can use pkill. pkill is a simple command. It’s syntax is as following:

# pkill processname

For example, if you want to kill all the php process, run:

# pkill php

Note: It will kill all the processes that match php. To list the process that pkill going to kill, you can use pgrep as following:

# pgrep -l php

How to use killall in Centos 7

If you do not want to use pkill, and keep using killall commands in centos 7, this is also possible. killall is a part of psmisc yum package. All you have to do, is to install psmisc in your system using yum

# yum install psmisc
# killall -9 php

 

Troubleshooting: Imunify360 database is corrupt. Application cannot run with corrupt database

Error Message:

# service imunify360 start
Starting imunify360: WARNING [+ 3743ms] defence360agent.utils.check_db|DatabaseError detected: database disk image is malformed
WARNING [+ 3766ms] defence360agent.cli.subparsers.common.server|Imunify360 database is corrupt. Application cannot run with corrupt database. Please, contact Imunify360 support team at https://cloudlinux.zendesk.com

Detail Information & Explanation:

If you are using imunify360, an application firewall for linux servers by Cloudlinux team, you might incur an error where it says the database is corrupt. You might first see ‘Imunify360 is not started’ error from the WHM panel and end up getting the above error message as stated. Imunify360 uses a SQL database, located under ‘/var/imunify360/imunify360.db’. This image is checked everytime Imunfi360 tries to start, and if the database is malformed, it would not start. Fortunately, imunify360 comes with tools to handle this database and recover if corrupted.

How to Fix:

First, we start by running database integrity check. This can be done using the following:

imunfiy360-agent checkdb

(From Imunify360 Doc: checkdb  – Check database integrity)

Once done, you can now use ‘migratedb’ to repair and restore if the database is corrupted.

imunify360-agent migratedb

(From Imunify360 Doc: migratedb – Check and repair database if it is corrupted.)

If migratedb fails, the only way to recover this is to reinstall imunify360.

Linux: Assertion failed on job for iptables.service.

If you are using Centos 7 or RHEL 7 or any of it’s variant, you are probably using ‘Firewalld’ by default. Although, if you are a iptables fan like me, who likes it’s simplicity and manipulative nature instead of a full form firewall, then you probably have disabled firewalld from your CentOS 7 instance and using iptables. There are couple of servers, where I use runtime iptables rules for postrouting and masquerading. These rules are dynamically generated by my scripts instead of the sysconfig file under:

/etc/sysconfig/iptables

This file is generated upon running the iptables save command:

service iptables save

which I rarely do so.

Error Details

Which is why, I don’t have a /etc/sysconfig/iptables file in those servers and a common error I see while restarting iptables in those system is the following:

# systemctl restart iptables.service
Assertion failed on job for iptables.service.

How to Fix The Error

The error appears because you don’t have any rule in /etc/sysconfig/iptables or the file doesn’t exist either. You can ignore the error as iptables would still run. To eradicate the error, simply make sure you have some iptables rules loaded on your system using the status command:

iptables -S

And then, run:

service iptables save

Once done, restarting iptables shouldn’t show the error any longer.

SMTP Error: 550 Please turn on SMTP Authentication in your mail client – IP is not permitted to relay through this server without authentication

We had a customer complaining about a commonly seen error of the following type:

550 Please turn on SMTP Authentication in your mail client. mail-pf0-f172.google.com [209.85.192.172]:38632 is not permitted to relay through this server without authentication.

Diagnostic-Code: smtp; 550-Please turn on SMTP Authentication in your mail client. 550-mail-pf0-f172.google.com [209.85.192.172]:38632 is not permitted to relay 550 through this server without authentication.

reason: 550-Please turn on SMTP Authentication in your mail client.
550-mout.kundenserver.de [212.227.17.24]:49392 is not permitted to relay
550 through this server without authentication.

They were all basically the same error. This is a common error and the solution is pretty simple as it looks like. Enabling ‘SMTP Authentication’ on the outlook or the mail client should solve the problem. But interestingly, the client was smart and he wasn’t doing any mistake with ‘SMTP authentication’. The error was actually showing up when someone was trying to send the mail to him (As a receiver SMTP). We then tried digging the error further.

There is something we need to remember. SMTP is not only authenticated using username and password, it also goes through a dns authentication check too. If your dkim/domainkeys/spf/dmarc do not match as the mail server has advised, the mail will get denied with the same type of error (Error code 550). We then realized the customer account was transfered earlier from a different server and the old domainkeys were still there in it’s DNS zone file. As domainkeys are RSA keys generated per server, it is important to regenerate the keys after the server change. Otherwise, the old key check through the DNS can trigger the 550 error from the receiver relay. We had deleted and generated a new domainkeys for the customer and the error went off.

phpMyAdmin Coming Blank in Cpanel

One of the customer reported an issue related to phpMyAdmin earlier today. He was getting a blank page of phpmyadmin that only says “Welcome to phpMyAdmin”

Once I hoped into the ssh and checked the cpanel error log file located under

/usr/local/cpanel/logs/error_log

I observed the following error:

PHP Fatal error: require_once(): Failed opening required './libraries/display_select_lang.lib.php' (include_path='/usr/local/cpanel/3rdparty/php/56/lib/php:.') in /usr/local/cpanel/base/3rdparty/phpMyAdmin/libraries/plugins/auth/AuthenticationCpanel.php on line 147

The error was peculiar because display_select_lang.lib.php wasn’t available in any other cpanel phpmyadmin source files I searched. Then I realized “AuthenticationCpanel.php” mentions the error which usually because Cpanel Authentication wasn’t done properly with the MySQL. Cpanel pass wasn’t synced with the MySQL.

Going to WHM >> Password Modification >> If you select the user and WHM shows you the ‘Sync with MySQL Password’ option, that means the MySQL password is outdated to cpanel and requires syncing (NB: If the password doesn’t require syncing, this option won’t be there). You can reset the pass and check the option to Sync the new pass with MySQL. That should restore your phpmyadmin.

What is Kondemand? Why do I see a lot of Kondemand process in my process list?

Question: What is Kondemand? Why do I see a lot of Kondemand process in my process list?

Answer: Kondemand is the process used for automatic CPU scaling on multi core linux system. It automatically reduce/drops the CPU clock speed to power usage when the CPU is not in use. This is done through scaling_governor available on linux. To see if your scaling_governor is set to ‘ondemand’ or not, you may use the following command:

# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

If your CPU is showing ‘ondemand’ scanling governor then the kondemand kernel process is active and will reduce your CPU clock speed on fly to reduce power usage. You can change this settings to performance on fly using the following small shell code:

for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n performance > $CPUFREQ; done

There is a linux service called CPUSpeed, this can tune your scaling governor back to ondemand after the reboot. You may shut it down:

service cpuspeed stop
chkconfig off cpuspeed

You may check your CPU speed is restored to the original through the proc filesystem:

cat /proc/cpuinfo

[Tue Dec 19 20:40:07.097202 2017] [lsapi:error] [pid 532140:tid 139848266454784] [client IP.IP.IP.IP:25021] mod_lsapi: [host domain.com] [req GET / HTTP/1.1] Could not connect to lsphp backend: connect to lsphp refused: 111 (possibly memory limit for LVE ID 1789 too small), referer: domain.com

If you are running Cloudlinux, cagefs and lsapi with cpanel, you are probably familiar with the error. If the error is appearing for one or two sites, then it is probably because the user is hitting the VM/PM limit you have set through Cloudlinux. But if the error is appearing for all the sites, then it is because the cagefs fails with the suexec permission for some reason.

[Tue Dec 19 20:40:07.097202 2017] [lsapi:error] [pid 532140:tid 139848266454784] [client *:25021] mod_lsapi: [host *] [req GET / HTTP/1.1] Could not connect to lsphp backend: connect to lsphp refused: 111 (possibly memory limit for LVE ID 1789 too small), referer: http://*/

One way to solve the problem is to remount all the users. Sometimes, it doesn’t work and you may require to reinitialize cagefs again:

cagefsctl –remount-all
cagefsctl -r

I have seen times, when nothing works, but reinstalling cagefs does the trick. If cagefs doesn’t work, you may try disabling virtual memory from the CloudLinux LVE manager to see if that fix the problem. CloudLinux also has a known Virtual Memory 503 error issue with LSAPI.

error: Unable to create cgroup for vm**: No such file or directory

The error can appear for any type of KVM VM installation like Virtualizor  or SolusVM or Proxmox. You may face the same error if you are simply using virtmanager to manage your KVM installation. The error appear when you try to start/create the VM from the xml file:

[root@vps8 addvs]# virsh create /etc/libvirt/qemu/v1015.xml
error: Failed to create domain from /etc/libvirt/qemu/v1015.xml
error: Unable to create cgroup for v1015: No such file or directory

It appears because for some CentOS starts unmounting the cgroup and breaks libvirt. Easy way to fix this is to restart libvirtd:

service libvirtd restart

The error is more common in CentOS 7 than CentOS 6, as systemd is known to have the bug:

https://bugzilla.redhat.com/show_bug.cgi?id=678555

device-mapper: cache: You have created a cache device with a lot of individual cache blocks

The error would be similar to the following to be exact:

[172593.817178] device-mapper: cache: You have created a cache device with a lot of individual cache blocks (3276800)
[172593.817182] All these mappings can consume a lot of kernel memory, and take some time to read/write.
[172593.817185] Please consider increasing the cache block size to reduce the overall cache block count.

It usually appears because you have created a large cache pool, while using a small chunk size.

Here is what to be said about chunk size in lvmcache2 manual:

The size of data blocks managed by a cache pool can be specified with
the –chunksize option when the cache LV is created. The default
unit is KiB. The value must be a multiple of 32KiB between 32KiB and
1GiB.

Using a chunk size that is too large can result in wasteful use of
the cache, where small reads and writes can cause large sections of
an LV to be mapped into the cache. However, choosing a chunk size
that is too small can result in more overhead trying to manage the
numerous chunks that become mapped into the cache. Overhead can
include both excessive CPU time searching for chunks, and excessive
memory tracking chunks.

Basically chunk size is determining the block size at which your caches are going to be stored. To find your current chunk size run:

lvs -o+chunksize vgname/cachevolumename

Should return something like the following:

#lvs -o+chunksize vg0/newvz
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Chunk
newvz vg0 Cwi-aoC— 1.50t [cache] 14.14 0.75 5.12 64.00k

A good chunk size for a large cache device of size 200GB+ would be 256k. You may set the chunk size while setting up the device :

lvcreate –type cache-pool –chunksize 256K -L CacheSize -n CachePoolLV VG FastPVs