Linux Cluster setup

http://oboguev.net/kernel-etc/linux-cluster-setup.html

Helpful reading:

https://alteeve.ca/w/AN!Cluster_Tutorial_2
https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial_-_Archive

RedHat 7 documentation

RHEL7 High Availability Add-On Administration

HighAvailability Add-On Reference Reference

GlobalFile System 2

LoadBalancer Administration

http://clusterlabs.org
http://clusterlabs.org/quickstart-redhat.html
http://clusterlabs.org/quickstart-ubuntu.html
http://clusterlabs.org/quickstart-suse.html
http://clusterlabs.org/doc
http://clusterlabs.org/faq.html

Clustersfrom Scratch
PacemakerExplained (Reference)

SUSEdocumentation

"ProLinux High Availability Clustering" (Kindle)

"CentOSHigh Availability" (Kindle)

http://corosync.org
https://alteeve.ca/w/Corosync
google: corosynctotem
google: OpenAIS

Older (CMAN-based) clusters included:

/etc/cluster/cluster.conf=> corosync.conf + cib.xml
system-config-cluster or conga (luci + ricci) configuration UI=> replaced by (still deficient) pcs-gui on port 2224
rgmanager => pacemaker
ccs => pcs

Linux Cluster setup

Setup Corosync/Pacemaker cluster namedvccomposed of three nodes (vc1,vc2,vc3)

Based on Fedora Server 22.

Warning: a bug in virt-manager Clone command may destroy AppArmorprofile both on source and target virtual machines.
Replicate virtual machines manually, or at least backup sourcemachine profile (located in /etc/apparmor.d/libvirt).

Networkset-up:

It is desirable to set up separate network cards for general internettraffic, SAN traffic and cluster backchannel traffic.
Ideally,interfaces should be link-aggregated (bonded or teamed) pairs, witheachlink in a pair connected to separate stacked switches.

backchannel/cluster network
- can be two sub-nets (on separate interfaces) withcorosync redundant ring configured through them
- however bonded interface is easier to set up, moreresilient to failures, and allows traffic for other components befail-safe too
- it is also possible to bind multiple addresses to thebonded interface and set up corosync redundant ring amont them - but itdoes not make sense
SAN network
- can be two sub-nets (on separate interfaces), with iSCSImulti-pathing configured between them
- however can also be bonded: either utilizing one sub-netfor all SAN traffic (with disks dual-ported between iSCSI portalswithin the same sub-net, but different addresses), or binding muiltiplesub-nets to the bonded interface (with disks dual-ported between iSCSIportals located on different sub-nets)
general network
- better be bonded, so each node can be convenientlyaccessed by a single IP address
- however load balancer can instead be configured to usemultiple addresses for a node

Bonded interfaces are slightly preferable to teamed interfaces forclustering, as all link management for bonded interfaces happens inthe kernel and does not involve user-land proccesses (unlike in the teamedinterfaces set-up).

It makes sense to use dual-port network cards and scattergeneral/SAN/cluster traffic ports between them, so a card failure doesnot bring down the whole network category.

If interfaces are bonded or teamed (rather than configured for separatesub-nets), switches should allow cross-traffic, i.e. be eitherstackable(preferably) or have ISL/IST (inter-switch link/trunking, akaSMLT/DSMLT/R-SMLT). 802.1aq (Shortest Path Bridging) support may bedesirable. See here.

Notethat IPMI (AMT/SOL) interface cannot be included in the bond or teamwithout loosing its IPMI capabillity, since it ceases to be indviduallyaddressable (having own P address).
Thus if IPMI is to be used for fencing or remote management, IPMI port is to be left alone.

For a real physical NIC, can identify port with

ethtool --identify ethX [10] => flashes LED 10 times

When hosting cluster nodes in KVM, create KVM macvtap interfaces (virtio/Bridge).

Bond interfaces:

About bonding
RHEL7 documentation
more about bonding

Notethat bonded/teamed interfaces in most setups do not provide increaseddata speed or increased bandwidth from one node to another. Theyprovide a failover and may provide an increasedaggregate bandwidth for concurrent connections tomultiple target hosts (but not to the same target host). However, see further down below.

Use network manager GUI:

"+" -> select Bond
Add->Create->Ethernet->select eth0
Add->Create->Ethernet->select eth1
Link Monitoring: MII => check media state
   ARP => use ARP to "ping" specified IP addresses(comma-separated),
   at least one responds -> link ok (canalso configure to require all to respond)
Mode = 802.3ad => if linked to a real switch (802.3ad-compliant peer)
Adaptive load balancing => otherwise (if connected directly or via a hub, not a switch)
Monitoring frequency = 100 ms

Or create files:

/etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0
NAME=bond0
TYPE=Bond
ONBOOT=yes
BONDING_MASTER=yes
BOOTPROTO=none
#DEFROUTE=yes
#IPV4_FAILURE_FATAL=no
#UUID=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
#BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-rr"
BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-alb"
IPADDR=223.100.0.10
PREFIX=24
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_PEERDNS=yes
#IPV6_PEERROUTES=yes
#IPV6_PRIVACY=no

/etc/sysconfig/network-scripts/ifcfg-bond0_slave_1

HWADDR=52:54:00:9C:32:50
TYPE=Ethernet
NAME="bond0 slave 1"
#UUID=97b83c1b-de26-43f0-91e7-885ef758d0ec
ONBOOT=yes
MASTER=bond0
#MASTER=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
SLAVE=yes

/etc/sysconfig/network-scripts/ifcfg-bond0_slave_2

HWADDR=52:54:00:CE:B6:91
TYPE=Ethernet
NAME="bond0 slave 2"
#UUID=2bf74af0-191a-4bf3-b9df-36b930e2cc2f
ONBOOT=yes
MASTER=bond0
#MASTER=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
SLAVE=yes

nmcli device disconntctifname
nmcli connection reload [ifname]
nmcli connecton up ifname

route -n => must go to bond, not slaves

also make sure default route is present
if not, add to /etc/sysconfig/network: GATEWAY=xx.xx.xx.xx

To team interfaces:

dnf install -y teamd NetworkManager-team

then configure team interface with NetworkManager GIU

Bonded/teamed interfaces in most setups do not provide increaseddata speed or increased bandwidth from one node to another. Theyprovide a failover and may provide an increasedaggregate bandwidth for concurrent connections tomultiple target hosts (but not to the same target host). However, there is a couple of workarounds:

Option 1:

Use

bonding mode=4 (802.3ad)
lacp_rate=0
xmit_hash_policy=layer3+4

The latter hashes using src-(ip,port) and dst-(ip,port).
Still not good for a single connection.

Option 2:

Create separate VLAN for each port (on each of the nodes) and use bonding mode = Adaptive load balancing.

ThenLACP-compliant bridge will consider links separate and won't try tocorrelate the traffic and direct it via a single link according toxmit_hash_policy.
However this will reduce somewhat failover capacity: for example if Node1.LinkVLAN1 and Node2.LinkVLAN2 both fail.
Italso requires that all peer systems (such as iSCSI servers, iSNS, etc.)have their interfaces configured accordingly to the sameVLAN scheme.

Remember to enable jumbo frames: ifconfig ethX mtu 9000.

Prepare:

Names vc1,vc2 andvc3 below are forcluster backchannel.

On each node:

# set node name
hostnamectlset-hostname vcx

# disable "captive portal"detection in Fedora
dnfinstall -y crudini
crudini --set/etc/NetworkManager/conf.d/21-connectivity-local.conf connectivityinterval 0
systemctl restartNetworkManager

Clustershells

Install

dnf install-y pdsh clustershell

To use pdsh:

#non-interactive:

pdsh -R exec -f 1 -w vc1,vc2,vc3 cmd | dshbak

pdsh -R exec -f 1 -w vc[1-3]  cmd | dshbak

#interactive:

pdsh -R exec -f 1 -w vc1,vc2,vc3

pdsh -R exec -f 1 -w vc[1-3]

cmd substitution:

%h  => remote host name

%u  => remote user name

%n  => 0, 1, 2, 3 ...

%%  => %

To set up for clush, first enable password-less ssh.
Clumsy way:

ssh vc1

ssh-****** -t rsa

ssh vc1 mkdir -p .ssh

ssh vc2 mkdir -p .ssh

ssh vc3 mkdir -p .ssh

ssh vc1 chmod 700 .ssh
ssh vc2 chmod 700 .ssh
ssh vc3 chmod 700 .ssh

cat .ssh/id_rsa.pub | ssh vc1 'cat>> .ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc2 'cat >>.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc3 'cat >>.ssh/authorized_keys'

Ctrl-D

ssh vc2

ssh-****** -t rsa

cat .ssh/id_rsa.pub | ssh vc1 'cat >>.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc2 'cat >>.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc3 'cat >>.ssh/authorized_keys'

Ctrl-D

ssh vc3

ssh-****** -t rsa

cat .ssh/id_rsa.pub | ssh vc1 'cat >>.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc2 'cat >>.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc3 'cat >>.ssh/authorized_keys'

Ctrl-D

Cleaner way:

Create id_rsa.pub, id_rsaand authorized_keys on one node,
then replicate them to other nodes in the cluster.

To use clush:

clush -w vc1,vc2,vc3 -b  [cmd]

clush -w vc[1-3] -b  [cmd]

Basiccluster install:

On each node:

dnf install -y pcsfence-agents-all fence-agents-virsh resource-agents pacemaker

Optional: dnfinstall -y dlm lvm2-cluster gfs2-utils iscsi-initiator-utils lsscsihttpd wget

systemctl startfirewalld.service
firewall-cmd--permanent --add-service=high-availability
firewall-cmd--add-service=high-availability
systemctl stopfirewalld.service
iptables --flush

## optionally disable SELinux:
#setenforce 0
#edit /etc/selinux/config and change SELINUX=enforcing=>SELINUX=permissive

passwd hacluster

systemctl startpcsd.service
systemctl enablepcsd.service

# make sure no http_proxyexported
pcscluster auth vc1.example.com vc2.example.com vc3.example.com -uhacluster -pxxxxx--force
e.g. pcs cluster auth vc1 vc2 vc3-u hacluster -p abc123 --force

# created auth data isstored in /var/lib/pcsd

On one node:

pcs cluster setup [--force] --name vcvc1.example.com vc2.example.com vc3.example.com

pcs cluster start --all

to stop: pcscluster stop --all

On each node:

# to auto-start cluster onreboot
# alternatively can manually do "pcs cluster start" on each reboot
pcscluster enable --all

to disable: pcs cluster disable --all

View status:

pcs status
pcs cluster status
pcs clusterpcsd-status
systemctl statuscorosync.service
journalctl -xe
cibadmin--query
pcsproperty list [--all] [--defaults]
corosync-quorumtool-oi [-i]
corosync-cpgtool
corosync-cmapctl [ | grepmembers]
corosync-cfgtool -s
pcs cluster cib

Verify current configuration

crm_verify --live --verbose

Start/stop node

pcs cluster stop vc2

pcs status

pcs cluster start vc2

Disable/enable hosting resources on the node (standby state)

pcs cluster standby vc2

pcs status

pcs cluster unstandby vc2

"Transactional"configuration:

pcs clustercib my.xml # get a copy ofCIB to my.xml
pcs -f my.xml ... change command ... #make changes of config in my.xml
crm_verify --verbose --xml-file=q.xml # verifyconfig
pcs cluster cib-push my.xml # push config from my.xml to CIB

ConfigureSTONITH

All agents: https://github.com/ClusterLabs/fence-agents/tree/master/fence/agents

fence_virsh -fences machine via ssh to vm host and execuitingsudo virsh destroy<vmid> or sudo virsh reboot<vmid>

Alternative to virsh:fence_virt/fence_xvm

dnf install -y fence-virt

STONITH is needed:

In resource (non-quorum) basedclusters, for obvious reasons
In two-node clusters withoutquorum disk (a special case of the above), for obvious reasons
Inquorum-based clusters, because Linux clustering solutions includingCorosync and CMAN run as user-level processes and are unable tointerdict user-level and kernel-level activity on the node when clusternode losesconnection to majority-votes partition. By comparison, in VMSCNXMAN is a kernel component which makes all CPUs to spin inIOPOST by requeueing the request to the tail of IOPOST queueuntilquorum is restored and the node re-joins the majoritypartition.During this time, no user-level processes can execute, and no new IOcan be initiated, except the controlled IO to the quorum disk and SCSdatagrams by CNXMAN. When connection to the majority-parition isrestored, mount verification is further executed, and all file systemrequests are held off until mount verification completes. If anoderestores connection to the majority parition and detects newincarnation of the cluster, the node executes a bugcheck to reboot.

Configure virshSTONISH

On the vm host:

define user stonithmgr
add it to sudoers as

stonithmgrALL=(ALL) NOPASSWD: ALL

On a cluster node:

pcs stonith list
pcs stonith describefence-virsh
man fence_virsh
fence_virhs -h

#test
fence_virsh--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose--plug=vc2 --action=metadata
fence_virsh--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose--plug=vc2 --use-sudo --action=status
fence_virsh--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose--plug=vc2 --use-sudo --action=list
fence_virsh--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose--plug=vc2 --use-sudo --action=monitor
fence_virsh--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose--plug=vc2 --use-sudo --action=off

create file /root/stonithmgr-passwd.sh as

#!/bin/sh
echo"vc-cluster-passwd"

chmod 755/root/stonithmgr-passwd.sh
rsync -av/root/stonithmgr-passwd.sh vc2:/root
rsync -av/root/stonithmgr-passwd.sh vc3:/root

for node in vc1 vc2vc3; do
pcs stonith deletefence_${node}_virsh
pcs stonith createfence_${node}_virsh \
   fence_virsh \
   priority=10 \
   ipaddr=${node}-vmhost \
   login=stonithmgr passwd_script="/root/stonithmgr-passwd.sh"sudo=1 \
   port=${node} \
   pcmk_host_list=${node}
done

pcmk_host_list => vc1.example.com
port => vm name in virsh
ipaddr => name of machine hosting vm
delay=15 => delay for execution of fencing action

STONITH commands:

pcs stonithshow --full
pcs stonith fence vc2 --off
pcs stonith confirm vc2
pcs stonith delete fence_vc1_virsh

Reading:

https://alteeve.ca/w/Anvil!_Tutorial_3#Fencing_using_fence_virsh
https://www.centos.org/forums/viewtopic.php?f=48&t=50904
https://www.ibm.com/developerworks/community/blogs/mhhaque/entry/configure_two_node_highly_available_cluster_using_kvm_fencing_on_rhel7
http://www.hpuxtips.es/?q=content/part6-fencing-fencevirsh-my-study-notes-red-hat-certificate-expertise-clustering-and-storage

Management GUI:

https://vc1:2224

log in as hacluster

ManagementGUI, Hawk:

https://github.com/ClusterLabs/hawk SUSEon Hawk

https://vc1:7630

Essential files:

/etc/corosync/cofosync.conf
/etc/corosync/cofosync.xml
/etc/corosync/authkey

/var/lib/pacemaker/cib/cib.xml (do not edit manually)

/etc/sysconfig/corosync
/etc/sysconfig/corosync-inotifyd
/etc/sysconfig/pacemaker

/var/log/pacemaker.log
/var/log/corosync.log (but by default sent to syslog)
/var/log/pcsd/...
/var/log/cluster/...

/var/log/syslog
or on new Fedora:

journalctl --boot -x
journalctl --list-boots
journalctl --follow-x
journalctl --all-x
journalctl -xe

Man pages:

mancorosync.conf
man corosync.xml
man corosync-xmlproc

man corosync_overview
man corosync
man corosync-cfgtool

man quorum_overview       // quorum library
man votequorum_overview    // ...
man votequorum           // quorum configuration
man corosync-quorumtool

man cibadmin

man cmap_overview    // corosync config registry
man cmap_keys
man corosync-cmapctl

man sam_overview       // library to register processfor a restart on failure

man cpg_overview       // closed group messaginglibrary w/virtual synchrony
man corosync-cpgtool

man corosync-blackbox    // dump protocol"blackbox" data
man qb-blackbox
man ocf-tester
man crmadmin

man gfs2
man tunegfs2

Essential processes:

corosync	totem,membership and quorum manager, messaging
cib	clusterinformation base
stonithd	fencingdaemon
crmd	clusterresource management daemon
lrmd	localresource management daemon
pengine	policyengine
attrd	co-ordinatesupdates to cib, as an intermediary
dlm_controld	distributedlock manager
clvmd	clusteredLVM daemon

Alternatives to corosync:CMAN or CCM + HEARTBEAT

DC ≡ DesignatedController. One of CRMd instances elected to act as a master. Shouldthe elected CRMd process or its node fail, a new master is elected. DCcarries out PEngine's instructions by passing them to LRMd on a localnode, or to CRMd peers on other nodes, which in turn pass them to theirLRMd's. Peers then report the results of execution to DC.

Resource categories:

LSB	Services from /etc/init.d
Systemd	systemd units
Upstart	upstart jobs
OCF	Open Cluster Framework scripts
Nagios	Nagios monitoring plugins
STONITH	fence agents

pcs resource standards
pcs resourceproviders
pcs resource agentsocf:heartbeat
pcs resource agentsocf:pacemaker
pcs resource agentssystemd
pcs resource agentsservice
pcs resource agentslsb
pcs resource agentsstonith

Resource consraints:

location	Which nodes the resource can run on
order	The order in which the resource is launched
colocation	Where the resource will be placed relative to otherresources

Connectto iSCSI drives:

See iSCSI page.

Briefly, on each cluster node:

Install the open-iscsi package. The package is also known as the LinuxOpen-iSCSI Initiator.

Ubuntu:

apt-get install open-iscsilsscsi
gedit/etc/iscsi/iscsid.conf
/etc/init.d/open-iscsirestart

Fedora:

dnf install -yiscsi-initiator-utils lsscsi
systemctl enableiscsid.service
systemctl startiscsid.service

Display/edit initiator name, ensure it is unique in the landscape(especially if cloned the system)

cat/etc/iscsi/initiatorname.iscsi

e.g.

InitiatorName=iqn.1994-05.com.redhat:cbf2ba2dff2=> iqn.1994-05.com.redhat:mynode1
InitiatorName=iqn.1993-08.org.debian:01:16c1be18eee8=> iqn.1993-08.org.debian:01:myhost2

Optional: edit configuration

gedit /etc/iscsi/iscsid.conf

restart the service

Discover the iSCSI targets on a specific host

iscsiadm -mdiscovery -tsendtargets -p qnap1x:3260 \

    --name discovery.sendtargets.auth.authmethod--value CHAP \

    --name discovery.sendtargets.auth.username--value sergey \

    --name discovery.sendtargets.auth.password--value abc123abc123

Check the available iSCSI node(s) to connect to.

iscsiadm -m node

Delete node(s) you don’t want to connect to when the service is on withthe following command:

iscsiadm -m node --op delete--targetname <target_iqn>

Configure authentication for the remaining targets:

iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.username --value=sergey
iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -pqnap1x:3260 --login

iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.username --value=sergey
iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -pqnap1x:3260 --login

iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.username --value=sergey
iscsiadm  --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -pqnap1x:3260--op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node --targetname"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -pqnap1x:3260 --login

Youshould be able to see the login message as below:

Restart open-iscsi to login to all of the available nodes.

Fedora: systemctlrestart iscsid.service
Ubuntu: /etc/init.d/open-iscsi restart

Check the device status with dmesg.

dmesg | tail -30

List available devices:

lsscsi
lsscsi-s
lsscsi-dg
lsscsi-c
lsscsi-Lvl

iscsiadm-m session [-P 3] [-o show]

For multipathing, see a section below.

Formatvolume with cluster LVM

See RHEL7LVM Administration, chapters 1.4, 3.1, 4.3.3, 4.3.8, 4.7, 5.5.

On each node:

lvmconf --enable-cluster
systemctl stoplvm2-lvmetad.service
systemctl disablelvm2-lvmetad.service

To revert (if desired later):

lvmconf --disable-cluster

edit /etc/lvm/lvm.conf
change use_lvmetad to 1

systemctlstart lvm2-lvmetad.service
systemctl enablelvm2-lvmetad.service

On one node (cluster must be running):

pcs resource create dlm ocf:pacemaker:controld opmonitor interval=30s on-fail=fence clone interleave=true ordered=true

pcsresource create clvmd ocf:heartbeat:clvm with_cmirrord=true op monitorinterval=30s on-fail=fence clone interleave=true ordered=true

pcs constraint order start dlm-clone then clvmd-clone

pcs constraint colocation add clvmd-clone with dlm-clone

pcs constraint show

pcs resource show

If clvmdwas already configured earlier, but withoutcmirrord, canenable the latter with:

pcs resource update clvmdwith_cmirrord=true

Identify the drive

iscsiadm -msession -P 3 | grep Target
iscsiadm -m session -P 3 | grep scsi | grep Channel
lsscsi
tree /dev/disk

Partition the drive and create volume group

fdisk/dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0

respond:  n, p,...., w, p, q

Refresh parition table view on all other nodes:


partprobe


pvcreate 
/dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0-part1
vgcreate[--clustered y] vg1/dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0-part1
vgdisplay vg1
pvdisplay
vgs

Create logical volume:

lvcreate vg1 --name lv1 --size 9500M
lvcreate vg1 --namelv1 --extents 
2544  # find the number of free extents from vgdisplay
lvcreate vg1 --namelv1 --extents 
100%FREE
lvdisplay


ls -l /dev/vg1/lv1

Multipathing

See here.

GFS2

RedHat GFS2 documentation

File system name must be unique in a cluster (DLM locknames derive from it)
File system hosts journal files. One journal is requiredper each cluster node that mounts this file system.
Default journal size: 128 MB (per journal).
Minimum journal size: 8 MB.
For large file systems, increase to 256 MB.
If journal is too small, requests will have to wait for journal space,and performance will suffer.
Do not use SELinux with GFS2.
SELinux stores information in every file's extended attributes, whichwill cause significant GFS2 slowdown.
If GFS2 filesystem is mounted manually (rather than throughPacemaker resource), unmount it manually.
Otherwise shutdown script will kill cluster processes and will then tryto unmount the GFS2 file system,
but without the processes the unmount will fail and the system willhang (and a hardware reboot will be required).

Configure cluster no-quorum-policyas freeze

pcs property setno-quorum-policy=freeze

By default, the value of no-quorum-policyis set to stop,indicating that once quorum is lost, all the resources on the remaining(minority) partition will immediately be stopped. Typically thisdefault is the safest and most optimal option, but unlike mostresources, GFS2 and OCFS2 require quorum to function. When quorum islost both the applications using the GFS2 mounts and the GFS2 mountitself cannot be correctly stopped in a partition that has becomenon-quorate. Any attempts to stop these resources without quorum willfail which will ultimately result in the entire cluster being fencedevery time quorum is lost.

To address this situation, set the no-quorum-policy=freezewhen GFS2 is in use. This means that when quorum is lost, the remaining(minority) partition will do nothing until quorum is regained.

If majority partition remains, it will fence the minority partition.

Find out for sure: if themajoritypartition can launch a failover replica of a service (that was runninginside a minority partition)beforefencing a minority partition, or will do it onlyafter fencing aminority parition . If before, two replicas can conflict whenno-quorum-policy isfreeze(and even when it is stop).

Create file system and Pacemaker resource for it:

mkfs.gfs2 -j 3 -p lock_dlm -tvc:cfs1 /dev/vg1/lv1

-j 3 => pre-createjournals for three cluster nodes

-t value => locking table name (must beClusterName:FilesystemName)

-O => do not ask for confirmation

-J 256 => create journal with size of 256 MB (default: 128, min:8)

-r <mb> => size of allocation "resource group",usually 256 MB

# view settings

tunegfs2 /dev/vg1/lv1

# change label (note: labelis also the lcck table name)

tunegfs2 -L vc:cfs1 /dev/vg1/lv1

# some other settings can also later be changed with tunegfs2

pcs resource create cfs1Filesystem device="/dev/vg1/lv1" directory="/var/mnt/cfs1" fstype=gfs2\

    options="noatime,nodiratime" run_fsck=no\

    op monitor interval=10s on-fail=fence cloneinterleave=none

Mount options:
acl        enable ACLs

discard     when on SSD or SCSCI devices, enableUNMAP function for blocks being freed

quota=on    enforce quota

quota=account matain quota, but do not enforce it

noatime     disable update of access time

nodiratime  same for directories 

lockproto=lock_nolock => mounting out of cluster (no DLM)

pcs constraint order start clvmd-clone then cfs1-clone

pcs constraint colocation add cfs1-clonewith clvmd-clone

mount | grep /var/mnt/cfs1

To suspend write activity on file system (e.g. to create LVM snapshot)

dmsetup suspend /dev/vg1/lv1
[... use LVM tocreate a snapshot ...]
dmsetup resume /dev/vg1/lv1

To run fsck, stop the resource to unmount file systems from all thenodes:

pcs resource disablecfs1 [--wait=60]   # default wait time is 60 seconds

fsck.gfs2 -y /dev/vg1/lv1

pcs resource enable cfs1

To expand file system:

lvextend  ... vg1/lv1

gfs2_grow /var/mnt/cfs1

When adding node to cluster, provide enough journals first:

# find out how many journals are available

# must unmount file syste first
pcs resource disable 
cfs1
gfs2_edit -p jindex /dev/vg1/lv1 | grepjournal
pcs resource enable 
cfs1


# add one more journal, sized 128 MB
gfs2_jadd /var/mnt/cfs1

# add twomore journals sized 256 MB

gfs2_jadd -j 2 -J 256 /var/mnt/cfs1


[... add the node ...]

Optional – Performance tuning – Increase DLM table sizes

echo 1024 >/sys/kernel/config/dlm/cluster/lkbtbl_size

echo 1024 > /sys/kernel/config/dlm/cluster/rsbtbl_size

echo 1024 > /sys/kernel/config/dlm/cluster/dirtbl_size

Optional – Performance tuning – Tune VFS

# percentage of system memory that can be filledwith “dirty” pages before the pdflush kicks in

sysctl -n vm.dirty_background_ratio    #default is 5-10
sysctl -wvm.dirty_background_ratio=20

# discard inodes anddirectory entries from cache more agressively

sysctl -n vm.vfs_cache_pressure       # default is 100

sysctl -n vm.vfs_cache_pressure=500

# can be permanently changed in /etc/sysctl.conf

Optional – Tuning

/sys/fs/gfs2/vc:cfs1/tune/...

To enable data journaling on a file (default: disabled)

chattr +j /var/mnt/cfs1/path/file   #enable

chattr -j /var/mnt/cfs1/path/file    #disable

Program optimizations:

preallocate file space – use fallocate(...) if possible
flock(...) is faster than fcntl(...) with GFS2
with fcntl(...), l_pid may refer to a process on adifferent node

To drop the cache (after large backups etc.)

echo 3 >/proc/sys/vm/drop_caches

View lock etc. status:

/sys/kernel/debug/gfs2/vc:cfs1/glocks   # decodedhere

dlm_tool ls [-n] [-s] [-v] [-w]

dlm_tool plocks lockspace-name[options]

dlm_tool dump [options]

dlm_tool log_plock [options]

dlm_tool lockdump lockspace-name[options]

dlm_tool lockdebug lockspace-name[options]

tunegfs2   /dev/vg1/lv1

Quota manipulations:

mount with "quota=on"

to create quotafiles:   quotacheck -cug /var/mnt/cfs1

to edit userquota:      exportEDITOR=`which nano' ; edquotausername
to edit groupquota:    export EDITOR=`whichnano' ; edquota -ggroupname

graceperiods:      edquota -t

verify userquota:    quota -u username
verify group quota: quota -g groupname

report quota:   repquota /var/mnt/cfs1

synchronizequota data between nodes: quotasync -ug/var/mnt/cfs1

NFS over GFS2: see here

=========

### multipath: man mpathpersist https://www.suse.com/documentation/sles-12/stor_admin/data/sec_multipath_mpiotools.html

### LVM: fsfreeze

misc,iscsi:https://www.ibm.com/developerworks/community/blogs/mhhaque/entry/configure_two_node_highly_available_cluster_using_kvm_fencing_on_rhel7?lang=en

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/IPaddr2

### add node (also GFS2 journals)
### virtual-ip
### httpd
### nfs
### fence_scsi
### GFS2
### OCFS2
### DRBD
### interface bonding/teaming
### quorum disk, qdiskd, mkqdisk
### GlusterFS
### Lustre
### hawk GUI https://github.com/ClusterLabs/hawk

### http://www.spinics.net/lists/cluster/threads.html

相关推荐