Tech in Wonderland

28 January 2022

How to implement `pixz` for the HiveOX PXE boot server and mining rig clients

@hiveos
Have you or anybody ever tried moving to using pixz instead of using pxz for the parallel compression and parallel decompression of the boot archive (i.e. moving from hiveramfs.tar.xz to hiveramfs.tpxz)?

I tried dissecting through the scripts and I can't seem to find the part where the system knows to use tar to extract the hiveramfs.tar.xz file into tmpfs.

I've tried looking in /path/to/pxeserver/tftp and also in /path/to/pxeserver/hiveramfs and I wasn't able to find where it codifies the instruction and/or the command to unpack the hiveramfs.tar.xz.

If you can provide some guidance as to where I would find that in the startup script, where it would instruct the client to decompress and unpack the hiveramfs.tar.xz, that would be greatly appreciated.

Thank you.

*edit*
I've implemented pixz now for both parallel compression and the creation of the boot archive hiveramfs.tpxz and the decompression of the same.

It replaces the boot archive hiveramfs.tar.xz.

The PXE server host, if you are running an Ubuntu PXE boot server, will need to have pixz installed (which you can get by running sudo apt install -y pixz, so it's pretty easy to get and install.)

The primary motivation for this is on your mining rig, depending on the CPU that you have in it, but usually, at boot time, you will have excess CPU capacity, and therefore; if you can use a parallel decompression for the hiveramfs archive, then you can get your mining rig up and running that much quicker.

The side benefit that this has also produced is that in the management of the hiveramfs image on the PXE server, pixz worked out to be faster in the creation of the FS archive compared to pxz.

Tested on my PXE server which has a Celeron J3455 (4-core, 1.5 GHz base clock), it compressed the FS archive using pxz in 11 minutes 2 seconds whilst pixz was able to complete the same task (on a fresh install of the HiveOS PXE server) in 8 minutes 57 seconds. (Sidebar: For reference, previously, when using only xz (without the parallelisation), on my system, it would take somewhere between 40-41 minutes to create the FS archive.)

On my mining rig, which has a Core i5-6500T, it takes about 8.70 seconds to decompress hiveramfs.tpxz to hiveramfs.tar and then it takes about another 1.01 seconds to unpack the tarball file.

Unfortunately, I don't have the benchmarking data for how long it took my mining rig to decompress and unpack hiveramfs.tar.xz file.

Here are the steps to deploying pixz, and using that to replace pxz.

On the PXE server, install pixz:
sudo apt install -y pixz

Run the pxe-config.sh to specify your farm hash, server IPv4 address, etc. and also the change the name of the FS archive from hiveramfs.tar.xz to hiveramfs.tpxz.

DO NOT RUN HiveOS update/upgrade yet still!!!

When it asks if you want to upgrade HiveOS, type n for no.

For safety/security, make a backup copy of the initial hiveramfs.tar.xz file that can be found in /path/to/pxeserver/hiveramfs.

(For me, I just ran sudo cp hiveramfs.tar.xz hiveramfs.tar.xz.backup.)

You will need to manually create the initial hiveramfs.tpxz file that the system will act upon next when you run the hive-upgrade.sh script.

To do that, run the following:

/path/to/pxeserver$ sudo mkdir -p tmp/root
/path/to/pxeserver$ cd tmp/root
/path/to/pxeserver/tmp/root$ cp ../../hiveramfs/hiveramfs.tar.xz .
/path/to/pxeserver/tmp/root$ tar --lzma -xf hiveramfs.tar.xz
/path/to/pxeserver/tmp/root$ tar -I pixz -cf ../hiveramfs.tpxz .
/path/to/pxeserver/tmp/root$ cd ..
/path/to/pxeserver/tmp/root$ cp hiveramfs.tpxz ../../hiveramfs
/path/to/pxeserver/tmp/root$ cd ../../hiveramfs
/path/to/pxeserver/hiveramfs$ cp hiveramfs.tpxz hiveramfs.tpxz.backup

Now, edit the pxe-config.sh:
at about line 51, it should say something like:
#adde pxz (typo included)

copy lines 51-53 and paste it after line 53

(basically, add an i so that where it says pxz now says pixz instead)
edit the lines to read:
#adde pixz
dpkg -s pixz > /dev/null 2>&1
[[ $? -ne 0 ]] && need_install="$need_install pixz"

save, quit

Run pxe-config.sh again.

DO NOT RUN HiveOS update/upgrade yet still!!!

Now, your farm hash, IP address, etc. should all have been set previously. Again, when it asks you if you want to upgrade HiveOS, type n for no.

Now, we are going to make a bunch of updates to hive-upgrade.sh.

(For me, I still use vi, but you can use whatever text editor you want.)

/path/to/pxeserver$ sudo vi hive-upgrade.sh
at line 71, add pixz to the end of the line so that the new line 71 would read:
apt install -y pv pixz

I haven't been able to figure out how to decompress the hiveramfs.tpxz archive and unpack it in the same line.

(I also was unable to get pv working properly so that it would show the progress indicator, so if someone else who is smarter than I am can help figure that out, that would be greatly appreciated, but you can also remote into your PXE server again in another terminal window and run top to monitor your PXE server to make sure that it is working in the absence of said progress indicator.)

So the section starting at line 79 echo -e "> Extract Hive FS to tmp dir" now reads:

line80: #pv $FS | tar --lzma -xf -
line81: cp $FS .
line82: pixz -d $ARCH_NAME
line83: tar -xf hiveramfs.tar .
line84: rm hiveramfs.tar

Line84 is needed because otherwise, without it, when you go to create the archive, it will try to compress the old hiveramfs.tar in as well, and you don't need that.

Now fast forward to the section where it creates the archive (around line 121) where it says:
line121: echo -e "> Create FS archive"
line122: #tar -C root -I pxz -cpf - . | pv -s $arch_size | cat > $ARCH_NAME
line123: tar -C root -I pixz -cpf - . | pv -s $arch_size | cat > $ARCH_NAME

(in other words, copy that line, paste it, comment out the old line, and add an i to the new line.)

line125 is still the old line where it used the single threaded xz compression algorithm/tool, which should be already commented out for you.

The rest of the hive-upgrade.sh should be fine. You shouldn't have to touch/update the rest of it.

Now you can run hive-upgrade.sh:
/path/to/pxeserver$ sudo ./hive-upgrade.sh

and you can run it to check and make sure that it is copying the hiveramfs.tpxz from /path/to/pxeserve/hiveramfs to /path/to/pxeserver/tmp/root, decompressing the archive, and unpacking the files properly.

If it does that properly, then the updating portion of it should be running fine, without any issues (or none that I observed).

Then the next section that you want to check is to make sure that when it repacks and compresses the archive back up, that that should be working properly for you.

Again, it is useful/helpful to have a second terminal window open where you've ssh'd into the PXE server again, with top running so that you can make sure that the pixz process is working/running.

After that is done, you can reboot your mining rig to make sure that your mining rig is picking up the new hiveramfs.tpxz file is ok and that it is also successful in decompressing and unpacking the archive.

I have NO idea how it is doing that because normally, I would have to issue that as two separate commands, but again, it appears to be working with my mining rig.

*shrug*

It's working.

I don't know/understand why/how.

But I'm not going to mess with it too much to try and figure out why/how it works, because it IS working.

(Again, if there are other people who are smarter than I am that might be able to explain how it is able to decompress and unpack a .tpxz file, I would be interested in learning, but on the other hand, like I said, my mining rig is up with the new setup, so I'm going to leave it here.)

Feel free to ask questions if you would want to implement pixz so that you would have faster compression and decompression times.

If your PXE server is fast enough for you such that pxz is fast enough for you and this isn't going to make enough of a difference for you, then that's fine. That's up to you.

For me, my PXE server, running on a Celeron J3455 is quite slow, so anything that I can do to speed things up a little bit is still a speed up.

Thanks.

06 January 2022

Getting the latest and greatest hardware running in Linux is sometimes, a bit of a nightmare

Just prior to the holidays, I decided to upgrade one of three of my systems and consolidate it down to two. My old Supermicro Big Twin^2 Pro micro cluster server and two HP Z420 workstations (that I was using in lieu of the Supermicro because the Supermicro was "too loud") were getting replaced by an AMD system, built on the Ryzen 9 5950X CPU and an Intel system, built on the latest and greatest that Intel had to offer - the Core i9-12900K.

So, I speced out all of the rest of the hardware, which really, consisted of the motherboard, RAM, and the CPU heatsink and fan assembly whilst I was able to reuse some of my older, existing components as well. (I did have to buy an extra power supply though because I had originally miscalculated how many power supplies that I would need.)

So that's all fine and dandy. All of the hardware arrived just before the start of the Christmas break for me, so I started to set up the AMD system. Install the CPU, the RAM, the CPU HSF, plug everything in, check and double check all of the connections - everything is good to go. I used Rufus USB to write the CentOS 7.7.1908 installed onto a USB drive, plug in the keyboard, mouse, and flip the switch on the power supply and off I go right?

[buzzer]

Nope!

Near instant kernel panic. Nice.

As you can see from the picture above, less than 3 seconds into the boot sequence from the USB drive - Linux has a kernel panic.

Great.

So now I get the "fun" [/sarcasm] job of trying to sort this kernel panic out. Try it a few more times, the same thing happens.

So, ok. Now I'm thinking that the hardware is too new for this older Linux distro and version (and kernel). So, I take out my Intel Core i7-3930K system (one of them that I use to run my tape backup system), and I plug the hard drive into that system, along with the video card back in, and run through the boot and installation process (which worked without any issues of course), power down the 3930K, take the hard drive back out, and plug it into the 5950X system. Power it on. (I set the BIOS to power on after AC loss so that I can turn on the system even when it isn't inside a case and I don't have a power button connected to it.)

The official CentOS forums state that they only support CentOS 7.9.2009, so I try that as well, still to no avail.

Eventually, I end up using a spare Intel 545 series 512 GB SATA 6 Gbps SSD that I had laying around so that I could try installing and re-installing, trying different drivers, kernel modules, kernels, etc. a LOT faster than I was able to with a 7,200 rpm HDD.

End net result: I filed a bug report with kernel.org because the mainline kernel 5.15.11 kept producing kernel panics with the Mellanox 100 Gbps Infiniband network card installed. And it didn't matter whether I tried to use the "inbox" CentOS Infiniband drivers or the "official" Mellanox OFED Infiniband drivers.

Yet another Linux kernel panic.

Interestingly enough, the mainline kernel 5.14.15 works with the Infiniband NIC just fine. So that's what I landed on/with.

The other major problem that I ran into was that the Asus X570 TUF Gaming Pro (WiFi) used the Intel I225-V 2.5 GbE NIC. Unbeknownst to me when I originally purchased the motherboard, I didn't realise that Intel does NOT have a Linux driver (even on Intel's website) for said Intel I225-V 2.5 GbE NIC. And what was weird was that when I migrating the SSD over during the testing and trying to find/figure out a configuration that worked, said Intel onboard 2.5 GbE NIC would work initially, but then it would eventually and periodically drop out and so that was quite the puzzle because if there wasn't a driver for it, then how was it that it was able to work when I moved the drive over?

As a result of that, that took up a couple of days where I would be trying to clone the disk image over from the Intel SSD over onto the HGST HDD using dd and in the end, that didn't work either.

So, what did I end up with?

This is the hardware specs that I ended up with on the AMD system:

CPU: AMD Ryzen 9 5950X (16-core, 3.4 GHz stock base clock, 4.9 GHz max boost clock, SMT enabled)

Motherboard: Asus X570 TUF Gaming Pro (WiFi)

RAM: 4x Crucial 32 GB DDR4-3200 unbuffered, non-ECC RAM CL22 (128 GB total)

CPU HSF: Noctua NH-D15 with one stock 140 mm fan, and one NF-A14 industrialPPC 3000 PWM fan

Video card: EVGA GeForce GTX 980

Hard drive: 1x HGST 1 TB SATA 6 Gbps 7200 rpm HDD

NIC: Mellanox ConnectX-4 dual port 100 Gbps 4x EDR Infiniband (MCX456A-ECAT)

NIC: Intel Gigabit CT Desktop 1 GbE NIC (Intel 82574L chipset)

Power Supply: Corsair CX750M

OS: CentOS 7.7.1908 kernel 5.14.15-1-el7.elrepo.x86_64

I ended up adding the Intel Gigabit CT Desktop NIC because a) it was an extra Intel GbE NIC AIC that I had also laying around, and b) it proved to be able to provide a vastly more reliable connection than the onboard Intel I225-V 2.5 GbE due to the driver issue.

Now that I have the system set up and running, there is a higher probabilty that the igc kernel module probably works more reliably now than it did when I was originally setting up the system, but being that it was not reliable when I was doing the initial setup and testing, I am less likely to use said onboard NIC, which is a pity. Brand spankin' new motherboard and I can't even use nor trust the reliability of the onboard NIC. And I can't even blame Asus for it because it is an Intel NIC. (Sidebar: Ironically, the Asus Z690 Prime-P D4 motherboard that I also purchased uses a Realtek RTL8125 2.5 GbE NIC, which I WAS able to find a driver for that and it has been working flawlessly with it.)

That took probably on the order of around 10 days, from beginning to end, to get the AMD system up and running.

The Intel system was a little bit easier to set up.

The kernel panic issue with the mainline 5.15.11 kernel and Infiniband was also present on the Intel platform as well.

Interestingly and ironically enough, the newer kernel kept crashing or had severe stability issues. It turns out that I did NOT install the RAM correctly (i.e. in the DIMM_A2 and DIMM_B2 slots), so since then, I've corrected that.

Keen readers might note that I have stated that I have 4 sticks of RAM, except that one of the sticks arrived DOA, and is currently being sent back to Crucial under RMA, so when it comes back, then I will be able to install the extra stick that is currently not installed and the stick that is due back from the RMA exchange.

I might try the newer kernels again later, but for now, at least the system is up and running so that I can start making it do the work that I need it to be doing.

The system stability issues due to the error that I made when installing (and uninstalling) the RAM (because I was testing the stick of RAM that wouldn't POST that ended up getting RMA'd back to Crucial), I ended up with a RAM installation configuration that wasn't correct and the resulting system stability issues ate up a few more days.

So, in the end, it took me almost the entire Christmas holiday to get both of these systems up and running.

(This is also a really good reason why traditionally, I have stuck with workstation and server hardware because on my old Supermicro micro cluster, I can deploy all four nodes in 2 hours or less. It's a pity that the system is too loud.)

This is the hardware that I ended up with on the Intel system:

CPU: Intel Core i9-12900K (16 cores (8P + 8E), 3.2 GHz/2.4 GHz base clock speed, 5.2 GHz/3.9 GHz max boost clock, HTT enabled)

Motherboard: Asus Z690 Prime-P D4

RAM: 4x Crucial 32 GB DDR4-3200 unbuffered, non-ECC RAM CL22 (128 GB total)

CPU HSF: Noctua NH-D15 with one stock 140 mm fan, and one NF-A14 industrialPPC 3000 PWM fan

Video card: EVGA GeForce GTX 660

Hard drive: 1x HGST 1 TB SATA 6 Gbps 7200 rpm HDD

NIC: Mellanox ConnectX-4 dual port 100 Gbps 4x EDR Infiniband (MCX456A-ECAT)

NIC: Intel Gigabit CT Desktop 1 GbE NIC (Intel 82574L chipset)

Power Supply: Corsair CX750M

OS: CentOS 7.7.1908 kernel 3.10.0-1127.el7.x86_64

AMD Ryzen 9 5950X is faster than the Intel Core i9-12900K for mining Raptoreum

The results speak for themselves.

The AMD Ryzen 9 5950X is faster at mining Raptoreum than Intel's latest and greatest 12th gen Core i9-12900K.

System/hardware specs:

AMD:

CPU: AMD Ryzen 9 5950X (16-core, 3.4 GHz stock base clock, 4.9 GHz max boost clock, SMT enabled)

Motherboard: Asus X570 TUF Gaming Pro (WiFi)

RAM: 4x Crucial 32 GB DDR4-3200 unbuffered, non-ECC RAM CL22 (128 GB total)

CPU HSF: Noctua NH-D15 with one stock 140 mm fan, and one NF-A14 industrialPPC 3000 PWM fan

Video card: EVGA GeForce GTX 980

Hard drive: 1x HGST 1 TB SATA 6 Gbps 7200 rpm HDD

NIC: Mellanox ConnectX-4 dual port 100 Gbps 4x EDR Infiniband (MCX456A-ECAT)

NIC: Intel Gigabit CT Desktop 1 GbE NIC (Intel 82574L chipset)

Power Supply: Corsair CX750M

OS: CentOS 7.7.1908 kernel 5.14.15-1-el7.elrepo.x86_64

Intel:

CPU: Intel Core i9-12900K (16 cores (8P + 8E), 3.2 GHz/2.4 GHz base clock speed, 5.2 GHz/3.9 GHz max boost clock, HTT enabled)

Motherboard: Asus Z690 Prime-P D4

RAM: 4x Crucial 32 GB DDR4-3200 unbuffered, non-ECC RAM CL22 (128 GB total)

CPU HSF: Noctua NH-D15 with one stock 140 mm fan, and one NF-A14 industrialPPC 3000 PWM fan

Video card: EVGA GeForce GTX 660

Hard drive: 1x HGST 1 TB SATA 6 Gbps 7200 rpm HDD

NIC: Mellanox ConnectX-4 dual port 100 Gbps 4x EDR Infiniband (MCX456A-ECAT)

NIC: Intel Gigabit CT Desktop 1 GbE NIC (Intel 82574L chipset)

Power Supply: Corsair CX750M

OS: CentOS 7.7.1908 kernel 3.10.0-1127.el7.x86_64

Configuration notes:

I had about two weeks over the Christmas break 2021 to receive all of the hardware, assemble the systems, and get the systems set up and up and running. And that was quite the endeavour because with the latest and greatest hardware, the older versions of CentOS (7.7.1908) and the older kernel didn't work with all of the features and functions with this level of hardware.

As a result, I had to "jumpstart" both systems by first installing the OS using my Intel Core i7-3930K system (Asus X79 Sabertooth motherboard, 4x Crucial 8 GB DDR3-1600 unbuffered, non-ECC RAM, Mellanox MCX456A-ECAT, GTX 660) first, and then update the systems (at least in part) before I can transplant the hard drive with the OS install into their respective systems and finish setting the systems up. (I will write more about that "journey"/clusterf in another blog post here shortly because it was quite the journey to jumpstart both of these systems simultaneously, which took pretty much the full two weeks that I had.)

You will find out how and why I ended up with the respective hardware choices in that blog post.

I am using cpuminer-gr-1.2.4.1-x86_64_linux from here (https://github.com/WyvernTKC/cpuminer-gr-avx2/releases/tag/1.2.4.1).

For the Intel system, because of the existence of the combination of (P)erformance Cores and (E)fficiency Cores and HyperThreading, this resulted in more combinations that I had to test in order to find the setting that had the highest Raptoreum hash rate as reported from their benchmarking tool. Each time the CPU configuration changed, I ran a full tune again, which you might well imagine, took quite some time to do. In the cases where the efficient cores were disabled, I also tested and re-ran the full tune for both AVX2 and AVX512.

The AVX512 runs (both times I ran it, i.e. with and without HyperThreading), resulted in thermal throttling with about a 23-24 C ambient at the time.

For the AMD system, testing it was a lot simpler because it was either only SMT on or off.

Results:

The results speak for themselves.

The AMD Ryzen 9 5950X with SMT enabled produces the highest hash rate (3953.64 hashes/second). Compare that with the run where SMT was disabled, enabling SMT results in about a 10.3% increase in the hash rate performance result.

The Intel Core i9-12900K results are an interesting case. Despite the plethora of benchmarks talking about how great and how fast the latest and greatest from Intel is (and there ARE some things that said latest and greatest from Intel are great at), unfortunately, for Raptoreum mining, this is not one of them.

At best, the 5950X with SMT enabled compared to the 12900K with all 16 cores AND HyperThreading enabled puts the 5950X at about 80.9% faster in Raptoreum hash rate performance vs. the 12900K.

Comparing like-for-like thread count, whether it is 8P+8E without HyperThreading, the 16 core/16 thread of the 5950X is still approximately 64.2% faster than the 12900K. Without the efficiency cores, but turning HyperThreading back on (i.e. 8P+0E with HyperThreading), the 12900K again is bested by the 5950X by a 71.8% margin.

Unfortunately, running this in Linux meant that I didn't have or didn't know of a tool like Hardware Info 64 to be able to report power consumption figures/values. Maybe I might get around to re-running this test again in Windows, but for now, this might be helpful to those who might be interested in looking for guidance if mining Raptoreum is on your mind.

30 July 2021

A Better Way To Test Computers

If you really want a better way to test a system, especially for single threaded performance, try testing it with an upwards of 90 MB text file which contains approximately 1.45 million lines of text in it by running a compare operation against two copies of the text file, and the difference between the two files is about +/- 7200 lines, but +/- 20 MB. (Yeah, don't know how THAT happened, but that's the result.) (It is an index of a directory that I have that contains about 1.45 million files and folders)

The current compare operation that has been running in Notepad++ 8.1.2 with the compare plugin has been running for over an hour now on an Intel Core i7-6700K (4-core, 4.0 GHz base clock speed, 4.2 GHz max boost speed, HyperThreading enabled) and I have no idea how much longer said compare operation is going to take.

You want to test single threaded performance?

Test it with something like this.

This will give you a REAL sense as to what different processors can do.

*edit*
The comparison ended up taking 51 hours 45 minutes and 28 seconds on said computer.

Mellanox Lies (and Why It's Important To Speak Up)

In the 2020 version of Mellanox's (now Nvidia Networking, but I'm old school, so I still call it Mellanox) IB Adapter Product Brochure, it states that their Infiniband network adapters support NFS(o)RDMA.

Page 4 from the Mellanox IB Adapter Product Brochure

But when I actually purchased said Mellanox ConnectX-4 dual 100 Gbps VPI port cards (MCX456A-ECAT), and I tried to use Mellanox's OFED drivers for Linux (MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.6-x86_64.iso), said NFSoRDMA wasn't working, so I posted on the Mellanox community forums to try and get some help (because sometimes, it can very well be that either I didn't read nor execute the installation instructions correctly or that I am not understanding something or that there was another step that I needed to do that wasn't documented in the installation instructions).

After fiddling around with it for a little while, I ended up reverting back to the "inbox" driver, i.e. the driver that ships with the OS (originally in my case, CentOS 7.6, and then I eventually moved up to CentOS 7.7.1908), because for some strange reason, NFSoRDMA was working with the "inbox" driver, but not the Mellanox OFED driver.

Turns out, the NFSoRDMA feature was disabled in that version of the Linux driver (and may have been actually disabled in some versions prior to that as well).

Oh really?

So with that admission, I was able to get Mellanox to go on record stating that their then-own-driver actually did NOT do what their advertising material states that their products can do, which constituted false advertising (which would be illegal).

Since then, I'm not sure at what point or version, but their latest driver now has NFSoRDMA re-enabled.

Source: https://docs.mellanox.com/display/MLNXOFEDv541030/General+Support

I am writing about this now because this was quite the adventure in trying to get NFSoRDMA up and running on my micro cluster.

As a result of Mellanox (Nvidia Networking) having previously taken away a feature/functionality before, I no longer trust that they won't do it again.

Therefore; as a result, I currently still will only use the "inbox" drivers that ships with the OS because at least I know that it'll work.

The ONLY time where this may not work or won't be true is if I were to start using Infiniband with the Windows platform (either on the server side and/or on the client side). Then I might have to use the Mellanox drivers for that, but for Linux, I can vouch for certain that the "inbox" driver that ships with 'Infiniband Support' for CentOS 7.6.1810 and 7.7.1908 works like a charm!

29 July 2021

The Part Of Computing That Nobody Ever Talks About

Whenever people talk about computers, whether it's CPUs, GPUs, RAM speeds, HDDs, SSDs, or networking, they will always put each of those components through its paces with some kind of test or benchmark and/or a suite of benchmarks.

At the component level, that's usually useful and good enough so that people can be informed about how one part compares to another part so that you can make informed, decisions about your current and/or future purchases and that's fine and dandy and all.

Today, I'm going to be talking about something that people in the computing and IT industry/world (especially for general public consumption) that is pretty much NEVER talked about as far as I can tell/see.

In the computing world, people will talk about how much data a CPU or a GPU is able to process, whether it's virtual machines, hyperscalers, databases, HPC/CAE applications, machine learning, etc.

In the storage subsystem world of HDDs and SSDs (or PMEM), they'll talk about a drive's performance in terms of sustained transfers rates (STRs) or the number of input/output operations per second (IOps) that a drive can deliver.

In the networking world, they'll either talk in millions of messages sent/received per second, the latency, or the raw bandwidth.

But notice that NONE of these ever talks about, for example, what it really means when you put it all together.

Take HPC/CAE for example. One simulation can generate terabytes (TBs) of data, if not more. A LOT more. As far as I can tell, NOBODY in the entire IT/computing industry talks about what do you do with all of that data and/or how to manage it that volume of data.

Moving the data around, organising it, making sure that it's consistent, especially when you have say upwards of 10 million files (I'm currently just shy of 7.5 million files (7,499,865 files to be exact as of the last scan, as of this writing).) - how nobody every talks about nor benchmarks what it's like to manage that much data.

For example, let's say that I want to update the user and the group that owns all of the files on a given system. That process, on my systems where one is hosting 2.8 million files and the other system is hosting close to 4 million files, can take anywhere between 40 minutes to an hour (each). And sometimes/usually, I will run those tasks overnight so that there are no other changes happening on the file system/server and even then it still takes between 40 minutes to an hour (each) just to do that. So upto 2 hours, JUST to update the user and the group that owns the files. Nothing else.

Why aren't people talking about the data processing speed of having to do just this kind of a basic, simple task?

Whenever people benchmark CPUs or drives, nobody bothers to talk about something like this.

Now you might argue that this isn't done very often, but that isn't the point.

The point is that there has been tremendous improvements that we've made to CPUs and GPUs and HDDs/SSDs and networking, but as a collective system, because nobody tests this, therefore; there aren't any improvements that people are making to actually make these operations run/go any faster.

And more broadly speaking, people will spend a LOT of time talking about, for example, how fast a CPU or GPU is, or how fast a drive is, or how fast networking is.

Nobody talks about how fast it is when you put it all together and you have to manage a (relatively large) volume of data.

27 July 2021

Apparently, GlusterFS No Longer Supports RDMA And You Can't Use It Across Ramdrives Anymore

Back in the day, I used to use CentOS 7.6.1810 with GlusterFS 3.7 and I was able to create ramdrives in said CentOS and then tie a bunch of ramdrives together with GlusterFS.

Apparently, that's not the case anymore and it hasn't been since Version 5.0 as RDMA was deprecated.

Bummer.

Here's why this was important (and useful):

SSDs, regardless of whether it's consumer-grade or enterprise-grade, the NAND flash memory cells/chips/modules that's used in them all have a finite number of program/erase cycles.

Therefore; as a result, ALL SSDs are consumer wear components (like brake pads on a car) where they are designed to be replaced after a few years due to said wear. (This is a point that unfortunately, I don't think that the SSD as an industry, as a whole, spends enough time focusing on because a LOT of people were and are using SSDs as a boot drive, and as a boot drive, because it has a finite number of program/erase cycles, this means that it is only a matter of time before the system will fail, but I'm going to write/rant about that some other time/day.)

But for now, the key takeaway is that SSDs have a finite number of erase/program cycles and that can cause SSDs to fail.

So, in the HPC space, where I am running simulations, I can produce a LOT of data over the course of a run, sometimes, into the PB/run territory.

Therefore; if I have a large amount of data that needs to be read and written, but I don't need to keep all of the transient data that the solver produces the course of a simulation, then I want it to be as fast as possible, but also NOT have it be a money pit where I am constantly pouring money into replacing SSDs (again, regardless of whether it's consumer grade SATA SSDs or enterprise grade U.2 NVMe SSDs).

So, this was where the idea came from - what if I were to create a bunch of ramdrives, and then tie them together somehow?

Originally GlusterFS was able to to do this with gluster version 3.7.

I would be able to create a tmpfs partition/mount point, make that a GlusterFS brick, and then create a GlusterFS volume with those bricks and then export the GlusterFS volume onto the Infiniband network as a NFSoRDMA file system.

And it worked ok for the most part.

I think that I was getting somewhere around like maybe 30 Gbps write speeds on it (for the distributed stripped volume).

Lately, I wanted to try and deploy that again, but for creating plots for the chia cryptocurrency.

Apparently, that wasn't possible/capable anymore.

And that just makes me sad because it had so much potential.

You can create the tmpfs.

Gluster will make you think that you can create the Gluster bricks and volume.

Gluster lies (which you only find out when you attempt to mount the gluster volume that it never really created the bricks (on tmpfs) to begin with).

And then Gluster-hell-breaks-loose because it thinks that the bricks are a part of a gluster volume already which locks the bricks and volume together, and nowhere in the Gluster documentation does it tell you how to dissociate a brick from a volume or vice versa.

And that's too bad that because GlusterFS had so much potential.

Re-deploying My Old Server Supermicro X7DBE

So, I was originally moving from "real" servers to NAS units, in order to try and cut down on my power consumption a little bit and also to make managing dumb storage a lot easier, so I originally purchased a Buffalo LinkStation 441e 4-bay diskless NAS unit, with the original intention of plugging in four 6 TB HGST SATA 6 Gbps 7200 rpm HDDs in it, and when that didn't work, said NAS unit was relegated to only using four 3 TB drives instead.

Fast forward three years, and I guess that I just got tired of the fact that the Buffalo LinkStation 441e couldn't read/write the data at anything more than 20 MB/s with my Windows clients. So I decided that I am going to re-deploy my server as an actual server for dumb file storage and data serving tasks.

Hardware specs:

Supermicro SC826 12-bay, 2U, SATA rackmount chassis

Supermicro X7DBE dual Socket 771 motherboard

2x Intel Xeon E5310 (4-cores, 1.6 GHz stock, no HyperThreading available)

8x 2 GB DDR2-533 ECC Registered RAM

2x LSI MegaRAID SAS 12 Gbps 9240-8i (SAS3008)

SIMLP (it came with it, but I think that the IPMI card is dead)

4x HGST 3TB SATA 6 Gbps 7200 rpm HDDs

OSes that I tried:

1) TrueNAS Core 12.0 U1.1

2) Solaris 10 1/13 (U11)

3) CentOS 7.7.1908

4) TrueNAS Core 12.0 U1.1

I first tested the server using four HGST 1 TB SATA 3 Gbps 7200 HDDs in order to test the resiliency of ZFS in TrueNAS Core 12 (FreeBSD) by randomly yanking and plugging the drives back in into random locations to see what ZFS would do about it.

Of course, with a raidz pool, it was only fault tolerant to one drive (out of the four), which meant that, as expected, pulling out two drives killed the pool (took the pool permanently offline such that even if I plugged the drives back in, it would still report the pool as having failed).

I copied a little bit of data onto the pool to see what it would do.

And then I figured "well, since I was going to use ZFS, why not use the OS that started the whole ZFS thing in the first place?" - Solaris.

After anywhere from a day-and-a-half to three days, I finally got Solaris onto the system.

But then I wasn't able to get Samba up and running. (It still surprises me that the native samba package that ships with Solaris 10 - they NEVER got that working out-of-the-box, even with all of the updates to the Solaris 10 release cadence.) I found a samba package from OpenCSW, but that was only SMB 1.0, which is disabled in Windows 7 by default due to security vulnerabilities.

So, I wasn't able to get that up and running and quickly abandoned it.

Next, I tried to use CentOS.

I use CentOS for my micro cluster and also the micro cluster headnode, so I have some experience with it, but again, because of how my backplane was wired up where three of the 3 TB drives was connected to one of the HW RAID HBAs and the fourth drive was connected to the second one, it meant that I would have needed to use mdadm to make it work, which I was a little bit weary of, on the off chance that said md array would fail.

At least with ZFS, there appears to be more resources available for help, should I need it, as that seems to be all the rage in the Linux storage subspace right now, despite the fact that for years, I have been telling people that it's not my favourite filesystem to work with due to the fact that there are ZERO external data recovery tools for data that resides on hard drives that belonged to a ZFS pool. (A point which is still true today. i.e. if your ZFS pool dies, you can't do a bit read on the drive in order to salvage whatever information you can off the platters themselves and try and reconstruct the data/your files whereas with (at least a single NTFS drive, you CAN do a bit read on the drive and try and pick up/pick off whatever data you can with that).)

So, CentOS didn't really work like I thought it would've/could've.

So that sent me back to TrueNAS Core 12.0 because at least, it has a nice, web-based GUI (I'm so done and tired of looking up commands to copy-and-paste into a command prompt or terminal or over ssh to get the system setup).

I did consider UNraid, but the problem with UNraid is that it will fill up one drive and then the next and then the next, which means that the write speed of a single drive can quickly become a bottleneck for your entire server.

It's too bad that Qnap doesn't publish/sell their QTS software so that you can install it on any hardware because I really like Qnap's software. And it's also too bad that the only way that you can get the Qnap software is if you buy their hardware as well, which can be quite expensive for the hardware that you are getting and with that, it doesn't even necessarily support some of the other nice-to-haves that I would want, perhaps in the future, from my future storage server(s) (like being able to install a Mellanox/Nvidia (I'm old school, I still call it Mellanox because that's what it is) ConnectX-4 100 Gbps Infiniband network card.

So TrueNAS Core 12.0 became the selected candidate, and I proceed to get everything all set up and up and running.

As it stands, CIFS/SMB and NFS is running, although I AM running into a permissions issue between the two differing protocols right now (Windows clients connect over CIFS/SMB and my Qnap Linux based NAS units connect to it over NFS because for some reason, it fails to log in with CIFS/SMB to the TrueNAS Core 12.0 system). I posted the question in the forums, and it seems that nobody has identified a cause nor a fix for this yet.

Luckily, most of the time, it's the Windows clients that uses this newly re-deployed server rather than my Qnap NAS units.

But this partially documents and chronicles my journey with TrueNAS Core 12.0 and also my old Supermicro server.

It took almost like 4 days to get the server back up and running. But it's humming quite nicely now.

The only downside is that I think it's consuming somewhere around like 205 W of power or something like that vs. the Buffalo LinkStation 441e only had a 90 W AC adapter.

And the weight of having to move/lug the server around before I finally put it into the rack. (It's just sitting on top of stuff. I don't know if I actually have rails for it.)

My Review Of The Buffalo Linkstation 441e 4-day Diskless NAS

TL;DR - it fails to meet performance expectations.

I bought the Buffalo LinkStation 441e 4-bay diskless NAS unit in May of 2018 and there were a number of problems that I immediately ran into when I was trying to use it. For one, I was originally going to try and put in four 6 TB HGST SATA 6 Gbps 7200 rpm HDDs into it, and right away, that didn't work. It wouldn't recognise the drives nor properly identify the storage capacity of each of the drives.

It was only then that I found out that it looked like that it couldn't support drives bigger than maybe 3 TB (the only other drives that I had available to me were 3 TB drives, so I popped four of those in instead, and that seemed to work).

This limitation was not advertised on Microcenter's product page at the time when I bought it. Had I known that a priori, I probably wouldn't have got it. But seeing that I bought it already, I just tried to see if I could make the best of it.

Setting it up was pretty easy and straight-forward.

The administrator's password, if you decide to assign a new one, can only take alphanumeric characters (apparently, or so it seemed). If you tried to use special characters, it didn't seem to work. I think that I had to reset the unit back to the factory default a few times on account of it when I was first trying to get the unit set up and configured the way I like it.

Upon doing so, the other problem that I ran into was that copying files to and from the Buffalo LinkStation 441e, despite the fact that it had a gigabit ethernet (GbE) networking, with four HGST 3 TB drives in a RAID5 configuration, I was NEVER able to get more than 20 MB/s read or write to this NAS.

It didn't matter what I tried to do - the unit just wasn't capable of it. It was also irrelevant whether the NAS was full or nearly empty - it couldn't read nor write at speeds > 20 MB/s (using a Windows client). The Buffalo LinkStation 441e does not support NFS, so I couldn't really test it with a Linux client.

The one thing that I will comment on that I liked about it was that Buffalo has a NAS Navigator so that it will find your NAS units that has a dynamically assigned IPv4 address to it, which is useful and helpful for the initial setup/configuration.

Beyond that, its inability to be able to read/write at anything > 20 MB/s really limited what I could actually use it for.

Pity.

(When you're reading/writing at upto 20 MB/s, even a USB 2.0 high speed device could outperform this NAS unit with four hard drives in it. And I know that drives weren't the limiting factor because I had them in a "real" server before I moved them into the LinkStation 441e NAS and they were able to read/write at > 100 MB/s (limited only by the gigabit ethernet).)

Oracle Solaris Is A Self-Terminating Ecosystem

So, recently, I was trying to re-deploy one of my servers to take over dummy fileserving duties from a NAS unit that really just wasn't up to the task (more on that later).

Here is what I was looking to be able to at least try and do (here is a list of my performance/technical requirements, in no particular order):

1) Actually be able to hit and/or sustain gigabit ethernet (1 GbE) line speeds (e.g. ~ 100 MB/s sustained transfer rates).

2) Be able to create a share folder that both Windows and Linux clients can use, simultaneous. (So, using a combination of CIFS/SMB for the Windows side and NFS (mostly) for the *nix side.)

3) Due to the way that my old server was setup and how the SATA backplane was wired up, it meant that three of the drives in the top row of a 12-bay, 2U server was attached to hardware RAID controller #1 whilst the fourth drive (also in the top row of said 12-bay, 2U server) was attached to hardware RAID controller #2. What this meant was that I couldn't create a single logical drive/RAID array consisting of all four drives because said all four drives weren't attached to a single controller. And I didn't want to rewire the backplane.

Therefore; as a result of this, I needed a software-based solution that would be able to create a RAID5 array across the two controllers, and my preference was for ZFS.

So here are the OS candidates:

1) FreeBSD (TrueNAS Core 12.0 U1.1)

2) Solaris 10 1/13 (U11)

3) Linux (CentOS 7.7.1908)

I've previously used Solaris 10 before for a similar kind of dumb file serving duties and being that, for example, ZFS on Linux on root JUST became a "thing" as of like Ubuntu 20.04, whilst ZFS on root for Solaris has been a thing since Solaris 10 10/08 (U6) (Source: https://en.wikipedia.org/wiki/Oracle_Solaris#Version_history), I figured that I would give ZFS another go.

(Previously, I had sworn off of ZFS because I had some major problems with it - problems that even after purchasing a premium support subscription directly from Sun Microsystems themselves, the creators of ZFS, they weren't able to help me fix/resolve my problem/issue with ZFS. So I avoided ZFS like the plague.)

As it turned out, trying to get Solaris 10 1/13 onto my old server become quite the challenge.

First off, the server only sports a Supermicro X7DBE motherboard, which only comes with two USB 2.0 ports in the back and none in the front. (I think the chassis that I am using is a Supermicro SC826, but I'm not 100% sure). The motherboard itself has ports for PS/2 mouse and keyboard, but as such, I don't really have nor use PS/2 mouse nor keyboards anymore (I've finally "modernised" and moved up to USB mice and keyboards) which meant that I was short for USB ports.

So, how do I plug in a mouse, keyboard, AND install Solaris?

Well, at first, I tried installing with an USB slim, external DVD drive, but for some reason, the graphical installer had a problem loading/reading off from the DVD+RW disk that I use pretty much for all of my OS installs that needs to do so from an actual DVD.

And then I tried to use a USB flash drive, but that had problems as well.

So, I remember years ago, deploying a PXE install server so that I can get Solaris onto the system.

So that's what I set off to do.

And oh boy, what a journey (in a bad way) that was.

First off, to be able to deploy Solaris over PXE, pretty much all of the official Oracle documentation asssumes that you have another system already running Oracle Solaris.

(I did try to deploy it using a Linux system (Ubuntu), but that didn't really work out the way it should've, based on people who's written about it on other blogs previously.)

So, deploying it using another Oracle Solaris system it is.

I fire up Oracle VirtualBox, and get to installing Solaris 10 1/13 on said VM and I didn't really have much of a problem with that.

(Maybe perhaps in a little bit of irony, I was running the Solaris VM off an Intel NUC. I'm actually finding myself using my Intel NUC for a BUNCH of VMs now whereas before, it was only relegated to the task of being a license server for some of my applications, i.e. it wasn't really doing much.)

I get the VM all set up, running Java Desktop 3 (yes, that was a thing), per usual. Nothing special to write about there.

Oracle Solaris 10 1/13 Java Desktop System 3. Yes, it's a thing.

After that, I then proceed to follow the instructions, as published by Oracle, for Oracle Solaris 10 1/13 Installation Guide: Network-Based Installations.

The instructions for creating an install server with DVD media (for x86 systems (as opposed for SPARC systems)) you follow the instructions here and there is nothing special to write about that.

Now, it says, in the instructions, that if you are going to be using a DHCP (which I was going to be because apparently, my system is so old that the PXE boot for the server can ONLY happen with DHCP, i.e. you cannot assign a static IP address to the onboard NICs themselves), so it said that you didn't need to create a boot server, but I did it anyways.

Now this is the part where it gets messed up.

According to the Oracle documentation, the next step in the process is "Adding Systems to Be Installed From the Network with a DVD Image - How to Add Systems to Be Installed From the Network With add_install_client (DVD)".

Step 3 states:

"3. Add the client to the install server's /etc/ethers file.

a. On the client, find the ethers address. The /etc/ethers map is taken from the local file.

# ifconfig -a grep ether

ether 8:0:20:b3:39:1d"

Here is the problem with this:

If you have a system that you are trying to deploy Oracle Solaris onto, you're NOT going to be able to run "ifconfig -a grep ether" on the client!

This is a really stupid step/line in the instructions.

Think about it. You're trying to "jumpstart" the system. The client isn't up and running yet because this is what (and why) you are setting up the PXE server for a network installation, and therefore; NOTHING is running on the target/client.

So, instead, I had to pull the MAC address by powering on the server, setting the BIOS so that it will boot from PXE, and then waiting/letting the PXE request to time out so that it will show me the MAC address for both interfaces (because I think that it might have actually tried to ennumerate thes second NIC first, which meant that I had to wait until that failed before the primary NIC's MAC address will show up).

And then took a picture with my phone so that I would then be able to go back to my desk so that I can keep going with the instructions.

That was a really, really stupid step in the instructions on the part of Oracle.

(But this isn't why I think that the entire Oracle Solaris ecosystem is a self-terminating one though.)

Okay, so besides that, you keep going with the instructions after you get the MAC address off the NIC from the server.

Note here: If you are running these steps back to back, make sure you get out of the directory of where the DVD is mounted to (e.g. just type cd to get back to your root, home directory) so that you can then unmount the DVD before adding your client.

I made this mistake a bunch of times.

And THEN go to the directory where you copied the contents of the DVD to.

Proceed with the next step.

At the very bottom of that page is a one liner that says that if you are going to be using DHCP, then you need to set up the DHCP server.

Now, I thought that this was going to be like Linux installations where you just set up some dummy DHCP server, tell it what's the IP address range that you want said DHCP server to dish out/serve up, and it's already going to have a TFTP server running, and away you go.

[buzzer]

I couldn't be more wrong.

Here's what it takes to ACTUALLY set up the DHCP for a DHCP based, PXE/network installation of Solaris 10 1/13, in Solaris 10 1/13:

(Refer to the Oracle Solaris documentation, "Preconfiguring System Configuration Information with the DHCP Service (Tasks)")

Chapter 14. Configuring the DHCP Service (Tasks), in Oracle Solaris Administration: IP Services.

You have to go through the steps of running

# /usr/sadm/admin/bin/dhcpmgr

And then you get this:

Once you're done with that, then you got to create the macro for the client, BY MAC ADDRESS, itself.

Oracle Solaris 10 1/13 DHCP Manager Macros BY CLIENT MAC ADDRESS

And THIS is where I think that Oracle Solaris is a self-terminating ecosystem.

In order for you to be able to install the system, especially over PXE, according to their own instructions, you apparently have to set up and permit each system, BY CLIENT MAC ADDRESS, to be able to boot from and pull the Oracle Solaris 10 install image.

(The rest of the installation ran fine, by the way. It wasn't the fastest install, but at least it worked. And it IS technicaly faster than USB 2.0, or at least faster than what an USB slim, external DVD drive can do at 4x speeds.)

In other words, you can just deploy Solaris 10 on ANY system that you want. It's almost like you have to "register" it with the DHCP manager before this will work.

That's incredibly dumb/stupid.

Now, it is entirely possible that you don't have to "register" the client's MAC address to get this to work, and that would be nice, but according to the Oracle Solaris official documentation, that's NOT the example that they've shown here.

I have NO idea why they chose to make this more complicated way (but it DOES work) as the example that they use to show/tell/teach people how to deploy an installation over your local area network, but that's precisely what they've done.

Also apparently, and I don't know exactly when this happened, but apparently, as of this writing (27 July 2021), Oracle deleted the Oracle Solaris page off from Facebook.

Sounds to me like the days for Oracle Solaris may be numbered, even if that number may still be years, possibly a decade-or-more away.

(Also, by the way, the reason why I went BACK to Solaris 10 instead of using the latest Solaris 11 is because I currently have Solaris 11 deployed on another Intel NUC of mine, that exists to perform some basic web serving tasks and for some strange, stupid reason, after a while, one of the NICs that has been assigned to it (it too, also runs in a VM), which has a static IP assigned to it, stops working. And when I try to switch it from static IP to DHCP, it fails to pick up an IPv4 address from the router's DHCP server. So I ended up just adding a second NIC to the VM and left that as being a DHCP managed interface and haven't had any issues with that NIC since, but I can't give it a static IPv4 address, of course.)

Looks like what we are seeing is the beginning of the end for Oracle Solaris.

And that's a shame/pity. I used to really like Solaris.