15 June 2022

Engineering data consolidation efforts

Since I built my Ryzen 5950X system, and the 12900K system, and then had to completely disassemble the 12900K system, and then built another Ryzen 5950X system whilst arguing with Asus, I was in the middle of a data consolidation effort for all of my engineering data from the various projects that I've worked on over the years.

Today marks the day where the first pass of this data consolidation effort has completed and I ended up saving almost 14 TB of storage space.

It feels nice, and I get a sense of accomplishment as the data is being written to tape right now.

I can't believe that it's taken me like about 6 months to finish this data consolidation effort.

At some points during the process of unpacking, packing, and then re-packing the data, both Ryzen 5950Xs and also the Intel Core i7-4930K that's in the headnode was oversubscribed 3:1 when it was processing the data. That just seems pretty crazy to me because that's also a little bit of an indication as to how much work the CPUs had to do to process and re-process the data.

Not to mention, my poor, poor hard drives, that have been working so hard throughout all of this.

13 June 2022

Welp....this is a problem.

Let me begin with the problem statement:

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0c QPN 0x008d PSN 0x277b7c
 remote address: LID 0x05 QPN 0x010f PSN 0xda4554
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          100000           0.000000            0.064007            4.000452
 4          100000           0.00               0.11              3.516592
 8          100000           0.00               0.26              4.078050
 16         100000           0.00               0.52              4.069701
 32         100000           0.00               1.05              4.086223
 64         100000           0.00               2.09              4.074705
 128        100000           0.00               4.27              4.167070
 256        100000           0.00               9.31              4.547246
 512        100000           0.00               12.20             2.978638
 1024       100000           0.00               13.17             1.607263
 2048       100000           0.00               13.64             0.832231
 4096       100000           0.00               13.82             0.421746
 8192       100000           0.00               13.96             0.212971
 16384      100000           0.00               14.08             0.107404
 32768      100000           0.00               14.12             0.053869
 65536      100000           0.00               14.17             0.027029
 131072     100000           0.00               14.19             0.013528
 262144     100000           0.00               14.17             0.006759
 524288     100000           0.00               14.15             0.003375
 1048576    100000           0.00               14.16             0.001688
 2097152    100000           0.00               14.14             0.000843
 4194304    100000           0.00               14.13             0.000421
 8388608    100000           0.00               14.12             0.000210
---------------------------------------------------------------------------------------

What you see above is the results from the 100 Gbps Infiniband network bandwidth test that are between my two AMD Ryzen 5950X systems. Both of them has a discrete GPU in the primary PCIe slot, and then the Mellanox ConnectX-4 dual port, 100 Gbps Infiniband NIC is in the next available PCIe slot.

I can't really tell from the motherboard manual for the Asus ROG Strix X570-E Gaming WiFi II motherboard what speed the second PCIe slot is supposed to be when there is a discrete GPU plugged into the primary PCIe slot.

The Mellanox ConnectX-4 card is a PCIe 3.0 x16 card, which means that the slot itself is supposed to support upto 128 Gbps (and the ports themselves is supposed to go up to a maximum of 100 Gbps out of the 128 Gbps that's theorectically available). If the slots were running as PCIe 3.0 x4, it should be capable of 32 Gbps.

As the results show, clearly, that is not the case.

I'll have to see if I can run both of those systems without the discrete GPU, so that I can plug the Mellanox cards into the primary PCIe slot.


*Update 2022-06-14*:

So I took out the discrete GPUs from both systems and put the Mellanox card into the primary PCIe slot and this is what I get from the bandwidth test results:

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0c QPN 0x008c PSN 0x5ccdd5
 remote address: LID 0x05 QPN 0x010a PSN 0x178491
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          100000           0.000000            0.066552            4.159479
 4          100000           0.00               0.11              3.529205
 8          100000           0.00               0.27              4.225857
 16         100000           0.00               0.54              4.254547
 32         100000           0.00               1.09              4.254549
 64         100000           0.00               2.19              4.276291
 128        100000           0.00               4.51              4.408332
 256        100000           0.00               9.21              4.498839
 512        100000           0.00               18.60             4.540925
 1024       100000           0.00               36.74             4.485289
 2048       100000           0.00               75.76             4.623960
 4096       100000           0.00               96.55             2.946372
 8192       100000           0.00               96.57             1.473530
 16384      100000           0.00               96.58             0.736823
 32768      100000           0.00               96.58             0.368421
 65536      100000           0.00               96.58             0.184218
 131072     100000           0.00               96.58             0.092109
 262144     100000           0.00               96.58             0.046055
 524288     100000           0.00               96.58             0.023027
 1048576    100000           0.00               96.58             0.011514
 2097152    100000           0.00               96.58             0.005757
 4194304    100000           0.00               96.58             0.002878
 8388608    100000           0.00               96.58             0.001439
---------------------------------------------------------------------------------------



Ahhhh.....much better. That's more like it.

05 April 2022

Moral of the story: Do NOT buy from Asus. Intel is willing to offer a refund. Asus is not.

 As a follow-up to my previous blog post about the data corruption issue that I was experiencing with the Intel Core i9-12900K processor that was running on the Asus Z690 Prime-P D4 motherboard, Intel has offered a full refund on the defective unit whilst Asus has not.

So, moral of the story:

Don't buy from Asus.

I mean, clearly, if the interaction between the Intel Core i9-12900K and the Asus Z690 Prime-P D4 motherboard is causing the system to spontaneously reset itself when I attempted to run memtest86 a second time, using the memory that was from my AMD Ryzen 9 5950X (which was also using an Asus motherboard), which PASSED memtest86 on said Ryzen platform, and by putting those four DIMMs into the Asus Z690 Prime-P D4 motherboard, it results in the system spontaneously resetting itself; that's NOT a good sign of a reliable motherboard.

Asus was ONLY willing offer a RMA repair, and I told them that the CPU is in the process of being sent back, so even if they attempted to repair it, I would have no way of verifying whether the issue is still there or not because the CPU would've already been sent back and I'm not buying another Alder Lake CPU from Intel only to give it the chance for this problem to repeat itself.

So, moral of the story:

Don't buy from Asus.

01 April 2022

memtest86 self-aborted "due to too many errors" -- My Intel Core i9-12900K on an Asus Z690 Prime-P D4 motherboard regularly corrupting data

 So this happened:

I am currently using an Intel Core i9-12900K processor (purchased November 10th, 2021) on an Asus Z690 Prime-P D4 motherboard (purchased November 18th, 2021) and the system was finally assembled around Christmas time, 2021. So the system has only been running for about 3 months and within that 3 months of normal, un-overclocked usage, this happens.

(I don't even use XMP.)

(I am using Crucial 32 GB DDR4-3200 unbuffered, non-ECC memory (Crucial part number: CT2K32G4DFD832A) - four sticks in total, for a total of 128 GB).)

(Memtest) has aborted "due to too many errors".

Wow.

I've NEVER seen that message before.

"Too many errors."

10035 errors to be precise (before the test self-aborted).

Think about how bad the problem must be for the CPU and/or the motherboard to cause memtest86 to self-abort the test on account of "due to too many errors".

I am in the process of trying to see if I can get a refund from Intel via a RMA because the processor has to be so royally screwed up to be able to produce 10035 errors in memtest86 before memtest86 self-aborted and also a refund on the motherboard from Asus (also under a RMA as well).

(Sidenote: I tested the same four sticks of memory in my AMD Ryzen 9 5950X system with an Asus TUF X570 Gaming Pro WiFi motherboard and it passed memtest86 with zero (0) cumulative errors, which is how I know that the problem is NOT with the memory.)

Asus so far, has issued a RMA number for a board repair, but the problem is that if I send the CPU back to Intel and Intel issues the refund, then I won't have a CPU to be able to test the Asus motherboard once it comes back to see whether the issue has been resolved or not.

And I don't want to play the game where I am just making the parcel delivery companies rich by constantly sending stuff back and forth in order to try and get this taken care of.

Stay tuned for this saga.

27 March 2022

Minisforum HX90 Conclusion

In order to conclude the saga that was the Minisforum HX90, I ended up trying out Pop! OS 21.04 from System76. At first, the results looked promising because I was able to install Steam, VirtualBox, and import all of my VMs, and got them up and running. Never got around to testing the games in Steam though.

Unfortunately though, what appeared, initially to be a success eventually still ended up in a failure.

The system did freeze, eventually, at least once; at which point, it was clear and obvious that there is something either wrong with the system, the hardware, the engineering, compatibility issues, and/or a problem with software running on it.

I don't have the tools to be able to diagnose the root cause of the issue, even when I had Pop! OS installed on the NVMe SSD. Therefore; as such, I have sent the SODIMM RAM back for a RMA already, and I am currently in the process of trying to do the same with the HX90 itself as well.

This is a bummer/shame because I was really hoping that said HX90 would have been able to take the place of my former Intel NUC, be more performant, and not have the same kind of thermal throttling issues that's wayyy too common in my Intel NUCs.

Sadly, that just didn't turn out to be the case.

So now I have my old Intel Core i7-6700K taking on the duties that were originally designated for the HX90 and I have bought two sets of 2x 16 GB DDR4-3200 Kingston HyperX Fury RAM modules (4 DIMMs total, 64 GB total), in the hopes that I would be able to upgrade the RAM in the 6700K system, and make that take on those duties instead.

We shall see how that goes.

24 March 2022

Still working on the Minisforum HX90

 About two weeks ago, my Minisforum HX90 finally arrived and I was able to get the system up and going.

So far, it's been a bit of a mixed bag.

The system is actually VERY performant and I don't have any really complaints in regards to that However, the way that I had it set up where the system was hosting 9 VMs, it started freezing daily; which necessitated a hard power cycle before it would freeze again the next day, and the next, etc.

So between last night and this morning, I was trying to alternative operating systems to see if I would be able to get said Minisforum HX90 to be stable.

Proxmox VE 7.1.2 would install and it would pick up on the onboard Intel I225-V 2.5 GbE NIC, but then after the system has rebooted, post-install; said NIC WASN'T available and I couldn't quickly discern why nor the root cause of that issue. Tried installing it again. Same problem.

So Proxmox was a bust.

Next I tried Ubuntu 20.04 LTS. It installed, but then I wasn't able to install Oracle VirtualBox 5.2 in a way where said Oracle VirtualBox 5.2 was working the way that it is supposed to, so that failed.

Then I tried downgrading to Ubuntu 18.04 LTS figure "okay, at least I should be able to get Oracle VirtualBox 5.2 installed." Well, that part was true, except that Ubuntu 18.04 was too old and didn't recognise the integrated Radeon GPU that is on the AMD Ryzen 9 5900HX processor that is in the HX90. The maximum resolution that it would display was 800x600. So, then after getting Oracle VirtualBox 5.2 installed, I figured "okay, maybe I can upgrade the system from here and that should give my the proper resolution back".

Nope.

I updated and upgraded to Ubuntu 20.04 from 18.04 and not only did I NOT get the proper screen resolution back, I also lost connectivity to the 2.5 GbE NIC which was, ironically, working in 18.04 before.

So, let's say just say - trying to get and make the system stable has been a complete and utter nightmare.

I've got a fresh install of Windows 10 21H2 now (well...I think that the installer was actually 20H2, but then I was able to run Windows update to update it to 21H2), so hopefully, that will be able to help stabilise the system, but we shall see. I'm in the middle of re-installing all of my Windows applications along with re-importing the Oracle VirtualBox VMs back into VirtualBox.

And if that doesn't work, it would be such a pity because the system has a LOT of potential, but if it doesn't work, I'll likely end up RMAing the system back to Minisforum, and then just buying 4x 16 GB of DDR4-whatever RAM (whatever is the most cost efficient, which, perhaps ironically, might be DDR4-3200), install that back into my Intel Core i7-6700K system, and use that system to host all of the VMs once again instead.

It won't be as fast as the AMD Ryzen 9 5900HX, but hopefully, at least it'll work and it won't freeze on my daily.

Hopefully.

23 February 2022

Vastly differing results in WSL2 between 5950X and 12900K in Windows 10 21H2

A little while ago, I came across this video which was talking about how you can run Linux graphical applications natively in Windows (more specifically, in Windows 11).

However, when at the time when I watched said original video, I didn't have any hardware that could actually really run that probably. My "newest" system that I had was an Intel Core i7-6700K and as far as I know, it didn't have the Trusted Platform Module (TPM) anywhere (whether it is as an external add-on dongle) or integrated into the motherboard firmware/BIOS.

So, I didn't really make much of it back then.

But since then, I've built both my AMD Ryzen 9 5950X system and also my Intel Core i9-12900K system and I figured that with some of the work that I needed the systems to be doing over with, I had a little bit of time with the system to do some more testing with it.

So I grabbed two extra HGST 1 TB SATA 6 Gbps 7200 rpm HDDs (one per system), threw Windows 10 21H2 on it, and proceeded with the instructions on how to install and configure Windows Subsytem for Linux 2 (WSL2). I installed Ubuntu 20.04 LTS (which really, turned out to be 20.04.4 LTS), and proceeded to try and install the graphical layer/elements to it.

So that's all fine and dandy. (Well, not really because in both instances, neither of the systems was able to start the display and I can't tell if it is because I have older video cards in the system (Nvidia GeForce GTX 980 and a GTX 660 respectively - because as a CentOS 7.7.1908 system, it didn't really matter what I had in there since I was going to remote in over VNC anyways).)

But, since I had it installed, AND by some miracle, Windows 10 picked up on the Mellanox ConnectX-4 dual port VPI 100 Gbps Infiniband cards automatically, I just had to manually give each card in each system an IPv4 address so that it can talk to my cluster headnode (which was still running CentOS along with the OpenSM), and connect up to the network shares that I had set up. (SELinux is a PITA. But I got Samba going on said CentOS system so that on the Linux side, it can connect up to the RAID arrays using NFS-over-RDMA whilst in Windows, it's just through "normal" Samba (i.e. NOT SMB Direct).)

So, I might as well benchmark the systems to see how fast it would be able to write and read a 1024*1024*10240 byte file.

And for fun, I also installed Cygwin on both of the systems as well, so that I can compare the two together.

Being that both systems was able to pick up the Mellanox ConnectX-4 card right away (I didn't have to do anything special, install the Mellanox drivers, etc.), I was able to connect up to my cluster headnode and the Samba shares were visible immediately. As a result of that, I was able to right-click on both of those shared folders and map it to a network drive directly and automatically.

Now, in WSL2, I had to mount the mapped network drive using the command:

$ sudo mount -t drvfs V: /mnt/V

(Source: https://superuser.com/questions/1128634/how-to-access-mounted-network-drive-on-windows-linux-subsystem)

And then once that was done, I was able to run the follow commands in both Ubuntu on WSL2 and also in Cygwin:

Write test:
$ time -p dd if=/dev/zero of=10Gfile bs=1024k count=10240

Read test:
$ time -p dd if=10Gfile of=/dev/null bs=1024k

Here are the results:

Huh. Interrresting.

I have absolutely NO clue why WSL2 on the 5950X is so much slower compared to WSL2 on the 12900K.

But what is interesting though is that the speeds are close, with the 5950X being a little bit faster under Cygwin than the 12900K, also under Cygwin.

I decided to blog about this because there is a potential possibility that for those that might be working with WSL2, the hardware that you pick MAY have an adverse performance impact.

I'm not sure who, if anybody, has done a cross-platform comparison like this before but to be honest, I haven't really bothered to look for it either because you might have reasonably expected that this significant performance difference wouldn't/doesn't exist, but the results clearly show that there's a difference. And a rather significant difference in performance at that.

Please be aware and you should do your own testing for your workload/case/circumstance if you get a chance to be able to do so.

22 February 2022

Why is Intel keeping the overall physical dimensions of their Intel 670p Series 2 TB SSD a secret?

I recently submitted my order for a Minis Forum HX90 (specs) and being that I am looking to use it to replace my very hot Intel NUC that I had previously written about (it's back up to 100 C nominal now), and that I might also be offload all of the virtualisation duties as well from my Intel Core i7-6700K system and onto this new system instead. As such, I didn't know if said new system would support RAID0 with my two existing Samsung EVO 850 1 TB SATA 6 Gbps SSDs that are no longer currently deployed in a system, so I figured that I was going to get a 2 TB NVMe SSD just to be safe and I landed on this - an Intel 670p Series 2 TB NVMe 3.0 x4 SSD (specs).

Whilst browsing through YouTube, I stumbled my way upon a video where they were talking about NVMe SSD and putting heatsinks on them and how they would thermal throttle the performance if said NVMe SSD got too hot whilst it was being used/under load.

So, that got me thinking - should I start looking and seeing if I should be getting a NVMe SSD heatsink of my own for this drive?

So, I reached out to the customer support at Minis Forum (based out of Hong Kong, which is interesting because their first email back to me was written entirely in Traditional Chinese), so I asked them about a SSD heatsink (because some of the review units that they've sent to other tech YouTubers included a NVMe SSD with a heatsink pre-installed in the system) and they told me that the total height that the HX90 can take, INCLUDING the NVMe SSD is 7 mm.

So, ok. No problems, right? If I can find out what's the overall height of the Intel 670p Series 2 TB NVMe 3.0 x4 SSD, then I can figure out what's the maximum height of a heatsink the HX90 can accept, and then I can start to look into what are my purchasing options.

So, then I reached out to Intel's customer support, because of course, lo and behold, the overall height of the Intel 670p Series 2 TB NVMe 3.0 x4 SSD isn't listed on their spec page.


Huh. No overall physical dimensions listed on Intel's website.

So I reached out to Intel's customer service and asked them this basic question and also told them that it was because the manufacturer of the computer has told me what the maximum height of the combined SSD and heatsink can be so that I can properly size and purchase said heatsink. Their customer service rep said that they understand why I was asking for this information and would need to do further research on this topic/matter and that they would get back to me. Okay. Not a big deal.

Well earlier today, I got an email from said customer service rep stating quote:


Why would Intel keep the overall physical dimensions of their product under a NDA?

So, at this point, it seemed awfully suspicious.

I told them that I am not asking on behalf of the company where I work, and therefore; I have no idea if they have a signed NDA with Intel or not. (And frankly, that shouldn't matter because a customer should be able to ask for the overall physical dimensions of their product (and not the overall dimensions of the box/packaging that their product gets shipped in either).)

I then told them that I will just measure my drive when it arrives and that as such, I will not be signing a NDA in regards to this.

Well, about 3 hours later, my drive arrived.

So, for those that are interested in knowing, the overall physical dimensions of the Intel 670p Series 2 TB NVMe 3.0 x4 SSD are:





Overall length: 80.12 mm
Overall width: 22.05 mm
Overall height: 2.0525 mm (average of 2.09 mm, 2.06 mm, 1.97 mm, and 2.09 mm)

So, in case you're out trying to shop for a NVMe heatsink, and you're trying to use it for a small form factor (SFF) or ultra compact form factor (UCFF) build, now you know the height of the NVMe heatsink you can get.

08 February 2022

A friendly reminder to periodically clean your NUC

I have an Intel BOXNUC8i7BEH (specs) and I have been using it to run a VM and also as a host system/unit.

Lately, it's been having issues where even when I tried to run it without the chassis (i.e. running it in an "open case" configuration, the temps were still hitting a peak of 100 C whilst downloading something in the VM and also with 12 Firefox tabs open on the host itself.

So, given that it was still running so hot, even with it running out of the case/in the "open case" configuration, I figured that I would shut the unit down, wait for it to cool off a bit, and proceed with the further disassembling the unit.

Once I took the fan off, there was a LOT of dust that had been trapped where the inlet to the copper heatsink was, so I was able to clean that off with damp tissue paper.

And I also figured that since I had some Thermal Grizzly's Kyronaut sitting around, that I might as well also remove the plate that the heatpipes are connected to, clean off the old thermal paste that's on the CPU, and give it some new thermal paste whilst I'm at it.

Lo and below, the current system, still doing exactly what it was doing before (picking up from where it left off when I powered down the system) is now sitting at a cooler 85 C or so.

Yay!

Moral of the story: remember to periodically clean your NUC!