Jump to content

pointertovoid

Member
  • Posts

    643
  • Joined

  • Last visited

  • Donations

    $0.00 
  • Country

    France

About pointertovoid

Profile Information

  • OS
    none specified

Recent Profile Visitors

4,159 profile views

pointertovoid's Achievements

10

Reputation

  1. Thoughts about the cables that connect the cabinets. Crosstalk seems low, like -40dB, at just 400µm stripline pitch for 100+100µm dielectric thickness. This needs double checking. 45mm cable width and 10 layers would carry 2*500 signals plus redundancy. The shield shall be grounded at the transmitter end and serve as a voltage reference at the receiver. All superconductors at 77K or 20K are brittle and excluded I think. This leaves NbTi and Nb below less pleasant 9K. Aluminium below 10K looks preferable: 1.5pOhm*m at easily available 99.9998% purity. Over a dielectric with very low loss at 10K it would transmit >40Gb/s over 30m at once. BUT the aluminium shield leaks too much heat from 77K to 10K at ends, even if 100nm thin there. The helium cooler would consume as much as the computing elements. Other heat leaks are manageable: the dielectric at the ends, the tubes with vacuum and Mli insulation for 30 cables. Data sources: NBS' technical note 365 [5.5MB] NBS' circular 556 Maybe many signals can use differential pairs and share one shield. But putting together the difficult design with helium, nitrogen, vacuum, stiff tubes... presently I clearly prefer cables at room temperature embedding 2-3 repeaters per metre. The design and operations constraints are lighter, and they consume less power. Marc Schaefer, aka Enthalpy
  2. The US gov wanted in August 2015 an exaflops computer, but in January 2022 none exists: Fugaku achieved 0.442 EFlops on Linpack (64b). I proposed an architecture in November 2015, but meanwhile Cpu benefit from a 7nm process and Dram have grown scienceforums - msfn [Here just above] so here's a 2022 architecture for an exaflops computer. The just described eight 2.2GHz scalar Cpu stacked on 2GB Dram cumulate 70.4GFlops (64b, FMAdd, peak everywhere) and 422GB/s. Alas, I keep the 2015 network design, with narrow copper lines and thin insulator stacked as thick rigid Pcb or as loose multilayer flexible Pcb. Repeaters are needed, in the cables too - but they transmit every received bit immediately, of course, and don't wait for a complete packet. So do the switch matrix chips that route up to some 60 inputs to 60 outputs. Because the exaflops computer has fewer cabinets in 2022, the network's throughput drops to 70PB/s in each direction through any equator. That's one 64b word per scalar core every 70 cycles: much better than existing supercomputers, but far less than I'd like for easy programming. A copper line takes 0.5mm width pitch and 0.3mm thickness to carry several 10Gb/s. For the electric comms between the cabinets, I'd consider superconducting lines very seriously. Using capacitive coupling and chips, one 40mm*40mm connector transmits 2*500 electric signals. Fibres don't achieve that as far as I know. Usual plastic fibres are no competitors, thinner ones maybe, with coarse Wavelength Division Multiplexing over a few 10m. Monomode fibres seem better, if Dense WDM fits in a computer. I suppose a ribbon can integrate many fibers. Fibres would save the power-hungry repeaters. Fibre connectors are the limit: I didn't see nor imagine any small one. But if one fibre carries 500 wavelengths at 40Gb/s each, the connector can be Sma sized. WDM on plastic fibers over 100m, fibre ribbons, tiny connectors... would be useful research topics, not addressed by the research for Internet and needing disruptive improvements for computers. The compute boards integrate improved Dram and Cpu in same 40*50*2 amount and 2-D switch matrix organisation as on 22 November 2015 scienceforums and one compute cabinet stacks 60 horizontal boards, with previously described data vessels and capacitive connectors leading to vertical boards at the cabinet's back that make the 3rd dimension of the switch matrix. One cabinet cumulates 240 000 Dram for 480TB, 1 920 000 Cpu for 8.45PFlops. The 30 upper compute boards send data cables to an upper floor, the 30 lower to a lower floor, with each floor holding cables in X and Y direction which are the 4th and 5th dimensions of the switch matrix. Just 144=12*12 cabinets provide 1.22EFlops peak, which at 83% efficiency must be 1EFlops on Linpack, combined with 69PB and 15EB/s Dram, and hopefully much Flash memory speed. I can even display true item numbers this time. Each compute board carries 32 connectors, and the switch boards must route the signals among connectors at the same position on all 12 compute cabinets in a row or column on the floorplan. As switch boards carry 32 connectors, each can manage 2 connector positions at the compute board, so only 6 switch cabinets manage the 12 compute cabinets in a row or column. scienceforums - scienceforums While one floor is good for the compute cabinets at 1EFlops, cables better spread over two floors, and then the switch cabinets can too. US companies can produce the necessary chips. Marc Schaefer, aka Enthalpy aka Pointertovoid
  3. One 4*2GB Hbm2 Dram stack reads 250GB/s, a true improvement over Dimm. So what computing power would it feed according to the golden rule? Two reads and one write over 64b take 24B transfers, so the throughput feeds 10G FMAdd/s. This suffices for: Avx-512 | | 1*1.3GHz Avx-256 | 1*2.6GHz | 2*1.3GHz Sse | 2*2.6GHz | 4*1.3GHz Scalar | 4*2.6GHz | 8*1.3GHz That's less processing power than a desktop. Worse for a supercomputer, it's little processing power per unit area, even if the Cpu can sit on the Dram somehow. At least, such a Cpu is easily made and it's frugal, even at 10nm. Depending on the Hbm2 latency, the Cpu still needs more or fewer caches. The Dram capacity is big at 2.6 Flops/Byte, so 4*1GB or 4*512MB would suffice if they provide the same throughput. ========== But what if a socket carries 8 Hbm2 Dram stacks around the Cpu? If the socket and Cpu handle the 2TB/s, the golden rule lets feed a small Cpu: Avx-512 | 4*2.6GHz | 8*1.3GHz Avx-256 | 8*2.6GHz | 16*1.3GHz Sse | 16*2.6GHz | 32*1.3GHz Scalar | 32*2.6GHz | 64*1.3GHz ========== Or the throughput must increase, as I proposed above.
  4. Chips progressed since my last check on 17 December 2018, so here's an update. ---------- Dram update The capacity only doubled in 3 years, to 2GB=16Gb per chip at Hynix (even 3GB), Samsung, Micron. Chips stack using Through-Silicon-Vias (Tsv). Fugaku has 100Flops/Byte because the faster Hbm2 on the processor socket limits the capacity. I won't imitate that. Other supercomputers offer 10-40Flops/Byte. I strongly advocate to spread the cores among the Dram for throughput and latency. One 2GB Dram can host 8 scalar cores (no Simd, one 64b FMAdd per cycle) at 2.2GHz for 18Flops/Byte. Each scalar core has comfortable 256MB Dram for OS subsets and application data. Or 16 scalar cores at 1.1GHz should save power. ---------- Flash update One 32GB Flash chip suffices for 2GB Dram. If a divided Flash chip stacked under the Dram can provide 600MB/s to each core, virtual memory saves the whole Dram in 0.4s, and data centers too improve a lot. If not, the 8 cores (or several sockets) can share a (stacked?) Flash socket nearby. ---------- Cpu update Intel is stuck at 14nm (10nm for chips not too big) and Amd buys its performance chips, but Tsmc produces 7nm processors, Samsung too. This improves the integation and power consumption over my latest update, and explains the Flops*2.6 from Summit to Fugaku: each A64fx Cpu socket holds 48 cores that compute sixteen 64b FMAdd (two 512b Simd) per cycle, while the whole Fugaku consumes 30MW or 0.24W per 64b FMAdd at 2.2GHz - maybe 0.15W locally. ---------- Divided Dram, split Cpu A divided Dram has a shorter latency. 8 sections, with address and data access at their centres, cut the distances by sqrt(8). Diffusion time over RC propagation lines explain the latency well, and then the latency is cut by (sqrt(8))2, so 50ns become 6-7ns, or some 15 Cpu cycles, excellent. The core needs only the L1 caches, no L2, no L3, saving almost 1/2 the silicon area. We could go further. 32 sections at 0.55GHz (waste silicon area) would slash the Dram latency to about 2 Cpu cycles to eliminate the L1 too. That's excellent for applications accessing random adresses, like Lisp or expert systems. Then instruction between 3 arguments in Dram, as the Vax-11 did, make sense again. To cut the distances on the Dram, each Cpu has its own chip and sits in the middle of a Dram section. Maybe some thick upper metal layers transport signals quickly on the Dram, but I doubt enough signals fit on the area. I expect many <1mm2 Cpu are cheaper than one huge many-cores. A 14nm or 10nm process can produce it at Intel etc, and then a slower clock and more Cpu would limit the consumption. Spare Dram sections can improve the production yield. A Bga socket suffices, I expect many ones come cheaper than one huge Lga. Cooling 8*0.15W is very easy. I displayed a switch matrix chip, the same as between the sockets, stacked on the Dram, but the Dram could integrate the function or the Cpu provide direct links. Stacking and splitting applies also to graphics cards, gamestations, desktops. Massive multitasking is necessary anyway and difficult enough. The scalar Cpu are easier to program and fit non vectorisable applications. I hope one sequencer per scalar Cpu is small and frugal. ---------- Dram to Cpu interconnect For easy programming, I want to restore the "golden rule": L1, L2... Dram throughput match the Cpu needs. Such a loop for (j blahblah) s += a[j]; or s += a[j] * b[j]; or even c[j] = a[j] * b[j]; can run within L1d of present processors, L2 of some recent ones, maybe L3, but not over the Dram, and this worsens over the generations. The older and smaller Xeon E5-2630L v2 computes on 4*64b (Avx-256) over 6 cores at 2.4GHz. a[j] needs 461GB/s, a[j] and c[j] too depending on the program. But 4*64b Ddr-1600 supply only 51GB/s. The Cpu must wait 9 cycles to obtain one a[j] from the Dram, or 27 cycles for a[j], b[j] and c[j]. The big Xeon Phi 7290 computes on 8*64b (Avx-512) over 72 cores at 1.5GHz. 6*64b Ddr-2400 take 60 cycles to supply one a[j], ouch. The recent A64fx get 1TB/s from HBM2 Dram on the socket, better. But 2*8*64b (two Avx-512) over 48 cores at 2.2GHz would need 14TB/s. The Cpu would wait 14 cycles for one a[j]. It's "Cy/T" in the table of 09 January 2022. To feed the Cpu, the programmer must find operations that use the data many times, write external loops that move data subsets to the caches and internal loops computing there. Slow L3 or L2 can demand intermediate loops. And some computations like an Fft just use data too few times. Separate Dram and Cpu create the horrible bottleneck at the worst place. HBM2 Dram on the Cpu socket improves too little. My answer is to stack a Cpu over each Dram chip and connect them densely. I explained there how to transfer enough signals scienceforums here's again the drawing. The chips carry solder islands smaller than we can align, and after reflow the chips assign adaptively the signals to the obtained contacts. The tiny islands consume little power to transmit the signals. Capacitive coupling can't supply power easily, alas. If Tsv were narrow enough, a stack of Dram could carry more Cpu. In my example, a scalar 64b core needs 3*64b transfers per clock cycle. With some links and power suply, that's 1000 contacts to the Dram at 2.2GHz (can be faster). If each contact takes (5µm)2, just (160µm)2 suffice on each Dram section and Cpu. Just 0.5µm etching suffice, even coarser at a demonstrator, and two Pcb make a poof-of-concept, so that's a nice university project. Ready? With estimated 15 cycles latency at 2.2GHz, the Dram can access data over only 15*3*64=2880b=360B to achieve the throughput. This relates with the L1 line size too. Close coupling between Cpu and Dram allow original useful accesses at full speed, as I described there scienceforums for (j blahblah) L1d[j] = Dram[j*n]; // Matrices, struct and many more or L1d[j] = Dram[bitreversed(j)]; // For FFT or also L1d[j, bits k] = Dram[k, bits j]; // For database filtering Dram access width being a prime number of bytes or words avoids most collisions in periodic addresses. Some primes like 28+1 or 211-1 simplify the hardware. Bit shuffling works best with transfer sizes multiple of 64 dwords of 64b. Marc Schaefer, aka Pointertovoid aka Enthalpy
  5. I supposed it, now I'm convinced that hyperthreading is a drawback in a supercomputer. A desktop runs much single-task software that demands a fast clock, and then hyperthreading combines this nicely with occasional multitasking. But a supercomputer has always enough tasks to load all cores. In that case, twice as many slow cores are better than a hyperthreaded one. Hyperthreading provides 30% more speed by feeding the execution units with operations from two threads. Doubled execution units with clock 0.65* as fast do it too, and because each draws power like F2.4, together they consume 0.71* as much. Slower duplicated schedulers would also consume 0.71* as much, but since hyperthreading adds complexity, two slower simpler schedulers save more power. Each thread needs capacity at the cache memories, so each of the doubled execution units just needs its half of the L1i, L1d, L2, L3... The chip area for caches is identical. Two small caches may respond faster than a big one, depending on the design. Even if not, a core 0.65* as fast is stalled for 0.65* as many cycles by memory latency, including from the Dram. Two slower cores stall less hence run faster. Conceptually, a hyperthreaded core offers two register sets, but a core has about 180 dynamically renamed registers for parallelism, and this number depends more on the execution capability. So two slow cores would total more registers. The area is tiny anyway. The execution units keep as big each, so they total twice the area, which is small anyway. Unclear to me, as I'm out of this job for decades: how big is the scheduler. On pictures of vectored hyperthreaded chips, it seems huge. If hyperthreading less than doubles its complexity, two slow cores may need more area. But hyperthreading doubles at least the instruction decode, splitting in µOP and regrouping them. Summarizing: two slower cores may be slightly bigger or smaller together than a hyperthreaded one, they run faster and consume less. Whether someone will redesign a core just for one 1G€ supercomputer?
  6. Thanks Tripredacus! I haven't looked at Ssd for some time and still use some on a Sata cable at home, but better interfaces seem to exist now. Sata is available without an enclosure. Pci-E delivers a throughput that fits the Ssd capabilities better. The designer of the compute boards might also buy new Flash chips and integrate them on the board. I didn't take time to compare the many options. In this particular use, Flash is still a bit slow to make a good virtual memory for a number cruncher, which can hence still use remote Ssd and Sata cables. But a database or a file server machine needs file performance, and for them the interface is paramount. The considered sockets have commonly 40 Pci-E 3.0 lanes each, of which 16 per computing board target the backplane Pcb. The other ones are available, and Flash throughput is among the very best uses to make of them.
  7. The offer evolved since July 2019 so here's an updated selection of second-hand server Cpu to assemble a supercomputer or a mega-server on my Pcb backplane. I checked only Intel despite they stick at 14nm. Not Amd. Not used Arm Cpu. Horrible mistake? Still Pci-E 3.0 only, bad! Wider connectors (Dram?) may improve the network. A bigger Pcb backplane too. Newer Cpu still increase the tiny Dram bandwidth less than compute power. Ruinous LGA3647 bring 6*ddr4. Better twice as many LGA2011 sockets per board with 4*ddr4 each. Ruinous Skylake, Cascade Lake bring Avx512. Better twice as many Avx256 sockets per board. Time squeezes prices. I report reasonably good bargains for small amounts on eBay. In July 2019 it cost 40€, the E5-2630L v2 costs 20€ now. The E5-2650 v4 brings 1.5* E5-2630L v2 except for Pci-E (*1) and price (*2). 4 or 8 server Cpu on one computing board can share the chipset and Bios and are affordable. Alas, they must share the network. | # GHz 64b | # MT/s | W | W/GHz | Cy/T | @ Qpi | PciE ======================================================================= d | 6 2.4 (4) | 4 1600 | 60 | 4.2 (1.0) | 2.3 (10) | 2 | 3.0 REF h | 12 2.2 (4) | 4 2400 | 105 | 4.0 (1.0) | 2.8 (11) | 2 | 3.0 <== ======================================================================= i | 12 2.3 (4) | 4 1600 | 105 | 3.8 (1.0) | 4.3 (17) | 4 | 3.0 ======================================================================= j | 16 2.2 (4) | 4 1866 | 140 | 4.0 (1.0) | 4.7 (19) | 8 | 3.0 ======================================================================= d = Ivy Bridge-EP Xeon E5-2630L v2. LGA2011 20€. h = Broadwell-EP Xeon E5-2650 v4. LGA2011 40€. i = Ivy Bridge-EX Xeon E7-4850 v2. LGA2011 30€. j = Haswell-EX Xeon E7-8860 v3. LGA2011 40€. ================================================ First # is the number of cores, GHz the base clock, 64b is 4 for Avx-256 and 8 for Avx-512. Next # is the number of Dram channels, MT/s how many 64b words a channel transfers in a µs. W is the maximum design consumption of a socket, called TDP by Intel. W/GHz deduces an energy per scalar or (Simd) computation. It neglects the small gains in cycle efficiency of newer Core architectures. Cy/T compares the Cpu and Dram throughputs in scalar and (Simd) mode, ouch! It's the number of cycles the cores shall wait to obtain one 64b or (Avx) word from the Dram running ideally. So 768 compute boards carrying two E5-2650 v4 give you 0.32 PFlops (64b, Fmadd, Simd, peak). Four E7-4850 v2 give 0.68 PFlops. And eight E7-8860 v3 on long compute boards give 1.7 PFlops, at the edge of the Top500 supercomputer list, but with a better network. Marc Schaefer, aka Enthalpy
  8. Hello you all! Nice to meet you again. I had described how to assemble a small cheap supercomputer from used server parts, there: scienceforums and around I reuses second-hand Xeons E5-2630L v2 for instance, laptop Ddr3, Ssd with Sata (bad choice?), Atx power supplies. The computing Pcb (up to 768) are specially designed, as is the 1m2 many-layers Pcb backbone. Pci-E signals and connectors make a network whose 3TB/s through any equator outperform existing supercomputers of same computing power. The network uses specially designed chips that make dense interconnection matrices from 32 to 32 Pci-E lanes. They build a superior bidimensional matrix scienceforums Such chips exist already but may not be commercially available and too expensive to redesign; then, a hypertorus avoids the matrices but slashes the performance by 3. From 0.25M€ hardware cost, you get 88 TFlops (peak, 64b, Fmadd) with less bad Dram throughput, superior network throughput, decent consumption. Very nice university project, both to develop and to use - and possibly to sell. ========== Now with A64FX I just checked the performance with a recent Cpu, the Arm-Fujitsu A64FX, four of them per computing Pcb, 3072 in total. The same backbone still provides as much throughput per Cpu as the Fugaku supercomputer does, so it's usable. The 2m*2m*2m cabinet then crunches 10 Pflops, enough for the 70th position in the supercomputer Top500 list A64FX – Fugaku – TOP500 on wikipedia Nov 2011 list on top500.org All intermediate designs are possible, say with four second-hand less old Xeons or one A64FX per computing Pcb... So, plug the soldering iron? Have your very own supercomputer? Marc Schaefer, aka Pointertovoid aka Enthalpy
  9. Thanks Jaclaz! I'd need more evidence before involving little green men in this story (or elsewhere). And I'd like to know why a mobo with a new battery should lose the date. Or even two mobos. So I'm still interested in testimonies: did other people observe an abnormal clock reset on their mobo? Thank you!
  10. Hi everybody, nice to see you again! I had imagined that resetting a computer's clock might be a way to weaken its encrypted communications, especially if they are as poorly programmed as what the cited webpage gives as an example. Recently at my close relatives, two computers have shown abnormal clock resets. At my old mother, the Cmos' battery was empty. I replaced the battery with a new one from a reputed brand bought in a closed package a quarter hour before, I set the clock, and the computer worked properly. A week later, this computer lost the date again. The computer was far from me, I got the account over my mother. At my nephew, who is comfortable with Pc hardware and can replace a battery, the computer with a recent battery lost the date. I had expected a reset date like 01 Jan 2000 or 01 Jan 1970, as usual with an empty battery, but at my mother's machine it was like 31 December 2001. So: did you observe something similar recently? The alternative explanation would be a malware that sets the computer's clock, possibly to exploit the weakness I outlined above. I could admit a hardware defect drawing the battery empty at one mobo, not at both. Thank you!
  11. Thanks Jaclaz! Yes, that's about what I plan. I had already good Apacer Cf, though I didn't use them for very long. Good experience with Transcend EXCEPT that in their 300x 8GB, they put Mlc chips without warning and contrary to their datasheet that promises Slc. Recently, two Toshiba 32GB 1000x went broken at the same time on two different readers - still not understood, maybe a high-tech Usb virus. So I consider Sata instead of Usb. I already had an adapter, or cartridge, for Ssd on Sata. My bad experience is that the connectors went broken after months, with random bad contacts. Avoid them, everybody - unless someone can report a good specimen he used for an extended period. Meanwhile, I connect and remove my Ssd on plain long cables daily, without a cartridge, and this is reliable over years. So I hope to do the same with Cfast, as I do need a smaller format than Ssd. My main concern would be some incompatibility between signalling voltages or a similar bad joke that plagues Cf cards. You know, Pata ports in 5V and some Cf cards that need 3.3V for Udma, or even fail to transmit properly on 5V. More reports, opinions, comments?
  12. Hello everyone and everybody, nice to see you again! I consider switching from CF (on USB reader) to CFast Flash cards. Has someone experience with them? From what I understand, the electric interface is Sata/6000 or Sata/3000 but the connector differs, needing an adapter card. Correct? Is there any bad joke with Sata and CFast, like the varied signalling voltages that plague CF cards? I plan to buy some Chinese adapter card, as they comprise essentially two connectors and copper lines in between, plus seemingly a regulator. Thoughts? The CFast cards I covet have 32GB and Sata/6000, my mobo has Sata/3000 from Intel's ich10r, ran with W2k and Xp, maybe Seven some day if needed. Opinions? I can live without the hot plug and unplug, and have some 3rd party software to force-eject a disk. Other thoughts? Thank you!
  13. My answer is late and sketchy, sorry... I only tried Nt4 briefly, so this is more repeating what I've read than first-hand experience. Most slipstream tools for 2k and later base on the slipstream capability that Microsoft built in the updates. This did not exist for Nt4. Nevertheless, some enthusiasts did write software to slipstream updates into Nt4, including the Sp6a. Do I remember that at least 3 such exist? I don't have my notices here. I saw a workaround, using an F6 diskette, to provide Lba access to a disk right from the beginning of an installation. Possibly a servicepacked version of acpi.sys on the diskette, plus some text files telling to use it. Whether this can provide Lba48 too? My answer may be 40dB behind what you already know... Apologies in advance.
  14. Not necessarily. The weakness results from the CPU restoring imperfectly its state when an exception occurs. Speculative execution makes restoration difficult, but alone it doesn't imply a weakness. From Intel's list, the Core 2 for instance seems immune, with the design flaw beginning at Core i3/i5/i7. I trust Intel's list (...which can evolve) better than arbitrary claims from other sources, which often rely only on the presence of speculative execution, a very old feature indeed. I wonder: exceptions occur much more frequently than after a violation of memory protection, including during legitimate operation of the OS and applications. If the restoration of state is faulty, then the CPU must introduce erroneous behaviour in the machine. This hasn't been observed before?
  15. In the very few past hours, eBay and Paypal show the same symptoms of bad operation as during the two last attacks by Wannacrypt and its successor. Just in case a new wide attack has started, take your precautions!


×
×
  • Create New...