WebMonkeys: parallel GPU programming in JavaScript (2016)

115 points by surprisetalk 5 days ago

butokai 3 days ago

By coincidence I was just having a look at the work by the same author on languages based on Interaction Nets. Incredibly cool work, although the main repos seem to have been silent in the last couple of months? This work however is much older and doesn't seem to follow the same approach.

mattdesl 3 days ago

The author is working on a program synthesizer using interaction nets/calculus, which should be released soon. It sounds quite interesting:
https://x.com/VictorTaelin/status/1907976343830106592
FjordWarden 3 days ago

WebMonkeys feels a bit like array programming, you create buffers and then have a simple language to perform operations on those buffers.
HVM is one of the most interesting developments in programming languages that I know off. I just don't know if it will prove to be relevant for the problem space it is trying to address. It is a very difficult technology that is trying to solve another very complex problem (AI) by seemingly sight stepping the issues. Like you have to know linear algebra and statistics to do ML, and they are saying: yes and you have to know category theory too.
- foobarbecue 2 days ago
  
  FYI, just in case you didn't know, it's "side-stepping," not "sight-stepping."
  Thanks for introducing me to the concept of higher-order virtual machines.

Anduia 3 days ago

The title should say 2016

kreetx 2 days ago

Unfortunately this is not maintained since 2017: https://github.com/VictorTaelin/WebMonkeys/issues/26

Are there other projects doing something similar on current browsers?

kaoD 2 days ago

Still a draft, experimental and not widely used[0], but WebGPU[1] will bring support for actual compute shaders[2] to the web.
It's much more low level than these "web monkeys" but I'd say if you really need GPU performance instead of toy examples like squaring a list of numbers, you really need to go low level and understand how GPU threads and work batching works.
[0] https://developer.mozilla.org/en-US/docs/Web/API/WebGPU_API
[1] https://en.m.wikipedia.org/wiki/WebGPU
[2] https://webgpufundamentals.org/webgpu/lessons/webgpu-compute...
- kreetx 2 days ago
  
  With "going low level" do you mean leaving the browser all together and shipping a native application?
  Although I currently don't need anything like this for work then still, the use case I see for GPU use in browser is that it's often times the easiest way to run a program on the user's machine - anything else requires an explicit install.
  - kaoD 2 days ago
    
    I meant to compare abstract-ish stuff (like these monkeys) vs actual low-level within the GPU realm, i.e. thinking in GPU architecture terms. E.g. appropriately choosing a workgroup[0] size, optimizing your buffer layouts for specific access patterns, knowing when and how to read/write from/to VRAM, when to (or if) split into multiple stages, etc.
    I see space for abstractions over this mess of complexity[1] but there's not a lot of room for simplification.
    It's almost like thinking in bare-metal terms but the GPU driver is your interface (and the browser's sandbox of course).
    Although WGSL is not low-level itself (in the sense that you're not writing SPIR-V) that's for a good reason because it needs to be portable and each vendor does their own things so it's often hardware dependent.
    Going native will still help with performance AFAIK (the aforementioned sandbox has a cost for example) but I agree with you. I love the web as a platform.
    [0] https://gpuweb.github.io/gpuweb/wgsl/#compute-shader-workgro...
    [1] https://developer.chrome.com/docs/capabilities/web-apis/gpu-...

sylware 3 days ago

Maybe the guys here know:

Is there a little 3D/GFX/game engine (plain and simple C written) strapped to a javascript interpreter (like quickjs) without being in apple or gogol gigantic and ultra complex web engines?

Basically, a set of javascript APIs with a runtime for wayland/vulkan3D, freetype2, and input devices.

FjordWarden 3 days ago

You can access the gpu without a browser using Deno[1] (and probably Node too if you search for it).
Not to be patronising here, but if you are looking for something that makes 3D/GFX/game programming easier without all the paralysing complexity, you should recalibrate how hard this is going to be.
[1] https://windowing.deno.dev/
gr4vityWall 2 days ago

You can use Node.js or Bun with bindings for stuff like raylib or SDL.
Examples:
https://github.com/RobLoach/node-raylib https://github.com/kmamal/node-sdl
- afavour 2 days ago
  
  I assume OP mentioned QuickJS specifically because they're looking for a tiny runtime. Node and Bun aren't that.
  - gr4vityWall a day ago
    
    Ahh, my bad. That makes sense.
    I wonder if QuickJS not having JIT capabilities could have a noticeable impact in a bigger game.
  - sylware 2 days ago
    
    And if it could compile without requiring gcc or clang, for instance cproc, tinycc, scc, etc...
ronsor 2 days ago

You could take raylib (https://www.raylib.com) and bolt quickjs to that.
rossant 2 days ago

https://datoviz.org will have a webgpu js backend in a year or so.
jkcxn 3 days ago

You can quite easily make bindings for raylib/sokol-gpu/bgfx from Bun
chirsz 3 days ago

You could use Deno with WebGPU.

zackmorris 2 days ago

This is cool but doesn't actually do any heavy lifting, because it runs GLSL 1.0 code directly instead of transpiling Javascript to GLSL internally.

Does anyone know of a Javascript to GLSL transpiler?

My interest in this is that the world abandoned true multicore processing 30 years ago around 1995 when 3D video cards went mainstream. Had it not done that, we could have continued with Moore's law and had roughly 100-1000 CPU cores per billion transistors, along with local memories and data-driven processing using hash trees and copy-on-write provided invisibly by the runtime or even in microcode so that we wouldn't have to worry about caching. Apple's M series is the only mainstream CPU I know of that is attempting to do anything close to this, albeit poorly by still having GPU and AI cores instead of emulating single-instruction-multiple-data (SIMD) with multicore.

So I've given up on the world ever offering a 1000+ core CPU for under $1000, even though it would be straightforward to design and build today. The closest approximation would be some kind of multiple-instruction-multiple-data (MIMD) transpiler that converts ordinary C-style code to something like GLSL without intrinsics, pragmas, compiler-hints, annotations, etc.

In practice, that would look like simple for-loops and other conditionals being statically analyzed to detect codepaths free of side effects and auto-parallelize them for a GPU. We would never deal with SIMD or copying buffers to/from VRAM directly. The code would probably end up looking like GNU Octave, MATLAB or Julia, but we could also use stuff like scatter-gather arrays and higher-order methods like map reduce, or even green threads. Vanilla fork/join code could potentially run thousands of times faster on GPU than CPU if implemented properly.

The other reason I'm so interested in this is that GPUs can't easily do genetic programming with thousands of agents acting and evolving independently in a virtual world. So we're missing out on the dozen or so other approaches to AI which are getting overshadowed by LLMs. I would compare the current situation to using React without knowing how simple the HTTP form submit model was in the 1990s, which used declarative programming and idempotent operations to avoid build processes and the imperative hell we've found ourselves in. We're all doing it the hard way with our bare hands and I don't understand why.

ralphc 2 days ago

Would your 1000 core CPU do well on neuroevolution?
- zackmorris a day ago
  
  Thank you, believe it or not I hadn't heard that word so it filled in a piece of the puzzle for me.
  https://en.wikipedia.org/wiki/Neuroevolution
  https://medium.com/@roopal.tatiwar20/neuroevolution-evolving...
  That was the state of the art when I got my ECE degree in 1999. They were using genetic algorithms (GAs) to evolve initial weights for neural net (NN) gradient descent right before the Dot Bomb. Then pretty much all R&D was cut/abandoned overnight, and the 2000s went to offshoring jobs and shuttering factories which led to the Housing Bubble popping. IMHO that set AI research back at least 10 years, maybe 20, at least in the US. I know it derailed my career dreams.
  I feel that GPUs by their SIMD nature do poorly with a large number of concurrent processes. For example, a population of GA agents running lisp with each instruction of the intermediate code (icode) tree encoded as bits in a bitfield which evolves using techniques like mutation, crossover and sexual reproduction.
  Some of those bits represent conditional branching. Ideally we'd want the opposite of a GPU, more like a transputer, on the order of thousands of independent arithmetic logic units (ALUs) with their own local memories, and perhaps able to use custom instructions. That way each lisp tree can execute concurrently in isolation. There may be branch-avoidance transformations from shaders that could help with this, to transpile the von Neumann SISD and MIMD code we're used to into the narrower SIMD so it can run on GPUs. Companies like Nvidia aren't touching this for some reason, and I don't know if it's just a blind spot or deliberate.
  This is one reason why I wanted FPGAs to take off, since they'd make great transputers. Unfortunately they stagnated due to proprietary compilers and a lack of profitable applications. Today it looks like microcontroller units (MCUs) might take their place, but they're still at least 4 orders of magnitude or more too small and 100 times too slow to be cost-effective.
  Imagine having a 1 GHz CPU that can run 1000 concurrent UNIX processes running lisp that only communicate through sockets. It's easy to see how mutating them is trivial, then running them as unit tests to see which ones pass is also straightforward. Erlang and Go figured that out years ago.
  It's so easy in fact to "just come up with the answer" this way, that I think this technique has been suppressed.
  With that foundation and removing the element of time, it's easy to see that this could be a drop-in replacement for gradient descent. Then the work shifts to curating large data and training sets. I suspect that this is the real work of AI research, and that the learning technique doesn't matter so much.
  I think of it as: GAs are good for getting to a local minimum error - finding a known-good state. NNs are better for hill climbing - exploring the local solution space.
  Another way to say that is, corporations spent billions of dollars to train the first LLM MVPs with gradient descent and other "hands on" techniques. When maybe they could have spent thousands of dollars if they had used GAs instead and "let go".
  Now that we're here, refinement is more important, so gradient descent is here to stay. Although I think of LLM blueprints as eventually fitting on a napkin and running on one chip. Then we'll orchestrate large numbers of them to solve problems in a coordinated fashion. At which point it might make sense to use techniques from GAs to create an orchestra of the mind, where each region of an artificial brain is always learning and evolving and bouncing ideas off the others.
  - ralphc a day ago
    
    Erlang figured this out so much you've described the BEAM. You can have hundreds of thousands of small "processes" communicating through BEAM mailboxes on a modern machine. Elixir has protocols and macros, improvements on Erlang, and if you really require lisp there is LFE, Lisp Flavored Erlang, that has true lisp homoiconicity.
    
    zackmorris 6 hours ago
    
    Wow LFE is really cool, thanks for showing me another new thing! Admittedly I got pretty depressed about the lack of progress on this stuff and kinda checked out over the years. So my knowledge has gaps since the 90s.
    -
    I'll use this as an excuse to talk about another learning model besides NNs and GAs: Boltzmann Machines (BMs). For anyone unfamiliar with them, they can be thought of as a temperature minimization or gas diffusion model. Rather than gradient descent, simpler formulas like summations and random sampling can be used, so they are "big dumb algorithms" like GAs as opposed to "fickle algorithms" like NNs. They borrow concepts from Markov chains and simulated annealing. I'm having trouble finding a single approachable summary though:
    https://en.wikipedia.org/wiki/Boltzmann_machine
    https://www.geeksforgeeks.org/types-of-boltzmann-machines/
    https://medium.com/@soumallya160/a-complete-guide-to-boltzma...
    https://blog.paperspace.com/beginners-guide-to-boltzmann-mac...
    https://deepai.org/machine-learning-glossary-and-terms/boltz...
    The main problem is that their fully-connected nature makes them difficult to train beyond a certain size. Restricted Boltzmann Machines (RBMs) try to overcome that, but their increased complexity makes them more like NNs.
    I think that GAs could be useful for training fully-connected BMs by using evolution to decide which weights to update, similarly to your original question of whether GAs could be used in place of gradient descent. Unfortunately I can't find succinct articles about it because it's pretty fringe:
    Using GAs to train BMs:
    http://www.icj-e.org/download/ICJE-5-10-108-116.pdf
    Using BMs to train GAs:
    https://doc.lagout.org/science/0_Computer%20Science/2_Algori...
    Towards combining GAs with BMs and possibly NNs:
    https://www.sciencedirect.com/science/article/abs/pii/037596...
    https://www.sciencedirect.com/science/article/abs/pii/016781...
    Note the dates on those last papers: 1987 and 1995. These are not new ideas. Unfortunately they're paywalled and I can't find online PDFs for them.
    I think what went wrong with LLMs is their fixation on NNs. The code smell for that is their exorbitant energy cost of training. I'd compare it to the high energy cost of proof-of-work crypto like Bitcoin vs proof-of-stake crypto like Ethereum. Or the high mental load of React vs Vue or especially htmx.
    A programmer's productivity is proportional to the level of abstraction, in other words how much mental load can be offloaded to the tooling.
    Which means that programming started regressing when we abandoned web for native mobile apps and doubled down on imperative programming with single-page applications around 2007, probably as a result of the Dot Bomb and lean approaches like Agile which resemble austerity, as opposed to those like Waterfall which resemble central planning.
    Understandable from a business perspective, but tragic for the cause of pure research and human progress towards achieving self-actualization through freedom from labor.
    Since efficiency didn't pan out, I see that as a sign that big dumb approaches like GAs deserve further study. My feeling is that all machine learning methods are actually equivalent and interconvertible, like with change of coordinates in math and transpiling. So we should use the conceptually simplest ones to avoid the complexity pitfalls of more complex ones that often lead to poor performance.
    -
    To put this in perspective, a neuron can perform about 1000 operations per second, or 1/1000 mips (megaflops), which is about as powerful as ENIAC:
    https://jetpress.org/volume1/moravec.htm
    Our mind is about 100 billion neurons that have the potential to connect to others very far away. So a 100 million megaflops or 100 teraflops (10^14 flops) computer should be able to simulate aspects of the human brain. We just reached 1 exaflops (10^18 flops) supercomputers 3 years ago, which is about 10,000 times faster than needed:
    https://en.wikipedia.org/wiki/List_of_fastest_computers
    So something has gone terribly wrong with the way we utilize computing power. I blame it on our fixation with narrow single-threaded performance leading to an inability to see outside-the-box solutions. For a tiny fraction of the effort we've expended, we could have built wide multi-threaded CPUs like the ones I mentioned, that more naturally handle the parallel computation of GAs and the human brain.
    This is analogous to the problems we're facing when 1 billionaire has the wealth and resources of 1 million people. Even if they have an IQ of 200, they're still orders of magnitude less effective than all of those people solving a problem in a unified fashion. That's why systems-level problems like infant mortality, mass-starvation, global warming, etc aren't being solved. When we look deep enough, we start to see that everything is connected, and that it's impossible to talk about technological progress without political progress. But I digress.
    -
    Where I'm going with this is that whatever machine learning algorithm we settle on will have an underlying concept like evolution at its core. Each neuron is doing the best it can to find connection with the others and minimize its workload to maximize the resources available to it. Just like ants and fungi and human beings.
    I believe that this evolution away from entropy towards complex structure is the essence of life, and that instinct, awareness, feeling and meaning - along with problem-solving, thinking, doing and reason - are two sides of the same coin.
    Currently tech mainly addresses the right (masculine?) side of that coin, while the left (feminine?) side was barely touched on by companies like Apple and Atari early on. I mean masculine/feminine in the divine sense in all of us, not gender. Now we're immersed in a world of technological magic with virtually no understanding of our magical nature or how to wield that power responsibly.
    Basically we're so distracted that we don't realize that we're divine beings with the power to change our outer reality through manifestation by being mindful of where we place our attention in our inner reality. That quantum effects bubble up into the real world like the butterfly effect, how life "finds a way" and subtly shifts probability to create outcomes favorable to it.
    I believe that if we mimick this quantum-determinism bridge using highly-parallel processors running on the order of 100 billion (10^11) threads that machines will acquire consciousness. Because at some level it won't matter if the smallest components are made of carbon or silicon.
    In other words, artificial intelligence can be achieved by brute forcing sequential computing power (intellect/knowledge). But artificial consciousness can be achieved by brute forcing parallel computing power (intuition/wisdom).
    This is the piece of the puzzle that's missing with LLMs and most other competing approaches. A GPU is basically a hugely wide processor running in the low hundreds of threads. It may never reach a level that could be mistaken for consciousness, because it can't explore a large number of potential solutions simultaneously. Unless we build transputers from FPGAs or find a way to transpile SISD and MIMD into SIMD to truly run billions of isolated threads with something like the BEAM/LFE you mentioned.
    On that note, we also need advances in programming languages to play with this sort of intelligent machine. There are countless examples of running lisp in C, but almost no examples of running C in lisp. That's a huge problem, because the real world is imperative. We need to be able to tell computers our problems and the solutions we need using familar formula-style syntax like C, but have the computer work with lisp's icode tree internally like a spreadsheet so that it doesn't get lost in the weeds.
    This is where I would start, if I had the resources to do so. I'd write a functional imperative language that uses all const variables so that all logic is free of side effects and can be statically analyzed so that it translates directly to lisp and back, avoiding the complexities of borrow checkers like in Rust. But it would have every convenience method we've come to expect from languages like PHP and Javascript so that we can get real work done. And be decomposable into as many concurrent threads as possible from the start using a divide and conquer strategy, so that even uninspired/unoptimized code would run thousands or millions of times faster than the languages we're used to. And it would have a JIT to transpile most other languages to/from itself to create a functional/imperative bridge.
    Then I'd write a runtime for GPUs to run those isolated threads. I don't know if this is possible, but if it isn't, that might create demand for true multicore CPUs.
    Well I've completely gotten off topic and revealed some of my most heartfelt dreams here, and I doubt that anyone will read this. I wish I could be as concise as you, because you connected a lot of insights from so few words.

qoez 2 days ago

Awesome stuff. Btw: "For one, the only way to upload data is as 2D textures of pixels. Even worse, your shaders (programs) can't write directly to them" With webgpu you have atomics so you can actually write to them.

punkpeye 3 days ago

So what are the practical use cases for this?