Monitoring my Minecraft server with OpenTelemetry and Prometheus

94 points by mmanciop 2 months ago

My kids demand SLOs stricter than Moon exploration technology, so I had to monitor our family’s Minecraft server Minecraft server like a pro. As luck would have it, I am one.

darknavi 2 months ago

> Microsoft made things confusing by adding the Bedrock server, which reportedly uses a combination of C, C# and Java,

No C# in Bedrock. No Java unless you're talking about the Android versions. Very little C.

It's mostly C++.

mmanciop 2 months ago

Thanks for setting me straight :-) I updated the article to reflect that.

doabell 2 months ago

> I am a man of simple tastes, and running the “vanilla” Minecraft server as a Systemd unit on a Linux VM in the cloud

Minecraft is famously under-optimized and needy in terms of CPU frequency. If running a vanilla (no server mods) version, then using something optimized, like PaperMC is a better idea for datacenter VMs. (Until you need to dupe sand or something.)

The other route is installing a bunch of optimization mods - some really do help.

ehnto 2 months ago

People love to bother about Java MC performance, but I ran a modded Tekkit sever for like 10 years on a base Digital Ocean VM. Shoutout to Digital Ocean for having no impactful changes for 10 years too. They give me a VM, I run the thing, life is good.
strogonoff 2 months ago

From my understanding, Paper and the like are good for Minecraft servers focused around specific mini-games (rather than freedorm building), and are the only sensible choice for servers with many people (or not that many people, but really underpowered hardware).
However, they may be a problem if players are sensitive to possible non-vanilla behaviour (as you mentioned, and it’s not limited to cheaty duping). Thankfully, spinning up a server with a selection of performance mods is very easy these days. Various tricks like pre-generating chunks in advance also help.
- treyd 2 months ago
  
  It's kinda nuts. The upstream mojang server binary starts to groan if you have >4-5 players on the same server doing stuff. They've really been dropping the ball on optimization in recent years.
  Paper is good enough for anyone but very technical players pushing to the limits of redstone tick timing logic, entity behavior, chunk loading mechanics, etc. These don't matter even for advanced players doing normal things.
  - mmanciop 2 months ago
    
    I actually had to splurge got 2 VCPUs on Digital Ocean to avoid "skipping ticks" and it does sound pretty nuts to me. We play max 3 players. I would expect the server with such a load to be able to run on a slightly tuned up toaster.
    
    strogonoff 2 months ago
    
    It is not cheap for the cloud. Had to use some beefy variety of EC2 medium instance for 4 players or so, with a simple dash for starting it up and terminating, I think using spot instance pricing. Otherwise it cost a pretty penny. At that point I did not use any performance mods, though.
    
    skrtskrt 2 months ago
    
    to be fair with the power on most people's laptops and phones now I think we tend to lose track of just how little "1 CPU" is if you're not just running like, a small web app.
  - frollogaston 2 months ago
    
    Wasn't it always like this? There's a lot going on in the game, especially if generating new chunks, and it's in Java.
    
    treyd 2 months ago
    
    It was not always like this. You used to comfortably be able to handle 70+ players in a single server before Paper existed (my memory of this is from before like 2015). You'd need to allocate a lot more memory than normal, like 8 gigs instead of the normal suggestion of 1 or 2, but it could handle it without regular lag.
    
    frollogaston 2 months ago
    
    I forget what heap setting I used, maybe it was 2G, but the old 2010 Mac mini I had as a server would lag if just one player was exploring land quickly (maybe by boat). Was online from 1.5 beta to 1.9 release, no more than 8 players usually.
    
    strogonoff 2 months ago
    
    I would say that without either setting your render distance to arm’s length, figuratively speaking, or allowing movement glitches and holes where terrain does not appear in time, “moving quickly while exploring” has pretty much not been a use case supported by the base game for a long time.
    
    frollogaston 2 months ago
    
    Right, or having some kind of entity-heavy autofarm, especially with version-specific bugs involved. Both things that in a moderately active server, someone will trigger.

strogonoff 2 months ago

Monitoring and metric collection makes a lot of sense when you run a production system, or a personal but critical system.

Promoting a telemetry solution when it comes to a hobby server, which you host for yourself and which can’t bankrupt you by running up a massive AWS bill, doesn’t seem to make much sense when simply bottling it up in Docker and being able to restart or recreate at will is enough (mount volumes for logs and persistent data, back it up, and you’re good).

With games like Minecraft in particular there’s value in being able to have multiple servers with different worlds, perhaps different mods, etc. If you decide not to have more servers because they are snowflakes you do not have time to set up monitoring for then you rob yourself and your players of the opportunity to have more fun.

Furthermore, containerizing it allows you to upgrade as new game versions come out quickly by simply spinning up a new container with your preexisting world as a test, and you get you basic system resource usage monitoring built-in.

What I think could be a more interesting exercise is a dashboard for friends or family that allows to manage the lifetime and configuration of their respective containers.

gmuslera 2 months ago

Implementing proper monitoring in a toy system doesn't prepare you to do it in a massive critical system, but at least you may had learn something in the process, and notice things that in big scale may not be as evident.
In any case, fun starts when the system have more interdependent components.
- strogonoff 2 months ago
  
  I think there is value in learning which pattern is good to apply in which scenario, and I will argue that in this case the best pattern is “servers are cattle”.
  - mmanciop 2 months ago
    
    One of the stretch goals for me writing this article was indeed to show between the lines how Prometheus Exporters, the OpenTelemetry Collector and Systemd can all work together. That is a very reusable skill on monitoring workflows running outside containers on Linux VMs or hosts.
jeroenhd 2 months ago

The goal of this article is to show you how to integrate with this service from just about anything. It's an ad that was fun to make as a hobby project. I doubt the goal was ever to set up a fully integrated Minecraft monitoring pipeline. At best, this is an employee at this company just decided to show the flexibility of their product by integrating with a random piece of kit they like.
Luckily, all of the interesting components are existing third party libraries so if you don't want to use their SaaS service, you can build your own Minecraft dashboard pretty easily.
- mmanciop 2 months ago
  
  I am indeed an employee of Dash0. The setup for telemetry collector will work with anything that accepts OTLP, and with minor adjustments, the data can be sent elsewhere too in other formats, as the OpenTelemetry Collector is very flexible in that regard.
  Alerting is specific to Dash0. I know of no other monitoring solution that lets you run real PromQL on logs. But there will be similar ways of accomplishing the same alerting logic.
dpe82 2 months ago

Have you never just built something for fun?
- dengolius 2 months ago
  
  Do you mean something like launching k3s on smartphones https://blog.denv.it/posts/pmos-k3s-cluster/?
- strogonoff 2 months ago
  
  I have built a panel like the one I mentioned for fun with friends!
  The goal of my comment was to highlight opportunities for more fun and less what seems like toil.
  Furthermore, this is an article about a telemetry solution posted on a site of that telemetry solution. They make money from this.
  - dewey 2 months ago
    
    One persons toil is another persons fun.
    
    strogonoff 2 months ago
    
    And sometimes a person is paid to pretend toil is fun. We are talking about spending hours setting up telemetry instead of playing a game.
    
    dewey 2 months ago
    
    Not everyone is into gaming. I rather code on my side projects than use my console. Or people tweak and customize their Linux installation instead of doing work on it. Some people like to work on their cars, driving is a small part of it.
    
    strogonoff 2 months ago
    
    I agree, and I am as guilty of procrastination. However, the author is not really procrastinating—he gets paid for this. Me, I do in fact procrastinate on setting up a Minecraft server infra in the cloud. Maybe that’s precisely why the solution to this problem strikes me as inadequate:
    > So, the Minecraft server should work reliably and, if it goes down, I should know well before they do
    How are metrics helpful? There is so much fun that could be had in setting up an actually resilient system instead.
    Why worry over metrics and alerts when you could orchestrate an infrastructure that grants you the superpower of being able to spin up a server with a copy of the world on a whim instead (or even a system that auto-starts one whenever there is demand)?
    
    dewey 2 months ago
    
    You are somehow very negative about this piece and are not understanding that your definition of fun is not universal.
    As you said "There is so much fun that could be had in setting up an actually resilient system instead.", maybe the author has more fun setting up alerts and metrics instead of a resilient system like you do?
    The truth is that in most real-world scenarios getting alerts, metrics is much more important than building a fully resilient system (Expensive, maybe overengieering for early stage etc.).
    > However, the author is not really procrastinating—he gets paid for this. As the first sentence in the blog post says "One of the secret pleasures of life is to be paid for things you would do for free.", which I can very much understand as I often work or play with things I could use at work in my free time.
    
    mmanciop 2 months ago
    
    > As you said "There is so much fun that could be had in setting up an actually resilient system instead.", maybe the author has more fun setting up alerts and metrics instead of a resilient system like you do?
    Adding the backup for the world files, already having Systemd bringing back a crashing server, makes the setup rather resilient. Sure, there's infinite more things that can go wrong, but with swiftly decreasing likelihood.
    > The truth is that in most real-world scenarios getting alerts, metrics is much more important than building a fully resilient system (Expensive, maybe overengieering for early stage etc.).
    This, very much this.
    > However, the author is not really procrastinating—he gets paid for this. As the first sentence in the blog post says "One of the secret pleasures of life is to be paid for things you would do for free.", which I can very much understand as I often work or play with things I could use at work in my free time.
    Yes :-)
    
    strogonoff 2 months ago
    
    > The truth is that in most real-world scenarios getting alerts, metrics is much more important than building a fully resilient system (Expensive, maybe overengieering for early stage etc.).
    Funny, because I have the opposite opinion. Build for failure first; if it’s critical/production then also monitor, but if an earthquake takes down an EC2 zone and you have no ability to spin it up exactly the way it was then the avalanche of alerts and metrics falling off a cliff[0] isn’t exactly going to help you (or your mental well-being).
    Generally speaking, if you build for failure first, then monitoring becomes much more useful and actionable; and simultaneously it becomes much less important for a hobby project.
    [0] That assuming you gather them from a different zone that wasn’t affected by the same downtime in the first place; speaking of, how are you monitoring your monitors? and so on.
    
    dewey 2 months ago
    
    This thread isn't going anywhere. If your startup hasn't found paying customers there's no need to build earthquake-resilient software. For most businesses that are not billion dollar companies there isn't.
    Of course for engineers that's a nice challenge, but that's the reason why engineers without a business sense have a hard time building their own companies if you prioritize perfect code and overengineered infrastructure over talking to customers or building the business.
    
    strogonoff 2 months ago
    
    I don’t think running a container, which takes one command and one small YAML file, is either overengineering or difficult.
    
    mmanciop 2 months ago
    
    > How are metrics helpful? There is so much fun that could be had in setting up an actually resilient system instead.
    Metrics are the means to an end of alerting. And with alerting, I mean getting pinged on my phone when something important breaks. Like, you know, the server going down.
    > Why worry over metrics and alerts when you could orchestrate an infrastructure that grants you the superpower of being able to spin up a server with a copy of the world on a whim instead (or even a system that auto-starts one whenever there is demand)?
    As somebody who has run cloud and enterprise software for almost two decades now, I can be that needs monitoring too. The more moving parts there are, the more things go wrong. The more things go wrong, and the more you care they get fixed, the more monitoring you need :-)
    
    strogonoff 2 months ago
    
    Do you really need to be urgently made aware that it’s down, if the system could simply spin up a new container and keep on as it were? You could still see that it had to do it, and if in the mood investigate it, but the matter of first importance is taken care for you.
    > As somebody who has run cloud and enterprise software for almost two decades now, I can be that needs monitoring too
    To be clear, I strongly believe that if you run anything seriously in production, you must monitor it—but first you need to be able to spin it back up with minimal effort. It may take a while to get there if you just inherited a rusty legacy snowflake monolith that no one dares to breathe the wrong way near, but if you are starting anew it is a bad mistake to not have that down first considering how straightforward it is nowadays.
    Then, for hobby projects of low criticality (because people in this thread mistakenly assume I mean any personal project, I have to reiterate: nothing controlling points of ingress into your house or the like), you may find that once you have the latter, the former becomes optional and not really that interesting anymore.
    
    mmanciop 2 months ago
    
    I swear I had a lot of fun setting doing the setup.
    I am also a massive observability nerd, so YMMV :-)
    
    strogonoff 2 months ago
    
    I believe you! Just due to your affiliation I wanted to highlight to any newbie SREs in the audience that perhaps there is a better way. I still think my approach is better, but we can do things differently.
    
    mmanciop 2 months ago
    
    Indeed if there were “official” container images out there, I might have instead run the server on Google Cloud Run or AWS AppRunner, without having to take care of the Linux underneath. Or an Amazon ECS task. I don’t have a Kubernetes cluster, but I will at some point make a version of this blog to run it on K8s :-)
koinedad 2 months ago

I’ve recently added telemetry to some “toy” apps at my house because a power outage or other unforeseen issue has caused things like my Siri enabled garage doors to stop working. Now I get alerts through grafana and telegram for basically free which comes in handy.
- strogonoff 2 months ago
  
  A garage door is a security concern.
  For a game, a solution that simply restarts the container if it’s down solves the issue. You can mount game logs in a volume if you want, and you can see resource usage in container host dashboard. What value do detailed system metrics bring?
  Furthermore, you don’t care what software you run to make your garage door system Siri-enabled, as long as it does its job and is not vulnerable; whereas with a game that adds new gameplay features multiple times per month, you do want to update it frequently. Babysitting a snowflake server makes it way more difficult than it should be.
ajmurmann 2 months ago

I am currently planning adding monitoring to some toy apps I hosted on a raspberry pi cluster. The intent is that this might safe me time and stress further down the road. If a new version makes performance worse, I want to see that in the data. If resource needs go up, I want to know that before it's time to move, so that I can plan without any kind of scheduling stress. (I also want to do this in part as an exercise which is partial motivation for the cluster and most things I built that run on it. But don't tell anyone!)
Am I misguided?
- strogonoff 2 months ago
  
  Well, as far as I’m concerned, if they are toy apps, why stress? If they are going to go in production at some point, then sure; but this certainly is not happening with a family game server.
  - ajmurmann 2 months ago
    
    Family game server going down can be very stressful, especially if you have kids.
    Also, I've had phone tech support sessions with family that were more stressful than calls with large banks who were worried about losing very large amounts of money in case of an outage. Different stressful, but nonetheless...
    
    strogonoff 2 months ago
    
    > Family game server going down can be very stressful, especially if you have kids.
    Telemetry does not address this, though. Shoving it into a container and assigning it a simple “restart if down” rule does. Minecraft is a flaky beast, if you run snapshots and/or mods. Metrics or not, often “start again” is all you need.
    Furthermore, this is a game that adds new gameplay features multiple times per month. If you do not update it frequently and your kid misses out on a new mob, you run into the same stress. Containerizing it makes the upgrade very straightforward, and once you run a couple of containerized instances… Do you not struggle to see the value of detailed system monitoring?
    
    mmanciop 2 months ago
    
    > Telemetry does not address this, though. Shoving it into a container and assigning it a simple “restart if down” rule does.
    A Systemd unit as shown in [1] does it too without using containers and with fewer moving parts of using containers. I use containers every day at $work. I have been using containers since before Docker was a thing. In this case, it's entirely overkill: Systemd units use the important things like cgroups already.
    For the upgrade: depends. You do need a container image regardless, and I have not seen official ones. Upgrading servers in Minecraft requires upgrading clients to match, and my kids prefer to play, more than upgrade. (Unless a biome is released. Then it must be immediately available to them.) But then again, I just need to download the binary with a cURL call. And if the configurations change, Docker won't help me there one bit anyhow.
    [1] https://github.com/dash0hq/minecraft-server/blob/main/drople...
    
    strogonoff 2 months ago
    
    There are no official ones (Microsoft profits from operating its own servers, why would it make things any easier), but there are community-maintained images.
    I found that vanilla server is insufficient and an ability to declaratively define mods, the seed, OP players, etc. through container environment is very important for iterative evolution, but of course this is individual.
    
    mmanciop 2 months ago
    
    Indeed.
    My personal definition of nanosecond is the time passing between the Minecraft server having a hiccup, and the first scream piercing the air.
    The printer not printing is DEFCON 5 material.
  - jauntywundrkind 2 months ago
    
    Seeing what computers are doing is good, actually. Period.
    
    strogonoff 2 months ago
    
    This is a real-time game. What the computer is doing is directly in front of your eyes.
    
    jauntywundrkind 2 months ago
    
    I know I sound like a freak to you, but you sound like a deranged freak to me too. Who would opt for ignorance? Who would opt not to have data? Who would opt not to see more? Its insanity to me to resist enrichment so.
    Limiting yourself to only naive senses is a wild proposition to me. The scientific mindset compels us to see further: it is a wild privilege to see more, to build and use tools that expand how we can see.
    
    strogonoff 2 months ago
    
    I don’t think you’re deranged. I do think this is a post about using telemetry 1) in (to me) excessive ways that defeat the point of the thing being measured 2) published on the website of a company that sells said telemetry solution.
    Furthermore, to me useless or excessive data is very much a reality (if you do not agree that it is a possibility and a thing that happens, we have clearly no way of understanding each other), and per my criteria it is just that sort of data in this use scenario.
    
    mmanciop 2 months ago
    
    To be fair, the setup of the article works with most modern observability solutions, in same cases just by replacing the endpoint and authentication token. Turning telemetry processing into a sort of utility is one of the great things that OpenTelemetry did. Now, among vendors, we compete on delivering insights on the telemetry, as opposed to just collecting it. If you are interested, I wrote about it a while back [1].
    About excessive telemetry, that depends on what you want to achieve. Using facilities in the OpenTelemetry Collector like [2], you can easily drop all telemetry you have no use for. At the cost of tooting my own horn, we actually provide super easy ways of doing the same dropping at no charge whatsoever to the end user in Dash0 [3].
    [1] https://thenewstack.io/is-otel-the-last-observability-agent-...
    [2] https://github.com/open-telemetry/opentelemetry-collector-co...
    [3] https://www.dash0.com/changelog/spam-filters
harrall 2 months ago

Setting up telemetry is really easy if you’ve done it before and it’s a learning opportunity if you haven’t.
I have Dockerfiles from 10 years ago for Grafana and a time-series DB so basically you learn it once and you can bang out basic telemetry infra in an hour afterwards.
And I still actually use InfluxDB and Grafana for my hobby stuff. My current Dockerfiles just look like my old ones…
- strogonoff 2 months ago
  
  What happens if Grafana or InfluxDB is down? Who monitors the monitors?
mmanciop 2 months ago

For this, I have the impression that https://github.com/dirien/minectl might be very close to what you are thinking. I did not try it, but took the Minecraft Exporter from it and used in the setup.

cpburns2009 2 months ago

> The minecraft-prometheus-exporter ... which uses Fabric, another way to run Minecraft servers with mods. Like Bukkit, Fabric was not an option for me.

Forge and its recent fork Neoforge are supported too.

Lirael 2 months ago

[dead]

Calliope1 2 months ago

[flagged]

Yasuraka 2 months ago

Why are you doing this?
- mmanciop 2 months ago
  
  Because I enjoy observability and monitoring a LOT, and because my kids nag me to hell and back when our home IT infrastructure is having a bad day.
  - Yasuraka 2 months ago
    
    I was asking some LLM spammer
acedTrex 2 months ago

Hi gippity