I've been using a shitty streaming website whose player interrupts the playback of a video in irregular intervals and presents a cryptic error message. I've started looking into the JavaScript code to see if I can't code up a work-around mechanism (basically debugging their garbage implementation), and of course (why actually?) their player code is also obfuscated.
And I've gotta say, emplying an AI assistant has proven to be an invaluable help in trying to understand obfuscated code. It's actually really cool to take a function of gobbledegook JavaScript and ask the AI to rewrite it in a more canonical and easily understandable way, with inline comments. Of course, there are flaws every now and then, but the ability to do this has been such a game changer for reverse engineering, IMO.
I can even ask to take a guess at finding better variable/function names and the AI can infer from the code (maybe has seen the unobfuscated libraries during training?) what this code is actually doing on a high-level and turn something like e.g(e.g) into player.initialize(player.state) which is nothing short of amazing.
So for anyone doing similar work, I cannot recommend highly enough to have an AI agent as another tool in your tool belt.
This seems like quite a lot of work to hide the code. What would the legitimate reasons for this be? Because it looks like it would make the program less optimized and more complexity just leads to more errors.
I understand the desire to make it harder for bots, but 1) it doesn't seem to be effective and bots seem to be going a very different route 2) there's got to be better ways that are more effective. It's not like you're going to stop clones through this because clones can replicate by just seeing how things work and reverse engineer blackbox style.
A generous take would be that they have their own internal GUI tools that make it easier for non-programmers to set up visual elements in this. That was historically the reason to invent VMs like Flash. A less generous take would account for the enormous potential for hiding nefarious code inside such a thing, and account for the nature of the government which deployed it, and conclude that it was a national security / defense project disguised as a candy-coated trojan horse.
VM-based architectures are really common in the obfuscation space, which is why you have executable packers[1], JS packers[2] and bot management products[3][4] leveraging similar techniques.
As for why the obfuscation is needed: bot management products suffer from a fundamental weakness in that ultimately, all of them simply collect static data from the environment, therefore it would make much more sense to make the steps involved as difficult to reverse engineer as possible. Once that is done, all you need to do is slightly change the schematics of your script every few weeks and publish a new bundle, and you've got yourself a pretty unsubvertible* protection scheme.
Regarding the "trojan horse", I think someone is yet to show proof that it's a Javascript exploit.
(*Unsubvertible is obviously relative, but raising the cost the attack, from say, $0.01/1000 requests to $10/1000 requests would massively cut down on abuse.)
Making it harder for bots usually means that it drives up the cost for the bots to operate; so if they need to run in a headless browser to get around the anti-bot measures it might mean that it takes, for example, 1.5 seconds to execute a request as compared to the 0.1 seconds it would without them in place.
On top of that 1.5 seconds is also that there is a much larger CPU and memory cost from having to run that browser compared to a simple direct HTTP request which is near negligible.
So while you'll never truly defeat a sufficiently motivated actor, you may be able to drive their costs up high enough that it makes it difficult to enter the space or difficult to turn a profit if they're so inclined.
If the "very different route" means running a headless browser, then it's a success for this tech. Because the bot must run a blackbox JS now, and this gives people a whole new street of ways to run bot detection, using the bot's CPU.
Very impressive work! I always enjoy a good write up about reverse engineering efforts and yours was really simple to follow.
Many popular/large websites and bot protection services usually have environment checking as a baseline and mouse-movement tracking in some of the more aggressive anti-bot checks.
It's always interesting to see how long it takes from when the measures have been defeated/publicised until the service ends up making changes to their mechanism to make you start over (hopefully not from scratch).
If you believe this you underestimate how adversarial the software world really is. TikTok will be on the receiving end of botnets by everything from commercial entities, state backed groups and criminals.
They won't be betting that this stops that entirely, but it adds a layer of friction that is easy for them to change on a continuous basis. These things are also very good for leaving honeypots in where if someone is found to still be using something after a change you can tag them as a bot or otherwise hacking. Both of those approaches are also widely used in game anti-cheat mechanisms, and as shown there the lengths people will go to anyway are completely insane.
That's probably not the goal. There are bots advertising illegal services (e.g. ads for "hacking services", illegal drugs) in most comment sections. If you report these comments, 99.9% of the time the report will be rejected with "no violations found" and the spam stays up.
The balance of evidence suggests otherwise. If they cared about spam bots they would take action when spammers are handed to them on a silver platter. The kinds of spammers who will leave 30 identical comments advertising illegal services, not some weird moderation corner case.
If you ever end up on a video that's related to drugs, there will be entire chains of bots just advertising to each other and TikTok won't find any violations when reported. But sure, I'm sure they care a whole lot about not ending up like Twitter.
Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.
Still, I had no idea. This is really taking JS obfuscation to the next level.
One kind of wonders, what is the purpose of that level of obfuscation? The naive take is that obfuscation is usually to protect intellectual property... but this is client-side code that wouldn't give away anything about their secret sauce algorithm.
> Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.
From the Repo's README:
"TikTok is using a full-fledged bytecode VM, if you browse through it, it supports scopes, nested functions and exception handling. This isn't a typical VM and shows that it is definitely sophiscated."
It looks like OP filled out the text area alongside with the URL when submitting the post.
HN takes that text and turns it into a comment. I’ve seen it happen before.
The unfortunate outcome of that IMO is that sometimes text that makes sense as a description of a submission feels a bit out of place as a comment due to how they are worded. And these comments sometimes then end up getting downvoted.
I wouldn’t be completely sure it was not human written. Even though it feels a bit weird to read it as a comment.
I've been using a shitty streaming website whose player interrupts the playback of a video in irregular intervals and presents a cryptic error message. I've started looking into the JavaScript code to see if I can't code up a work-around mechanism (basically debugging their garbage implementation), and of course (why actually?) their player code is also obfuscated.
And I've gotta say, emplying an AI assistant has proven to be an invaluable help in trying to understand obfuscated code. It's actually really cool to take a function of gobbledegook JavaScript and ask the AI to rewrite it in a more canonical and easily understandable way, with inline comments. Of course, there are flaws every now and then, but the ability to do this has been such a game changer for reverse engineering, IMO.
I can even ask to take a guess at finding better variable/function names and the AI can infer from the code (maybe has seen the unobfuscated libraries during training?) what this code is actually doing on a high-level and turn something like e.g(e.g) into player.initialize(player.state) which is nothing short of amazing.
So for anyone doing similar work, I cannot recommend highly enough to have an AI agent as another tool in your tool belt.
Which AI agents did you use?
I've tried different ones, they all seem to do a great job.
Could you name a couple?
Out of curiosity (as someone disappointingly new to prompt engineering), what’s an example prompt you used with some success?
This seems like quite a lot of work to hide the code. What would the legitimate reasons for this be? Because it looks like it would make the program less optimized and more complexity just leads to more errors.
I understand the desire to make it harder for bots, but 1) it doesn't seem to be effective and bots seem to be going a very different route 2) there's got to be better ways that are more effective. It's not like you're going to stop clones through this because clones can replicate by just seeing how things work and reverse engineer blackbox style.
A generous take would be that they have their own internal GUI tools that make it easier for non-programmers to set up visual elements in this. That was historically the reason to invent VMs like Flash. A less generous take would account for the enormous potential for hiding nefarious code inside such a thing, and account for the nature of the government which deployed it, and conclude that it was a national security / defense project disguised as a candy-coated trojan horse.
VM-based architectures are really common in the obfuscation space, which is why you have executable packers[1], JS packers[2] and bot management products[3][4] leveraging similar techniques.
As for why the obfuscation is needed: bot management products suffer from a fundamental weakness in that ultimately, all of them simply collect static data from the environment, therefore it would make much more sense to make the steps involved as difficult to reverse engineer as possible. Once that is done, all you need to do is slightly change the schematics of your script every few weeks and publish a new bundle, and you've got yourself a pretty unsubvertible* protection scheme.
Regarding the "trojan horse", I think someone is yet to show proof that it's a Javascript exploit.
(*Unsubvertible is obviously relative, but raising the cost the attack, from say, $0.01/1000 requests to $10/1000 requests would massively cut down on abuse.)
[1] https://vmpsoft.com/
[2] https://jscrambler.com/
[3] https://github.com/neuroradiology/InsideReCaptcha
[4] https://www.zenrows.com/blog/bypass-cloudflare#_qEu5MvVdnILJ...
Making it harder for bots usually means that it drives up the cost for the bots to operate; so if they need to run in a headless browser to get around the anti-bot measures it might mean that it takes, for example, 1.5 seconds to execute a request as compared to the 0.1 seconds it would without them in place.
On top of that 1.5 seconds is also that there is a much larger CPU and memory cost from having to run that browser compared to a simple direct HTTP request which is near negligible.
So while you'll never truly defeat a sufficiently motivated actor, you may be able to drive their costs up high enough that it makes it difficult to enter the space or difficult to turn a profit if they're so inclined.
Google has been doing this since forever for recaptcha. And, to be fair, it seems to be fairly effectively for bot detection.
https://github.com/neuroradiology/InsideReCaptcha
> bots seem to be going a very different route
If the "very different route" means running a headless browser, then it's a success for this tech. Because the bot must run a blackbox JS now, and this gives people a whole new street of ways to run bot detection, using the bot's CPU.
Makes it easier to hide code that does browser fingerprinting.
[dead]
Very impressive work! I always enjoy a good write up about reverse engineering efforts and yours was really simple to follow.
Many popular/large websites and bot protection services usually have environment checking as a baseline and mouse-movement tracking in some of the more aggressive anti-bot checks.
It's always interesting to see how long it takes from when the measures have been defeated/publicised until the service ends up making changes to their mechanism to make you start over (hopefully not from scratch).
There is no legitimate reason for a social media platform to employ this much obfuscation.
If you believe this you underestimate how adversarial the software world really is. TikTok will be on the receiving end of botnets by everything from commercial entities, state backed groups and criminals.
They won't be betting that this stops that entirely, but it adds a layer of friction that is easy for them to change on a continuous basis. These things are also very good for leaving honeypots in where if someone is found to still be using something after a change you can tag them as a bot or otherwise hacking. Both of those approaches are also widely used in game anti-cheat mechanisms, and as shown there the lengths people will go to anyway are completely insane.
The legitimate reason could be bot protection, the same way recaptcha uses a similar VM technique for obfuscation.
You not being able to come up with one is different from there not being any possible reason.
See my other comment on this thread: https://news.ycombinator.com/item?id=43748994
It's to keep bots away and not turn to be another Twitter.
That's probably not the goal. There are bots advertising illegal services (e.g. ads for "hacking services", illegal drugs) in most comment sections. If you report these comments, 99.9% of the time the report will be rejected with "no violations found" and the spam stays up.
That doesn’t mean that it’s “probably not the intention”.
The balance of evidence suggests otherwise. If they cared about spam bots they would take action when spammers are handed to them on a silver platter. The kinds of spammers who will leave 30 identical comments advertising illegal services, not some weird moderation corner case.
If you ever end up on a video that's related to drugs, there will be entire chains of bots just advertising to each other and TikTok won't find any violations when reported. But sure, I'm sure they care a whole lot about not ending up like Twitter.
This is not a social media platform but a government backed tool for doing stuff for the government.
this is cool. i briefly worked on a TikTok bot a while back and it was a huge pain in the ass.
Is this VM somehow related to Lynx (their cross platform dev tooling?)
https://lynxjs.org/
Also discussed on HN
https://news.ycombinator.com/item?id=43264957
Is there also a VM in their iOS app? I thought a VM would be against Apple's policies?
Apple's policies prevent using JIT compilation, they don't ban VM's outright.
Is TikTok so obfuscated to prevent people from knowing the full extent of data collection and device fingerprinting?
What's terrible are the humans writing such software...
But if AI can help to fight those people's work, good for humanity I guess.
That said... Is AI going to de-obfuscate/reverse engineer their obsfuscated AI prompts or web apps?
Looks like a lot of work. I recently discovered webcrack and the tool jehna/humanify for such deobfuscate tasks
It could be interesting to see a comparison to OP's work.
That's a very strong obfuscation. Takes a lot of work to deobfuscate such a thing. Great writeup.
[flagged]
Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.
Still, I had no idea. This is really taking JS obfuscation to the next level.
One kind of wonders, what is the purpose of that level of obfuscation? The naive take is that obfuscation is usually to protect intellectual property... but this is client-side code that wouldn't give away anything about their secret sauce algorithm.
> Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.
From the Repo's README:
"TikTok is using a full-fledged bytecode VM, if you browse through it, it supports scopes, nested functions and exception handling. This isn't a typical VM and shows that it is definitely sophiscated."
VM obfuscation is a common technique for malware developers.
The VM term is applied because the obfuscator creates a custom instruction set and executes custom byte code. This is generated per build.
You are replying to a comment that looks extremely unhuman.
It looks like OP filled out the text area alongside with the URL when submitting the post.
HN takes that text and turns it into a comment. I’ve seen it happen before.
The unfortunate outcome of that IMO is that sometimes text that makes sense as a description of a submission feels a bit out of place as a comment due to how they are worded. And these comments sometimes then end up getting downvoted.
I wouldn’t be completely sure it was not human written. Even though it feels a bit weird to read it as a comment.