this post was submitted on 29 Apr 2025
4 points (100.0% liked)

Technology

84358 readers
5015 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

The one-liner:

dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz

This is brilliant.

top 50 comments
sorted by: hot top controversial new old
[–] palordrolap@fedia.io 4 points 1 year ago (5 children)

The article writer kind of complains that they're having to serve a 10MB file, which is the result of the gzip compression. If that's a problem, they could switch to bzip2. It's available pretty much everywhere that gzip is available and it packs the 10GB down to 7506 bytes.

That's not a typo. bzip2 is way better with highly redundant data.

[–] just_another_person@lemmy.world 2 points 1 year ago (2 children)

I believe he's returning a gzip HTTP reaponse stream, not just a file payload that the requester then downloads and decompresses.

Bzip isn't used in HTTP compression.

[–] bss03@infosec.pub 1 points 1 year ago

For scrapers that not just implementing HTTP, but are trying to extract zip files, you can possibly drive them insane with zip quines: https://github.com/ruvmello/zip-quine-generator or otherwise compressed files that contain themselves at some level of nesting, possibly with other data so that they recursively expand to an unbounded ("infinite") size.

Brotli is an option, and it's comparable to Bzip.

[–] sugar_in_your_tea@sh.itjust.works 1 points 1 year ago (1 children)

Brotli gets it to 8.3K, and is supported in most browsers, so there's a chance scrapers also support it.

[–] Aceticon@lemmy.dbzer0.com 0 points 1 year ago (1 children)

Gzip encoding has been part of the HTTP protocol for a long time and every server-side HTTP library out there supports it, and phishing/scrapper bots will be done with server-side libraries, not using browser engines.

~~Further, judging by the guy's example in his article he's not using gzip with maximum compression when generating the zip bomb files: he needs to add -9 to the gzip command line to get the best compression (but it will be slower).~~ (I tested this and it made no difference at all).

[–] sugar_in_your_tea@sh.itjust.works 0 points 1 year ago* (last edited 1 year ago)

You can make multiple files with different encodings and select based on the Accept-Encoding header.

load more comments (3 replies)
[–] Bishma@discuss.tchncs.de 2 points 1 year ago (1 children)

When I was serving high volume sites (that were targeted by scrapers) I had a collection of files in CDN that contained nothing but the word "no" over and over. Scrapers who barely hit our detection thresholds saw all their requests go to the 50M version. Super aggressive scrapers got the 10G version. And the scripts that just wouldn't stop got the 50G version.

It didn't move the needle on budget, but hopefully it cost them.

[–] sugar_in_your_tea@sh.itjust.works 2 points 1 year ago (1 children)

How do you tell scrapers from regular traffic?

[–] Bishma@discuss.tchncs.de 2 points 1 year ago (1 children)

Most often because they don't download any of the css of external js files from the pages they scrape. But there are a lot of other patterns you can detect once you have their traffic logs loaded in a time series database. I used an ELK stack back in the day.

[–] sugar_in_your_tea@sh.itjust.works 2 points 1 year ago (1 children)

That sounds like a lot of effort. Are there any tools that get like 80% of the way there? Like something I could plug into Caddy, nginx, or haproxy?

[–] Bishma@discuss.tchncs.de 2 points 1 year ago (1 children)

My experience is with systems that handle nearly 1000 pageviews per second. We did use a spread of haproxy servers to handle routing and SNI, but they were being fed offender lists by external analysis tools (built in-house).

[–] sugar_in_your_tea@sh.itjust.works 0 points 1 year ago (1 children)

Dang, I was hoping for a FOSS project that would do most of the heavy lifting for me. Maybe such a thing exists, idk, but it would be pretty cool to have a pluggable system that analyzes activity and tags connections w/ some kind of identifier so I could configure a web server to either send it nonsense (i.e. poison AI scrapers), zip bombs (i.e. bots that aren't respectful of resources), or redirect to a honey pot (i.e. malicious actors).

A quick search didn't yield anything immediately, but I wasn't that thorough. I'd be interested if anyone knows of such a project that's pretty easy to play with.

[–] ABasilPlant@lemmy.world 0 points 1 year ago* (last edited 1 year ago)

Not exactly what you asked, but do you know about ufw-blocklist?

I've been using this on my multiple VPSes for some time now and the number of fail2ban failed/banned has gone down like crazy. Previously, I had 20k failed attempts after a few months and 30-50 currently-banned IPs at all times; now it's less than 1k failed after a year and maybe 3-ish banned at any time.

There was also that paid service where users share their spammy IP address attempts with a centralized network, which does some dynamic intelligence monitoring. I forgot the name and search these days isn't great. Something to do with "Sense"? It was paid, but well recommended as far as I remember.

Edit: seems like the keyword is " threat intelligence platform"

[–] lemmylommy@lemmy.world 2 points 1 year ago (2 children)

Before I tell you how to create a zip bomb, I do have to warn you that you can potentially crash and destroy your own device.

LOL. Destroy your device, kill the cat, what else?

[–] Albbi@lemmy.ca 2 points 1 year ago (4 children)

It'll email your grandmother all if your porn!

[–] CrazyLikeGollum@lemmy.world 0 points 1 year ago

Ah yes, the infamous "stinky cheese" email virus. Who knew zip bombs could be so destructive. It erased all of the easter eggs off of my DVDs.

[–] Exec@pawb.social 0 points 1 year ago

outstanding reference

[–] turkalino@lemmy.yachts 0 points 1 year ago

Haven’t thought about that Weird Al song in a while

load more comments (1 replies)
load more comments (1 replies)
[–] dwt@feddit.org 2 points 1 year ago (3 children)

Sadly about the only thing that reliably helps against malicious crawlers is Anubis

https://anubis.techaro.lol/

[–] alehel@lemmy.zip 2 points 1 year ago (5 children)

That URL is telling me "Invalid response". Am I a bot?

[–] doorknob88@lemmy.world 4 points 1 year ago

I’m sorry you had to find out this way.

[–] L_Acacia@lemmy.ml 1 points 1 year ago

https://anubis.techaro.lol/docs/user/known-broken-extensions

If you have JShelter installed, it breaks the proof of work from anubis

[–] MonkderVierte@lemmy.ml 0 points 1 year ago (2 children)

You're using a VPN, right?

[–] alehel@lemmy.zip 0 points 1 year ago

Nope. Just using Vivaldi on my Android device.

load more comments (1 replies)
[–] xavier666@lemm.ee 0 points 1 year ago

Now you know why your mom spent so much time with the Amiga

load more comments (2 replies)
[–] mbirth@lemmy.ml 2 points 1 year ago

And if you want some customisation, e.g. some repeating string over and over, you can use something like this:

yes "b0M" | tr -d '\n' | head -c 10G | gzip -c > 10GB.gz

yes repeats the given string (followed by a line feed) indefinitely - originally meant to type "yes" + ENTER into prompts. tr then removes the line breaks again and head makes sure to only take 10GB and not have it run indefinitely.

If you want to be really fancy, you can even add some HTML header and footer to some files like header and footer and then run it like this:

yes "b0M" | tr -d '\n' | head -c 10G | cat header - footer | gzip -c > 10GB.gz
[–] moopet@sh.itjust.works 1 points 1 year ago (2 children)

I'd be amazed if this works, since these sorts of tricks have been around since dinosaurs ruled the Earth, and most bots will use pretty modern zip libraries which will just return "nope" or throw an exception, which will be treated exactly the same way any corrupt file is - for example a site saying it's serving a zip file but the contents are a generic 404 html file, which is not uncommon.

Also, be careful because you could destroy your own device? What the hell? No. Unless you're using dd backwards and as root, you can't do anything bad, and even then it's the drive contents you overwrite, not the device you "destroy".

[–] Lucien@mander.xyz 1 points 1 year ago

Yeah, this article came across as if written by a complete beginner. They mention having their WordPress hacked, but failed to admit it was because they didn't upgrade the install.

On the other hand, there are lots of bots scraping Wikipedia even though it's easy to download the entire website as a single archive.

So they're not really that smart....

[–] fmstrat@lemmy.nowsci.com 1 points 1 year ago (3 children)

I've been thinking about making an nginx plugin that randomizes words on a page to poison AI scrapers.

[–] owsei@programming.dev 1 points 1 year ago (1 children)

There are "AI mazes" that do that.

I remember reading and article about this but haven't found it yet

load more comments (1 replies)
load more comments (1 replies)
[–] arc@lemm.ee 0 points 1 year ago (4 children)

Probably only works for dumb bots and I'm guessing the big ones are resilient to this sort of thing.

Judging from recent stories the big threat is bots scraping for AIs and I wonder if there is a way to poison content so any AI ingesting it becomes dumber. e.g. text which is nonsensical or filled with counter information, trap phrases that reveal any AIs that ingested it, garbage pictures that purport to show something they don't etc.

[–] mostlikelyaperson@lemmy.world 0 points 1 year ago (1 children)

There have been some attempts in that regard, I don’t remember the names of the projects, but there were one or two that’d basically generate a crapton of nonsense to do just that. No idea how well that works.

[–] frezik@midwest.social 0 points 1 year ago (2 children)

When it comes to attacks on the Internet, doing simple things to get rid of the stupid bots means kicking 90% of attacks out. No, it won't work against a determined foe, but it does something useful.

Same goes for setting SSH to a random port. Logs are so much cleaner after doing that.

load more comments (2 replies)
[–] echodot@feddit.uk 0 points 1 year ago* (last edited 1 year ago)

I don't know as to poisoning AI, but one thing that I used to do was to redirect any suspicious bots or ones that were hitting their server too much to a simple html page with no JS or CSS or forward links. Then they used to go away.

load more comments (1 replies)
[–] fmstrat@lemmy.nowsci.com 0 points 1 year ago

This is why I use things like Docusaurus to generate static sites. Vulnerability injections are pretty hard when there's no code to inject into.

[–] billwashere@lemmy.world 0 points 1 year ago

I want to know he they built that visualization

load more comments
view more: next ›