• drkt@scribe.disroot.org
    link
    fedilink
    English
    arrow-up
    32
    ·
    9 days ago

    I am currently watching several malicious crawlers be stuck in a 404 hole I created. Check it out yourself at https://drkt.eu/asdfasd

    I respond to all 404s with a 200 and then serve them that page full of juicy bot targets. A lot of bots can’t get out of it and I’m hoping that the driveby bots that look for login pages simply mark it (because it responded with 200 instead of 404) so a real human has to go and check and waste their time.

    • Daniel Quinn@lemmy.ca
      link
      fedilink
      English
      arrow-up
      7
      ·
      9 days ago

      This is pretty slick, but doesn’t this just mean the bots hammer your server looping forever? How much processing do you do of those forms for example?

      • drkt@scribe.disroot.org
        link
        fedilink
        English
        arrow-up
        7
        ·
        9 days ago

        doesn’t this just mean the bots hammer your server looping forever?

        Yes

        How much processing do you do of those forms

        None

        It costs me nothing to have bots spending bandwidth on me because I’m not on a metered connection and electricity is cheap enough that the tiny overhead of processing their requests might amount to a dollar or two per year.

      • jagged_circle@feddit.nl
        link
        fedilink
        English
        arrow-up
        4
        ·
        9 days ago

        Best is to redirect them to a 1TB file served by hetzner’s cache. There’s some nginx configs that do this

  • nothacking@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    1
    ·
    10 days ago

    Perhaps feed the convincing fake data so they don’t realize they’ve been IP banned/used agent filtered.

  • Deckweiss@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    edit-2
    9 days ago

    The only way I can think of is blacklisting everything by default, directing to a challanging proper captcha (can be selfhosted) and temporarily whitelisting proven human IPs.

    When you try to “enumerate badness” and block all AI useragents and IP ranges, you’ll always leave some new ones through and you’ll never be done with adding them.

    Only allow proven humans.


    A captcha will inconvenience the users. If you just want to make it worse for the crawlers, let them spend compute ressources through something like https://altcha.org/ (which would still allow them to crawl your site, but make DDoSing very expensive) or AI honeypots.

  • dudeami0@lemmy.dudeami.win
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    1
    ·
    10 days ago

    The only way I can think of is require users to authenticate themselves, but this isn’t much of a hurdle.

    To get into the details of it, what do you define as an AI bot? Are you worried about scrappers grabbing the contents of you website? What is the activities of an “AI Bot”. Are you worried about AI bots registering and using your platform?

    The real answer is not even cloudflare will fully defend you from this. If anything cloudflare is just making sure they get paid for access to your website by AI scappers. As someone who has worked around bot protections (albeit in a different context than web scrapping), it’s a game of cat and mouse. If you or some company you hire are not actively working against automated access, you lose as the other side is active.

    Just think of your point that they are using residential IP addresses. How do they get these addresses? They provide addons/extensions for browsers that offer some service (generally free VPNs) in exchange for access to your PC and therefore your internet in the contract you agree to. The same can be used by any addon, and if the addon has permissions to read any website they can scrape those websites using legit users for whatever purposes they want. The recent exposure of the Honey scam highlights this, as it’s very easy to get users to install addons by selling users they might save a small amount of money (or make money for other programs). There will be users who are compromised by addons/extensions or even just viruses that will be able to extract the data you are trying to protect.

    • DaGeek247@fedia.io
      link
      fedilink
      arrow-up
      2
      ·
      9 days ago

      Just think of your point that they are using residential IP addresses. How do they get these addresses?

      You can ping all of the ipv4 addresses in under an hour. If all you’re looking for is publicly available words written by people, you only have to poke port 80 and then suddenly you have practically every possible small self-hosted website out there.

      • dudeami0@lemmy.dudeami.win
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        9 days ago

        When I say residential IP addresses, I mostly mean proxies using residential IPs, which allow scrappers to mask themselves as organic traffic.

        Edit: Your point stands on there are a lot of services without these protections in place, but a lot of services are protective against scrapping.

    • ctag@lemmy.sdf.orgOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      9 days ago

      Thank you for the detailed response. It’s disheartening to consider the traffic is coming from ‘real’ browsers/IPs, but that actually makes a lot of sense.

      I’m coming at this from the angle of AI bots ingesting a website over and over to obsessively look for new content.

      My understanding is there are two reasons to try blocking this: to protect bandwidth from aggressive crawling, or to protect the page contents from AI ingestion. I think the former is doable, and the latter is an unwinnable task. My personal reason is because I’m an AI curmudgeon, I’d rather spend CPU resources blocking bots than serving any content to them.

  • WasPentalive@lemmy.one
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    9 days ago

    When one of these guys attacks your site, do they send the info back to the spoofed address or does the scraped info go to their real IP address? Is there some way to get a fix on the actual bot and not on some home user that got his network facing IP address hijacked?

  • Scrubbles@poptalk.scrubbles.tech
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 days ago

    If I’m reading your link right, they are using user agents. Granted there’s a lot. Maybe you could whitelist user agents you approve of? Or one of the commenters had a list that you could block. Nginx would be able to handle that.

    • ctag@lemmy.sdf.orgOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 days ago

      Thank you for the reply, but at least one commenter claims they’ll impersonate Chrome UAs.

          • ctag@lemmy.sdf.orgOP
            link
            fedilink
            English
            arrow-up
            3
            ·
            9 days ago

            In the hackernews comments for that geraspora link people discussed websites shutting down due to hosting costs, which may be attributed in part to the overly aggressive crawling. So maybe it’s just a different form of DDOS than we’re used to.