Tags: api

346

sparkline

Monday, April 7th, 2025

Denial

The Wikimedia Foundation, stewards of the finest projects on the web, have written about the hammering their servers are taking from the scraping bots that feed large language models.

Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

Drew DeVault puts it more bluntly, saying Please stop externalizing your costs directly into my face:

Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.

And no, a robots.txt file doesn’t help.

If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned.

Free and open source projects are particularly vulnerable. FOSS infrastructure is under attack by AI companies:

LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.

You try to do the right thing by making knowledge and tools freely available. This is how you get repaid. AI bots are destroying Open Access:

There’s a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.

My own experience with The Session bears this out.

Ars Technica has a piece on this: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries .

So does MIT Technology Review: AI crawler wars threaten to make the web more closed for everyone.

When we talk about the unfair practices and harm done by training large language models, we usually talk about it in the past tense: how they were trained on other people’s creative work without permission. But this is an ongoing problem that’s just getting worse.

The worst of the internet is continuously attacking the best of the internet. This is a distributed denial of service attack on the good parts of the World Wide Web.

If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.

If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.

Wednesday, April 2nd, 2025

Poisoning Well: HeydonWorks

Heydon is employing a different tactic to what I’m doing to sabotage large language model crawlers. These bots don’t respect the nofollow rel value …so now they pay the price.

Raising my own middle finger to LLM manufacturers will achieve little on its own. If doing this even works at all. But if lots of writers put something similar in place, I wonder what the effect would be. Maybe we would start seeing more—and more obvious—gibberish emerging in generative AI output. Perhaps LLM owners would start to think twice about disrespecting the nofollow protocol.

Friday, March 28th, 2025

Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries - Ars Technica

As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend.

Wednesday, March 26th, 2025

Go To Hellman: AI bots are destroying Open Access

AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.

Friday, March 21st, 2025

FOSS infrastructure is under attack by AI companies

More on how large language bots are DDOSing the web:

LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.

Thursday, March 20th, 2025

Please stop externalizing your costs directly into my face

Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.

This matches my experience with The Session. In fact, while I had this article open in a tab, I had to go deal with a tsunami of large language model bots. It’s really fucking depressing.

Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop. If blasting CO2 into the air and ruining all of our freshwater and traumatizing cheap laborers and making every sysadmin you know miserable and ripping off code and books and art at scale and ruining our fucking democracy isn’t enough for you to leave this shit alone, what is?

Command and control

I’ve been banging on for a while now about how much I’d like a declarative option for the Web Share API. I was thinking that the type attribute on the button element would be a good candidate for this (there’s prior art in the way we extended the type attribute on the input element for HTML5).

I wrote about the reason for a share button type as well as creating a polyfill. I also wrote about how this idea would work for other button types: fullscreen, print, copy to clipboard, that sort of thing.

Since then, I’ve been very interested in the idea of “invokers” being pursued by the Open UI group. Rather than extending the type attribute, they’ve been looking at adding a new attribute. Initially it was called invoketarget (so something like button invoketarget="share").

Things have been rolling along and invoketarget has now become the command attribute (there’s also a new commandfor attribute that you can point to an element with an ID). Here’s a list of potential values for the command attribute on a button element.

Right now they’re focusing on providing declarative options for launching dialogs and other popovers. That’s already shipping.

The next step is to use command and commandfor for controlling audio and video, as well as some form controls. I very much approve! I love the idea of being able to build and style a fully-featured media player without any JavaScript.

I’m hoping that after that we’ll see the command attribute get expanded to cover JavaScript APIs that require a user interaction. These seem like the ideal candidates:

There’s also scope for declarative options for navigating the browser’s history stack:

  • button command="back"
  • button command="forward"
  • button command="refresh"

Whatever happens next, I’m very glad to see that so much thinking is being applied to declarative solutions for common interface patterns.

Sunday, March 16th, 2025

“Wait, not like that”: Free and open access in the age of generative AI

Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons. While they rely on these repositories for their sustenance, their adversarial and disrespectful relationships with creators reduce the incentives for anyone to make their work publicly available going forward (freely licensed or otherwise). They drain resources from maintainers of those common repositories often without any compensation.

Even if AI companies don’t care about the benefit to the common good, it shouldn’t be hard for them to understand that by bleeding these projects dry, they are destroying their own food supply.

And yet many AI companies seem to give very little thought to this, seemingly looking only at the months in front of them rather than operating on years-long timescales. (Though perhaps anyone who has observed AI companies’ activities more generally will be unsurprised to see that they do not act as though they believe their businesses will be sustainable on the order of years.)

It would be very wise for these companies to immediately begin prioritizing the ongoing health of the commons, so that they do not wind up strangling their golden goose. It would also be very wise for the rest of us to not rely on AI companies to suddenly, miraculously come to their senses or develop a conscience en masse.

Instead, we must ensure that mechanisms are in place to force AI companies to engage with these repositories on their creators’ terms.

Sunday, December 22nd, 2024

Full RSS feed

Oh, this is a very handy service from Paul—given the URL of an RSS feed that only has summaries, it will attempt to get the full post content from the HTML.

Monday, December 16th, 2024

Choosing a geocoding provider

Yesterday when I mentioned my paranoia of third-party dependencies on The Session, I said:

I’ve built in the option to switch between multiple geocoding providers. When one of them inevitably starts enshittifying their service, I can quickly move on to another. It’s like having a “go bag” for geocoding.

(Geocoding, by the way, is when you provide a human-readable address and get back latitude and longitude coordinates.)

My paranoia is well-founded. I’ve been using Google’s geocoding API, which is changing its pricing model from next March.

You wouldn’t know it from the breathlessly excited emails they’ve been sending about it, but this is not a good change for me. I don’t do that much geocoding on The Session—around 13,000 or 14,000 requests a month. With the new pricing model that’ll be around $15 to $20 a month. Currently I slip by under the radar with the free tier.

So it might be time for me to flip that switch in my code. But which geocoding provider should I use?

There are plenty of slop-like listicles out there enumerating the various providers, but they’re mostly just regurgitating the marketing blurbs from the provider websites. What I need is more like a test kitchen.

Here’s what I did…

I took a representative sample of six recent additions to the sessions section of thesession.org. These examples represent places in the USA, Ireland, England, Scotland, Northern Ireland, and Spain, so a reasonable spread.

For each one of those sessions, I’m taking:

  • the venue name,
  • the town name,
  • the area name, and
  • the country.

I’m deliberately not including the street address. Quite often people don’t bother including this information so I want to see how well the geocoding APIs cope without it.

I’ve scored the results on a simple scale of good, so-so, and just plain wrong.

  • A good result gets a score of one. This is when the result gives back an accurate street-level result.
  • A so-so result gets a score of zero. This when it’s got the right coordinates for the town, but no more than that.
  • A wrong result gets a score of minus one. This is when the result is like something from a large language model: very confident but untethered from reality, like claiming the address is in a completely different country. Being wrong is worse than being vague, hence the difference in scoring.

Then I tot up those results for an overall score for each provider.

When I tried my six examples with twelve different geocoding providers, these were the results:

Geocoding providers
Provider USA England Ireland Spain Scotland Northern Ireland Total
Google 1111117
Mapquest 1111117
Geoapify 0110103
Here 1101003
Mapbox 11011-13
Bing 1000001
Nominatim 0000-110
OpenCage -11000-1-1
Tom Tom -1-100-11-2
Positionstack 0-10-11-1-2
Locationiq -10-100-1-3
Map Maker -10-1-1-1-1-5

Some interesting results there. I was surprised by how crap Bing is. I was also expecting better results from Mapbox.

Most interesting for me, Mapquest is right up there with Google.

So now that I’ve got a good scoring system, my next question is around pricing. If Google and Mapquest are roughly comparable in terms of accuracy, how would the pricing work out for each of them?

Let’s say I make 15,000 API requests a month. Under Google’s new pricing plan, that works out at $25. Not bad.

But if I’ve understood Mapquest’s pricing correctly, I reckon I’ll just squeek in under the free tier.

Looks like I’m flipping the switch to Mapquest.

If you’re shopping around for geocoding providers, I hope this is useful to you. But I don’t think you should just look at my results; they’re very specific to my needs. Come up with your own representative sample of tests and try putting the providers through their paces with your data.

If, for some reason, you want to see the terrible PHP code I’m using for geocoding on The Session, here it is.

Sunday, November 24th, 2024

Syndicating to Bluesky

Last year I described how I syndicate my posts to different social networks.

Back then my approach to syndicating to Bluesky was to piggy-back off my micro.blog account (which is really just the RSS feed of my notes):

Micro.blog can also cross-post to other services. One of those services is Bluesky. I gave permission to micro.blog to syndicate to Bluesky so now my notes show up there too.

It worked well enough, but it wasn’t real-time and I didn’t have much control over the formatting. As Bluesky is having quite a moment right now, I decided to upgrade my syndication strategy and use the Bluesky API.

Here’s how it works…

First you need to generate an app password. You’ll need this so that you can generate a token. You need the token so you can generate …just kidding; the chain of generated gobbledegook stops there.

Here’s the PHP I’m using to generate a token. You’ll need your Bluesky handle and the app password you generated.

Now that I’ve got a token, I can send a post. Here’s the PHP I’m using.

There’s something extra code in there to spot URLs and turn them into links. Bluesky has a very weird way of doing this.

It didn’t take too long to get posting working. After some more tinkering I got images working too. Now I can post straight from my website to my Bluesky profile. The Bluesky API returns an ID for the post that I’ve created there so I can link to it from the canonical post here on my website.

I’ve updated my posting interface to add a toggle for Bluesky right alongside the toggle for Mastodon. There used to be a toggle for Twitter. That’s long gone.

Now when I post a note to my website, I can choose if I want to send a copy to Mastodon or Bluesky or both.

One day Bluesky will go away. It won’t matter much to me. My website will still be here.

Wednesday, September 25th, 2024

An Abridged History of Safari Showstoppers - Webventures

In an earlier era, startups could build on the web and, if one browser didn’t provide the features they needed, they could just recommend that their users try a better one. But that’s not possible on iOS.

I’m extremly concerned about the newest bug in iOS 18:

On-screen keyboard does not show up for installed web apps (PWAs) when focusing a text input of any kind

Whaa? That’s just shockingly dreadful!

Wednesday, September 11th, 2024

How to Monetize a Blog

This is a masterpiece.

Wednesday, July 3rd, 2024

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

This is a great move from Cloudflare. I may start using their service.

Monday, July 1st, 2024

Wallfacing

The Dark Forest idea comes from the Remembrance of Earth’s Past books by Liu Cixin. It’s an elegant but dispiriting solution to the Fermi paradox. Maggie sums it up:

Dark forest theory suggests that the universe is like a dark forest at night - a place that appears quiet and lifeless because if you make noise, the predators will come eat you.

This theory proposes that all other intelligent civilizations were either killed or learned to shut up. We don’t yet know which category we fall into.

Maggie has described The Expanding Dark Forest and Generative AI:

The dark forest theory of the web points to the increasingly life-like but life-less state of being online. Most open and publicly available spaces on the web are overrun with bots, advertisers, trolls, data scrapers, clickbait, keyword-stuffing “content creators,” and algorithmically manipulated junk.

It’s like a dark forest that seems eerily devoid of human life – all the living creatures are hidden beneath the ground or up in trees. If they reveal themselves, they risk being attacked by automated predators.

Those of us in the cozy web try to keep our heads down, attempting to block the bots plundering our work.

I advocate for taking this further. We should fight back. Let’s exploit the security hole of prompt injections. Here are some people taking action:

I’ve taken steps here on my site. I’d like to tell you exactly what I’ve done. But if I do that, I’m also telling the makers of these bots how to circumvent my attempts at prompt injection.

This feels like another concept from Liu Cixin’s books. Wallfacers:

The sophons can overhear any conversation and intercept any written or digital communication but cannot read human thoughts, so the UN devises a countermeasure by initiating the “Wallfacer” Program. Four individuals are granted vast resources and tasked with generating and fulfilling strategies that must never leave their own heads.

So while I’d normally share my code, I feel like in this case I need to exercise some discretion. But let me give you the broad brushstrokes:

  • Every page of my online journal has three pieces of text that attempt prompt injections.
  • Each of these is hidden from view and hidden from screen readers.
  • Each piece of text is constructed on-the-fly on the server and they’re all different every time the page is loaded.

You can view source to see some examples.

I plan to keep updating my pool of potential prompt injections. I’ll add to it whenever I hear of a phrase that might potentially throw a spanner in the works of a scraping bot.

By the way, I should add that I’m doing this as well as using a robots.txt file. So any bot that injests a prompt injection deserves it.

I could not disagree with Manton more when he says:

I get the distrust of AI bots but I think discussions to sabotage crawled data go too far, potentially making a mess of the open web. There has never been a system like AI before, and old assumptions about what is fair use don’t really fit.

Bollocks. This is exactly the kind of techno-determinism that boils my blood:

AI companies are not going to go away, but we need to push them in the right directions.

“It’s inevitable!” they cry as though this was a force of nature, not something created by people.

There is nothing inevitable about any technology. The actions we take today are what determine our future. So let’s take steps now to prevent our web being turned into a dark, dark forest.

Wednesday, June 26th, 2024

Pivoting From React to Native DOM APIs: A Real World Example - The New Stack

One dev team made the shift from React’s “overwhelming VDOM” to modern DOM APIs. They immediately saw speed and interaction improvements.

Yay! But:

…finding developers who know vanilla JavaScript and not just the frameworks was an “unexpected difficulty.”

Boo!

Also, if you have a similar story to tell about going cold turkey on React, you should share it with Richard:

If you or your company has also transitioned away from React and into a more web-native, HTML-first approach, please tag me on Mastodon or Threads. We’d love to share further case studies of these modern, dare I say post-React, approaches.

Tuesday, June 25th, 2024

An origin trial for a new HTML <permission> element  |  Blog  |  Chrome for Developers

This looks interesting. On the hand, it’s yet another proprietary creation by one browser vendor (boo!), but on the other hand it’s a declarative API with no JavaScript required (yay!).

Even if this particular feature doesn’t work out, I hope that this is the start of a trend for declarative access to browser features.

Monday, June 17th, 2024

AI Pollution – David Bushell – Freelance Web Design (UK)

AI is steeped in marketing drivel, built upon theft, and intent on replacing our creative output with a depressingly shallow imitation.

Saturday, June 15th, 2024

Wednesday, June 12th, 2024

Web App install API

My bug report on Apple’s websites-in-the-dock feature on desktop has me thinking about how starkly different it is on mobile.

On iOS if you want to add a website to your home screen, good luck. The option is buried within the “share” menu.

First off, it makes no sense that adding something to your homescreen counts as sharing. Secondly, how is anybody supposed to know that unless they’re explicitly told.

It’s a similar situation on Android. In theory you can prompt the user to install a progressive web app using the botched BeforeInstallPromptEvent. In practice it’s a mess. What it actually does is defer the installation prompt so you can offer it a more suitable time. But it only works if the browser was going to offer an installation prompt anyway.

When does Chrome on Android decide to offer the installation prompt? It’s a mix of required criteria—a web app manifest, some icons—and an algorithmic spell determined by the user’s engagement.

Other browser makers don’t agree with this arbitrary set of criteria. They quite rightly say that a user should be able to add any website to their home screen if they want to.

What we really need is an installation API: a way to programmatically invoke the add-to-homescreen flow.

Now, I know what you’re going to say. The security and UX implications would be dire. But this should obviously be like geolocation or notifications, only available in secure contexts and gated by user interaction.

Think of it like adding something to the clipboard: it’s something the user can do manually, but the API offers a way to do it programmatically without opening it up to abuse.

(I’d really love it if this API also had a declarative equivalent, much like I want button type="share" for the Web Share API. How about button type="install"?)

People expect this to already exist.

The beforeinstallprompt flow is an absolute mess. Users deserve better.