Vision for W3C
We believe the World Wide Web should be inclusive and respectful of all participants: a Web that supports facts over falsehoods, people over profits, humanity over hate.
We believe the World Wide Web should be inclusive and respectful of all participants: a Web that supports facts over falsehoods, people over profits, humanity over hate.
The Wikimedia Foundation, stewards of the finest projects on the web, have written about the hammering their servers are taking from the scraping bots that feed large language models.
Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
Drew DeVault puts it more bluntly, saying Please stop externalizing your costs directly into my face:
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.
And no, a robots.txt
file doesn’t help.
If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned.
Free and open source projects are particularly vulnerable. FOSS infrastructure is under attack by AI companies:
LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.
You try to do the right thing by making knowledge and tools freely available. This is how you get repaid. AI bots are destroying Open Access:
There’s a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
My own experience with The Session bears this out.
Ars Technica has a piece on this: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries .
So does MIT Technology Review: AI crawler wars threaten to make the web more closed for everyone.
When we talk about the unfair practices and harm done by training large language models, we usually talk about it in the past tense: how they were trained on other people’s creative work without permission. But this is an ongoing problem that’s just getting worse.
The worst of the internet is continuously attacking the best of the internet. This is a distributed denial of service attack on the good parts of the World Wide Web.
If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.
If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.
As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend.
AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
More on how large language bots are DDOSing the web:
LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.
- Support open source software
- Support open web platform technology
- Distribution on the web should never be throttled
- External links should be encouraged, not de-emphasized
Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons. While they rely on these repositories for their sustenance, their adversarial and disrespectful relationships with creators reduce the incentives for anyone to make their work publicly available going forward (freely licensed or otherwise). They drain resources from maintainers of those common repositories often without any compensation.
Even if AI companies don’t care about the benefit to the common good, it shouldn’t be hard for them to understand that by bleeding these projects dry, they are destroying their own food supply.
And yet many AI companies seem to give very little thought to this, seemingly looking only at the months in front of them rather than operating on years-long timescales. (Though perhaps anyone who has observed AI companies’ activities more generally will be unsurprised to see that they do not act as though they believe their businesses will be sustainable on the order of years.)
It would be very wise for these companies to immediately begin prioritizing the ongoing health of the commons, so that they do not wind up strangling their golden goose. It would also be very wise for the rest of us to not rely on AI companies to suddenly, miraculously come to their senses or develop a conscience en masse.
Instead, we must ensure that mechanisms are in place to force AI companies to engage with these repositories on their creators’ terms.
About halfway through this talk transcript, Aaron starts dropping a barrage of truth bombs:
I understand the web, whose distinguishing characteristic is asynchronous recall on a global scale, as the technology which makes revisiting possible in a way that has genuinely never existed before the web.
What the web has made possible are the economics of keeping something, something which has not enjoyed “hockey stick growth”, around long enough for people to warm up to it. Or to survive long past the moment when people may have grown tired of it.
If your goal is to build something which is designed to flip inside of ten years, like many things in the private sector, that may not seem like a very compelling argument.
If, however, your goal is to build something to match the longevity of the cultural heritage sector, to meet the goal of fostering revisiting, or for novel ideas to outlast the reluctance of the present and to do so at a global scale, or really any scale larger than shouting distance, then I will challenge you to find a better vehicle for doing so than the internet, and the web in particular.
Explore our hand-picked collection of 10,046 out-of-copyright works, free for all to browse, download, and reuse. This is a living database with new images added every week.
The Session has been online for over 20 years. When you maintain a site for that long, you don’t want to be relying on third parties—it’s only a matter of time until they’re no longer around.
Some third party APIs are unavoidable. The Session has maps for sessions and other events. When people add a new entry, they provide the address but then I need to get the latitude and longitude. So I have to use a third-party geocoding API.
My code is like a lesson in paranoia: I’ve built in the option to switch between multiple geocoding providers. When one of them inevitably starts enshittifying their service, I can quickly move on to another. It’s like having a “go bag” for geocoding.
Things are better on the client side. I’m using other people’s JavaScript libraries—like the brilliant abcjs—but at least I can self-host them.
I’m using Leaflet for embedding maps. It’s a great little library built on top of Open Street Map data.
A little while back I linked to a new project called OpenFreeMap. It’s a mapping provider where you even have the option of hosting the tiles yourself!
For now, I’m not self-hosting my map tiles (yet!), but I did want to switch to OpenFreeMap’s tiles. They’re vector-based rather than bitmap, so they’re lovely and crisp.
But there’s an issue.
I can use OpenFreeMap with Leaflet, but to do that I also have to use the MapLibre GL library. But whereas Leaflet is 148K of JavaScript, MapLibre GL is 800K! Yowzers!
That’s mahoosive by the standards of The Session’s performance budget. I’m not sure the loveliness of the vector maps is worth increasing the JavaScript payload by so much.
But this doesn’t have to be an either/or decision. I can use progressive enhancement to get the best of both worlds.
If you land straight on a map page on The Session for the first time, you’ll get the old-fashioned bitmap map tiles. There’s no MapLibre code.
But if you browse around The Session and then arrive on a map page, you’ll get the lovely vector maps.
Here’s what’s happening…
The maps are embedded using an HTML web component called embed-map
. The fallback is a static image between the opening and closing tags. The web component then loads up Leaflet.
Here’s where the enhancement comes in. When the web component is initiated (in its connectedCallback
method), it uses the Cache API to see if MapLibre has been stored in a cache. If it has, it loads that library:
caches.match('/path/to/maplibre-gl.js')
.then( responseFromCache => {
if (responseFromCache) {
// load maplibre-gl.js
}
});
Then when it comes to drawing the map, I can check for the existence of the maplibreGL
object. If it exists, I can use OpenFreeMap tiles. Otherwise I use the old Leaflet tiles.
But how does the MapLibre library end up in a cache? That’s thanks to the service worker script.
During the service worker’s install
event, I give it a list of static files to cache: CSS, JavaScript, and so on. That includes third-party libraries like abcjs, Leaflet, and now MapLibre GL.
Crucially this caching happens off the main thread. It happens in the background and it won’t slow down the loading of whatever page is currently being displayed.
That’s it. If the service worker installation works as planned, you’ll get the nice new vector maps. If anything goes wrong, you’ll get the older version.
By the way, it’s always a good idea to use a service worker and the Cache API to store your JavaScript files. As you know, JavaScript is unduly expensive to performance; not only does the JavaScript file have to be downloaded, it then has to be parsed and compiled. But JavaScript stored in a cache during a service worker’s install
event is already parsed and compiled.
While I’ve grown more cynical about much of tech, movements like the Indieweb and the Fediverse remind me that the ideals I once loved, and that spirit of the early web, aren’t lost. They’re evolving, just like everything else.
This project, based on OpenStreetMap, looks great:
OpenFreeMap lets you display custom maps on your website and apps for free.
You can either self-host or use our public instance.
I’m going to try it out on The Session once there’s documentation for using this with Leaflet.
This is how I write:
As an online writer, my philosophy is link maximalism; links add another layer to my writing, whether I’m linking to an expansion of a particular idea or another person’s take, providing evidence or citation, or making a joke by juxtaposing text and target. Links reveal personality as much as the text. Linking allows us to stretch our ideas, embedding complexity, acknowledging ambiguity, holding contradictions.
This is a very handy piece of work by Rich:
The idea is to set sensible typographic defaults for use on prose (a column of text), making particular use of the font features provided by OpenType. The main principle is that it can be used as starting point for all projects, so doesn’t include design-specific aspects such as font choice, type scale or layout (including how you might like to set the line-length).
Since the early days of the web, large corporations have seemingly always wanted more than the web platform or web standards could offer at any given moment. Whether they were aiming for cross-platform-compatibility, more advanced capabilities, or just to be the one runtime/framework/language to rule them all, there’s always been a company that believes they can “fix” it or “own” it.
Applets. ActiveX. Flash. Flex. Silverlight. Angular. React.
A library of CC-licensed photos.
Next time you’re tempted to use a generative “AI” tool to make an image for a slide deck, use this instead.
I like this framing:
If you’ve ever corrected a typo in an Open Source readme, or added alt-text to an image, or tidied up some broken references in Wikipedia - you’re doing Digital Litter Picking. You’re cleaning up after others. And I think that’s a marvellous way to spend a little time.
Great stuff from Maggie—reminds of the storyforming workshop I did with Ellen years ago.
Mind you, I disagree with Maggie about giving a talk’s outline at the beginning—that’s like showing the trailer of the movie you’re about to watch.
Subvert the status quo. Own a website. Make and share links.
What podcasting holds in the promise of its open format is the proof that an open web can still thrive and be relevant, that it can inspire new systems that are similarly open to take root and grow. Even the biggest companies in the world can’t displace these kinds of systems once they find their audiences.