Online ID / age verification: the death of online search, and non-browser web access?

- Posted in hosting by - Permalink

screenshot of curl accessing decoded.legal

Neil note: I've published this with the hope that others familiar with what this could entail will give feedback. As such, what you read now might not be the final version. If I make material changes, I'll indicate them.

Regular readers will not have missed [the UK government's ongoing ~~something-must-be-done~~ attempt to put ID or age verification barriers on the web and other online services.

In the Full government response to the consultation on the Online Harms White Paper, the government says:

2.41 Under our proposals companies will be expected to use a range of tools proportionately, to take reasonable steps to prevent children from accessing age-inappropriate content and to protect them from other harms. This includes, for example, the use of age assurance and age verification technologies, which are expected to play a key role for companies in order to fulfil their duty of care.

(Thanks to user tyingq on HackerNews for asking for the context of this.)

The thesis behind the proposal— which I critique here — is that the offline world is safe for children because of the measures put in place to protect them, but the online world is not. Spoiler: the offline world is not "safe for children", even with the protections and rules which are in place. And it's certainly not safe for children by default. (Update: Thanks to Jess Figueras for her feedback on this bit.)

Part of the current proposal is to compel some website and service operators to implement ID or age verification. In other words, forcing users to prove who they are, or at least how old they are, before they can access the restricted content.

(What is that restricted content / which website operators have to do this? It's impossible to say at the moment, as the proposal is insufficiently specific about the harm against which it is purporting to protect. Is it enough if your website simply has a page about anorexia (Twitter link)? Internet law guru Graham Smith has written an excellent introduction to the issue.)

ID / age verification could take various forms, from mildly intrusive and inconvenient, to very intrusive and inconvenient. Imagine the "cookie banners" you see everywhere at the moment, but requiring you to submit to a facial scan via your device's in-built camera every time you want to load a page.

There are numerous flaws with this. I'm hesitant to call it a plan at this stage, given how little it has been thought through, but let's just say that a proposal which could require people to turn on their cameras before accessing, say, a porn site sounds like a route ripe for exploitation and abuse...

There are also two issues which I don't recall seeing discussed elsewhere.

ID / age verification could (unintentionally?) kill spidering, and thus search

Automated, non-browser access is the mechanism by which search engine providers "crawl" or "spider" sites. They used these automated tools to work through a site, using both the content and the links to and from the site, to determine what results appear in their search engine.

Think Google, Bing, Baidu, Yandex, and others.

Here's the problem: without significant change to the way in which search is done at the moment, sites will either have to choose between implementing effective ID or age verification, or having their content spidered for search engines.

They cannot have both: if sites want to permit the numerous search engine spiders to access content which would be behind ID or age verification controls (such as user generated content, or content which Ofcom might deem unsuitable for children), they also permit users pretending to be those spiders.

There might be an approach based on allow-listing known IPs of spiders, but it's hardly optimal. And, as Michał "rysiek" Woźniak said in feedback on Mastodon:

It will also block any chance for new independent small search engines to pop-up, effectively locking the search mainstream engine space down for the oligopolists.

The quite brilliant US tech lawyer Daphne Keller (and if you don't follow her on Twitter, you really should) said on Twitter:

I ... wonder if there is some model with a trusted 3rd party verifier of "legitimate crawls" based on IP or some other identifier. Then each website just has to trust that verifier. This has a bunch of problems, but it greatly lowers the barrier to entry for new crawlers.

She went on to say:

problems include:

  • Latency
  • More room for technical screw-ups that ramify
  • Centralizing control over who can crawl, creating a new locus for undue influence, abuse, error
  • Creating a centralized record of who crawled what

If the site in question considers that, to comply with the law, it must obtain a facial scan from every user for each access, I do not think there is a workaround other than trying to keep on top of every spider's IP address: spiders would likely be locked out.

Let's assume that a number of sites will not be content with losing their search engine originated traffic because they can no longer permit spiders or bots.

And let's further assume, then, that they continue to permit these spiders to access their site, much in the same way that providers of paywalled content — for example, news sites — do today. They attempt to erect their paywalls to Actual Human Users, while letting spiders and bots crawl their sites unencumbered.

With close to minimal effort, anyone can circumvent the purported restrictions.

I'm not going to detail precisely how, because there may still be an argument that, in doing so, I infringe copyright law. I don't think it's a good argument, but it's not an argument I intend to have. For the sake of this blog post, it is sufficient to say that a simple piece of software, free of charge, which a user can add to their browser, will circumvent many — not all, but many — paywalls.

You don't need that piece of software — if you know what you are doing, you can do it for many sites with just changes to your browser's configuration — but that is the easiest approach.

And, while children may not be chatting in the playground about ways to view the restricted content on — say — the Financial Times website, they may be far more willing to have conversations about circumventing age verification.

Update 20210716: I was given some feedback from a reader who wishes to remain anonymous, saying that this may be addressable through a combination of reverse IP lookups for spiders and other bots and — particularly interestingly — a change to the way in which bots operate, to use an authenticated connection, where the bot authenticates to the sites it wishes to crawl. A bit like certificate-based authentication for ssh.

This sounds like a pretty fundamental change, with implications for search in the short term, as well as for the development of new search engines if the ability to crawl sites is restricted. However, it's an interesting idea, and there is clearly money in search which could be focussed on resolving problems which arose. If developments like this took place, what appears to be a loophole today may be closed off. Perhaps search will not die after all!

If a measure can be avoided trivially, it is neither effective nor proportionate

If one can avoid going through intrusive age verification with the same technique used to permit spidering (and, if sites want to allow spidering, it's hard to see why that would not be the case), we are left with a measure that is so ineffective that it can be circumvented within a user's browser, with no technical knowledge.

Can it possibly be proportionate to command that sites implement something so trivially circumventable?

Or is the decree that they must deprive themselves of search traffic, because the only way to prevent this access is to stop traffic from spiders? If so, the commercial impact of this is not something I have seen in any dialogue to date.

Has anyone told sites this? Has it been raised in the debate about ID and age verification, that search — something massively important to many sites — could be adversely affected? Have the commercial implications of this been costed in to the impact assessment?

Or will it simply be the case that, having made a rash policy decision, it is left to the sites' technical teams to simply "nerd harder", being told that it is their problem to solve, and that, if only they worked a bit harder, they find a solution?

ID / age verification could (unintentionally?) kill non-browser access to the web

Browsers — Firefox, Safari, Chrome, whatever Microsoft calls its browser these days — are a common way of accessing the web.

But browsers are not the only way of accessing the web — there are myriad other common ways of doing it, especially programmatically.

For example, I routinely use three different non-browser tools for accessing the web: curl, wget, and a technology which, sadly, seems to have fallen out of favour a bit, rss.

(I expect that lynx might have the same problem in at least some situations, being a text-only browser, but I'm not a regular lynx user.)

I have a nasty feeling that some of the plans for ID and age verification will kill off these important tools, and most probably do so without even realising it.

curl and wget

curl and wget are command line tools for interacting with files. Not just remotely-hosted files, and not just files on the web, but often for these.

For example, imagine that you wanted to get a snapshot of a website every hour.

You could visit that site in a browser every hour, on the hour, and take a screenshot.

Or you could use curl or wget, in combination with something like cron, to do it for you automatically.

You don't want to take a snapshot? How about monitoring a site for updates? In which case, cron or wget, triggered by cron, which compares what it downloaded now with what it downloaded last time (e.g. via diff, and linked to an email script which emails you if it detects a difference.

This is incredibly useful for monitoring the content on large numbers of sites automatically, and getting updates straight into your inbox — think press releases, legislative changes, regulatory announcements, and so on. An essential tool for lawyers, journalists, researchers, and many others.

Or perhaps you want to undertake an archival project, keeping track of sites over a period of many years. You're not going to do that through a browser. Has there been any discussion about important archival projects, like the Internet Archive? Perhaps I have missed it.

Not all ID and age verification implementations will kill off these tools. For example, you can use these tools to log in to sites where you have an account. Whether you can use them to prove your identity or age when creating an account is a different matter. So you are back to the browser.

And, as with spiders, if you have to submit to a facial scan every time you want to access the site, these tools are likely to be locked out.

As cryptography and security expert Alec Muffett pointed out on Twitter, any solution predicated on scans of a user's face, or real-time photography:

disenfranchis[es] devices that do not have a camera more fundamentally

And that covers many command line tools, as well as users who do not have a device with a working camera.

rss

rss. Really Simple Syndication.

I may be showing my age, but rss is a lovely way of interacting (or not interacting) with websites.

A bit like your own curated newspaper, you hook your chosen reading device — an rss reader / client — into the sites whose content you want to read, and then you can read it in a plain, simple, no-flashing-adverts-or-cookie-notices, manner.

In a typical setup, you have a client which talks to your chosen rss server, and your chosen rss server programatically grabs the feeds made available by the sites you have specified. (Here's the rss feed for this blog.)

We have a problem similar to that of curl and wget. rss servers can often handle authentication, meaning that someone can collect feeds from a page or site which requires them to log in. This means that, if someone can be verified once, when they set up their account, and they need to log in to view the content, this mechanism likely survives.

If, however, someone must validate their identity or age for every access, or even just intermittently, this valuable tool likely ceases to work.

And if the mechanism for ID or age verification requires something more than sending a static, or rss-server-side generated, text parameter to the server — for example, a real-time facial scan — then this is, most probably, out of the question.

This would be a shame, because if you want to read the web without the comments, without the distractions, rss is a great way of doing so.

Have those pushing ID / age verification even considered this?

More worrying still, I don't think that the prospect of killing search, or non-browser access tools, is even on the radar of those proposing ID and age verification online.

If these things had been considered, debated, and ruled to be insufficiently important, then so be it. I might have thought it to be an odd decision, but I'd have been comforted by the fact that they had been considered properly.

I've a nasty feeling that they have not, and that we are driving forward a policy without considering what we will lose along the way, in a rush to force sites to splash out on implementing a solution which may end up being circumvented trivially anyway.