Github Copilot: initial thoughts from an English law perspective
(I have more questions than answers, but I thought it was worth writing something, even if just as a straw man for others to rip apart.)
Introducing Github Copilot
Github has launched a beta tool, available only to some users, of something it calls "Copilot".
It describes it as:
Your AI pair programmer
In a little more detail:
Trained on billions of lines of public code, GitHub Copilot puts the knowledge you need at your fingertips, saving you time and helping you stay focused.
It says it has trained the tool using:
a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.
(Because if it's "publicly available" on the Internet is fair game, right?)
If you are a copyright owner, does Github need your permission to train a data set using your code?
If you are a copyright owner in some source code, and Github has trained its model using your code as one of the many inputs, has Github infringed the rights reserved to you under copyright law?
Which law applies?
There is a question here as to whether English copyright law has any relevance. Even if a developer writes code in the UK and pushes it to Github, there's a reasonable chance that Github will claim that its service is provided by Github, Inc, which is established in the USA, such that English law is irrelevant.
Assuming English law applies, what then?
Github says that using code to train an AI tool is "fair use":
In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.
It may well be, under US copyright law - I do not know — but it is not a concept which appears in English copyright law.
Does training entail "copying", or some other restrictred/reserved act?
My starting point would be that the act of training a model is likely to entail the act of copying.
I assume (and it is only an assumption, so please tell me if you think I am wrong) is that the act of training will require a computer processing the code in its memory.
To my mind, this is an act of copying, reserved to the copyright owner unless there's statutory permission.
And it's not taking insignificant, immaterial portions of the code. As I understand it, it would be using all the code in the training.
(If I understand "fair use" correctly, my analysis is consistent with Github's. "Fair use" is an affirmative defence, meaning that the person pleading fair use accepts that they did something restricted by copyright law, but that doing it was legally acceptable. If they hadn't done something which was otherwise infringing, there would be no need to claim "fair use".)
I doubt there is statutory permission
Under English copyright law, I am doubtful that there is a statutory permission available to Github.
English law has a provision for use for research purposes, but only if they are "non-commercial".
Similarly, there is a specific statutory right for:
[copies] made in order that a person who has lawful access to the work may carry out a computational analysis of anything recorded in the work for the sole purpose of research for a non-commercial purpose
but, again, there is the "non-commercial purpose" restriction.
My feeling is that, no, there is no statutory permission.
And, if I am right in saying that what Github has done entails copying (or another reserved act), then the only other outcome, to avoid infringement, would be if Github has a licence.
Github probably has a licence for code uploaded to Github by the copyright owner
Looking at Github's terms of service, the relevant section is D4:
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
A6 contains the definition of "Service":
The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.
The licence is broadly worded, and I'm confident that there is scope for argument, but if it turns out that Github does not require a licence for its activities then, in respect of the code hosted on Github, I suspect it could make a reasonable case that the mandatory licence grant in its terms covers this as against the uploader.
It might get more interesting if, say, the copyright owner of some code licensed it under terms which did not permit this type of activity, or conditioned copying on adherence to certain requirements, and the Github user uploaded it anyway...
In any case, this does not address the situation in which Github scraped code from other "publicly available sources", and I've yet to come up with a good answer to this, if their training does indeed require an act restricted by copyright. Perhaps they really are gambling on the exclusive application of US copyright law.
If you use Github Copilot, what about...
Ownership of the code?
Github says that the person who uses Copilot owns the code it creates:
GitHub Copilot is a tool, like a compiler or a pen. The suggestions GitHub Copilot generates, and the code you write with its help, belong to you, and you are responsible for it.
This is helpful when it comes to potential arguments as between the user and Github, but is perhaps less helpful if someone else alleges infringement.
Infringing the copyright of one of the authors of code in the training set?
For example, if a particular function is restricted by copyright, but it is commonly used because it is licensed under Free / open source licensing terms, could that function be suggested as part of Copilot's activities?
Again, Github has an answer:
GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before.
It goes on to say that:
about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set
Assuming that this is correct, it seems that the risk of infringement seems very low: there is no reproduction of the original code in the training data set in the output generated for you.
Again, assuming this is correct, it seems we are more like someone who was inspired by reading someone else's code, potentially in a completely different context, and then went and wrote their own code to achieve the goal they were trying to achieve.
That being said, I've yet to find a contractual promise from Github that it will indemnify you against claims relating to infringment in code generated by Copilot. So while you might take some comfort from their statement, you are, I suspect, rather on your own here.
Practical considerations for developers
Let's assume that you are a developer in England, looking to make use of Github Copilot. What sort of issues might you need to consider? (This list might grow over time.)
A warranty that your code meets an agreed specification
If you are writing code to achieve a purchaser's goal, there's a strong chance that the purchaser will insist that the code you produce meets their specification.
If they've done it well, they'll probably set out a series of acceptance tests, and a framework for doing the testing, with the code only being accepted if it meets those tests.
Does it matter if some of the code was generated by Copilot? IMHO, no. You – the developer — are still responsible for what you produce, and for ensuring that it does the job to the required standard.
A bit like a lawyer turning to their precedent documents to prepare something for you: you don't care from whence it came, but rather than it does the job that you want it to do.
A warranty of non-infringement
If you are developing software for someone, there's a good chance that they will expect you to take responsibility for it, in the sense of infringements. In other words, they want to know — and get a contractual commitment underpinning it — that what you give them does not infringe anyone else's rights.
There's often negotiation about the scope of a warranty of non-infringement (and the remedies available).
For example, you might attempt to carve out responsibility for patent infringement (on the basis that it would be prohibitive to carry out exhaustive patent searching), or responsibility for copyright infringement claims arising from use of Free / open source software. There's plenty of scope for nuance around this, and the extent to which a carve-out should apply, but it's a relatively common thing to see in software development agreements.
Based on what Github has said, it seems unlikely that you need to be concerned about this from a copyright point of view, if the code it generates really is "uniquely generated".
And, unless you're using Copilot's generated code "as is", rather than modifying it to suit your needs more precisely, then determining if your code, or their code, is the problematic bit could be interesting.
Of course, Github has not said that the code it produces will not infringe patents (bah, software patents), so you'd need to think about how you deal with that, just as you would your own code or any third party code you use.
Pricing when you use Copilot
I've seen several instances where a would-be purchaser has said that they want to pay for some custom development, but that the developer cannot use any Free / open source code.
"No problem," says the developer. "But our price will got up by [£££] because we will either need to license in non-FOSS libraries or write our own."
Perhaps we will see the same here: a price for a solution which uses Copilot, and a price for one which does not.
Do you have thoughts? Let me know and, if they're the kind of thing I feel comfortable linking to, I'll be happy to do so.
Need advice? Again, just get in touch.
For me, it's back to installing Debian on All The Things.
Interesting things others have written about this
- Andres Guadamuz, a superb IP law academic, has focussed on potential infringement of the GNU GPL, concluding that, in his view, there is no infringement.