Training AI on Copyrighted Data: Innovation or Infringement?

By Prof. Dr. Vikneswaran Nair | Deputy Vice Chancellor, University Malaysia of Computer Science & Engineering (UNIMY)

Artificial intelligence does not create in isolation. It learns by analysing enormous volumes of data, including books, news articles, images, music, and computer code. The challenge is much of this material is protected by copyright. 

This raises a difficult legal and ethical question, which is when AI systems are trained on copyrighted works without permission, is that a legitimate form of learning, or is it copyright infringement? 

Why This Issue Matters

The question is no longer theoretical. It is already at the centre of major legal disputes involving AI developers, publishers, artists, and other creators. At stake is not only the future of AI innovation, but also the rights of those whose work helps make these systems possible. One of the most widely discussed cases is The New York Times Company vs. OpenAI and Microsoft. 

The core allegation is that copyrighted news content was used to train AI models without authorisation, and that the systems can generate outputs that closely resemble the original material. The significance of this case extends well beyond one publisher. It may influence how courts, policymakers, and the wider public define the legal boundaries of AI training. 

The Case for Fair Use

AI companies often argue that training on copyrighted data can fall under the doctrine of fair use, or equivalent legal principles in other jurisdictions. Their argument usually rests on several points: 

  • The material is used to train the model, not to resell the original work 
  • The process is said to be transformative because the system uses data to learn patterns rather than simply store and redistribute content 
  • In principle, the model does not reproduce an entire work verbatim as part of training 

This is an important defence, but it is far from settled. Courts will have to decide whether using protected works to train a model is genuinely transformative, or whether it amounts to unauthorised copying on a very large scale. 

Why Creators Are Pushing Back

Writers, artists, musicians, photographers, and software developers have raised strong objections. From their perspective, the issue is straightforward: their work is being used without permission, without payment, and often without acknowledgement. 

Their concerns are not only about legal ownership. They are also about economic displacement. If AI systems can generate text, art, code, or music that competes with the original creators, then the training process may do more than borrow from human creativity. It may undermine the market for it. 

This is why many creators are asking a basic question of why AI companies should be allowed to profit from material they did not license? 

Why the Law Struggles to Keep Up

The legal difficulty arises because AI training does not fit neatly into existing copyright categories. 

  • It is not traditional copying in the usual sense, 
  • but neither is it mere inspiration. 
  •  It falls somewhere in between. 

During training, copyrighted works are processed, broken into data points, and converted into mathematical relationships within the model. This makes the concept of “use” harder to define. The system may not store a work in its original form, yet it may still rely on that work to produce valuable outputs. Existing copyright law was not designed with this kind of technological process in mind. 

Key Questions That Will Shape the Future

Several issues are likely to determine how this debate develops: 

  • How courts define “transformative use” in the context of AI 
  • Whether AI companies will be required to obtain licences for training data 
  • Whether governments will introduce new laws specifically to regulate AI training and data use 
  • How the law will distinguish between learning from content and reproducing it 

At its core, this debate is about more than copyright law. 

It concerns the balance between technological progress and creative rights. It asks whether everything available online can be treated as free raw material for machine learning, or whether permission and compensation remain essential. 

The answer will shape not only the future of artificial intelligence, but also the future value of human creativity in a digital economy increasingly shaped by automation. 

More for You