To Share or Not to Share: Debates and New Approaches on Data Sharing for AI Training
Artificial Intelligence (AI) continues to evolve at an extraordinarily rapid pace, and its dependency on vast amounts of data for training, especially in the fields of Large Language Models (LLMs) and Generative AI, continues to raise significant concerns in the media industry regarding the use of proprietary intellectual property (IP). In other words, there is a growing concern that AI companies are building multibillion-dollar valuations and capturing existing markets on the backs of third-party scraped data, often without compensating the original content owners.
On one hand, there are a series of high-profile lawsuits that will shape what is considered best practice in the industry and the creator economy. For instance, the New York Times, a growing number of news publishers and prominent authors have sued OpenAI and Microsoft for allegedly using their copyrighted content to train AI models, including ChatGPT and Copilot. These lawsuits argue that the AI models store and reproduce significant portions of their articles verbatim, thereby infringing on their copyrights. In a statement, the New York Times noted, "This suit will be the first big test for AI in the copyright law space." Prominent legal experts agree.
However, these best practices might also take years to be defined and adopted by the wider industry. In the meantime, companies adopting AI-powered solutions, for example for content discovery on streaming services or to boost post-production workflows, are already unlocking significant productivity gains and cost reductions. Over the past three years, we've developed the Imaginario AI platform and API, tackling both commercial and legal complexities. Here, I share perspectives and insights from customers who leverage our AI technology for operational gains, alongside our shared concerns about the industry's future direction.
What are the main risks of sharing your data?
Let’s start with the basics: What’s the risk to content owners? Here is a non-exhaustive list of the negative implications:
- Potential market loss: AI-generated and verbatim content can potentially replace original works, leading to a loss of revenue for the original creators. For instance, if AI can generate summaries, news articles, scripts or b-roll footage based on existing works, it could diminish the demand for the original content due to lower click-through and therefore revenue.
- Accuracy, biases, and hallucinations: AI models can sometimes produce inaccurate or biased content, which could be mistakenly attributed to the original publishers.
- AI as a credited author and generated content as source material: In some jurisdictions, AI could be credited as the sole creator in the future. If the original content creator is not involved in the creative process, they will also not receive the recognition, job opportunities, and royalties.
What are the main advantages of enabling AI?
Despite the risks, the application of AI in media workflows is already proving to add considerable value, cost savings and revenue opportunities across the media supply chain. According to various reports, the productivity gains from implementing “supermind” technologies (a term coined by MIT that refers to combining AI and human intelligence) can range from 20% to close to 70% in many industries.
For example, AI is significantly transforming the streaming experience, enhancing personalized content discovery and providing contextual advertising for viewers. Human-curated AI is also optimizing video Q/C, monitoring, post-production workflows (including search, b-roll generation and repurposing), and localization services.
Wait too long and you’ll miss the AI train while your competitors optimize their workflows. Move too soon and you might be giving away too much data for a limited upside.
Types of Training Data
It’s important to familiarize ourselves with the type of data that helps train models to understand, categorize information, generate content and/or personalize experiences. Here are some examples relevant to the media industry:
- Audiovisual Assets
- Photographs
- Audio Recordings
- Scripts and Storyboards
- Transcripts, Subtitles, and Closed Captions
- Articles, Opinion Pieces, Reviews, How-to Guides, and Other Textual Data
- Asset-level Metadata
- In-Video Tags and Metadata
- User Engagement / Human Input and Output Data (prompts, clicks, and responses)
- User Demographics and Behavioral Information
- User Engagement
- Playback
- Content Recommendations and Search
Main Claims in Training and Fair Use/Fair Dealings
Now that we are familiar with the high-level risks, opportunities and type of data that AI models use, it’s time to unpack the main argument used by incumbent AI companies: Fair Use (US) and Fair Dealing (UK). Disclaimer: I am not a solicitor/lawyer and the following is not legal advice.
The primary claims against AI training involve the unauthorized reproduction of copyrighted material and the generation of derivative works, including verbatim. OpenAI, for instance, argues that their use of scraped data falls under "fair use," a legal doctrine that allows limited use of copyrighted material without requiring permission from the rights holders under certain conditions.
Fair Use in the US and Fair Dealing in the UK, albeit different, provide similar frameworks for defense, primarily considering:
- Character of the Use: Whether the allegedly infringing work is transformative or merely duplicative of the original work.
- Nature of the Original Work: Highly creative works with significant expression are harder to claim fair use.
- Amount and Substantiality: The extent of the original work used—whether it's a small segment or the entire work.
- Impact on Commercial Market: Whether the derivative work displaces the original or affects its market potential. According to experts, this is the most important element.
Fair use allows for criticism, comment, news reporting, teaching, scholarship, or research. In contrast, fair dealing in the UK includes non-commercial research, private study, criticism and review, news reporting, quotations, and educational use.
This means that even though incumbent AI companies might not be able to store and use entire news broadcasts, films, books, and more to train their models, they could use smaller parts arguing that it is work that is not substantive. In addition, summaries, analysis and output data coming from models could constitute highly creative and transformative works that are not copying the expressions of the original work. This is still up for debate in the US and UK courts.
Industry Approaches and New Initiatives
In response to these challenges, new initiatives are emerging in the industry to embrace the adoption of AI technologies. At Imaginario AI, we are implementing these initiatives because we believe they offer the best balance between protecting rights holders and developing genuinely useful products.
- Embracing AI with trusted partners that operate with caution: Companies like Getty Images are now partnering with AI firms like Nvidia to embrace generative AI within controlled parameters. Getty promises full indemnification for commercial use and shares revenue with contributors whose images are used in training datasets.
- Explainability and data transparency: Ensuring transparency about the data used to train AI models is becoming crucial. This involves detailing how data was sourced (provenance), how it was cleaned, annotated, shaped, and updated, as well as the main datasets included in the training process.
- Licensing data: Licensing agreements are being explored as a way to ensure that original content creators are compensated when their data is used for AI training. For example, OpenAI has entered into multi-million dollar licensing agreements with content providers to use their data for training purposes including Associated Press, Reddit, NewsCorp, and Axel Springer SE.
- Guardrails and control measures: AI companies are introducing guardrails to prevent the generation of harmful or misleading content. For instance, Getty Images has implemented measures to block the creation of politically harmful deepfakes, ensuring that AI-generated content does not produce recognizable people or brands without permission.
- Opting-out from general models and fine-tuning: Some companies prefer to opt-out from training general models and decide to train their own AI models with proprietary data in secure cloud ecosystems (e.g. AWS, GCP, Azure) or on prem. This approach minimizes the risk of IP infringement, ensures compliance with legal standards and it provides better results fine-tuned for the client’s use cases.
Conclusion
The debate over the use of third-party intellectual property in AI training is far from settled. As AI technology continues to advance, especially by incumbents like OpenAI and Microsoft, the tension between innovation and protecting IP rights will likely intensify in the coming months and years. However, by adopting new initiatives focused on data transparency, opting-out, contractual approaches, and licensing, the industry can move towards a more equitable solution that balances the interests of AI developers and content creators.
Contextual Search. Imaginario AI can discover specific scenes across vision, speech and sounds for social media repurposing, dailies and compliance. (Click on image to see full size)
Social Media Clipping. Resize, add captions and brand your videos for TikTOk, Instagram Reels and YouTube Shorts. (Click on image to see full size)
Chapterization. Imaginario AI breaks down all your long-form videos into snackable chapters including a title and short summary. (Click on image to see full size)
[Editor's note: This is a contributed article from Imaginario. Streaming Media accepts vendor bylines based solely on their value to our readers.]
Related Articles
Brands are clamoring for one signal above all else: show title data. CTV represents the merging of digital-era granular targeting and TV programming's deep engagement. It's no surprise that ad buyers would want to mimic the TV advertising practice of going after specific programs that draw big audiences. Yet, when it comes to CTV, show-title data isn't the answer that advertisers want it to be.
26 Nov 2024
The landscape of Free, Ad-Supported Streaming Television (FAST) channels has significantly shifted in the last 12 months. Once dominated by niche content providers, the market has witnessed a surge in activity from major studios and broadcasters. This influx of premium programming has brought a new level of competition, forcing FAST channels to adapt and optimize their offerings to stay afloat.
11 Jul 2024
Victor Yakovlev, Associate Director, Product Marketing, PubMatic, discusses how there are two rising superpowers in the digital marketing space right now, and they're not competing. They're converging, and he outlines why that's fantastic news for marketers.
31 May 2024
Virtual Product Placement (VPP) is an excellent new type of CTV and streaming ad placement. Stephan Beringer, CEO at Mirriad, discusses best practices for advertisers to follow for it to work well.
28 Feb 2024
Mitch Eisenberg of Alliant discusses how a 35-year-old piece of legislation—the Video Privacy Protection Act (VPPA)—is sparking lawsuits and sowing confusion within the CTV and video ad space, and he outlines some safe and practical strategies that marketers can take to navigate the VPPA and reach the audiences that matter most to them.
03 Jan 2024
Companies and Suppliers Mentioned