AI and scraped data: Data protection implications

11 minute read  12.03.2024 Prasanth Kapilan, Susan Kantor, Paul Kallenbach

We explore the data protection implications that arise when data scraping is conducted for the purpose of training artificial intelligence tools.

Key takeouts

  • Concerns have been raised by regulators globally, as well as the general public, surrounding the mass scraping of data for the purpose of training sophisticated artificial intelligence tools such as large language models.
  • Australian privacy law prohibits data scraping by inference, at least in certain circumstances. In addition, data security risks arise in connection with the use of data scraping tools.
  • Given the increased proliferation of data scraping, and the increased uptake and rapid development of artificial intelligence tools, more explicit regulation may be required to address the privacy risks posed by these practices.

The role of data as a key driver for artificial intelligence (AI) innovation and development remains paramount. The increased sophistication of AI tools, which are capable of advanced language processing, enables vast troves of data to be ingested and analysed. Through these processing capabilities, important insights are derived which have the capability to shape our society, including its geopolitics, social mores, and financial markets.

In response to these capabilities, large tech companies (including Amazon, Apple, Google, Meta, Microsoft and Nvidia) have redefined their strategic approach to ensure that AI development is a key focus. Smaller organisations also perceive the upside of AI tools and technologies, and many are developing and commercialising their own models, or using existing AI tools to enhance their business operations. As a result, the AI landscape has expanded rapidly. In each case, advanced AI models rely on large amounts of data for their training. To obtain this data, large-scale scraping of information (including personal data) has been undertaken – raising concerns amongst regulators globally about the potential implications for privacy and data protection.

Organisations in Australia who are using or are considering using data scraping tools for the purposes of training AI should be aware of the potential risks associated with these practices.

AI lawsuits commence

In response to the rapid development and increased uptake of AI tools, legislative and executive bodies, such as the European Commission, continue to accelerate their discussions to regulate AI technologies (as reflected by the provisional agreement reached by the Council and European Parliament on its AI Act in December 2023 and the imminent plenary vote brought forward to 13 March 2024).

Although these global regulatory efforts have recently accelerated, they still lag behind the deployment of AI. Consequently, aggrieved parties have turned to the courts to seek relief against large tech firms.

Last year, several lawsuits were brought against Microsoft and OpenAI. In particular, one class action lawsuit, brought in San Francisco's Federal Court on June 2023, alleged that Microsoft and OpenAI violated the privacy of more than one hundred million individuals by data scraping for the purposes of training ChatGPT.

The lawsuit was subsequently withdrawn. Whilst the reasons behind the withdrawal is unclear, the issues relating to the scraping of all forms of data (including personal information) to train AI tools remains contentious.

What is data scraping?

Data scraping involves the extraction of publicly available information from online sources. Broadly, there are two types of online data scraping:

  • website scraping – the extraction of data from a website (such as a social media site). In particular, this involves extracting information from a website's URL, corpus of text, and HTML code; and
  • screen scraping – the extraction of visual data through emulating an end user. This could include the extraction of various kinds of display data, such as screenshots and visual images.

Typically, data scraping is used to obtain and aggregate large volumes of data for analytical purposes. Then, other technologies are used to analyse this data and derive strategic insights that can be applied in various domains (for example, acquiring leads, monitoring competitors, or extracting product reviews).

There has been ongoing regulatory scrutiny of data scraping, with the Office of the Australian Information Commissioner (OAIC) and 11 other international data protection and privacy authorities releasing a joint statement, Global expectations of social media platforms and other sites to safeguard unlawful data scraping, on this practice in August 2023. The joint statement raises concerns around the large-scale scraping of personal information for the purposes of reselling data to third party entities (which may include malicious actors) for commercial gain. The joint statement also refers to other risks that may be heightened by this practice, such as identity fraud, targeted cyberattacks and unauthorised usage of such data for political or intelligence purposes by foreign intelligence agencies.

AI's proliferation of data scraping

Large language models (LLMs) are a form of AI which combines deep learning techniques and large data sets (with specific parameters) to perform natural language processing tasks. Their purpose is to recognise inputs, and generate text, graphics and other outputs, in order to solve problems and simplify processes. Examples of LLMs include OpenAI's ChatGPT, Google's Gemini and Meta's LLaMA.

LLMs require vast troves of data, from myriad forms of user-generated content, in order to train their algorithms. This can result in the indiscriminate scraping of personal information (including names, addresses and other personal identifiers). Major tech platforms, including Wikipedia, Twitter (and other social media repositories), large forums (particularly, developer forums such as Reddit and Stack Overflow) and e-commerce websites (such as eBay) have been reportedly subject to large-scale data scraping for the purpose of training LLMs.

The challenge, therefore, is to balance the demand for data to train and fine-tune these powerful LLMs (which is essential for their functionality) and the indiscriminate mass scraping of personal information (which may infringe on privacy rights).

Australia's approach to data scraping

The Australian government is considering a range of regulatory reforms to manage risks associated with the use of AI. In tandem with the long-awaited reforms to the Privacy Act 1988 (Cth) (Privacy Act) which reflects the Australian Government's ongoing commitment to enhancing privacy rights, the Government is considering introducing AI-specific laws (we recently explored the Federal Government's response to the Privacy Act 1988 review and reform plans). Pending these changes, organisations should be aware of their existing obligations under the Privacy Act relating to data scraping.

The OAIC's current stance

The Privacy Act imposes various obligations on entities governed by the Act who scrape publicly accessible personal information:

  • Restrictions on the collection of personal information: APP 3 provides that an organisation must only collect personal information that is reasonably necessary for its functions or activities. Whether this objective test is satisfied (that is, whether a reasonable person who is properly informed would consider the collection to be necessary) will depend on, amongst other things, the primary purpose of the data scraping itself. Large-scale, indiscriminate data scraping may (objectively) be considered excessive, since it is conceivable that less data can be scraped in order to achieve the relevant primary purpose. Furthermore, the collection of sensitive information – which includes information pertaining to an individual's race, marital status, age, or religious beliefs – requires the scraper to obtain the relevant individuals’ consent, which is unlikely to be attainable.
  • Notification of the collection of personal information: APP 5 requires that an organisation must, at or before the time of collection (or, if this is not practicable, as soon as practicable after collection), take reasonable steps to notify the individual of certain matters (or otherwise ensure that the individual is aware of any those matters). This notification generally takes place via the provision of a collection notice to the individual. The matters required to be notified include the collector's identity, its contact details, and the purposes of collection. Given the automated nature of data scraping, it may not be practicable for organisations engaging in data scraping to give APP 5 compliant collection notices to each affected individual.

Safe and responsible AI in Australia consultation

On 17 January 2024, the Australian Government published its Safe and responsible AI in Australia consultation interim response (Report) (see our previous article, Mandatory safety guardrails foreshadowed for high-risk AI for further insights into the Report). Although the Report does not refer to data scraping, it alludes to potential privacy harms arising from the development phase of an AI tool. In particular, the Report refers to implementing transparency on the data used to train AI tools, including to enable the general public to better understand the extent to which an AI tool has ingested large amounts of personal information. Additionally, these transparency requirements may encourage organisations to adopt a more moderate approach to the collection of data for training LLMs, as large-scale indiscriminate data scraping may be viewed unfavourably by regulators and the general public.

Europe's approach to data scraping

Europe's data protection framework is relevant to Australian organisations in two ways.

First, Australian organisations should be mindful of the potential applicability of Europe's General Data Protection Regulation (GDPR) to their activities.

The GDPR's extraterritorial scope will apply an Australian organisation if:

  • it offers goods or services to individuals located within the European Union (EU) (irrespective of whether the goods or services are offered for a payment or for free); or
  • it monitors the behaviour of individuals located within the EU (which may include tracking individuals through cookies).

Failure to comply with the GDPR (if bound) may expose an Australian organisation to European regulators’ aggressive approach to data scraping, including the levying of large fines on organisations who regulators concluded utilised data scraping technologies inappropriately.

Second, Australian privacy law reform, as well as the interpretation of current privacy laws by regulators, have been influenced by the GDPR. Consequently, Europe's approach to data scraping (explained further below) may be reflected within the interpretation of current Australian privacy laws or in the impending Commonwealth privacy reforms.

Europe's General Data Protection Regulation

Similar to Australia's privacy framework, the GDPR imposes implicit parameters around data scraping.

Data minimisation

Article 5(1)(c) requires the processing of personal information to be 'adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed'. This form of data minimisation is particularly relevant to data scrapers. Where an AI tool undertakes large-scale, indiscriminate scraping, an assumption that all personal information collected is relevant to its processing is unlikely to hold.

Lawful processing

Article 6 sets out the relevant lawful basis for which entities, such as a data scraper (who would be considered a 'data controller' for these purposes) may process personal information. Broadly, consent, contract, legal obligations, public tasks, and vital or legitimate interests, are the legal bases for the processing of personal data. Relevantly, a data scraper would likely rely on the ‘legitimate interests’ basis. Yet, in 2022, Greece’s Hellenic Data Protection Authority levied a fine of €20 million against Clearview AI. on the grounds that could not rely on the ‘legitimate interests’ basis for engaging in the large-scale scraping of facial images from the internet.

Notifying the relevant data subject

Article 14 is enlivened where personal information is indirectly collected from a data subject (an ‘identifiable individual’) and the data controller is subsequently required to inform the data subject of certain information. Such information includes the identity and contact details of the data controller, the contact details of the relevant data protection officer, the purpose(s) of the processing, and the relevant lawful basis for processing.

There are several exceptions to this requirement, which are set out in Article 14(5)(b). The most relevant exception that may be relied upon by a data scraper is that the 'provision of such information proves impossible or would involve a disproportionate effort'. This exception is, however, notoriously difficult to rely upon. Specifically, the United Kingdom's Information Commissioner's Office (ICO) has clarified that, in order to rely on this exception, the organisation would need to make an assessment (and document it) as to whether there is a proportionate balance between the effort involved for providing the relevant notification to data subjects, and the impact that the processing (ie scraping) of their personal information may have on them.

Europe’s AI Act:

On 9 December 2023, the European Parliament and Council reached a provisional agreement on the remaining points on Europe’s regulation of artificial intelligence (EU AI Act). In particular, the proposed EU AI Act considers the untargeted scraping of online facial images to constitute an unacceptable threat to fundamental rights and freedoms. A breach of the EU AI Act can attract maximum fines of the higher of €35 million or 7% of global turnover. The European Parliament's plenary vote on the proposed EU AI Act is expected to occur on 13 March 2024.

What will 2024 bring for AI and technology regulation?

2024 will likely see further claims brought against tech companies, as well as heightened regulator activity within Australia and Europe and other jurisdictions, relating to the scraping of data to train and fine-tune AI models.

AI offers significant benefits, but also poses privacy and data protection challenges that require careful consideration. In particular, LLMs that rely on massive amounts of scraped data may infringe on the privacy rights of individuals whose personal information is used without their knowledge or consent. Organisations that develop or adopt AI tools should be aware of the potential legal implications in Australia, Europe, and other jurisdictions, and ensure they comply with relevant data protection frameworks. Moreover, they should keep up with ongoing reforms and developments in this area, as lawmakers and regulators seek to address the emerging risks and opportunities of AI.

The team at MinterEllison can assist you in understanding the legal issues and risks associated with AI to your organisation. If your organisation is planning to use or interact with AI and you need more detailed advice, contact us.