Data-Scraping, AI, and the New Internet Frontier

Posted by Ethan Storck.

Nearly every single day we see news articles and stories focusing on “Artificial Intelligence” (or “AI”), and how AI is the future of so many different industries. This includes everything from medicine, law, banking, cybersecurity and so much more.  However, with the growth of AI comes new and expanding issues for us to deal with as we include that new technology in our daily lives. One of those issues is the use and regulation of “data-scraping.” “Data-scraping” or what is also known as “web-scraping” is where companies such as Open AI, Google, data aggregators, and others use algorithms to pull large amounts of data from the web (which can often be millions of webpages) for purposes of machine learning in developing AI tools. The data that is scraped then becomes part of those AI tools, whether to write songs, sell products, help with research, or in endless other ways. Gary DrenikLinks to an external site. discusses this in his December 18, 2023 article in Forbes, “Data Privacy And Ownership To Remain Key Concerns In Web Scraping Industry Next Year,” where he discusses data-scraping through an interview with Denas Grybauskas, Head of Legal at OxylabsLinks to an external site., a leading web intelligence collection platform.

Data-scraping presents several different kinds of legal issues, including intellectual property ownership, privacy, and other concerns, which are now the subject of several lawsuits – and these cases may just be the beginning in a battle over this new frontier. As discussed in the Drenik’s article, these current disputes generally come down to two issues.  The first focuses on the intellectual property rights and “fair use” of data in the public domain. Since data-scraping algorithms often pull information that is available on the internet, which any of us can access through a Google or Bing search, the issue becomes whether that scraping creates any ownership issues. The information pulled would have been created by possibly millions of individual contributors and, as discussed in Drenik’s article, those creators did not necessarily create it with the intent of allowing scrapers and others to now profit from that information. A key issue comes down to whether scrapers and the companies who use from that information should be able to profit from it and if the content creators are entitled to any of that profit.

The second major issue focuses on privacy concerns relating to the scraping of personal information. For example, individual contributors are also putting their pictures, voice, opinions, information about their families, and other personal information online. Should aggregators get permission to use that information? How would that even work? Then there is the question of whether contributors who put their personal information on public websites should have any protection in the first place.  After all, if someone puts their information on the web where anyone can access it, isn’t that their choice and didn’t they give up any protection? Should they not have assumed that anyone could use, retweet, or copy that information whenever they want? Where is the line? Where is the ethical line? What if someone puts the information of another person, who didn’t agree to their information being public, online and it gets scraped and used?

These are the types of issues that companies who use AI will need to deal with.  Recent lawsuits with Open AI and Google are focused on some of these issues, but they may be only the beginning.  The bigger issue is where should the line be. How much should the government regulate businesses who scrape the data?  What about the companies that use the data, will they all be subject to lawsuits if they get it wrong?  What about where someone did not consent to their information being made public? Or where companies seek information that was intended to be private? 

When looking at the implications of data-scraping, it’s clear we’re navigating a complex terrain where innovation and personal rights intersect. While I agree that there are concerns where individual content creators did not create that content with the intent of someone else profiting from it, as covered in Drenik’s article, and that there should be some limitations on how scraped information can be used, the contributors did make that information public and there are consequences to that. Once information is in the public domain, trying to figure out ownership and what someone intended may not be realistic.  However, that should not include information behind password-protected websites and other areas where there are efforts to ensure such information is private, or at least not available to everyone. I also agree that we should explore at the idea of disclaimers on websites, such as those notifying contributors that information on their website is “public” and therefore can be used by third parties, including any personal information they post. Websites can also decide whether they want to limit public access of content absent invited users, so that it is not public. But for anything that remains in the public domain, it is a much harder issue. We will need to find the right balance between what is “public,” what is not, and if anything is used that is not intended to be public, how to decide on damages and penalties, especially for very sensitive information.

Ultimately, this year and in the years to come, we will see more and more discussion about the scope of data-scraping and what the rules should be, including possible responses from agencies and decisions in AI lawsuits that could impact many different industries. We will need to find a balanced approach that will not only foster innovation but also ensure that content creators are acknowledged and protected where appropriate in this fast-evolving digital landscape.

Ethan Storck is a psychology major and business/marketing minor at Seton Hall University, Class of 2026.

Article: https://www.forbes.com/sites/garydrenik/2023/12/18/data-privacy-and-ownership-to-remain-key-concerns-in-web-scraping-industry-next-year/?sh=61418d4b2c05Links to an external site.