Generative AI AI

Why You Can't Share Your Data

Reporting Xpress Blogger

Using AI for data analysis seems like the holy grail for helping organizations make informed decisions faster, leading to increased revenue, improved operating efficiencies, faster responsiveness to board questions, etc. We hear over and over what a game-changer it can be.

However, most of the tools that your teams are likely to come across have a major flaw: when you expose your data to them for processing, the AI companies may have the right to collect your data and apply it to train their generative AI models. Think about the possible consequences of that.

Consider this scenario: if you are a nonprofit, would you be comfortable with a random person asking one of the AI Chat models for a profile of one of your largest donors and getting a list of their contributions to your organization? The origin of that data would be obvious.

Or, if you are a company and someone is using an AI Chat tool to analyze sales data, would a question posed later expose information from your detailed sales results from this year?

**Just in case you are now panicking about how your team might be using AI for data analytics, and maybe you should be, I will point out that Xpress Analytics does not share data with any AI model. More about this later below.**

There is a very good article on this subject by a couple of academics in The Strategist. You can read the full text here: The high cost of GPT-4o | The Strategist.

The gist is that the major AI firms, including Microsoft, Meta, and Google, have quietly updated their privacy policies in ways that potentially allow them to collect user data and use it to train their models.

Why are they doing this?

They are quickly running out of high-quality data to use to train their models and continue to improve them, and no one has figured out yet whether “synthetic data” created by an AI model to train an AI model is anywhere near as valuable as “original human-constructed content” for training purposes.

In the case of GPT4o, they have dropped the price of certain services to free to entice people to share their data. There are “theoretically” ways to tell AI companies not to use your data, but they are typically not the default modes, and according to the article, it is not easy to turn those settings on.

Then, there is the question of whether you really trust those companies with your data regardless of the settings. There are always opportunities for a technical issue to allow something unintended to happen or, as has been seen at Amazon in the past, cases where mid-level employees allegedly ignored company policy meant to protect seller data to accomplish their job goals more effectively.

It is not all bad news, though, and it doesn’t mean that you can’t take advantage of this incredible technology without sharing your data. At Reporting Xpress, we recognized what a game changer the application of AI towards data analysis (versus content creation) could be very early on. In fact, it was so early that when we started building prototypes of Xpress Analytics, there was no way to send large data sets in their native form to the AI companies, and the context windows were too small to send a useful dataset of any real size in any form.

We had to develop proprietary methods of describing large datasets and optimizing their structure so that we could pass information about an optimized dataset to an LLM without including any of the actual data. We could then ask AI to help with analysis by generating queries that it would send back, and we could execute them in a private environment. It turns out this works really well. As concerns about data privacy began to manifest, Xpress Analytics was in the clear.

If any of the AI-powered data analysis software you are considering relies on technology from any of these companies, you need to understand the mechanics of how they are using that technology and whether your raw data is being shared/exposed. Software companies that started later and are playing catch-up are often using an easier-to-implement but more dangerous approach, which requires sharing customer data with the AI company whose technology they are leveraging.

Users of Xpress Analytics do not have to worry about their data being sent to any of the great AI models that power its analysis capabilities. And, we can easily make available multiple models from multiple AI companies to enhance the user experience and speed at which you can complete data analysis.

P.S. If the vocabulary around AI, Generative AI, AI companies, Models, and LLMs-abbreviation for Large Language Models has you a bit confused, here is a quick primer.

When we speak of AI today or “an AI,” we are typically referring to Generative AI. Generative AI is a type of artificial intelligence that can create new content such as text, images, or music based on the data it has been trained on. It operates by learning patterns from vast amounts of data and then using those patterns to generate novel outputs that resemble the training data.

For example, Generative AI models like OpenAI's GPT-4 can produce human-like text based on the prompts they receive. It turns out that the computer code required to perform data analysis is just another form of written text so Generative AI models can be effectively used for this purpose. So, in general, an “AI”, “LLM”, “Generative AI model”, or “Model” are more or less the same thing these days when you are reading about them, and that is definitely the case in this post.

When we speak of “AI Companies”, who we worry about having access to our data, we typically refer to those that have released one or more LLMs/Models. The AI Companies currently believed to have the best publicly available models include OpenAI (with major investment from Microsoft), Anthropic, Google, Meta, and Mistral.

Each company typically has multiple models or versions of its core models, and they usually have version names or numbers to differentiate them. For instance, OpenAI has released several versions of its GPT (Generative Pre-trained Transformer) model. When someone mentions GPT-4, they are referring to the fourth iteration of OpenAI's LLM, which stands for Large Language Model. Google's model is known as Gemini, Anthropic’s models are named Claude, etc. and most have versions designated primarily by numbers.

Most “AI-powered” software companies leverage “models” from one or more of the above companies. Again, the question you should always ask is how they leverage those models and whether your data is shared with those companies in the process.