IoT For All
IoT For All
How can IoT use synthetic data and what are the risks? Adam Kamor, Co-Founder and Head of Engineering of Tonic.ai, joins Ryan Chacon on the IoT For All Podcast to discuss synthetic data with AI and IoT. They cover data synthesis and ML model training, synthetic data for IoT simulations, generative AI, data risks with generative AI, and solutions using generative AI.
About Adam
Adam Kamor, PhD, is Co-Founder and Head of Engineering of Tonic.ai. Since completing his PhD in Physics at Georgia Tech, Adam has committed himself to enabling the work of others through the programs he develops. In his roles at Microsoft and Kabbage, he handled UI design and led the development of new features to anticipate customer needs. At Tableau, he played a role in developing the platform’s analytics/calculation capabilities. As a founder of Tonic.ai, he is leading the development of data generation solutions that are transforming the work of fellow developers, analysts, and data engineers alike.
Interested in connecting with Adam? Reach out on LinkedIn!
About Tonic.ai
Tonic.ai is the fake data company. They mimic your production data to create de-identified, realistic, and safe data for your test environments.
Key Questions and Topics from this Episode:
(00:54) Introduction to Adam and Tonic.ai
(02:25) Data synthesis and ML model training
(05:44) How can IoT use synthetic data?
(10:45) Generative AI
(12:28) Data risks with generative AI
(15:00) Solutions using generative AI
(17:31) Learn more and follow up
Transcript:
- [Ryan] Hello everyone and welcome to another episode of the IoT For All Podcast. I'm Ryan Chacon, and on today's episode we have Adam Kamor, the Co-Founder and Head of Engineering at Tonic AI. They are a fake data company focused on helping companies mimic their production data to create de-identified, realistic, and safe data for your testing environments.
Interesting conversation. We're gonna talk about generative AI, challenges with it, data risks, solutions utilizing generative AI. What is data synthesis? What does it mean? How can you synthesize time series data and how can it be used for IoT simulation and ML modeling? I think you'll get a lot of value out of this one, but before we get into it, I would love it if you could give this video a thumbs up, subscribe to our channel, hit that bell icon, so you never miss an episode, and if you're listening to this on a podcast directory, subscribe, so you get the latest episodes as soon as they are out.
Other than that, let's get onto the episode.Â
Welcome Adam to the IoT For All Podcast. Thanks for being here this week.
- [Adam] Hey, glad to be here. Thanks for the invite.
- [Ryan] Absolutely. Loved to kick this off by having you give a quick introduction about yourself and the company to our audience.
- [Adam] Sure. My name's Adam, and I am the co-founder and Head of Engineering of a company called Tonic.ai. And that's also our website domain. So please check us out. Tonic AI is the fake data company, that is to say we generate fake data for our customers, and we meet them where they are and for the use cases that they have.
Typically our customers use Tonic for one of two things. They want to de-identify sensitive data. A common use case there would be, oh man, I have a production database. My application testers and developers can't use production data in lower environments. So let's create a de-identified version of that database, which we can use for testing and development.
That's a great use case for Tonic, and that's actually how the company got started. But we also now generate synthetic data for our customers, which has a slightly different use case. It's typically more for machine learning training or model training or for analytics. But the point of synthetic data is to essentially build a mathematical model of your data, like an understanding of that data from a mathematical point of view.
And then generate synthetic rows of data where you can't tie any like individual synthetic record back to an original record. But it's more like generated from your understanding of the model itself. And the use cases there, like I said, are primarily for like analytics, data science and machine learning model training.
- [Ryan] Yeah, tell me a little bit more about that, how data synthesis works with ML model training and things like that.
- [Adam] So it- we have found it, like we have found two use cases that I think speak best to folks generating synthetic data for machine learning model training. The first use case is primarily around privacy. When we started our company and we were focusing basically entirely on application databases, there was already a movement underway to get production data out of lower environments.
To restrict access in the development organization. And we're starting to see that happen now with data scientists. Where data scientists when they're doing their model development or their exploratory data analysis, prior to like actually developing models, they're not being given that unrestricted access to production that they used to be, that they used to have.
So it's useful for tools like Tonic to come in to generate either de-identified or synthetic datasets, which can be used for exploratory data analysis and for that initial model development. So that's the privacy use case. I actually could break that down a little further, and I'll just do it really quickly.
I already said this. There's like the exploratory data analysis and then there's like the initial model development. When you're doing exploratory data analysis, it's best to typically work with a completely de-identified database or data warehouse or what have you.
Whereas when you're doing that initial model development, you typically want to work with synthetic data, which has a really high statistical accuracy and similarity to the original dataset. And because each approach has like limitations and you know where it works best. And that's what we have found.
And now the other use case for synthetic data is actually not related to privacy at all. It's related to model efficacy. A good example would be like the data augmentation use case. You're training a model, and I'll do a simple use case. It is a- it's like a logistic regression. It's meant to give you a yes no. It's a binary classifier.
And maybe you're trying to classify churn at your company, right? Like you don't have a lot of customers that churn, but you have some, and you'd like to predict who is likely to churn, so you can take steps to prevent it. That's a pretty common thing that companies do. If you're training your model on your company data, and you don't have a lot of churning customers and, hopefully, you don't, then your logistic regression will sometimes do a poor job in determining who is likely to churn.
Because as it's training on this data, it is getting- the people that don't churn are swamping out the people that do turn, and it's just not picking up on those that churn. So you can actually generate synthetic examples of churning customers. Take that synthetic set of churning customers, throw it into the original training dataset to rebalance, so you could have perhaps equal numbers of churning and non-churning.
And then you can train your logistic regression on this more balanced dataset and hopefully improve the quality of your classifier.
- [Ryan] One thing I wanted to bring this back around to and just get your thoughts on is how this fake data, synthetic data can be used for IoT. And the reason I bring it up, obviously our audience is very much focused in that area, but simulations are a big area for IoT, is being able to simulate environments, simulate use cases, different situations, but having access to data is not always there. Or they can't use certain data because of privacy reasons and so forth to run these models. So how can that- how can what we're talking about now be adapted or be thought about in the IoT simulation space?
- [Adam] It definitely can. Let's go- from the privacy side of the house, it is roughly the same story, whether it's IoT or not IoT, right? You have a dataset. Let's assume, I guess for this conversation, that maybe you have two types of data, right?
Like you have a table which records all of the IoT devices in the field and properties about them. A dataset like that, it's more about like the relationships between columns, right? If the device type is this, it's more likely to be in this location than in that location, right?
So you care about column- relationships between columns, individual rows are typically independent of each other in a dataset like that. Then you have your other table, which actually records the values that perhaps IoT sensors are sending you over time. This table might have a handful of columns.
It would have the IoT device ID, it would have the value that it sent, and it would have a timestamp for when it sent it. It might have other columns like related to categorical properties of the device. Perhaps it could have a name, it could have the type of device it is, it could have the location perhaps, or that information might be in that earlier table.
And you would typically grab it by doing some type of join. On that first table, which has the properties of the devices, it's a very similar approach when I was talking about that churning example earlier, right? Because the churning dataset that we were talking about would typically be more relationships between columns and not rows.
You could de-identify the data, or you could just generate synthetic IoT devices. And I think that's like pretty- that's pretty straightforward. The time series data- oh, and so to that point, let's say you have several IoT devices that just aren't very common.
You can certainly generate more examples of them, but okay, you can add entries to that first table, that's all well and good, but it's fairly straightforward also. Like the complex part is, okay, what we really care about is what data are these IoT devices sending? Like maybe there's like very rare events that sensors pick up that you would like to create more of these events, for example.
So at Tonic, we make a distinction when we're talking about synthetic data. We make a distinction between column type relationships and datasets that have this longitudinal aspect to them, or where there's relationships between rows, right? For example, talking about that second table, the- if I focus in on a single IoT device, if it emitted a value of X at the first timestamp, the next timestamp, it's likely gonna emit a value that's a function of the previous value.
And if I look at all of the values emitted and their timestamps for a given device, that graph of those values should tell a story about that device over time that makes sense. So generating synthetic event streams is, I think, really where our tool shines because it's not easy to do, and it- event-driven or time series data plays a really serious role in many different industries.
IoT is a very good example. The banking and finance sectors where you talk about banking account transactions or credit card transactions is another really good example. So we have the ability to generate synthetic IoT devices, but more importantly, to generate synthetic event streams.
And you can even have the tool focus in on the event streams that are most interesting to you. Maybe you have tons of the boring events. The ones that just happen all the time. But, oh, every once in a while, this thing in the physical world happens that causes the event streams to go crazy and do really interesting things.
And it might be that which is what you're trying to focus in on and synthesize, and that's what the tool would be able to do for you.
- [Ryan] Yeah, we've had some guests on talking about just the simulation side, digital twins, and so forth in the IoT space and the role that they play in helping people get to different stages of their IoT journey more successfully. And the data is a big part of that. If you don't have the accurate data, you're not going to- even accurate, but it doesn't have to be real, data to be able to run these simulations and model different scenarios, then it's almost useless.
So I appreciate you breaking that down a little bit. So I wanted to pivot a little bit to the AI side of things a little bit on- talk about generative AI and things along those lines. What has your experience been? What do you all do? Where's the- how do you overlap into that space, if at all?
- [Adam] So, I spoke earlier about the tool. The two high level use cases we have, data de-identification and synthetic data. Synthetic data or at least our approach to synthetic data is generative AI. Right now because of what's happened in the past month or two with OpenAI and ChatGPT, when you hear generative AI, you typically think of large language models and unstructured data.
So it could be like give me an image of a polar bear using roller blades or whatever people are doing with Midjourney. Or it could be like having a conversation with ChatGPT, asking it, hey, how do I fix my lawnmower? It has these symptoms, right? There's other types of data that you can generate via AI and one of them is structured data.
So our synthetic data offering is in earnest generative AI for structured data. So I think we've actually been in this space for, I think the early versions of this are probably closer to two years old. And then we productized it into its own offering specifically for data scientists about a year ago.
And that's what we've been doing. We're currently like- we're doing a lot of thinking at the moment on how we can best utilize these large language models in our own offerings to make our structured synthesis either better or just maybe to use a different approach, but it really is an exciting time to be a data scientist in this space.
- [Ryan] Definitely. Yeah. What are some of the the challenges that you've found or some of the maybe risks on the data side that are associated with generative AI? Just because I think our audience, like you said, they, when they think of generative AI, most of the time they're thinking of what they're hearing about, ChatGPT and stuff, but how this applies in this setting, I'd be curious to just get your thoughts on what are the real challenges and things people need to be thinking about.
- [Adam] Yes. That's a good question, and my answer applies to whether you're using some of these large language models or whether you're using Tonic's own models. It's really the same. So when you train these models on your own data, the output from these models, you know what they give back to you, like the actual thing generated, is gonna reflect what it was trained on, right? And that's good. That's what you want. But it's also bad because what if you train it on something that's very sensitive and then it emits those values when it shouldn't. So there's- let's use ChatGPT as an example because I think everyone knows that or not everyone knows it, but it's certainly more well known than our offering. So with ChatGPT or really any of these large language models, you can take these models, the ones that are open sourced at least, and add your own training on top of them, and that process is called fine tuning. And then you can use it to get like more specific outputs for like your industry or your use case. When you fine tune, you do two things. One, if you're using some third party service, you're sending that service your sensitive company information. That alone can be problematic.
It depends what contracts and regulations are in place between those two entities though. But then on the other end, when you go train that data, when you go fine tune this LLM and then you start asking it questions and it giving you outputs, it might emit unmodified data based on what you sent it.
So in IoT, let's say it's a medical device, and so the data it emits is covered under very, frankly, pretty serious government regulations and privacy regulations. You wouldn't want to just feed medical device data into one of these large language models.
It might go and emit something it should not on the other end. And then people will still see things they shouldn't and there can be violations and fines and even jail time in some cases. So that I think is probably a good example. Does that kind of cover what you were getting at?
- [Ryan] Yeah. Yeah. No, I appreciate you breaking that down. One of the last things I want to ask you before we wrap up here is around bringing generative AI into solutions for people that are listening to this because a lot of times we're introduced to it as a tool to use, but not necessarily as a tool to integrate into something like an IoT solution or a solution with data that is something that is more closely connected to what they do directly.
How do you see generative AI folding into different solutions or how do you envision that happening or what is happening now from your perspective on that side of things?
- [Adam] I'll tell you what I've seen so far. What I've seen so far in talking with our customers and with others is there is a bit of a reluctance and uncertainty right now on how to bring large language models into customer facing experiences if you plan on fine tuning these models first, which for a lot of companies is a requirement.
And it's because of that privacy piece. Imagine the situation where you want to train a chat widget for your health tech company so that your customers can ask questions that are the front line in front of the nurse. Hey, I have these symptoms, what do you think I should do?
You want to go then train that on like medical records. That's how you would make it better. Medical records and their outcomes, obviously a huge privacy problem if you do that. So I'm seeing right now the first steps aren't even incorporating it. It's more like how do we even start using these things in a safe way is what I'm seeing at the moment. And of course the answer depends on like the stage of the company and where they're at. Like small companies, startups, folks in industries that don't have highly sensitive data, this answer probably doesn't apply to them.
But that's not typically who the people that I talk to, right? I talk to folks in tightly regulated industries, large companies, sensitive data, and those are the concerns they have. It's how can we take advantage of these things while preserving customer privacy?
And it's a big question, but it's actually one that we at Tonic are actively working on. And we've already made some great inroads and are already beginning to work with customers on solving this for them so they can unblock the power of these large language models.
- [Ryan] Appreciate you diving into that. That's a- it's a very interesting topic. For our audience out there who wants to learn more about what you all have going on at Tonic and maybe follow up with any questions, anything like that, what's the best way they can do that?
- [Adam] Thank you for asking that. So tonic.ai. t o n i c.ai is our website. You can also reach out to [email protected] if you have any questions. On the website there's various forms where you can reach out, say hello. You can also go over, on the website, you can create an account and begin using the tool for a free trial as well.
And we're also on Twitter. I could pull up the Twitter alias, but if you just go to Twitter and type in Tonic AI, we'll come right up.
- [Ryan] Well, Adam, thank you so much for taking the time, man. I really appreciate it. Great conversation. A lot of topics we haven't talked about pretty much ever. So I really appreciate you coming in and sharing your expertise on the data side of things. A lot of exciting stuff going on over there so appreciate it and excited to get this out to our audience.
- [Adam] I'm excited too. Thanks for- these were great questions. This was a fun combo. I appreciate it.