Mirek and AB discuss the rise of fast large language models and their impact on various industries and workflows. We cover the recent announcements and capabilities of models like Meta’s Llama 3, Google’s Gemini, OpenAI’s GPT-4, and Microsoft’s Co-pilot Plus. We explore the models’ speed, multimodal capabilities, context windows, and conversational interfaces – and discuss the implications of these advancements, such as improved human-AI interaction, potential privacy concerns, and the challenges of integrating AI into existing workflows. And because we can’t help it, we speculate on Apple’s upcoming AI developments and the future of AI assistants in coding and robotics.
We’re also publishing this episode on YouTube, if you’d like to watch along in full living colour: https://youtu.be/ROBjyhYZKgs
Chapters
01:45 – Meta’s Llama 3 and Google’s Gemini
We begin the episode on Meta’s Llama 3 and Google’s Gemini models. Llama 3 is praised for its speed, with responses being generated almost instantly, making it difficult for users to keep up. Gemini, on the other hand, is highlighted for its multimodal capabilities, allowing it to identify objects in live video and engage in conversations about them. We also discuss Gemini’s large context window, enabling it to build conceptual knowledge during conversations.
10:08 – OpenAI’s GPT-4 and Microsoft’s Co-pilot Plus
We analyze OpenAI’s GPT-4 and Microsoft’s Co-pilot Plus models. GPT-4 is praised for its conversational interface, including synthesized voices with human-like qualities. However, concerns are raised about the potential privacy implications of Microsoft’s Co-pilot Plus, which can record users’ screens to provide context-aware assistance. The discussion also touches on the challenges of integrating these models into existing workflows and the potential for AI hallucinations or inaccuracies.
30:22 – Apple’s Potential AI Developments
We speculate on Apple’s upcoming AI developments, given the company’s control over both hardware and software stacks. Anticipation is that Apple may announce tighter AI integration across its devices and operating systems during the upcoming Worldwide Developers Conference (WWDC). The discussion also touches on the potential for Apple to improve its virtual assistant, Siri, and integrate AI capabilities more deeply into its products.
52:30 – AI in Coding and Robotics
Finally, we shift to the potential impact of large language models on coding and robotics workflows by exploring the benefits of using AI assistants for tasks like documentation, code generation, and problem-solving. However, we also acknowledge the limitations of these models and the need for human expertise in integrating and understanding complex systems.
Transcript and Links
AB
Well, G’day, welcome to SPAITIAL. This is Episode 19 coming to you the last week of May after a gap of one week. So it’s been a fortnight since we had Merrick, who’s here online. G’day, Mirek.
Mirek
Hello, how’s everybody doing?
AB
Good, good. Yeah, we had a great reaction to your All about ROS, which has been good. A lot of podcast downloads. The YouTube channel is starting to bite a little bit, which is really good. We are definitely in the playpen of load numbers.
So, hey, if you haven’t checked out the YouTube series, it is what you are listening to now, but you can watch it. So tell you what, we’ll wave – something that you can’t see on the podcast. There you go.
You also can’t see Hubert in the background doing laps. He’s the turtle. And you can’t tell what Mirek is drinking at this moment. Look, welcome. This is almost a “news of the month” kind of thing.
AB
There has absolutely been a trend that we’ve been watching. In fact, about three weeks ago, we were about to do an episode on this topic. We’ve got pulled away. But in the last three, four weeks, this topic has just blown out of proportion.
What is it? Well, you probably read the title. This is the rise of really fast, large language models or really fast models. A question we’ll ask sometime in this episode is, are they just language models?
Are they also spatially-aware general world models? Casting the clock back about four weeks, I’m actually looking back at Meta’s announcement. So Meta, i .e. Facebook, late April, April 18th, they announced the release of Llama 3.
[https://ai.meta.com/blog/meta-llama-3/]
Llama 3 is an open source language model – and language only. It doesn’t do any multimodal: no images, no uploads, simply text. But the big thing there was that it was fast. If you played with any of the GPTs before, you’re sort of used to typing in a request and that tiny pause, the pregnant pause, the little pause where it goes and has a bit of a think and comes back with, yeah, a wall of text and often the wall of text, you can kind of scroll and catch up with it.
Well, literally four weeks ago with Llama 3, you couldn’t do either of those things. The moment you hit enter, you pretty much got the wall of text and you couldn’t scroll fast enough to keep up with it.
What that meant was it suddenly opened up doors to, you could normally outthink what you were doing with a large language model, with the chat modes. Llama 3 was the first one really that made it quite astounding that the moment you asked it a large, deep question, it would be bang, there’s the answer.
Mirek, had you tried Llama 3 at all? It is available on a website called Groq.com, which is easy to get to and nice to find. It’s also been hosted.
Mirek 03:12
I don’t think I did. I just follow the news and you know, I’m I’m playing with open a eyes like this and greatest every now And then but I’m I’m what’s happening. Is it just faster or is it better? There were a bunch of news from Google also About their new Gemini and I remember the the previous Gemini wasn’t this glamorous, right?
There was some there were some funny moments there
AB
Yes, it was hallucinating a little bit and it wasn’t quite taken off the board, but certainly it was put up in great lights and then maybe deprecated slightly. Yeah, look, next on the list is the Gemini, and I’m glad to hear it’s Gemini, not Gemini, so the old NASA rocket platform with multiple ways of saying it.
But Google came out with that just at the same time as OpenAI. We’ll talk about Google first, it sort of makes more sense. They were also the number one thing, the most amazing thing was it was fast and multimodal, so the demos they were showing were not only text -based, long queries, and bang, wall of text, which was brilliant, but grabbing the live camera out and actually walking around the offices looking at things, sort of doing what YOLO would do, you only look once, which ironically just did a new version just yesterday, I think called version 10, which is faster, so that’s the threat of this episode.
But the Gemini model was able to take live video and not just identify objects where needed, but have conversations about them pointing to a pot plant while you walk around an office and asking, what is it?
[https://gemini.google.com/app]
The fun one was looking at someone’s computer screen at a wall of code and asking to decrypt and have look and explain that code, and it was happening fast. Oh my God, and we’re going to have to pull out the phrase again, it’s one of my favourite phrases to drop in a meeting and basically scare people.
It’s nice when you can paraphrase Stalin, I’m just going to look at your eyes now. No, okay, not scary. His quote was, quantity has a quality all of its own, talking about takes and soldiers in World War Two, but to badly paraphrase it is, does speed have a quality all of its own?
Does going fast get you out of trouble so much that it can do more things than if you were doing something well but slower?
Mirek 05:43
It’s hard to benchmark these models right – or compare them with one another so it’s really not just the speed it’s it’s what you’re getting yeah how useful it is and I mean you you mentioning that multimodality that Google demonstrated why I think there’s something different happening than than YOLO. YOLO is is a model that you train to recognize classes of objects and so fundamentally it can only recognize what you trained it on so unless that is happening in those demos which is possible you could pre-train models to recognize a certain person there’s probably something else when you you’re able to you know detect from the image some some some more more information than just you know bounding boxes and respond to that I hope yeah
AB
Yeah, that’s what the Gemini showed of at least it can do live video object identification, which is not new. It was really just the amazing thing was how panning around an office, it would just find things.
Sorry, it wouldn’t highlight everything because that would make your screen quite cluttered. But if you pointed it at something and asked a question in context, it kind of figured out what you were asking about and it could have a serious conversation.
The reason I’m asking about that bad quote about Stalin is does having the speed of reaction go from, I guess, in human interface, HCI, H, anyway, human interface guidelines sort of Higgs, isn’t it, go from being if something has a weight, then humans can also react and run with those pauses.
But if something is instant or near instant that crosses the gap to changing the mode of interfacing to being conversational and I hate to say getting a little bit like the humanistic bot chat bot that we all think they are.
Mirek 07:47
Well, it’s definitely interesting for like, you know, human interaction and getting as much as possibly done in a little time at Google, the Gemini announcement also was about something else. Why they have this this very large context window now, I think they said million tokens or 10 million.
Yeah, can correct me. So that’s, I think, more interesting than the sheer response time, because with large context window like that, you can remember things and you can sort of build up context on top of the training during the conversation you have with the human.
And that can be much more interesting, I believe than just than just pure speed. But I really can’t tell the difference between, you know, those benchmarks and our people who are doing great work actually doing all that and comparing what they’re getting from these models.
I can’t do that. It’s I have other things to do.
AB 08:49
There’s two more models to talk about, and then I do have a tab, a resource, there’ll be a link in the show notes, to a lovely, you know, one of those population graphs that moves over time, but this one has to do with the chat response.
So using ELO scores of the major models, essentially humanistic and responsiveness. So we’ll tackle that as the end of, you know, trying to look at the overall comparison between our models over time.
But hot on the heels of Google, of course, came OpenAI’s 4 .0, which of course inspired the title of this episode, Oh My, for a few reasons. The similarities were not, you know, were quite the same as the Gemini, multimodal images, video, being able to pan around, but the real killer feature and all the last week of news, which we may cover later on, is regard the live chat with the synthesised voices that had some, well, had ums, nars and giggles and little bits of humanistic input to make it more than just a friend’s Siri and Alexa.
Mirek
Yeah, some sound suspiciously close to certain humans that we know, right? We’ll go straight there, yes. We can recognize by their voice.
AB
Oh, yes, Sky. I must say I was able to use it for the first few days And my wife was able to use it while we’re at and about but it has vanished and there is some Good drama and some good back and forth over that Let’s leave Scarlett for a second But by all means it is a absolutely new way of working If you can text quickly or use the microphone to do your own dictation into a text field, that’s fine That speeds things up a lot But the open AI is mode of just having an ongoing open microphone conversation with your AI That had some character I’ve got to say In actual fact, I would have loved to have seen like we’ve been used to buttons on models that have a heat control if it’s hot, it’s a bit noisy and Hallucinating if it’s cold, it’s a bit mechanical and a bit straightforward We’d love to see a heat sort of dial on 4 -0 Essentially while it’s lovely to have your AI sort of Yes, well, okay Put in those bits and bobs to make them sound human.
It gets a little bit tiring after a while And I’d almost like to take the gaps out of Some of that and have an option where I’m in a hurry. I just want the answer. I don’t need the I Don’t need the icing on the cake.
I just need the cake Did you get a chance to try the chat mode of GPT 4 .0?
Mirek 11:48
I didn’t spend much time with it, but I believe you can donate and you could do that for a while. I remember my girlfriend having formal, like all the GPT version draft something for her. And then it was like too pushy, too salesy.
So just, she just said less hyperbole. Like it’s as simple as that. And it produced something that was like human language, you know, not too pushy, but you can, I think you can customize the output a little bit like that already.
So that should probably work with synthesized voices also.
AB
Nice I didn’t do my free prompt engineering to say I want this voice but I also wanted terse and less of the stumbles and giggles and stuff like that
Mirek
But I believe the context window how would that like to with that customization because I believe what you get with like every new prompt and GPT these days on open AI’s website is a new instance that doesn’t know anything about anything else So the customization you can just inject a prompt right before everything But that only gets you so far.
Yeah in the way of like making the model kind of more In general behave in you know the terms of what you actually want from it and how you want to get it
AB
Yeah and I’ve got to say that interface of live chat was pretty phenomenal. It was kind of the interface that I think people had been forecasting for a very long time and again with that bad quote from Stalin it does change the game a little bit to have something that responsive and lets you stay in the flow as the human as opposed to I am driving a cursor and I’m waiting for the screen to scroll.
In this case having a conversation with a chatbot, an agent is enlightening. It starts a new way of working. It means that you can on the fly refine it. I’ve just been loading up a few tabs here which might be on the show notes and in the video for some same queries across all these four different large models that we’re talking about today.
I’ve got to say copying and pasting and typing was probably the hardest bit. I did the same typo on two out of four frames. So that’s my fat fingers or my muscle memory not working well. If I could have just said oi try this spoken out my prompt across all four models that would have been brilliant.
I think the new tools are going to be a good microphone, good set of headphones, quiet room and large screens. I’m not saying keyboard and mouse is going away but this actually does start that conversational extended conversational mode which up until the moment like now there’s always been a microphone button on almost every text input option or every keyboard has got a microphone option so you can dictate but this changes from dictation to conversational and yeah pretty wild.
I must say I should have
Mirek 14:59
Mean yeah, I see it coming. I don’t think we’re there quite yet And I think there’s like maybe not all sorts of work that you want to apply this to I was just listening to this conversation a podcast by Neuroscientist his name is David Eagleman and he wrote a book with his friend who’s a musician on – you know how all these things kind of come together artificial intelligence human intelligence how we’re creative and how that happens and So they argue and I agree with that these AIs are already very creative But what you can do like say you’re writing a piece of music.
So you work on the whole piece all the time you come back and make changes and you know You you see the whole thing in its entirety and you figure out how this doesn’t work so well and you do this in software also In other areas of you know intellectual work We don’t work like that with AIs. We just feed it something and we expect the perfect result but like evaluating that and coming yeah, you know making Fundamental changes.
Yeah, or little changes as we go little tweaks is really difficult Yeah, it’s like it’s a one -shot as opposed to a week -long endeavor. Exactly. Exactly. It’s this one shot Kind of like give me a good brief for a graphics designer about the logo That you know is for the company that and you might spend that then it was something that’s Refining it averaged because of this whole data compression that is doing.
But then does that and I mean we’ll see how far we can get with these kind of systems But it’s interesting to sort of even like it shines light on various things that we do and don’t think about that much and you know things about how we learn and how we create and What does that actually?
Anyway, I get to philosophic This is what I see. I see what you mean, but at the same time I don’t think it’s gonna like replace Intellectual work just yet.
AB
Gotcha. This is the version of the machine can’t sleep on it or come back with a good idea on day three or four at this stage.
Mirek
Also, the reality will be quite different than what you see in these tech demos. I wanted to just reiterate back to this Rabbit R1 orange gadget that we mentioned on this podcast. I was just watching this YouTube video by a pretty big channel called Coffeezilla.
We can link that in the description. And he looks into that product that seems to do nothing of what it advertises, and they based all these promises of how it can do everything for you. You just talk to it and it will order your food, order your rides, handle your day -to -day mundane things.
It does nothing of that. They based all this on saying they’re developing this large action model, which is supposed to be this new smart thing that makes you… …makes the rabbit click. That apparently doesn’t exist.
And what it is, in fact, actually it is a GPT that’s instructed to not tell you. It is a GPT. Plus some UI scripting on various websites that order your food and do other things. And so that means that when the interface of the website changes, this whole line of functionality stops working.
So this guy on YouTube tries all that was advertised in the promotional videos, and none of that works. And he can’t even tell his position from GPS correctly. So that’s like that level of concerning.
Mirek 19:00
So don’t spend your money on that. I think we should not advise on buying that. This is entertaining to watch so we can link the video and have people decide for themselves. But even the way the company answers reacts to the criticism is a huge red flag.
They pretty much gaslight the guy and say that he’s not an AI expert and they will only take criticism from AI experts and with a certificate or something like that, they say. So that’s… It’s very entertaining, but don’t spend your money on it.
AB
Makes sense. The Rabbit has been put in the same bucket by some large reviewers as the Humane Pin. I’ve got to say I love the concept of both of them. The Humane Pin, I think news of the last few days is that they’re appetizing that they’re happy to be bought out looking for a buyer, looking to get absorbed into one of the big four.
It kind of alludes to a conversation that is about these large -language models. We’ve had Meta, Google, OpenAI, and the last one is Microsoft, and then the hanging asterisk in the air is Apple. With those five, those are the largest companies out there by market share, by thought share.
How possibly – I’m going to ask the ETH app, but please, if you can have a think about it – how possibly can a company that’s coming in as one hundredth at the size have both a product and a software focus, hardware, software interface?
It’s almost no doubt that it’s going to be impossible for the small fry to break into these large markets now. The dawn of the web, 25, 30 years ago, everyone could do their own thing and there was some amazing one -person, two -person, five -person bands did some epic work that became the big players, but now we’ve got to the point of the big players are sadly probably 89% of the market.
It’s going to be almost impossible to break in.
Mirek
i think there’s still room for lots of creativity and we’re just discovering what these models can do and i wouldn’t underestimate the creativity of startups but you have to put some actual work into it to create products that you know do something interesting and this seems like too good to be true you know many of these products that will apparently magically do everything for you it just doesn’t seem like we’re there quite yet and it seems like yeah the product’s actually over -promised and under -delivered quite often so
AB
No, it feels like it feels like it’s a bit too easy to do epic marketing, nice PR.
Mirek
I don’t want to sound just negative, I started noticing language models for software packages, like I think 3JS that I’m using for something, they have documentation and next to it is a language model, you can talk to it about the documentation and examples and I think this is enormously useful.
But it doesn’t do everything for me, right? It’s still work that remains to be done and things need to be… I think I just talked to somebody that there’s a lot of work that remains to be done in integrating all these services together into products that actually make sense and it’s not just the language model, there’s so much to be done and if you can sort out different kind of area of problems or pain for the users and just combine that with this, there’s a product right there.
AB 22:46
Yeah, so having a large document library like 3JS and all the comments there is great to have that as a searchable, talkable documentation index, but that’s then a silo, a pillar, which is itself smart, but it’s separate from the others, surely.
So there needs to be a time when either there is a standard interface between silos or something that can come across all of the documentation and have topic specific querying. So right now we probably couldn’t ask any one of these large models, okay, within the bounds of just 3JS now, can I answer this question?
They are general models. They’ve either read half the internet or the whole internet or not. They either don’t know what they is behind a closed door, but you also can’t say just the next five minutes.
I just want you to focus on this code base and only answer within that sphere. They don’t have that level of filtering just yet. I think they would hallucinate and get out of jail pretty quickly and start to answer things in a general sense far too fast.
Mirek
Yeah, that would be fantastic to have, but I think that’s fundamentally not how these models work. Yeah, yeah. Yeah, maybe with that, I’m really interested in what’s going to happen with this larger context window, because I think that kind of allows you to build more of conceptual knowledge, you know, during, around that area or task or subject matter.
AB 24:24
Great segue, because the fourth of the large models that was announced in the last three or four weeks was Microsoft with a co-pilot plus. It has an asterisk here as well, because co-pilot’s been around for a little bit.
[https://www.microsoft.com/en-au/microsoft-365/business/copilot-for-microsoft-365]
The two things they announced, or the two sides of the world that just climbed into the middle. I’m looking down. Yes, I’m on a PC. Sorry to all my Mac friends from around the world. Microsoft, I now have a co -pilot button in the bottom right corner of my screen, which is useful and okay.
It’s not fast like the others. It does have that big pregnant pause and the I’m thinking, I’m thinking, and you can scroll faster than the output comes. The announcement from Microsoft is that their next round of hardware slash software tools are going to be co -pilot plus PCs.
That is, having co -pilot more directly tied into the operating system experience. With the first bullet that everyone just opened their eyes at and had a bit of a half a fit about is that it’s able to almost always record your screen to have that context window of what you’re doing and what you’re working on.
If you were able to type or grab the microphone out and say, hey, I’m having problems with what I’m doing now, we already know what you’re doing with context, context window and context in the language sense, and be able to assist you with what you’re working on.
“Hey, I’m having problems with this document. Can you reach in and help me?” It already has knowledge based on not screen scraping, but direct connection to GPU and the drawer of the screen to know what that is.
That makes sense in that regard, rather than doing a bolt -on over the top of pointing a video camera at your screen. Makes sense to go right to the ones and zeros before they became pixels and have that knowledge.
But that co -pilot plus PC has been pretty, both welcomed and divisive. I think everyone with a privacy bent has just taken a double take or a triple take. It’s been quite a astounding revelation that that’s probably something which we’re going to be well used to in two years, but I think announcing it right now is definitely a few people pretty hard.
Thoughts on that? If an app was able to natively operating system assist you with your tasks, whatever that was without having to interpret pixels, but know before you even draw it on screen.
Mirek 26:55
Yeah, that definitely makes sense technically to integrate something like that as close to the metal as possible, right? But this I agree with those with the concerns that you’re raising because there’s yes it’s yeah, it’s a big big brother that you’re learning into your computer and giving it all the keys and uh Yes, I mean
AB
Yeah, it’s it’s not almost certainly with the way that these last language models are working. It’s not using your pixels to train on It’s been trained. It’s just using it to interpret that request at that second but at some point in time, it’s going to Assimilate your culture into our own as the book would say
Mirek
there will be concerns, at the same time Microsoft is really going all in on AI, on all fronts, and I can’t even mention all the names of the companies they invested in, but it’s not just OpenAI, they’re even investing into OpenAI’s competition and just covering as much of the market as possible.
And that tight integration with user products is, I believe, where the value is. Yeah. So, but yeah, it’s kind of concerning and interesting to watch and kind of like you want the computers to be smarter and faster and do everything for you, but at the same time.
You don’t want that.
AB
Look your segues are working really well – I don’t know whether you knew we’re going to be hitting here but of the four models talked about again: Facebook’s llama 3, Google’s Gemini, OpenAI and Microsoft – but the gorilla in the room (the elephant in the corner?) is Apple.
You know the pundits say they are lagging behind but their WWDC is happening in the next couple of weeks and I would probably bet the farm that they could have well let me put it this way Apple is the only company who has both the hardware and software stack completely under their control every Apple phone and Apple device in the last well the M series has just as many TPUs basically AI neural chips as they do graphics or CPU chips there’s a lot that happens when you take a photo on your iPhone that is more AI than anything if it’s a picture of the sky it’ll probably make it a bit clearer and a bit bluer if it’s someone’s face they’ll probably try and figure out that let’s not make them look like they were you know pale and about to keel over there’s a lot of computing resources tied directly to an operating system in that whole platform and they have announced well chat GPT also at the same time that they announced for Omni also put out a first native asterisk ish ish app for Marcus for Mac OS kind of neat kind of weird that that was the case you can access it quite perfectly fine on browser or device relatively natively to have it as a first party application is a sign yeah I’d say I’d bet the farm that in the next month that the worldwide Apple conference that will probably have some tighter AI integration everywhere let’s face it Siri has devices didn’t go on excellent
Mirek 30:22
Siri has been able to set my timer perfectly almost perfectly Siri for years now. Yep We’re we’re thrilled to see what’s next
AB
There it goes, one, two, three devices just woke up. It’s okay, Siri, you can stay.
Mirek
pretty much all I was ever able to use Siri for. So if they’re just make this work as they advertised years ago, that would be fantastic.
AB
Well that actually is probably one of the fundamental human interface sort of hassles of the last when a Siri had been out. I’d almost say a seven year Siri has been around. The problem with an interfaceless interface is that for all of us mere mortals who are used to screens full of icons, if we don’t recall the keyboard combination or we can’t recall what the icon looks like, hunting the menus is a thing and within a half a second, you found what you’re looking for.
It’s not too hard. It’s there somewhere is normally the answer when you have a rich interface. When you have an interfaceless interface, it’s very binary. Either it does a brilliant job and you love it or it sucks.
The early days of Siri were leading more to the second than the first of if you would ask it, how do I get from A to B? Or if you ask it a question that literally can’t answer, I think in the beginning, it couldn’t even go out to the web and find a half -assed answer for you.
It would simply, I can’t answer that. You could do things like control your timer, set alarms, integration with maps, but a few core phone apps, nothing more. In the last few years though, and this is the peril of an interfaceless interface and a lesson now for, chat to you for Omni with this chat interface, that balance between an AI hallucinating and telling me why the sky’s purple in convincing 12 ,000 page essay.
Great, but you’re wrong. Versus Siri and other agents being helpful and being able to solve more problems than solve just the right number of problems and then quietly give up when really they shouldn’t answer a common query.
Have you used Siri for more stuff in recent times or more the point?
Mirek 32:38
would love to, but it simply didn’t work ever. And I don’t think it’s just my funky accent. It’s just like that for everybody. So timer or wake up alarm that I barely use and that’s it like, you know, you can’t ask it what song is playing when you’re driving because it just pops out Google search.
AB
My number one use of Siri I must say is around the family dinner table when someone in my family says, oh, yeah, I must remember to do that and invariably forget maybe, you know, we sort of go grab your phone, hey, doofus, remind me tomorrow to do the thing.
And more often than not, that’s perfect. That is an example of the human being kind of trained to use keywords, get the order vaguely right and, you know, do commands.
Mirek
at the same time the human is trained to not trust Siri because it just doesn’t do anything else well. So I can’t rely on the fact that it’s gonna put anything in my calendar if I just tell it and then check on it.
And if I go and check if it’s really there, like I could have done it. You could have done it yourself, yeah, yeah. It’s really like you figure out what kind of tasks you can use it for and then you use it just for that because you know that nothing else works.
And it’s really hard to convince you that oh, we updated it and now it’s fine. It can do like all these other things. That’s it.
AB
There is no manual for it and if they do five years later release a slew of features or a slew of API You know to check the baseball scores in your in your region Unless you are made aware of that if you try twice to ask that question and you get stupid answers You will not try a third time and even if they sell you later on that that feature has been put in Chance of you using it’s low So that that is the classic example of building up trust over a period of years with an interface lists no button No icon interface that you either trust or you use for small things My question to you is comparing that to open AI’s chat interface with audio with yes Not Scarlett Johansson, but Sky and the five other four other voices now that one’s being removed What would you prefer?
Would you prefer something that is conversational and Ironically might make stuff up over time or would you prefer something that’s a bit more mechanical and concrete? But you have to learn how to trust it.
How do you certain features? Where’s your level of preference there?
Mirek 35:09
I think it really depends on what you’re doing, right? First, it’s a personal preference, but then sometimes you want the hallucination, sometimes you want some inspiration for something creative. And I think that’s what this works really well for, although, you know, it’s not the most creative, it’s kind of like mediocre creative.
Oh, gee, wow, you’re in trouble. But it’s there. And so the one task is to do that, the other is, you know, yes, no answer to something, and I want to read five polygraphs of fluff to, yes, no, it would be fine.
But it really depends on the situation, I guess.
AB
I’ll put forward two scenarios that our family has used ChatGPT for in the last couple of weeks that have been good use cases that we wouldn’t have thought. My daughter’s doing university, yeah, okay, astrophysics, so not exactly rocket science, but the next thing over.
She was, now I’ll say this carefully, she was having to do an essay on a really deep gnarly topic, all good. She couldn’t get started because the one phrase that she had in her mind of the topic that she was trying to do was the wrong phrase.
It was the one that she had in her mind of what she was trying to focus on. And guess what? There were no references in scientific papers for it. It was just the wrong word. And Google could not help when she searched for that phrase.
She found millions of answers, but none with that context. Well, guess what? ChatGPT was able to assist by basically being the world’s biggest thesaurus for synonyms, basically putting in that key phrase saying, if I’ve got this term, what else should I be looking for?
And bang, it came out with a long word list bullet points of, if you meant this, here’s other topics that are highly related to it. And suddenly that opened the door to her too. I had my facts wrong, my keywords wrong, now the doors are open.
But I could find everything I need to research my essay. That was really good because all of the online tools were just being quite mechanical in that search phrase equals this, but no creativity and little context apart from, if you meant this, did you mean this?
AB 37:23
So that was a win.
The other one was showing my wife, yes, ChatGPT-4o and having a conversation about geology. Now, we know with ChatGPT three and four plain, you could have GPTs, basically other GPT instances that people had made where they’d done the pre-prompt engineering to say, you’re a geologist, answer this in either a lay person’s terms or a professor of a certain educational level to make it work.
But they’re basically pre-faced. Most questions need to be answered with this context in mind. Those were okay, not exactly awesome. They were fair, but they didn’t really have a free range of options.
They basically said, here’s five texts or five corpuses, use that as your foundation. Four zero was able to have a conversation with my wife about not only the ground underneath which she was standing, but, and here’s the spatial part of it.
If I’m traveling from A to B, what shall I be looking for while I’m on this route? And it did a pretty bang up job to actually say, well, as you transfer from A through B, make sure you’re watching out for this kind of feature in this first half.
And if you look across here, you’ll find it didn’t say left or right. It didn’t know which way she was facing. But it had that context, both of topic and arcane topic, like a deep topic, but it also had that in terms of relevance of place.
And that was quite phenomenal. There, we were tourists at that point in time. We didn’t have our encyclopedia or our geologist’s handbook to Estonia and Finland. There you go. So they were dropping for you where it was last week.
AB 39:18
So we couldn’t in fact check it. But tell you what, we turned over every rock that we saw and my wife was able to validate most of what she was talking through. So pretty phenomenal. Obviously, this is a tool that as we’re at the internet, we all kind of know that.
So the body of knowledge is huge. But able to have a deep, dark conversation on a pretty specific topic with spatially aware, I’m here, I’m going to there, tell me everything I need to know as I travel.
Yeah, I must say, we were reasonably flawed. And then to have it icing on the cake. Here we go. To have that in the Scarlett Johansson voice. That was absolutely, it was a delight to see and to hear, but it was definitely a little bit freaky.
The controversy there is only escalated in the last few days. Apparently, which is not rocket science now. Apparently, OpenAI did not use her voice. They did hire an unnamed actress who was still remaining unnamed.
That’s all fine. But without fail, eerily similar. And I think everybody in the world kind of thinks that voice number one sounds a lot like a person.
Mirek
Well, it doesn’t matter if it’s not. It’s meant to be replicating that person makes sense. had it for some weird times it’s only gonna get funnier yeah yeah and weirder
AB
It is, it is. I know the the Siri lady I believe is from Australia and has been having an awesome time in the last many years extending her vocal repertoire to in -person events and the rest of it. I believe she spent a couple of many, many sessions talking all sorts of things.
It wasn’t with Apple at that point, it was with a third party who she was hired to be the voice of. Obviously it went further. I dare say she’s been paid quite well but that voice -off is certainly pretty infamous now.
I must say what voice do you use on your devices or your Apple devices? Do you use the default Siri? By that I’m meaning female voice or do you use
Mirek
I think I do. I think I use the the British theory. Yeah, I know it.
AB
I didn’t realise it was such a regional thing, but it does have a psychology to it of do you want your assistant to be something that you’re used to, something that you’re not used to, it almost defines whether, it’s the same with interface -less, it’s almost hard to figure out what other voices there are.
Once you start with one, I dare say if they add new features of voice 9, 10 and 11, you probably won’t ever realise that. So for me, it’s just the default and still sits well. I dare say OpenAI’s sky won’t make a comeback anytime soon.
I know I might get fried for this, but I really can’t handle Southern USA. For me, that involves a little bit, for me, too much mental thinking. I’ve got to really listen carefully. It’d be like if I had someone as my chat AI who was quoting Shakespeare, like a really high Shakespeare voice.
Love it in context, great, but takes a few too many brain cells to pass what was said before I can fluidly do it. It takes me out of my zone a little bit. You’re currently in out of Vancouver. A nice Canadian voice might go down well.
How are you fitting in amongst the natives?
Mirek 43:23
Oh, yeah, that was good. Oh, yeah, it’s good.
AB
Sorry, everyone. We’re just basically going to lose half that viewership in most nations. No, the context for I think how you want your AI system is going to be a great conversation for the rest of this year.
I wonder soon if you can input your own voices. We know there’s many services on the internet where you can grab small snippets of text and have that turned into a speech model. I wonder if you can basically after a while reference or upload a few minutes or someone talking and then use that as your own fodder.
That may get around some of the legal things if you can install your own voices locally. I think a David Attenborough slash chat would be a little bit again. It’s got to be something that’s generic enough to not take you out of the moment.
Mirek
Wow, I would use I would use David for every day.
AB
Explain my code to me “across the African savannah”. There’s suddenly a lot of famous people who, um, weren’t that good voice like this is going to be. Hot on the chopping board.
Yeah. Um, I guess fraction more serious. Um, what can you say that these fast language models are going to change your workflow? I know you’re deep in code, you’re interfacing, you’re doing gnarly problems.
Um, I have to say, are your fingers a mouse? Literally the tools of choice forever, or can you see a transition to talking through some of the blocks that you’re doing?
Mirek
I think, I see myself using anything that helps, but at the same time a lot of what I do is integrating complicated things together. And it kind of helps to understand those individual bits. It’s kind of like like outsourcing your work to individual engineers who do just exactly what you tell them, right?
And that sometimes is not what you want. It works in teams because people have initiative and, you know, if you can afford to have experts for each individual thing that you’re dealing with or somebody to learn it, then you expect that initiative and not just, you know, to follow through what you tell them blindly.
So I use AI where it helps to sort of make sense of this or that, but then I sort of need to be able to understand all these elements and that’s kind of the job. That’s what you do to make a product or put things together in order to make something useful.
So I’m open to have something write a piece of boring code for me every time. I don’t enjoy boilerplates and, you know, starting from scratch and like reinventing the wheel. So all for that, but, you know, I just like can’t outsource my thinking to a machine.
Yeah, yeah, completely. And I don’t want to. No, no, it makes perfect sense. Like it’s definitely getting useful and I’m following it very, very open mind.
AB
Yeah, look, you are a classic example of being at the pointy end, you are doing something which is seriously bespoke, there’s no one else in the world doing exactly what you’re doing.
Mirek
I think you could say that about many things, you know, if you if you have a GPT right book for you, then who did the writing and where the value is. And yes, this any interesting read in the first place.
AB
You came up in conversation also during the week. Don’t panic. I saw a meme that made me think of you. It’s all right. You weren’t in the main. But one summary of generative AI, large language models or multimodal models, hit home a little bit.
It was a meme on LinkedIn. So apologies. It only had two panels, same text twice. The text was, I would much rather my AI did my washing and my cooking for me rather than my painting and my writing versus right now it does my creative arts.
And it doesn’t do the damn physical thing.
Mirek
Interestingly enough, that’s where we started, right? And all the people who do the drawing and writing and storytelling are concerned and see these technologies quite differently than the tech people, which is very interesting to watch.
And to many people, there’s this class, a lawsuit with Sarah Silverman, I believe, being part of it, or was that settled? So that was a really quick and kind of awakening for me because there’s a whole bunch of other people who see this as just stolen data and scraped examples of other people’s work and gets you philosophical about how do we learn, what’s original, what’s creative.
And I don’t think it’s quite different than how people learn. I would say that it’s not just faking it, but it produces this blend of whatever you feed it or a replica of that. And it’s just, that’s not good or bad.
That’s just what it is, I think.
AB 48:50
Well, the reason why you’re in that meme, of course, was to basically say if you could please hurry up and finish your operating system for robots so that we can control them and have them around our house, that would be fantastic.
Mirek
I’m on it, I’m on it, yes.
AB
Okay, so more coffee is required, more energy drinks are required. Look, last thoughts, love to just, I’ll point, I’ll leave in the show notes. I’ll quickly share screen, which is probably easiest anyway, share that one.
[https://public.flourish.studio/visualisation/18055935/]
Look, yep, this is a classic chart in the style of population GDP over time. This one has the top -ranked large language and multimodal models by a company based on these scores from Chatbot Arena. So basically how responsive, how life -like were they?
It’s only the last year and doesn’t sort of start at zero. It sort of starts with a little bit, but I think the chart’s about to finish. Yeah, here we go. It’s basically showing some major jumps. I guess the headline is OpenAI has basically been the leader for almost all but like a couple of weeks.
Anthropic came in there for a little while early in March. It says, Google just made a massive jump with Gemini and Meta a few weeks ago with Llama. So look, with that fail, all those charts are going up.
All the major competitors for large language models are becoming more conversational, coming more convincing. It is, as you say, hard to measure these things. On our own website, spatial .space, we’ve got the continuing charts of light architects, ratings of how efficient is a large model.
Basically it’s parameters to its output. The one that’s conspicuously missing is OpenAI’s for Omni. We don’t have a lot of data on that, so there’s probably a big dot to go on that screen later on. It is near impossible now to assess these from any mechanical point of view.
It is almost a guaranteed, it has to be felt, not tested. I know I’ve scared half of my colleagues over the last five years by saying that you can’t test AI. I work in the field.
Mirek
I think it should be tested at the same time. I know, I know.
AB
But I work in a field where testing is a large portion of everything and being able to sign off on something and say it hits all your requirements perfectly and passed every test is important. These are tools that you can test forever, but you can’t ever sign off and say that 100% perfect everything was hit.
There were no misses. We’ve gotten every last nickel worked out of these things. These are, if running can break them, in fact, it seems to be a worldwide sport. The moment a new model comes out, how can I quickly show how it’s doing a stupid thing as fast as possible, as opposed to using it for good?
Testing is a process, but it’s not an exhaustive process anymore. It hit the marks well against out -known benchmarks, but yeah, it read the entire internet. We can’t ask it enough questions to prove that it did.
AB 52:10
I’d love to know whether you are going to be using more large language models for coding more. Is there a go -to for you and are you always able to just grab on to the next thing or you’ve actually tied in your IDE, your code environment, to one assistant type?
Mirek
I don’t know yet. I’m definitely thinking about what I can do with these models. In the last episode, I spoke about ROS and I think I didn’t do it a good service. That sounds like another depressed engineer, I think.
So to keep it on the bright side, there’s a lot of data that you get in standardized form and shape. And if you then somehow vectorize that and feed it into a model and train it to do something that you want, that’s exactly how you do that.
But you think about, I think, distinct tasks that robots perform and not general intelligence at most of the times. You think about a problem that you want the robot to solve because that, in my mind, defines what physical capabilities it needs to have and what kind of intelligence it needs to come with.
So you don’t need your Roomba to talk back to you about Shakespeare, right? It’s not necessary. So you use these tools where you can, and there’s only so much you can actually calculate on the edge unless you use expensive hardware.
Like one might say NVIDIA enables you to do just that, but it comes at a cost. Maybe you don’t need to do all of that. Maybe you can do it somewhere else. Maybe, you know, it really depends on the use case.
So definitely I will at some point when I get to it, but most of my work is just integrating things together and seeing where it takes me and, you know, using whatever is there to make it faster and more useful.
The end to us.
AB
look, I’ve got the luxury of not having a single project to focus on, so I get to really scatter brain and try out most of the leading lights. I must say, here in my open tabs, I’m paying for open AI.
I have not forked out for Google, and Meta and others are sort of open and free. I’m actually interested, and we may tackle in a future episode, what we can do locally as opposed to using these large models.
Basically, the opposite of today’s topic of, yes, these are the leading lights, cloud -based, huge. I’m actually coming down to what can we use that’s a decimated teacher -student local model that can still do good for us.
That may be a great topic to find out what is faster, and we can actually train on our own data. Something that really is the opposite of what we’re talking about here. Hopefully, still fast, but likely a fraction slower.
Mirek
Well, but things like you, you spoke about voice synthesis, right? That’s the thing to do locally, I guess, to get around those those legs, because you send something as text, and then you synthesize it in, you know, human speech, pace.
So definitely something that like helps you to speed up response times. In robotics, you want to do certain computer vision tasks as fast as possible to avoid, you know, dangerous, dangerous situations.
And, but there’s like higher level of, of obstruction that you might do slow, slower than that, maybe, and few seconds of like a fine, it really comes down to what you’re doing and how fast you really need the result.
Or if, if doing something locally makes your experience much, much better, say, if you’re waiting for a voice to be synthesized in the cloud, there’s going to be a significant lag. Yeah. I mean, you’re streaming in the real time, almost real time, but maybe it makes sense to send text and synthesize locally.
Maybe it doesn’t, you know, it comes down to.
AB 56:35
This is the AI version of incentive things, edge processing, which of course you’re using with your robotics world big time. So you’re saying that vision needs to be closer to the robot so that it can avoid the thing quickly, yet there can be slower commands that you can send back to a larger brain to do some meta tasks.
Mirek
But even even with vision, it might be just some of the tasks, you know, just just just the time critical things like I don’t want to bump into anything as i’m driving my robot around real fast, right?
so so that That obstacle avoidance must be as fast as possible And most likely you want it to work even when the when when the connection goes down Yeah, I mean the robot should stop and all that But yeah might want to have some level of of autonomy on the edge That high level functions wrap planning and whole the whole
AB
Yeah. I look interesting parallel there between the world of large language models and robotics. I think that’s one of the things that we’ll draw between last episode, this one, the fact that it’s an ecosystem.
These aren’t simply one-trick ponies that you either use or love, and even sadly the pace of change is not going to get larger from here. We might leave it there, but I’ll tell you what. This topic hasn’t stopped at all.
As I say, the elephant in the room, the elephant in the corner, anyway, mixing metaphors horribly. Apple is yet to release what they’re doing in this sphere. You may well come back in two months with a proceed of what they’ve been up to or what they’re planning.
In their typical fashion, they’ll probably pre-announce what they’re going to do, but they’re not actually going to put that out until later in the year. That’s the normal way. They don’t have the same competitive urgency as some of their other large players, so they may talk about it for a while and give us a chance to get used to the idea before software gets rolled into our devices close to hand.
AB 58:40
Mirek, thanks for that. I know this is an exciting time with another epoch. I think we almost draw a line in the sand pretty much of April and May, 2024. This is the time when our AI is our language models, our multimodal models started to go from being a query response to, oh my God, I’m having a serious conversation.
I don’t think we’re going to go back from this date at all. I think this is going to be the way going forward. We’re probably not going to tolerate blinking curses and thinking we’re going to have conversations and be able to iterate and just refine our ideas on the fly.
As I say, I dare say the tools of the future are going to be a good microphone and a glass of water next to us if we’re going to be talking a hell of a lot. Well, if that’s the case for microphones and glasses of water, we’re in the right spot.
Thanks for that. We’ll catch you next week. We’re back on a regular schedule. I’m back in country. We’ve got a series of interviews lined up actually starting from here, which is brilliant. We’ll reveal all as we go.
As always, show notes. This one’s available on audio format everywhere, video. Do check out on YouTube. We can, we can wave. You can see turtle. You can see my cats, your background. Yeah, less robots.
You need to up your game. That’s all right. We want them everywhere. Bonus points if you can have them running around in the room behind you while you’re talking. I guess that’s a bit noisy. Alrighty, that’s a challenge for next time.
AB
From us here, we’ll say catch you next time and thanks for listening on Space Shop. But for now, bye bye.
Mirek
Bye.
HOSTS
AB – Andrew Ballard
Spatial AI Specialist at Leidos.
Robotics & AI defence research.
Creator of SPAITIAL
Mirek Burkon
CEO at Phantom Cybernetics.
Creator of Augmented Robotality AR-OS.
To absent friends.