Emerging Audio Terminals

Val Gretchev

Hi Guys

I am surprised that there are no threads discussing the “Audio Terminal” craze introduced recently by several large corporations. I am, of course, referring to:

1. Google with Google Home Speaker: https://store.google.com/product/google_home

2. Amazon Echo with Alexa: https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-Alexa-Black/dp/B00X4WHP5E

3. Apple HomePod: https://www.apple.com/homepod/

I just bought the Google Home Speaker from BestBuy Canada for $99.99, a big discount due to Black Friday. I am anxious to try it out and see how good it is in its voice recognition abilities. It will be arriving next week.

I was wondering if anyone here knows how the device works with the remote server.

1. Does it connect to the server App at Google with TCP/IP or does it use HTTP?

2. Is the voice recognition in the device firmware? If so, what is in the data packets that it sends to the server? Is it simple text or is it encoded somehow?

3. Does it convert received text to audio on the way back?

4. Or, does it send VOIP to the server where the voice recognition takes place?

If it is a simple known interface, wouldn’t it be nice to redirect its input and output to a private server running a different App then the Google search engine. I can think of several applications for an audio terminal that works well in sending and receiving voice messages. I believe that the current application these 3 companies are promoting (music, switching electrical devices on and off) will wear off quickly. There are many much more serious applications that a device like this could be put to better use.

So, if you have any information on the communications aspects of these devices, please let me know.

Val Gretchev

Wow! Not a single reply in over a week. I guess I stumped everyone here with my little question or perhaps there are no electronics enthusiasts here any more and this is just another social website.

I received my Google Home Speaker in the mail and set it up close to my PC. It works quite well and the voice recognition is very good. It can hear me even when the music is playing loudly.

I then started Wireshark in my PC and set the filter for the IP address assigned to the speaker by my router. I can see some broadcast messages from the speaker to destination IP address on port 5353 using MDNS protocol. I looked this up and it is at Google. However, I can’t see any transactions when I make a request to the speaker nor when it responds. I assume these messages are not echoed to all connected nodes on the router.

I looked up a Wireshark help file at:
and see that I need to turn on “Monitor Mode”. However, it also says that “Windows is very limited here”. Sure enough, there is no monitor mode checkbox anywhere to be found. It looks like I may have to get a PC running Linux to complete my investigation.

Any thoughts?

Val Gretchev

It's not an electronics question, it's a computer/networking one.
I guess you are making it abundantly clear that I am posting on the wrong website. I don't see any computer/networking category on the main page. Do you know of any website where I could better fit-in?

However, I must point out that in today's world, to be successful, one must have several skills. Electronics knowledge must be augmented with software/firmware know-how and perhaps a bit of mechanical know-how. And you certainly can't function without a working knowledge of network connections and protocols.


Wow! Not a single reply in over a week....
...Any thoughts?
None whatsoever, no interest in such things.

If I had such a device, little Basil would have great fun with it.


As the postman struggles up to the front door with an enormous sack of assorted bird seed, I wonder what else has been ordered by online voice while I am out of the house.


Val Gretchev

None whatsoever, no interest in such things.
Thanks for your answer. At least I know where I stand with my little project. Tell me, is everyone here a "Super Moderator" or am I inviting censure by voicing some negative opinion?


am I inviting censure by voicing some negative opinion?
No problem, most of us here are grown up and can stand a bit of negative opinion.

You asked "Any Thoughts"
I replied "No Interest"
Just indicating my view on the situation.

Whitby, Ontario, I have never been there.
Whitby, North Yorkshire, I have been there many times, a favourite place for holidays in days gone by.


Val Gretchev

I don’t think you got the gist of what I am thinking about regarding the speakers that I re-named “Audio Terminal”. I have an application in mind that involves medical assistance in a patient’s home which requires a “terminal”. The standard model of screen, keyboard, and mouse is just not in the cards as many older people are not familiar with computer usage and would not take kindly to such an intrusion into their lives. An audio terminal that talks to them in their native language might be the answer. Take a look at the following article that may shed some light on the subject:


Thanks, VG.

Very interesting use of the device(s) I hadn't considered.

I think your first post must have arrived on a busy day and slid off the bottom before anyone noticed/bothered to respond.
... So, if you have any information on the communications aspects of these devices, please let me know.
Not really, but I suspect that the device is "hard coded" to reach out (via WiFi) to a specific IP address (as you noted) and that may not be something you can alter.

And, again from your description, VOIP appears to be the "data" that is sent and received from/to the AT, hence no discernible data. See this description. Note the highlighted "Google Talk" - just the reference indicates the kind of digital muscle supporting the concept.

Going no deeper than that though, it strikes me that interfacing to one of the current ATs appears to be doable, but rather complex (setting aside the voice recognition encoding and database manipulation at a receiver computer for recognition and response).

I would be very interested in following your progress in this endeavor.

Val Gretchev

Cowboybob, thanks for your reply.

Yes, the IP address is most likely hard coded and would be difficult to change. The Google Home Speaker has a grill on the bottom that is held to the main body by magnets. One tug on it exposes the speakers inside (the power cord must be removed first). At the back of the speaker assembly is a USB-mini B jack. This may be a way to alter the programming but I haven’t dared to plug anything into it until I have more information.

You are quite right about the voice protocol to and from the speaker. It is most certainly VoIP since it wouldn’t make much sense to put voice recognition into every unit. It’s best to keep the voice recognition at the server where a more powerful system can perform the analysis much better and faster. Also, a single app at the server is easier to maintain and upgrade rather than having to upload changes to all the units out there. And there are going to be millions.

The big clue to me was when I read the following article that allows a Google Home user to telephone any land or mobile phone:
I haven’t tried this yet, but it’s next on my agenda.

One way to get familiar with Google Home is to join the Google Developer Community and write some apps.

One thing the Google Home speaker does not have is a camera. That is unfortunate, since a camera would add an extra dimension to human interaction by analyzing facial expressions and body language. This is absolutely essential in artificial intelligence apps that attempt to act as a psychologist. A camera can take a picture of a user’s dinner plate, identify the food groups, measure/estimate the volume of each, and calculate the caloric content of the food. This would be a Dietician app.

Although a Google Home device would be a cheap hardware solution (leveraging on their huge manufacturing volume), I am not averse to developing my own hardware which fits my application better.

Val Gretchev

Are you old enough to remember when the movie 2001: A Space Odyssey first came out in 1968? The epic film by Stanley Kubrick featured H.A.L. 9000, a sentient artificial intelligence, who eventually went berserk, but that was a Hollywood necessity in order to put some meat on a rather dry plot. Later, in 1984, HAL was reprised in the movie 2010.

If you observe HAL in the movie 2001, you will notice that he not only entertains the crew by playing chess, he carefully interprets any changes in crew behaviour or mood to determine their mental state and whether they continue to be fit for duty. This is his primary function by interacting with the crew. His other function is the control of every aspect of the ship that he performs automatically in the background.

Perhaps some younger readers do not remember the stories that circulated in those days that Stanley Kubrick played with the name HAL to send a message to his audience. If you take the next higher character in the alphabet for each letter in the name HAL, you will spell IBM.

Artificial intelligence is not there yet to equal that of HAL depicted in the movie. It will be a long time before artificial intelligence can achieve sentience. Then, we will face a moral dilemma: can such a sentient program be shut down or is that tantamount to murder?

In 1968 I worked at IBM as a Customer Engineer and serviced computers and their peripherals (Tape Drives, Disk Drives, Printers, Communications Interfaces) in the downtown Toronto area and vicinity.

IBM was the foremost computer company at that time. Thomas Watson Jr. was the second President of IBM from 1952 to 1971. You can read all about him here:

Before that, Thomas J. Watson (his father) served as Chairman and CEO of IBM from 1914 to 1956. His bio is here:

It is appropriate, therefore, that IBM named their Artificial Intelligence Software after the two most influential presidents who led IBM to greatness by building improved computing machines. You can read about Watson (computer) here:

I believe that artificial intelligence is about to go ballistic in the next few years. Here is a system that Amazon will have available in April 2018 that will allow everyone to dabble with machine learning and object recognition. This will be the ideal tool to implement the Dietitian App I mentioned in the previous segment of this post.

If you are interested what VoIP packet structure looks like, you can read about it here:
and here:


I agree that AI is going to boom in the coming years but can't see sentient AI in my lifetime or even this century. Do you know of any progress towards sentience?


Val Gretchev

Do you know of any progress towards sentience?
No, I don’t have any new information on research that is going on in that field. This article I just found seems to describe the current thinking situation quite well:

I believe that computers of today (based on a stored program with an arithmetic-logic unit hardware) will never become self-aware. A completely new computer architecture must be invented first. Perhaps Neural Network computers with a lot more network nodes might be the answer or, something entirely different.

I agree with you that sentient computers will not occur in my time. I can only hope for somewhat intelligent programs executing a pre-defined decision matrix and fooling people into thinking that they are talking and interacting with a human. See Touring Test:

But, that is also very useful. Think about a Psychologist making a diagnosis on a patient. He may ask a number of questions of a patient and make notes of the answers. He already has a list of possible answers and what they might mean. He then draws on his experience having asked these questions of many patients and makes a conclusion. I think that a computer program having a vast database at its disposal can make a similar diagnosis.

Val Gretchev

If anyone is interested in learning what makes a Google Home Speaker tick, here are a couple of URLs that disassemble a unit and spell out the modules used in the hardware.

Taking apart the Google Home! What's inside the Google Home?

Google Home Teardown

From the software point of view, the Google website discloses details of the protocols used in Google Talk.
It is clear from this FAQ that Google uses the XMPP protocol and lists the codecs for voice and video.

If you have a Google Home, you can try programming a game by duplicating this writer’s work.
How To Build Your Own Action For Google Home Using API.AI

Once you have this working, you can try a more complicated verbal exchange with Google Assistant.


My daughter received a Google Home speaker as a Christmas gift but certainly not from me. It sounds awful! and does the same things a smart phone can do.

When my daughter reads about something that she cannot pronounce properly then the Google Home speaker tries to understand and makes a few tries or gets all confused.

The news says that much better sounding "Smart Speakers" are coming soon.

