Issue with Lua Libraries WebSockets on Edge Driver

Hi all,

I’m writing the Harman Luxury Edge driver and I get a FATAL error that I don’t quite understand why it happens or if I’m missing something in my setup.
We are using the Lua v9 libraries’ Lustre.WebSockets and you can find the relevant code in this repo (commit: 9ec68c9666 on branch harman-websocket). The websockets are handled in <repo>/drivers/SmartThings/harman-luxury/src/hl_websocket.lua.

A summary of what we do:
We are creating a websocket with the following parameters:

local sock, _ = socket.tcp()
sock:settimeout(10)
local config = Config.default():keep_alive(30)
local websocket = ws.client(sock, "/", config)
websocket:connect(<ip>, <port>)

But everytime after 30s of inactivity on the websocket (the ping time), we get this fatal error:

 FATAL Harman Luxury  runtime error: [string "cosock.lua"]:313: [string "lustre/ws.lua"]:299: attempt to index a nil value (local 'recv')
stack traceback:
	[string "lustre/ws.lua"]:299: in method '_handle_recvs'
	[string "lustre/ws.lua"]:286: in method '_receive_loop'
	[string "lustre/ws.lua"]:178: in function <[string "lustre/ws.lua"]:177>
stack traceback:
	[C]: in function 'error'
	[string "cosock.lua"]:313: in field 'run'
	[string "st/driver.lua"]:1016: in method 'run'
	[string "init.lua"]:207: in main chunk

So it seems I might have missed something while setting the socket up and when the ping is attempted it tries to access an empty table.

Anyone has any suggestions?

Hi, @alon-tchelet
Welcome to the SmartThings Community!

The engineering team is already working on the solution for this. Once we have an update, we’ll let you know.

1 Like

@alon-tchelet I think you’ve found a bug in the WebSocket library that’s rearing its head because you’re putting a timeout on the ping-pong heartbeat, which I don’t normally do so I haven’t come across it before.

It looks like what’s happening is the following:

  • socket.select(rs, nil, self.config._keep_alive) is returning nil, nil, "timeout" (ws.lua line 277)
  • not recv is truthy, so we call self:_handle_select_err(loop_state, err) (ws.lua line 282)
  • err == "timeout" is truthy; state.pending_pongs >= 2 is falsey, so we send a ping (ws.lua lines 319-321).
  • not err is truthy if the send succeeds (which send rarely fails)
  • The function reaches the end of execution without an explicit return, meaning its return value is falsey
  • We check the return value of the function where we called it on line 282, it’s falsey, so we don’t early return
  • We proceed to call self.:_handle_recvs(loop_state, recv, 1) on line 286 even though recv is nil

You can “monkey patch” this file in your driver by making lustre/ws.lua in your driver’s src dir (you don’t have to vendor the entire driver, just the file you want to change).

Try changing the if on ws.lua line 286 to an elseif and see if that gets you unstuck using the method I just described.

And I’ll take a longer look at this later this week to see if there’s more that needs to be done, or if this is sufficient.

5 Likes

Hi @doug.stephen ,
Thank you for your answer. I was thinking it something like that, but wanted to make sure before I make a local copy of the lustre library to our driver (again) as that’s what I was advised/requested by the community members on the PR.

I’m doing this ping to have a low traffic regular check that the device is still online and reachable. Does your application not doing something similar or you are doing it in a different manner?

edit: just wanted to let you know that doing this change, like you suggested, seems to fix the issue:

@@ -282,8 +282,7 @@
       if self:_handle_select_err(loop_state, err) then
         return
       end
-    end
-    if self:_handle_recvs(loop_state, recv, 1) then
+    elseif self:_handle_recvs(loop_state, recv, 1) then
       break
     end
   end

before I make a local copy of the lustre library to our driver

You can copy just this one file and it’ll override only that file when you package your driver

Does your application not doing something similar or you are doing it in a different manner?

I’ve just been relying on the TCP socket reporting closed itself. There’s nothing wrong with your approach, though, so we’ll fix this in the libs themselves. But it might be a bit before that’s available because we’re in the middle of an entire firmware release for some of our targets, so I think patching just this file is your best bet in the short term.

I’ve just been relying on the TCP socket reporting closed itself.

This would be sufficient for the general use cases, but in cases where the WiFi device loses internet momentarily, gets powered off “dirtily”, or even just being assigned a new IP, the closed message won’t be sent and SmartThings won’t know the device is offline until a communication is attempted. With the keep_alive ping I can routinely inspect that the connection is still available and have logic based on it.