Extracting Sky Router Crash Data Amidst Kernel Panics
I have a Sky broadband connection with fibre into a Sky Router / Sky Hub. I have noticed very short outages of internet service with increasing frequency recently. The outages are short, maybe around 3-5 minutes long but these are annoying enough in the middle of an online meeting or some other synchronous activity. Sometimes two or three of these short outages occur in a relatively short time frame.
There did not seem to be a correlation to temperature, or even usage. In other words, the device doesn't seem to be crashing because it is overheating under load or ambient temperature.
Extracting the Crash Data
It turns out that some enterprising engineer took the decision to embed crash or reset data in what is perhaps a surprising place, or at least a surprising place for easy user access. If one backups the router configuration settings to a file, the result is an XML file that includes two interesting stanzas towards the end.
<X_SKY_REBOOT_CAUSE> <RebootTime>1660120232</RebootTime> <RebootReasonType>KLRT</RebootReasonType> <RebootReasonCode>1001</RebootReasonCode> <RebootInfo>kernel panic marker detected in the flash</RebootInfo> </X_SKY_REBOOT_CAUSE> <X_SKY_COM_DEVICE_DOCTOR> <Enable>TRUE</Enable> <CmsLockDuration>10</CmsLockDuration> <MemoryThreshold>95</MemoryThreshold> <SharedMemoryThreshold>95</SharedMemoryThreshold> </X_SKY_COM_DEVICE_DOCTOR>
Now the X_SKY_COM_DEVICE_DOCTOR
also looks quite interesting, but the piece for us right now is the X_SKY_REBOOT_CAUSE
. You will note in this case that the RebootInfo
data contains the following:
kernel panic marker detected in the flash
Which is interesting if not encouraging. For those not in the know a kernel panic indicates a type of crash in the base operating system of the device. It could be caused by faulty software or in this case firmware in the device, or it could be caused by some hardware problem. In Sky's case firmware updates are highly automated. It could easily be caused by a firmware bug, but in that case it would likely be experienced by many customers. The smart money if that's not the case might be on faulty hardware.
Importantly, even if there is some problem in the broadband provision coming from the fibre, the router should not panic or crash - it should just deal with the problem, reconnect when possible and move on. That would be obvious in the device logs.
I called Sky to report this, and the first conversation wasn't too productive if it wasn't surprising: the request to turn the device off and on again. Not bad advice, but not successful. I got told to do a total factory reset - which was a time consuming pain, and didn't fix the problem.
By this point, I'd started to write a very small Python script to automate extracting the crash data and time-stamping it, if the backup file was download by hand. To complete the script I really need to automate the web request part - which I haven't attempted yet as it looks like I need to handle the session data - not insurmountable, but a bit of work. Here is that short unvarnished script.
import xml.etree.ElementTree as ET import datetime as dt import os # Extract the XML from the settings file and get the root tree = ET.parse('sky_router_settings.conf') root = tree.getroot() # Look for the X_SKY_REBOOT_CAUSE stanza should it exist for item in root.findall('.//X_SKY_REBOOT_CAUSE'): # It does, so let's extract the details reboot_time = int(item.find('RebootTime').text) reboot_reason_type = item.find('RebootReasonType').text reboot_reason_code = item.find('RebootReasonCode').text reboot_info = item.find('RebootInfo').text # Make an ISO datetime from the Unix epoch timestamp reboot_format_time = dt.datetime.utcfromtimestamp(reboot_time).strftime("%Y-%m-%d %H:%M:%S") # Let's write the data if it isn't already there (in current working directory) if not os.path.exists(reboot_format_time): print(f"Found new crash data... {reboot_format_time} {reboot_info}") try: fh = open(reboot_format_time, 'x') print(f'Time:{reboot_format_time}', file=fh) print(f'Type:{reboot_reason_type}', file=fh) print(f'Code:{reboot_reason_code}', file=fh) print(f'Info:{reboot_info}', file=fh) fh.close() except Exception as e: print(f'oops, something went wrong') print(e)
So, to use this, one logs into the sky router (probably browse to http://192.168.0.1 or whatever address your router is on from within your network) - go to Maintenance and Backup Settings. Drop the saved file in the same directory as the script, and run it. I was doing this periodically to check for crashes I had not witnessed. It will save any new data into a file with the timestamp as a filename in the same directory. Crude, but it works.
Armed with a number of crash events I called Sky back. What followed was a highly frustrating conversation for all sides, where I was advised that I had to plug the hub into a different electrical socket. I duly did so, and incidentally noticed that the hub records a different reason for the reset.
Power On Reset detected
In other words, the hub notes when it was power cycled. To the surprise of virtually no-one, changing the socket the hub was plugged into did not prevent the crashes. My script detected another one just before 2 am yesterday.
Time:2022-09-07 01:56:12 Type:KLRT Code:1001 Info:kernel panic marker detected in the flash
I called Sky again and finally had a constructive conversation - they are sending me a new hub to test. Hopefully this will solve the problem. I doubt it's a firmware issue or it would have been more widely reported.
#TODO
I think I will probably try and bite the bullet and use Python to download the backup file too. If I can get that bit working I can rig the whole thing up to cron to check for crash data automatically. New hub or not, keeping a track of these crash events would be useful.
It's curious that the hub obviously does this hard work of storing crash data, but this doesn't seem to be transmitted to Sky which would really help them diagnose problems when customers call.
UPDATE
Unfortunately a new hub has not solved the problem, so it looks as though it may be a firmware problem after all. I have collected some more messages that might be in the reboot reason. Here is my list so far:
kernel panic marker detected in the flash Power On Reset detected The new firmware image downloaded by FUS is being written to flash. The device may REBOOT CPE has been software resetted (possibly watchdog timeout, if no other indicators
Thanks for this, it's really helpful; we have exactly the same issue (got a replacement hub, but crashing every 6 hours or so). We've also got a similar cause to you (CPE has been software resetted (possibly watchdog timeout, if no other indicators):
1664633814
KLRT
1003
CPE has been software resetted (possibly watchdog timeout, if no other indicators)
Did you manage to get any further with it?
Hi Mat, no, I'm not any further. In my most recent call with Sky they asked me to photo the back of the hub and then complained I wasn't using a Cat 5e cable to connect it to OpenReach (I'm using a Cat 6 cable, so the latest bizarre suggestion is to downgrade my patch cable).
I'm beginning to suspect the hub is prone to some RFI, and that might be the issue. Still haven't found a clear cause.
Yeah that sounds plausible. When I was looking at the syslog output I noticed that it's doing memory allocations and it seems like before a crash it's probably running out of resources which definitely points to a firmware issue
Hi Mat, I'm not sure you'll get this, but in the way of closure I now have a stable network. My best guess is a failing webcam was emitting RFI or Packet bursts and the hub is not robust enough to cope. Not sure if that helps you!
Thanks; I also have a stable network now - though frankly I don't remember if I changed anything in particular or whether it was just a firmware update that fixed the underlying crashes.
In my case I narrowed down the crashes to being caused by my Toshiba ThinkPad laptop that had been issued by work; it was running bog-standard Windows 10 but I wasn't sure if some combination of the Secure Access VPN client, Zscaler corporate proxy and/or Tailscale VPN was causing it to broadcast something a bit spicy.
Nevertheless, happy to hear you managed to get it sorted too!