Forum - MCS Electronics

 

FAQFAQ SearchSearch RegisterRegister Log inLog in

watchdog

 
Post new topic   Reply to topic    www.mcselec.com Forum Index -> BASCOM-AVR
View previous topic :: View next topic  
Author Message
njepsen

Bascom Member



Joined: 13 Aug 2007
Posts: 469

newzealand.gif
PostPosted: Wed Dec 07, 2016 4:57 am    Post subject: watchdog Reply with quote

I have been research a 'reliable" watchdog recently. Why - because i have stumbled on a situation with unreliaable internet connectivity that caused my code to misbehave.
Simply kicking the dog once every time through the main() doesnt cut it.
There is a very well written article here on watchdogs. http://www.ganssle.com/watchdogs.htm

And this statement is worth considering:
Quote:

A pair of design rules leads to decent WDTs: kick the dog only after your code has done several unrelated good things, and make sure that erratic execution streams that wander into your watchdog routine won't issue incorrect tickles.

This is a great place to use a simple state machine. Suppose we define a global variable named "state". At the beginning of the main loop set state to 0x5555. Call watchdog routine A, which adds an offset - say 0x1111 - to state and then insures the variable is now 0x6666. Return if the compare matches; otherwise halt or take other action that will cause the WDT to fire.

Later, maybe at the end of the main loop, add another offset to state, say 0x2222. Call watchdog routine B, which makes sure state is now 0x8888. Set state to zero. Kick the dog if the compare worked. Return. Halt otherwise.

Suppose the code crashes and for inscrutable reasons probably having to do with Murphy's Law and the perversity of nature vectors into wdt_b() just before the kick_dog command. The protection mechanism of the state machine won't help.

Perhaps it's safe to assume that the code will again crash when wdt_b() returns, so the system will miss the next watchdog tickle. But. . . . perhaps not - who knows what evil lurks in the mind of runaway software?

Is this fear paranoid? You bet. But the WDT might be the last line of defense between deflecting the Earth-bound asteroid and utter disaster, or at least in rebooting the pacemaker before grandpa collapses. Assuming that crashed code will operate in any benign mode is na‹ve.

So wdt_b() double-checks variable "state" to insure the system halts (so the watchdog can issue a reset) even if rogue code wandered into wdt_b() just before issuing the kick_dog.

This is a trivial bit of code, but now runaway code that stumbles into any of the tickling routines cannot errantly kick the dog. Further, no tickles will occur unless the entire main loop executes in the proper sequence. If the code just calls routine B repeatedly, no tickles will occur because it sets state to zero before exiting.



(BASCOM-AVR version : 2.0.7.9 , Latest : 2.0.7.8 )

_________________
Neil
Back to top
View user's profile
Duval JP

Bascom Member



Joined: 22 Jun 2004
Posts: 1161
Location: France

france.gif
PostPosted: Mon Dec 19, 2016 5:12 pm    Post subject: Reply with quote

thank you for your article, very interesting indeed, but I hope that this dog does not bite Wink
JP
Back to top
View user's profile Visit poster's website
njepsen

Bascom Member



Joined: 13 Aug 2007
Posts: 469

newzealand.gif
PostPosted: Sat Sep 14, 2019 11:43 pm    Post subject: Reply with quote

Further to my earlier notes about watchdogs:-
For some time ( like a few years) I have been unsatisfied with the watchdog in my code. The code is about 8000 lines long, and is embedded in loggers that are solar powered, and many hundreds of Km from my shop, so access and maintenance is not possible if the code falls over. The board uses the Atmega 1284p which has a watchdog timer of just on 8 seconds, but many of my routines can take upwards of a minute. This is because they communicate with 2 servers via a cellular modem, as well as talking serially (RS232) to several asynchronous sensors. These things take time, due to variable connection times with the servers. Also - there are instances where telecom will turn off the cellular service in the middle of the night without warning and often, sometime in the middle of a data upload. So – I need to be able to take care of unsolicited unexpected delays, but at the same time manage a watchdog timer.
8s is too short, and I am not happy with having multiple instances of "reset watchdog" throughout the code because a rogue crash could result in that part of the code being run continuously, thus defeating the watchdog.
I have come up with what I think is a good workaround, and I’d like any comments. In this code, a rogue crash would have to stop timer T1 to defeat the watchdog. The holdoff is set in only one place in the main loop, and so if the main loop takes more than ( in the case below - 15sec) to run, then there is a reset.

Here is a working example of my solution to requiring long watchdog times. The main loop in the example takes 10secs to run, and - because T1 and the main loop are asynchronous the WD_holdoff must be set to at least 12 sec to be safe. I've used 15s. Config WD must also be > than T1.


Code:
  '*******************************************************************************
$regfile = "m1284pdef.dat"                                  'my processor
$crystal = 9830400                                          'system crystal
$framesize = 800                                            ' Located at top of 16k of SRAM
$hwstack = 550                                              '
$swstack = 550                                              '
$frameprotect = 1
Disable Jtag

 '------------------------------------------------------------

dim WD_holdoff_int as integer

'-----------------------------------------------------------
'Set up Timer 1
'----------------
'This counter ONLY handles the SW watchdog
'we want to set up counter 1 to toggle 1x per sec
'timer counts UP
'Timer 1 = 16 bit , i.e 65536 counts
'counts 9830400 / 1024 = 9600 per sec
'65536 - 9600 = 55936
'-----------------------------------------------------------


Config Timer1 = timer , Prescale = 1024                     '
On Timer1 Timer1overflow
enable OVF1
enable interrupts

 '---------------------------------------------------------------------------------

 config watchdog = 2048                                     '2 sec  Must be > T1 setting
 stop watchdog


'-------------------------------------------------------------------------------
'Open a software UART TRANSMIT channel for debug on software ser port
Open "comc.3:38400,8,n,1" For Output As #1                  '
Open "comc.2:38400,8,n,1" For Input As #5


Declare Sub mySub1()
Declare Sub mySub2()
Declare Sub mySub3()
Declare Sub mySub4()
Declare Sub mySub5()


'****************************************************


   Print #1 , ""
   Print #1 , ">>*******  DID A REBOOT   **********<<"
   Print #1 , ">> File Version " ; Version(3)
   Print #1 , ">> File Version " ; Version(1)
   Print #1 , ""

   'Main Loop
      do
          Print #1 , ""
          print #1 , ">> start over"
          wd_holdoff_Int = 15                               'loop takes 10sec, so lets use 5 sec margin

          mysub1
          mysub2
          mysub3
          mysub4
          mysub5

      loop

End

'**************************************************

sub mysub1()

'my code
print #1 , ">> waiting 2s in sub 1:"
wait 2
print #1 , ">> holdoff= " ; wd_holdoff_int


end sub

sub mysub2()


'my code
print #1 , ">> waiting 4s in sub 2:"
wait 4
Print #1 , ">> holdoff= " ; wd_holdoff_int

end sub

sub mysub3()


'my code
print #1 , ">> waiting 1s in sub 3:"
wait 1
Print #1 , ">> holdoff= " ; wd_holdoff_int


end sub

sub mysub4()

'my code
print #1 , ">> waiting 1s in sub 4:"
wait 1
print #1 , ">> holdoff= " ; wd_holdoff_int


end sub


sub mysub5()

'my code
print #1 , ">> waiting 2s in sub 5:"
wait 2
print #1 , ">> holdoff= " ; wd_holdoff_int




end sub


'**************************************************************
timer1overflow:
 '(
 **************************************************************
 * Handles the SW watchdog.                                                                      
 * WD holdoff is decrimented every 1 sec under interrupt                                
 * But - if the  SW watchdog times out before holdoff is reset in the main code  
 * somewhere, the SW WD operates.                                                              
 * wd_holdoff_int needs to be an integer - byte is too small & rolls over            
 *                                                                                                                
 **************************************************************
')

  timer1 = 55936
 start watchdog                                             'ensure that software WD is running continuously
 decr wd_holdoff_Int
 if wd_holdoff_Int > 0 then
   reset watchdog
 else
   print #1 , ">> WD TIME OUT"                              'do a watchdog
   goto 0
 end if


return
  '*************************************************************


  end

_________________
Neil
Back to top
View user's profile
MWS

Bascom Member



Joined: 22 Aug 2009
Posts: 2262

blank.gif
PostPosted: Sun Sep 15, 2019 10:16 am    Post subject: Reply with quote

njepsen wrote:
In this code, a rogue crash would have to stop timer T1 to defeat the watchdog.

It is sufficient to run into CLI.
Code:
goto 0

This is not the same as a hardware WD reset, a goto 0 will not reset accidentally altered registers, while a hardware WD reset will.
Back to top
View user's profile
njepsen

Bascom Member



Joined: 13 Aug 2007
Posts: 469

newzealand.gif
PostPosted: Sun Sep 15, 2019 10:48 pm    Post subject: Reply with quote

Quote:
It is sufficient to run into CLI.


Actually MWS, I don't think that is true, (assuming the inbuilt WD is running) because if either T1 stops, or CLI is executed somehow, then the inbuilt 2s chip WD would operate.


Quote:
This is not the same as a hardware WD reset, a goto 0 will not reset accidentally altered registers, while a hardware WD reset will.


The "goto " is not needed, and I forgot to take it out after testing. With this instruction missing, the 2sec WD will operate if the WD_holdoff_int is not reset(and any corrupted registers will get reset).

If I replaced the goto instruction with "wait 10" that would achieve the same outcome (an inevitable WD) and would not allow the system to recover inside the WD interval with re-assigment of a new WD-holdoff_int.

_________________
Neil
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    www.mcselec.com Forum Index -> BASCOM-AVR All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum