5

I realize this is a somewhat vague question, but, I have a python script that needs to run for two years on a raspberry pi and is failing after about 3 hours. Without getting into to much detail as to what the script does (I'm not sure the script itself is at fault), what is interesting is that the process appears to stop dead in its tracks, i.e., no warnings, errors or failures are generated when it fails, the process just stops and breaks my terminal session, i.e., I can't enter any more commands when it happens. The process also disappears from the list of processes the pi is running (from the top command).

Anybody have any idea what might be going on? Is there any reason the script would just stop after some time? I'm more than happy to post extensive details about what the script does if need be, I just thought it might have something more to do with how it's interacting with the OS.

This is how I am running the script:

python animation.py &

Running a Model B+ 512MB, connected to internet via WIFI, powering the PI via USB

UPDATE:

I tried running the script from my Mac, the same thing happened about 3 hours in. This time, the program didn't disappear from the process list, it remained in a sleeping state and it's CPU usage dropped down to 0%, while the screen I was watching the the stdout on seemed to be frozen. I am doing some serial communication with the script, is it possible it's getting hung up on a response?

Collin Schupman
  • 77
  • 1
  • 1
  • 3
  • 3
    We are going to need to see the script (please edit your question to add it), we also will need to know what model Pi you are using, how are you powering the Pi, how are you connecting to the Pi Ethernet or WiFi? You mention no errors etc. have you tailed the syslog? – Steve Robillard Mar 11 '15 at 17:14
  • "I'm not sure the script itself is at fault" -- of course, and presumably that's why you haven't bothered to include any logging or debugging, by the sounds of it. Why would you need to? I'm sure the fault lies with either the python interpreter, the operating system, or the hardware. ;\ – goldilocks Mar 11 '15 at 17:57
  • @goldilocks - Perhaps through your snarky, sarcastic tone you're suggesting I advance my logging, debugging efforts? – Collin Schupman Mar 11 '15 at 18:13
  • @SteveRobillard - The script itself is about 300 lines long, includes some classes I created and about 100 lines dropped down to cython, so posting the entirety is not an option. I'm having a hard time finding critical parts of the program that may be the culprit, as the PROGRAM DOESN'T FAIL AT THE SAME PLACE, DOESN'T GENERATE ANY ERRORS WHEN IT FAILS and ISN'T CAUGHT BY THE COUPLE ERRORS I'M TRYING TO CATCH in the script. I haven't tried tailing the syslog. – Collin Schupman Mar 11 '15 at 18:22
  • Without this information we can not really help. – Steve Robillard Mar 11 '15 at 18:29
  • @CollinSchupman Yep, although it now sounds like you have made an effort in that direction. Always good to be explicit about these things from the top! I do feel your pain -- no one is likely to pour over a 300 line script, and this is a kind of "hail mary" post. But there's not much to go on here (what would you think?). I'm curious about a detail: "the process just stops and breaks my terminal session"... – goldilocks Mar 11 '15 at 18:30
  • You've indicated you've forked this into the background (&), so are you saying the shell where it was started suddenly becomes unresponsive at the same time as the script fails and disappears from top? Also, you aren't doing this via ssh by any chance? – goldilocks Mar 11 '15 at 18:30
  • @goldilocks - Yes. I have the script running on PI startup (which I ran overnight), but I do use ssh to observe the stdout and watch the processes through top while editing it during the day. Both way, it seems to fail in apparently the same way. During the night, I redirected the stdout to a .txt file( > log.txt) and again, it didn't generate any errors or messages when it failed, it just seemed to stop. – Collin Schupman Mar 11 '15 at 18:42
  • Post your script for better understanding – guyd Apr 26 '18 at 13:56
  • You say it fails around 3 hours. Does it fail at exactly the same length of time every time you run it? – NomadMaker Apr 26 '18 at 18:43

3 Answers3

10

Not quite an answer but a guess, since this is a pretty vague question.

I'm presuming something you are starting with the intent of having it run for years is also intended to outlast the login session which started it -- unless you start it via the init system, which you don't refer to in the question.

If/when you are starting it from a login (including ssh), simply back-grounding something is not sturdy enough. You also have to take care of a few things:

  • Making sure the process is properly re-parented by init.
  • Cutting off standard input and output streams, if you aren't otherwise redirecting those.

So,

setsid python animation.py < /dev/zero &> /dev/null &

See man setsid -- this ensures the forked process will be re-parented by init. The other stuff is input/output redirection (the output you probably actually want to send to a log instead of /dev/null).

If this doesn't solve your problem, and/or you want a way to monitor the process over a long period of time, have a look at plog.

goldilocks
  • 58,859
  • 17
  • 112
  • 227
  • What if I fork the process on startup? Instead of through an ssh session? Should these same precautions be taken? – Collin Schupman Mar 11 '15 at 18:56
  • If you're starting it from an init script or rc.local you don't need to use setsid but you should take care of the input and output. – goldilocks Mar 11 '15 at 19:08
  • Thanks. Also (out of curiosity), why do you need to take steps to cut off the input/output streams? – Collin Schupman Mar 11 '15 at 19:54
  • Without that, the controlling terminal may hang (although setsid should take care of that -- this may be overkill...). It's more of pattern when using nohup (<- read that); notice the problem is either the terminal hanging or the terminal exiting and the process receiving SIGHUP. Setsid is a better choice IMO because it's orphaned right away and adopted by init (i.e., it is properly daemonized). You might want to look into screen too. – goldilocks Mar 12 '15 at 10:23
2

Just some knowledge that I ran by a few weeks ago. I have seen people with similar problems when it comes to creating a hydroponics system. It turned out that there were variables that were being incremented that went past the memory allocation for its type. When you run a program for an extended period of time, I have seen that in a few cases, this can be the issue. I think (as a work-around) they used the "long long" type or an "unsigned int" so that the increment values can store a larger number before it crashes.

Justin C.
  • 47
  • 5
  • Since we can't see your code, we have to throw best guesses at you. Here is one. – Justin C. Mar 11 '15 at 18:48
  • 1
    This might be the issue, I have a section of my code where I am updating a variable based on how many seconds have passes since the program started, i.e., every 30'ish seconds it gets larger, by a lot. I could easily see this getting past the size of an int :) – Collin Schupman Mar 11 '15 at 19:01
  • Keep us posted! – Justin C. Mar 11 '15 at 22:32
  • i double checked the type, it turned out to be a float and it's basically just a copy of time.time() at that a certain time in the program. don't think that's overflowing ;) – Collin Schupman Mar 11 '15 at 22:34
1

Run your code with a profiler and debugger. Something in the script or how its set up is causing the script to fail.

My best guess is a memory leak or a variable overflowing.

Especially with C code involved, my mind jumps to memory leaks. It is easy to forget who is supposed to free memory that comes out of a function. And automatic garbage collectors are not perfect.

Variables overflowing is a distinct possibility. Why are you increasing a variable ever 30 seconds or so? Wouldn't it be easier just to calculate the value when you need it?

Posting large code has been done before. Unless you post it, anything we say is just a guess.

NomadMaker
  • 1,560
  • 9
  • 10