Work units crash computer and restart at zero

Message boards : Number crunching : Work units crash computer and restart at zero
Message board moderation

To post messages, you must log in.

AuthorMessage
admpicard999

Send message
Joined: 12 Feb 17
Posts: 1
Credit: 33,345
RAC: 0
United States
Message 934 - Posted: 20 Feb 2017, 4:25:20 UTC

I've had the same issue happen twice now. I wasn't able to write to the message boards before because of the anti-spam requirements. Thankfully one work unit did complete successfully between the two times so I'm able to post now.

With both Work Unit 835263 and 848116, my computer spent the better part of a day (about 10 hours each, I believe) processing the work unit on multiple cores. With the first one, I shut my computer off because a thunderstorm was coming, and when I booted it back up, all nearly 70 hours of computing time (10 hours on 7 cores) was lost because the progress was reset to zero. This more recent one, my computer unexpectedly shut off (this does not happen with any other BOINC projects) and when I turn it back on, the Work Unit is back to 0%. In both these instances I aborted the work unit and it appears online that the work unit was only worked on for a few seconds/minutes instead of many hours on multiple cores.

Is there no checkpointing in this project so it doesn't have to start over when a system restarts? If not, this seems like a serious oversight for work units with 12 hour expected runtimes (my laptop is a bit older, but still gets the job done). Is there some other reason why the work units would be starting over when they were over 70% complete before?
ID: 934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 726
Credit: 16,445,605
RAC: 0
Germany
Message 937 - Posted: 23 Feb 2017, 19:43:42 UTC - in response to Message 934.  

There is checkpointing, but as it seems in a later phase of the wu runtime.

The wu has a Cxxx in its name. Smaller xxx runs shorter. Longest runtime have wurkunits where xxx=135 or higher. Maybe you should try to get some of the smaller ones.
ID: 937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Work units crash computer and restart at zero




Datenschutz / Privacy Copyright © 2011-2024 Rechenkraft.net e.V. & yoyo