Message boards :
Number crunching :
When should a "stalled" task be aborted?
Message board moderation
Author | Message |
---|---|
Gunnar Hjern Send message Joined: 19 Oct 17 Posts: 4 Credit: 1,065,909 RAC: 0 |
Hi! Since about a week one of my computers has worked on one and the same task, and it is still running...! The task is already marked as "Timed out - no response", as it's final deadline was set at 2 Dec 2018, 0:04:47 UTC. This is a bit surprising, as the task was from the "YAFU for small composites" bin, and I thought it would complete in hours rather than in days. Facts: task: yafu_ali_1294872_L86_C68_1543104023_10_0 sent: 25 Nov 2018, 0:04:47 UTC computer CPU: Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz [Family 6 Model 15 Stepping 10] computer nr: 38654 When should I give up and abort such tasks?? //Gunnar |
yoyo_rkn Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 22 Aug 11 Posts: 734 Credit: 17,574,526 RAC: 510 |
I would say a "small" should not run longer than 24 hours. In general in the slot directory of a workunit the files are changing. If the last change of a file is more than 5 hours back the workunit is probably dead. You can also check if the workunit still consumes CPU. Do you use CPU throtleing in BOINC? |
Gunnar Hjern Send message Joined: 19 Oct 17 Posts: 4 Credit: 1,065,909 RAC: 0 |
Hi! Thanx for fast reply!! :-) Yes, both the CPU cores are still working 100%. No, I don't throttle: Boinc gets 100% CPU. I've found a directory: /var/lib/boinc-client/slots/2/ but most of the files in there seems to be from the start date of the task: There are ONE file: "boinc_mmap_file" that seems to change every minute or so, but the rest are 5 or 7 days old. The next newest file is "init_data.xml" and that one stems from nov 27, but all the other files are all from nov 25. //Gunnar |
yoyo_rkn Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 22 Aug 11 Posts: 734 Credit: 17,574,526 RAC: 510 |
The workunit seems to be dead. Which process is using CPU? |
Gunnar Hjern Send message Joined: 19 Oct 17 Posts: 4 Credit: 1,065,909 RAC: 0 |
Hi! Top said the process is named "yafu" and running two threads. By giving $ ps -aux command I got the following information: "boinc 19047 197 0.7 112968 14328 ? SNl nov25 21898:42 yafu -threads 2 -batchfile in" Should I abort the WU now? //Gunnar |
yoyo_rkn Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 22 Aug 11 Posts: 734 Credit: 17,574,526 RAC: 510 |
I would say abort it. |
Gunnar Hjern Send message Joined: 19 Oct 17 Posts: 4 Credit: 1,065,909 RAC: 0 |
Hi! Yes, I've done so now, and even flushed another task that had run for nearly three days and showed the same signs: I.e., when looking on the processes (w. 'top' on Linux) not finding the usual "gnfs-lasieve4I1", but only a single "yafu", that seems to occupy 195% or 390% CPU (dep. on the number of cores). That task was: yafu_ali_1468926_L85_C71_1543572912_18_0. Have a nice new week! //Gunnar |
yoyo_rkn Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 22 Aug 11 Posts: 734 Credit: 17,574,526 RAC: 510 |
After gnfs it is normal that only yafu is running to combine the results. But this should not take days, even not hours for yafu small. |
Matthias Lehmkuhl Send message Joined: 7 Oct 11 Posts: 34 Credit: 2,419,370 RAC: 223 |
Hello, I too have a result that looks like running forever yafu_ali_2145600_L131_C124_1571470508_8_5 https://yafu.myfirewall.org/yafu/result.php?resultid=5325104 the result is running with one gnfs-lasieve4I13e.exe (one CPU) now at 114 hours The first result was started at 19.10.2019, an no result finished so far. Matthias |
Chooka Send message Joined: 4 Mar 19 Posts: 11 Credit: 28,616,045 RAC: 0 |
I'm going to ask the same question sorry...... How long do I wait till I abort a task?? I have a couple of 4t tasks on different pc's that have been crunching for 2 days sitting at 100%. How much longer do I wait? Its the one issue with this project that annoys me. This isn't the first time this has happened. |
yoyo_rkn Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 22 Aug 11 Posts: 734 Credit: 17,574,526 RAC: 510 |
|
vaughan Send message Joined: 1 Sep 11 Posts: 9 Credit: 13,483,012 RAC: 742 |
|
vaughan Send message Joined: 1 Sep 11 Posts: 9 Credit: 13,483,012 RAC: 742 |
|
marsinph Send message Joined: 1 Apr 18 Posts: 22 Credit: 715,524 RAC: 0 |
It seems that nobody is reading this thread as no reply from Admin in 2 days. Time to move to another project then. Yes, I am sure there are people who read. But not answer. |
Speedy51 Send message Joined: 25 Jan 12 Posts: 23 Credit: 1,529,974 RAC: 0 |
Why are there so many Yafu tasks that get to 100% but never finish? I haven't had what you describe happened to me. To fix this I would recommend exiting boinc for 30 seconds or so and then restarting boinc. Task will start again from the most recent checkpoint, If the percentage drops from 100 when task restarts let it run & hopefully it will complete and upload |
vaughan Send message Joined: 1 Sep 11 Posts: 9 Credit: 13,483,012 RAC: 742 |
|
Speedy51 Send message Joined: 25 Jan 12 Posts: 23 Credit: 1,529,974 RAC: 0 |
I aborted 7905872 on 27 Mar 2022, 17:35:52 UTC because it had progressed as far as it could in the time that my computer is on before needing to be turned off. Each time it started from from a checkpoint in the.DAT file however it would not move to the next phase/checkpoint before needing to be turned off again. I hope the person that had currently has this task is able to complete it |
Matthias Lehmkuhl Send message Joined: 7 Oct 11 Posts: 34 Credit: 2,419,370 RAC: 223 |
Hello, I've again a long running gnfs-lasieve4l3e thread with now over 24 hours CPU runtime (LatSieveTime) the other threads have not more than 750 seconds LatSieveTime for this task And other threads are not started till this thread is finished 06/09/24 08:55:28 v1.34.5 @ MA64-003, nfs: commencing lattice sieving with 4 threads the 3 other threads are finished since 06/09/2024 09:04 for me it looks like the thread has "wrong" parameters that can not finish Task with this behaviour https://yafu.myfirewall.org/yafu/result.php?resultid=8928821 Matthias |
kl8610 Send message Joined: 22 Mar 24 Posts: 1 Credit: 13,664,387 RAC: 192 |
Hi, you can try with older BOINC version as i see your computer is using the latest 8.0.2 version |