When should a "stalled" task be aborted?

Message boards : Number crunching : When should a "stalled" task be aborted?
Message board moderation

To post messages, you must log in.

AuthorMessage
Gunnar Hjern

Send message
Joined: 19 Oct 17
Posts: 4
Credit: 1,065,909
RAC: 0
Sweden
Message 1314 - Posted: 2 Dec 2018, 13:39:01 UTC

Hi!

Since about a week one of my computers has worked on one and the same task, and it is still running...!
The task is already marked as "Timed out - no response", as it's final deadline was set at 2 Dec 2018, 0:04:47 UTC.
This is a bit surprising, as the task was from the "YAFU for small composites" bin, and I thought it would complete in hours rather than in days.

Facts:
task: yafu_ali_1294872_L86_C68_1543104023_10_0
sent: 25 Nov 2018, 0:04:47 UTC
computer CPU: Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz [Family 6 Model 15 Stepping 10]
computer nr: 38654

When should I give up and abort such tasks??

//Gunnar
ID: 1314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 76
Germany
Message 1315 - Posted: 2 Dec 2018, 14:59:30 UTC - in response to Message 1314.  

I would say a "small" should not run longer than 24 hours.
In general in the slot directory of a workunit the files are changing. If the last change of a file is more than 5 hours back the workunit is probably dead.
You can also check if the workunit still consumes CPU.

Do you use CPU throtleing in BOINC?
ID: 1315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 19 Oct 17
Posts: 4
Credit: 1,065,909
RAC: 0
Sweden
Message 1316 - Posted: 2 Dec 2018, 15:39:08 UTC - in response to Message 1315.  

Hi!

Thanx for fast reply!! :-)

Yes, both the CPU cores are still working 100%.

No, I don't throttle: Boinc gets 100% CPU.

I've found a directory:
/var/lib/boinc-client/slots/2/
but most of the files in there seems to be from the start date of the task:
There are ONE file: "boinc_mmap_file" that seems to change every minute or so, but the rest are 5 or 7 days old.
The next newest file is "init_data.xml" and that one stems from nov 27, but all the other files are all from nov 25.

//Gunnar
ID: 1316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 76
Germany
Message 1317 - Posted: 2 Dec 2018, 15:57:41 UTC - in response to Message 1316.  

The workunit seems to be dead.
Which process is using CPU?
ID: 1317 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 19 Oct 17
Posts: 4
Credit: 1,065,909
RAC: 0
Sweden
Message 1318 - Posted: 2 Dec 2018, 17:16:33 UTC - in response to Message 1317.  

Hi!

Top said the process is named "yafu" and running two threads.
By giving $ ps -aux command I got the following information:
"boinc 19047 197 0.7 112968 14328 ? SNl nov25 21898:42 yafu -threads 2 -batchfile in"

Should I abort the WU now?

//Gunnar
ID: 1318 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 76
Germany
Message 1319 - Posted: 2 Dec 2018, 18:32:33 UTC - in response to Message 1318.  

I would say abort it.
ID: 1319 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 19 Oct 17
Posts: 4
Credit: 1,065,909
RAC: 0
Sweden
Message 1320 - Posted: 2 Dec 2018, 23:36:35 UTC - in response to Message 1319.  

Hi!

Yes, I've done so now, and even flushed another task that had run for nearly three days and showed the same signs:
I.e., when looking on the processes (w. 'top' on Linux) not finding the usual "gnfs-lasieve4I1", but only a single "yafu",
that seems to occupy 195% or 390% CPU (dep. on the number of cores).
That task was: yafu_ali_1468926_L85_C71_1543572912_18_0.

Have a nice new week!

//Gunnar
ID: 1320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 76
Germany
Message 1321 - Posted: 3 Dec 2018, 6:55:44 UTC - in response to Message 1320.  

After gnfs it is normal that only yafu is running to combine the results. But this should not take days, even not hours for yafu small.
ID: 1321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 7 Oct 11
Posts: 34
Credit: 2,443,777
RAC: 290
Germany
Message 1457 - Posted: 25 Nov 2019, 17:55:32 UTC

Hello,
I too have a result that looks like running forever
yafu_ali_2145600_L131_C124_1571470508_8_5
https://yafu.myfirewall.org/yafu/result.php?resultid=5325104

the result is running with one gnfs-lasieve4I13e.exe (one CPU) now at 114 hours

The first result was started at 19.10.2019, an no result finished so far.
Matthias
ID: 1457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chooka

Send message
Joined: 4 Mar 19
Posts: 11
Credit: 28,616,045
RAC: 0
Australia
Message 1548 - Posted: 30 Dec 2020, 15:30:15 UTC

I'm going to ask the same question sorry...... How long do I wait till I abort a task??
I have a couple of 4t tasks on different pc's that have been crunching for 2 days sitting at 100%. How much longer do I wait?
Its the one issue with this project that annoys me. This isn't the first time this has happened.

ID: 1548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 76
Germany
Message 1550 - Posted: 31 Dec 2020, 21:57:04 UTC - in response to Message 1548.  

Do they still consume CPU? In this case let them run.
Which process consumes the CPU?
.
ID: 1550 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
vaughan

Send message
Joined: 1 Sep 11
Posts: 9
Credit: 13,786,383
RAC: 5,443
Australia
Message 1589 - Posted: 12 Dec 2021, 1:45:27 UTC

Why are there so many Yafu tasks that get to 100% but never finish?
ID: 1589 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
vaughan

Send message
Joined: 1 Sep 11
Posts: 9
Credit: 13,786,383
RAC: 5,443
Australia
Message 1590 - Posted: 14 Dec 2021, 22:39:09 UTC - in response to Message 1589.  

It seems that nobody is reading this thread as no reply from Admin in 2 days. Time to move to another project then.
ID: 1590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile marsinph

Send message
Joined: 1 Apr 18
Posts: 22
Credit: 715,524
RAC: 0
Belgium
Message 1591 - Posted: 4 Jan 2022, 16:50:41 UTC - in response to Message 1590.  

It seems that nobody is reading this thread as no reply from Admin in 2 days. Time to move to another project then.



Yes, I am sure there are people who read.
But not answer.
ID: 1591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Speedy51

Send message
Joined: 25 Jan 12
Posts: 23
Credit: 1,529,974
RAC: 0
New Zealand
Message 1596 - Posted: 30 Jan 2022, 0:41:35 UTC - in response to Message 1589.  

Why are there so many Yafu tasks that get to 100% but never finish?

I haven't had what you describe happened to me. To fix this I would recommend exiting boinc for 30 seconds or so and then restarting boinc. Task will start again from the most recent checkpoint, If the percentage drops from 100 when task restarts let it run & hopefully it will complete and upload
ID: 1596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
vaughan

Send message
Joined: 1 Sep 11
Posts: 9
Credit: 13,786,383
RAC: 5,443
Australia
Message 1603 - Posted: 20 Apr 2022, 22:46:10 UTC - in response to Message 1596.  

I aborted a ali task that had run hogging all CPU threads for 3 days 5 hours. Stuck on 100 percent for days. Useless application.
ID: 1603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Speedy51

Send message
Joined: 25 Jan 12
Posts: 23
Credit: 1,529,974
RAC: 0
New Zealand
Message 1604 - Posted: 20 Apr 2022, 23:09:14 UTC
Last modified: 20 Apr 2022, 23:10:32 UTC

I aborted 7905872 on 27 Mar 2022, 17:35:52 UTC because it had progressed as far as it could in the time that my computer is on before needing to be turned off. Each time it started from from a checkpoint in the.DAT file however it would not move to the next phase/checkpoint before needing to be turned off again. I hope the person that had currently has this task is able to complete it
ID: 1604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 7 Oct 11
Posts: 34
Credit: 2,443,777
RAC: 290
Germany
Message 1779 - Posted: 11 Jun 2024, 15:05:39 UTC

Hello,
I've again a long running gnfs-lasieve4l3e thread with now over 24 hours CPU runtime (LatSieveTime)
the other threads have not more than 750 seconds LatSieveTime for this task
And other threads are not started till this thread is finished

06/09/24 08:55:28 v1.34.5 @ MA64-003, nfs: commencing lattice sieving with 4 threads
the 3 other threads are finished since 06/09/2024 09:04

for me it looks like the thread has "wrong" parameters that can not finish

Task with this behaviour
https://yafu.myfirewall.org/yafu/result.php?resultid=8928821
Matthias
ID: 1779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kl8610

Send message
Joined: 22 Mar 24
Posts: 1
Credit: 13,721,845
RAC: 2,142
Malaysia
Message 1780 - Posted: 13 Jun 2024, 6:42:14 UTC - in response to Message 1779.  

Hi, you can try with older BOINC version as i see your computer is using the latest 8.0.2 version
ID: 1780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : When should a "stalled" task be aborted?




Datenschutz / Privacy Copyright © 2011-2024 Rechenkraft.net e.V. & yoyo