Long running work units

Message boards : Number crunching : Long running work units
Message board moderation

To post messages, you must log in.

AuthorMessage
AMDave
Volunteer moderator
Volunteer tester

Send message
Joined: 30 Aug 11
Posts: 41
Credit: 100,018
RAC: 0
Australia
Message 262 - Posted: 20 Nov 2011, 10:25:32 UTC
Last modified: 20 Nov 2011, 11:09:27 UTC

I have several work units that are running over 40+ hours.

One is about to pass the deadline and is still running.

What are the acceptable run times for the current batch of work units?

If the longer run time is acceptable, please extend the deadline.

If the longer run time is not acceptable please activate the duration limit test to kill the work unit when it exceeds the acceptable time limit.

/edit - longest running WU has now exceeded deadline. still running. - edit/
/edit2 - next one has exceeded deadline. still running. -edit2/
ID: 262 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 51
Germany
Message 263 - Posted: 20 Nov 2011, 11:16:35 UTC
Last modified: 20 Nov 2011, 11:17:00 UTC

You will get credits even if you report the result 5 days after the deadline.
Which Wus are running so long?
yoyo
ID: 263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AMDave
Volunteer moderator
Volunteer tester

Send message
Joined: 30 Aug 11
Posts: 41
Credit: 100,018
RAC: 0
Australia
Message 264 - Posted: 20 Nov 2011, 11:18:50 UTC
Last modified: 20 Nov 2011, 11:43:37 UTC

Workunit 552683 (C99)
Workunit 552361 (C99)

checked server status in the task list
both come up as Error status "Too many total results"
and the task status message is "Timed out - no response"

both clients have been Update refreshed and both work units are continuing without interruption.

I expected that when the server has set the work unit status as error, that the project would terminate the work unit on the client.

On the project server side, there should also be a notification to the admin of this undesirable state because it indicates that the work batch is exceeding acceptable parameters requiring admin intervention to resolve.

If the WU status is now "Error", how can we get credit for them?

I suspect that I should terminate these wu's immediately,
but I will await your advice.
ID: 264 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 51
Germany
Message 265 - Posted: 20 Nov 2011, 14:59:48 UTC

Both are "timed out - no reply". They stay there for 5 more days. If you report them during these days they will be validated and credited.
yoyo
ID: 265 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AMDave
Volunteer moderator
Volunteer tester

Send message
Joined: 30 Aug 11
Posts: 41
Credit: 100,018
RAC: 0
Australia
Message 266 - Posted: 20 Nov 2011, 20:11:49 UTC

ok.

approaching 60 hours.

ID: 266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AMDave
Volunteer moderator
Volunteer tester

Send message
Joined: 30 Aug 11
Posts: 41
Credit: 100,018
RAC: 0
Australia
Message 267 - Posted: 21 Nov 2011, 8:12:35 UTC
Last modified: 21 Nov 2011, 8:27:06 UTC

One of them completed at 239,496.30 seconds. (66.5 hrs)
The task record has updated as you described.

The other is still going.
But it looks like there may be a problem.
When I left home this morning the task showed about 58 hours duration.

Now I check it and it says it has been running for only 1 hour 23 minutes.
I cross checked the task ID and it is definitely the same one.
It looks like it stopped and re-started and the BOINC Manager has dumped the 58+ hours already accounted for in the duration.

The task record probably still has the real duration value in it as opposed to what the BOINC manager says.
Seen that before.
(in which case it should complete soon)
We'll see what happens when it completes.
ID: 267 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AMDave
Volunteer moderator
Volunteer tester

Send message
Joined: 30 Aug 11
Posts: 41
Credit: 100,018
RAC: 0
Australia
Message 268 - Posted: 21 Nov 2011, 11:53:29 UTC

It appears to have finished and been awarded.
It aged off the task list immediately so I did not see the final duration.
However, the important observation is the work contributed without running forever or getting cancelled.
We can get stuck into those long running WUs with confidence.
Thanks Yoyo :)
ID: 268 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 7 Oct 11
Posts: 34
Credit: 2,445,252
RAC: 318
Germany
Message 434 - Posted: 10 Jul 2012, 9:28:14 UTC

Had a very long running result.
Did cancel it today after 1 day without updating the log files or other files.
http://yafu.dyndns.org/yafu/result.php?resultid=752333
Looks like it did not survive the d**** "no heartbeat" on Sunday 08.07.2012.
17:50:19 (2904): No heartbeat from core client for 30 sec - exiting

Matthias
ID: 434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 51
Germany
Message 436 - Posted: 10 Jul 2012, 15:23:13 UTC - in response to Message 434.  

Would be good to save the slot directory before canceling the wu and send it to me that I can check what the wu status was.
yoyo
ID: 436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 7 Oct 11
Posts: 34
Credit: 2,445,252
RAC: 318
Germany
Message 441 - Posted: 11 Jul 2012, 9:55:00 UTC - in response to Message 436.  
Last modified: 11 Jul 2012, 10:01:03 UTC

Will it help, if I send you the 3 Log-files (factor.log, ggnfs.log, nfs.log)?
From these I made a copy before reseting the project.
Sorry, was a little bit to early yesterday :(
Matthias
ID: 441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile yoyo_rkn
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 22 Aug 11
Posts: 736
Credit: 17,612,101
RAC: 51
Germany
Message 442 - Posted: 11 Jul 2012, 15:23:04 UTC - in response to Message 441.  

Will it help, if I send you the 3 Log-files (factor.log, ggnfs.log, nfs.log)?
From these I made a copy before reseting the project.
Sorry, was a little bit to early yesterday :(

factor.log is sufficient.
ID: 442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 7 Oct 11
Posts: 34
Credit: 2,445,252
RAC: 318
Germany
Message 458 - Posted: 19 Jul 2012, 12:56:49 UTC - in response to Message 442.  

Late again, there where no changes to the files in the slot dir round the last 12 hours. And the CPU time of the process was growing.
here the requested factor.log

07/07/12 23:35:04 v1.30 @ host ID: 1397 ,
07/07/12 23:35:04 v1.30 @ host ID: 1397 , ****************************
07/07/12 23:35:04 v1.30 @ host ID: 1397 , Starting factorization of 8938861779072390809014308542579704644988289566455519966972540828521729483117056503398097479159550310460743973
07/07/12 23:35:04 v1.30 @ host ID: 1397 , using pretesting plan: normal
07/07/12 23:35:04 v1.30 @ host ID: 1397 , no tune info: using qs/gnfs crossover of 95 digits
07/07/12 23:35:04 v1.30 @ host ID: 1397 , ****************************
07/07/12 23:35:04 v1.30 @ host ID: 1397 , pp1: starting B1 = 20K, B2 = gmp-ecm default on C109
07/07/12 23:35:04 v1.30 @ host ID: 1397 , pp1: starting B1 = 20K, B2 = gmp-ecm default on C109
07/07/12 23:35:04 v1.30 @ host ID: 1397 , pp1: starting B1 = 20K, B2 = gmp-ecm default on C109
07/07/12 23:35:04 v1.30 @ host ID: 1397 , pm1: starting B1 = 100K, B2 = gmp-ecm default on C109
07/07/12 23:35:05 v1.30 @ host ID: 1397 , current ECM pretesting depth: 0.00
07/07/12 23:35:05 v1.30 @ host ID: 1397 , scheduled 30 curves at B1=2000 toward target pretesting depth of 33.54
07/07/12 23:35:06 v1.30 @ host ID: 1397 , Finished 30 curves using Lenstra ECM method on C109 input, B1 = 2K, B2 = gmp-ecm default
07/07/12 23:35:06 v1.30 @ host ID: 1397 , current ECM pretesting depth: 15.18
07/07/12 23:35:06 v1.30 @ host ID: 1397 , scheduled 74 curves at B1=11000 toward target pretesting depth of 33.54
07/07/12 23:35:22 v1.30 @ host ID: 1397 , Finished 74 curves using Lenstra ECM method on C109 input, B1 = 11K, B2 = gmp-ecm default
07/07/12 23:35:22 v1.30 @ host ID: 1397 , current ECM pretesting depth: 20.24
07/07/12 23:35:22 v1.30 @ host ID: 1397 , scheduled 214 curves at B1=50000 toward target pretesting depth of 33.54
07/07/12 23:37:04 v1.30 @ host ID: 1397 , Finished 214 curves using Lenstra ECM method on C109 input, B1 = 50K, B2 = gmp-ecm default
07/07/12 23:37:04 v1.30 @ host ID: 1397 , pp1: starting B1 = 1250K, B2 = gmp-ecm default on C109
07/07/12 23:37:09 v1.30 @ host ID: 1397 , pp1: starting B1 = 1250K, B2 = gmp-ecm default on C109
07/07/12 23:37:13 v1.30 @ host ID: 1397 , pp1: starting B1 = 1250K, B2 = gmp-ecm default on C109
07/07/12 23:37:18 v1.30 @ host ID: 1397 , pm1: starting B1 = 2500K, B2 = gmp-ecm default on C109
07/07/12 23:37:24 v1.30 @ host ID: 1397 , current ECM pretesting depth: 25.33
07/07/12 23:37:24 v1.30 @ host ID: 1397 , scheduled 430 curves at B1=250000 toward target pretesting depth of 33.54
07/07/12 23:49:58 v1.30 @ host ID: 1397 , Finished 430 curves using Lenstra ECM method on C109 input, B1 = 250K, B2 = gmp-ecm default
07/07/12 23:49:58 v1.30 @ host ID: 1397 , pp1: starting B1 = 5M, B2 = gmp-ecm default on C109
07/07/12 23:50:16 v1.30 @ host ID: 1397 , pp1: starting B1 = 5M, B2 = gmp-ecm default on C109
07/07/12 23:50:33 v1.30 @ host ID: 1397 , pp1: starting B1 = 5M, B2 = gmp-ecm default on C109
07/07/12 23:50:51 v1.30 @ host ID: 1397 , pm1: starting B1 = 10M, B2 = gmp-ecm default on C109
07/07/12 23:51:19 v1.30 @ host ID: 1397 , current ECM pretesting depth: 30.45
07/07/12 23:51:19 v1.30 @ host ID: 1397 , scheduled 559 curves at B1=1000000 toward target pretesting depth of 33.54
07/08/12 01:56:09 v1.30 @ host ID: 1397 , Finished 560 curves using Lenstra ECM method on C109 input, B1 = 1M, B2 = gmp-ecm default
07/08/12 01:56:09 v1.30 @ host ID: 1397 , final ECM pretested depth: 33.55
07/08/12 01:56:09 v1.30 @ host ID: 1397 , scheduler: switching to sieve method
07/08/12 01:56:09 v1.30 @ host ID: 1397 , nfs: commencing gnfs on c109: 8938861779072390809014308542579704644988289566455519966972540828521729483117056503398097479159550310460743973
07/08/12 01:56:09 v1.30 @ host ID: 1397 , nfs: commencing poly selection with 2 threads
07/08/12 01:56:09 v1.30 @ host ID: 1397 , nfs: setting deadline of 1641 seconds
07/08/12 02:22:36 v1.30 @ host ID: 1397 , nfs: completed 14 ranges of size 250 in 1587.4484 seconds
07/08/12 02:22:36 v1.30 @ host ID: 1397 , nfs: best poly = # norm 5.590786e-015 alpha -4.665071 e 3.341e-009 rroots 4
07/08/12 02:22:36 v1.30 @ host ID: 1397 , nfs: commencing lattice sieving with 2 threads
07/08/12 03:15:33 v1.30 @ host ID: 1397 , nfs: commencing lattice sieving with 2 threads
07/08/12 05:26:45 v1.30 @ host ID: 1397 , nfs: commencing lattice sieving with 2 threads
07/08/12 06:19:52 v1.30 @ host ID: 1397 , nfs: commencing lattice sieving with 2 threads
07/08/12 07:37:05 v1.30 @ host ID: 1397 , nfs: commencing lattice sieving with 2 threads
07/08/12 14:36:32 v1.30 @ host ID: 1397 , nfs: commencing msieve filtering
07/08/12 14:38:21 v1.30 @ host ID: 1397 , nfs: commencing lattice sieving with 2 threads
07/08/12 21:27:12 v1.30 @ host ID: 1397 ,
07/08/12 21:27:12 v1.30 @ host ID: 1397 , ****************************
07/08/12 21:27:12 v1.30 @ host ID: 1397 , Starting factorization of 8938861779072390809014308542579704644988289566455519966972540828521729483117056503398097479159550310460743973
07/08/12 21:27:12 v1.30 @ host ID: 1397 , using pretesting plan: normal
07/08/12 21:27:12 v1.30 @ host ID: 1397 , no tune info: using qs/gnfs crossover of 95 digits
07/08/12 21:27:12 v1.30 @ host ID: 1397 , ****************************
07/08/12 21:27:12 v1.30 @ host ID: 1397 , pp1: starting B1 = 20K, B2 = gmp-ecm default on C109
07/08/12 21:27:12 v1.30 @ host ID: 1397 , pp1: starting B1 = 20K, B2 = gmp-ecm default on C109
07/08/12 21:27:12 v1.30 @ host ID: 1397 , pp1: starting B1 = 20K, B2 = gmp-ecm default on C109
07/08/12 21:27:12 v1.30 @ host ID: 1397 , pm1: starting B1 = 100K, B2 = gmp-ecm default on C109
07/08/12 21:27:13 v1.30 @ host ID: 1397 , current ECM pretesting depth: 0.00
07/08/12 21:27:13 v1.30 @ host ID: 1397 , scheduled 30 curves at B1=2000 toward target pretesting depth of 33.54
07/08/12 21:27:14 v1.30 @ host ID: 1397 , Finished 30 curves using Lenstra ECM method on C109 input, B1 = 2K, B2 = gmp-ecm default
07/08/12 21:27:14 v1.30 @ host ID: 1397 , current ECM pretesting depth: 15.18
07/08/12 21:27:14 v1.30 @ host ID: 1397 , scheduled 74 curves at B1=11000 toward target pretesting depth of 33.54
07/08/12 21:27:30 v1.30 @ host ID: 1397 , Finished 74 curves using Lenstra ECM method on C109 input, B1 = 11K, B2 = gmp-ecm default
07/08/12 21:27:30 v1.30 @ host ID: 1397 , current ECM pretesting depth: 20.24
07/08/12 21:27:30 v1.30 @ host ID: 1397 , scheduled 214 curves at B1=50000 toward target pretesting depth of 33.54
07/08/12 21:29:10 v1.30 @ host ID: 1397 , Finished 214 curves using Lenstra ECM method on C109 input, B1 = 50K, B2 = gmp-ecm default
07/08/12 21:29:10 v1.30 @ host ID: 1397 , pp1: starting B1 = 1250K, B2 = gmp-ecm default on C109
07/08/12 21:29:15 v1.30 @ host ID: 1397 , pp1: starting B1 = 1250K, B2 = gmp-ecm default on C109
07/08/12 21:29:20 v1.30 @ host ID: 1397 , pp1: starting B1 = 1250K, B2 = gmp-ecm default on C109
07/08/12 21:29:24 v1.30 @ host ID: 1397 , pm1: starting B1 = 2500K, B2 = gmp-ecm default on C109
07/08/12 21:29:30 v1.30 @ host ID: 1397 , current ECM pretesting depth: 25.33
07/08/12 21:29:30 v1.30 @ host ID: 1397 , scheduled 430 curves at B1=250000 toward target pretesting depth of 33.54
07/08/12 21:41:58 v1.30 @ host ID: 1397 , Finished 430 curves using Lenstra ECM method on C109 input, B1 = 250K, B2 = gmp-ecm default
07/08/12 21:41:58 v1.30 @ host ID: 1397 , pp1: starting B1 = 5M, B2 = gmp-ecm default on C109
07/08/12 21:42:17 v1.30 @ host ID: 1397 , pp1: starting B1 = 5M, B2 = gmp-ecm default on C109
07/08/12 21:42:34 v1.30 @ host ID: 1397 , pp1: starting B1 = 5M, B2 = gmp-ecm default on C109
07/08/12 21:42:52 v1.30 @ host ID: 1397 , pm1: starting B1 = 10M, B2 = gmp-ecm default on C109
07/08/12 21:43:20 v1.30 @ host ID: 1397 , current ECM pretesting depth: 30.45
07/08/12 21:43:20 v1.30 @ host ID: 1397 , scheduled 559 curves at B1=1000000 toward target pretesting depth of 33.54
07/08/12 22:46:27 v1.30 @ host ID: 1397 , Finished 560 curves using Lenstra ECM method on C109 input, B1 = 1M, B2 = gmp-ecm default
07/08/12 22:46:27 v1.30 @ host ID: 1397 , final ECM pretested depth: 33.55
07/08/12 22:46:27 v1.30 @ host ID: 1397 , scheduler: switching to sieve method
07/08/12 22:46:27 v1.30 @ host ID: 1397 , nfs: commencing gnfs on c109: 8938861779072390809014308542579704644988289566455519966972540828521729483117056503398097479159550310460743973
07/08/12 22:46:27 v1.30 @ host ID: 1397 , nfs: commencing NFS restart
07/08/12 22:46:27 v1.30 @ host ID: 1397 , nfs: previous data file found - commencing search for last special-q
07/08/12 22:46:38 v1.30 @ host ID: 1397 , nfs: parsing special-q
07/08/12 22:46:38 v1.30 @ host ID: 1397 , nfs: found 4689644 relations, continuing job at specialq = 1980301
07/08/12 22:46:38 v1.30 @ host ID: 1397 , nfs: commencing msieve filtering

Matthias
ID: 458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Long running work units




Datenschutz / Privacy Copyright © 2011-2024 Rechenkraft.net e.V. & yoyo