Updating test-mkos8

(1) By Bill Cattey (poetnerd) on 2018-11-16 22:40:03 [link] [source]

I've read the test-mkos8 code.

Converting it to accept os8-run as the build system is easier than I thought, because it calls make on the build target.

I'm going to take an initial stab at updating it in the following ways:

Rip out the stuff conditional --disable-os8-patches, since that option is now gone. This means there's only one target we will build against.
Update the target name to v3f.rk05, our new simpler conventional name.

and see if it runs.

(2) By Bill Cattey (poetnerd) on 2018-11-16 23:25:45 in reply to 1 [link] [source]

Note that I needed to install the following Perl modules:

Math::Subsets::List Parallel::Loops

To do that I needed to:

upgrade cpan
install cpanminus
use a non-standard path to run sudo /opt/local/libexec/perl5.24/sitebin/cpanm to do the module installs.

(3) By Bill Cattey (poetnerd) on 2018-11-16 23:35:55 in reply to 1 [link] [source]

I am not sure if I've got all the prerequisites, or if everything is configured right.

This is what I get for output after doing #1 and #2:

bash-3.2$ tools/test-mkos8
Configuring test {e3b0c44298fc}, PID 21208...
Configuring test {4b562c60e3db}, PID 21209...
Configuring test {4ca6a47775a5}, PID 21210...
Configuring test {c4ee51b85006}, PID 21211...
Configuring test {373cac38c48a}, PID 21212...
Configuring test {34c8537e7a21}, PID 21213...
Configuring test {e736b422ffa9}, PID 21214...
Configuring test {05d81df4108b}, PID 21215...
Error from child: Failed to configure {4b562c60e3db}!
Couldn't print to pipe at /opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 502.
Warning: unable to close filehandle GEN12 properly: Broken pipe at 
/opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 504.
bash-3.2$ Couldn't print to pipe at /opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 502.
Warning: unable to close filehandle GEN8 properly: Broken pipe at 
/opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 504.

It's not returned to the command prompt so it's doing something.

I see that there's a test subdirectory now with directories with long names like, 05d81df4108b5a9748f63510442b1f450981bd220bb905a6b1ebb812a90e68de and a tmp subdirectory with numbered subdirectores.

In each of those numbered subdirectories I see two files, cfg.log and config.log. The last line of every cfg.log file is:

Error: couldn't open "lib/pidp8i/os8opts.py": no such file or directory

I wonder if that's a failed dependency or benign, since stuff IS apparently still running and doing something...

(4) By Bill Cattey (poetnerd) on 2018-11-17 00:19:44 in reply to 3 [link] [source]

There is definitely something subtlely different in the dependencies such that running config from a subdir fails, on the os8opts.py dependency. I thought it was a superfluous dependency on the os8-sys target I created to replace the two targets, os8-patched and os8-bin but nope.

I'll check in a snapshot of the changes so you can try what I've got so far, and maybe fix what's wrong.

(5) By Bill Cattey (poetnerd) on 2018-11-17 09:17:16 in reply to 1 [link] [source]

I think I got make test-mkos8 working.

I have updated Makefile.in, auto.def, and test-mkos8 such that the various combinations of the --disable options successfully build v3d.rk05 for comparison purposes.

I didn't build all 131072 of them. I quit the tester after about 175 combinations had been built.

The changes are checked in and ready for you to try out.

(6) By Warren Young (tangent) on 2018-11-17 15:39:03 in reply to 2 [link] [source]

I've checked in an autosetup test for Perl and the non-core modules we use, with an automated one-command method for installing all of those modules if any are missing.

I'm not sure why you'd need to use a non-platform Perl version. I don't think we're doing anything particularly bleeding edge here. I'd expect what we're doing to work back to at least Perl 5.16, perhaps farther.

That's a guess because I'm out of the state, away from my test VMs, so I can't just try it and find out until Sunday night.

(7) By Warren Young (tangent) on 2018-11-17 15:45:20 in reply to 4 [link] [source]

I won't be able to get deep into this properly until tonight.

I don't think I ever tried to get test-mkos8 working for out-of-tree builds. It may need help to get that working.

The configure error is a sign that test-mkos8 has found a real problem, a case that the configure script doesn't properly handle. You'll find logs under the tmp directory to help you debug this.

In the meantime, would you upload a few of the images to Google Drive for me to download and compare against? You might want to run it in shuffle (-s) mode so that you get Monte Carlo coverage of the test space, rather than bias towards the front of the list. That'll also get me several random samples to test.

(8) By Bill Cattey (poetnerd) on 2018-11-17 18:42:42 in reply to 5 [link] [source]

Correction: there would be 65536 images because I forgot to eliminate --disable-os8-patches from the list I made when I first started working on this.

(9) By Bill Cattey (poetnerd) on 2018-11-17 18:50:27 in reply to 7 [link] [source]

Any guess about what the errors:

wdc-home2:os8-v3f-extensions wdc$ Couldn't print to pipe at /opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 502.
Warning: unable to close filehandle GEN80 properly: Broken pipe at /opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 504.
Couldn't print to pipe at /opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 502.
Warning: unable to close filehandle GEN78 properly: Broken pipe at /opt/local/lib/perl5/site_perl/5.24/Parallel/Loops.pm line 504.

are all about?

(10) By Warren Young (tangent) on 2018-11-18 05:04:28 in reply to 9 [link] [source]

That's two errors repeated twice, not four errors, and the second error in each case follows from the first, which means the parent caught SIGPIPE. I've just seen it here when I Ctrl-C'd the parent, which is odd, since that should be suppressed. Maybe it's the child catching the Ctrl-C and exploding.

If these errors are happening in another case, then I'd guess it's related: the child is dying uncleanly.

Anyway, my tests so far have built correctly in my Debian 9 VM under Chrome OS, which should be fairly similar to Raspbian Stretch. I've uploaded what I've generated here to Google Drive so you can try building those on your system with tools/test-os8-run -e, which tells it to try building only tests that already exist in test/*. Invitation to that Drive folder will be sent via private email.

(11) By Bill Cattey (poetnerd) on 2018-11-18 18:52:16 in reply to 10 [link] [source]

Those errors arose when the build in one of the tmp subdirs blew out.

I think that the status reporting from a failed sub-build may not be working on my Mac platform.

(12) By Warren Young (tangent) on 2018-11-19 01:43:06 in reply to 11 [link] [source]

I was looking at your build results on the road, and I think what happened is clear from the log output in the tmp/$pid directory. Near the end of the v3d-dist-rk05.os8 script it runs one last command (SET SYS INIT) and then exits the simulator, expecting to see the OS/8 prompt one last time in response to an "os8" script command before it tells the simulator to exit. But, it times out because instead of seeing "." at the end of a line, it sees ".^E", so pexpect times out, never seeing ".$".

Which is why it keeps all those logs. :)

Bottom line, I think there's some kind of race condition here that only happens when there are multiple simulators and/or C compiler runs happening in parallel.

You can choke test-os8-run down to a single thread with the -1 flag to see if this suppresses the symptom. If it does, that's not a fix, it's an indication that the problem is due to the parallel builds, which is what really needs to be fixed.

I recommend studying the tail end of the following logs, in this order:

test/tmp/84835/obj/os8-run.log — where the condition described above occurs.
media/os8/scripts/v3d-dist-rk05.os8 — which issues the SET SYS INIT command involved.
test/tmp/85835/make.log — which shows the pexpect timeout and guides you to #1 and #2

(13) By Bill Cattey (poetnerd) on 2018-11-19 02:15:26 in reply to 12 [link] [source]

One thing I reviewed, but forgot to mention, was that the config said "--disable os8-init".

Root cause here could be as simple as, "Bill forgot there was an os8-init conditional and never implemented it."

(14) By Warren Young (tangent) on 2018-11-19 02:21:49 in reply to 13 [link] [source]

Speaking of, while trying — and failing — to reproduce the symptom, os8-run complained that it didn't understand the "if" command in the script file in question. I don't have it in front of me, but you can reproduce it readily enough.

Is that a planned but still-missing feature of the os8-run language?

(15) By Bill Cattey (poetnerd) on 2018-11-19 03:24:58 in reply to 14 [link] [source]

There's no if statement in os8-run.

It's done with

begin defined symname
end defined symname

(16) By Warren Young (tangent) on 2018-11-19 03:32:30 in reply to 15 [link] [source]

It's on line 149 of media/os8/scripts/v3d-dist-rk05.os8.

(17) By Bill Cattey (poetnerd) on 2018-11-19 05:05:00 in reply to 16 [link] [source]

Oh jeez!

That's why code coverage is so important!

It's a paste-o, with the result that the vtedit option, if enabled, would blow out with a syntax error! That if statement is from the old mkos8 python code.

I've just checked in a new version of v3d-dist-rk05.os8 with the line removed.

(18.1) By Bill Cattey (poetnerd) on 2018-11-19 05:18:57 edited from 18.0 in reply to 1 [link] [source]

(I hope this lands at the end of the thread. I'm still learning how the threading works.)

During the re-run of the make in one of the tmp subdirs, I've found a concerning bit of output:

bin/os8-run --disable ba --disable cc8 --disable chess --disable fortran-ii --disable init --disable macrel --enable music --enable vtedit ../../../media/os8/scripts/v3d-dist-rk05.os8
Running script file: ../../../media/os8/scripts/v3d-dist-rk05.os8
Pre-existing /Users/wdc/src/pidp8i/os8-v3f-extensions/bin/v3d-dist.rk05 found.  Saving as /Users/wdc/src/pidp8i/os8-v3f-extensions/bin/v3d-dist.rk05.save
Overwriting old /Users/wdc/src/pidp8i/os8-v3f-extensions/bin/v3d-dist.rk05.save
bin/os8-run --disable ba --disable cc8 --disable chess --disable fortran-ii --disable init --disable macrel --enable music --enable vtedit ../../../media/os8/scripts/v3d-rk05.os8

I think this explains our parallel race condition. The scripts are using $os8mo the OS/8 media output directory as the destination of the v3d.rk05 image being created.

Our lib/pidp8i/dirs.py is setting that directory. I would have expected that file to define the build directory appropriate to this out of tree build. I.E. .../src/pidp8i/os8-v3f-extensions/test/tmp/85835/ not src/pidp8i/os8-v3f-extensions.

And yet: dirs.py defines build as follows:

build = "/Users/wdc/src/pidp8i/os8-v3f-extensions"

(19) By Warren Young (tangent) on 2018-11-22 19:40:52 in reply to 12 [link] [source]

This same ".^E" with pexpect timeout symptom just happened to me with the latest code.

I don't have time to debug it now, something about eating turkey or some such nonsense.

I'm sending this in case you were hoping your latest checkin solved it.

This is with the Homebrew version of Perl 5.28, rather than the macOS 10.14 platform version, which is 5.18. But again, I don't think any of that matters. I think the test program is showing us a problem outside itself, which could happen to actual users.

(20) By Bill Cattey (poetnerd) on 2018-11-23 03:41:46 in reply to 19 [link] [source]

Now knowing that the repair of an obvious bit of brokenness didn't fix the problem, I actually went back and read the logs. (Sometimes I'm so lazy. :-( )

The smoking gun appears in .../test/temp/84835/opt/os8-run.log:

...
^ * $
.SET SYS INIT
SET SYS INIT
SET?
. 
Simulation stopped, PC: 01207 (KSF)

The relevant line is "SET?"

I'm away from my build computer at the moment and can't dig in, but I think this is because CCL wasn't running. I think CCL was not running because of a race condition owing to our using an image we're not supposed to be for an out-of-tree build.

I think if we fix dirs.py to point to the ...test/tmp/<pid>/... subtree we should be good.

(21) By Bill Cattey (poetnerd) on 2018-11-24 21:50:22 in reply to 18.1 [link] [source]

I checked in a trivial change that should make a big difference: adding --prefix=. to the call out to configure.

Alas, it STILL did not fix everything.

The next test I ran, it put v3d-dist.rk05 and v3d.rk05 in the tmp/<pid>/bin subdir.

But one of the parallel builds still blew out.

I just uploaded subdir 27285 to the test tree in google drive that I shared with you.

It was waiting for an OS/8 dot prompt and I can see no reason why it would not get one.

I re-ran the same config by hand and it ran to completion. :-(

Any thoughts on what this might be?

(22) By Warren Young (tangent) on 2018-11-24 23:03:37 in reply to 21 [link] [source]

adding --prefix=. to the call out to configure

The prefix tells where files land on "make install", which never happens in test-os8-run. Even if it did do that, the only way that would help us with these problems in the build process is if you've got parts of the build process that depend on installed files, which would be a bug in itself. The build process must depend only on things in the source or build tree.

In an out-of-tree build, you can predict which of the two locations a file should be based on whether it is somehow generated from something in the source tree. That clear distinction gives you unambiguous guidance on how to construct path in build rules:

Add no prefix for files that are generated by a prior step, and therefore land in the build tree.
Prefix a relative path in Makefile.in with "@srcdir@/" for files that are not generated — either by the Make or Autosetup stages of the build process — to get a path that points to the file back in the source tree.

Regardless, your most recent checkin is missing the leading double dash to make Autosetup treat "prefix" as an option. As written, your change is just defining a variable which our auto.def script never references, so there is no effect. The syntax you used is most often used for overriding the default CFLAGS and such; that's why the configure script doesn't give an error: it's valid syntax, it just doesn't do what you think it should.

To see this, try "./configure prefix=oogabooga" and search for where "oogabooga" appears in Makefile.in.

It was waiting for an OS/8 dot prompt and I can see no reason why it would not get one.

This is not the same as the prior Pexpect timeout.

The Python debug dump at the end of make.log shows that Pexpect only has the SIMH command prompts following the copy-in of music.tu56 in its buffer, but it's waiting for the OS/8 dot prompt on resuming after swapping to advent.tu56. On re-entering the simulator, os8-run sends a few CRs, presumably to elicit another dot prompt, which doesn't occur, thus the timeout.

If you run the command in question manually, you see the CRs on return from SIMH command mode do elicit a dot prompt:

 $ bin/os8-run --enable music ./media/os8/scripts/v3d-dist-rk05.os8

Here's what I get in the same condition when running it manually:

...copying MU files...
PLAYOV.PA^M
MUSIC.PA^M
^M
.^E^M
Simulation stopped, PC: 01210 (JMP 1207)^M
sim> det dt1
det dt1^M
sim> att -r dt1 /usr/local/src/pidp8i/os8-v3f-extensions/media/os8/subsys/advent.tu56
att -r dt1 /usr/local/src/pidp8i/os8-v3f-extensions/media/os8/subsys/advent ^H.tu56^M
DT: unit is read only^M
DT1: 12b format, buffering file in memory^M
sim> cont
^C^M
^M
.COPY RKB0:<DTA1:*.*^M^M

There's that screwy ".^E^M" pattern again, so I don't see how the dot prompt pattern matching works in this case.

What's that Ctrl-C doing in there?

Although it all works, it seems inherently unreliable, depending on specific byte orderings which vary from one run to the next.

Do we need to increase the OS/8 command delays to ensure that things like ".^E^M" become ".^M^E" so that patterns like ".$" always match correctly?

Incidentally, I see a curious thing in the most recent obj/os8-run.log file here:

  att dt1 /usr/local/src/pidp8i/os8-v3f-extensions/media/os8/al-4696c-sa-os8- ^Hv3d-6.1978.tu56^M

Why is there a space and backspace in that command to SIMH?

(23) By Bill Cattey (poetnerd) on 2018-11-25 02:15:30 in reply to 22 [link] [source]

So much to respond to. Thanks for thinking about this.

Re: prefix:

As written, your change is just defining a variable which our auto.def script never references, so there is no effect.

I may have stumbled on the right thing through a typo. I'm not sure HOW it is working, but it is working. I saved the old subdir build tmp/84835. In lib/pidp8i/dirs.py we see lines 44 and 45:

build = "/Users/wdc/src/pidp8i/os8-v3f-extensions"
src   = "/Users/wdc/src/pidp8i/os8-v3f-extensions"

In the recent testing I did, in tmp/27285 lines 44 and 45 of lib/pidp8i/dirs.py are different:

build = "/Users/wdc/src/pidp8i/os8-v3f-extensions/test/tmp/27285"
src   = "/Users/wdc/src/pidp8i/os8-v3f-extensions"

I think this has something to do with how auto.defs is using the variable prefix. I see this in auto.defs:

options-defaults {prefix /opt/pidp8i}
...
# Canonicalize some paths which may be relative and generate others from them
set abspfx [file-normalize [get-define prefix]]
define ABSPREFIX $abspfx
define BOOTDIR  "$abspfx/share/boot"
define MEDIADIR "$abspfx/share/media"
...
# High-level definitions
set builddir [get-define builddir]
set srcdir [get-define srcdir]
set cores [exec $srcdir/tools/corecount]

dirs.py.in defines build and src as follows:

build = "@builddir@"
src   = "@abs_top_srcdir@"

In the make-template section of the current autosetup docs I see:

The special value @srcdir@ is substituted with the relative path to the source directory from the directory where the output file is created, while the special value @top_srcdir@ is substituted with the relative path to the top level source directory.

I presume that prefix and builddir actually have some undocumented meaning that I stumbled upon to get what I was looking for.

Sigh.

Re: Waiting for OS/8 prompts.

I looked through the code in v3d-dist.os8. The code for copyin of music is pretty much the same as the copyin of other stuff that succeeded just fine:

# Copying in Music score files and source code...
begin enabled music
umount dt1
mount dt1 $os8mi/subsys/music.tu56 ro required
# resume
os8 COPY RKB0:<DTA1:*.*
end enabled music

# Adventure
begin default advent
umount dt1
mount dt1 $os8mi/subsys/advent.tu56 ro required
# resume
os8 COPY RKB0:<DTA1:*.*
end default advent

The reason why resume is commented out everywhere is that the latest os8-run has the resume call built in.

The mount command just returns success. The call to the os8 command has:

# Resume OS/8 if necessary.
if self.simh._context == "simh":
  self.resume_command(line, script_file)

In os8script.py, resume_command simply calls os8_resume() in simh.py. And therein may live our problem. Viz, the comments I put in that method:

  #### os8_resume #######################################################
  # Resume OS/8.
  #
  # It would be nice if we could just send  the "cont" command
  # and have python expect and OS/8 synch right up.
  # But so far we have not figured out how to do that.
  # To resume OS/8 from SIMH we need to provoke a prompt.
  # Typing a rubout or ^U at a SIMH terminal session does this.
  # But not when SIMH is run under python expect.
  # We don't know why.
  #
  # boot works
  # go 7600 works
  # ^C <pause> \n\r works.
  #
  # The resume command uses the ^C method as the least disruptive
  # to system state.

I think the ".^E^M" pattern is a red herring. Our problem here is that the output mixes what is being sent with what is being received.

After staring at the code for a while, I think perhaps the ^C I want to send to OS/8 is getting to SIMH before it has successfully transferred control to OS/8. I'm going to add a keyboard delay to BEFORE as well as after sending ^C.

(24.1) By Bill Cattey (poetnerd) on 2018-11-25 03:46:28 edited from 24.0 in reply to 23 [link] [source]

Adding the additional keyboard delay before sending ^C caused TWO of the sub-processes to blow out instead of one.

But I had one more trick up my sleeve...

Instead of the "less intrusive" resume, I did the more conservative, more intrusive restart:

go 7600

By changing the call from resume_command to restart_command the race condition no longer happens.

At this point, all I can say is that os8_resume in simh.py isn't reliable enough.

The updated version of os8script.py is checked in now.

(25) By Bill Cattey (poetnerd) on 2018-11-30 03:50:46 in reply to 1 [link] [source]

I now understand how to update the sanitizing code.

Some but not all of the tests pass.

I created a random block of 16 tests. 15 pass. The one that does not pass is:

ba2a9637b3f4fd8cbc62f2d7a76f512f8599b3ba7756db49aefa6db7bdcbff00 --disable-os8-advent --disable-os8-ba --disable-os8-cc8 --disable-os8-crt --os8-focal69

I've put that directory in my google share, test-os8-run-wdc-2018-11-17, and appended that row into the tests-manifest.txt file.

What sorts of lzcmp problems have you seen with previous testing?

(26) By Warren Young (tangent) on 2018-11-30 11:29:15 in reply to 25 [link] [source]

Since the diff file is empty, it's pointing out that the disk contents differ even though the commands given and the responses are the same in both cases.

You might have to dissect the disk contents to find out how they differ. Or, maybe it's time to write bin/os8-diff. :)

The only reason lz is involved is because these disk images are pretty flabby, so it's worth compressing them, especially on a Pi's SD card.

(27) By Bill Cattey (poetnerd) on 2018-12-01 01:23:57 in reply to 26 [link] [source]

Yea, I figured as much.

The diff file should be empty now, since the sanitizing has removed the differing path names.

The log data is the same.

I decompressed the two rk05 images, used os8-cp to pull all the files from them and diff'ed them. There was exactly one difference: BUILD.SV.

BUILD.SV is only touched in the v3d-dist-rk05.os8 script, where it is run out of a scratch tu56 image cloned from the read-only $os8mi/al-4711c-ba-os8-v3d-1.1978.tu56, saved back to that scratch image, and used for subsequent work. These activities are pretty much the first things done in the script, and the first things done after booting the scratch tu56 image.

It seems all very deterministic to me.

I guess subsequent runs of SimH might have random bits in the uninitialized memory image data.

Any thoughts on where the deltas in BUILD.SV might have come from?

Here's what the script looks like around the touching of BUILD.SV.

cat -n ../media/os8/scripts/v3d-dist-rk05.os8
 1  # Creates an OS/8 v3d binary rk05 image from distribution DECtapes.
 2  # Caution: the boot tape must be writeable.
 3  
 4  mount dt0 $os8mi/al-4711c-ba-os8-v3d-1.1978.tu56 required scratch
 5  mount dt1 $os8mi/al-4712c-ba-os8-v3d-2.1978.tu56 ro required
 6  
 7  mount rk0 $os8mo/v3d-dist.rk05 new
 8  
 9  boot dt0
10  
11  begin build SYS:BUILD
12  LOAD DTA1:RK8ESY.BN
13  LOAD DTA1:PT8E.BN
14  LOAD DTA1:TD8ESY.BN
...
33  DSK RK8E:RKB0
34  PRINT
35  BOOT
36  end build
37  
38  # Note:
39  # If we save BUILD to RKA0 it gets overwritten with our DTA0: copyin.
40  # If we save BUILD to DTA0: that does not happen but then we modify
41  # our tape. (It is supposed to be a scratch tape.)
42  os8 SAVE DTA0 BUILD.SV
43  
44  boot dt0

(28) By Bill Cattey (poetnerd) on 2018-12-01 03:07:56 in reply to 27 [link] [source]

Oh, there's one more thing I did...

I looked at the contents of the BUILD.SV deltas with a tool I found, OCOMP.SV an OS/8 octal compare program. I registered the changes against code from an assembly listing I made of BUILD.PA, and examining the two differing .SV files under FUTIL

Data that is different is NOT just uninitialized data. It's values of temporary variables from a running BUILD.SV.

Here is a sample of output from OCOMP.SV. It's a chunk of page 0. The blank zones represent octal values that are the same in both files. Indeed I found page zero in FUTIL and the blocks that are the same match up. So I'm pretty confident I'm looking at differences on page zero.

Excerpt of diffs from failed rklz:

RELATIVE BLOCK 0021   ( ABSOLUTE BLOCK 0147 AND 0514 )
0000            2403 6070                                 5353 5373
0014  5273 5203 5202                 0025
0030  6736      2200 6565  0022 0000 0000 0000  0000 2000 6000
0044  3657 3216 4160
0314                                 4007
0330            6553

Excerpt of diffs from last rklz:

RELATIVE BLOCK 0021   ( ABSOLUTE BLOCK 0514 AND 0147 )
0000            2213 7005                                 5356 5376
0014  5276 5202 5201                 0024
0030  7322      4000 6767  0023 2213 7005 2213  0260 2400 5400
0044  0000 6211 4230
0314                                 4011
0330            6745

For the most part, I don't understand how BUILD.SV does its work, but after staring at stuff for a while, it seems that indeed this is a run of BUILD.SV with a different code path.

The line beginning with 0030 has a simple explanation: The name of the default storage device DSK is not defined in the failed rklz, but is set to RK8ERKB0, i.e. 2213 7005 2213 0260 versus all zeros.

Line 33 in the script sets those values! But somehow, in the creation of the v3d-dist.rk05 used by the failed run, that value is not set.

I shall dig deeper...

(29) By Bill Cattey (poetnerd) on 2018-12-01 03:49:56 in reply to 28 [link] [source]

It is confirmed that the version of BUILD.SV in last.rk05 is an un-configured save image:

sim> att rk0 last.rk05
sim> boot rk0

PIDP-8/I OS8-V3F-EXTENSIONS:ID[5FECCFDF3A] - OS/8 V3D - KBM V3Q - CCL V1F
CONFIGURED BY WDC@WDC-HOME2.LAN ON 2018.11.20 AT 23:17:54 EST

RESTART ADDRESS = 07600

TYPE:
    .DIR                -  TO GET A LIST OF FILES ON DSK:
    .DIR SYS:           -  TO GET A LIST OF FILES ON SYS:
    .R PROGNAME         -  TO RUN A SYSTEM PROGRAM
    .HELP FILENAME      -  TO TYPE A HELP FILE

.R FUTIL

FILE BUILD.SV
BUILD.SV   0255-0315  0041 (0033)  1.101  21-JUL-69
SET MODE SAVE
LIST 0030-45

0000.00030:   7^R55RP0^/ZN

^C
.R BUILD

$PR

RK05: *RKA0 *RKB0 *RKA1 *RKB1  RKA2  RKB2  RKA3  RKB3 
RX01: *RXA0 *RXA1 
KL8E: *TTY  
LPSV: *LPT  
TC08: *SYS  *DTA0 
TC  :  DTA0 *DTA1  DTA2  DTA3  DTA4  DTA5  DTA6  DTA7 
$

The PR output from BUILD should look like this:

.R BUILD

$PR

RK05:  RKA0  RKB0 *RKA1 *RKB1 *RKA2 *RKB2  RKA3  RKB3 
RX01: *RXA0  RXA1 
KL8E: *TTY  
LPSV: *LPT  
TC08:  SYS   DTA0 
TC  : *DTA0 *DTA1  DTA2  DTA3  DTA4  DTA5  DTA6  DTA7 
RK8E: *SYS  *RKA0 *RKB0 
PT8E: *PTP  *PTR  
TD8E:  SYS   DTA0  DTA1 
ROM :  SYS   DTA0  DTA1 
TD8A:  DTA0  DTA1 

DSK=RK8E:RKB0
$

The relevant question is, "How can test-os8-run change so as to make this failure mode visible?" The test case image initially created is wrong, not the subsequently created one.

Is it possible that when I hit ^C on my initial test run of 16 cases, I interrupted creation of #16 such that the image was corrupt? I think yes.

That leads me to the next question which might give reasonable insight into the previous two question: How could test-os8-run change to stop gracefully after creating a given number of test cases?

(30) By Bill Cattey (poetnerd) on 2018-12-01 03:53:54 in reply to 29 [link] [source]

I just checked in an amendment to test-os8-run that gives the -c option to limit the count of tests created. With it, I created 17 tests and they all passed.

I'm going to presume the error I got before was from the un-graceful kill of the run of test-os8-run.

I'm going to kick off a test run of 200 and see what happens.

(31) By Warren Young (tangent) on 2018-12-01 05:03:29 in reply to 27 [link] [source]

I guess subsequent runs of SimH might have random bits in the uninitialized memory image data.

About a year ago, I reported the fact that the SIMH PDP-8 simulator doesn't initialize the core memory buffer on startup as a bug, and Bob Supnik himself closed the bug pointing out that a real PDP-8 doesn't initialize its core on startup, so any code that depends on the initial state of core is buggy.

For our purposes, I think we want to have a dep 0/100000 0 command somewhere at the start of each script run to zero all 32 kWords of core.

(32) By Warren Young (tangent) on 2018-12-01 05:06:50 in reply to 31 [link] [source]

Shortly after posting that, my Spidey sense starting tingling, then I remembered writing such a thing already: see [os8_]zero_core() in lib/simh.py. One zeroes all of core, the other skips over areas precious to OS/8.

(33) By Bill Cattey (poetnerd) on 2018-12-01 05:24:59 in reply to 1 [link] [source]

I've build 217 tests, and all 217 pass.

Running the verify pass on my Mac Mini took 42m10.8s.

So I can tell it "generate 3000 tests" and go to bed, expecting it will have that many done in the morning.

(34.1) By Bill Cattey (poetnerd) on 2018-12-01 16:29:31 edited from 34.0 in reply to 33 [link] [source]

Generating an additional 3000 tests took 636m20.5s.

(35) By Bill Cattey (poetnerd) on 2018-12-01 16:22:45 in reply to 32 [link] [source]

Right! I was trying to remember about that.

However, as you saw from my subsequent investigation, it was an actual build error, presumably provoked by ^C during test generation.

(36) By Bill Cattey (poetnerd) on 2018-12-01 16:27:35 in reply to 1 [link] [source]

Since progressively larger numbers of test runs are coming up successful...

QUESTION: At what point do we declare victory and merge os8-v3f-extensions into trunk?

I suppose I could start up a complete test run. With the creation of 65536 images (half baseline, half comparison), am I going to burn out the SSD in my Mac Mini?

(37) By Warren Young (tangent) on 2018-12-01 22:14:29 in reply to 34.1 [link] [source]

That fits pretty well with my results with test-mkos8: linear extrapolation puts a set of 32768 at ~4.4 days, we had 65,536 test cases with the prior release, and it was taking me about a week solid to generate the test corpus.

How about you clear out your GDrive shared folder and copy over all test sets beginning with 0, which should get me about 200 to play with in -e mode on my various machines. UUIDs are random, so the set should be fairly evenly distributed among the OS/8 configuration options.

I suggest deleting those you've already uploaded because until recently, there was cause to be uncertain whether they are true exemplars of correct output.

(38) By Warren Young (tangent) on 2018-12-01 22:28:25 in reply to 36 [link] [source]

At what point do we declare victory

When multiple machines can rebuild all of at least a large subset of the test results. The most important test cases are on a Raspberry Pi model B+ and on an original Pi 3, both of which I have. Next after that would be a Pi Zero, but I don't have one of those. These are my estimates of the most-used Pi models with the PiDP-8/I.

We don't have to re-test everything, since it's fine for trunk to be in-development. We just want to have some reasonable assurance that it usually builds everywhere, at least.

am I going to burn out the SSD in my Mac Mini?

Unless it's really old, I wouldn't bet on it. A typical endurance rating for a recent consumer-grade SSD is 72 TB written per 128 GB of capacity, so for a 256 GB SSD — my best guess as to what you have — a full test corpus would consume about 1/4400 the total write capacity of the disk, based on a rough test size of 1 MB.

In that estimate I'm trading off the conservative warranted capacity against the fact that it takes more than 1 MB of writing to produce 1 MB of test result data. I expect these factors to roughly cancel out.

To put that in perspective, you'd have to re-generate the whole test corpus 44 times to consume 1% of the expected endurance of the drive. Since that would take you about a year to accomplish, I don't think you have much to worry about.

I stopped worrying about SSD endurance years ago.

(39) By Bill Cattey (poetnerd) on 2018-12-02 01:23:34 in reply to 37 [link] [source]

I'm in the process of running 3217 tests (16 with fencepost error + 200 +3000) . So far 2566 have come back passed. At this rate the whole set should complete in a couple hours.

I have deleted the old set of tests from GDrive.

I've copied in the directories corresponding to the first 217 lines of tests-manifest.txt, because I presume they correspond to the 217 test runs already mentioned thus far.

Look in test-os8-run-2018-11-30 which I have just shared with you.

Give them a whirl!

(40) By Bill Cattey (poetnerd) on 2018-12-02 01:27:50 in reply to 38 [link] [source]

Thanks for the re-assurance.

(41) By Bill Cattey (poetnerd) on 2018-12-02 03:02:42 in reply to 38 [link] [source]

I just copied my 217 tests to my Pi3B. They are all failing.

Cutting apart the packs like I did before, there's exactly one file differing: INIT.TX

(I think I shall turn my cut-apart ritual into a script.)

Looks like the code for creating a "neutral version" isn't working:

pi@pidp8:~/src/pidp8i/os8-v3f-extensions/test/c4ee51b850063f086f3b82581e0bab016534ef4e1ec5edbb92e172c82bd014ba $ diff rkb_last/INIT.TX.txt rkb_fail/INIT.TX.txt 

2,3c2,3
< PiDP-8/I os8-v3f-extensions:id[5feccfdf3a] - OS/8 V3D - KBM V3Q - CCL V1F
< Configured by wdc@wdc-home2.lan on 2018.11.20 at 23:17:54 EST
---
> PiDP-8/I os8-v3f-extensions:id[75497225c9] - OS/8 V3D - KBM V3Q - CCL V1F
> Configured by pi@pidp8 on 2018.12.01 at 20:30:48 EST

Root cause here is again an out-of-directory build issue. v3d-dist-rk05.os8 pulls init.tx from $os8mi which evaluates to @srcdir@/media/os8. test-os8-run puts the sanitized version into the local media/os8 directory -- not where we pull it from.

I shall have to meditate on how to fix this.

(42) By Bill Cattey (poetnerd) on 2018-12-02 03:11:46 in reply to 41 [link] [source]

Additional issues on running test-os8-run on my Pi:

I did not have lzw, and had to search around to learn that the relevant debian package is named liblz4-tool. As soon as I did sudo apt-get install liblz4-tool it found it, installed it, and used it.

Note that I didn't call test-os8-run from make, because I wanted to use the -e option. So I never got any of the messages about missing prereqs or how to get them. With liblz4-tool not installed, the tests all passed even though lzw was not found. I think test-os8-run should proactively test for the existence of the tool and not both complain about its lack and say the test passed.

The instruction sudo cpanm --inst did not work because I first had to do sudo cpan App::cpanminus.

So the prerequisite installs are:

sudo apt-get install liblz4-tool
sudo cpan App::cpanminus
sudo cpanm --inst Math::Subsets::List
sudo cpanm --inst Parallel::Loops

(43) By Warren Young (tangent) on 2018-12-02 04:02:23 in reply to 41 [link] [source]

Regarding INIT.TX, I'm seeing the same thing here, and I don't entirely understand what's going wrong.

The version string is supposed to be sanitized by tools/version. I can see that init.tx and 3finit.tx have the version line set to the generic form they're supposed to have, but if you cd into test/tmp/$PID and type "make run", you see it gets built with a proper version string, not the sanitized form.

My current guess is that it's reaching outside the test directory and using material from the checkout tree via a path that's relative on trunk and is now absolute, perhaps done as part of an effort to make out-of-tree builds work.

Maybe it's best if we just leave this to the sanitizer, and give up on generic version strings in test mode.

We can do the same for the "Configured by" line.

(44) By Warren Young (tangent) on 2018-12-02 04:07:15 in reply to 42 [link] [source]

had to search around to learn that the relevant debian package is named liblz4-tool

The configure script output actually tells you that. See line 465 of auto.def.

I think test-os8-run should proactively test for the existence of the tool

Good plan.

So the prerequisite installs are:

A shorter command set is:

 sudo apt install liblz4-tool cpanminus
 sudo cpanm --installdeps .

Differences are "apt" vs "apt-get", getting cpanm pre-built via apt rather than going through CPAN, and using the cpanfile to automatically install all of the dependencies rather than list them explicitly.

(45) By Bill Cattey (poetnerd) on 2018-12-02 04:22:58 in reply to 39 [link] [source]

Tests finished, but we know the gap: Won't succeed across hosts until I fix the code that sanitizes init.tx.

4 Tests failed:

RESULT DIFFERS!  See test/5d6c212f96869837adc32359611a15444825bf05ce0a349131cc194b1a80eab2/fail.diff
RESULT DIFFERS!  See test/b478f1b135331d50c53cd6a046d93d3f16bfc0655a00d3cb8e92737ac58ca76b/fail.diff
RESULT DIFFERS!  See test/38bbd2d8b6ae9148a6714e64bbf39005a50a1bbb3dc58224978ff61e65c43c33/fail.diff
RESULT DIFFERS!  See test/8e529187ed35bf851f3dd11e6afe5d36c6581703e2f036fa0e2f5c599d43c423/fail.diff

If I have time tomorrow, I'll script out the unpack and diff work and see what the delta actually is.

(46) By Bill Cattey (poetnerd) on 2018-12-02 21:50:57 in reply to 43 [link] [source]

I have fixed this problem.

Root cause: v3d-dist-rk05.os8 uses init.tx as created in the source tree by auto.def. That's the value that we get when we expand the path value of $os8mi/init.tx. Here's the fix I just checked in:

# INIT message setting stuff goes here
cpto $os8mi/init.cm
# Location of init.tx is funky so that test-os8-run will be able to
# sanitize it.
cpto $build/media/os8/init.tx

We have to rebuild all the test cases because init.tx is wrong.

I just did a quick sanity check run and 2 builds on my Mac Mini now pass when checked on my Pi3 B.

I have added a new script, .../tools/diff-os8 which uncompresses, unpacks and compares the last.rklz and failed.rklz. This should be helpful for debugging.

I ran it against the 4 out of 3217 tests that failed. All four failed cases complain solely of a difference in the BUILD.SV image!

(47) By Bill Cattey (poetnerd) on 2018-12-02 22:22:10 in reply to 46 [link] [source]

For reasons I do not understand, the second execution of os8-run to v3d.rk05 replaces the log file from the run to create v3d-dist.rk05 instead of appending to it.

I've just checked in a change to Makefile.in to explicitly save the first run log output, and create obj/os8-run.log with the contents of both runs.

I am now building 200 tests, will test them on both my Mac and my Pi, and will see if any of them say what went sideways with BUILD.SV. When the 200 test run completes, I'll replace the GDrive contents with these tests so you can play with them.

(48) By Bill Cattey (poetnerd) on 2018-12-03 00:36:20 in reply to 47 [link] [source]

There's a new set of 200 tests in the GDrive. It's in a subdir test because it was quickest and easiest to just un-tar what I sent over to my Pi.

There's a bit of additional log sanitizing that's needed, now that we're getting both os8-run.log chunks. The temp filenames created for scratch aren't yet cleaned out:

-sim> att dt0 .../media/os8/al-4711c-ba-os8-v3d-1.1978-temp-HMBSJv.tu56
-att dt0 .../media/os8/al-4711c-ba-os8-v3d-1.1978-temp-HMBSJv.tu56
+sim> att dt0 .../media/os8/al-4711c-ba-os8-v3d-1.1978-temp-8S1WQa.tu56
+att dt0 .../media/os8/al-4711c-ba-os8-v3d-1.1978-temp-8S1WQa.tu56
-DT0    190KW, attached to .../media/os8/al-4711c-ba-os8-v3d-1.1978-temp-HMBSJv.tu56, write enabled
+DT0    190KW, attached to .../media/os8/al-4711c-ba-os8-v3d-1.1978-temp-8S1WQa.

I've had two failures: one on the Mac, and two so far on the Pi. One of those Pi failures is the same as the one on the Mac. Here are details from the Mac:

Question: What does Error 70 : Write error : cannot write decoded block in the below example mean?

Configuring test {d9de1cfe1e34}, PID 91697...
Building bin/v3d.rk05 for test {d9de1cfe1e34} (PID 91697)...
/dev/fd/63 /dev/fd/62 differ: char 643, line 23
Error 70 : Write error : cannot write decoded block 
Error 70 : Write error : cannot write decoded block 
RESULT DIFFERS!  See test/5a13b69c464b2ad59cab5bfe8eff4d333498f918ef848907eea86322e2b21f53/fail.diff

(I suspect the reason why this test failed on both the Mac and the Pi is because the bad save of BUILD.SV happened in the test creation phase, not in the test validation phase.)

Again, only BUILD.SV is reported as different after running diff-os8.

This time I have a smoking gun, in fail.diff:

@@ -70,11 +70,13 @@
 WRITE ZERO DIRECT?YY
 SYS BUILT
 .^CSAVE DTA0 BUILD.SV^C
+.SAVE DTA0 BUILD.SV
+
 .^E
 Simulation stopped.
 sim> show dt0
 show dt0

It looks like a race condition where the SAVE DTA0 BUILD.SV is getting aborted by the ^C. But it should not be sending a ^C. I will have to re-run this with more verbose debugging to understand where the build_subcomm state machine is, and what it thinks it's doing.

There still are calls to os8script.py:resume_command that have not been replaced with os8script.py:restart_command. (And I'm not sure about those replacements.)

Race conditions suck.

(49) By Bill Cattey (poetnerd) on 2018-12-03 01:19:21 in reply to 48 [link] [source]

Another question: Why do we save obj/os8-run.log but not make.log?

At the moment, I'm generating another set of ~200 tests in the hope that the race condition bites. I am running os8-run with -d on the run for v3d-dist.rk05 in the hopes that the race condition fires and I get decent debugging output.

However, that debugging output appears in make.log, not in obj/os8-run.log.

(50.1) By Warren Young (tangent) on 2018-12-03 10:48:56 edited from 50.0 in reply to 48 [link] [source]

There's a new set of 200 tests in the GDrive.

Are you using -s to shuffle the test set before generating a subset, or are we retreading old ground on each test set?

Until the tests become reliable, it's a pointless distinction, but I thought you should be aware of the option soon, if not already.

What does Error 70 : Write error : cannot write decoded block in the below example mean?

I've seen it, too, and the message is from lz4.

When this happens, you get a bunch of pointless diffs in fail.diff. I think the diffs would go away if it would run the exemplar last.log through the sanitizer before doing the diff.

I've added a feature to test-os8-run which lets you pass a prefix of a test hash after any option arguments to re-test the named test(s). This overrides -c, -e, -g and -s. I've also added a usage message and handlers for the --help and --version flags.

Using this new feature, a re-test of the failed test case I had here doesn't fail again, so I think there may still be a race condition somewhere.

Curiously, the test that failed here was 5a13b6, the same test that failed on you there, based on your second quoted block above. Don't be fooled by the first two lines that talk about test d9de1c: there are multiple tests running in parallel, so it's the last line that tells you which test actually failed. Your log fragment says that test d9de1c was just getting started when 5a13b6 failed.

Why would both our machines fail in the same test if this is just a race condition? What makes this test special? Why doesn't it fail when run solo?

Race conditions suck.

“There is no teacher but the enemy. No one but the enemy will tell you what the enemy is going to do. No one but the enemy will ever teach you to destroy and conquer. Only the enemy shows you where you are weak. Only the enemy shows you where he is strong.”

— Orson Scott Card, "Ender’s Game"

Down with the buggers!

Why do we save obj/os8-run.log but not make.log?

I have no objections to changing that.

(51) By Bill Cattey (poetnerd) on 2018-12-03 18:15:00 in reply to 50.1 [link] [source]

Yes, I used -s. I try to do that habitually.

Over night, I generated 1000 tests with the latest sanitizing code, and the update to eliminate that ^C race condition. On my Mac it passed all 1000 without a problem. On my Pi3 B, it went through a few hundred, and then had one fail. I just ran diff-os8 and it seems to complain of the ^C race condition that I thought I fixed.

The log files still show ^C where they should not. I am investigating...

(52) By Bill Cattey (poetnerd) on 2018-12-03 18:21:24 in reply to 51 [link] [source]

Found it!

I had a nesting error in my conditionals, so that the ^C fix was only active when debugging was enabled.

Fixed now. Re-running tests.

(53) By Bill Cattey (poetnerd) on 2018-12-04 05:18:29 in reply to 1 [link] [source]

With the working ^C fixup, I've successfully run 1000 test builds (with -s on creation to spread them out) both on my Mac and my Pi3 B+.

How many of them would you like in the GDrive?

(54) By Warren Young (tangent) on 2018-12-04 09:23:45 in reply to 53 [link] [source]

200's a good number. It's about as many as you can reasonably re-test on a Pi 3 between email exchanges.

(55) By Bill Cattey (poetnerd) on 2018-12-04 19:12:00 in reply to 54 [link] [source]

Ok. New set of tests (created with -s) the first 200 from my set of 1000, on the GDrive in the new-test directory.

Is there any reason why I should not remove the test directory there, since it's been superceded?

(56) By Warren Young (tangent) on 2018-12-05 00:15:10 in reply to 55 [link] [source]

No, nuke it. I'd do it myself, but GDrive usually prevents doing it from the shared-to direction.

(57) By Warren Young (tangent) on 2018-12-05 02:19:10 in reply to 56 [link] [source]

All 200 built successfully on my 3B. Didn't bother on my iMac. Didn't re-test on the Pi B+, since the code hasn't changed since.

(58) By Bill Cattey (poetnerd) on 2018-12-08 03:16:26 in reply to 1 [link] [source]

We got the tests to be 100% reproducible for a critical count of 200 on 3 different platforms.

To get a slow platform working, we lengthened the timeout.

This thread is resolved.

(59.1) By Warren Young (tangent) on 2018-12-08 18:36:35 edited from 59.0 in reply to 24.1 [link] [source]

I've just come across this in the code, and saw something that wasn't clear to me from your post above: os8_resume and os8_restart now have the same implementation as long as you're not in verbose mode.

Unless you intend for these two to diverge again at some point, I think we should just remove the resume command from the language. And if so, that should be done before the next release, so it isn't an interface we're pressured to maintain.

(60) By Bill Cattey (poetnerd) on 2018-12-08 19:46:06 in reply to 59.1 [link] [source]

"long as you're not in verbose mode"???

Did I spazz on an indent? (Again. :-( )

Can you point me to the code block you found?

(61) By Warren Young (tangent) on 2018-12-08 19:49:42 in reply to 60 [link] [source]

It's my mental wires that are crossed, not yours. I'm talking about os8script.resume_command() and restart_command(). I had that crossed up with simh.os8_restart(), which both of those are now based on.

You'll see what I mean about verbose mode when you look at the code.

(62) By Bill Cattey (poetnerd) on 2018-12-09 16:04:09 in reply to 61 [link] [source]

Am I looking in the wrong place?

  def restart_command (self, line, script_file):
    if self.verbose: print "Restarting OS/8 at line " + \
       str(self.line_ct_stack[0]) + "."

    self.simh.os8_restart()
    return "success"

self.simh.os8_restart is at the same level as the if not conditional upon it.

(63) By Warren Young (tangent) on 2018-12-10 07:29:08 in reply to 62 [link] [source]

self.simh.os8_restart is at the same level as the if not conditional upon it.

My point is that if you're not in verbose mode, those two commands do exactly the same thing, and if you are in verbose mode, they differ only in a small amount of text they output to the console.

I'm proposing that we remove either "restart" or "resume". Once upon a time, they behaved substantially differently, but not since your change up-thread.

(64) By Bill Cattey (poetnerd) on 2018-12-10 22:39:38 in reply to 62 [source]

Ah! The light comes on! Sorry for taking so long to understand.

Normally, I'd just rip out the resume command from os8-run, exactly as you suggest. But there's another aspect that I discovered subsequently. When I was chasing down the ^C issue in os8script.py:build_subcomm, I discovered that there is a bunch of stuff that is calling simh.os8_resume.

Here's what I know so far:

There should be a way to just resume -- go back to the OS/8 keyboard monitor and get that prompt. But so far that code has been flaky and I don't understand why.
There does seem to be a race condition in handling interrupt characters sent through pexpect whether it lands on the simh> prompt or the OS/8 . prompt. Again, I don't understand why.
I believe there really are situations were you should just go back to OS/8 without restarting the monitor. But I have not thought them through yet.
I think all this might go away when the basic command loop order is backwards problem is fixed.

At this point, I'd like some more time to meditate on whether or to support the resume command. However, perhaps the best way forward is to disable the resume command, and remove it from the public facing documentation.

Then, if we solve the problems with it, we could add it back in a subsequent release.