CAC 2017-10-15

boot crash

The MC data in the RCU is not the correct data.

Either the data is wrong or the wrong data set was selected.

Watch the memory locations to see where the data came from

The last write into the memory location:

    000073                         177  copy_mc:
    000073  aa  0 00100 1007 00    178            mlr       (pr),(pr)           copy machine conditions
    000073  5a   000750 0000 00
    000073  aa   052721 247134

called by;

    000231                         425  pre_empt: 
    000231  0a   000312 7002 00    426            tsx0      ring_test           ring 0 & not idle?
    000232  0a   000056 7102 00    427            tra       ralr_set_1          if so, delay this for later
                                   428
    000233  4a  4 00112 3503 20    429            eppap     pds$page_fault_data copy machine conditions
    000234  4a  4 00100 7003 20    430            tsx0      fim_util$copy_mc    ..

This would appear to be copying to page_fault_data; the same location as the bad preempt. Thus we conclude that the location is correct.

source/library_dir_dir/system_library_1/source/bound_init_1.s.archive/initialize_faults.pl1:    wired_fim$int_scuinfo = addr (pds$page_fault_data.scu);

Yes; the pre-empt RCU points to page_fault_data

So either the wrong data was copied into page_fault data or the data was corrupted before copying.

The last RCU before the crash

WATCH [98943430] 00034:002463 read   00530030 000041250011 (APU_DATA_READ)

so page_fault_data is at 00530030

Watching page_fault data:

hdbg 4000000

    000253  0a   000004 6132 20    466            rcu       int_scuinfo,*       bye-bye
DBG(91716530)> CPU TRACE: 00034:002463 0 002214613220 (RCU 002214,N*)

This is a pre-empt return to a process not on prds

    000231                         425  pre_empt:
    000231  0a   000312 7002 00    426            tsx0      ring_test           ring 0 & not idle?
DBG(91718184)> CPU TRACE: 00034:002441 0 002522700200 (TSX0 002522)
...
    000322  0a   000010 1072 00    550            cmpx7     trouble_scuinfo     is it prds?
DBG(91718202)> CPU TRACE: 00034:002532 0 002220107200 (CMPX7 002220)
DBG(91718202)> CPU FINAL: Read 00126220 000072000043
DBG(91718203)> CPU FINAL: Read 00126533 000000600210
00034:002533 bound_interceptors:wired_fim+0323
    000323  aa   000000 6002 10    551            tze       0,0                 if so, go away
DBG(91718204)> CPU TRACE: 00034:002533 0 000000600210 (TZE 000000,0)
00034:002534 bound_interceptors:wired_fim+0324
    000324  aa   000001 7102 10    552            tra       1,0                 if not, take second return
DBG(91718206)> CPU TRACE: 00034:002534 0 000001710210 (TRA 000001,0)

DBG(91718212)> CPU FINAL: Read 77024210 000034650011
DBG(91718212)> CPU FINAL: Read 77024211 000000000021
DBG(91718212)> CPU FINAL: Read 77024212 000034000000
DBG(91718212)> CPU FINAL: Read 77024213 000000000000
DBG(91718212)> CPU FINAL: Read 77024214 002332500200
DBG(91718212)> CPU FINAL: Read 77024215 002332000100
DBG(91718212)> CPU FINAL: Read 77024216 002332710200
DBG(91718212)> CPU FINAL: Read 77024217 000410710220

DBG(91718212)> CPU FINAL: Write 00530030 000034650011
DBG(91718212)> CPU FINAL: Write 00530031 000000000021
DBG(91718212)> CPU FINAL: Write 00530032 000034000000
DBG(91718212)> CPU FINAL: Write 00530033 000000000000
DBG(91718212)> CPU FINAL: Write 00530034 002332500200
DBG(91718212)> CPU FINAL: Write 00530035 002332000100
DBG(91718212)> CPU FINAL: Write 00530036 002332710200
DBG(91718212)> CPU FINAL: Write 00530037 000410710220

This is a RCU without a write to page_fault_data followed immediately by a TRO.

    000253  0a   000004 6132 20    466            rcu       int_scuinfo,*       bye-bye
DBG(92963702)> CPU TRACE: 00034:002463 0 002214613220 (RCU 002214,N*)
DBG(92963702)> CPU FINAL: Read 00530030 000034650011
DBG(92963702)> CPU FINAL: Read 00530031 000000000021
DBG(92963702)> CPU FINAL: Read 00530032 000034000000
DBG(92963702)> CPU FINAL: Read 00530033 000000000000
DBG(92963702)> CPU FINAL: Read 00530034 002332500200
DBG(92963702)> CPU FINAL: Read 00530035 002332000100
DBG(92963702)> CPU FINAL: Read 00530036 002332710200
DBG(92963702)> CPU FINAL: Read 00530037 000410710220
00034:002332 bound_interceptors:wired_fim+0122
    000122                         275  timer_runout:
    000122  0a   000000 2542 20    276            spri      prs,*               save prs
DBG(92963704)> CPU TRACE: 00034:002332 0 002210254220 (SPRI 002210,N*)

Event over view:

The code is in a steady state:

DBG(92307911)> CPU INTR: intr_pair_addr 24
DBG(92307911)> CPU INTR: 000170 005404235000 (LDA 005404) 005404 235(0) 0 0 0 00
[92309021] rcu to 00043:000170  PSR:IC 00034:002766 final address 77024130
[92319349] scu to 00000:000004  PSR:IC 00043:000170 final address 77163404
DBG(92319349)> CPU INTR: intr_pair_addr 24
DBG(92319349)> CPU INTR: 000170 005404235000 (LDA 005404) 005404 235(0) 0 0 0 00
[92320459] rcu to 00043:000170  PSR:IC 00034:002766 final address 77024130
[92330821] scu to 00000:000004  PSR:IC 00043:000170 final address 77163404
DBG(92330821)> CPU INTR: intr_pair_addr 24
DBG(92330821)> CPU INTR: 000170 005404235000 (LDA 005404) 005404 235(0) 0 0 0 00
[92331931] rcu to 00043:000170  PSR:IC 00034:002766 final address 77024130

The code is in unlock_ptl with interrupts inhibited, the code disinhibits and the IOM interrupt is serviced.

DBG(92338531)> CPU FAULT: Fault 17(021), sub 0(00), dfc N, 'PTW0.F == 0'
DBG(92338531)> CPU FAULT: 00126:001101 0 040140100540 (MLR PR0|40140,F1) 040140 100(1) 0 0 0 00
[92338532] scu to 02235:000000  PSR:IC 00126:001101 final address 00016000
[92346308] rcu to 00126:001101  PSR:IC 00042:001405 final address 77124030

disk_reader page faults.

DBG(92346622)> CPU FINAL: Read 77442237 600024650100
00041:030237 bound_library_wired_:pl1_operators_+07727
    007727                        5098  alm_operators_begin:
    007727                        5099  alm_call:
    007727  aa  6 00024 6501 00   5100            sprilp    sp|stack_frame.return_ptr save return pointer
DBG(92346623)> CPU TRACE: 00041:030237 0 600024650100 (SPRI4 PR6|24)
DBG(92346624)> CPU FAULT: Fault 4(04), sub 0(00), 'Timer runout'
DBG(92346626)> CPU TRACE: 030240 0 000510657220 (SCU 000510,N*)
DBG(92346626)> CPU FINAL: Write 77024210 000041250011

So at cycle 92346624, a TRO occurred at the spot the pxss later returns to when trying to preempt the idle process; telling us that the MC that was used was, at least at one time, a legitimate MC. This would seem to indicate that the pxss code to locate the idle process MC found the wrong MC, or that the idle process MC was overwritten.

Track 77024210:

00034:002522 bound_interceptors:wired_fim+0312
    000312                         539  ring_test:
    000312  aa  2 00030 2353 00    540            lda       bp|mc.scu.ppr.prr_word  get PRR
DBG(92346747)> CPU TRACE: 00034:002522 0 200030235300 (LDA PR2|30)
DBG(92346747)> CPU FINAL: Read 77024210 000041250011

00034:002276 bound_interceptors:wired_fim+066
    000066  0a   000002 6132 20    206            rcu       scuinfo,*           good-bye
DBG(92346783)> CPU TRACE: 00034:002276 0 002212613220 (RCU 002212,N*)
DBG(92346783)> CPU FINAL: Read 77024210 000041250011
DBG(92346783)> CPU FINAL: Read 77024211 000000000011

The TRO is resumed; that should have been the end of that MC

    000312                         539  ring_test:
    000312  aa  2 00030 2353 00    540            lda       bp|mc.scu.ppr.prr_word  get PRR
DBG(92963826)> CPU TRACE: 00034:002522 0 200030235300 (LDA PR2|30)
DBG(92963826)> CPU FINAL: Read 77024210 000041250011

00034:003063 bound_interceptors:fim_util+073
    000073                         177  copy_mc:
    000073  aa  0 00100 1007 00    178            mlr       (pr),(pr)           copy machine conditions
    000073  5a   000750 0000 00
    000073  aa   052721 247134
DBG(92963852)> CPU TRACE: 00034:003063 0 000100100700 (MLR PR0|100)
DBG(92963852)> CPU FINAL: Read 77024210 000041250011
DBG(92963852)> CPU FINAL: Write 00530030 000041250011

Here we see the old MC being copied into page_fault_data.

00034:002463 bound_interceptors:wired_fim+0253
    000253  0a   000004 6132 20    466            rcu       int_scuinfo,*       bye-bye
DBG(92964692)> CPU TRACE: 00034:002463 0 002214613220 (RCU 002214,N*)
DBG(92964692)> CPU FINAL: Read 00530030 000041250011

And then the bogus RCU.

Analyzing the fault handling..

Returning from a fault or interrupt

DBG(92963702)> CPU TRACE: 00034:002463 bound_interceptors:wired_fim+0253
DBG(92963702)> CPU TRACE:     000253  0a   000004 6132 20    466      rcu       int_scuinfo,*       bye-bye
DBG(92963702)> CPU TRACE: 00034:002463 0 002214613220 (RCU 002214,N*) 002214 613(0) 0 1 1 00

TRO is pending

DBG(92963704)> CPU TRACE: 00034:002332 bound_interceptors:wired_fim+0122
DBG(92963704)> CPU TRACE:     000122                         275  timer_runout:
DBG(92963704)> CPU TRACE:     000122  0a   000000 2542 20    276      spri      prs,*               save prs
DBG(92963704)> CPU TRACE: 00034:002332 0 002210254220 (SPRI 002210,N*) 002210 254(0) 0 1 1 00

DBG(92963706)> CPU TRACE: 00034:002333 bound_interceptors:wired_fim+0123
DBG(92963706)> CPU TRACE:     000123  0a   000000 3522 20    277      eppbp     prs,*               bp -> machine conditions
DBG(92963706)> CPU TRACE: 00034:002333 0 002210352220 (EPP2 002210,N*) 002210 352(0) 0 1 1 00

DBG(92963708)> CPU TRACE: 00034:002334 bound_interceptors:wired_fim+0124
DBG(92963708)> CPU TRACE:     000124  aa  2 00020 7533 00    279      sreg      bp|mc.regs          save regs
DBG(92963708)> CPU TRACE: 00034:002334 0 200020753300 (SREG PR2|20) 000020 753(0) 1 1 0 00

...

DBG(92963850)> CPU TRACE: 00034:002444 bound_interceptors:wired_fim+0234
DBG(92963850)> CPU TRACE:     000234  4a  4 00100 7003 20    430      tsx0      fim_util$copy_mc    ..
DBG(92963850)> CPU TRACE: 00034:002444 0 003063700200 (TSX0 003063) 003063 700(0) 0 1 0 00
    000253  0a   000004 6132 20    466            rcu       int_scuinfo,*       bye-bye
DBG(92963702)> CPU CORE: core_read  00530030 000034650011 (APU_DATA_READ)
DBG(92963702)> CPU CORE: core_read  00530031 000000000021 (APU_DATA_READ)
DBG(92963702)> CPU CORE: core_read  00530032 000034000000 (APU_DATA_READ)
DBG(92963702)> CPU CORE: core_read  00530033 000000000000 (APU_DATA_READ)
DBG(92963702)> CPU CORE: core_read  00530034 002332500200 (APU_DATA_READ)
DBG(92963702)> CPU CORE: core_read  00530035 002332000100 (APU_DATA_READ)
DBG(92963702)> CPU CORE: core_read  00530036 002332710200 (APU_DATA_READ)
DBG(92963702)> CPU CORE: core_read  00530037 000410710220 (APU_DATA_READ)

DBG(92963704)> CPU TRACE: 00034:002332 bound_interceptors:wired_fim+0122
DBG(92963704)> CPU TRACE:     000122                         275  timer_runout:
DBG(92963704)> CPU TRACE:     000122  0a   000000 2542 20    276      spri      prs,*               save prs
DBG(92963704)> CPU TRACE: 00034:002332 0 002210254220 (SPRI 002210,N*) 002210 254(0) 0 1 1 00

Wait… the RCU returns to the fault handler? No!

Track 00530030 back…

WATCH [90603312] 00034:003063 write  00530030 000041250011 (APU_DATA_STORE)
    000073                         177  copy_mc:
    000073  aa  0 00100 1007 00    178            mlr       (pr),(pr)           copy machine conditions

watch 77024210

My cycle numbers changed? Trying again…

Faulting at

DBG(98943432)> CPU FAULT: Fault 20(024), sub 1(01), dfc N, 'acvFault: fetchDSPTW out of segment bounds fault'
DBG(98943432)> CPU FAULT: 00041:030240 bound_library_wired_:pl1_operators_+07730
DBG(98943432)> CPU FAULT:     007730  aa  6 00025 7541 00   5101      sti       sp|stack_frame.return_ptr+1   save indicators
DBG(98943432)> CPU FAULT: 00041:030240 0 600025754100 (STI PR6|25) 000025 754(0) 1 0 0 00

from RCU

DBG(98940877)> CPU FAULT: 00431:017156 0 000140100540 (MLR PR0|140,F1) 000140 100(1) 0 0 0 00
[98940878] scu to 00431:017156  PSR:IC 00431:017156 final address 00306000
WATCH [98943430] 00034:002463 read   00530030 000041250011 (APU_DATA_READ)
    000253  0a   000004 6132 20    466            rcu       int_scuinfo,*       bye-bye

Bad data in 00530030

Tracking back…

WATCH [88359848] 00034:003063 write  00530030 000041250011 (APU_DATA_STORE)
    000073                         177  copy_mc:
    000073  aa  0 00100 1007 00    178            mlr       (pr),(pr)           copy machine conditions

DBG(88359848)> CPU FINAL: Write(Actual) Write: iefpFinalAddress=00530030 writeData=000041250011^M
DBG(88359848)> CPU CORE: core_read  77024210 000041250011 (APU_DATA_READ)

Watching 77024210

WATCH [88358833] 00041:030240 write  77024210 000041250011 (APU_DATA_STORE)
DBG(88358831)> CPU TRACE: is_priv_mode P 0 get_addr_mode 1 get_bar_mode 0 IR 100200
DBG(88358831)> CPU FAULT: Fault 4(04), sub 0(00), dfc N, 'Timer runout'
DBG(88358831)> CPU FAULT: 00041:030240 bound_library_wired_:pl1_operators_+07730
DBG(88358831)> CPU FAULT:     007730  aa  6 00025 7541 00   5101      sti       sp|stack_frame.return_ptr+1   save indicators
DBG(88358833)> CPU TRACE: 030240 000510657220 (SCU 000510,N*) 000510 657(0) 0 1 1 00
DBG(88358833)> CPU CORE: core_write 77024210 000041250011 (APU_DATA_STORE)

This is the original fault.

Time sequence:

88358831: TRO @ 00041:030240
88358833: SCU stores at 77024210
88359848: pxss copies 77024210 to 00530030
98940878: RCU to 00041:030240

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License