Ignore:
Timestamp:
Jan 5, 2013, 4:19:48 PM (13 years ago)
Author:
rousseau
Message:

Possibly fixed VirtualBox trap #8/reset problem (TRAC ticket #15)

Depending on kernel (W4/SMP) and ACPI (enabled or not), massive copying of
files produced a trap #8 or reset when using OS2AHCI in VirtualBox.

Causes & Fixes

o Interrupts were not disabled before doing DevHelp_EOI().

Since the AHCI controller in VirtualBox is a software implementation,
it cannot process requests as fast as a true hardware controller.
This caused stacked interrupts to a level that exhausted the interrupt
stack, resulting in a trap #8 on W4 or a reset on SMP with ACPI.
Interrupts are now disabled before doing the EOI.
The "Physycal Device Driver Reference" mentions this in:
?:\IBMDDK\DOCS\PDDREF.INF->Device Helper (DevHlp) Services)->EOI
and cross-referencing with DANIS506 shows she does the same in:
s506m.c (state machine) around line 244.

To investigate

o SMP safety

The disable() function in lib.c (around line 793) mentions that SMP systems
should use spinlocks. Possibly because a CLI is only executed on the
current CPU and a new interrupt could come from another CPU where
interrupts are still enabled. However, doing the EOI before unlocking did
not solve the VBox issue, possibly because spin_unlock() enables
interrupts again. So, at least in the case of VBox, the interrupt handler
has to return with interrupts disabled.
This will only be an issue in VBox if it realy can receive interrupts from
multiple CPU's in it's software implementation.

o Real hardware

As far as I can tell this patch has no influence on performance when
using real hardware. Also, since real hardware can handle requests in a
much shorter timespan, the likeliness of stacked interrupts occuring is
much less.

Performance measurements in VirtualBox

VirtualBox is a virtual machine and therefore subjected to any system load
on the host OS. Even comparing DANIS506 with OS2AHCI makes no real sense
since with DANI the SATA-contoller will run in ATA-compatibility and with
OS2AHCI it will run in AHCI-mode. Also, I have experienced that when my CPU
get's hot, the duty-cycle throttling goes active, reducing performance and
thus impacting any benchmarks in VBox. (or real hardware)
The only way to compare DANIS506 with OS2AHCI is on real hardware, on the
same controller, with the same disks and doing the same tests.
And most importantly, with the same CPU core-temperature.

Note

Diff's may show changed lines that are actually the same.
That's because my editor is configured to strip trailing white space.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/os2ahci/ahci.c

    r144 r145  
    13051305     * writes), we may be stacking interrupts on top of each other. If we
    13061306     * detect this, we'll pass this on to the engine context hook.
     1307     *
     1308     * Rousseau:
     1309     * The "Physycal Device Driver Reference" states that it's a good idea
     1310     * to disable interrupts before doing EOI so that it can proceed for this
     1311     * level without being interrupted, which could cause stacked interrupts,
     1312     * possibly exhausting the interrupt stack.
     1313     * (?:\IBMDDK\DOCS\PDDREF.INF->Device Helper (DevHlp) Services)->EOI)
     1314     *
     1315     * This is what seemed to happen when running in VirtualBox.
     1316     * Since in VBox the AHCI-controller is a software implementation, it is
     1317     * just not fast enough to handle a large bulk of requests, like when JFS
     1318     * flushes it's caches.
     1319     *
     1320     * Cross referencing with DANIS506 shows she does the same in the
     1321     * state-machine code in s506sm.c around line 244; disable interrupts
     1322     * before doing the EOI.
     1323     *
     1324     * Comments on the disable() function state that SMP systems should use
     1325     * a spinlock, but putting the EOI before spin_unlock() did not solve the
     1326     * VBox ussue. This is probably because spin_unlock() enables interrupts,
     1327     * which implies we need to return from this handler with interrupts
     1328     * disabled.
    13071329     */
    13081330    if ((u16) (u32) (void _far *) &irq_stat < 0xf000) {
    13091331      ddprintf("IRQ stack running low; arming engine context hook\n");
     1332      /* Rousseau:
     1333       * A context hook cannot be re-armed before it has completed.
     1334       * (?:\IBMDDK\DOCS\PDDREF.INF->Device Helper (DevHlp) Services)->ArmCtxHook)
     1335       * Also, it is executed at task-time, thus in the context of some
     1336       * application thread. Stacked interrupts with a stack below the
     1337       * threshold specified above, (0xf000), will repeatly try to arm the
     1338       * context hook, but since we are in an interrupted interrupt handler,
     1339       * it's highly unlikely the hook has completed.
     1340       * So, possibly only the first arming is succesful and subsequent armings
     1341       * will fail because no task-time thread has run between the stacked
     1342       * interrupts. One hint would be that if the dispatching truely worked,
     1343       * excessive stacked interrupts in VBox would not be a problem.
     1344       * This needs some more investigation.
     1345       */
    13101346      DevHelp_ArmCtxHook(0, engine_ctxhook_h);
    1311 
     1347//      DevHelp_EOI(irq);
    13121348    } else {
    13131349      spin_lock(drv_lock);
    13141350      trigger_engine();
     1351//      DevHelp_EOI(irq);
    13151352      spin_unlock(drv_lock);
    13161353    }
    1317 
     1354    /* disable interrupts to prevent stacking. (See comments above) */
     1355    disable();
    13181356    /* complete the interrupt */
    13191357    DevHelp_EOI(irq);
Note: See TracChangeset for help on using the changeset viewer.