skip to main content
research-article
Open access

DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip

Published: 15 December 2023 Publication History

Abstract

With the high-performance requirement of safety-critical real-time tasks, the platforms of many-core processors with high parallelism are widely utilized, where network-on-chip (NoC) is generally employed for inter-core communication due to its scalability and high efficiency. Unfortunately, large uncertainties are suffered on NoCs from both the overly parallel architecture and the distributed scheduling strategy (e.g., wormhole flow control), which complicates the response time upper bounds estimation (i.e., either unsafe or pessimistic). For DAG-based real-time parallel tasks, to solve this problem, we propose DAG-Order, an order-based dynamic DAG scheduling approach, which strictly guarantees NoC real-time services. First, rather than build the new analysis to fit the widely used best-effort wormhole NoC, DAG-Order is built upon a kind of advanced low-latency NoC with SLT (Single-cycle Long-range Traversal) to avoid the unpredictable parallel transmission on the shared source-destination link of wormhole NoCs. Second, DAG-Order is a non-preemptive dynamic scheduling strategy, which jointly considers communication as well as computation workloads, and fits SLT NoC. With such an order-based dynamic scheduling strategy, the provably bound safety is ensured by enforcing certain order constraints among DAG edges/vertices that eliminate the execution-timing anomaly at runtime. Third, the order constraints are further relaxed for higher average-case runtime performance without compromising bound safety. Finally, an effective heuristic algorithm seeking a proper schedule order is developed to tighten the bounds. Experiments on synthetic and realistic benchmarks demonstrate that DAG-Order performs better than the state-of-the-art related scheduling methods.

1 Introduction

Many-core platforms are progressively attractive for embedded real-time applications, such as robotics, avionics, and autonomous vehicles, to satisfy the high parallel performance requirement. In consideration of the scalability need, Network-on-Chip (NoC) [15] is a common interconnection network for many-core platforms. One of the main challenges on real-time NoCs is to estimate the response time upper bounds (simply called bound henceforth) in the design-time schedulability tests, which is related to both the underlying NoC architecture and the scheduling strategy. However, the widely used NoCs are generally designed for best-effort traffics and average-case performance, thus showing poor quality on bounds estimation. Besides, their dynamic scheduling decision may be different in the same situation of concurrently running tasks, which challenges the bounds estimation.
To ensure real-time services, rather than present the analysis to match the widely used wormhole NoCs, we employ SLT NoC (another kind of advanced low-latency NoC with Single-cycle Long-range Traversal) (e.g., [17, 23]), in which the low-latency transmission can be provided, and the process of real-time schedulability analysis becomes easy. Specifically, on SLT NoCs (e.g., SMART [23]/optical [17] NoCs), a flit is able to reach the destination in a single cycle via an exclusively long-range link, which cannot be shared with other flits, eliminating thenotoriously Multi-point Progressive Blocking (MPB) [38] problem of wormhole NoCs naturally. Combined with SLT NoC, we propose a non-preemptive and dynamic scheduling paradigm to guarantee timing predictability and maximize the runtime scheduling flexibility. Dynamic scheduling methods make flexible decisions at runtime, whereas the non-preemptive scheme simplifies the Worst-Case Execution Time (WCET) estimation and incurs less interference. The first challenge of applying the non-preemptive dynamic scheduling strategy to SLT NoC is that the timing anomaly may occur during the runtime execution of parallel tasks, whereas the second challenge is that it is not straightforward to eliminate the execution-timing anomaly without losing too much scheduling flexibility.
In this article, we consider parallel real-time tasks modeled as a Directed Acyclic Graph (DAG). To facilitate analysis, we develop the job and resource abstractions from the DAG structure and the SLT NoCs. We address the aforementioned challenges with DAG-Order in the following steps:
Based on the abstractions, we formally prove that the timing-anomaly-free execution on SLT NoCs is guaranteed for any non-preemptive dynamic scheduler if certain schedule order constraints are enforced among DAG edges/vertices during runtime, and thus the safety requirement of generic bounds is guaranteed.
The order constraints are further relaxed for more scheduling flexibility at runtime, where a higher average-case runtime performance is achieved without compromising the bound safety, saving resource consumption.
We present an effective heuristic algorithm to produce a proper schedule order, with which specific tight bounds are generated by “simulating” the runtime execution of the DAG task on SLT NoCs.
We run the experiments with a variety of synthetic and realistic benchmarks, showing the effectiveness of DAG-Order compared to the State-of-the-Art (SOTA) counterparts with different parameters.
Although the schedulability analysis of real-time NoCs is a mature research domain, such as priority-preemptive wormhole NoCs (e.g., [11, 20, 38]) and TDM NoCs (e.g., [4, 26, 27]), there exist only a few works addressing real-time issues on advanced low-latency NoCs such as SLT NoCs. To our best knowledge, this is the first work to guarantee the bound safety for non-preemptive dynamic schedulers on SLT NoCs, assuming that NoC is the only component in a computer. The novel feature of DAG-Order is that, like a uniprocessor resource, the source-destination long-range NoC link can be physically regarded as an exclusively single resource. Distinct from the SOTA works with unilateral consideration of computation (e.g., [12, 18, 19]) or communication tasks (e.g., [5, 25, 35]), to be general, we consider both of them in this article.
The remainder of this article is organized as follows. Section 2 provides the system model. Section 3 introduces the background and motivation behind the proposed approach. Section 4 presents a certain order constraint enforced among DAG edges/vertices to guarantee timing-anomaly-free execution. Section 5 proposes a specific order at design time with an effective algorithm, whereas Section 6 describes how to use this order at runtime, and details the implementation and overhead. Section 7 presents the experimental evaluation of DAG-Order with synthetic and realistic benchmarks. Section 8 discusses the related work, and Section 9 concludes this article.

2 System Model

2.1 Task Model

This article considers a single task modeled as a DAG, which is a general setup in many fields (e.g., autonomous vehicles, industrial automation). Such a task can be represented by a computation-communication graph \(G\) = (V, E), where \(V = \lbrace v_1, v_2, \ldots , v_n\rbrace\) is a set of vertices (corresponding to computation workload) denoted by rectangles, and \(E \subseteq V \times V\) is a set of directed edges (corresponding to communication workload), shown in Figure 1. Each vertex \(v_i\) of \(G\) represents a segment of program (called subtask), and the number inside the vertex denotes the WCET \(c_i\) of the corresponding subtask. As for each directed communication edge \(m_{i,j} \in E\) , it represents the data dependence between subtasks \(v_i\) and \(v_j\) , and it has an associated worst-case message size also denoted by \(m_{i,j}\) . The unit of this message is a flit [15] (i.e., NoC link width). To granularly consider the influence of shared cache/memory in NoCs, the DAG task model should be extended. Specifically, before and after task computation, the communication workloads from/to shared caches and memory banks (if local cache misses in the worst case) can be abstracted as a directed communication edge, whereas a memory-related access operation can be abstracted as a “computation” vertex into the DAG task model. Besides, we only consider a single DAG task with no conditional structures (e.g., switch-case statement). The frequent notations1 used are summarized in Table 1.
Table 1.
NotationDefinition
\(G\) A real-time parallel DAG task
\(v_i\) \(i^{\it th}\) subtask of \(G\)
\(n\) The number of jobs in \(G\)
\(\tau _i/\tau _1/\tau _{n}\) \(i^{\it th}\) /first/last job of \(G\) = (V, E)
\(m_{i,j}\) A communication message from \(v_i\) to \(v_j\)
\(r_i\) \(i^{\it th}\) NoC router
\(t_a\) \(a^{\it th}\) schedule time point
\(\Xi _{p \rightarrow s}\) Communication route from \({\it PE}_p\) to \({\it PE}_s\)
\(R^{\prime }(R^{\prime \prime })\) Actual response time of \(G\)
\(R\) Estimated bound of \(G\)
\(\Phi _i/\Phi _{p,\lambda }\) The resource allocated to \(\tau _i\)
\({\it Conf}\) Job-to-resource configuration of \(G\)
\(\mathcal {M}_{\tau _i\rightarrow \Phi _i}\) Allocation of job \(\tau _i\) to resource \(\Phi _i\)
\(s_i^{\prime }(s_i^{\prime \prime })/s_i\) Start time of \(\tau _i\) at runtime/design time
\(c_i^{\prime }(c_i^{\prime \prime })/c_i\) Execution time length of \(\tau _i\) at runtime/design time
\(f_i^{\prime }(f_i^{\prime \prime })/f_i\) Finish time of \(\tau _i\) at runtime/design time
\({\it Sch}^{\prime }({\it Sch}^{\prime \prime })\) Runtime schedule order of jobs
\({\it Sch}\) Design-time schedule order of jobs
\(\textsf {dc}(\tau _i)\) Job set that has direct contention with \(\tau _i\)
\(\textsf {ic}(\tau _i)\) Job set that has indirect contention with \(\tau _i\)
\(\textsf {cf}(\tau _i)\) Job set that is free of contention with \(\tau _i\)
\({\it group}_\chi\) \(\chi ^{\it th}\) job set that mutually has direct/indirect contention
\(\mathcal {L}(\tau _i)\) The execution length of the remaining critical path of \(\tau _i\)
\({\it HPC}_{{\it max}}\) The maximum bypass hops per cycle on SLT NoC
Table 1. Frequent Notations Used throughout the Article
Fig. 1.
Fig. 1. A parallel real-time task modeled as a computation-communication DAG \(G\) .

2.2 Runtime Behavior

The DAG task \(G\) is assigned a cluster of dedicated NoC by using federated scheduling [3], which is an effective method to provide theoretically and empirically guaranteed real-time services. The minimum NoC cluster is chosen from the whole platform until the deadline can be guaranteed. A subtask/message is assumed to be scheduled in a non-preemptive manner. Note that to reduce the transmission latency of each message, the processing cores communicate by using the message-passing communication paradigm like that of some commercial NoC products (e.g., Arteris FlexNoC and SonicsGN) instead of the shared-memory paradigm. In other words, the message is sent/received through the NoC to achieve direct core-to-core communication (peer-to-peer communication). Due to the pessimistic estimation of \(c_i\) and \(m_{i,j}\) , the actual execution time \(c_i^{\prime }\) of each subtask \(v_i\) and the actual message size \(m_{i,j}^{\prime }\) of each message \(m_{i,j}\) can be potentially less than their estimated values (i.e., \(c_i^{\prime } \le c_i\) , \(m_{i,j}^{\prime } \le m_{i,j}\) ). To validate the schedulability of \(G\) , let \(R\) refer to the estimated bound which is the maximum duration time between the start time of the first subtask and the finish time of the last subtask, and \(R^{\prime }\) be the actual total execution time at runtime. \(G\) is schedulable if the bound \(R\) is no more than its deadline. With high-quality and timing-predictable bounds, the DAG task’s safety can be ensured and resource allocation can be also well controlled to avoid resource overprovisioning.
Therefore, to ensure the response-time predictability, the goal is to generate a safe and tight bound for a real-time DAG task \(G\) . Specifically, safe indicates the actual response time under any execution scenario is no more than the estimated bound (i.e., \(R^{\prime } \le R\) ), whereas tight means the bound \(R\) should be minimized as possible without the violation of the safe condition under current resource budget constraints.

3 Background AND Motivation

On real-time NoCs, the computation/communication subtasks will experience high resource contention due to the parallel execution. Thus, the actual response time at runtime also shows high timing variation, which is jointly decided by the underlying NoC architecture and the corresponding scheduling strategy. To make the bounds on real-time NoCs accurately estimated, we introduce another off-the-shelf emerging SLT NoCs (e.g., SMART [23]/optical [17] NoCs) and illustrate the scheduling challenges on top of them.

3.1 Background of SLT NoC

The wormhole NoCs [20] are widely utilized and generally developed for best-effort communications. Specifically, they can perform high performance in average latency, but demonstrate significant unsafety and pessimism of worst-case latency for time-critical communications. For example, the worst-case runtime behavior is difficult to identify due to the parallel transmission of multiple packets over a shared link. Such behavior will contribute to the notorious MPB [38] problem due to the limited router buffer size, and thus the generated latency bounds are usually overestimated.
For bounds predictability, rather than fix the erroneous patch of the current analysis for wormhole NoCs, we introduce SLT NoC, which is another kind of NoC (optical NoC, e.g., [17]; electrical NoC, e.g., [23]) featured with single-cycle long-range traversal. To be specific, among the SLT NoCs, in the rest of this article, we employ some terminology and techniques of the representative one, SMART NoC [8, 23], which has been proven as a promising NoC technique for many-core platforms.
SMART (Single-Cycle Multi-hop Asynchronous Repeated Traversal) [2, 23] is an advanced NoC architecture that enables single-cycle long-range (up to \({\it HPC}_{{\it max}}\) hops) traversal by dynamically building a direct bypass from the source to the destination. For long-range bypass traversal, low-swing clockless repeated link and bypass mechanisms are integrated into the on-chip routers. Before packet transmission, the long-range bypass path is pre-built by setting the control signals at runtime. If the bypass path is built, SMART enables the single-cycle exclusive transmission by avoiding buffering at intermediate routers. Recent research works (e.g., [8, 9, 13, 24, 39]) improve its communication performance and employ SMART techniques to develop new works (e.g., [7, 11, 21]).
For a source-destination link, different from wormhole NoC of Figure 2(a) with parallel traversal, a flit of \(\tau _1\) can exclusively traverse the Link \(1\rightarrow 4\) in SLT NoC of Figure 2(b). For example, when \({\it HPC}_{{\it max}}\) = 3, a flit of \(\tau _1\) can reach the destination (Router \(r_4\) ) in a single cycle. Then, like a uniprocessor resource, the source-destination Link \(1\rightarrow 4\) can be regarded as a physically single resource without being shared and logically “1-hop” traversal. The observations of the non-preemptive low-latency SLT NoC are twofold. First, compared to the works (e.g., [20, 38]), MPB is naturally avoided since the flit-level preemption and parallel traversal on a source-destination link are eliminated. Note that for a single DAG task model considered in priority-preemptive wormhole NoCs and compared with the communication task model as in the work of Indrusiak et al. [20], the difference is that the formula of packet latency bounds is slightly different, but the MPB phenomenon still exists. Our proposed scheduling paradigm also shows benefits compared with the single DAG task considered in priority-preemptive wormhole NoCs. Second, in other work (e.g., [35]), the long-range source-destination link is considered as a logically single resource without being shared, which leads to large resource waste at runtime. By contrast, the links within \({\it HPC}_{{\it max}}\) [23] in SLT NoCs are naturally and physically single resources without being shared. Therefore, to ensure the real-time performance of NoCs, we introduce SLT NoC to real-time many-core systems.
Fig. 2.
Fig. 2. \(\tau _1\) transmits on the wormhole and SLT NoCs.

3.2 Unpredictability from Timing Anomaly

On many-core systems, dynamic scheduling schemes [1] are widely utilized, where the scheduling decisions are conducted at runtime. In this article, we adopt the non-preemptive dynamic schedulers due to their advantages in some real cases compared with the preemptive ones. However, the non-preemption behaviors bring challenges to real-time systems, as the safety of estimated bounds may not be ensured due to timing anomalies. Timing anomaly is observed by Graham [16] when employing the non-preemptive dynamic scheduler on multi-core systems. Specifically, the actual response time at runtime can probably be more than the estimated bound under the same dynamic scheduler, when the actual execution time of subtasks is less than the designed WCET. Thus, this phenomenon may make the real-time tasks that are originally schedulable become unschedulable.
When dynamically scheduling the messages on SLT NoC in a non-preemptive manner, we observe that the execution-timing anomaly phenomenon can also occur. We use an example to illustrate such a phenomenon. As shown in Figure 3, the task is scheduled upon a \(3\times 3\) SLT NoC. Assume the allocated NoC resource and the DAG structure are given and fixed, whereas the actual transmission time of the message \(m_{1,3}\) is reduced from the designed WCET 3 to 1.5 at runtime. Counter-intuitively, the actual response time is increased from the design-time estimated bound 10 to 10.5 under the same dynamic scheduler (e.g., breadth-first scheduler). In this case, the originally schedulable real-time task can be potentially unschedulable if the specified deadline is set as 10.
Fig. 3.
Fig. 3. Illustration of the timing-anomaly phenomenon with reduced transmission time of the communication message.
The recent works [12, 30, 36] have addressed the timing anomaly problems for real-time task scheduling on multi-core systems, but their approaches cannot be trivially extended from multi-core to NoC scheduling due to direct and indirect contention in communication resources, showing pessimistic average-case runtime performance. Besides, when jointly scheduling the computation and communication subtasks, all of the possible runtime execution instances can be exponential. Deriving the optimal bound by enumerating all of the runtime instances can be computationally infeasible.

4 Safety Guarantee of Generic Bound

In this section, we propose certain order constraints among DAG edges/vertices to eliminate execution-timing anomaly, and thus safety of generic bounds can be ensured. Then, the order constraints are further relaxed for more scheduling flexibility, with which the average-case runtime performance is improved without compromising the bound safety.

4.1 Preparation

To facilitate analysis of computation and communication subtasks, we develop job and resource abstractions—that is, resource (e.g., computation PE, communication route) from SLT NoC and job (e.g., computation task, communication message) from the DAG structure. This indicates that each job can be mapped to the corresponding resource, like the relationship between task and core in uniprocessor scheduling.
Definition 1 (Resource).
Given an SLT NoC, a resource abstraction \(\Phi _i\) is a computation PE or a given communication route \(\Xi _{p \rightarrow s}\) from predecessor \({\it PE}_p\) to successor \({\it PE}_s\) .
For simplicity, only a part of PEs are shown in Figure 4, and SLT candidates refer to the candidates that can be reached via an SLT path from router \(r_{13}\) . If \(v_p\) is mapped on \({\it PE}_{13}\) , the route \(r_{13} \rightarrow r_{18} \rightarrow r_{23}\) or \({\it PE}_{23}\) is considered as a resource. We uniformly call such a computation PE or communication route as a resource.
Fig. 4.
Fig. 4. An example of the abstracted resource and job definitions on 2D mesh-based SLT NoC, assuming \({\it HPC}_{\it max} = 2\) .
Definition 2 (Job).
For \(G\) = (V, E), a job abstraction refers to a computation subtask \(v_s\) or an inter-subtask message \(m_{p,s}\) from predecessor \(v_p\) to successor \(v_s\) , satisfying \(m_{p,s} \in E\) .
For example, in Figure 4, inter-subtask message \(m_{p,s}\) or computation subtask \(v_s\) is considered as a job denoted by \(\tau _i\) . In Figure 1, the task graph \(G\) can be divided into jobs \(\lbrace v_1\) , \(m_{1,2}\) , \(m_{1,3}\) , \(v_2\) , \(v_3\) , \(m_{2,4}\) , \(m_{2,5}\) , \(m_{3,6}\) , \(v_4\) , \(v_5\) , \(v_6\) , \(m_{4,7}\) , \(m_{5,7}\) , \(m_{6,7}\) , \(v_7\rbrace\) . Note that \(G = (V, E)\) can generate \(n\) (i.e., \(n = |E|+|V|\) ) jobs. For job \(\tau _i\) (e.g., \(m_{p,s}, v_s\) ) and resource \(\Phi _j\) (e.g., \(\Xi _{p \rightarrow s}, {\it PE}_s\) ), \(\tau _i\) can be “executed” on the exclusive \(\Phi _i\) , in which \(v_s\) is allocated to \({\it PE}_s\) and \(m_{p,s}\) is transmitted to \({\it PE}_s\) via the route \(\Xi _{p \rightarrow s}\) , respectively. \(\Phi _j\) , which is assigned to \(\tau _i\) , can be simply called \(\Phi _i\) . For \(\tau _i\) , let \(s_i^{\prime }\) and \(s_i\) be the execution start time points, \(f_i^{\prime }\) and \(f_i\) be the finish time points, and \(c_i^{\prime }\) and \(c_i\) ( \(c_i^{\prime } \le c_i\) ) be the execution/transmission time at the runtime and design-time estimation phases, respectively.
Before the execution of DAG tasks, configuration, such as the job-to-resource allocation and communication route, should be given. We define the configuration as follows.
Definition 3 (Job-to-Resource Configuration).
For \(G\) = (V, E) on SLT NoC, the job-to-resource configuration is defined as
\({\it Conf} = \lbrace \mathcal {M}_{\tau _1\rightarrow \Phi _1}, \ldots , \mathcal {M}_{\tau _i\rightarrow \Phi _i}, \ldots , \mathcal {M}_{\tau _{n}\rightarrow \Phi _{n}}\rbrace\) ,
where \(\mathcal {M}_{\tau _i\rightarrow \Phi _i}\) refers to the allocation of the job \(\tau _i\) to the resource \(\Phi _i\) , and \(n\) (i.e., \(n = |E|+|V|\) ) is the number of computation and communication jobs of \(G\) .
The bound \(R\) must be compared with the deadline to validate the schedulability. In this article, assume \(G\) is scheduled by a certain non-preemptive dynamic scheduler and configured by \({\it Conf}\) , and \(R\) is generated by “simulating” such scheduler, assuming the jobs are executed for their WCETs. Specifically, the simulated bound of \(G\) is \(R = 10\) in Figure 3(a). Intuitively, the actual response time \(R^{\prime }\) at runtime would be less than estimated bound \(R\) with the same scheduler, \(R^{\prime } \le R\) , since some jobs may be executed for less than their WCETs. However, the actual schedule result may be the opposite.

4.2 Partial Schedule Order

In the previous subsection, the estimated bound \(R\) is generated by simulating the dynamic scheduler at design time. However, due to a variety of execution instances, \(R\) can be potentially unsafe. To uniquely identify a particular runtime execution instance of a DAG task, we introduce the following definition of schedule order of jobs.
Definition 4 (Schedule Order).
For \(G\) = (V, E) on SLT NoC, the schedule order \({\it Sch}\) is the job sequence by scheduling time points, where the jobs start execution, in ascending order:
\({\it Sch}\) = \(\lbrace \tau _1\) \(\rightsquigarrow\) \(\tau _2\) \(\rightsquigarrow\) \(\tau _3\) \(\rightsquigarrow\) ... \(\rightsquigarrow\) \(\tau _i\) \(\rightsquigarrow\) ... \(\rightsquigarrow\) \(\tau _{n}\rbrace\) .
For instance, the DAG task in Figure 3 can be divided into jobs \(\lbrace v_1\) , \(m_{1,2}\) , \(m_{1,3}\) , \(v_2\) , \(v_3\) , \(m_{2,4}\) , \(m_{3,5}\) , \(v_4\) , \(v_5\) , \(m_{4,6}\) , \(m_{5,6}\) , \(v_6\rbrace\) , sequentially denoted by \(\tau _1\) , \(\tau _2\) , ..., \(\tau _{12}\) . In Figure 3(a), the design-time schedule order \({\it Sch}\) of \(G\) is as follows:
\({\it Sch} = \lbrace \tau _1 \rightsquigarrow \tau _2 \rightsquigarrow \tau _3 \rightsquigarrow \tau _4 \rightsquigarrow \tau _5 \rightsquigarrow \tau _6\) \(\rightsquigarrow \tau _7 \rightsquigarrow \tau _8 \rightsquigarrow \tau _9 \rightsquigarrow \tau _{10} \rightsquigarrow \tau _{11} \rightsquigarrow \tau _{12}\rbrace\) .
Specifically, “ \(\tau _2 \rightsquigarrow \tau _3\) ” means that \(\tau _2\) (i.e., \(m_{1,2}\) ) starts execution before or not later than \(\tau _3\) (i.e., \(m_{1,3}\) ). The bound \(R\) of the DAG task in Figure 3(a) is 10, whereas in Figure 3(b), the runtime schedule order is as follows:
\({\it Sch}^{\prime } = \lbrace \tau _1 \rightsquigarrow \tau _2 \rightsquigarrow \tau _3 \rightsquigarrow \tau _5 \rightsquigarrow \tau _4 \rightsquigarrow \tau _7\) \(\rightsquigarrow \tau _6 \rightsquigarrow \tau _9 \rightsquigarrow \tau _8 \rightsquigarrow \tau _{11} \rightsquigarrow \tau _{10} \rightsquigarrow \tau _{12}\rbrace\) .
In Figure 3(b), the actual response time \(R^{\prime } = 10.5\) , which is more than the bound \(R = 10\) . Thus, \(R\) is timing-anomaly and unsafe, and \(G\) that is originally regarded as schedulable becomes unschedulable if the deadline is set as 10, leading to a false-positive phenomenon. The observations are twofold:
When \(G\) is rescheduled at runtime, some jobs may not execute up to their WCETs. The earlier execution behavior changes the design-time order that is used to obtain the estimated bound. Specifically, in Figure 3, when the time of \(\tau _3 = \langle m_{1,3},\) \(v_3 \rangle\) is reduced from 3 to 1.5, the sub-schedule order is changed from \((\tau _4 \rightsquigarrow \tau _5 \rightsquigarrow \tau _6\) \(\rightsquigarrow \tau _7 \rightsquigarrow \tau _8\) \(\rightsquigarrow \tau _9 \rightsquigarrow \tau _{10} \rightsquigarrow \tau _{11}) \subseteq {\it Sch}\) to \((\tau _5 \rightsquigarrow \tau _4 \rightsquigarrow \tau _7\) \(\rightsquigarrow\) \(\tau _6\) \(\rightsquigarrow\) \(\tau _9\) \(\rightsquigarrow\) \(\tau _8\) \(\rightsquigarrow\) \(\tau _{11}\) \(\rightsquigarrow\) \(\tau _{10})\) \(\subseteq {\it Sch}^{\prime }\) .
In addition, a dynamic scheduler may behave differently in the same situation, even if the jobs are actually executed for their WCETs. For example, if \(G\) is scheduled by a breadth-first scheduler, the decision to choose which job from the ready queue (including the jobs in the same layer) to be scheduled first may be different.
In the following, we propose a polynomial-time approach, with which safe bounds can be obtained. Specifically, given configuration \({\it Conf}\) and any non-preemptive dynamic scheduler, we present a schedule order constraint \({\it Sch}\) among jobs to ensure safe execution. By the aforementioned observations, the timing-anomaly execution is caused by the change of schedule order used to estimate the bound. A feasible solution is to consistently keep the same schedule order of jobs at design time and runtime. Before that, motivated by the work [32] on priority-preemptive wormhole NoCs, we define the following resource contentions regarding the spatial relationship between computation/communication jobs, including direct, indirect contention, and contention-free relationships.
Definition 5 (Direct Contention).
For the analyzed job \(\tau _i\) and any other job \(\tau _j (\forall j \in \lbrace 1, 2, \ldots , n\rbrace \wedge (i \ne j))\) , if
\((\mathcal {M}_{\tau _i\rightarrow \Phi _i}) \wedge (\mathcal {M}_{\tau _j\rightarrow \Phi _j}) \wedge (\Phi _i \cap \Phi _j \ne \emptyset) \wedge (\tau _j \rightsquigarrow \tau _i),\)
then \(\tau _i\) suffers the direct contention from \(\tau _j\) . Such direct contention (dc) is denoted by \(\tau _j \in \textsf {dc}(\tau _i)\) , where \(\textsf {dc}(\tau _i)\) refers to the set of jobs that have direct contention with \(\tau _i\) .
Example 1.
In Figure 5(a), assume \(\tau _1\) and \(\tau _4\) have \(\Phi _1 = r_{1} \rightarrow r_{2}\) , \(\Phi _4 = r_{1} \rightarrow r_{2} \rightarrow r_{3}\) , meeting \(\Phi _1 \cap \Phi _4 \ne \emptyset\) and \(\tau _1 \rightsquigarrow \tau _4\) , thus \(\tau _4\) suffers the direct contention (formally \(\tau _1 \in \textsf {dc}(\tau _4)\) ) and blocking from \(\tau _1\) in the worst case. Similar example for the relationship between computation jobs.
Fig. 5.
Fig. 5. The contention relationship between communication job \(\tau _1\) and the analyzed communication job \(\tau _4\) .
Definition 6 (Indirect Contention).
For the analyzed job \(\tau _i\) and any other job \(\tau _j (\forall j \in \lbrace 1, 2, \ldots , n\rbrace \wedge (i \ne j))\) , if
\((\mathcal {M}_{\tau _i\rightarrow \Phi _i}) \wedge (\mathcal {M}_{\tau _j\rightarrow \Phi _j}) \wedge\) \((\tau _j \not\in \textsf {dc}(\tau _i))\) \(\wedge (\exists \tau _k \in \textsf {dc}(\tau _i), \tau _j \in \textsf {dc}(\tau _k) \cup \textsf {ic}(\tau _k)),\)
then \(\tau _i\) suffers the indirect contention from \(\tau _j\) . Such indirect contention (ic) is denoted by \(\tau _j \in \textsf {ic}(\tau _i)\) , where \(\textsf {ic}(\tau _i)\) refers to the set of jobs that have indirect contention with \(\tau _i\) via an(multiple) intermediate job(s) \(\tau _k\) .
Example 2.
In Figure 5(b), \(\tau _1\) , \(\tau _2\) \(\not\in\) \(\textsf {dc}(\tau _4)\) . \(\textsf {dc}(\tau _4) = \lbrace \tau _3\rbrace\) . Assume \(\exists \tau _3\) \(\in\) \(\textsf {dc}(\tau _4)\) , \(\textsf {dc}(\tau _3) = \lbrace \tau _2\rbrace\) , then \(\tau _2 \in \textsf {ic}(\tau _4)\) . Similarly, \(\tau _1 \in \textsf {ic}(\tau _3)\) . By definition, \(\textsf {ic}(\tau _4) = \textsf {dc}(\tau _3) \cup \textsf {ic}(\tau _3) = \lbrace \tau _1, \tau _2\rbrace\) . \(\tau _4\) suffers the indirect contention and thus blocking from \(\tau _1\) and \(\tau _2\) via \(\tau _3\) in the worst case. Note that such a relationship does not exist between computation jobs.
Definition 7 (Contention Free).
For the analyzed job \(\tau _i\) and a job \(\tau _j (\forall j \in \lbrace 1, 2, \ldots , n\rbrace \wedge (i \ne j))\) , if
\((\mathcal {M}_{\tau _i\rightarrow \Phi _i}) \wedge (\mathcal {M}_{\tau _j\rightarrow \Phi _j}) \wedge\) \((\tau _j \not\in \textsf {dc}(\tau _i)) \wedge (\tau _j \not\in \textsf {ic}(\tau _i)),\)
then \(\tau _i\) suffers no contention from \(\tau _j\) . Such contention-free (cf) relationship is denoted by \(\tau _j \in \textsf {cf}(\tau _i)\) , where \(\textsf {cf}(\tau _i)\) denotes the set of jobs that are free of contention with \(\tau _i\) .
Example 3.
In Figure 5(c), assume \(\textsf {dc}(\tau _4) = \emptyset\) and \(\textsf {ic}(\tau _4) = \emptyset\) . Thus, \(\tau _1 \not\in \textsf {dc}(\tau _4)\) and \(\tau _1 \not\in \textsf {ic}(\tau _4)\) . \(\tau _4\) suffers no contention and blocking from \(\tau _1\) . A similar example of the relationship between computation jobs.
Note that since there is no resource contention (i.e., \(\Phi _i \cap \Phi _j == \emptyset\) ) between computation job \(\tau _i\) and communication job \(\tau _j\) , \(\tau _i\) and \(\tau _j\) have the contention-free relationship. To granularly judge the resource contention, we divide all of the jobs into multiple groups according to the spatial relationships defined earlier. Without loss of generality, \(\tau _i\) and \(\tau _j\) ( \(\forall i, j \in \lbrace 1, 2, \ldots , |V|+|E|\rbrace (i \ne j)\) are mapped to the same group if \(\tau _j \in \textsf {dc}(\tau _i)\cup \textsf {ic}(\tau _i)\) or \(\tau _i \in \textsf {dc}(\tau _j)\cup \textsf {ic}(\tau _j)\) . For example, in Figure 5(b), \(\tau _1\) , \(\tau _2\) , \(\tau _3,\) and \(\tau _4\) should belong to the same group due to the direct or indirect contention relationship, whereas in Figure 5(c), \(\tau _1\) and \(\tau _4\) belong to different groups due to the absence of direct and indirect contention. Suppose all jobs are divided into \(\chi\) groups (i.e., \({\it group}_1\) , \({\it group}_2\) , ..., \({\it group}_\chi\) ), and the schedule order in \({\it group}_\chi\) is denoted by \({\it sch}({\it group}_\chi)\) . Thus, the total schedule order consists of multiple independent partial orders (i.e., \({\it Sch}^{\prime }\) = \({\it sch}({\it group}_1)\cup {\it sch}({\it group}_2)\cup\) ... \(\cup {\it sch}({\it group}_\chi)\) ). Let \(R^{\prime }\) be the actual response time of \(G\) enforced with a certain order constraint \({\it Sch}^{\prime }\) .
Lemma 1.
If \((\tau _j \rightsquigarrow \tau _i) \subseteq {\it sch}({\it group}_\mu)\) and \((\tau _k \rightsquigarrow \tau _j) \subseteq {\it sch}({\it group}_\mu)\) , \((\tau _k \rightsquigarrow \tau _j \rightsquigarrow \tau _i) \subseteq {\it sch}({\it group}_\mu)\) .
Proof.
By Definition 4, “ \((\tau _j \rightsquigarrow \tau _i) \subseteq {\it sch}({\it group}_\mu)\) ” indicates that \(\tau _j\) starts execution before or not later than \(\tau _i\) in \({\it sch}({\it group}_\mu)\) (e.g., \(s_j \le s_i\) ). With \(\tau _j \rightsquigarrow \tau _i\) and \(\tau _k \rightsquigarrow \tau _j\) , the start times can be sorted in ascending order with \(s_k \le s_j \le s_i\) . Thus, \((\tau _k \rightsquigarrow \tau _j \rightsquigarrow \tau _i) \subseteq {\it sch}({\it group}_\mu)\) and the lemma is true. □
Theorem 1.
For \(G\) = (V, E), configured by \({\it Conf}\) and scheduled by any given non-preemptive dynamic scheduler, if
\({\it Sch}^{\prime }\) = \({\it sch}({\it group}_1)\cup {\it sch}({\it group}_2)\ \cup \cdots \cup \ {\it sch}({\it group}_\chi),\)
where sub-order \({\it sch}({\it group}_\mu) \subseteq {\it Sch} (\forall \mu \in \lbrace 1, 2, \ldots , \chi \rbrace)\) ,
then
\(R(G)^{\prime } \le R(G).\)
Proof.
Assume \(s_i\) and \(s_i^{\prime }\) are the start time points; \(f_i\) and \(f_i^{\prime }\) are the finish time points; and \(c_i\) and \(c_i^{\prime }\) ( \(c_i^{\prime } \le c_i\) ) are the execution time of job \(\tau _i\) at design time and runtime, respectively. To prove the bound safety ( \(R(G)^{\prime } \le R(G)\) ), it is sufficient to prove the actual finish time \(f_n^{\prime }\) of the \({n}^{th}/last\) job in DAG task is no more than that ( \(f_n\) ) of the estimated bound, since \(R(G)^{\prime } = f_n^{\prime } + \mathcal {C}_{\it ctrl}^{\prime }\) and \(R(G) = f_n + \mathcal {C}_{\it ctrl}\) :
\(f_n^{\prime } + \mathcal {C}_{\it ctrl}^{\prime } \le f_n + \mathcal {C}_{\it ctrl}\) \(\Rightarrow\) \(f_n^{\prime } \le f_n\) .
By assumption, \(\mathcal {C}_{\it ctrl}\) and \(\mathcal {C}_{\it ctrl}^{\prime }\) are the worst-case and actual time deviation due to the job scheduling and NoC control operations ( \(\mathcal {C}_{\it ctrl}^{\prime } \le \mathcal {C}_{\it ctrl}\) ), respectively. We use the mathematical induction to prove \(f_i^{\prime } \le f_i\) in the following.
Base Case: When \(i = 1\) , since the first job \(\tau _1\) starts execution at time 0 at design time and runtime, \(s_1^{\prime } = s_1 = 0\) . Then \(s_1^{\prime } + c_1^{\prime } \le s_1 + c_1\) , namely \(f_1^{\prime } \le f_1\) .
Inductive Step: Assume \(f_j^{\prime } \le f_j (j = 1, 2, \ldots , i-1)\) , and we will prove \(f_i^{\prime } \le f_i\) in the following.
During runtime execution, the job \(\tau _i\) will be scheduled if the following conditions are met at the same time: (i) precedence constraints of DAG structure are met (i.e., all of the predecessor jobs must finish execution), (ii) the allocated resource \(\Phi _i\) is available, and (iii) schedule order constraints are met (i.e., \({\it Sch}^{\prime }\) ). We prove this theorem in the following steps:
As for condition (i), assume \(\tau _\psi\) ( \(\psi \le i-1\) ) is the last predecessor of \(\tau _i\) in \({\it Sch}\) , indicating \(\tau _i\) starts execution only after \(\tau _\psi\) completes execution—that is, \(f_\psi \le s_i\) . At the time \(s_i\) at runtime, all of the predecessors must have finished execution, since \(f_\psi ^{\prime } \le f_\psi \le s_i\) by assumption.
As for condition (ii), assume \(\tau _\varphi\) ( \(\varphi \le i-1\) ) is the last job that fully or partly shares the resource \(\Phi _i\) in \({\it Sch}\) . \(\Phi _i\) is free at \(s_i\) at runtime since \(f_\varphi ^{\prime } \le f_\varphi \le s_i\) .
As for condition (iii), assume \(\tau _i \in {\it group}_\mu\) , and there are two cases to be considered according to the spatial relationship between \(\tau _\omega\) ( \(\forall \omega \le i-1\) ) and \(\tau _i\) :
(1)
\(\tau _\omega \in {\it group}_\mu\) . This indicates that \(\tau _\omega\) contends with \(\tau _i\) directly or indirectly, satisfying \(\tau _\omega \rightsquigarrow \tau _i \subseteq {\it sch}({\it group}_\mu)\) . \(\tau _\omega\) started/finished execution at \(s_i\) with the design-time schedule order \({\it sch}({\it group}_\mu)\) .
(2)
\(\tau _\omega \not\in {\it group}_\mu\) . \(\tau _\omega\) is free of contention with \(\tau _i\) . No order is needed between \(\tau _\omega\) and \(\tau _i\) .
Thus, the three conditions (i–iii) are met at \(s_i\) at runtime. With the greed of dynamic scheduler, \(\tau _i\) is scheduled and starts execution before or not later than \(s_i\) at runtime, namely \(s_i^{\prime } \le s_i\) ; we get \(f_i^{\prime } \le f_i\) since \(f_i^{\prime } = s_i^{\prime } + c_i^{\prime }\) , \(f_i = s_i + c_i\) and \(c_i^{\prime } \le c_i\) . As a result, the inequation \(f_i^{\prime } \le f_i\) is met under the order constraint for any non-preemptive dynamic scheduler. When \(i = n\) (i.e., \(n = |V|+|E|\) ), the theorem is proved. □
With Theorem 1, the timing-anomaly-free execution can be ensured by enforcing the consistent order at design time and runtime, and thus the generic bounds (for any non-preemptive dynamic scheduler) are safe (i.e., \(R^{\prime } \le R\) ).

4.3 Relaxed Partial Order

Based on Theorem 1, the essential idea behind our runtime timing-anomaly-free guarantee of a DAG task is to insert extra schedule constraints between jobs if resource contention occurs. Even if the partial order constraint \({\it Sch}^{\prime }\) can guarantee the timing-anomaly-free execution, the inserted constraint in \({\it Sch}^{\prime }\) is pessimistic due to coarse-grained consideration (i.e., only in the spatial dimension).
We revisit the four jobs of Figure 5(b), meeting \((\tau _1 \rightsquigarrow \tau _2 \rightsquigarrow \tau _3 \rightsquigarrow \tau _4)\) due to the direct and indirect contention relationship. For \(\tau _4\) , \(\textsf {dc}(\tau _4) = \lbrace \tau _3\rbrace\) , \(\textsf {ic}(\tau _4) = \textsf {dc}(\tau _3) \cup \textsf {ic}(\tau _3) = \lbrace \tau _1, \tau _2\rbrace\) . By Theorem 1, \(\tau _4\) starts execution after \(\tau _3\) that also after \(\tau _2\) , whereas \(\tau _2\) starts execution after \(\tau _1\) , resulting in a dependent order-chain, called the cascading effect. \(\tau _4\) finally starts execution after the distant \(\tau _1\) via intermediate jobs \(\tau _2\) and \(\tau _3\) . Suppose three time points \(t_a\) , \(t_b\) , \(t_c\) meet \(t_a \lt t_b \le t_c\) . \(t_a\) is the current scheduling point in which \(\tau _1\) and \(\tau _4\) are ready, whereas \(\tau _3\) becomes ready at \(t_c\) . There is a chance that \(\tau _4\) can start execution at \(t_a\) (in parallel with \(\tau _1\) ) and finish execution at \(t_b\) not later than \(t_c\) . Although the time interval ( \(t_a\) , \(t_b\) ) is idle and available, it cannot be utilized in the originally partial order, leading to poor communication link utilization.
To reduce such cascading effect, without loss of generality, assume \(\tau _j \in \textsf {ic}(\tau _i)\) , \(\exists \tau _k \in \textsf {dc}(\tau _i), \tau _j \in \textsf {dc}(\tau _k)\) . \(\tau _j \rightsquigarrow \tau _k \rightsquigarrow \tau _i\) (e.g., \(\tau _2 \in \textsf {ic}(\tau _4)\) , \(\tau _3 \in \textsf {dc}(\tau _4), \tau _2 \in \textsf {dc}(\tau _3)\) in Figure 5(b)). The actual start time of the analyzed job \(\tau _i\) is indirectly dependent on \(\tau _j\) via \(\tau _k\) , meaning that \(\Phi _i\) (i.e., Link \(r_4 \rightarrow r_5\) in Figure 5(b)) may not be utilized even if it is idle. In fact, the runtime performance can be strictly improved by allowing a part of idle resources, which will not cause temporal and spatial contentions to the unexecuted jobs \(\tau _k\) (i.e., \(\in \textsf {dc}(\tau _i)\) ), to be utilized before \(s_k^{\prime }\) at runtime.
In this subsection, for the analyzed ready job \(\tau _i\) at scheduling point \(t_{\it sp}\) , we conditionally remove the preceding partial order constraint \(\tau _k \rightsquigarrow \tau _i\) (i.e., \(\tau _i\) can instead start execution before \(\tau _k\) ) under the following condition \(\xi _i^{\prime }\) :
(1)
\(\Phi _i\) is available at \(t_{\it sp}\) ,
(2)
\(\forall \tau _k \in \mathcal {Q}_{\it hp}\) and \(\tau _k \in \textsf {dc}(\tau _i)\) , \(t_{\it sp} + c_i^{\prime \prime } \le s_k^{\prime }\) ,
where
\(\Phi _i\) is the assigned resource (e.g., route, PE) of \(\tau _i\) and \(t_{\it sp}\) refers to the current scheduling point,
\(\mathcal {Q}_{\it hp}\) records the higher-priority (hp) jobs (i.e., a job \(\tau _w\) with a smaller design-time start time \(s_w\) has a higher priority) that are not dispatched, and
\(c_i^{\prime }\) ( \(c_i^{\prime \prime }\) ) is the execution time of \(\tau _i\) at runtime and \(s_k^{\prime }\) ( \(s_k^{\prime \prime }\) ) is the start time of \(\tau _k\) at runtime.
Note that we assume the notations attached with one and two apostrophes in the top right corner denote the runtime parameters when enforced with the original partial (i.e., \({\it Sch}^{\prime }\) ) and relaxed partial (i.e., \({\it Sch}^{\prime \prime }\) ) order constraints, respectively. Specifically, \(R^{\prime \prime }\) denotes the actual response time with the relaxed partial order \({\it Sch}^{\prime \prime }\) , whereas \(R^{\prime }\) denotes the original partial order \({\it Sch}^{\prime }\) . If the partial schedule order \(\tau _k \rightsquigarrow \tau _i\) is broken, \(\tau _k\) starts execution after the contending resources are released by \(\tau _i\) . The new actual start time of \(\tau _k\) with the relaxed order \({\it Sch}^{\prime \prime }\) is denoted by \(s_k^{\prime \prime }\) , where \(s_k^{\prime \prime } \ge t_{\it sp} (s_i^{\prime \prime }) + c_i^{\prime \prime }\) , \(c_i^{\prime \prime } = c_i^{\prime }\) .
Lemma 2.
For \(G\) = (V, E), if \(s_w^{\prime \prime } \le s_w\) for \(\tau _w\) ( \(\forall w \in \lbrace 1, 2, \ldots , |V|+|E|\rbrace\) ), the runtime execution enforced with the relaxed partial order is free of timing anomaly.
Proof.
For any job \(\tau _w\) ( \(\forall w \in \lbrace 1, 2, \ldots , |V|+|E|\rbrace\) ), since \(s_w^{\prime \prime } \le s_w\) , the runtime execution is free of timing anomaly by Theorem 1 (i.e., replace \(s_w^{\prime \prime }\) with \(s_w^{\prime }\) ). □
Theorem 2.
For each job \(\tau _k\) ( \(\forall k \in \lbrace 1, 2, \ldots , n\rbrace\) ), \(R^{\prime \prime }\) strictly dominates \(R^{\prime }\) when \(s_k^{\prime \prime } \le s_k^{\prime }\) , whereas \(R^{\prime \prime }\) partly dominates \(R^{\prime }\) when \(s_k^{\prime } \le s_k^{\prime \prime } \le s_k\) .
Proof.
(i) If \(s_k^{\prime \prime } \le s_k^{\prime }\) under \(\xi _i^{\prime }\) , it means \(\tau _k\) starts execution at \(s_k^{\prime \prime }\) after order relaxation, which is not later than \(s_k^{\prime }\) of the original partial order. We prove this case in two steps: (i.1) when \(s_k^{\prime \prime } \le s_k^{\prime }\) , similar to Theorem 1, \(R^{\prime \prime }\) is no larger than \(R^{\prime }\) (formally \(R^{\prime \prime } \le R^{\prime }\) ) with the mathematical induction, and (i.2) when \(s_k^{\prime \prime } \lt s_k^{\prime }\) , \(R^{\prime \prime }\) is strictly less than \(R^{\prime }\) (formally \(R^{\prime \prime } \lt R^{\prime }\) ). Therefore, when \(s_k^{\prime \prime } \le s_k^{\prime }\) , \(R^{\prime \prime }\) dominates \(R^{\prime }\) .
(ii) If \(s_k^{\prime } \le s_k^{\prime \prime } \le s_k\) (i.e., remove the order constraint \(\tau _k \rightsquigarrow \tau _i\) by ignoring the second condition of \(\xi _i^{\prime }\) ). Since \(\tau _k \in \textsf {dc}(\tau _i)\) , \(\tau _k\) and \(\tau _i\) must be executed sequentially. Assume \(\tau _k\) and \(\tau _i\) are regarded as a single job \(\tau _{\it new}\) :
(1)
If \(s_i^{\prime \prime } \le s_k^{\prime } \le s_k^{\prime \prime } \le s_i^{\prime }\) in Figure 6(a), we have \(s_i^{\prime } - s_i^{\prime \prime } \ge s_i^{\prime } - s_k^{\prime } \ge s_k^{\prime \prime } - s_k^{\prime }\) . The start time reduction (i.e., \(s_i^{\prime } - s_i^{\prime \prime }\) ) of \(\tau _i\) is more than the start time increasing (i.e., \(s_k^{\prime \prime } - s_k^{\prime }\) ) of \(\tau _k\) . \(\tau _{\it new}\) finishes earlier in the relaxed order than that of original order, namely \(f_k^{\prime \prime } \lt f_i^{\prime }\) . Assume the schedule order of the remaining unexecuted jobs would not be changed, \(R^{\prime \prime } \lt R^{\prime }\) can hold since the remaining jobs start execution ahead of the original start time \(f_i^{\prime }\) .
Fig. 6.
Fig. 6. Runtime timeline of original and relaxed partial order, where \(R^{\prime \prime }\) partly dominates \(R^{\prime }\) when \(s_k^{\prime } \le s_k^{\prime \prime } \le s_k\) .
(2)
If \(s_k^{\prime } \le s_i^{\prime \prime } \le s_i^{\prime } \le s_k^{\prime \prime }\) , shown in Figure 6(b), we have \(s_k^{\prime \prime } - s_k^{\prime } \ge s_k^{\prime \prime } - s_i^{\prime \prime } \ge s_i^{\prime } - s_i^{\prime \prime }\) . This case occurs when the orders of previously scheduled jobs are relaxed and then \(\tau _k\) is not ready at \(t_{\it sp}\) . The start time increasing (i.e., \(s_k^{\prime \prime } - s_k^{\prime }\) ) of \(\tau _k\) is more than the start time reduction (i.e., \(s_i^{\prime } - s_i^{\prime \prime }\) ) of \(\tau _i\) . \(\tau _{\it new}\) finishes later in the relaxed order than that of original order, namely \(f_i^{\prime } \lt f_k^{\prime \prime }\) . Similar to (1), \(R^{\prime \prime } \gt R^{\prime }\) can hold.
Therefore, \(R^{\prime \prime }\) partly dominates \(R^{\prime }\) when \(s_k^{\prime } \le s_k^{\prime \prime } \le s_k\) . Since \(s_k^{\prime \prime } \le s_k\) , timing-anomaly-free execution is guaranteed by Lemma 2. The theorem is proved. □
Following Theorem 2, the runtime performance of the original partial order can be strictly improved (i.e., \(R^{\prime \prime } \lt R^{\prime }\) ) without timing anomaly, when a part of order constraints (e.g., \(\tau _k \rightsquigarrow \tau _i \subseteq {\it Sch}^{\prime \prime }\) ) is removed under the condition \(\xi _i^{\prime }\) . This is because \(\tau _k\) will still start execution not later than \(s_k^{\prime }\) (i.e., \(s_k^{\prime \prime } \le s_k^{\prime }\) ) when the original order is broken by \(\tau _i\) . Moreover, there is a chance that the runtime performance can also be strictly improved when ignoring the second condition of \(\xi _i^{\prime }\) (i.e., Figure 6(a)) and removing part of orders. When the runtime execution time of the current DAG task is reduced, the computation/communication resources can be reassigned to other DAG tasks as early as possible if the workload is heavy, saving resource consumption.

5 Exploration of Specific Tight Bound

In Section 4, when a DAG task is scheduled by any non-preemptive dynamic scheduler with configuration \({\it Conf}\) , the timing-anomaly-free execution is eliminated if the order constraints \({\it Sch}\) are enforced among jobs at runtime, without losing too much scheduling flexibility. Thus, the safety of generic bounds can be ensured for any \({\it Conf}\) and \({\it Sch}\) . To further generate a specific tight bound, we present a heuristic algorithm to seek suitable parameters \({\it Conf}\) and \({\it Sch}\) .
Since the design space of the mapping can be exponentially increased with the increase of DAG/NoC size, obtaining the optimal parameter \({\it Conf}\) is an NP-hard problem [22]. Specifically, \(G\) may produce an exponential number of schedule orders (i.e., topological order) when scheduled by a dynamic scheduler. Thus, for the large-size DAG and SLT NoC, obtaining the tightest bound by using the optimal \({\it Conf}\) and \({\it Sch}\) parameters is computationally infeasible to perform a search in the spatial and temporal dimensions to find the best solution among the design space within the accepted time. Instead, we propose suitable \({\it Conf}\) and \({\it Sch}\) parameters with heuristic algorithms to make the DAG bound as tight as possible. Before presenting the heuristic algorithm, we give the following definitions.
Definition 8 (Remaining Path).
For \(\tau _i\) of \(G\) = (V, E), its remaining path \(\Lambda\) is defined as the job sequence from \(\tau _i\) to \(\tau _n\) . The schedule length of \(\Lambda\) , including the execution time of computation and communication jobs, is denoted by \(\mathcal {L}(\Lambda)\) .
Definition 9 (Remaining Critical Path).
For \(\tau _i\) of \(G\) = (V, E), its remaining critical path \(\Lambda _c\) is a path that satisfies the following condition:
\(\forall \Lambda _{\it p} \in S_\mathcal {P}(\tau _i), \mathcal {L}(\Lambda _c) \ge \mathcal {L}(\Lambda _{\it p}),\)
where \(S_\mathcal {P}(\tau _i)\) is the set of paths from \(\tau _i\) to the last job \(\tau _n\) .
With Definition 9, the execution length of remaining critical path \(\Lambda _c\) for \(\tau _i\) is \(\mathcal {L}(\tau _i) = \mathcal {L}(\Lambda _c)\) . To decrease the runtime non-determinism from pipeline transmission on a single communication link \(\Phi _i\) , we assume all of the communication jobs are assigned to the communication links that are within the \({\it HPC}_{{\it max}}\) (i.e., \(|\Phi _i| \le {\it HPC}_{{\it max}}\) ). The remaining worst-case computation and communication time can be calculated together since the worst-case latency of a communication job remains the same for any job-to-resource configuration and any DOR (Dimension-Order Route) [15] within \({\it HPC}_{{\it max}}\) .
To demonstrate the remaining critical path \(\Lambda _c\) , assume it costs \(t_r = 1\) to establish the SLT path and the unit data size of messages (e.g., flit) is the width of NoC links. Thus, the intra-DAG message size \(m_{i,j}\) can be transformed into a time parameter with a time offset constant (i.e., \(t_r\) ). Specifically, for \(v_2\) in Figure 1, \(\Lambda _1\) = \(\lbrace v_2, m_{2,4}, v_4, m_{4,7}, v_7\rbrace\) , \(\mathcal {L}(\Lambda _1)= 16\) ; \(\Lambda _2\) = \(\lbrace v_2, m_{2,5}, v_5, m_{5,7}, v_7\rbrace\) , \(\mathcal {L}(\Lambda _2) = 14\) . Thus, \(\Lambda _c\) of \(v_2\) is \(\Lambda _1\) where \(\mathcal {L}(v_2) = \mathcal {L}(\Lambda _c) = 16\) .
Definition 10 (Resource Pair).
On SLT NoC, a resource pair \(\Phi _{p,\lambda } = \langle \Xi _{p \rightarrow s}, {\it PE}_s \rangle\) is a pair of a given route \(\Xi _{p \rightarrow s}\) from PE \({\it PE}_p\) to PE \({\it PE}_s\) , and \({\it PE}_s\) , where \(\Phi _{p,\lambda }\) is the \(\lambda ^{\it th}\) resource pair candidate of the predecessor job \(v_p\) .
In Figure 4, \(\Phi _{p,\lambda } = \langle r_{13} \rightarrow r_{18} \rightarrow r_{23}, {\it PE}_{23} \rangle\) is the \(\lambda ^{th}\) resource pair of predecessor computation job \(v_p\) .
Definition 11 (Job Pair).
For \(G\) = (V, E), a job pair \(\langle m_{p,s}, v_s \rangle\) is a pair of inter-subtask message \(m_{p,s}\) and successor subtask \(v_s\) , satisfying \(m_{p,s} \in E\) .
For instance, in Figure 1, the task graph \(G\) can be divided into job pairs \(\lbrace \langle null,\) \(v_1 \rangle ,\) \(\langle m_{1,2},\) \(v_2 \rangle ,\) \(\ldots\) , \(\langle m_{6,7}, null\rangle\) , \(\langle m_{4,7}, v_7 \rangle \rbrace\) . Note that the task \(G = (V, E)\) can produce \(|E|+1\) job pairs. For job pair \(\tau _i = \langle m_{p,s}, v_s \rangle\) and resource pair \(\Phi _{p,\lambda } = \langle \Xi _{p \rightarrow s}, {\it PE}_s \rangle\) , \(v_s\) can be allocated to \({\it PE}_s\) and \(m_{p,s}\) can be transmitted to \({\it PE}_s\) via the route \(\Xi _{p \rightarrow s}\) .
Next, we present a heuristic algorithm to obtain parameters \({\it Conf}\) and \({\it Sch}\) , to shorten the bound \(R\) as much as possible. The bound and parameters are obtained by “simulating” the runtime execution of DAG tasks at design time, assuming the execution time of jobs is WCET. We still use \(\tau _i\) to represent job pair or job, and \(\Phi _{p,\lambda }\) ( \(\Phi _i\) ) for resource pair or resource. For simplicity, we combine \({\it Conf}\) and \({\it Sch}\) as \({\it CS}\) . This configuration symbol for each generated job pair \(\tau _i\) is defined as a tuple \(\Upsilon _i\) = ( \(\mathcal {M}_{\tau _i\rightarrow \Phi _i}, s_i\) ), where \(\mathcal {M}_{\tau _i\rightarrow \Phi _i}\) refers to the allocation of job pair \(\tau _i\) to resource pair \(\Phi _i\) , and \(s_i\) is the design-time start time of \(\tau _i\) . Then, the configuration vector can be obtained by \({\it CS} = \lbrace \Upsilon _1, \Upsilon _2, \ldots , \Upsilon _{|E|+1}\rbrace\) .
In Algorithm 1, we initialize the scheduling time point \(t_{\it sp}\) from 0 and put the first job pair \(\tau _1\) = \(\langle null,\) \(v_1 \rangle\) into the ready queue \(\mathcal {Q}_{\it ready}\) . \(\mathcal {Q}_{jp}\) contains the job pairs that have been dispatched, whereas \(\mathcal {Q}_{na}\) includes the job pairs that cannot find available computation and communication resources at current scheduling point. The job pair \(\tau _i\) ( \(\in \mathcal {Q}_{\it ready}\) ) with the longest remaining critical path is also defined as the critical job pair \(\tau _c = \langle m_{p,s}, v_s \rangle\) . To reduce the estimated bound, we always schedule \(\tau _c\) at each \(t_{\it sp}\) first in line 5. When there are multiple available resource pairs (collected in \(\mathcal {Q}_{\it rp}(v_p)\) ) of \(v_p\) , \(\tau _c\) will be scheduled on the idle resource pair \(\Phi _{p,\lambda } = \langle \Xi _{p \rightarrow s}, {\it PE}_s \rangle\) that has minimum distance so that it would cause less potential contention to unexecuted job pairs. If the distance of the resource pair is 0, the child job will be mapped on the PE where its parent job locates, eliminating the communication cost. If a job pair includes a synchronization (syn) computation job \(v_{\it syn}\) and the distance between its predecessor and \(v_{\it syn}\) is more than \({\it HPC}_{{\it max}}\) , an available DOR route [15] is employed via intermediate router(s). According to the returned \(\Phi _{p,min}\) , there are two cases to be considered:
(1)
\(\Phi _{p,min} \ne null\) in lines 10 through 12. Each critical job pair \(\langle m_{p,s}, v_s \rangle\) will be allocated to the returned resource pair \(\langle \Xi _{p \rightarrow s}, {\it PE}_s \rangle\) . The runtime state is “updated.” \(\tau _c\) is inserted into \(\mathcal {Q}_{jp}\) and removed from \(\mathcal {Q}_{\it ready}\) .
(2)
\(\Phi _{p,min} = null\) in lines 13 and 14. None of the resource pairs around \(v_p\) are available. \(\tau _c\) is put into \(\mathcal {Q}_{na}\) and will be scheduled at the next scheduling time point.
When all the remaining ready job pairs are traversed, the next scheduling time point is triggered by the completion of a job pair. Together with \(\mathcal {Q}_{na}\) , currently generated ready job pairs that have no unfinished predecessors are created and put into \(\mathcal {Q}_{\it ready}\) in lines 16 and 17. The final scheduling time point \(t_{\it sp}\) triggered by the completion of the last job pair is the estimated bound \(R\) of \(G\) . Then, \(R\) and the configuration vector \({\it CS} = \lbrace \Upsilon _1, \Upsilon _2, \ldots , \Upsilon _{|E|+1}\rbrace\) are returned in line 18. Since \(\Upsilon _i\) = ( \(\mathcal {M}_{\tau _i\rightarrow \Phi _i}, s_i\) ), \(\mathcal {M}_{\tau _i\rightarrow \Phi _i}\) and \(s_i\) constitute \({\it Conf}\) and \({\it Sch}\) , respectively. The total time complexity of Algorithm 1 is \(\mathcal {O}({\it HPC}_{\it max}^2 \times |E|^2)\) . Since \({\it HPC}_{\it max}\) is a constant (e.g., \({\it HPC}_{\it max} \le 9\) ) when SLT NoC (e.g., SMART NoC [23]) is designed, the actual time complexity is \(\mathcal {O}(|E|^2)\) , which can be polynomially solvable.

6 Implementation of DAG-Order

In this section, we introduce how to conduct runtime scheduling, and detail the implementation and overhead.

6.1 Runtime Scheduling

To derive a specific tight bound \(R\) , Section 5 has developed proper parameters \({\it Conf}\) and \({\it Sch}\) with a heuristic algorithm. In this subsection, we show how to adopt \({\it Conf}\) and \({\it Sch}\) to safely schedule DAG edges/vertices at runtime.
Without loss of generality, assume \(G\) starts execution at time point 0, and the assigned resources (i.e., decided by \({\it Conf}\) ) are initially idle. According to the design-time schedule order \({\it Sch}\) , the runtime order constraints could have different possible implementations. By Theorem 1, one solution is to set priorities for jobs according to the offline start time points generated at the design time. For example, “ \(s_k \le s_i\) ” denotes the order constraint “ \(\tau _k \rightsquigarrow \tau _i\) ,” meaning that \(\tau _k\) (with higher priority) actually starts execution not later than \(\tau _i\) at runtime.
Even if some part of partial order constraints (e.g., \(\tau _k \rightsquigarrow \tau _i\) ) are relaxed under the condition \(\xi _i^{\prime }\) , the runtime performance \(R^{\prime \prime }\) of the relaxed order dominates \(R^{\prime }\) of the original order when \(s_k^{\prime \prime } \le s_k^{\prime }\) by Theorem 2. However, the runtime time parameters \(s_k^{\prime }, s_k^{\prime \prime }, c_i^{\prime \prime }\) in \(\xi _i^{\prime }\) cannot be known before a job starts execution. Instead, in runtime scheduling, the original order relaxing condition \(\xi _i^{\prime }\) are relaxed to the following \(\xi _i\) by using the design-time parameters \(c_i\) and \(s_k\) :
(1)
\(\Phi _i\) is available at current scheduling point \(t_{\it sp}\) ;
(2)
\(\forall \tau _k \in \mathcal {Q}_{\it hp}\) and \(\tau _k \in \textsf {dc}(\tau _i)\) , \(t_{\it sp} + c_i \le s_k\) .
Although the relaxed condition \(\xi _i\) cannot guarantee strict runtime performance improvement of the partial order in all cases, the runtime execution enforced with the relaxed order under \(\xi _i\) is free of timing anomaly by Lemma 2, since \(\tau _k\) that suffers order break from \(\tau _i\) starts execution not later than \(s_k\) , namely \(s_k^{\prime \prime } \le s_k\) . Besides, the observed cascading effect (see Section 4.3) can be reduced/eliminated, which thus will avoid extremely poor NoC resource utilization. Experimental results (see Section 7) demonstrate that the average-case runtime performance of the relaxed order under \(\xi _i\) is further improved in our test set, compared with that of the original order.
The runtime scheduler dispatches the ready job \(\tau _i\) on its assigned resource \(\Phi _i\) at the scheduling point \(t_{\it sp}\) if the condition \(\xi _i\) is met, shown in Algorithm 2. The procedure of the runtime scheduler stops when all the ready/eligible jobs are traversed and is repeated at the next scheduling point triggered by the completion of a job. In this article, we mainly focus on the strategic and theoretical design of the runtime scheduler and do not limit the actual implementation in the operating system layer. One possible solution is that the online scheduler is incorporated into the operating system of the NoC controller, consisting of one main scheduler and multiple sub-schedulers with threads. The main scheduler selects the highest-priority job from the ready queue \(\mathcal {Q}_{\it ready}\) and allocates a thread for such a job. Then, the thread sub-scheduler will check the scheduling condition \(\xi _i\) to see if the job is to be scheduled or put in \(\mathcal {Q}_{\it hp}\) .
Specifically, in Algorithm 2, \(\mathcal {Q}_{\it ready}\) records the set of ready jobs. At each scheduling point \(t_{\it sp}\) , if \(\mathcal {Q}_{\it ready}\) is not empty, \(\mathcal {Q}_{\it hp}\) is set as an empty set in line 3. According to the offline schedule order \({\it Sch}\) , the job \(\tau _i\) with the earliest design-time start time (i.e., \((\tau _i \rightsquigarrow \tau _q) \subseteq {\it Sch}, \forall \tau _q \in \mathcal {Q}_{\it ready}, q \ne i\) ) is selected. Before the execution of the selected job \(\tau _i\) , the condition \(\xi _i\) is checked in line 5. If \(\xi _i\) is met, \(\tau _i\) starts execution on its assigned resource \(\Phi _i\) . Note the condition \(t_{\it sp} + c_i \le s_k\) aims to guarantee \(s_k^{\prime \prime } \le s_k\) , meaning that \(\tau _i\) can remove the order constraint \(\tau _k \rightsquigarrow \tau _i\) but the worst-case execution of \(\tau _i\) must finish before \(s_k\) to avoid timing anomaly. Thus, the time interval ( \(t_{\it sp}, s_k^{\prime \prime }\) ) that originally cannot be utilized in the original order can be utilized when enforcing the relaxed order at runtime, which can further improve the runtime performance of the original order. If \(\xi _i\) is violated, \(\tau _i\) is inserted into \(\mathcal {Q}_{\it hp}\) and removed from \(\mathcal {Q}_{\it ready}\) .

6.2 Implementation Sketch

The proposed DAG-Order is based on SLT NoC. SLT NoC is identified as a kind of single-cycle long-range traversal NoCs that are well developed and need not be designed from scratch. Thus, off-the-shelf NoCs characterizing such data transmission behavior are available, such as SMART [23]/optical [17] NoCs. To be scalable, an SLT NoC can be split into multiple cluster networks (e.g., [31]), in which the cluster size is defined by user requirements. The global control wires are limited to the cluster level instead of the whole NoC, and thus SLT NoC can be extended to a larger-size network and real-time system. On account of the requirement for job-to-resource allocation (from off-chip memory to NoC) and chip thermal distribution, each cluster in NoCs is necessarily equipped with a resource controller [6] to achieve the cluster-level global control in real-time guarantee and chip thermal reliability in the dark silicon era.
Specifically, given a DAG task \(G\) , a suitable cluster (e.g., cluster size, voltage, and frequency) is selected from the whole SLT NoC until the required deadline is met. \(G\) is executed within the allocated cluster and scheduled with Algorithm 2. The proposed scheduling algorithm and arbitration is conducted in the operating system of the resource controller. For each computation/communication job, a thread is allocated by the main scheduler, whereas the scheduling decision is distributed through the dedicated control links within the cluster network. The information of each job \(\tau _i\) can be partly/fully stored in on-chip memories (e.g., private/shared cache) of the resource controller, encoded by a \(ready\) bit (i.e., eligible for execution), \(priority\) (i.e., design-time execution start time \(s_i\) ), \(c_i\) (i.e., WCET), and assigned resource (i.e., \(core\_id\) for computation job, a bit vector \(route\) for communication job). Bit 1 represents the busy state of core/link, and 0 represents the idle state. The availability of the assigned route is checked by a bit-wise AND operation according to the resource occupation state. Similar to circuit switching, the contention-free “1-cycle” SLT path is established for the winners of the scheduler by sending a path-setup control packet along with the control layer. Once the resources are released by the finished jobs, the resource occupation state is updated immediately. This update operation of resource/job state can be conducted concurrently with the schedule processing from communications/computations, and thus cannot delay the schedule processing and schedule information delivery to the routers/PEs of communications/computations. Especially for the large-size real-time task, it can be divided into multiple moderate sub-tasks, with each sub-task assigned to one cluster network. Note that if a parent job and its child are assigned to different clusters, the control message of the dependence relationship is propagated through the control links between cluster controllers, whereas the generated data of parents is delivered to the border NoC node between clusters or shared cache where the child can fetch the data. Then, the child job will become ready and eligible to be executed if all input dependencies are met.
The overhead of the control layer (e.g., resource occupation update, control links) on SLT NoC is a necessary part of cluster-based control (e.g., [31]), in which the existing hardware resources can be reused. Since resource control is conducted in a cluster-based centralized manner, the existing router buffer of SLT NoC (e.g., SMART NoC) can be reduced to one flit to mitigate the overhead in this work, whereas the additional time overhead \(\mathcal {C}_{\it ctrl}\) of conducting job scheduling and NoC control operations for each DAG vertex/edge is assumed to be incorporated into the WCET of jobs, which should be accurately estimated in real implementation. As in the work of Krishna et al. [23], each core of SMART NoCs has 32KB private L1 cache and 1MB private/shared cache. The additional storage space for job information is acceptable when equipped with KB- and MB-level caches. Here, we mainly focus on the strategic design of the scheduling paradigm based on off-the-shelf SLT NoCs. Additional research such as accurate estimation of time deviation \(\mathcal {C}_{\it ctrl}\) , exploration of more efficient designs of scheduling schemes, and cluster-based SLT NoC will be conducted in future work.

7 Experimental Evaluation

In this section, we run a variety of experiments to evaluate the effectiveness (i.e., bound tightness, runtime performance) of the scheduling strategy DAG-Order using synthetic and realistic benchmarks with different parameter settings.

7.1 Experimental Setup

To derive the response time results and validate the effectiveness of the scheduling strategy, we build a high-level simulator by “simulating” (which is different from the critical-path-based analysis like in the work of He et al. [18]) the runtime execution according to the schedule order based on the abstracted SLT NoC in a C/C++ environment. Although the runtime behavior is not the actual execution behavior in realistic systems, it reflects the technical efficiencies of our scheduling approach to some extent, which is used for comparison of all approaches in experiments. Like recent DAG works [12, 18], the synthetic DAG tasks are generated by the Erdos-Renyi approach \(G(|V|, p)\) [14], which is a commonly experimental setup in real-time DAG scheduling. The DAG tasks are totally run on a \(4 \times 4\) mesh-based cluster of SLT NoC. Specifically, the parameters of the DAG tasks are generated as follows:
Connectivity parameter \(p_{\it con}\) : \(p_{\it con}\) is designed for \(G(|V|, p_{\it con})\) . If a random probability from (0, 1) is less than the given probability threshold \(p_{\it con}\) , a directed edge between two vertices is generated for a DAG task.
WCET: The WCET of each edge/vertex is randomly chosen in (300, 1500), which complies with the real measurement from OpenMP programs in the work of Wang et al. [37].
Maximum number of vertexes \(N_{max}\) : The actual number of vertexes of the DAG is chosen within the range of (10, \(N_{max}\) ).
Communication-to-computation ratio: The Communication-to-Computation Ratio (CCR) is the ratio of the average communication time to the average computation time in the DAG tasks.
WCET varying parameter \(p_{\it vary}\) : A probability \(p_{\it vary}^{\prime }\) is randomly generated in (0.05, \(p_{\it vary}\) ), and the actual execution time of each edge/vertex at runtime is set as \(p_{\it vary}^{\prime } \times {\it WCET}\) .
Since SLT NoCs (e.g., optical [17]/SMART [23] NoCs) have different implementations in real systems and we mainly focus on the theoretical design of the new real-time scheduling paradigm and do not limit the actual implementation in the hardware part, in the experiments, the simulator infrastructure only models the task graph, the assumed WCET of each edge/vertex, and the SLT NoC behavior. The response time of the DAG task begins with the corresponding necessary programming codes already loaded to the NoC private/shared cache/memory and ends with the execution finish of the last DAG vertex, including the time of vertex/PE execution and inter-vertex/PE message transmission, whereas the other influencing factors are kept the same. The additional time overhead of conducting job scheduling and NoC control operations for each job is assumed to be incorporated into the WCET of the corresponding jobs. The real instruction-level dependency generated at runtime and the memory/cache impact do not factor into the results being evaluated, as these operations (e.g., memory-/cache-related operations) can be embedded into the DAG task model by adding the corresponding vertices and edges at design time. Besides, we generate the task graphs from the realistic StreamIt benchmarks [29, 33] and AI applications. In the experiments, we collect three StreamIt benchmarks including Audiobeam (adm), FMRadio (fmr), and H264 (h264), and four AI applications including MobileNetV2 (mnt), UNet (unet), VGG16 (vgg), and ResNet50 (res). The employed benchmarks are representative, as they have different orders of magnitude in workloads, and cover audio processing, scientific processing, and AI applications, among others. In addition, the proposed scheduling scheme DAG-Order is applicable and extended to other benchmarks that can be modeled as DAG.
To facilitate quantitative comparison, we use the performance metrics response time bound and actual response time. Specifically, response time bound is estimated by “simulating” the runtime execution of DAG tasks at design time and assuming the jobs are executed for their WCETs, whereas actual response time is measured by varying the actual execution time of jobs within WCET. The proposed approaches and the corresponding baselines are as follows in our experiments:
\(R_{\it base}\) is derived through “simulating” the runtime execution of DAG tasks at design time by Chen et al. [12], which is regarded as a SOTA related work. To guarantee the high parallelism of vertices, it always selects the vertex job first with maximum WCET, when there are multiple ready jobs.
\(R_{\it our}\) is derived by selecting the job pair that has the longest remaining critical path when there are multiple ready job pairs in our proposed Algorithm 1.
\(R_{\it base}^{\prime }\) refers to the actual execution time of DAG tasks with certain order constraints in the work of Chen et al. [12], which is regarded as the SOTA related work.
\(R_{\it opo}^{\prime }\) refers to the actual execution time of DAG tasks with original partial order (opo), which is proposed in our earlier conference version [10].
\(R_{\it rpo}^{\prime }\) refers to the actual execution time of DAG tasks with relaxed partial order (rpo) in this article.
Note that the task model of real-time communication analysis in other works [11, 20, 38] focuses on the independent periodic packet flows, priority-preemptive scheduling schemes, and traditional multi-hop transmission NoCs, whereas in this article, we adopt a general DAG task model of jointly considering the computation and communication workloads and the non-preemptive scheduling scheme in SLT NoCs. Similarly, the classical bounds (e.g., [18]) on multi-cores do not consider inter-core communication and thus cannot be fully applicable in SLT NoCs. Therefore, these related approaches [11, 18, 20, 38] are only qualitatively compared to demonstrate the advantages of the proposed NoC scheduling paradigm and are not adopted as quantitative baselines. Instead, we quantitatively compare our work with another SOTA related simulation-based approach [12] and the conference version [10] of this work.

7.2 Comparison of Response Time Bound

Figure 7 demonstrates the comparison of bounds. Figure 7(a) shows the results under synthetic benchmarks by varying \(p_{\it con}\) , Figure 7(b) shows the results by varying the \(N_{max}\) , Figure 7(c) shows the results with the CCR, and Figure 7(d) shows the results under realistic benchmarks. Each point of all the results is averaged with multiple experiments and normalized to the bound \(R_{\it our}\) obtained by Algorithm 1 in this article. It can be observed that the bound \(R_{\it our}\) is consistently smaller or not more than the baseline bounds \(R_{\it base}\) , indicating that Algorithm 1 can tighten the bounds.
Fig. 7.
Fig. 7. Normalized response time bound under synthetic and realistic benchmarks.
Specifically, as shown in Figure 7(a), the bound gap between \(R_{\it our}\) and \(R_{\it base}\) increases with the increasing of \(p_{\it con}\) . It indicates that the superiority of \(R_{\it our}\) is more apparent when the DAG structure is more complicated (i.e., more data/control dependence). This is because the number of runtime execution instances will be larger when enforcing the complicated dependence among vertices. Thus, the response time bound experiences high time variation due to a large number of instances. Without considering the criticality and dependence of jobs, \(R_{\it base}\) cannot produce tighter bounds. \(R_{\it our}\) conducts the critical remaining path scheduling (considers the inter-job dependence) and performs better. Similarly, as shown in Figure 7(b) through (d), the bound \(R_{\it our}\) also performs better, with no larger bounds compared with the baseline bound \(R_{\it base}\) . By employing our bound \(R_{\it our}\) , (i) if the DAG tasks are set as the same deadlines, less computation and communication resources are needed, in which the resource assignment is well controlled and the resource over-provisioning is avoided, and (ii) if the DAG tasks are assigned the same resources, \(R_{\it our}\) can produce a higher acceptance ratio in the schedulability test.

7.3 Comparison of Actual Response Time

Figure 8 demonstrates the runtime performance (i.e., actual response time) by varying the execution time within WCET of jobs. Figure 8(a) shows the results under synthetic benchmarks by varying \(p_{\it con}\) , Figure 8(b) shows the results with \(p_{\it vary}\) , Figure 8(c) shows the results with the CCR, and Figure 8(d) shows the results under realistic benchmarks. All the results are averaged and normalized to the response time \(R_{\it base}^{\prime }\) . It can be observed that \(R_{\it rpo}^{\prime }\) and \(R_{\it opo}^{\prime }\) , which are enforced with the proposed order constraints, perform better compared with the SOTA related work \(R_{\it base}^{\prime }\) [12]. Meanwhile, it also reflects the communication performance/resource utilization of \(R_{\it rpo}^{\prime }\) and \(R_{\it opo}^{\prime }\) on SLT NoC is better than \(R_{\it base}^{\prime }\) under the same resource constraints.
Fig. 8.
Fig. 8. Normalized actual response time under synthetic and realistic benchmarks.
Specifically, \(R_{\it rpo}^{\prime }\) and \(R_{\it opo}^{\prime }\) granularly consider the direct and indirect contention, and less schedule constraints are enforced among jobs. This means that the scheduling flexibility of \(R_{\it rpo}^{\prime }\) and \(R_{\it opo}^{\prime }\) is better than that of \(R_{\it base}^{\prime }\) , while guaranteeing the timing-anomaly-free execution. When the jobs of \(R_{\it rpo}^{\prime }\) and \(R_{\it opo}^{\prime }\) are finished earlier than that of \(R_{\it base}^{\prime }\) , the assigned resources are available and can be assigned to other jobs earlier, improving the resource utilization. Although \(R_{\it rpo}^{\prime }\) and \(R_{\it opo}^{\prime }\) both consider the direct and indirect contention relationship among jobs, they are slightly different in average-case runtime performance. The reason is that the observed cascading effect is reduced/eliminated. In other words, when the shared resources are available for an upcoming scheduled job while the contending higher-priority jobs have not been started execution on these shared resources, the upcoming scheduled ready job can conditionally start execution, further improving the resource utilization. When the actual response time of the current DAG task is reduced, the resources can be reassigned to other DAG tasks as early as possible if the remaining workload is heavy, reducing the resource consumption. Besides, all of the runtime execution instances are free of timing anomaly in the experiments (e.g., \(R_{\it rpo}^{\prime } \le R_{\it our}\) ), also proved in Lemma 2.

8 Related Work

The response time upper bounds of real-time DAG tasks on multi-cores are well studied in recent works (e.g., [12, 18, 19, 36]). By accurately analyzing the inter-task and intra-task interferences of DAG tasks, the safe and tight upper bounds are derived. But most of these works did not pay attention to the inter-core communication delay.
To address the communication issues, some authors [20, 25] conduct real-time analysis on the distributed priority-preemptive wormhole NoCs, borrowing the uniprocessor scheduling theory. However, the communication resources (e.g., link) from source to destination are not mutually exclusive (i.e., single resource) and can be occupied concurrently by multiple communication flows. The actual worst-case timing scenario is difficult to identify due to the MPB, and thus a series of new analysis-based works [20, 25, 38] tried to solve the safety problem of the latency bounds by considering more runtime counterexamples. The works [20, 25] on real-time wormhole NoCs are presented by precisely considering the MPB latency, but the latency bounds are still not proven to be safe. Another research approach adopts the tools of network calculus (e.g., [28]) or compositional performance analysis (e.g., [34]).
Significantly, there are two recent works to ensure the safety and tightness of the latency bounds. One is presented by Burns et al. [5] with a novel distributed flow control scheme. Such a communication scheduling scheme can improve the latency bound tightness by eliminating MPB [38] (i.e., router backpressure due to the finite buffer size). Another is presented by Ueter et al. [35] with centralized priority-preemptive scheduling with all-or-nothing property. The key idea is that a long-range source-destination link, including multiple router-to-router links, is regarded as a logically Single Link (SL) (like uniprocessor), thus building the correct relation between NoCs and uniprocessor scheduling theory. The logically “single” long-range link is mutually exclusively occupied by a packet flow without being shared at a time, at the expense of the communication resource waste. However, both the distributed (e.g., [20]) and centralized (e.g., [35]) priority-preemptive communication scheduling strategies did not consider the computation tasks and inter-packet dependence simultaneously.
To address the drawbacks of the aforementioned main works, Table 2 presents a comparison of our work with the representative related works for real-time NoCs:
Table 2.
approachcomp.comm.MPBpartial orderrelaxed partial orderlogically SL schedulingphysically SL schedulingpreemptive or nottype of managementsafe
[18]yesnoyescentralizedyes
[12]yesnoyesnonocentralizedyes
[20, 25]noyesyesnonoyesdistributedno
[5]noyesnononoyesdistributedyes
[26]noyesnonononocentralizedyes
[35]noyesnoyesnoyescentralizedyes
[10]yesyesnoyesnoyesyesnocentralizedyes
oursyesyesnoyesyesyesyesnocentralizedyes
Table 2. Comprehensive Comparison of Representative Related Works for Real-Time NoCs
comp.: Whether the computation is considered.
comm.: Whether the communication is considered.
MPB: Whether the notorious MPB problem is solved.
partial order: Whether the tasks are scheduled with partial order constraints.
relaxed partial order: Whether the tasks are scheduled with relaxed partial order constraints.
logically SL scheduling: Whether the communication tasks are scheduled in the unit of the whole source-destination link. Such link is regarded as a logically SL, and needs to be traversed with multiple cycles.
physically SL scheduling: Like logically SL scheduling, but the difference is that the link is regarded as a physically SL, and needs to be traversed with only one cycle. This scheduling intuitively shows higher resource utilization than logically SL scheduling.
preemptive or not: Whether the tasks are scheduled based on preemptive approach or not?
type of management: Whether the tasks are scheduled based on distributed or centralized approach?
safe: Whether the bounds are safe or not.
As illustrated in Table 2, except for the conference version [10] of this work, there are few works that consider the computation and communication workloads simultaneously. Although the work of Benchehida et al. [4] has considered both computation and communication issues for TDM NoC-based multi-core architectures, it presents challenges for global clock synchronization and TDM schedule setup. Compared with the conference version [10] and another prior work [12], this work further relaxes the partial order constraints, which can achieve higher average runtime performance and resource utilization.

9 Conclusion AND Future Work

In this article, we presented a new dynamic scheduling paradigm on a kind of SLT NoC. With this scheduling paradigm, the notoriously MPB problem is naturally eliminated in traditional priority-preemptive wormhole scheduling, simplifying the real-time analysis. In particular, task computation and inter-core communication are easily considered simultaneously, which is missing in SOTA works. Compared to the previous conference version [10], the current work improves the average runtime performance, saving resource consumption. Especially, this scheduling paradigm opens new research directions in future work, such as the design of more efficient SLT NoC and the exploration of finer-grained scheduling schemes to improve the runtime performance and bounds tightness, and the scheduling design of multiple recurrent DAG tasks.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable feedback and improvements to this article.

Footnote

1
We assume the notations attached with one and two apostrophes in the top right corner denote the runtime parameters when enforced with the original partial and relaxed partial order constraints, respectively.

References

[1]
Hamid Arabnejad and Jorge G. Barbosa. 2013. List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Transactions on Parallel and Distributed Systems 25, 3 (2013), 682–694.
[2]
Yashar Asgarieh and Bill Lin. 2019. Smart-hop arbitration request propagation: Avoiding quadratic arbitration complexity and false negatives in SMART NoCs. ACM Transactions on Design Automation of Electronic Systems 24, 6 (2019), 1–25.
[3]
Sanjoy Baruah. 2015. The federated scheduling of systems of conditional sporadic DAG tasks. In Proceedings of the 2015 International Conference on Embedded Software (EMSOFT’15). IEEE, Los Alamitos, CA, 1–10.
[4]
Chawki Benchehida, Mohammed Kamel Benhaoua, Houssam-Eddine Zahaf, and Giuseppe Lipari. 2020. Task and communication allocation for real-time tasks to networks-on-chip multiprocessors. In Proceedings of the 2020 2nd International Conference on Embedded and Distributed Systems (EDiS’20). IEEE, Los Alamitos, CA, 9–14.
[5]
Alan Burns, Leandro Soares Indrusiak, N. Smirnov, and J. Harrison. 2020. A novel flow control mechanism to avoid multi-point progressive blocking in hard real-time priority-preemptive NoCs. In Proceedings of the 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’20). IEEE, Los Alamitos, CA, 137–147.
[6]
Guilherme Castilhos, Marcelo Mandelli, Guilherme Madalozzo, and Fernando Moraes. 2013. Distributed resource management in NoC-based MPSoCs with dynamic cluster sizes. In Proceedings of the 2013 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’13). IEEE, Los Alamitos, CA, 153–158.
[7]
Hui Chen, Peng Chen, Xiangzhong Luo, Shuo Huai, and Weichen Liu. 2022. LAMP: Load-balanced multipath parallel transmission in point-to-point NoCs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 12 (2022), 5232–5245.
[8]
Hui Chen, Peng Chen, Jun Zhou, Luan H. K. Duong, and Weichen Liu. 2021. ArSMART: An improved SMART NoC design supporting arbitrary-turn transmission. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 5 (2021), 1316–1329.
[9]
Peng Chen, Hui Chen, Jun Zhou, Mengquan Li, Weichen Liu, Chunhua Xiao, Yiyuan Xie, and Nan Guan. 2021. Contention minimization in emerging smart NoC via direct and indirect routes. IEEE Transactions on Computers 71, 8 (2021), 1874–1888.
[10]
Peng Chen, Hui Chen, Jun Zhou, Di Liu, Shiqing Li, Weichen Liu, Wanli Chang, and Nan Guan. 2021. Partial order based non-preemptive communication scheduling towards real-time networks-on-chip. In Proceedings of the 36th Annual ACM Symposium on Applied Computing. 145–154.
[11]
Peng Chen, Weichen Liu, Hui Chen, Shiqing Li, Mengquan Li, Lei Yang, and Nan Guan. 2020. Reduced worst-case communication latency using single-cycle multihop traversal network-on-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 40, 7 (2020), 1381–1394.
[12]
Peng Chen, Weichen Liu, Xu Jiang, Qingqiang He, and Nan Guan. 2019. Timing-anomaly free dynamic scheduling of conditional DAG tasks on multi-core systems. ACM Transactions on Embedded Computing Systems 18, 5s (2019), 1–19.
[13]
Peng Chen, Weichen Liu, Mengquan Li, Lei Yang, and Nan Guan. 2020. Contention minimized bypassing in SMART NoC. In Proceedings of the 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC’20). IEEE, Los Alamitos, CA, 205–210.
[14]
Daniel Cordeiro, Grégory Mounié, Swann Perarnau, Denis Trystram, Jean-Marc Vincent, and Frédéric Wagner. 2010. Random graph generation for scheduling simulations. In Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques (SIMUTools’10). 10.
[15]
William James Dally and Brian Patrick Towles. 2004. Principles and Practices of Interconnection Networks. Elsevier.
[16]
Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics 17, 2 (1969), 416–429.
[17]
Huaxi Gu, Jiang Xu, and Zheng Wang. 2008. A novel optical mesh network-on-chip for gigascale systems-on-chip. In Proceedings of the 2008 IEEE Asia Pacific Conference on Circuits and Systems. IEEE, Los Alamitos, CA, 1728–1731.
[18]
Qingqiang He, Xu Jiang, Nan Guan, and Zhishan Guo. 2019. Intra-task priority assignment in real-time scheduling of DAG tasks on multi-cores. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2283–2295.
[19]
Qingqiang He, Nan Guan, Mingsong Lv, Xu Jiang, and Wanli Chang. 2022. Bounding the response time of DAG tasks using long paths. In Proceedings of the 2022 IEEE Real-Time Systems Symposium (RTSS’22). IEEE, Los Alamitos, CA, 474–486.
[20]
Leandro Soares Indrusiak, Alan Burns, and Borislav Nikolić. 2018. Buffer-aware bounds to multi-point progressive blocking in priority-preemptive NoCs. In Proceedings of the 2018 Design, Automation, and Test in Europe Conference and Exhibition (DATE’18). IEEE, Los Alamitos, CA, 219–224.
[21]
Biresh Kumar Joardar, Karthi Duraisamy, and Partha Pratim Pande. 2018. High performance collective communication-aware 3D network-on-chip architectures. In Proceedings of the 2018 Design, Automation, and Test in Europe Conference and Exhibition (DATE’18). IEEE, Los Alamitos, CA, 1351–1356.
[22]
Muhammad Kafil and Ishfaq Ahmad. 1998. Optimal task assignment in heterogeneous distributed computing systems. IEEE Concurrency 6, 3 (1998), 42–50.
[23]
Tushar Krishna, Chia-Hsin Owen Chen, Woo Cheol Kwon, and Li-Shiuan Peh. 2013. Breaking the on-chip latency barrier using SMART. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE, Los Alamitos, CA, 378–389.
[24]
Daeyeal Lee, Bill Lin, and Chung-Kuan Cheng. 2021. SMT-based contention-free task mapping and scheduling on 2D/3D SMART NoC with mixed dimension-order routing. ACM Transactions on Architecture and Code Optimization 19, 1 (2021), 1–21.
[25]
Borislav Nikolić, Sebastian Tobuschat, Leandro Soares Indrusiak, Rolf Ernst, and Alan Burns. 2019. Real-time analysis of priority-preemptive NoCs with arbitrary buffer sizes and router delays. Real-Time Systems 55, 1 (2019), 63–105.
[26]
Tomas Picornell, José Flich, Carles Hernández, and Jose Duato. 2020. Enforcing predictability of many-cores with DCFNoC. IEEE Transactions on Computers 70, 2 (2020), 270–283.
[27]
Anastasios Psarras, Junghee Lee, Ioannis Seitanidis, Chrysostomos Nicopoulos, and Giorgos Dimitrakopoulos. 2015. PhaseNoC: Versatile network traffic isolation through TDM-scheduled virtual channels. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 5 (2015), 844–857.
[28]
Yue Qian, Zhonghai Lu, and Wenhua Dou. 2009. Analysis of worst-case delay bounds for best-effort communication in wormhole networks on chip. In Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip. IEEE, Los Alamitos, CA, 44–53.
[29]
Benjamin Rouxel and Isabelle Puaut. 2017. STR2RTS: Refactored StreamIT benchmarks into statically analyzable parallel benchmarks for WCET estimation & real-time scheduling. In Proceedings of the 17th International Workshop on Worst-Case Execution Time Analysis (WCET’17).
[30]
Sanjit Kumar Roy, Rajesh Devaraj, Arnab Sarkar, Kankana Maji, and Sayani Sinha. 2020. Contention-aware optimal scheduling of real-time precedence-constrained task graphs on heterogeneous distributed systems. Journal of Systems Architecture 105 (2020), 101706.
[31]
Marcelo Ruaro, Nedison Velloso, Axel Jantsch, and Fernando G. Moraes. 2019. Distributed SDN architecture for NoC-based many-core SoCs. In Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip. 1–8.
[32]
Zheng Shi and Alan Burns. 2008. Real-time communication analysis for on-chip networks with wormhole switching. In Proceedings of the 2nd ACM/IEEE International Symposium on Networks-on-Chip (NOCS’08). IEEE, Los Alamitos, CA, 161–170.
[33]
William Thies and Saman Amarasinghe. 2010. An empirical characterization of stream programs and its implications for language and compiler design. In Proceedings of the 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, Los Alamitos, CA, 365–376.
[34]
Sebastian Tobuschat and Rolf Ernst. 2017. Real-time communication analysis for networks-on-chip with backpressure. In Proceedings of the 2017 Design, Automation, and Test in Europe Conference and Exhibition (DATE’17). IEEE, Los Alamitos, CA, 590–595.
[35]
Niklas Ueter, Jian-Jia Chen, Georg von der Brüggen, Vanchinathan Venkataramani, and Tulika Mitra. 2020. Simultaneous progressing switching protocols for timing predictable real-time network-on-chips. In Proceedings of the 2020 IEEE 26th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA’20). IEEE, Los Alamitos, CA, 1–10.
[36]
Petros Voudouris, Per Stenström, and Risat Pathan. 2017. Timing-anomaly free dynamic scheduling of task-based parallel applications. In Proceedings of the 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’17). IEEE, Los Alamitos, CA, 365–376.
[37]
Yang Wang, Nan Guan, Jinghao Sun, Mingsong Lv, Qingqiang He, Tianzhang He, and Wang Yi. 2017. Benchmarking OpenMP programs for real-time scheduling. In Proceedings of the 2017 IEEE 23rd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA’17). IEEE, Los Alamitos, CA, 1–10.
[38]
Qin Xiong, Fei Wu, Zhonghai Lu, and Changsheng Xie. 2017. Extending real-time analysis for wormhole NoCs. IEEE Transactions on Computers 66, 9 (2017), 1532–1546.
[39]
Lei Yang, Weichen Liu, Peng Chen, Nan Guan, and Mengquan Li. 2017. Task mapping on SMART NoC: Contention matters, not the distance. In Proceedings of the 2017 54th Annual Design Automation Conference. 1–6.

Index Terms

  1. DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 1
    March 2024
    500 pages
    EISSN:1544-3973
    DOI:10.1145/3613496
    • Editor:
    • David Kaeli
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 December 2023
    Online AM: 03 November 2023
    Accepted: 19 October 2023
    Revised: 19 September 2023
    Received: 16 May 2023
    Published in TACO Volume 21, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Real-time systems
    2. non-preemptive dynamic scheduler
    3. SLT NoC
    4. execution-timing anomaly
    5. order constraint

    Qualifiers

    • Research-article

    Funding Sources

    • Science and Technology Research Program of Chongqing Municipal Education Commission
    • Startup Grant of Chongqing University of Posts and Telecommunications
    • Nanyang Technological University, Singapore, under its NAP

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,205
      Total Downloads
    • Downloads (Last 12 months)1,205
    • Downloads (Last 6 weeks)156
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media