Fault Tolerance And Reliability

If all n different events have same mean time m, then the Mean time to the first one of the events = m/n

Theorem 1:

Mean time to event MT(A)=1/P(A)

Theorem 2:

P(A or B) = P(A) + P(B) - P(A and B)
Assuming A and B are independent
= P(A) + P(B) - P(A) * P(B)
= P(A) + P(B) (if P(A) and P(B) are very small)

Theorem 3:

If events A,B, have mean time MT(A), MT(B), then the mean time to the first event is 1/(P(A) + P(B))

Prove:

if p is the probability of an event in given time, then the mean time m = 1/p,
and there are n events, then the probability of one of these events = n * p
Therefore, mean time to one of these events = 1/ n*p = m/n



Capture.PNG

Fault Tolerance Strategy:

  1. Fail-vote:
    use two or more modules and compare their outputs, stops if there are no majority outputs agreeing. If fails twices as often with duplication but gives clean failure semantics
    Capture.PNG

2.Fail-fast:
Similar to the fail vote except the system senses which modules are available and then uses the majority of the available modules.

Improve the software reliability:

  1. Periodic transfer of data: The primary process does all the work until it fails, and the second process called backup takes over the primary and continues
  2. Checkpoint-restart: The primary records its state on a duplexed storage module, at takeover the secondary starts reading the state of the primary from the duplexed storage and resumes the application.
  3. Checkpoint messages: The primary sends its state changes as messages to the backup. At takeover the backup gets its current state from the most recent checkpoint message.
  4. Persistent: backup restarts in the null state and lets Transaction mechanism to clean up all uncommitted transactions. This is the approach taken by the most Database Systems.
  5. Highly available storage
    • write to several storage modules.
    • have some kind of checksum to make sure that the data read is correct with a very high probability.
    • Disk mirroring is an example of this.
    • Shadowing is another mirroring technique which allows atomic write operations.
  6. Highly available Processes
    • process pairing
    • transaction based restart
    • checkpoint restart

Improve the communication reliability

Capture.PNG
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容