Very Simple File System (VSFS)

Very Simple File System (VSFS)

This is pure software that allows to study/experience the simplified version of UNIX file system.

  • Key Question: How can we struct simple file system? What data structure is needed on top of disk, and how can it track certain data?

Approach

  1. Data Structure: What kind of data structure is needed so that file system can handle its data and metadata?
  2. Access Method: How system calls are related to the data structure of file system? How efficient are they???

Overall Organization

Disk block - normally 4KB.
Data Region - Disk space where user data exist.
Allocation Structure: Indicate whether the blocks are being used or not.

  • free list: Manages non-used blocks in a linked list form. Hence inode only remembers the head.

Disk Space (assume 64 blocks):

  • Metadata: Information such as data blocks (data region), size of file, owner, access, access change time…. –> usually saved in data structure called inode.
  • Inode Table: disk space for saving inodes; saved in array format.
  • Inode is usually size between 128 ~ 256 bytes -> 16 inodes in 4KB block, 80 inodes in the file system (as we use 5 * 4KB blocks).
  • Superblock: Contain information about the entire file system; how many inodes, data blocks we have, where the inode table starts…
    • When we say “file system is brokeb” means superblock is damaged.
    • Usually, file system duplicates superblock for backups.

OS Reads superblock, initialize file system elements, paste each partition to system tree when mount the file system -> by doing so, can know the location of required data structure when access the file.

File Organization: inode

Inode has:

  • File type; normal, directory…
  • Size, block allocation, protected info (owner, access), time.
  • Location within the disk (as like ptr).
    And these are called metadata.

Important to consider within inode structure is way of expressing the data block location. Simple method could be include multipe direct pointer within the inode. Each pointer points on disk block.
-> Has file size limitation: Size is limited to num of pointers * block size.

Multi Level Index

  • To support bigger file, needed to add one more data structure; indirect pointer.
    • Indirect pointer does not point data block. It points the pointer that points the data block.
    • Inode has fixed number of direct pointer (i.e. 12) and one indirect pointer.
    • Indirect block is allocated to a big file (within disk data block region), and inode indirect pointer points this indirect block.

  • If bigger file wanted, then use double indirect pointer which points indirect pointer. If bigger then triple indirect pointer

  • All together, disk block forms one-sided tree -> multi level index.

  • Extent:

    • consists of disk pointer and length of block -> means it required continuous disk space for each extent.
    • Therefore, use multipe extents to overcome this limitation which brings another limitation: metada gets bigger.
  • Linux use ext2, ext3 (multi level index), some OS use extent.

Reason the tree (MLI) is biased -> most files are small
If most files are small, needs to construct file system to read small files fast -> vsfs has inode that can read first (usually) 12 blocks fast (the direct pointers).

Directory Organization

vsfs direcotry (like the other file system) has (entry name, inum).
In the directory data block, it has: inum | reclen | strlen | name.

  • dot: current directory
  • dot-dot: parent directory
  • inmum = 0: unlinked; deleted -> needs len in case create another object in the gap later.

We should also note again that this simple linear list of directory entries is not the only way to store such information. As before, any data structure is possible.

Free space maintenance

File system must be able to track which inodes and data blocks are free or not -> to allocate new space for file/directory -> free space management.

vsfs uses two bitmaps.

  • File creation:
    1. file system searches inode bitmap and allocates the empty inode to the file.
    2. Note the inode now is being used, and update disk bitmap accordingly.
    3. Similar steps for data block allocation.
  • Needs to consider…
    • ext2,ext3: If possible, find sequence of blocks so that future additional allocation request can have contiguous on the disk -> performance improving (called pre-allocation).

Access Paths: Reading and Writing

So far has learned how to save file and directory in the disk. Now need to understand how to read and write.

  • Reading file from disk
  • Open systemcall -> traverse the path name -> find inode to get basic information (metadata)
    1. Open Systemcall
    2. Traverse the path name
      • Filesystem will read root inode first, which gets decided when the filesystem is mounted (usually 2).
      • Filesystem extract pointer of a data block which contains root directory information.
      • Read this directory information to find the target file.
    3. Find the block that has the inode of the target, and its inode.
    4. open() read the target’s inode number into the memory.
    5. Filesystem checks the acess authority.
    6. Get file descriptor from open file-table and return it to user.
  • Read() - After open(), read comes
    1. By inode, know the location of the block in the disk, and read the block.
    2. Since it is the first read, will read the first block (if file offset is 0).
    3. After reading the file, record the time.
    4. Update offset of file descriptor (give by open-file-table).
    5. Next time, when read() happenes, it starts with the updated offset.
  • Close: deallocate the file descriptor.
    In summary:
  1. Read Root inode -> read Root data -> read foo inode -> read foo data -> read bar inode.
  2. read bar inode -> bar data[0] -> write bar inode (the offset).
  3. read bar inode -> bar data[1] -> write bar inode (the offset)….

As it could be checked, as the path gets longer, relevant inode and data must be read.
Hence, if the directory is big, then many data blocks must be read to find the targeted file.

  • Write Disk
  • Similar to read, it first uses open() and then write(), and deallocate the file descriptor.
    • Unlike read, write may need block allocation which needs to decide which blocks to allocate and write on the disk -> required to update data structures of the disk (such as data bitmap and inode).
    • Therofre, logically it makes 5 extra steps of I/O.
      • One for read data bitmap, one for to write bitmap (to apply the change of states at the disk).
      • Two for inode read and write (to apply the change of location at the disk).
      • One for write the actual block.
  • File creation - Very complicated and needs many I/O.
    • One for reading inode bitmap (to find free inode), one to write on the inode bitmap (notice allocation)
    • One for making new inode (reset), one for writing data block at directory (link inode name and parent name).
    • Two for read and write directory inode (update)…..

Caching and buffering

For performance matter, most of filesystem cache frequently used blocks to DRAM.

  • If no cahce: At least needs two reads per directory (directory inode and its data).
  • Usually, fixed cache occupies 10% of the total memory -> static partitioning is wasteful.
  • In the modern OS, use dynamic partitioning.
    • United virtual memory pages and filesystem pages -> unified page cahce.
    • Hence, can flexibly allocate between filesystem and virtual memory upon its need.
  • Cached Read: It will cause many I/O for reading directory inode data, but follow up directory (children) will be cached, hence no extra I/O is needed.
  • Cached Write: Since the changed data must be written back to the disk –> Called write buffering.
    • By storing/saving the number of write-works,
      1. Multiple write processes can be batched with fewer I/Os.
      2. Can improve the performance by scheduling the multiple I/Os.
    • Of course, it system crashes before saving, cached data will be gone.
    • To avoid this situation, use fsync() for jobs that must be written right away,

Concurrency Bugs

Concurrency Bugs

Key Question: How do we handle the common concurrency bugs?
* Tend to have common patterns. Need to know what to avoid.

Types of bugs

Non-deadlock Bugs

There are two major types of bugs; Atomicity Violation and Order Violation.

  • Atomicity Violation: “The desired serializability among multiple memory accesses is violated (i.e. a code region is intended to be atomic, but the atomicity is not enforced during execution).”
    • If code A needs B stays the same, do not interrupt A in the middle such as changing B.
    • This violation, mostly, easily be fixed by implementing a lock (pthread_mutex_t)
  • Order Violation: “The desired order between two (groups of) memory accesses is flipped (i.e., A should always be executed before B, but the order is not enforced during execution)”
    • If A needs to be guaranteed before B, then keep the order.
    • Can be fixed by forcing the order using condition varaibles (pthread_cond_t).

Deadlock Bugs

The existence of cycle within a graph implies the possibility of deadlock.

Why is this happening?

  • Compelx dependency between components.
    • OS: Virtual memory system access file system for getting disk block. File system needs memory page to losad disk block on its memory, hence access virual memory system. –> Cycle
  • Encapsulation property: Modularization and lock do not go well.

Coditions to have deadlock:

  • Mutex: Threads claim exclusive control of resources that they require (e.g., a thread grabs a lock) - 독자적인 제어권
  • Hold-and-wait: Threads hold resources allocated to them (e.g., locks that they have already acquired) while waiting for additional resources e.g., locks that they wish to acquire). - 자원점유및 대기
  • No preemption: Resources (e.g., locks) cannot be forcibly removed from threads that are holding them. - 자원을 강제적으로 가져오지 못함.
  • Circular wait: There exists a circular chain of threads such that each thread holds one or more resources (e.g., locks) that are being requested by the next thread in the chain. - cycle

All those four conditions are requried to have deadlock.

Deadlock Prevention:

  • Circular Wait: Most common and practical method to prevent a deadlock.
    • Easiest: Set total ordering of getting a lock.
    • Partial Ordering in complex structures.
  • Hold and Wait: Allow getting necessary lock at one time.
    • Such as get a prevention lock which prevents context switch within the process of getting locks (critical section).
    • It has problem with encapsulation - as it needs to know all the locks it need -> Hence also lowers the performance.
  • No Preemption: trylock() - flexible interface
    • It causes livelock that attempts to get a lock infinitely without progress.
    • It also has problem with encapsulation - if the lock call routine is complicated and expensive, the repetition will be very difficult to implement.
  • Mutex: Get rid of mutex by implementing wait-free data structures.
    • Instead of lock-fix-unlock, use compare-and-swap to repetitivelt try to fix the value (yet it has possibility of livelock).

Sometime it is better to “avoid” instead of “preven” the deadlock.
To do so, needs to know which threads will require which locks. Based on this info, schedule the threads to “AVOID” deadlock.
As it is really depends on situations, it is not generally used method.

Deadlock Detection and Recovery

  • Detection: Draw resource allocation graph to check the cycle.
  • Recovery: Depends on the implementation; whether human based…

Semaphore

Semaphore

Definition

Semaphore is two integer object that can be controlled by two routines. Under POSIX, it is sem_wait() and sem_post().

1
2
3
4
5
#include <semaphore.h>
sem_t =s;
sem_init (&s, 0, 1);
// Initialise with value 1.
// Second parater 0 -> shares among threads.
  1. sem_wait()
    1. Immediately reutrns (if >= 1)
    2. Wait until it becomes >= 1
    3. Remembet there are two methods of waiting;sleep, spin.
  2. sem_post()
    1. Does not wait like sem_wait()
    2. It increases semaphore and awake one thread.
  3. If semaphore is negative
    1. Thread waits only when semaphore is negative.
    2. It equals the num of threads waiting. User cannot know this number.

Binary Semaphore (lock)

Since there are two threads, the initial semaphore value must be 1.

  1. Thread A decrease semaphore to 0 (sem_wait)
    1. since it is 0, it does not wait.
  2. Thread B decrease semaphore to -1 (sem_wait)
    1. Since it is negative, B waits.
  3. A calls sem_post, and semaphore becomes 0
    1. it awakes thread B.
  4. B calls sem_post (after critical section), then semaphore becomes 1.

Semaphore as conditional variable

For example, in case where the parent process is waiting its child process to be finished, the communication (notation) can be done through the semaphore.

  1. Parent creates the child, and sem_wait.
  2. Child process sem_post once it finished its job.
  3. Then parent can notice child is done and then proceed.

In this example, since there are only two processes, and parent has to wait after call sem_wait, the initial semaphore value should be 0 (of course).

Producer/consumer - Finite Buffer

  • Also called as Bounded-Buffer

Assumption: Max = 1, empty = max, full = 0;
full is used by Consumer, empty is used by producer.
By setting values as above, there could be only one element within the queue, and producer&consumer will always have the same tendency within the given code.

1
2
3
4
1. consumer first -> wait(full) -> full = -1 -> now wait.
2. producer second -> wait(empty) -> empty = 0 so keep going -> put element in the queue -> post(full) -> full = 0.
3. consumer awake -> get the element -> post (empty) -> empty = 1.
// Now empty = 1, full = 0 as the beginning

Reader-Writer Lock

  1. Reader increase reader variable after have read-lock.
    1. when havd read-lock, get write-lock together by calling sem_wait against writelock (semaphore).
    2. When release read-lock,. it released write lock with sem_post (if there is no reader left).
      1. -> This behaviour allows readers able to get read-lock.
      2. -> Writer, on the other hand, needs wait until all the readers are done.
      3. -> The last reader will release the write lock in the end.
  2. This is unfair to writers as it may cause starvation as well. -> there is a implementation to fix it.
  3. Shoudl be cautious using it as it may have significant overheads.

RCU : Read Copy Update - Lock free synchronization approach

  • Can be used in Reader-Writer problem.
  • Idea is that New readers are redirected to the modeifed data.
  • Once old readers have finished the old data, can be reclaimed.
  • Thid approach effectively splits the removal and reclamation up into two separate phases.

Lock

Lock

Basic Concept

It is a traditional varaible change.

1
2
3
4
lock_t mutex;
lock(&mutex);
balance = balance + 1; // Critical Section
unlock(&mutex);

Since lock is a varaible, it must be initialised to be used.
Lock has status:

  • Available/Unlock/free: no thread has the lock.
  • Acquired: Excatly one thread has the lock (owner) within the critical section.

Lock gives the least control at scheduling for programmer.
Generally, programmer creates the thread, and OS controls it. Lock gives partial control of thread to programmer.

Lock Implementation

Key Question: How can we make efficient lock? What hardware supports does it need? OS?

Lock Evaluation

  1. Does it support mutex well
    1. Basic function that prevents multipe threads enter the critical section.
  2. Performance: Especailly, lock overheads.
    1. No race: Overheads when one thread lock/unlock?
    2. Race: multipe threads compete to achieve the lock within one cpu.

Interrupt Control

Mutex Can be implemented by disabling interrupt within critical section.

  1. Pro: Very easy. Codes will be executed atomically.
  2. Cons:
    1. Privileged Thread - enable/disable interrupt.
      1. It must be trustable that thread will soley use the privilege for its inteneded purpose.
      2. i.e. Greedy program gets the lock and use all the process resource, infinite loop…
    2. Can’t in multi-processors.
      1. When multiple threads are running on multipe CPU, each thread may try to access the same critical section.
      2. In this case, interrupt disable on certain process has no influence on the other process; hence cannot guarantee the mutex.
    3. May miss important interrupt.
      1. What if CPU misses that reading request completed from the data storage? When can we awake the process waiting for the data reading?
    4. Inefficient - Modern CPU tends to take more time to enable/disable the interrupt compares to regular instructions.
  3. Usage:
    1. Therefore, handling interrupt must be dealt within the limited area.
    2. i.e. OS disbales interrupt to atomic execution within internal data structures.
    3. Since internal OS has no such thing as trust/privilege, it is okay for letting OS does what it wants.

Spin Lock

Test-And-Set (Atomic Exchange)

As we discussed, there is no point of disabling interrupt in multi-processors. Hence, multi-processors have hardware support.

The simplest hardware support is Test-And-Set or atomic exchange.

Without hardware support

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
typedef struct __lock_t { 
int flag;
} lock_t;

void init(lock_t *mutex) {
// 0 -> lock is available, 1 -> held
mutex->flag = 0;
}
void lock(lock_t *mutex) {
while (mutex->flag == 1) // TEST the flag
; // spin-wait (do nothing)
// Here can be interrupt after left the while loop
mutex->flag = 1; // now SET it!
}
void unlock(lock_t *mutex) {
mutex->flag = 0;
}

Above code has two problems:

  1. Accuracy: If interrupt is called in the certain timing, the flag will be 1 for both of threads -> Fail to have mutex.
  2. Performance: spin-wait is infinitely checking the flag value. It wastes the time until the lock is released by the other thread -> Very wasteful especially within one processor.

With hardware support

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// TestAndSet is atomic
int TestAndSet(int *old_ptr, int new) {
int old = *old_ptr; // fetch old value at old_ptr
*old_ptr = new; // store ’new’ into old_ptr
return old; // return the old value
}
typedef struct __lock_t {
int flag;
} lock_t;
void init(lock_t *lock) {
// 0: lock is available, 1: lock is held
lock->flag = 0;
}
void lock(lock_t *lock) {
while (TestAndSet(&lock->flag, 1) == 1)
// someone has the lock then wait for it
; // spin-wait (do nothing)
}
void unlock(lock_t *lock) {
lock->flag = 0;
}

It is a spin-lock as it keeps wasting CPU cycles until getting the lock -> Obviously it only works in preemptive scheduler which can trig the interrupt using timer.

Spin Lock Evaluation

  1. Mutex - Works
  2. Fairness - Can it guarantee the opportunity for waiting threads?
    1. No, Indeed, a thread spinning may spin forever,under contention. Simple spin locks (as discussed thus far) are not fair and may lead to starvation.
  3. Performance
    1. Case Single CPU: Overheads of spin-lock may be significant.
    2. Case Multiple CPU: It works reasoably well.

Compare-And-Swap

1
2
3
4
5
int CompareAndSwap (int *ptr, int expected, int new){
int actual = *ptr;
if (actual == expected) *prt = new;
return actual;
}

CompareAndSwap is stronger than TestAdnSet; wait-free synchronization.

In spin-lock implementation, it works exactly the same as TestAndSet.

Load-Linked & Store-Conditional

1
2
3
4
5
6
7
int LoadLinked(int *ptr) return *ptr;
int StoreConditional(int *ptr, int value){
if (not updated since LoadLinked){
*ptr = value;
return 1; // Success
} else return 0; // Fail to update
}

Load-Linked works same as load instruction.

1
2
3
4
void lock(lock_t *lock){
while (LoadLinked(&lock->flag) || !StoreConditional(&lock->flag,1)) ;
}
void unlock(lock_t *lock) lock->flag = 0;

As can be expected, StoreConditional may fail due to interrupt -> But the if condition (not updated since) guarantees that one thread has the lock at a time.

Fetch-And-Add

1
2
3
4
5
int FetchAndAdd(int *ptr){
int old = *ptr;
*ptr = old + 1;
return old;
}

Ticket-Turn lock method. Since FetchAndAdd is atomic, assign numbers in ticket and turn to decide which thread’s turn to have the lock.

Once the thread is done with the lock, it increases the turn numnber, which allows one thread (the ticker holder) will exit the while loop and gets the lock.

  • The important difference between the previous lock methods is that it executes in order, hence there is no starvation.

How to not waste CPU time in a spin?

First Approach: Yield (양보)

Approach is simple: Yield its allocated CPU to the other thread while waiting for the lock (spin).

  • It works fine when there are few threads.
  • If many threads exist
    • let’s say the scheduling policy is round-robin, it needs to wait for n-1 strip time to do run-and-yield every time.
    • It is better than wasting n-1 threads, but waste contect switch overheads (problematic when context switch is expensive).
    • Starvation is possible - What if miss the time while waiting for the other threads to use the CPU?

Secon Approach: Queue - sleep instead of spinning

The fundamental problems of discussed methods are all on luck; either spinning or yield -> wasteful, and cannot prevent starvation.

Hence, it is required to explicitly control which threads get the next lock -> support from OS and use of Queue.

Implement two stage: guard and lock. Cases:

  • Guard, no lock: Add itself to the queue (release guard) - Before critical
  • Both lock and guard: critical section (release guard)
  • Lock, guard: Wake up the next queue thread, release guard.

Guard is still spinning implementation, but alot more efficient as it only deals with few instructions; no critical section parts at all.

Linux Lock - Futex

Futex:

  • Linked with particular physical memory address.
  • Has its own internal kernel queue per each futex.
  • Caller calls futex, and sleep/awake upon needs.
    • futex_wait(address, expected), futex_wake(address)

Two Phase lock

  • First Phase: wait in spinning with hope it will soon have the lock.
  • Second Phase: Sleeps and awake after the lock is released.

Concurrency

Concurrency

So far, we have discussed what and how OS deals with…

  • Expand a single CPU into multiple virtualized CPUs so that many programs run at the same time.
  • Makes each process looks like it has its own independent memory (virtualized).

Through the concept “address space” each program acts like it has its own memory, but OS uses the physical memory by switching the multipe address space.

The difference between thread and process is that

  • threads share the memory space –> can access the same value.

Thread:

  • Each thread state is very similar to process states
  • has PC, and registers
  • If two threads run under one process, threads must be switched by the context switch.
    • Context switch of threads is similar to processes’
    • As like process control block (PCB), there is thread control block (TCB).
    • The biggest difference is that thread does not change the memory space (keep use the current page table).

  • A single thread process has one stack at the bottom.
  • In a multi thread process, each thread runs independently and mulitple routines can be called to run the thread. Hence, each thread has a stack.

Concurrency Problems

Race Condition: Situation where the output changes depends on the order of instruction executions.

  • In case the context switch triggered in the unfortunate timing, wrong (not what we expected) output comes out.
    • It is indeed indeterminate.
  • Critical section: where the multiple threads should not be executed
    • Therefore, we need mutex.
  • Waiting for another: Dependency where order must be followed.

Concurreny must be dealt carefully so that most of kernel data structures, page table, process list, and file system… can work correctly.

Thread API

Thread Creation

1
2
3
4
5
pthread_create(thread, attr, routine, arg);
// thread:type struct pointer. This struct communicates with thread.
// attr: size of stack, thread scheduling priority - most time NULL is enough
// routine: a function that thread will run
// arg: input parameters delivered to the routine.

Thread Exit

1
2
3
pthread_join (thread, **value_ptr);
// thread: Which thread are you watiting for?
// value_ptr: pthread_join changes the value, hence need the pointer instead of val
  • Key Points
    • Can deliver NULL when no parameter is needed
    • Do not need to bind the parameters when only one is needed
    • Needs to care about how the value changes within the thread.

Lock

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pthread_mutex_t lock = pthread_mutex_initializer;
pthread_cond_t cond = pthread_cond_initializer;
int rc = pthread_mutex_init (&lock, NULL);
assert (rc == 0);
pthread_mutex_lock (&lock); // Wait if one thread has the lock
while (ready == 0) pthread_cond_wait (&cond, &lock); // Waiting routine
// It awake, get the lock, and return -> Ensures it always has the lock when executing.
x = x + 1; // Critical points
pthread_mutex_unlock (&lock);

// Other thread gives the wake up call
pthread_mutex_lock(&lock);
// After send the signal, and fix the ready val, it must have the lock
// It guarantees the prevention of race condition.
ready = 1;
pthread_cond_signal (&cond); // It release the lock and send the signal to awake the thread.
pthread_mutex_unlock (&lock);

Address Translation

Address Translation

Limited Direct Execution (LDE): When process, in certain times such as systemcalls and timer interrupt happens, OS interrupts process to prevent problems. OS gets little support from hardware to do its best to provide efficient virtualization and gets out of the process’s way.

  • We want:

    • efficiency: hence requires hardware support; using TLB, pagetable…
    • control: OS guarantees no access to the memory of the application program -> needs hardware support.
    • flexibility: Ez to use address space and ez to program-system.
  • Address Translation = hardware-based address translation

    • Can be thought as an additional function of LDE.
    • Through address translation:
      • Hardware translate the virtual address (instruction, load, store) to the physical address.
      • Hardware translates (changes) to relocate the memory reference of programs into the real ones.
      • Of course, OS needs to know empty/used memory space, and control/manage the memory use -> To setup the hardware to ensure the correct/translation of the memory address.

From now on we assume that user address space is consecutive, not too big, smaller than the physical memory, and each address space is the same

Dynamic (hardware-based) relocation = base and bound

  1. Each CPU needs two hardware register; one base, one bound (or limit) register.
  2. This base and bound pair allows us to allocate address space where we want.
  3. In this setupt, each program is written and compiled like address 0 (virtual).
  4. At the start of the program, OS decides the physical memory location, and set base register as of it.
    • physical address = virtual address + base (This is how address translation happens).
  5. As reallocation is possible later

Note: Static allocation has a main problem: has no protection at all -> can have illegal access from other processes + hard to reallocate later.

  1. Bound register is used for protection.
    • Process first checks if the memory reference is legal - by checking bound.
    • If process reference out of base-bound region -> CPU runs exception, kills the process.
    • Meaning it literally indicates the bound of the region (as we already knows the start; base).

Note: Base and Bound registers are hardware structure that exists in CPU chip

Memory management unit (MMU): Processes that help address translation; for more detailed memory management, more circuits are needed in MMU.

Hardware support: summary

  1. Needs two CPU modes:
    • Kernel (privileged): Has access to entire computer.
    • User: Application program runs under this mode, has limitations to do something.
  2. Process status word: One bit in a register represents the current execution state of CPU.
  3. Base and Bound registers; part of MMU
    • Physical address = virtual address + base.
    • Hardware must be able to check if it is legal address; check is done using bound register and other CPU circuits (the other MMU).
    • Hardware must provide instruction that changes base and bound register values (kernel instructinos).
    • Hardware must be able to trigger the exceptions (its handler).

OS issues

To implement base-bound method virtualization memory, it needs three important points.

  1. At the process creation, OS needs to find memory space which will become address space.
    • OS memory sees memory as an array of slot, and manage each slot’s usage; finding the free list (the empty space).
  2. At the process termination, deallocate the memory so that can be used later. OS puts the memory to free list and manage all the relavant data structures.
  3. OS needs to perform additional steps when context switch happens.
    • Each CPU has a pair of base and bound registers, and each program must have different physical memory address. Hence, OS needs to save and store the base-bound when context switch happens in the PCB.
      Note: If process is not running state, OS can easily move the physical address space easily; copy into the new address, and update the base register in PCB -> Hence program does not realize there was a change as it happens before the restore.
  4. OS provides exception handler, or helper function.
    • OS installes handler during the boot through privilege instructions.
      Note: memory translation is handled by hardware, no intervention from OS.
    • We still follows the LDE, but OS interven only when process does illegal action.

Summary

Dynamic reallocation is inefficient -> cause internal fragmentation (gapped-space is being wasted).
Internal fragmentation can be avoided by Generalizing base-bound; which is called segmentation.

Proportional Share

Proportional = Fair share

Instead of optimizing turnaround and response time, scheduler guarantee fair amount of CPU share.

Lottery Scheduling - Randomness

Each lottery tickets gives the different share per job.
It does not guarantee the perfect share that the ticket represents, but as the job takes more time, it approximates.

  • Randomness:
  1. Oftern avoids strange corner-case behaviours that a more traditional algorithm may have trouble handling.
  2. Lightweight: It does not have to tract how much CPU the job used; but track how many tickets left.
  3. Very fast: Since the decision is simply made by the ticket.

Method

  • Ticket Currency: User can freely slice up the given tickets into its own currency. If user A has 100 tickets, it can gives 300 tickets to its programs…
  • Ticket Transfer: Process can temporarily transfer the tickets to the another process. It is beneficial to client/server situation.
  • Ticket Inflation: It is non-sense in the race condition, but between trusted processes, asks system it needs more tickets.

Implementation

It is very easy to implement - needs randommizer, data structures, total number of tickets.

1
2
3
4
5
6
7
8
9
10
int counter = 0;
int winner = getrandom(0, totaltickets);
node_t *current = head; // To Search job list
while (current){
counter += current->tickets;
if (counter > winner) break;
current = current->next;
}
// current is now the winner. So ready to run the winner process.
// If the list is descending order of ticekts, it can minimze the search process.

Stride Scheduling - deterministic

As can be seen, randomness can simplify the scheduler but it does not guarantee the exact proportions especially over short time scales.

  • Stride Scheduling: a deterministic fair-share scheduler
    • Each job has a stride; inverse in proportion to the number of tickets it has.
    • Each time process runs, increment counter (called pass value) by its stride; track the total CPU usage.
    • Scheduler decides which process to run based on stride and pass values.
      • run the process has the lowest pass values, and increase by its stride.
        Then why do we use the lottery scheduling then?
    • Imagine if there is a new job assigned.
    • What the pass value should be?
    • Lotter scheduling does not have to care about process state (CPU usage, pass values). Only consider tickets.
    • Therefore, it is easier to add a new process in the list.

Summary

For these reasons, Proportional and stride scheduling are not widely used:

  1. These approaches do not get along well with IO.
  2. Hard to implement tickets allocation; how many to which?

In case of virtualization environment, say I would like to use 25% of system to run the virtual machine, proprotional scheduling might be useful.

Multi Level Feedback Queue

Multi Level Feedback Queue

  • Key Question: How can struct the scheduling that minimizes response time and turnaround time without having any prior information of the job.

MLFQ Basic Rules

MLFQ is consist of multiple queues which has its own priority level.

  1. If priority(A) > priority (B) then execute A
  2. If prior(A) == prior(B) then do RR for A and B.
  • MLFQ assigns the priority depends on the characteristics of the job; keyboard input waiting…
  • If one job maintains the high priority for a long time, lower the level.
  1. When a job enters the system, it has the highest priority.
    1. a: After use all of its time slice, priority goes lower.
    2. b: Maintain the priority level if it gives up the CPU before the end of its time slice.

Key Points:
* Scheduler do not know if the job is short or not
* So assume it is the short one at first (assign high prior level).
* Therefore, if the job is the short one, it will be finished fast.
* If not, it will slowly move down to the lower prior queue which proves it is a time consuming job.
* By doing so, MLFQ approximates SJF without having prior info.

Problems:

  1. Starvation: if there are too many conversational job (which gives up the time slice earlier), lower prior queue tasks will not have a chance to use CPU.
  2. Can trick the scheduler: Can program the code intentionally to trick the scheduler, and seize the CPU for a longer time.
  3. The job may alter its characteristics later; but not able to have the higher priority disregard to its change.
  1. After certain time, system moves all the tasks to the highest queue.
    1. Non starvation is guaranteed.
    2. Altered Jobs (characteristics into interactive) can be treated as intended.

Redefine 4a 4b into one rule to prevent the second problem; gaming of the scheduler.
4. Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).

Other problems/concerns

  1. How many queues should there be?
  2. How big should the time slice be per queue?
  3. How often priority must be reset to avoid starvation?
    There are no definite answer to these concerns.
    There could be different implementation such as alloting different time slice per each priority queue (longer for lower level…).

Scheduling

Sheduling policy = discipline

Workload Assumptions

let us first make a number of simplifying assumptions about the processes running in the system, sometimes collectively called the workload.

  • Assumptions are:
    1. Each job runs for the same amount of time.
    2. All jobs arrive at the same time.
    3. Once started, each job runs to completion.
    4. All jobs only use the CPU (i.e., they perform no I/O)
    5. The run-time of each job is known.

Scheduling Metric

  • Turnaround time = Completion - Arrival.

    • Since we assume that all jobs arrive at the same time, T_Arrival = 0. Therefore, turnaround time = completion time.
  • Fariness: Performane and fairness are often at odds in scheduling.

  • FIFO

    • Convoy Effect: Short-time required process waits for the long-time process to be completed.
  • Shorted Job First

    • Same convoy effect possible as the process (jobs) might be arrived differently; the shortest job arrives at the last, hence still need to wait.
  • Shortest Time to Completion First (STCF)

    • Compare the remainer completion time whenever new job arrives, and pick the shortest one.
    • It is non-preemptive that the job can be stopped in the middle.

New scheduling metric: response time

Response Time = Firstrun - Arrival = Time gap between arrival and first run of job.
Can intuitively knows the policy such as STCF, FIFO has bad response time.

  • Round Robin
    • Time slice/Scheduling quantum –> slice must be multiples of interrupt period.
    • If time slice too short –> context switch costs becomes significant.
    • In general, the fair policy like Round Robin do bad when the metric is turnaround time.

Incorporating IO


During the IO process, the current process will not use the CPU and in the wait state.
To avoid the waste of resource, scheduler utilizes the resource executing the other waiting jobs.

Limited Direct Execution

  • Virtualization through CPU time sharing methods. Need to overcome/consider:
    • Performance: Virtualization without giving too much of overhead.
    • Control: Efficient execution of process with control over CPU.

Basic Concept: Limited Direct Execution

  • Direct Execution:
    • Just execute the program on CPU.
    • OS creates process object in the process list when execute.
    • Alloscate the memory, load it on a disk, branch to magin point to run the code.


Above figure shows no limited direct execution.
It creates several problems:

  1. How can ensure it does not execute the program that OS does not want?
  2. How can we stop the execution and switch to the other process during the execution?
    --> How can we implement Time Sharing Method?

This is why we need “Limited” Direct Execution.

Problem 1 : Limited Execution

What if the program requires to executr privileged instructions such as IO/additional memory allocation? –> Introduce two different modes

  • User Mode: Limited access such as no request for IO.
  • Kernel Mode: Non limited access

To allow privileged instructions, modern OS provides systemcall to user process.

  • Filesystem access, process creationa and deletion, communication and allocation…

To call systemcall, a program needs to execute trap command.

  • It elevates the the privilege level from user to Kernel when the instruction branches to the kernel.
  • Once elevated to the kernel mode, OS can execute every instructions, and can handle all the requested instructions by the process.
  • Once completed, OS calls “return-from-trap” which de-elevates the privilege down to user mode.

When calls the trap command, hardware needs to store necessary register of the process.
It is to return properly to the user process when “return-from-trap” is executed.

* x86 has PC, flags, and other registers are stored in kernel stack of each process.
* "return-from-trap" pops these info to re-run the user mode process (from sleep).

A process cannot state the “branch address” where “trap” gets executed as it means the process has access to the kernel.

  • Kernel creates trap table during the boot, and controls the system with it.
  • Booting is instructed within kernel mode, hence hardware can be managed as wanted.

OS uses privileged instructions to deliver the trap handler location to the hardwares.
Trap Handler: When some interruptors, or systemcalls happen, which code hardware should execute?


In the booting:

1. Kernel creates/resets trap table
2. OS lets hardware knows the location of trap handler. 

After the boot:

  1. User calls trap to OS
  2. Hardware saves the stack and branch to trap handler.
  3. Kernel handles the trap, branch back to hardware.
  4. Restore the registers from kernel stack, branch to user program.
  5. User program moves on.

Problem 2: Process Switch

How OS gets CPU back to switch the process?

A Cooperative Approach: Wait For System Calls

  • OS assumes that process will behave reasonably; process gives up CPU periodically so that other processes get the chance.
  • Process often calls systemcall to handover the control of CPU to OS such as while reading files… –> yield systemcall.
  • If the application behaves abnormal –> it triggers trap –> OS gets CPU control.

A Non-Cooperative Approach: The OS Takes Control

Timer Interrupt - trigs the interrupt periodically and executes the interrupt handler.
OS needs to decide whether to proceed the current process or do the switch; decision is made by sheduler policy.

  • Context Save and Store
    • OS saves the current state of the process in the kernel stack, and restore the stack of the next process.
    • By doing so, return-from-trap returns to the next process (not the current process).
    • Once the next process (current) calls return-from-trap, it calls the previous process to proceed where it stopped.

ASIDE: KEY CPU VIRTUALIZATION TERMS (MECHANISMS)

  • The CPU should support at least two modes of execution: a restricted user mode and a privileged (non-restricted) kernel mode.
  • Typical user applications run in user mode, and use a system call to trap into the kernel to request operating system services.
  • The trap instruction saves register state carefully, changes the hardware status to kernel mode, and jumps into the OS to a pre-specified destination: the trap table.
  • When the OS finishes servicing a system call, it returns to the user program via another specialreturn-from-trap instruction, which reduces privilege and returns control to the instruction after the trap that jumped into the OS.
  • The trap tables must be set up by the OS at boot time, and make sure that they cannot be readily modified by user programs. All of this is part of the limited direct execution protocol which runs programs efficiently but without loss of OS control.
  • Once a program is running, the OS must use hardware mechanisms to ensure the user program does not run forever, namely the timer interrupt. This approach is a non-cooperative approach to CPU scheduling.
  • Sometimes the OS, during a timer interrupt or system call, might wish to switch from running the current process to a different one, a low-level technique known as a context switch
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×