* How to approach HW1 (and any project)

Key: break problem into many small steps, and develop+test each step
separately.

General comments For hw1:

- leave the crypto to the end

- avoid corner cases (complex errors) for now

- hard coding various values to start with: e.g., so you don't have to worry
  about how to pass args from the cmd line to the program and how to process
  them.

- don't worry about efficiency just yet: more important that your code is
  first working and functional, then that it runs efficiently.

- test new features w/ small new programs, then integrate into "main"
  program.

1. a prog to open a file for reading then close it: test that it works even
   when the file doesn't exist, namely it reports the right error message.

2. then read few bytes from file, and printf it.  Be sure your input file
   has plain ascii texts and not binary codes (can't print binary codes).

3. make prog read all data from the file, in a loop until there's no more.

4. write a SEPARATE program that performs steps 1-3, but for writing to a
new file.  You can test that the dummy data you wrote to the file is there
using a command such as:

# dumps content of data into stdin (avoid binary codes)
$ cat myfile
# displays file contents as hex data
$ od -Ax -h myfile

5. integrate read+write loop.

Now you have a baseline program and there's several things you can
investigate in any order.  Realize that the hw1 program is basically a
"copy" program that copies file 1 to file 2 with encryption/decryption in
between.  The remaining sections (A, B, C...) can be done as standalone
pieces of code almost in any order.

(A) handling corner cases: all kinds of errors, esp. in the middle of the
inner read/write loop.  E.g., how to recover from a partially written o/p
file.  Or how do you handle if you get an actual errno from read or write
ion the inner "copy" loop (e.g., EIO, ENOMEM, etc.).

Recall: the OS will free mem and close open descriptors when your process
exits (but good programmers will do that cleanup before exiting anyway).
The OS will NOT cleanup/delete any state that you left on the file system or
disk.

Imagine your program fails to fully encrypt a source file, users ignore or
don't notice the err message, and delete the source file: when they try to
decrypt the o/p file, at best they'll get part of their orig data -> users
unhappy!

Cleanup: if failed mid way, of course, exit w/ appropriate error code and
message, but also don't leave behind half written files -- delete the
partially written files w/ the unlink(2).

Option 1: rename the old file to some temp name, and your program will have
to read from the "temp" name.  If prog succeeded, you delete the temp name,
else, you rename it back.  If your program failed half-way: delete the tmp
name, and rename the source file back to its original name.

Option 2: write your new o/p file to a temp name.  If failed, delete temp
name; if succeeded, rename temp name on top of old o/p file, then you can
delete the original src file.

You can form temp names any way you want, for example, if the user asked for
"myinput" and "myoutput", your temp name can be ".myinput.tmp".  Even
better, use functions designed for creating unique temporary file names and
even opening them for you securely -- see mkstemp(3), avoid older variants
such as mktmp/mktemp.

It's a good idea in general to try and "hide" names from users, by prefixing
them with a '.'.  Such "dot-files" are not listed by default by /bin/ls and
other file browsers, unless you ask to see them with "ls -a".  Note that
hidden files are not a UNIX OS concept, but a feature of ls and file
browsers.

Good idea to put these "partial failure recovery code" into its own
function.

Can't really do the above recovery if you don't have a file name, and using
stdin or stdout.

(B) more basic error conditions: pre-conditions before you begin the main
inner loop and post conditions.  E.g., pre-conditions: do I have enough
space to write the whole o/p file?

Space issues: check if there's enough, and if not, you can abort.  Note it's
a heuristic that isn't 100% guaranteed: even if you have space at start of
program, you may not have space by the end; and vise verse.  Alternative:
modern Unix systems include a syscall to "preallocate" file space (see
fallocate(2) call).

Q: how to check for how much space there is on the file system?
A: df(1) is a good user program
A: in C, use statfs(2) or statvfs(2)

(C) option processing using getopt(3).  Once done, you can remove all
hard-coded values you've had before.

(D) enc/decryption.  Check that you can take some input data string, encrypt
it with some key K, and then decrypt back to get the same orig data.

(E) Efficiency (definitely last thing)

recall read syscall:

n = read(fd, buf, len) // ask to read len bytes from file fd, into buf,
returns no. of bytes successfully read 'n'

Option 1: we read the file 1 byte at a time.  Bad: too many syscalls, will
be too slow.

Option 2: read all data into one buffer.  I can use stat(2) to find out the
size of the infile, then alloc a buffer that large.  Bad: if file is very
large, you put too much mem pressure on the OS, and your program will be
penalized and slowed down.  Worse, OS may kill your program under heavy
memory pressure (OOMK: Out of Memory Killer).

Option 3: pick a native unit to the system and hardware.

Networking: 1500 MTU packet size (Ethernet)

Memory: depends on OS "page size", most are 4KB (some processors may have
8KB or even multiple size)

Disks: traditional block sizes (or sectors) on storage devices were 512B
long.  In recent years, more storage devices read/write in 4KB sectors (w/
512 for backwards compatibility).  512B was deemed too slow for today's h/w,
but also 4KB maps nicely into kernel memory caches.  You can find out page
size on a system w/ getpagesize(2).

Option 4: consider a small integer multiple of the pagesize unit.  B/c arch
can support reading several times the 4K, plus OS "readahead" will have
already pre-fetched some data from I/O devices.

Even better if the multiple is a small power of 2: 2 pages, 4 pages, 8
pages, etc.

* Handling Errors (a summary)

First, not all errors are created equal
- depends on the program in question, and who's using

Any function has docs, man pages, listing possibly many error conditions
e.g., man open(2) lists many errors named E<something>.  See <errno.h> or
related header files for description of those errors.

Review all errors and decide how to handle each of them:
- don't write code to handle each error separately, unless necessary.
- rather, group them into categories:

1. FATAL: nothing we can do, print error and exit (maybe cleanup disk):
e.g., EIO -- I/O error

2. IGNORE: nothing to worry about, can ignore, many log a message

3. Some user action may be needed: maybe need to prompt user for feedback
(e.g., passwords don't match)

4. Error is serious, but you can try to work around it: if you get NULL back
from malloc(3), meaning ENOMEM.  Usually, OS goes to great lengths to try
and give you memory, else it often suspends the program until memory is
available (then wakeup sleeping program).  Note OS doesn't HAVE to wait till
it can give a program memory; OS can return NULL right away.  So better for
a program that gets NULL from malloc() (or any mem alloc functions), to try
again.  The prog can try to sleep(1) [1sec] then try malloc again, if that
fails, try to sleep again, etc.  Prog can try this indefinitely, or for some
no. of times (print warnings, esp. if you abort after N tries).

5. Transitory "errors": consider read(2), if asked to read N bytes, and you
get less than N (but more than 0).  Normally this "partial" or "short" read
indicates you've read the remaining bytes in the file.  This happens when
you get close to EOF but not guaranteed: you still have to read(2) once
again, until you get a 0.  In networking environments, more transitory
errors could happen: meaning you can get fewer bytes than asked for, and
it's NOT the end of the file (or network stream).

6. If reading from a network socket or resource, you may get an error
EAGAIN: meaning you should try the action again, and it may succeed.