// writing system call code
// assume read(2) syscall in kernel, this is the entry point in the kernel.
// in Linux, any mem addr that represents a user virt addr, is usually
// prefixed with "__user"
int sys_read(int fd, __user void* buf, int len)
{
  // 1. validate your arguments!
  // fd: it could be a negative number
  if (fd < 0)
    return -EINVAL; // linux convention, return -errno
  // is fd pointing to a VALID, open file? else ENOFILE, see read(2)
  // if fd was open, and now is closed, it'll also point to an "invalid"
  // file (i.e., file that's not open NOW)
  // Was the file opened with O_READ (or O_RDWR) permission?  Else, -EPERM,
  // or -EACCESS.

  // Every process has a 'struct task' in linux, with everything you ever
  // want to know about that process while it's running, including an array
  // of open files.

  // When a user process issues a syscall, the kernel syscall handler knows
  // which process was running on which CPU, b/c that's what was just
  // running.  There is a global kernel mapping of which PID (or struct
  // task) is currently running on which CPU/core.  When that task is
  // de-scheduled, and you're executing syscall code, you can always refer
  // to "struct task *current" as a ptr to the current process that the
  // syscall is running for.

  // buf: is it NULL -> error
  // Q: what is actually passed to the kernel in "buf"?
  // A: it's just a NUMBER that happens to represent a mem addr.
  // A: what's special about THIS particular 'buf' number?
  // Q: it's a VIRTUAL MEMORY address in the scope of THIS user process!
  // Recall, every process has its own virt addr space (as large as 4GB on
  // 32-bit processors).  Those virt addr spaces are not sharable.

  // assume i had the data i wanted to give the user process, in a buffer
  // called data, could i now do this:
  //   memcpy(buf, data, len); // NO
  // 'buf' cannot be used as is in the kernel, b/c it represents a mem addr
  // that's only valid for the user process!  If you try to access that mem
  // addr or copy to it, you'll probably access some kernel PHYSICAL mem
  // addr that happens to be the same as 'buf': that addr may not even
  // exist, or it could be some valid phys mem that you'll be corrupting.
  // End result is hang, crash, or some bad mem corruption inside the OS.

  // So, you have to "translate" this 'buf' mem addr from its virt to phys
  // number! Only then, can you access it.  There are tables that map
  // P2V/V2P in struct task, and loaded into the MMU.

  // So the kernel, can check to see if 'buf' is a valid virt mem addr, else
  // -EFAULT.

  // Also need to validate that all the bytes in buf are valid, not just
  // buf[0], but all the way to buf[len-].  How to validate: should we check
  // each and every byte in the range?  Maybe just check first and last, but
  // what if len=10000.  We need to validate every page in the range, not
  // every byte or first/last.  So in reality, you take the byte range, and
  // you expand it to enclose its encompassing aligned, 4KB pages.  Then you
  // have to check ONCE, each page in that range, that it is valid.
  // Q: what is valid?
  // A: check that there is an valid actual P2V mapping for that process.
  // A: we also need to check the permission to these pages.  So also have
  // to check that each page has the PROT_WRITE permission bit enabled.

  // also check that 'len' is valid, not negative; there may be an upper
  // bound allowed, perhaps a ulimit/setrlimit imposed.

  // How to do all of this?  Thankfully a kernel can be thought of as a
  // collection of lots of services, many middle-ware layers, and many
  // "libraries" of functionality that you can invoke (meaning: lots and
  // lots of helper functions).
  access_ok(); // one of several routines in linux for validating user
	       // buffers.

  // Note: unit of protection is on a per page (4KB) basis, even if that
  // page may be shared with other processes.  As long as the page in THIS
  // process is valid and has the right PROT_* flags, the kernel's happy.

  // Q: how does kernel prevent an overflow: the len is greater than what
  // was actually allocated in user space?
  // A: can't use sizeof() b/c it gets determined at compile time, not
  // runtime.  And users can alloc a much larger buffer and choose to use a
  // portion of it -- that's ok.  The kernel still checks that all pages in
  // the range are valid and have the right flags, so if user passed a len
  // larger than the actual buf, the kernel CAN do a buf overflow and
  // scribble data read from fd, past the end of your buf.  This could
  // corrupt the state of a user program, but kernel won't go into invalid
  // pages or those without the right flags.

  // Tip: if you want to write or modify a syscall, find the closest syscall
  // in nature, and find its sys_XXX entry point, and read the code.  It'll
  // tell you about helpful routines to validate args, what are the "right"
  // errors to return from many conditions, etc.

  // in Linux, any mem addr that represents a user virt addr, is usually
  // prefixed with "__user".  This __user is a special gcc marker that's
  // actually a noop (does nothing).  It's a hint to programmers that this
  // addr is a user virt addr and to be careful, translate it back and forth
  // as needed when accessing.  There's also some static analysis tools for
  // linux, that validate a "chain of custody" for every ptr marked as
  // __user.

  // 2. find the bytes of the file and stuff them into the user buffer
  // you can use vfs_read()/vfs_write() as helper routines to get file data
  // into a buffer, even with user/kernel virt addr translation.

  // memcpy doesn't translate virt/phys addr by default (it may do it
  // automatically under limited circumstances of a 32-bit machine with less
  // RAM), but it's no guarantee.  If you need to copy bufs b/t virt and
  // phys and perform translation, use
  copy_from_user();
  copy_to_user();

  // 3. at end, return appropriate value or error (as per read syscall).
  // TBD
}

int sys_open(__user char *name, int flags, int mode)
{
  // 1. validate flags represent only valid flags
  // validate mode being one of valid modes

  // valid that 'name' (another ptr) is a valid addr.  Note that "char *" is
  // nothing more than an array of characters, or a pointer to a mem addr,
  // that is supposed to contain a name of a file to open.  Must validate
  // this name ptr like the 'buf' in sys_read().

  // Problem, I don't know how long 'name' is.
  // why not use strlen() -- count no. of bytes until you get a '\0'.
  // there is a strlen() in kernel.  Careful, strlen has to READ a byte to
  // determine if it's null or not.  Suppose I verify that the first byte in
  // 'name' is in a valid page with PROT_READ, but the next byte may be in a
  // DIFFERENT page (crosses a 4KB boundary).  Note that in POSIX, file
  // names cannot be longer than 256 bytes, and whole /path/names, can't be
  // longer than 4KB.  So in the worst case, a valid name might span two
  // pages.  Still, the kernel has to check however long the name is, but it
  // can stop after 4096 b/c of POSIX.

  // user helper routine
  void *kpath;
  kpath = getname(name); // validates __username, kmalloc's a buf in kernel
			 // space, and copies into it (similar to strdup).
  // if you didn't get an error from getname(), then kpath is a valid
  // string, null terminated, in kernel space, that you can use to pass to
  // other routines.  for example, pass it to filp_open() to open a named
  // file inside the kernel.  Best not to touch user memory any more than
  // needed.

  // once you're done with kpath, must free it using
  putname(kpath);
  // must remember to kfree any kmalloc'd objects in the kernel, else you
  // leak PHYSICAL memory forever (until system is rebooted)!  Mem might be
  // allocated as a side effect of all sorts of routines in the kernel, such
  // as this pair of get/putname.

  // there's a malloc library in the linux kernel (and all sorts of other
  // object/mem allocators), and it keeps a data structure that knows how
  // much was allocated, where, and how many bytes to free.
}

// next time: allocating resources in the kernel, releasing them, and coding
// conventions.