// 3. show stack trace and OOPS trace, how to analyze
// - messages from OOPS (and "negative ptrs)
// - stack trace analysis
// - location within function binary

int sys_foo()
{
  // print a stack trace; gets printed as part of any "oops" trace in a
  // BUG/WARN message.
  dump_stack();
  // helpful to know code path that led to THIS function.

  // caveat: fxn tries best to display accurate stack, but no easy way to
  // tell what on a kstack is a function addr ptr vs. variable.  You'll see
  // sometimes functions listed with a '?' in front.  Functions marked with
  // '?' may not be actual functions: inspect actual code to see who calls
  // whom.

  // 1. ext4_read(...)
  // 2. vfs_read(...)
  // 3. ? do_read(...)
  // 4. sys_read(...)

  // When you see a stack entry with a '?' in front of it, it may be a good idea to
  // check the actual code: that is, in above example, can sys_read actually
  // call do_read? If so, then more likely that sys_read called do_read;
  // else unlikely.

  // sometimes you'll see a stack trace that makes NO SENSE.  i.e., some
  // "random" looking functions calling each other, where they're not even
  // related.  This indicates often a stack memory corruption, so don't try
  // to analyze that stack -- rather, look for a bug that could have trashed
  // your stack (e.g., bad ptr, buffer overflow).

  // what's in an oops trace?
  // 1. a message like "null ptr dereference at 0x0000000"
  //    The message could also be the literal string of the assertion that
  //    failed.  If you trigger a "BUG_ON(ptr != NULL && i < 10)". then the
  //    message will be exactly what's inside the BUG_ON() parenthesis.
  // 2. cpu register dumps
  // 3. stack trace
  // 4. the hex addr and (hopefully also the) name of the function where the
  //    problem occurred.  Note that the top of the stack may NOT be the
  //    actual fxn that triggered the oops, but its parent function.
  // 5. the hex instruction position inside the function that triggered the
  //    oops, relative to the entire size of that function
  //    e.g., BUG in foo(...) at 0x12A/0x7F9. (roughly at start)
  //    e.g., BUG in foo(...) at 0x71B/0x7F9. (roughly at end)
  //    caveat: compiled code includes optimizations and inline macros, and
  //            CPP macros.

  // Sometimes you get a "NULL ptr dereference at 0x00000008"
  struct bar {
    int x; // each INT is 4 bytes
    int y;
    int field;
  };
  struct bar *ptr; // uninitialized or NULL.
  printk("%d", ptr->field); // assume "ptr" is a struct.
  // meaning ptr was null, and tried to deref a field that was 0x8 bytes into
  // the struct of ptr.

  // Sometimes you get a "NULL ptr dereference at 0xFFFFFFF0"
  // same as above, but trying to deref a field that's 0xF (16 bytes) BEFORE
  // the ptr.

  // If you get multiple OOPS traces, try to find the FIRST one, as that one
  // is the more likely helpful bug to fix.  Often, a bug in one place will
  // trigger a bug in another place, etc.

}

// 4 develop an intuition for bugs
// - what if the whole system freezes?
// - what if the systems appears to get slower and slower?
// - what if variables/pointers seem to have "strange" values?
// - when to reboot?
// - time b/t when bug happens and when its effects are visible.
// - what to do if you get an OOPS
// - stack trace

// If you get an OOPS, record it somewhere, then reboot the system.  After
// an oops, the system is in an "unknown" state, you may not be able to
// unload any modules (esp. if the OOPS code was inside the module).  The OS
// and Linux system may not cleanly reboot.

// First try to reboot using the /sbin/reboot tool (see flags to reboot(8)
// for "forced" reboot).  If that doesn't work, use the vmware console to
// perform an unclean restart.  An unclean, or "cold" restart, may cause
// buffered data to be lost and not written to disk.  So be sure to have
// saved any important code, and even push it to the git repo.

// If the system reboots, then you can move on to debug the code based on
// the OOPS trace you recorded.  But if the system now does NOT reboot, then
// try to reboot to the vanilla ubuntu kernel (maybe your custom kernel or
// modules need to be recompiled/reinstalled).  If even the ubuntu kernel
// doesn't boot, your only choice is to restore to an earlier, known good
// snapshot.

// If you restore to an earlier state of the VM, all virtual disks and files
// will be restored as well!  That means that even your own files, as well
// as your own GIT repo files, will be restored to an earlier state.  So
// where is the code b/t the snapshot and the later point in time where your
// system no longer rebooted.  Hopefully you git-push'ed it.  If so, the
// newer code commits are on the git server, and your own VM is now at an
// earlier state of that git repo. What you should now, is in your git repo,
// do a "git pull" to RETRIEVE all newer changes from the git server back
// into your current git repo.

// Note that if you don't git pull first, and you then make new commits
// locally, then try to push them, you will get a git-push error that the
// "HEAD" of this git repo has diverged.  Don't use "git push -f" to (F)orce
// your new changes, b/c that'll result in a loss of ALL newer changes you
// previously pushed!  Instead, use "git merge" to merge the newer local
// changes with the previously pushed git changes.

// When you take a snapshot, you may also choose the option "Snapshot the
// virtual machine's memory": that let's you store the memory of the system
// as well, all phys+virt, running processes, etc.  Useful when you want to
// restore a VM that was running ok (not one you suspect was buggy).  When
// you restore a VM, some things don't work well, esp. any process that
// doesn't like seeing the system time going backwards (TCP timeouts, and
// more).  So what I do after a restore, is reboot the system, to get it to
// a fresh, clean state with proper time.  Sometimes it's best to take a
// snapshot of a VM that was shutdown: faster to take the snapshot anyway
// (no mem to snapshot needed), and then you can reboot.

// If you get an oops, try to run "dmesg" on the shell, and capture the last
// few printks, and oops trace.  A "small" buf like bad ptr deref inside a
// module code, would typically only lock out that module: still need to
// reboot, but the system is still running somewhat, just enough for you to
// capture info useful for debugging, then reboot.

// If you can't even get a shell, and the system is completely hung, then no
// choice but a cold restart.  Sometimes you won't even see an oops, but the
// kernel will suddenly just reboot.  This often indicates a really bad bug,
// like a massive buffer overflow or scrambling of many KB of kernel memory,
// esp. as addresses where the actual kernel code lives.  Could also be a
// wrong or missing translation b/t user and kernel address spaces.

// TEMPORAL DISTANCE: time passes b/t cause and effect of a bug.
// A NULL ptr deref will trigger an immediate oops.
// a small mem corruption may take many runs, even days before it
// manifests.  Also "small" things like leaking a few bytes of mem,
// forgetting to close a file here and there, etc.

// you may only notice it much later.  Have to decide if problem you're
// seeing is from the latest code change, or something older (that you may
// or may not have already fixed).  Often, good to reboot first, to see if
// the problem can be reproduced consistently.  Note: your latest code may
// not be at fault, could be old code, so don't just assume recent code is
// bad and revert it unnecessarily.

// mem leaks, esp. small ones, may not be seen initially.  But over time,
// the system overall will feel sluggish and slower.

// sometimes you may corrupt your ON DISK kernel/module state.  So a good
// idea to do a make clean, rebuild kernel from scratch and reinstall.


//////////////////////////////////////////////////////////////////////
// 5. steps to develop hw1
// - take it slowly!
// see 10.txt notes