Security very broad. Define a policy first: 1. privacy: how, using encryption for example. 2. authentication: verify your identity (login usernames and passwords, PINs, etc.) 3. authorization: access to data/resource at a later time. 4. integrity: verify that data you're accessing hasn't been modified or corrupted. Usually done through use of checksums or hashes. 5. non-repudiation: ability to track actions that cannot be refuted. Ex., collect log files of users' actions, encrypt them, add integrity checking, and transmit securely to a 3rd trusted entity. * Hashes, checksums, digital fingerprints, parity, signatures class of functions that can take an input data D of any length L, and produce an output number H of size S. Usually S<128 bits even as much as 512), we call it a "cryptographically strong function" (e.g., MD5, SHA1, SHA256, ...) BTW, today, most Intel processors have built in instructions to compute hashes (and even encryption), using "SSE" extensions -- this is much faster than computing it on your own. STRONG HASHES: are guaranteed to produce a "unique" hash with very small probability of collision. Collision is defined as two different inputs X and Y, that produce the same hash H. Collision probabilities are very small: the larger the hash size S is, and the fewer pieces of data D you hash, the lower the probability. See https://preshing.com/20110504/hash-collision-probabilities/ Uniqueness property is easier to accomplish if the hash function distributes the output hashes nearly uniformly. E.g., MD5, 128 bits, 2^128, 10E42. Another important property: changing a single input bit in D, should result on avg in 50% of the bits in hash H having changed (Normally distributed). If you plot a figure of no. of changed bits, you'll get a Normal/Gaussian. This is also called the "avalanche property". Non-invertability: given a hash H, it should be VERY hard to find any data input D that will have produced that hash. Hash functions are thus called "one-way" hash functions. Example, in login programs: 1. when you re/set your password P, the computer hashes P and produces a hash H. 2. store the H in some file, in unix it's /etc/passwd or /etc/shadow 3. next time you login, you type your password P' 4. computer computes H' from P' using same hash function 5. compare previously saved H and H' 6. if they match, login proceeds; else, deny login, re-prompt you for new password, or even lock out your account. Other uses of hashes: - data deduplication: find matching files or chunks of files and save having to store duplicate data. Often yielding 10-40x dedup ratios. - verify integrity of data you store. Each time you store a file or any data, the OS (or storage system) will store with it a hash of that data. Next time you read the data/file back, the OS will recalc the hash and compare to the stored hash. Alert or produce an error if they don't match. Note: hashes can tell you if your data's integrity has been compromised, but cannot tell you how to fix (get back your original data). For that, you need some offline backups, or "error correcting codes" (ECCs). Example of very simple hash function to calc parity, or a CRC. 1. assume you want an 8 bit hash (CRC) 2. take your input data, can be long 3. read each byte in the input as a number b/t 0..255 4. sum up all the bytes, truncating to just the lower 8 bits 5. at the end, what you're left with is H = (sum of all bytes of data D) % 256 * ENCRYPTION preserve the privacy of some data D. original unencrypted data is called "cleartext". encrypted data is called "ciphertext". encryption alg/software is called the "cipher". ciphers take an input data and at least one (often secret) key K, and produce ciphertext. Ciphers have different properties/classes: 1. symmetric ciphers use the same key K to enc/dec (HW1) 2. asymmetric ciphers use a different K1 to enc, and K2 to decrypt. - symmetric ciphers are much faster than asymmetric ones. Example how enc works: 1. share many properties with hash functions (lots of AND, OR, XOR, bit shifts and rotates). 2. XOR (eXclusive OR) is a primary useful function X Y XOR(x,y) 0 0 0 0 1 1 1 0 1 1 1 0 If X and Y are same, o/p is 0; if they differ, output is 1. For crypto: if you take any input Z and XOR it with a '1', Z's value is flipped. So a good cipher, can use XOR where the key is a random number, and everywhere there's a 0 in the key, the bits remain; everywhere there's a '1' in the key, the input bits flip. Result: ciphertext looks nothing like cleartext. K size has to be large enough: if it's too small, the attacker can try brute force of all keys, and try to decrypt your data until it looks like "text". They need to know how long your key was, and what cipher was used. May ciphers are called "block" ciphers: meaning they encrypt in units of a certain size, often 64-bits. Meaning cipher breaks data into units of 64bits and encrypts them. In most ciphers, you have to say what's your input unit that YOU want to encrypt, e.g., 4KB. Internally, the cipher will encrypt each small 64bit chunk, and then it'll use a previous 64bit chunk encrypted, and add it to the mix when encrypting the next chunk. Suppose I broke my input into 64bits (8B) and encrypt unit separately. Ciphers are deterministic: given an input X and key K and cipher C, you'll always get the same output Y. What it means is that if you have multiple inputs that are the same, they'd all encrypt to the same output sequence. That would be a "dictionary problem" when multiple ciphertext chunks are all the same, gives attacker a way to guess your input (e.g., english text has certain letters/words that are more frequent). To prevent these dictionary attacks, encrypt the new chunk of data with some of the material of the just previously encrypted data: 1. read D1 of size 64 bits 2. enc D1 with key K, you get C1 3. read next 64-bit chunk, call it D2 4. enc (D2 XOR Z) with K -> C2 Z can be all of part of C1 or D1 5. repeat steps 3-4 until no more data to encrypt. This is a form of "chaining" in ciphers. Many ciphers have different modes of operations that use "chaining" or "feedback" to prevent dictionary attacks. One disadvantage of these modes is that to decrypt data at end of file, you have to decrypt all data that came before -- because it depends on it. Internally, if you give a cipher a chunk of, say 4KB, it'll break it into smaller native units (64B), and use an internal chaining/feedback mode. That allows more efficiency, b/c you only have to decrypt 4KB (aligned) units to get the data you want -- not whole file. BUT, different 4KB ciphertexts, are still vulnerable to dictionary attacks (less than if you encrypted one byte). To prevent dictionary attacks at the level of the input you give, we use an "Initialization Vector" (IV). A common way is to use an integer that increments. E.g., for the first 4KB of the file set IV=0; for the next 4KB chunk, use IV=2; next one, IV=2, etc... The important property to remember is that even if attacker KNOWS what IV numbers you've used, it still doesn't help them. And you have to know what IV you've used to enc a chunk and pass same IV to same chunk when you try to decrypt. Also important: if you chose your enc/dec unit to be 4KB (or any other multiple of pagesize), you must use the same unit, when decrypting. Note: symmetric ciphers are just math. They can't fail. Which means: if you give the wrong IV or ciphertext or key upon decryption, you won't get the same orig cleartext! There are way to combine encryption + integrity techniques together, such that the cipher can "detect" if it's decrypting with the wrong key: these are called "authenticated encryption." I recommended using the AES alg, but there's a new version called "ARIA". More variants of ARIA available in ubuntu 18, esp. "Counter mode" (CTR). CTR mode ensures that your o/p is same size as input. Other modes may round up your file size to next multiple of 8B (cipher block size). Then you'd have to record how long was orig. file and truncate(2) it after decryption. * asymmetric ciphers Asymmetric ciphers use a different K1 to enc, and K2 to decrypt. - User 1 can enc data D with key K1, produce C, send C to user 2. - User 2 can dec C with K2, and they'll get the original data D. - this is what public key cryptography is all about (PKI) - K1 is often called a "private" key; and K2 is called the public key - user 1 protects K1! No one should know it. - user 1 can freely distribute K2 publicly. - asymmetric ciphers can "fail" as they can detect if C was NOT encrypted with K1. Digital signatures: - same as before, but now, the user 1, will take data D, hash it, produce H. - user 1 will encrypt D+H with K1, and send it (C') to user 2 - user 2 will decrypt C' with K2, first verifying that it worked - user 2 will call hash on part of data that was original data - user 2 will compare hashes Private 1-to-1 communications. 1. U1 enc D with U1's K1 (private) and U2 (target user)'s public key J2 2. U1 sends C to U2 3. U2 decrypts C with U2's priv key J1, then with U1's pub key K2 - result: both U1 and U2 can communicate privately and guarantee they know who's on the other end. PKI example are the RSA alg. most famous. Used heavily in SSL Web site certificates all over the world. BUT PKI is slow! So, use PKI is used to establish a trust relationship b/t two entities (e.g., your Web browser connecting to a bank's Web server). Then exchange a randomly generated cipher key to be used with a much faster symmetric cipher. User symmetric cipher to encrypt all subsequent communications.