Lecture 4: Data Wrangling
Missing Semester IAP 2020
Data Wrangling
A broad term for changing data format and content with various tools, the process of going from unstructured data like a log file to something like a list of usernames or IP addresses, something useful.
SSH tricks
Given a SSH server profile tsp , let’s grab some raw data:
ssh tsp journalctl | grep ssh | grep "Disconnected from"
From a network traffic perspective, we don’t need anything we’re discarding with grep.
In the above command, we’re sending the entire journalctl entry from our server. To run the grep pipes remotely, single-quote the whole ordeal and pipe it into less.
ssh tsp 'journalctl | grep ssh | grep "Disconnected from"' | less
sed stuff
Assuming we have now taken the SSH data and wrote it to a local file ssh.log , let’s remove everything before “Disconnected from” with sed:
cat ssh.log | sed 's/.*Disconnected from//' | less
sis a substitute expression which takes two arguments, search, and replace. Here we’re saying “Find the following pattern (*Disconnected from) and replace it with blank”.- While
sedis a stream editor,awkis a column-based stream processor
RegEx
\.\* in the above sed command is a regular expression
- dot means any single character
- star means zero or more of that character
- i.e. this boils down to: Select zero or more of any character preceeding the literal string “Disconnected from” , and replace it with “blank”
Examples & Breakdowns
echo 'aba' | sed 's/[ab]//'
- output:
ba - Replace (the first occurance of) any character that is either “a” or “b” with blank
- Default
sedregex applies the matching and replacement once per lineecho 'aba' | sed 's/[ab]//g'- you can provide the
gmodifier to repeat this as many times as it matches
echo 'abcaba' | sed -E 's/(ab)*//g'
output: czc
- Select zero or more of the string “ab” and replace them with “blank”
- The
-Eflag makessedinterpret regular expressions as modern rather than basic - If you wanted to do this with Basic Regular Expressions (BRE) you’d escape the parentheses with backslashes.
echo 'abcaba' | sed 's/\(ab\)*//g'
echo 'abcababc' | sed -E 's/(ab|bc)*//g'
output: cc
- Anything that matches “ab” or “bc”, replace with blank
echo 'abcabbc' | sed -E 's/(ab|bc)*//g'
output: c
- Removing an “a” in the input string
cat ssh.log | sed -E 's/^.*Disconnected from (invalid )?user .* [0-9.]+ port [0-9]+$//'
- output:
^matches the beginning of a line$matches the end of a line- These two are known as “anchoring” the regular expression (i.e. this regular expression has to match the complete line)
(string )?match zero or one
ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c
sortwill, well, sort its inputuniq -cwill collapse consecutive lines that are the same into a single line, prefixed with a count of the number of occurrences
ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10
sort -nwill sort in numeric (instead of lexicographic) order- lexicographic: (also known as lexical order, or dictionary order) is a generalization of dictionary alphabetic order to sequences of ordered symbols
-k1,1means “sort by only the first whitespace-separated column”1,1here are the arguments saying “I want to start at the 1st column, and end at the 1st column”
- The
,npart says “sort until thenth field, where the default is the end of the line (sorting by the whole line doesn’t matter in this case) - Printing the last (
-n) 10 lines withtailis optimal here since sort by default outputs in ascending order, meaning the largest values are going to be at the bottom
ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 | awk '{print $2}' | paste -sd,
- Creating a .csv of the most common usernames used to login
awkby default will take in whitespace separated columns as arguments and allow you to perform operations on that datapastetakes a bunch of lines and pastes them into a single line based on a given delimiter (comma in this case)awk, also being an entire programming language in its own, can process regex statements (see next)
| awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l
- Print the number of single-use usernames that start with “c” and end with “e”
awk is a programming language
- You can do the same thing as above in
awksyntax:BEGIN { rows = 0 } $1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 } END { print rows } BEGINis a pattern that matches the start of the input (andENDmatches the end)- Now, the per-line block just adds the count from the first field (although it’ll always be 1 in this case), and then we print it out at the end
- ref: https://backreference.org/2010/02/10/idiomatic-awk/
Greedy vs Non-greedy Expressions
- RegEx expressions like dot star (.*) are greedy, meaning they will match as much as they can
- Example:
echo 'Disconnected from invalid user Disconnected from 84.211' | sed 's/.*Disconnected from //'- output:
84.211 - In this scenario, since the username we’re trying to extract (“Disconnected from”) matches until the second occurance, it is replaced with blank as dot star is a greedy expression
- output:
RegEx Symbol Cheatsheet-ish
.means “any single character” except newline*zero or more of the preceding match+one or more of the preceding match[abc]any one character ofa,b, andc(RX1|RX2)either something that matchesRX1orRX2^the start of the line$the end of the line- adding a suffix of
?changes the greed of a given expression
RegEx Debugger
- RegEx can get really complicated and layered in a single command, debugger can be a great way to break down more complicated expressions
- Example: https://regex101.com/r/qqbZqh/2
Capture Groups
- For if you want to capture a value and store it for later use, in the case of our log file, we want to capture the username
- Any parentheses-enclosed value is technically a capture group i.e.
(invalid |authenticating ) - But, they won’t necessarily do anything until you refer back to them in the replacement
- Example:
sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'- output:
root\n ftp_test\n ... - Above, the second capture group is being referred to in the replacement (“\2”) and thus the value will be output
- output:
RegEx Black Hole
- This stuff gets out of hand, especially for matching email:
- https://www.regular-expressions.info/email.html
- https://emailregex.com
- https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression/1917982#1917982
- https://fightingforalostcause.net/content/misc/2006/compare-email-regex.php
- https://mathiasbynens.be/demo/url-regex
Analyzing the Data
| paste -sd+ | bc -l
- Concatenate all the values with a plus sign and pipe them into
bcfor more calculation capabilities bc- (basic calcaulator) an arbitrary-precision calculator language that can read from STDIN
R
ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | awk '{print $1}' | R --no-echo -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'
- R is wierd, but good at analysis and plotting
- Here we created a vector comprised of the input stream of numbers, and ran the
summary()function on it
gnuplot
ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'
- If you just want simple plotting capabilities, gnuplot is probably easier
xargs
rustup toolchain list | grep nightly | grep -vE "nightly-x86" | sed 's/-x86.*//' | xargs rustup toolchain uninstall
- If you’re looking to install or remove things on your system from a longer list,
xargsplus the data wrangling we’ve seen already can be a powerful tool - Above we see a command to uninstall old nightly builds of Rust by extracting the old build names and then passing them via
xargsto the uninstaller
Wrangling Binary Data
ffmpeg -loglevel panic -i /dev/video0 -frames 1 -f image2 - | convert - -colorspace gray - | gzip | ssh mymachine 'gzip -d | tee copy.jpg | env DISPLAY=:0 feh -'
- Command to use ffmpeg to capture an image from our camera, convert it to grayscale, compress it, send it to a remote machine over SSH, decompress it there, make a copy, and then display it
Resources
- https://regexone.com
- https://regex101.com
- Especially useful for Mac users looking for data sources for the exercises: https://eclecticlight.co/2018/03/21/macos-unified-log-3-finding-your-way/