Preparing data for a database

I previously posted about my new Snow Survey of Great Britain paper. Long overdue, but I’ve started working with the dataset as a whole, instead of individual stations. One of the key hurdles to doing this is using a database. However, before I can even get to that step I needed to prepare my data and make sure the column headings were consistent.

When I digitised the 60+ years of data at 140+ sites I did so using a spreadsheet, I had one tab per year with each column as a different station and each row as a different day. To make life more difficult, the same stations didn’t appear each year and didn’t always appear in the same order (some decades alphabetical, some North to South), and they were digitised as I came to them. To ease tidying the columns and for import into a database I exported each tab as a csv, using the pipe (|) as a delimiter to avoid confusion with commas in column headings. I exported these csvs manually, there were only 60ish so it took ~5 mins. Create a backup of these csv files so you don’t need to do it again for iterations or if the following goes very wrong!

This left me with 60+ csv files which may or may not have had all the right spellings for station names in the column headers. To extract a unique list of column headers from all files I used the following bash commands:

# Navigate to your folder of data files
cd /dir/dir
# Extract first line, sort and send unique values to headers.txt
head -qn 1 SSGB* | awk '{gsub("\|","\n");printf"%s",$0}' | sort | uniq > headers.txt

If you’re not familiar with bash, it’s a programme that runs in the terminal and is native to UNIX like systems (Linux and Apple). I think you can get emulators to run terminal commands in MS Windows, but you’re better off doing this.

So to break down the above commands:

  • #: indicates a comment, not code
  • cd: change directory
  • head -qn 1: take the first 1 lines and don’t print file names
  • SSGB*: all files that begin with SSGB (* is a wildcard)
  • |: send the output from previous command to the next
  • awk …: replace the | with a new line, \n (requires \| as bash interprets | as something specific on its own)
  • sort: … pretty obvious!
  • uniq: takes the unique values from the sorted list
  • >: writes to a file (nb >> appends to an existing file).

You need to be doing this day in and day out to stand any chance of remembering all these things, thankfully there are great resources on the web. I picked up the bones of what I needed from the UNIX manual page for head and this forum for awk as I don’t remember in between each time I use these commands!

Next step was a tedious look through the 140+ names to check there weren’t miss-spelt duplicates or that sites with multiple names were all using the same one. To do this I used a spreadsheet and built up a column of fixes against the actual names, looking like:

Original column headings with corrections, as required.

Original column headings with corrections, as required.

I then used the auto-filter in the spreadsheet to show those rows in fixed that weren’t empty. I could then paste these into a new tab to generate the script to rename the incorrect column headings. Prior to this, another handy line of code I used looked through all the files to find where a heading cropped up, so I could check the context:

# Remember to cd into your file directory
grep -r "Barra" .
# -r is recursive (all files), . means look here (in current directory)

The spreadsheet tab for script generation looks like the below (with the csv files in a folder called data). Obviously I could have achieved this with a loop over the old and new values, although the spreadsheet gave me more customisation.

String replacement script building in a spreadsheet.

String replacement script building in a spreadsheet.

 

I saved the above column c in a text file called name_clean.sh

One line of the script to find and replace the incorrect station names is:

# Remember to cd into your file directory
grep -rl '|Achnaeoichan|' ./data/* | xargs sed -i 's#|Achnaeoichan|#|Achnagoichan|#I'

I can’t exactly remember where this came from… probably a stackoverflow question! Things to note:

  • I’ve used # to separate the sed commands as it doesn’t appear in the file
  • Having seen undergrads not realising they can fill down in a spreadsheet, make sure you don’t type each input name but refer to A2 and B2 like in the image above
  • ‘I’ at the end of the sed command makes it case sensitive
  • I had to manually adjust some lines as I’d set up the script to only look at whole column names (i.e. bounded by a pipe: |column name|) and this caused a problem for names that appeared at the end of lines with no pipe at the end.

To run the script type the following in the terminal

# Remember to have your file in your current directory
sh name_clean.sh

I ended up going through a few iterations of these steps (re-running the header extract script on fresh csv files – don’t forget that backup!) to locate names I’d missed, or parts of the script that needed adjusting. All in, this is a much more robust method than doing the work by hand, and very quick to repeat if you make a mistake!

 

About these ads