Importing data from lots of systems with PHP and Regular Expressions

Preamble: I found this article sitting as a draft for several years (2009), so the software release it mentions was years ago! But I figured I'd finally proof read and release the article since the problem it addresses is still a real one.

On Friday I released version 4.1.0 of OPUS which has been in the pipeline for a while. Although the previous version had been working remarkably well, I hadn't realised just how many bugfixes and new features had accumulated in the last months, and as several Universities had just last week installed the older version, I really needed to get the new version out.

One of the features I introduced in the 4.x family, and which seems to have settled and matured nicely in 4.1 was functionality to import student data, but really, it's useful for importing diverse CSV formats into a standard base, where the CSV conversion can have more complex problems than just shuffling columns. Our software really uses a thin layer of web services to interface to our student records system, but for the first few versions we imported data from the CSV files that our system could easily export. This is still the solution other universities will have to use unless and until they want to write their own web services layer. Even back when I needed this for internal use, this functionality was a pain, since the University was liable to change the format without notice, and even had slightly different formats for different lists. This made data validation and interpretation a real pain.

So I hit upon the idea of using Regular Expressions (regexps) to validate the data, naturally enough, but then had an idea that would go further. I thought I would define a "standard" CSV format that OPUS would expect, a very simple one. Then I would, for each input system define three regexps.

  1. A match regular expression that would define what each line of data should look like, and also group the bits we wanted to capture in parentheses.
  2. A replacement regular expression that would be used to remap the captured data into a "standard" line.
  3. An exclude regular expression that would define lines to be explicitly exempt from parsing, particularly to deal with header rows, for example.

Since University's frequently change the content and order of their CSV files and they vary from institution to institution this provides a mechanism to not just validate the input data, but to re-map data from a specified format from any given institution to this "common" format for use within OPUS.

Here is the basic bones of the algorithm.

// This is the pattern OPUS expects at the end of a mapping
$standard_pattern = "/^\"(.*)\",\"(.*)\",\"(.*)\",\"(.*)\",\"(.*)\",\"(.*)\",\"(.*)\",\"(.*)\",\"(.*)\"$/";
while($line = fgets ($fp, 2048))
{
  $line = trim($line);
  // Valid lines must match the normal pattern
  if(!preg_match($csvmap->pattern, $line))
  {
    array_push($rejected_lines, $line);
    continue; // move on
  }
  // and not be excluded
  if((strlen($csvmap->exclude) && preg_match($csvmap->exclude, $line)))
  {
    array_push($excluded_lines, $line);
    continue; // move on
  }
 
  // Ok, do the replacement to change to standard format
  $line = preg_replace($csvmap->pattern, trim($csvmap->replacement), $line);
  // Finally extract data from the standard format to an array as if from SRS
  if(!preg_match($standard_pattern, $line, $matches))
  {
    array_push($mismapped_lines, $line);
    continue; // move on
  }
}

You can now analyse the arrays of bad lines to explore problems. Rejected lines indicate the fundamental regexp might be wrong. Excluded lines should be expected, but not if they contain valid data obviously. Mismapped lines indicate that the conversion (replacement) regexp is likely to be at fault. This functionality has so far, allowed me to easily adapt OPUS to import data from lots of different university systems without fuss. (Although setting up the initial regexps can be a real pain.)

To give an example, I use the following regexp's for the University of Ulster's current programme listing.

// Pattern to match
/^Y([0-9]*),(B[0-9]*),"([A-Za-z'\-]*), ([A-Za-z'\-]*)",(.*),(.*),"(.*)"$/
// Pattern to replace with
"${1}","${2}","","${4}","${3}","${6}","","","${2}"
// Exclusion pattern
/^Year,.*$/

This excludes the top, key row of labels, and reorders some of the columns appropriately. This little trick may be of value to you if you have the same problem.

Follow me!

Leave a Reply

Your email address will not be published. Required fields are marked *