PHP Wizardry Lessons: Parsing CSV

February 17, 2007 at 4:51 am (PHP Spell Book)

The young apprentice asked fearfully the old and wise wizard how to parse a CSV file (plain text format used mostly in spreadsheet apps like MS Excel), because he knew how to use

http://php.net/fgetcsv

but this only worked with files, not with variables already in memory. The old wizard looked him seriously and answered with a reproval – “no wizard that is proud of being such would fear such a trivial thing as parsing a plain text file”. Then he took from his magic hat his old book “the one thousand and one most powerful and dangerous regular expressions”. He went through the pages slowly and ceremonially. Lastly he closed the book and took a deep breath. Holding his breath he wrote the following on the floor (reciting aloud a regular expression would look stupid, wouldn’t it?):

/(?<=^|,)(“(.|\n)*(?<!”)”|.*)(,|$)/Um

The old wizard explain no more, that’s why we’ll look a little more into the mystics behind the power of this regexp.

On execution, this regexp applied sequentially to a string the way

http://php.net/preg_match_all

does, the result will be each one the fields of the CSV. The content of each will be captured by the first parenthesis, but this must be processed first, because the CSV format allows fields to be enclosed by double quotes which can have double quotes enconded by duplicating them (“….”"…”).

To know when a row ends we must check the third parenthesis, which captures the end of the field (,|$). If the captured content is empty, a row end was found (the new line). Note that it is not safe to split first in lines and then parse each line, because there can be fields which hold a line break, using the double quotes to hold this character.

Each part of the expression has an specific purpose:

(?<=^|,) is a back assertion which assures that each regexp run starts where the one before ended, that is a comma or a line break, or the beginning of the string (for the first run).

“(.|\n)*(?<!”)” is the expression that looks for fields that use the double quotes. The can hold line breaks (look at .|\n), and it has a back assertion to prevent that a double quote followed by another double quote is interpreted as the end of the field.

|.* is the expression that capture fields that don’t follow the double quotes format, in case that fails.

(,|$) is the expression that captures the field end, which can be a comma or a line end.

U (ungreedy) and m (multiline) modifiers are necessary to prevent expressions to keep capturing beyond the intented end of it (ungreedy), and so that ^ and $ also match line breaks.

The full content of the spell, that builds the array using the results of preg_match_all can be found at:

http://www.martinalterisio.com.ar/php-spell-book/parse-csv.php

—-

But the old wizard left other less cryptic solutions untold, maybe because of his senililty, maybe because he couldn’t keep up with the latest spells. All those who have reached level 5 of PHP wizardry (PHP version 5.x) have new options in their skills menu. Just copy&paste (skill you should have probably acquired while being an apprentice) the example code for the custom stream wrappers in the PHP manual:

http://www.php.net/stream_wrapper_register

Once the summon of the new stream wrapper has been completed, we can use global variables as files:

$csv = “….”;

$fp = fopen(“var://csv”, “r”);

while ($data = fgetcsv($fp)) {

…etc…

}

For those who lack the copy&paste skill (shame on you) here you can look the stream wrapper code:

http://www.martinalterisio.com.ar/php-spell-book/VariableStream.phps

Post a Comment