开发者

Manipulating data to matrix-like format in PHP

开发者 https://www.devze.com 2023-03-28 03:59 出处:网络
Given input, which shows tag assignments to images, as follows (reading this from php://stdin line by line, as the input can get rather large)

Given input, which shows tag assignments to images, as follows (reading this from php://stdin line by line, as the input can get rather large)

image_a tag_lorem
image_a tag_ipsum
image_a tag_amit
image_b tag_sit
image_b tag_dolor
image_b tag_ipsum
... (there are more lines, may get up to a million)

Output of the input is shown as follows. Basically it is the same format with another entry showing whether the image-tag combination exists in input. Note that for every image, it will list all the available tags and show whether the tag is assigned to the image by using 1/0 at the end of each line.

image_a tag_sit 0
image_a tag_lorem 1开发者_JAVA百科
image_a tag_dolor 0
image_a tag_ipsum 1
image_a tag_amit 1
image_b tag_sit 1
image_b tag_lorem 0
image_b tag_dolor 1
image_b tag_ipsum 1
image_b tag_amit 0
... (more)

I have posted my no-so-efficient solution down there. To give a better picture of input and output, I fed 745 rows (which explains tag assignment of 10 images) into the script via stdin, and I receive 555025 lines after the execution of the script using about 0.4MB of memory. However, it may kill the harddisk faster because of the heavy disk I/O activity (while writing/reading to the temporary column cache file).

Is there any other way of doing this? I have another script that can turn the stdin into something like this (not sure if this is useful)

image_foo tag_lorem tag_ipsum tag_amit
image_bar tag_sit tag_dolor tag_ipsum

p/s: order of tag_* is not important, but it has to be the same for all rows, i.e. this is not what i want (notice the order of tag_* is inconsistent for both tag_a and tag_b)

image_foo tag_lorem 1
image_foo tag_ipsum 1
image_foo tag_dolor 0
image_foo tag_sit 0
image_foo tag_amit 1
image_bar tag_sit 1
image_bar tag_lorem 0
image_bar tag_dolor 1
image_bar tag_ipsum 1
image_bar tag_amit 0

p/s2: I don't know the range of tag_* until i finish reading stdin

p/s3: I don't understand why I get down-voted, if clarification is needed I am more than happy to provide them, I am not trying to make fun of something or posting nonsense here. I have re-written the question again to make it sound more like a real problem (?). However, the script really doesn't have to care about what the input really is or whether database is used (well, the data is retrieved from an RDF data store if you MUST know) because I want the script to be usable for other type of data as long as the input is in right format (hence the original version of this question was very general).

p/s4: I am trying to avoid using array because I want to avoid out of memory error as much as possible (if 745 lines expaining just 10 images will be expanded into 550k lines, just imagine I have 100, 1000, or even 10000+ images).

p/s5: if you have answer in other language feel free to post it here. I have thought of solving this using clojure but still couldn't find a way to do it properly.


Sorry, maby I misunderstood you - this looks too easy:

$stdin = fopen('php://stdin', 'r');
$columns_arr=array();
$rows_arr=array();
function set_empty_vals(&$value,$key,$columns_arr) {
    $value=array_merge($columns_arr,$value);
    ksort($value);
    foreach($value AS $val_name => $flag) {
        echo $key.' '.$val_name.' '.$flag.PHP_EOL;
    }
    $value=NULL;
}
while ($line = fgets($stdin)) {
    $line=trim($line);
    list($row,$column)=explode(' ',$line);
    $row=trim($row);
    $colum=trim($column);
    if(!isset($rows_arr[$row]))
        $rows_arr[$row]=array();
    $rows_arr[$row][$column]=1;
    $columns_arr[$column]=0;
}
array_walk($rows_arr,'set_empty_vals',$columns_arr);

UPD:

1 million lines is easy for php:

$columns_arr = array();
$rows_arr = array();

function set_null_arr(&$value, $key, $columns_arr) {
    $value = array_merge($columns_arr, $value);
    ksort($value);
    foreach($value AS $val_name => $flag) {
        //echo $key.' '.$val_name.' '.$flag.PHP_EOL;
    }
    $value=NULL;
}

for ($i = 0; $i < 100000; $i++) {
    for ($j = 0; $j < 10; $j++) {
        $row='row_foo'.$i;
        $column='column_ipsum'.$j;
        if (!isset($rows_arr[$row]))
            $rows_arr[$row] = array();
        $rows_arr[$row][$column] = 1;
        $columns_arr[$column] = 0;
    }
}
array_walk($rows_arr, 'set_null_arr', $columns_arr);

echo memory_get_peak_usage();

147Mb for me.

Last UPD - this is how I see low memory usage(but rather fast) script:

//Approximate stdin buffer size, 1Mb should be good
define('MY_STDIN_READ_BUFF_LEN', 1048576);
//Approximate tmpfile buffer size, 1Mb should be good
define('MY_TMPFILE_READ_BUFF_LEN', 1048576);
//Custom stdin line delimiter(\r\n, \n, \r etc.)
define('MY_STDIN_LINE_DELIM', PHP_EOL);
//Custom stmfile line delimiter - chose smallset possible
define('MY_TMPFILE_LINE_DELIM', "\n");
//Custom stmfile line delimiter - chose smallset possible
define('MY_OUTPUT_LINE_DELIM', "\n");

function my_output_arr($field_name,$columns_data) {
    ksort($columns_data);
    foreach($columns_data AS $column_name => $column_flag) {
        echo $field_name.' '.$column_name.' '.$column_flag.MY_OUTPUT_LINE_DELIM;
    }
}

$tmpfile=tmpfile() OR die('Can\'t create/open temporary file!');
$buffer_len = 0;
$buffer='';
//I don't think there is a point to save columns array in file -
//it should be small enough to hold in memory.
$columns_array=array();

//Open stdin for reading
$stdin = fopen('php://stdin', 'r') OR die('Failed to open stdin!');

//Main stdin reading and tmp file writing loop
//Using fread + explode + big buffer showed great performance boost
//in comparison with fgets();
while ($read_buffer = fread($stdin, MY_STDIN_READ_BUFF_LEN)) {
    $lines_arr=explode(MY_STDIN_LINE_DELIM,$buffer.$read_buffer);
    $read_buffer='';
    $lines_arr_size=count($lines_arr)-1;
    $buffer=$lines_arr[$lines_arr_size];
    for($i=0;$i<$lines_arr_size;$i++) {
        $line=trim($lines_arr[$i]);
        //There must be a space in each line - we break in it
        if(!strpos($line,' '))
            continue;
        list($row,$column)=explode(' ',$line,2);
        $columns_array[$column]=0;
        //Save line in temporary file
        fwrite($tmpfile,$row.' '.$column.MY_TMPFILE_LINE_DELIM);
    }
}
fseek($tmpfile,0);

$cur_row=NULL;
$row_data=array();
while ($read_buffer = fread($tmpfile, MY_TMPFILE_READ_BUFF_LEN)) {
    $lines_arr=explode(MY_TMPFILE_LINE_DELIM,$buffer.$read_buffer);
    $read_buffer='';
    $lines_arr_size=count($lines_arr)-1;
    $buffer=$lines_arr[$lines_arr_size];
    for($i=0;$i<$lines_arr_size;$i++) {
        list($row,$column)=explode(' ',$lines_arr[$i],2);
        if($row!==$cur_row) {
            //Output array
            if($cur_row!==NULL)
                my_output_arr($cur_row,array_merge($columns_array,$row_data));
            $cur_row=$row;
            $row_data=array();
        }
        $row_data[$column]=1;
    }
}

if(count($row_data)&&$cur_row!==NULL) {
    my_output_arr($cur_row,array_merge($columns_array,$row_data));
}


Here's a MySQL example that works with your supplied test data:

CREATE TABLE `url` (
  `url1` varchar(255) DEFAULT NULL,
  `url2` varchar(255) DEFAULT NULL,
  KEY `url1` (`url1`),
  KEY `url2` (`url2`)
);

INSERT INTO url (url1, url2) VALUES
('image_a', 'tag_lorem'),
('image_a', 'tag_ipsum'),
('image_a', 'tag_amit'),
('image_b', 'tag_sit'),
('image_b', 'tag_dolor'),
('image_b', 'tag_ipsum');

  SELECT url1, url2, assigned FROM (
  SELECT t1.url1, t1.url2, 1 AS assigned
    FROM url t1
   UNION
  SELECT t1.url1, t2.url2, 0 AS assigned
    FROM url t1
    JOIN url t2
      ON t1.url1 != t2.url1
    JOIN url t3
      ON t1.url1 != t3.url1
     AND t1.url2  = t3.url2
     AND t2.url2 != t3.url2 ) tmp
ORDER BY url1, url2;

Result:

+---------+-----------+----------+
| url1    | url2      | assigned |
+---------+-----------+----------+
| image_a | tag_amit  |        1 |
| image_a | tag_dolor |        0 |
| image_a | tag_ipsum |        1 |
| image_a | tag_lorem |        1 |
| image_a | tag_sit   |        0 |
| image_b | tag_amit  |        0 |
| image_b | tag_dolor |        1 |
| image_b | tag_ipsum |        1 |
| image_b | tag_lorem |        0 |
| image_b | tag_sit   |        1 |
+---------+-----------+----------+

This should be simple enough to convert to SQLite, so if required you could use PHP to read the data into a temporary SQLite database, and then extract the results.


Put your input data in array and then sort them by using usort, define comparison function which compares array elements by row values and then column values if row values are equal.


This is my current implementation, I don't like it, but it does the job for now.

#!/usr/bin/env php
<?php

define('CACHE_MATCH', 0);
define('CACHE_COLUMN', 1);

define('INPUT_ROW', 0);
define('INPUT_COLUMN', 1);
define('INPUT_COUNT', 2);

output_expanded_entries(
    cache_input(array(tmpfile(), tmpfile()), STDIN, fgets(STDIN))
);
echo memory_get_peak_usage();

function cache_input(Array $cache_files, $input_pointer, $input) {
    if(count($cache_files) != 2) {
        throw new Exception('$cache_files requires 2 file pointers');
    }

    if(feof($input_pointer) == FALSE) {
        cache_match($cache_files[CACHE_MATCH], trim($input));
        cache_column($cache_files[CACHE_COLUMN], process_line($input));

        cache_input(
            $cache_files,
            $input_pointer,
            fgets($input_pointer)
        );
    }

    return $cache_files;
}

function cache_column($cache_column, $input) {
    if(empty($input) === FALSE) {
        rewind($cache_column);

        $column = get_field($input, INPUT_COLUMN);

        if(column_cached_in_memory($column) === FALSE && column_cached_in_file($cache_column, fgets($cache_column), $column) === FALSE) {
            fputs($cache_column, $column . PHP_EOL);
        }
    }
}

function cache_match($cache_match, $input) {
    if(empty($input) === FALSE) {
        fputs($cache_match, $input . PHP_EOL);
    }
}

function column_cached_in_file($cache_column, $current, $column, $result = FALSE) {
    return $result === FALSE && feof($cache_column) === FALSE ?
        column_cached_in_file($cache_column, fgets($cache_column), $column, $column == $current)
        : $result;
}

function column_cached_in_memory($column) {
    static $local_cache = array(), $index = 0, $count = 500;

    $result = TRUE;

    if(in_array($column, $local_cache) === FALSE) {
        $result = FALSE;

        $local_cache[$index++ % $count] = $column;
    }

    return $result;
}

function output_expanded_entries(Array $cache_files) {
    array_map('rewind', $cache_files);

    for($current_row = NULL, $cache = array(); feof($cache_files[CACHE_MATCH]) === FALSE;) {
        $input = process_line(fgets($cache_files[CACHE_MATCH]));

        if(empty($input) === FALSE) {
            if($current_row !== get_field($input, INPUT_ROW)) {
                output_cache($current_row, $cache);

                $cache = read_columns($cache_files[CACHE_COLUMN]);
                $current_row = get_field($input, INPUT_ROW);
            }

            $cache = array_merge(
                $cache,
                array(get_field($input, INPUT_COLUMN) => get_field($input, INPUT_COUNT))
            );
        }
    }

    output_cache($current_row, $cache);
}

function output_cache($row, $column_count_list) {
    if(count($column_count_list) != 0) {
        printf(
            '%s %s %s%s',
            $row,
            key(array_slice($column_count_list, 0, 1)),
            current(array_slice($column_count_list, 0, 1)),
            PHP_EOL
        );

        output_cache($row, array_slice($column_count_list, 1));
    }
}

function get_field(Array $input, $field) {
    $result = NULL;

    if(in_array($field, array_keys($input))) {
        $result = $input[$field];
    } elseif($field == INPUT_COUNT) {
        $result = 1;
    }

    return $result;
}

function process_line($input) {
    $result = trim($input);

    return empty($result) === FALSE && strpos($result, ' ') !== FALSE ?
        explode(' ', $result)
        : NULL;
}

function push_column($input, Array $result) {
    return empty($input) === FALSE && is_array($input) ?
        array_merge(
            $result,
            array(get_field($input, INPUT_COLUMN))
        )
        : $result;
}

function read_columns($cache_columns) {
    rewind($cache_columns);

    $result = array();

    while(feof($cache_columns) === FALSE) {
        $column = trim(fgets($cache_columns));

        if(empty($column) === FALSE) {
            $result[$column] = 0;
        }
    }

    return $result;
}

EDIT: yesterday's version was bugged :/

0

精彩评论

暂无评论...
验证码 换一张
取 消