Imagine I have the following table available to me:
A: { x: int, y: int, z: int, ...99 other columns... }
I now want to transform this, such that z
is set to NULL
where x > y
, with the resulting dataset to be stored as B
.
and I want to do it without having to explicitly mention all the other columns, as this becomes a maintenance ni开发者_如何学运维ghtmare.
Is there a simple solution?
This issue is tracked in this JIRA: PIG-1693 There needs to be a way in foreach to indicate "and all the rest of the fields"
Currently I don't know anything simpler than doing what you say or not loading Z and adding a new column Z with the star expression.
I was able to drop some of the column bloat by nesting them in single-row bags and flattening afterwards.
Still, it feels like a bit of a hack. So I'm also investigating cascading to see if it's a better fit for my scenario.
A feature to facilitate your scenario was added in Pig 0.9. The new project-range operator (..) allows you to express a range of fields by indicating the starting and/or ending field names as in this example:
result = FOREACH someInput GENERATE field1, field2, null as field3, field4 .. ;
In the example above field1/2/3/4 are actual field names. One of the fields is set to null while the other fields are kept intact.
More details in this "New Apache Pig 0.9 Features – Part 3" article: http://hortonworks.com/blog/new-apache-pig-0-9-features-part-3-additional-features/
To solve your specific problem you probably want to do a FILTER and an UNION to combine the results.
Of course you can select columns by column number, but that can easily become a nightmare if you change anything at all. I have found column names to be much more stable, and therefore I recommend the following solution:
Update mycol when it is between two known columns
You can use ..
to indicate leading, or trailing columns (or inbetween columns). Here is how that would work out if you want to change the value of 'MyCol' to 'updatedvalue'.
aliasAfter = FOREACH aliasBefore GENERATE
.. colBeforeMyCol, updatedvalue, colAfterMyCol ..;
精彩评论