Hive: converting a comma separated string to array

2020-05-29 06:42发布


I am creating a Hive table on Amazon's Elastic MapReduce by using a gzipped JSON encoded file. I am using this JSON SerDe:

The unencoded file looks like this:

{"id":"101", "items":"A:231,234,119,12"}

{"id":"102", "items":"B:13,89,121"}


I'd like to create an array of the "items" column for user with a table generating function. The array I want would be the "exploded" CSV of ints ignoring the ":" and the letter before it. I want to be able to GROUP BY the item ints (231,234,etc.) or JOIN them against other tables. Changing around the format of this file would be difficult and all my old data is stored in this format so I'd like to solve this in Hive without making changes to how to store my data.

Is there a way for me to do this?


Thanks @mark-grover, this worked!

I did something like this:

SELECT id, item FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item

to get:

101 231

101 234

101 119

101  12

102  13

102  89

102 121

In doing this I found out I'd like to do a little more. I'd like to also select the position that each item was in in the CSV. So I'd like the output to look like:

101 231 0

101 234 1

101 119 2

101  12 3

102  13 1

102  89 2

102 121 3

Any ideas here?

Edit again:

I came up with something for my followup question doing something like this:

SELECT id, item, find_in_set(item, substr(items, 3, length(items) - 2)) AS position
FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item


You can do that use Hive UDFs. You would want to use the split UDF to split the string on commas to get an array and then use Lateral view along with explode UDTF to "explode" the contents of the array.

To answer your second question: This is what I can think of right now. There might be a better/easier/less hackier way to do it.

Create a UDF say split_extended that would return as a struct with 2 elements, first one being the actual contents of the array element, second one being the index. When you explode this the type of exploded column will be a struct. One of the entries of this struct would the actual element from the array, the other would be the index corresponding to the element.

If you think about it, we are augmenting split UDF to return 2 pieces of information - the element and the index. These are several ways of going about returning such information as a single object - structure is one of them, array being another (first element of the array being the actual element from the array, second element being the index). Another hackier way to return 2 pieces of information is to concat them with something (say a ':') that you know doesn't exist in your data so then you can have your UDF return '231:0', '234:0', '119:2', etc. and in the end instead of accessing them members of a struct, you can split the resulting string out based on ':' (using split UDF) to get the desired two pieces.

For creating your own UDF, you will be able to leverage most, if not all of split UDF code available here.