remove last pipe-delimited value from dataframe co

2019-08-23 05:51发布

I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4. I want to create a new column by removing the last element, so it should show like AB|12|XY.

I tried to split, rsplit did not work, so need some suggestion to get the desired output.

1条回答
女痞
2楼-- · 2019-08-23 06:10

Use the Spark SQL split function as follows:

>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+                   
|                c1|                  
+------------------+                  
|        AB|12|XY|4|                  
|11|22|33|44|remove|                  
+------------------+                  

>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0])  # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+ 

If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF):

>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
...   return str.rsplit('|',1)[0]
...
>>> my_udf    = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))

>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+ 
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+

Built-in SQL functions are preferred (also here) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.

查看更多
登录 后发表回答