remove last pipe-delimited value from dataframe co

2019-08-23 05:51发布

I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4. I want to create a new column by removing the last element, so it should show like AB|12|XY.

I tried to split, rsplit did not work, so need some suggestion to get the desired output.

标签： apache-spark dataframe split pyspark

1条回答

女痞

2楼-- · 2019-08-23 06:10

Use the Spark SQL split function as follows:

>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+                   
|                c1|                  
+------------------+                  
|        AB|12|XY|4|                  
|11|22|33|44|remove|                  
+------------------+                  

>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0])  # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+

If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF):

>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
...   return str.rsplit('|',1)[0]
...
>>> my_udf    = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))

>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+ 
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+

Built-in SQL functions are preferred (also here) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.

0人赞添加讨论(0) 举报

remove last pipe-delimited value from dataframe co

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间