Efficient way to search for invalid characters in

2020-07-28 08:03发布

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whole post to check for the invalid characters. What I have so far is the following although it does not work correctly and I do not think the idea is very efficient.

def clean_topic_message(self):
    topic_message = self.cleaned_data['topic_message']
    words = topic_message.split()
    if (topic_message == ""):
        raise forms.ValidationError(_(u'Please provide a message for your topic'))
    ***for word in words:
        if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
            raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
    return topic_message

Thanks for any help.

9条回答
做自己的国王
2楼-- · 2020-07-28 08:10

re.match and re.search behave differently. Splitting words is not required to search using regular expressions.

import re
symbols_re = re.compile(r"[^<>/\{}[]~`]");

if symbols_re.search(self.cleaned_data('topic_message')):
    //raise Validation error
查看更多
等我变得足够好
3楼-- · 2020-07-28 08:10

I can't say what would be more efficient, but you certainly should get rid of the $ (unless it's an invalid character for the message)... right now you only match the re if the characters are at the end of topic_message because $ anchors the match to the right-hand side of the line.

查看更多
唯我独甜
4楼-- · 2020-07-28 08:11

In any case you need to scan the entire message. So wouldn't something simple like this work ?

def checkMessage(topic_message):
  for char in topic_message:
       if char in "<>/\{}[]~`":
           return False
  return True
查看更多
闹够了就滚
5楼-- · 2020-07-28 08:20

Example: just tailor to your needs.

### valid chars: 0-9 , a-z, A-Z only
import re
REGEX_FOR_INVALID_CHARS=re.compile( r'[^0-9a-zA-Z]+' )
list_of_invalid_chars_found=REGEX_FOR_INVALID_CHARS.findall( topic_message )
查看更多
Bombasti
6楼-- · 2020-07-28 08:22

I agree with gnibbler, regex is an overkiller for this situation. Probably after removing this unwanted chars you'll want to remove unwanted words also, here's a little basic way to do it:

def remove_bad_words(title):
'''Helper to remove bad words from a sentence based in a dictionary of words.
'''
word_list = title.split(' ')
for word in word_list:
    if word in BAD_WORDS: # BAD_WORDS is a list of unwanted words
        word_list.remove(word)
#let's build the string again
title2 = u''
for word in word_list:
    title2 = ('%s %s') % (title2, word)
    #title2 = title2 + u' '+ word

return title2
查看更多
神经病院院长
7楼-- · 2020-07-28 08:24

You have to be much more careful when using regular expressions - they are full of traps.

in the case of [^<>/\{}[]~] the first ] closes the group which is probably not what you intended. If you want to use ] in a group it has to be the first character after the [ eg []^<>/\{}[~]

simple test confirms this

>>> import re
>>> re.search("[[]]","]")
>>> re.search("[][]","]")
<_sre.SRE_Match object at 0xb7883db0>

regex is overkill for this problem anyway

def clean_topic_message(self):
    topic_message = self.cleaned_data['topic_message']
    invalid_chars = '^<>/\{}[]~`$'
    if (topic_message == ""):
        raise forms.ValidationError(_(u'Please provide a message for your topic'))
    if set(invalid_chars).intersection(topic_message):
        raise forms.ValidationError(_(u'Topic message cannot contain the following: %s'%invalid_chars))
    return topic_message
查看更多
登录 后发表回答