Efficient way to search for invalid characters in

2020-07-28 08:03发布

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whole post to check for the invalid characters. What I have so far is the following although it does not work correctly and I do not think the idea is very efficient.

def clean_topic_message(self):
    topic_message = self.cleaned_data['topic_message']
    words = topic_message.split()
    if (topic_message == ""):
        raise forms.ValidationError(_(u'Please provide a message for your topic'))
    ***for word in words:
        if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
            raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
    return topic_message

Thanks for any help.

9条回答
看我几分像从前
2楼-- · 2020-07-28 08:27

For a regex solution, there are two ways to go here:

  1. Find one invalid char anywhere in the string.
  2. Validate every char in the string.

Here is a script that implements both:

import re
topic_message = 'This topic is a-ok'

# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
    print ("RE1: Invalid char detected.")
else:
    print ("RE1: No invalid char detected.")

# Option 2: Validate all chars in string.
re2 =  re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
    print ("RE2: All chars are valid.")
else:
    print ("RE2: Not all chars are valid.")

Take your pick.

Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.

Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:

Test data:

r"""
TEST topic_message STRINGS:
ok:  'This topic is A-ok.     This topic is     A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'

MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""

Results:

r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method  Ok-time  Bad-time
1        1.054    1.190
2        1.830    1.636
3        4.364    4.577
"""

The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.

查看更多
霸刀☆藐视天下
3楼-- · 2020-07-28 08:32

is_valid = not any(k in text for k in '<>/{}[]~`')

查看更多
小情绪 Triste *
4楼-- · 2020-07-28 08:33

If efficiency is a major concern I would re.compile() the re string, since you're going to use the same regex many times.

查看更多
登录 后发表回答