What is non-alphanumeric characters examples

01.01.01 - Regular expressions and simple search patterns

1st chapter

This workshop is intended for beginners to the problem of regular expressions. It is a shortened and revised version of a tutorial that I have on my website for users of the TheBat! without replacing it. This tutorial is really written to learn only the syntax, regardless of where you want to use the regexes later. I have tried to adapt the examples and exercises to common problems. However, the reader will find that a large number of the example rain lizards come from the world of mail. Nevertheless, I hope that the workshop is suitable as an introduction to the general understanding of the fascinating world of the "rain lizards".

1.1. What exactly does "regular expression" mean

Regexe can be found in many UNIX tools, in programming languages ​​such as Perl (Practical Extraction and Report Language), PHP, Javascript or even in various editors such as UltraEdit or jEdit. My Perl book says about this term that it seems nonsensical at first glance (with me also on the second), since it is not a question of correct expressions and it is also difficult to explain what is actually "regular" about them. Let's just accept that the term “regular expressions” comes from formal algebra, and in fact, regexes are part of mathematics.

Perhaps this is the right place to point out that, as with any language, it is also regexic Dialects gives. I concentrate on the PCRE dialect used by PERL. Other tools, such as GNU grep, use different dialects. Although the basic structure is the same, this tutorial is still an introduction for everyone, but the little things can differ. Please inform yourself beforehand whether PCRE is being used. In chapter 5.4 I mention this difference explicitly for the options and modifiers.

While we're at it: I'm only describing the regex here, not the syntax with which the regex is used in the respective language. On the one hand, that would go beyond the scope of this tutorial and, on the other hand, the use in the other languages ​​is certainly better described in the accompanying tutorials and manuals than I could ever 🙂

The easiest way to describe regular expressions is probably as a search pattern for "Pattern Matching". Everyone of us who has looked for files at DOS level or in Explorer has used such search patterns:

Here search patterns consisting of asterisks and question marks are used to narrow down the selection of files. In the first example, all files with the extension .doc should be listed. In the second example, only files with a three-letter extension and a t as the last letter should be copied.

But these "regexes" are just pure placeholders and are trivial. They are in no way as powerful as regexes, which - as we will soon see - are not just placeholders for characters.

Part 2

In order to be able to give examples of regexes in the following, we have to agree on a form of representation. I'm going to delimit the regexes with quotes and format them as code, so like this. If you want to try them out, you have to use the entries between the “symbols. Testing? Yes, you can test the effectiveness of the patterns in the help: Please, download the regex coach for the course if you want to test the regexes.

To do this, go to the Regex Coach and install it. For operating instructions, please refer to the product itself. For users of the editors Weaverslave or Ultraedit, there are plugins or existing functions for using the regexes in the editor. Caution, some of these programs have their own syntax. Please read the help beforehand!

2.1. Simple familiar characters

Let's start with simple search patterns: "this or that"

Yes, that's a regex: it finds the character sequence 'this or that' in a text and that is exactly that. No, the 'or' does not mean that either 'this' or 'that' is found, just the string in the quotation marks.

Regexe are stubborn: they look for exactly what you ask them to do. They are case sensitive and they don't care about word boundaries if no one says so. In the example above, the string in 'The Parathis or that Woman ‘found.

2.2. Search for metacharacters

A regex can be used to search for any character - alphanumeric, hexadecimal, binary, etc. A small but important exception are characters that the regex uses as a special character, the metacharacter. These metacharacters are:

(Hello experts: You are right. I cheated a bit. Not all of them are actually metacharacters. But let's just assume that it is. I'll show later why I prefer this type of metacharacter definition.)

We will get to know their meaning in the course of the workshop, so more on that later. Just so much in advance: if you want to search for these characters in their original sense, i.e. literally, you have to make this recognizable to the regex in some form. The metacharacter must be preceded by an escape character: it is the backslash \

If you are looking for a question mark, the regex must be "\?". If you are looking for the slash, it must be "\ /". Well, even if it looks strange, but if you are looking for a backslash, you have to enter two such: "\"

2.3. Simple unknown characters

The first metacharacter is the point "." It stands for exactly one character, regardless of what this character represents. (Well, another couple of experts who know more * g *? Let's get to exceptions later, ok?)

“M.ier” thus finds 'Maier', 'Meier' and the 'Maier' in 'Maiering' (the search string no longer fits the 'ing'), but not 'Manners'. “H..s” finds both “Hans” and “Haus”; but it will not find 'rabbit'. The word 'millet' is found except for the 'e'; the search pattern matches 'Hirs' exactly.

At a later point in time we will see that you can use additional metacharacters in order to be able to search for more than just one unknown character without marking it several times with a “.”.

2.4. Character groups and classes

Another powerful tool are the metacharacters for groups of characters. Here we differentiate between several options. Let's start with the simple ones:

"\ D" stands for a digit. So "\ d \ d" searches for two consecutive digits.

"\ W" stands for any letter, number or underscore (word), also called alphanumeric characters.

In this way, more complex search patterns can be built up.

“Re \ [\ d \]:” searches for a string for the character string 'Re' followed by a space, an opening square bracket, any number and a closing square bracket with a colon as the end.

The regex also offer the respective counterpart to the two above: "\ W" and "\ D" (non-digit and non-word)

Here \ W stands for any non-alphanumeric character and \ D for any character that is not a digit.

Another elegant way to define groups of characters is to use [] for character classes. With these square brackets, only one character is searched for, regardless of how many characters are listed in the brackets: "[AEX]" This combination searches for character strings that consist of only one character, which is also called A, E or X got to.

If you want to specify whole areas, you don't have to list all elements individually, rather you can connect the first and last element with a hyphen: "[e-z]" means all letters from e to z should be found.

A very powerful method: "[0-3] [0-9] \. [0-1] [0-9] \." Only dates in the format DD.MM. found. Other number combinations that cannot be a date, such as 47.35., Are not found (Yes, attentive readers have noticed that my regex above at least also finds 39.19., Which is definitely not an earthly date. We'll come to that later, we're missing it anything else…).

It is also very practical that you can negate the search criterion in one fell swoop, according to the motto: "Find all characters, unless they are 1, 2, 3 or 4!" The regex is: "[^ 1-4] “The negation is effected with a ^. Oops, we should remember that, because we will see later that this ^ has a completely different meaning if it is not in square brackets.

2.5. Overview of this chapter

In this chapter we got to know simple search patterns:

  • Strings entered directly are searched for as such. “He” searches for the letters e and r in sequence. A distinction is made between upper and lower case
  • Regexes use metacharacters that can only be searched for literally if you put a backslash in front of them: * +? . () [] {} \ / | ^ $
  • the point “.” is used to search for any unknown character. If you are looking for the point as a character, put a backslash in front of "\."
  • Regexes use groups of characters such as
    • \ d for digits ([0-9])
    • \ D for non-digits ([^ 0-9])
    • \ w for alphanumeric characters ([a-zA-Z0-9_])
    • \ W for non-alphanumeric characters ([^ a-zA-Z0-9_])
  • Character classes can be defined by specifying in square brackets "[A-Z]" This specification can be negated by using a ^ as the first character in square brackets.

tasks

What does the following regexes

Solution for the first case: two digits. This is followed by the backslash and only then the point, which means that the point should be searched literally, i.e. as a point and not as any character. Again two digits should follow with a point and then another four-digit number. A date in the form DD.MM.YYYY

The second case searches for three alphanumeric characters with a comma, a space, two digits, again three alphanumeric characters with a space and finally a four-digit number. This also looks like a date, but in the Anglo-American notation: Tue, Feb 19, 2002. Unfortunately, this regex is not optimal: it only recognizes dates with two-digit day numbers. We'll see a little later how to modify the regex to match both single and double digit day numbers.

The third case searches for any two characters followed by a space and an open square bracket. Next, the square brackets are no longer accompanied by a backslash, so a group of characters begins here. All digits from 0 to 9 are allowed in this character group. The number should be followed by a closed square bracket and a colon. For example: 'Re [2]: ‘

in the last case the regex should only find a single character, which can only consist of lowercase or uppercase letters. Why is there actually no "\ w" at this point? Well, that would also include underscores, which might be undesirable.

further