What is a word boundary in regexes?











up vote
76
down vote

favorite
24












I am using Java regexes in Java 1.6 (inter alia to parse numeric output) and cannot find a precise definition of b ("word boundary"). I had assumed that -12 would be an "integer word" (matched by b-?d+b) but it appears that this does not work. I'd be grateful to know of ways of matching space-separated numbers.



Example:



Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\s*\-?\d+\s*");
System.out.println(""+pattern.matcher(minus).matches());


This returns:



true
false
true









share|improve this question
























  • Can you post a small example with input and expected output?
    – Brent Writes Code
    Aug 24 '09 at 20:52










  • Will try to construct one
    – peter.murray.rust
    Aug 24 '09 at 20:58










  • Example Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*"); String plus = " 12 "; System.out.println(""+pattern.matcher(plus).matches()); String minus = " -12 "; System.out.println(""+pattern.matcher(minus).matches()); pattern = Pattern.compile("\s*\-?\d+\s*"); System.out.println(""+pattern.matcher(minus).matches()); gives: true false true
    – peter.murray.rust
    Aug 24 '09 at 21:06















up vote
76
down vote

favorite
24












I am using Java regexes in Java 1.6 (inter alia to parse numeric output) and cannot find a precise definition of b ("word boundary"). I had assumed that -12 would be an "integer word" (matched by b-?d+b) but it appears that this does not work. I'd be grateful to know of ways of matching space-separated numbers.



Example:



Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\s*\-?\d+\s*");
System.out.println(""+pattern.matcher(minus).matches());


This returns:



true
false
true









share|improve this question
























  • Can you post a small example with input and expected output?
    – Brent Writes Code
    Aug 24 '09 at 20:52










  • Will try to construct one
    – peter.murray.rust
    Aug 24 '09 at 20:58










  • Example Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*"); String plus = " 12 "; System.out.println(""+pattern.matcher(plus).matches()); String minus = " -12 "; System.out.println(""+pattern.matcher(minus).matches()); pattern = Pattern.compile("\s*\-?\d+\s*"); System.out.println(""+pattern.matcher(minus).matches()); gives: true false true
    – peter.murray.rust
    Aug 24 '09 at 21:06













up vote
76
down vote

favorite
24









up vote
76
down vote

favorite
24






24





I am using Java regexes in Java 1.6 (inter alia to parse numeric output) and cannot find a precise definition of b ("word boundary"). I had assumed that -12 would be an "integer word" (matched by b-?d+b) but it appears that this does not work. I'd be grateful to know of ways of matching space-separated numbers.



Example:



Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\s*\-?\d+\s*");
System.out.println(""+pattern.matcher(minus).matches());


This returns:



true
false
true









share|improve this question















I am using Java regexes in Java 1.6 (inter alia to parse numeric output) and cannot find a precise definition of b ("word boundary"). I had assumed that -12 would be an "integer word" (matched by b-?d+b) but it appears that this does not work. I'd be grateful to know of ways of matching space-separated numbers.



Example:



Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\s*\-?\d+\s*");
System.out.println(""+pattern.matcher(minus).matches());


This returns:



true
false
true






regex word-boundary






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Oct 25 '17 at 17:11









Wiktor Stribiżew

299k16121195




299k16121195










asked Aug 24 '09 at 20:46









peter.murray.rust

20.9k34121194




20.9k34121194












  • Can you post a small example with input and expected output?
    – Brent Writes Code
    Aug 24 '09 at 20:52










  • Will try to construct one
    – peter.murray.rust
    Aug 24 '09 at 20:58










  • Example Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*"); String plus = " 12 "; System.out.println(""+pattern.matcher(plus).matches()); String minus = " -12 "; System.out.println(""+pattern.matcher(minus).matches()); pattern = Pattern.compile("\s*\-?\d+\s*"); System.out.println(""+pattern.matcher(minus).matches()); gives: true false true
    – peter.murray.rust
    Aug 24 '09 at 21:06


















  • Can you post a small example with input and expected output?
    – Brent Writes Code
    Aug 24 '09 at 20:52










  • Will try to construct one
    – peter.murray.rust
    Aug 24 '09 at 20:58










  • Example Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*"); String plus = " 12 "; System.out.println(""+pattern.matcher(plus).matches()); String minus = " -12 "; System.out.println(""+pattern.matcher(minus).matches()); pattern = Pattern.compile("\s*\-?\d+\s*"); System.out.println(""+pattern.matcher(minus).matches()); gives: true false true
    – peter.murray.rust
    Aug 24 '09 at 21:06
















Can you post a small example with input and expected output?
– Brent Writes Code
Aug 24 '09 at 20:52




Can you post a small example with input and expected output?
– Brent Writes Code
Aug 24 '09 at 20:52












Will try to construct one
– peter.murray.rust
Aug 24 '09 at 20:58




Will try to construct one
– peter.murray.rust
Aug 24 '09 at 20:58












Example Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*"); String plus = " 12 "; System.out.println(""+pattern.matcher(plus).matches()); String minus = " -12 "; System.out.println(""+pattern.matcher(minus).matches()); pattern = Pattern.compile("\s*\-?\d+\s*"); System.out.println(""+pattern.matcher(minus).matches()); gives: true false true
– peter.murray.rust
Aug 24 '09 at 21:06




Example Pattern pattern = Pattern.compile("\s*\b\-?\d+\s*"); String plus = " 12 "; System.out.println(""+pattern.matcher(plus).matches()); String minus = " -12 "; System.out.println(""+pattern.matcher(minus).matches()); pattern = Pattern.compile("\s*\-?\d+\s*"); System.out.println(""+pattern.matcher(minus).matches()); gives: true false true
– peter.murray.rust
Aug 24 '09 at 21:06












11 Answers
11






active

oldest

votes

















up vote
58
down vote



accepted










A word boundary, in most regex dialects, is a position between w and W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).



So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.






share|improve this answer



















  • 18




    Correctamundo. b is a zero-width assertion that matches if there is w on one side, and either there is W on the other or the position is beginning or end of string. w is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
    – hobbs
    Aug 24 '09 at 21:02










  • 100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
    – Brent Writes Code
    Aug 24 '09 at 21:05






  • 1




    for the sake of understanding, is it possible to rewrite the regex bhellob without using b (using w, W and other)?
    – David Portabella
    Sep 28 '16 at 9:40






  • 1




    Sort of: (^|W)hello($|W), except that it wouldn't capture any non-word characters before and after, so it would be more like (^|(?<=W))hello($|(?=W)) (using lookahead/lookbehind assertions).
    – brianary
    Sep 28 '16 at 9:58






  • 3




    @brianary Slightly simpler: (?<!w)hello(?!w).
    – David Knipe
    Nov 19 '17 at 17:16


















up vote
18
down vote













A word boundary can occur in one of three positions:




  1. Before the first character in the string, if the first character is a word character.

  2. After the last character in the string, if the last character is a word character.

  3. Between two characters in the string, where one is a word character and the other is not a word character.


Word characters are alpha-numeric; a minus sign is not.
Taken from Regex Tutorial.






share|improve this answer






























    up vote
    7
    down vote













    A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.






    share|improve this answer





















    • This is the best explanation.
      – Chris Leung
      Feb 7 at 6:47


















    up vote
    4
    down vote













    Check out the documentation on boundary conditions:



    http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html



    Check out this sample:



    public static void main(final String args)
    {
    String x = "I found the value -12 in my string.";
    System.err.println(Arrays.toString(x.split("\b-?\d+\b")));
    }


    When you print it out, notice that the output is this:



    [I found the value -, in my string.]



    This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.






    share|improve this answer




























      up vote
      4
      down vote













      I talk about what b-style regex boundaries actually are here.



      The short story is that they’re conditional. Their behavior depends on what they’re next to.



      # same as using a b before:
      (?(?=w) (?<!w) | (?<!W) )

      # same as using a b after:
      (?(?<=w) (?!w) | (?!W) )


      Sometimes that isn’t what you want. See my other answer for elaboration.






      share|improve this answer






























        up vote
        4
        down vote













        I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.



        Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w. (I'm sure there was a good reason for it at the time).



        The w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in w. But Java, JavaScript, and PCRE match only ASCII characters with w.



        Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the b.



        Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.



        Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the b with before and after whitespace and punctuation designators. For example:



        public static String grep(String regexp, String multiLineStringToSearch) {
        String result = "";
        String lines = multiLineStringToSearch.split("\n");
        Pattern pattern = Pattern.compile(regexp);
        for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
        result = result + "n" + line;
        }
        }
        return result.trim();
        }


        Then in your test or main function:



            String beforeWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|^)";   
        String afterWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|$)";
        text = "Programming in C, (C++) C#, Java, and .NET.";
        System.out.println("text="+text);
        // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\b\.NET\b", text));
        System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\.NET"+afterWord, text));
        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\bC#\b", text));
        System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
        System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\bC\+\+\b", text));
        System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\+\+"+afterWord, text));

        System.out.println("Should find: grep with word boundary for Java="+ grep("\bJava\b", text));
        System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\bjava\b", text));
        System.out.println("Should find: grep with word boundary for C="+ grep("\bC\b", text)); // Works Ok for this example, but see below
        // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
        text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
        System.out.println("text="+text);
        System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\bC\b", text));
        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
        // Make sure the first and last cases work OK.

        text = "C is a language that should have been named differently.";
        System.out.println("text="+text);
        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

        text = "One language that should have been named differently is C";
        System.out.println("text="+text);
        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

        //Make sure we don't get false positives
        text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
        System.out.println("text="+text);
        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));


        P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!






        share|improve this answer























        • I struggled trying to understand why I couldn't match C# but now it's clearer
          – Mugoma J. Okomba
          Dec 6 '16 at 19:48


















        up vote
        1
        down vote













        I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.



        One possible alternative is



        (?:(?:^|s)-?)d+b


        This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.






        share|improve this answer




























          up vote
          1
          down vote













          In the course of learning regular expression, I was really stuck in the metacharacter which is b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(w)-boundary.



          My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.



          enter image description here






          share|improve this answer




























            up vote
            0
            down vote













            I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.






            share|improve this answer

















            • 1




              You're thinking of G: matches the beginning of the string (like A) on the first match attempt; after that it matches the position where the previous match ended.
              – Alan Moore
              Jun 24 '16 at 20:50


















            up vote
            0
            down vote













            when you use \b(\w+)+\b that means exact match with a word containing only word characters ([a-zA-Z0-9])



            in your case for example setting \b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)



            for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html






            share|improve this answer






























              up vote
              0
              down vote













              Word boundary b is used where one word should be a word character and another one a non-word character.
              Regular Expression for negative number should be



              --?bd+b


              check working DEMO






              share|improve this answer








              New contributor




              Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.


















                Your Answer






                StackExchange.ifUsing("editor", function () {
                StackExchange.using("externalEditor", function () {
                StackExchange.using("snippets", function () {
                StackExchange.snippets.init();
                });
                });
                }, "code-snippets");

                StackExchange.ready(function() {
                var channelOptions = {
                tags: "".split(" "),
                id: "1"
                };
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function() {
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled) {
                StackExchange.using("snippets", function() {
                createEditor();
                });
                }
                else {
                createEditor();
                }
                });

                function createEditor() {
                StackExchange.prepareEditor({
                heartbeatType: 'answer',
                convertImagesToLinks: true,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: 10,
                bindNavPrevention: true,
                postfix: "",
                imageUploader: {
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                },
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                });


                }
                });














                 

                draft saved


                draft discarded


















                StackExchange.ready(
                function () {
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f1324676%2fwhat-is-a-word-boundary-in-regexes%23new-answer', 'question_page');
                }
                );

                Post as a guest
































                11 Answers
                11






                active

                oldest

                votes








                11 Answers
                11






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes








                up vote
                58
                down vote



                accepted










                A word boundary, in most regex dialects, is a position between w and W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).



                So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.






                share|improve this answer



















                • 18




                  Correctamundo. b is a zero-width assertion that matches if there is w on one side, and either there is W on the other or the position is beginning or end of string. w is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
                  – hobbs
                  Aug 24 '09 at 21:02










                • 100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
                  – Brent Writes Code
                  Aug 24 '09 at 21:05






                • 1




                  for the sake of understanding, is it possible to rewrite the regex bhellob without using b (using w, W and other)?
                  – David Portabella
                  Sep 28 '16 at 9:40






                • 1




                  Sort of: (^|W)hello($|W), except that it wouldn't capture any non-word characters before and after, so it would be more like (^|(?<=W))hello($|(?=W)) (using lookahead/lookbehind assertions).
                  – brianary
                  Sep 28 '16 at 9:58






                • 3




                  @brianary Slightly simpler: (?<!w)hello(?!w).
                  – David Knipe
                  Nov 19 '17 at 17:16















                up vote
                58
                down vote



                accepted










                A word boundary, in most regex dialects, is a position between w and W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).



                So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.






                share|improve this answer



















                • 18




                  Correctamundo. b is a zero-width assertion that matches if there is w on one side, and either there is W on the other or the position is beginning or end of string. w is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
                  – hobbs
                  Aug 24 '09 at 21:02










                • 100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
                  – Brent Writes Code
                  Aug 24 '09 at 21:05






                • 1




                  for the sake of understanding, is it possible to rewrite the regex bhellob without using b (using w, W and other)?
                  – David Portabella
                  Sep 28 '16 at 9:40






                • 1




                  Sort of: (^|W)hello($|W), except that it wouldn't capture any non-word characters before and after, so it would be more like (^|(?<=W))hello($|(?=W)) (using lookahead/lookbehind assertions).
                  – brianary
                  Sep 28 '16 at 9:58






                • 3




                  @brianary Slightly simpler: (?<!w)hello(?!w).
                  – David Knipe
                  Nov 19 '17 at 17:16













                up vote
                58
                down vote



                accepted







                up vote
                58
                down vote



                accepted






                A word boundary, in most regex dialects, is a position between w and W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).



                So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.






                share|improve this answer














                A word boundary, in most regex dialects, is a position between w and W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).



                So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Jul 4 '12 at 21:40









                Gilles

                73.3k18157202




                73.3k18157202










                answered Aug 24 '09 at 21:00









                brianary

                5,95712826




                5,95712826








                • 18




                  Correctamundo. b is a zero-width assertion that matches if there is w on one side, and either there is W on the other or the position is beginning or end of string. w is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
                  – hobbs
                  Aug 24 '09 at 21:02










                • 100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
                  – Brent Writes Code
                  Aug 24 '09 at 21:05






                • 1




                  for the sake of understanding, is it possible to rewrite the regex bhellob without using b (using w, W and other)?
                  – David Portabella
                  Sep 28 '16 at 9:40






                • 1




                  Sort of: (^|W)hello($|W), except that it wouldn't capture any non-word characters before and after, so it would be more like (^|(?<=W))hello($|(?=W)) (using lookahead/lookbehind assertions).
                  – brianary
                  Sep 28 '16 at 9:58






                • 3




                  @brianary Slightly simpler: (?<!w)hello(?!w).
                  – David Knipe
                  Nov 19 '17 at 17:16














                • 18




                  Correctamundo. b is a zero-width assertion that matches if there is w on one side, and either there is W on the other or the position is beginning or end of string. w is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
                  – hobbs
                  Aug 24 '09 at 21:02










                • 100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
                  – Brent Writes Code
                  Aug 24 '09 at 21:05






                • 1




                  for the sake of understanding, is it possible to rewrite the regex bhellob without using b (using w, W and other)?
                  – David Portabella
                  Sep 28 '16 at 9:40






                • 1




                  Sort of: (^|W)hello($|W), except that it wouldn't capture any non-word characters before and after, so it would be more like (^|(?<=W))hello($|(?=W)) (using lookahead/lookbehind assertions).
                  – brianary
                  Sep 28 '16 at 9:58






                • 3




                  @brianary Slightly simpler: (?<!w)hello(?!w).
                  – David Knipe
                  Nov 19 '17 at 17:16








                18




                18




                Correctamundo. b is a zero-width assertion that matches if there is w on one side, and either there is W on the other or the position is beginning or end of string. w is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
                – hobbs
                Aug 24 '09 at 21:02




                Correctamundo. b is a zero-width assertion that matches if there is w on one side, and either there is W on the other or the position is beginning or end of string. w is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
                – hobbs
                Aug 24 '09 at 21:02












                100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
                – Brent Writes Code
                Aug 24 '09 at 21:05




                100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
                – Brent Writes Code
                Aug 24 '09 at 21:05




                1




                1




                for the sake of understanding, is it possible to rewrite the regex bhellob without using b (using w, W and other)?
                – David Portabella
                Sep 28 '16 at 9:40




                for the sake of understanding, is it possible to rewrite the regex bhellob without using b (using w, W and other)?
                – David Portabella
                Sep 28 '16 at 9:40




                1




                1




                Sort of: (^|W)hello($|W), except that it wouldn't capture any non-word characters before and after, so it would be more like (^|(?<=W))hello($|(?=W)) (using lookahead/lookbehind assertions).
                – brianary
                Sep 28 '16 at 9:58




                Sort of: (^|W)hello($|W), except that it wouldn't capture any non-word characters before and after, so it would be more like (^|(?<=W))hello($|(?=W)) (using lookahead/lookbehind assertions).
                – brianary
                Sep 28 '16 at 9:58




                3




                3




                @brianary Slightly simpler: (?<!w)hello(?!w).
                – David Knipe
                Nov 19 '17 at 17:16




                @brianary Slightly simpler: (?<!w)hello(?!w).
                – David Knipe
                Nov 19 '17 at 17:16












                up vote
                18
                down vote













                A word boundary can occur in one of three positions:




                1. Before the first character in the string, if the first character is a word character.

                2. After the last character in the string, if the last character is a word character.

                3. Between two characters in the string, where one is a word character and the other is not a word character.


                Word characters are alpha-numeric; a minus sign is not.
                Taken from Regex Tutorial.






                share|improve this answer



























                  up vote
                  18
                  down vote













                  A word boundary can occur in one of three positions:




                  1. Before the first character in the string, if the first character is a word character.

                  2. After the last character in the string, if the last character is a word character.

                  3. Between two characters in the string, where one is a word character and the other is not a word character.


                  Word characters are alpha-numeric; a minus sign is not.
                  Taken from Regex Tutorial.






                  share|improve this answer

























                    up vote
                    18
                    down vote










                    up vote
                    18
                    down vote









                    A word boundary can occur in one of three positions:




                    1. Before the first character in the string, if the first character is a word character.

                    2. After the last character in the string, if the last character is a word character.

                    3. Between two characters in the string, where one is a word character and the other is not a word character.


                    Word characters are alpha-numeric; a minus sign is not.
                    Taken from Regex Tutorial.






                    share|improve this answer














                    A word boundary can occur in one of three positions:




                    1. Before the first character in the string, if the first character is a word character.

                    2. After the last character in the string, if the last character is a word character.

                    3. Between two characters in the string, where one is a word character and the other is not a word character.


                    Word characters are alpha-numeric; a minus sign is not.
                    Taken from Regex Tutorial.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Jul 19 '16 at 0:15









                    SongWithoutWords

                    16619




                    16619










                    answered Aug 24 '09 at 21:05









                    WolfmanDragon

                    4,885144157




                    4,885144157






















                        up vote
                        7
                        down vote













                        A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.






                        share|improve this answer





















                        • This is the best explanation.
                          – Chris Leung
                          Feb 7 at 6:47















                        up vote
                        7
                        down vote













                        A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.






                        share|improve this answer





















                        • This is the best explanation.
                          – Chris Leung
                          Feb 7 at 6:47













                        up vote
                        7
                        down vote










                        up vote
                        7
                        down vote









                        A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.






                        share|improve this answer












                        A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Aug 25 '09 at 1:36









                        Alan Moore

                        60.1k978128




                        60.1k978128












                        • This is the best explanation.
                          – Chris Leung
                          Feb 7 at 6:47


















                        • This is the best explanation.
                          – Chris Leung
                          Feb 7 at 6:47
















                        This is the best explanation.
                        – Chris Leung
                        Feb 7 at 6:47




                        This is the best explanation.
                        – Chris Leung
                        Feb 7 at 6:47










                        up vote
                        4
                        down vote













                        Check out the documentation on boundary conditions:



                        http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html



                        Check out this sample:



                        public static void main(final String args)
                        {
                        String x = "I found the value -12 in my string.";
                        System.err.println(Arrays.toString(x.split("\b-?\d+\b")));
                        }


                        When you print it out, notice that the output is this:



                        [I found the value -, in my string.]



                        This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.






                        share|improve this answer

























                          up vote
                          4
                          down vote













                          Check out the documentation on boundary conditions:



                          http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html



                          Check out this sample:



                          public static void main(final String args)
                          {
                          String x = "I found the value -12 in my string.";
                          System.err.println(Arrays.toString(x.split("\b-?\d+\b")));
                          }


                          When you print it out, notice that the output is this:



                          [I found the value -, in my string.]



                          This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.






                          share|improve this answer























                            up vote
                            4
                            down vote










                            up vote
                            4
                            down vote









                            Check out the documentation on boundary conditions:



                            http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html



                            Check out this sample:



                            public static void main(final String args)
                            {
                            String x = "I found the value -12 in my string.";
                            System.err.println(Arrays.toString(x.split("\b-?\d+\b")));
                            }


                            When you print it out, notice that the output is this:



                            [I found the value -, in my string.]



                            This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.






                            share|improve this answer












                            Check out the documentation on boundary conditions:



                            http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html



                            Check out this sample:



                            public static void main(final String args)
                            {
                            String x = "I found the value -12 in my string.";
                            System.err.println(Arrays.toString(x.split("\b-?\d+\b")));
                            }


                            When you print it out, notice that the output is this:



                            [I found the value -, in my string.]



                            This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Aug 24 '09 at 21:03









                            Brent Writes Code

                            12.3k44152




                            12.3k44152






















                                up vote
                                4
                                down vote













                                I talk about what b-style regex boundaries actually are here.



                                The short story is that they’re conditional. Their behavior depends on what they’re next to.



                                # same as using a b before:
                                (?(?=w) (?<!w) | (?<!W) )

                                # same as using a b after:
                                (?(?<=w) (?!w) | (?!W) )


                                Sometimes that isn’t what you want. See my other answer for elaboration.






                                share|improve this answer



























                                  up vote
                                  4
                                  down vote













                                  I talk about what b-style regex boundaries actually are here.



                                  The short story is that they’re conditional. Their behavior depends on what they’re next to.



                                  # same as using a b before:
                                  (?(?=w) (?<!w) | (?<!W) )

                                  # same as using a b after:
                                  (?(?<=w) (?!w) | (?!W) )


                                  Sometimes that isn’t what you want. See my other answer for elaboration.






                                  share|improve this answer

























                                    up vote
                                    4
                                    down vote










                                    up vote
                                    4
                                    down vote









                                    I talk about what b-style regex boundaries actually are here.



                                    The short story is that they’re conditional. Their behavior depends on what they’re next to.



                                    # same as using a b before:
                                    (?(?=w) (?<!w) | (?<!W) )

                                    # same as using a b after:
                                    (?(?<=w) (?!w) | (?!W) )


                                    Sometimes that isn’t what you want. See my other answer for elaboration.






                                    share|improve this answer














                                    I talk about what b-style regex boundaries actually are here.



                                    The short story is that they’re conditional. Their behavior depends on what they’re next to.



                                    # same as using a b before:
                                    (?(?=w) (?<!w) | (?<!W) )

                                    # same as using a b after:
                                    (?(?<=w) (?!w) | (?!W) )


                                    Sometimes that isn’t what you want. See my other answer for elaboration.







                                    share|improve this answer














                                    share|improve this answer



                                    share|improve this answer








                                    edited May 23 '17 at 12:34









                                    Community

                                    11




                                    11










                                    answered Nov 18 '10 at 13:35









                                    tchrist

                                    68.1k25104161




                                    68.1k25104161






















                                        up vote
                                        4
                                        down vote













                                        I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.



                                        Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w. (I'm sure there was a good reason for it at the time).



                                        The w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in w. But Java, JavaScript, and PCRE match only ASCII characters with w.



                                        Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the b.



                                        Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.



                                        Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the b with before and after whitespace and punctuation designators. For example:



                                        public static String grep(String regexp, String multiLineStringToSearch) {
                                        String result = "";
                                        String lines = multiLineStringToSearch.split("\n");
                                        Pattern pattern = Pattern.compile(regexp);
                                        for (String line : lines) {
                                        Matcher matcher = pattern.matcher(line);
                                        if (matcher.find()) {
                                        result = result + "n" + line;
                                        }
                                        }
                                        return result.trim();
                                        }


                                        Then in your test or main function:



                                            String beforeWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|^)";   
                                        String afterWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|$)";
                                        text = "Programming in C, (C++) C#, Java, and .NET.";
                                        System.out.println("text="+text);
                                        // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\b\.NET\b", text));
                                        System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\.NET"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\bC#\b", text));
                                        System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\bC\+\+\b", text));
                                        System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\+\+"+afterWord, text));

                                        System.out.println("Should find: grep with word boundary for Java="+ grep("\bJava\b", text));
                                        System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\bjava\b", text));
                                        System.out.println("Should find: grep with word boundary for C="+ grep("\bC\b", text)); // Works Ok for this example, but see below
                                        // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
                                        text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
                                        System.out.println("text="+text);
                                        System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\bC\b", text));
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
                                        // Make sure the first and last cases work OK.

                                        text = "C is a language that should have been named differently.";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        text = "One language that should have been named differently is C";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        //Make sure we don't get false positives
                                        text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
                                        System.out.println("text="+text);
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));


                                        P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!






                                        share|improve this answer























                                        • I struggled trying to understand why I couldn't match C# but now it's clearer
                                          – Mugoma J. Okomba
                                          Dec 6 '16 at 19:48















                                        up vote
                                        4
                                        down vote













                                        I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.



                                        Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w. (I'm sure there was a good reason for it at the time).



                                        The w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in w. But Java, JavaScript, and PCRE match only ASCII characters with w.



                                        Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the b.



                                        Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.



                                        Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the b with before and after whitespace and punctuation designators. For example:



                                        public static String grep(String regexp, String multiLineStringToSearch) {
                                        String result = "";
                                        String lines = multiLineStringToSearch.split("\n");
                                        Pattern pattern = Pattern.compile(regexp);
                                        for (String line : lines) {
                                        Matcher matcher = pattern.matcher(line);
                                        if (matcher.find()) {
                                        result = result + "n" + line;
                                        }
                                        }
                                        return result.trim();
                                        }


                                        Then in your test or main function:



                                            String beforeWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|^)";   
                                        String afterWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|$)";
                                        text = "Programming in C, (C++) C#, Java, and .NET.";
                                        System.out.println("text="+text);
                                        // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\b\.NET\b", text));
                                        System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\.NET"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\bC#\b", text));
                                        System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\bC\+\+\b", text));
                                        System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\+\+"+afterWord, text));

                                        System.out.println("Should find: grep with word boundary for Java="+ grep("\bJava\b", text));
                                        System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\bjava\b", text));
                                        System.out.println("Should find: grep with word boundary for C="+ grep("\bC\b", text)); // Works Ok for this example, but see below
                                        // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
                                        text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
                                        System.out.println("text="+text);
                                        System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\bC\b", text));
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
                                        // Make sure the first and last cases work OK.

                                        text = "C is a language that should have been named differently.";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        text = "One language that should have been named differently is C";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        //Make sure we don't get false positives
                                        text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
                                        System.out.println("text="+text);
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));


                                        P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!






                                        share|improve this answer























                                        • I struggled trying to understand why I couldn't match C# but now it's clearer
                                          – Mugoma J. Okomba
                                          Dec 6 '16 at 19:48













                                        up vote
                                        4
                                        down vote










                                        up vote
                                        4
                                        down vote









                                        I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.



                                        Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w. (I'm sure there was a good reason for it at the time).



                                        The w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in w. But Java, JavaScript, and PCRE match only ASCII characters with w.



                                        Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the b.



                                        Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.



                                        Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the b with before and after whitespace and punctuation designators. For example:



                                        public static String grep(String regexp, String multiLineStringToSearch) {
                                        String result = "";
                                        String lines = multiLineStringToSearch.split("\n");
                                        Pattern pattern = Pattern.compile(regexp);
                                        for (String line : lines) {
                                        Matcher matcher = pattern.matcher(line);
                                        if (matcher.find()) {
                                        result = result + "n" + line;
                                        }
                                        }
                                        return result.trim();
                                        }


                                        Then in your test or main function:



                                            String beforeWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|^)";   
                                        String afterWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|$)";
                                        text = "Programming in C, (C++) C#, Java, and .NET.";
                                        System.out.println("text="+text);
                                        // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\b\.NET\b", text));
                                        System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\.NET"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\bC#\b", text));
                                        System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\bC\+\+\b", text));
                                        System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\+\+"+afterWord, text));

                                        System.out.println("Should find: grep with word boundary for Java="+ grep("\bJava\b", text));
                                        System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\bjava\b", text));
                                        System.out.println("Should find: grep with word boundary for C="+ grep("\bC\b", text)); // Works Ok for this example, but see below
                                        // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
                                        text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
                                        System.out.println("text="+text);
                                        System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\bC\b", text));
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
                                        // Make sure the first and last cases work OK.

                                        text = "C is a language that should have been named differently.";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        text = "One language that should have been named differently is C";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        //Make sure we don't get false positives
                                        text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
                                        System.out.println("text="+text);
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));


                                        P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!






                                        share|improve this answer














                                        I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.



                                        Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w. (I'm sure there was a good reason for it at the time).



                                        The w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in w. But Java, JavaScript, and PCRE match only ASCII characters with w.



                                        Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the b.



                                        Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.



                                        Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the b with before and after whitespace and punctuation designators. For example:



                                        public static String grep(String regexp, String multiLineStringToSearch) {
                                        String result = "";
                                        String lines = multiLineStringToSearch.split("\n");
                                        Pattern pattern = Pattern.compile(regexp);
                                        for (String line : lines) {
                                        Matcher matcher = pattern.matcher(line);
                                        if (matcher.find()) {
                                        result = result + "n" + line;
                                        }
                                        }
                                        return result.trim();
                                        }


                                        Then in your test or main function:



                                            String beforeWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|^)";   
                                        String afterWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|$)";
                                        text = "Programming in C, (C++) C#, Java, and .NET.";
                                        System.out.println("text="+text);
                                        // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\b\.NET\b", text));
                                        System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\.NET"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\bC#\b", text));
                                        System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
                                        System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\bC\+\+\b", text));
                                        System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\+\+"+afterWord, text));

                                        System.out.println("Should find: grep with word boundary for Java="+ grep("\bJava\b", text));
                                        System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\bjava\b", text));
                                        System.out.println("Should find: grep with word boundary for C="+ grep("\bC\b", text)); // Works Ok for this example, but see below
                                        // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
                                        text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
                                        System.out.println("text="+text);
                                        System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\bC\b", text));
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
                                        // Make sure the first and last cases work OK.

                                        text = "C is a language that should have been named differently.";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        text = "One language that should have been named differently is C";
                                        System.out.println("text="+text);
                                        System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

                                        //Make sure we don't get false positives
                                        text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
                                        System.out.println("text="+text);
                                        System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));


                                        P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!







                                        share|improve this answer














                                        share|improve this answer



                                        share|improve this answer








                                        edited Aug 10 '16 at 9:24









                                        Alan Moore

                                        60.1k978128




                                        60.1k978128










                                        answered Dec 16 '13 at 16:54









                                        Tihamer

                                        23924




                                        23924












                                        • I struggled trying to understand why I couldn't match C# but now it's clearer
                                          – Mugoma J. Okomba
                                          Dec 6 '16 at 19:48


















                                        • I struggled trying to understand why I couldn't match C# but now it's clearer
                                          – Mugoma J. Okomba
                                          Dec 6 '16 at 19:48
















                                        I struggled trying to understand why I couldn't match C# but now it's clearer
                                        – Mugoma J. Okomba
                                        Dec 6 '16 at 19:48




                                        I struggled trying to understand why I couldn't match C# but now it's clearer
                                        – Mugoma J. Okomba
                                        Dec 6 '16 at 19:48










                                        up vote
                                        1
                                        down vote













                                        I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.



                                        One possible alternative is



                                        (?:(?:^|s)-?)d+b


                                        This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.






                                        share|improve this answer

























                                          up vote
                                          1
                                          down vote













                                          I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.



                                          One possible alternative is



                                          (?:(?:^|s)-?)d+b


                                          This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.






                                          share|improve this answer























                                            up vote
                                            1
                                            down vote










                                            up vote
                                            1
                                            down vote









                                            I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.



                                            One possible alternative is



                                            (?:(?:^|s)-?)d+b


                                            This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.






                                            share|improve this answer












                                            I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.



                                            One possible alternative is



                                            (?:(?:^|s)-?)d+b


                                            This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.







                                            share|improve this answer












                                            share|improve this answer



                                            share|improve this answer










                                            answered Aug 24 '09 at 20:59









                                            Sean

                                            3,9841719




                                            3,9841719






















                                                up vote
                                                1
                                                down vote













                                                In the course of learning regular expression, I was really stuck in the metacharacter which is b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(w)-boundary.



                                                My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.



                                                enter image description here






                                                share|improve this answer

























                                                  up vote
                                                  1
                                                  down vote













                                                  In the course of learning regular expression, I was really stuck in the metacharacter which is b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(w)-boundary.



                                                  My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.



                                                  enter image description here






                                                  share|improve this answer























                                                    up vote
                                                    1
                                                    down vote










                                                    up vote
                                                    1
                                                    down vote









                                                    In the course of learning regular expression, I was really stuck in the metacharacter which is b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(w)-boundary.



                                                    My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.



                                                    enter image description here






                                                    share|improve this answer












                                                    In the course of learning regular expression, I was really stuck in the metacharacter which is b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(w)-boundary.



                                                    My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.



                                                    enter image description here







                                                    share|improve this answer












                                                    share|improve this answer



                                                    share|improve this answer










                                                    answered Jun 1 at 1:19









                                                    snr

                                                    4,9801438




                                                    4,9801438






















                                                        up vote
                                                        0
                                                        down vote













                                                        I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.






                                                        share|improve this answer

















                                                        • 1




                                                          You're thinking of G: matches the beginning of the string (like A) on the first match attempt; after that it matches the position where the previous match ended.
                                                          – Alan Moore
                                                          Jun 24 '16 at 20:50















                                                        up vote
                                                        0
                                                        down vote













                                                        I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.






                                                        share|improve this answer

















                                                        • 1




                                                          You're thinking of G: matches the beginning of the string (like A) on the first match attempt; after that it matches the position where the previous match ended.
                                                          – Alan Moore
                                                          Jun 24 '16 at 20:50













                                                        up vote
                                                        0
                                                        down vote










                                                        up vote
                                                        0
                                                        down vote









                                                        I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.






                                                        share|improve this answer












                                                        I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.







                                                        share|improve this answer












                                                        share|improve this answer



                                                        share|improve this answer










                                                        answered Aug 24 '09 at 20:55







                                                        user130076















                                                        • 1




                                                          You're thinking of G: matches the beginning of the string (like A) on the first match attempt; after that it matches the position where the previous match ended.
                                                          – Alan Moore
                                                          Jun 24 '16 at 20:50














                                                        • 1




                                                          You're thinking of G: matches the beginning of the string (like A) on the first match attempt; after that it matches the position where the previous match ended.
                                                          – Alan Moore
                                                          Jun 24 '16 at 20:50








                                                        1




                                                        1




                                                        You're thinking of G: matches the beginning of the string (like A) on the first match attempt; after that it matches the position where the previous match ended.
                                                        – Alan Moore
                                                        Jun 24 '16 at 20:50




                                                        You're thinking of G: matches the beginning of the string (like A) on the first match attempt; after that it matches the position where the previous match ended.
                                                        – Alan Moore
                                                        Jun 24 '16 at 20:50










                                                        up vote
                                                        0
                                                        down vote













                                                        when you use \b(\w+)+\b that means exact match with a word containing only word characters ([a-zA-Z0-9])



                                                        in your case for example setting \b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)



                                                        for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html






                                                        share|improve this answer



























                                                          up vote
                                                          0
                                                          down vote













                                                          when you use \b(\w+)+\b that means exact match with a word containing only word characters ([a-zA-Z0-9])



                                                          in your case for example setting \b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)



                                                          for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html






                                                          share|improve this answer

























                                                            up vote
                                                            0
                                                            down vote










                                                            up vote
                                                            0
                                                            down vote









                                                            when you use \b(\w+)+\b that means exact match with a word containing only word characters ([a-zA-Z0-9])



                                                            in your case for example setting \b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)



                                                            for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html






                                                            share|improve this answer














                                                            when you use \b(\w+)+\b that means exact match with a word containing only word characters ([a-zA-Z0-9])



                                                            in your case for example setting \b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)



                                                            for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html







                                                            share|improve this answer














                                                            share|improve this answer



                                                            share|improve this answer








                                                            edited Nov 19 '17 at 18:53

























                                                            answered Nov 19 '17 at 16:41









                                                            vic

                                                            46




                                                            46






















                                                                up vote
                                                                0
                                                                down vote













                                                                Word boundary b is used where one word should be a word character and another one a non-word character.
                                                                Regular Expression for negative number should be



                                                                --?bd+b


                                                                check working DEMO






                                                                share|improve this answer








                                                                New contributor




                                                                Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                Check out our Code of Conduct.






















                                                                  up vote
                                                                  0
                                                                  down vote













                                                                  Word boundary b is used where one word should be a word character and another one a non-word character.
                                                                  Regular Expression for negative number should be



                                                                  --?bd+b


                                                                  check working DEMO






                                                                  share|improve this answer








                                                                  New contributor




                                                                  Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                  Check out our Code of Conduct.




















                                                                    up vote
                                                                    0
                                                                    down vote










                                                                    up vote
                                                                    0
                                                                    down vote









                                                                    Word boundary b is used where one word should be a word character and another one a non-word character.
                                                                    Regular Expression for negative number should be



                                                                    --?bd+b


                                                                    check working DEMO






                                                                    share|improve this answer








                                                                    New contributor




                                                                    Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.









                                                                    Word boundary b is used where one word should be a word character and another one a non-word character.
                                                                    Regular Expression for negative number should be



                                                                    --?bd+b


                                                                    check working DEMO







                                                                    share|improve this answer








                                                                    New contributor




                                                                    Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.









                                                                    share|improve this answer



                                                                    share|improve this answer






                                                                    New contributor




                                                                    Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.









                                                                    answered Nov 8 at 10:38









                                                                    Anubhav Shakya

                                                                    11




                                                                    11




                                                                    New contributor




                                                                    Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.





                                                                    New contributor





                                                                    Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.






                                                                    Anubhav Shakya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                                    Check out our Code of Conduct.






























                                                                         

                                                                        draft saved


                                                                        draft discarded



















































                                                                         


                                                                        draft saved


                                                                        draft discarded














                                                                        StackExchange.ready(
                                                                        function () {
                                                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f1324676%2fwhat-is-a-word-boundary-in-regexes%23new-answer', 'question_page');
                                                                        }
                                                                        );

                                                                        Post as a guest




















































































                                                                        Popular posts from this blog

                                                                        Schultheiß

                                                                        Liste der Kulturdenkmale in Wilsdruff

                                                                        Android Play Services Check